CN111382428A

CN111382428A - Malicious software recognition model training method, malicious software recognition method and device

Info

Publication number: CN111382428A
Application number: CN201811647282.9A
Authority: CN
Inventors: 佘三华; 余沛
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2020-07-07

Abstract

The invention discloses a training method of a malicious software identification model, a malicious software identification method and a malicious software identification device, relates to the technical field of network security, and can solve the problem that the existing malicious software identification model needs to be continuously retrained along with software change. The method mainly comprises the following steps: the method comprises the steps of obtaining a PE file with security identification, wherein the security identification comprises malicious identification and benign identification; calculating a transition probability matrix of the PE file according to the calling type of the API in the PE file; and performing model training based on the transition probability matrix and the security identification of the PE file to obtain a malicious software identification model. The method is mainly suitable for the scene of identifying the malicious software by analyzing the PE file.

Description

Malicious software recognition model training method, malicious software recognition method and device

Technical Field

The invention relates to the technical field of network security, in particular to a malicious software identification model training method, a malicious software identification method and a malicious software identification device.

Background

With the continuous development of the internet, various malicious software are layered endlessly, the number of deformation and variation is more, and the efficiency of screening features and implementing detection based on manual analysis is difficult to meet the requirement of large-scale detection, so that a machine learning detection method based on a PE (portable executable) file is gradually applied to actual protection services.

At present, a detection model is mainly constructed by manually extracting features, namely a machine learning model is constructed by analyzing information such as sample file structures of a PE (provider edge) header, an import table and the like. Representative work includes constructing a neural network detection model based on PE structure information proposed by Raff and the like, and constructing a deep learning detection model based on byte entropy, character string entropy and the like by Saxe and the like. The methods have a certain detection effect, but the methods construct a machine learning model according to the characteristics of a sample depending on the sample file structure and the like, so that when the sample (namely software) is changed continuously, the detection effect can be maintained by continuous retraining, and the operation is complicated.

Disclosure of Invention

In view of this, the present invention provides a malware recognition model training method, a malware recognition method and a device thereof, and aims to solve the problem that the existing malware recognition model needs to be continuously retrained along with software changes.

The purpose of the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a training method for a malware recognition model, the method including:

the method comprises the steps of obtaining a PE file with security identification, wherein the security identification comprises malicious identification and benign identification;

calculating a transition probability matrix of the PE file according to the calling type of the API in the PE file;

and performing model training based on the transition probability matrix and the security identification of the PE file to obtain a malicious software identification model.

Optionally, calculating the transition probability matrix of the PE file according to the call type of the API in the PE file includes:

acquiring a control flow graph corresponding to each function in the PE file based on a decompiling tool;

analyzing codes in the PE file to obtain an API name contained in a basic block of each control flow graph, and identifying a calling type corresponding to the API name;

counting the number of the types of the calling types related to all the control flow diagrams of the PE file and the times that any two calling types are arranged in the basic block in sequence and appear adjacently;

and calculating the transition probability matrix according to the number of the types and the times.

Optionally, identifying the call type corresponding to the API name includes:

and searching the calling type corresponding to the acquired API name according to a pre-established API type dictionary containing the mapping relation between the API name and the calling type.

Optionally, calculating the transition probability matrix according to the number of the types and the number of times includes:

constructing a transition probability matrix with M (i, j) matrix elements and N row numbers and column numbers;

wherein, M (i, j) represents the times that the ith call type and the jth call type are arranged in the basic block in sequence and have adjacent subsequent call types in the occurrence times of the ith call type and the ith call type, and N represents the number of the types.

Optionally, the obtaining the PE file with the security identifier includes:

extracting a PE file from a software installation package with a security identifier;

judging whether the extracted PE file is subjected to shell adding treatment or not according to the static information of the extracted PE file;

and if the extracted PE file is subjected to shell adding, placing the extracted PE file into a sandbox for operation, dumping the memory occupied by the extracted PE file after the behavior of the extracted PE file is completely triggered, and extracting the PE file which is not subjected to shell adding from the dump file.

Optionally, the static information includes any one or more of the following combinations: file format, program entry point instruction characteristics, and import table.

Optionally, performing model training based on the transition probability matrix and the security identifier of the PE file, and obtaining a malware recognition model includes:

directly performing model training according to the transition probability matrix of the PE file and the corresponding security identifier to obtain the malicious software identification model;

or converting the transition probability matrix of the PE file into a one-dimensional characteristic vector, and performing model training according to the characteristic vector and a corresponding security identifier to obtain the malicious software identification model.

In a second aspect, the present invention provides a malware identification method, including:

acquiring a PE file to be identified;

calculating a transition probability matrix of the PE file to be identified according to the calling type of the API in the PE file to be identified;

and identifying whether the software corresponding to the file to be identified is malware or not by using the transition probability matrix of the PE file to be identified and a pre-established malware identification model, wherein the malware identification model is obtained by training according to the malware identification model training method of the first aspect.

Optionally, identifying whether the software corresponding to the to-be-identified file is malware by using the transition probability matrix of the to-be-identified PE file and a pre-established malware identification model includes:

directly inputting the transition probability matrix of the PE file to be identified into the malicious software identification model for malicious identification so as to determine whether the software corresponding to the PE file to be identified is malicious software;

or converting the transition probability matrix of the PE file to be identified into a one-dimensional feature vector, and inputting the feature vector obtained by conversion into the malicious software identification model for malicious identification so as to determine whether the software corresponding to the PE file to be identified is malicious software.

In a third aspect, the present invention provides a malware recognition model training apparatus, including:

the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a PE file with security identification, and the security identification comprises malicious identification and benign identification;

the calculation unit is used for calculating a transition probability matrix of the PE file according to the calling type of the API in the PE file;

and the training unit is used for carrying out model training based on the transition probability matrix and the security identification of the PE file to obtain a malicious software identification model.

Optionally, the computing unit includes:

a first obtaining module, configured to obtain, based on a decompiling tool, a control flow graph corresponding to each function in the PE file;

a second obtaining module, configured to obtain, by analyzing the code in the PE file, an API name included in a basic block of each control flow graph;

the identification module is used for identifying the calling type corresponding to the API name;

the counting module is used for counting the number of the types of the calling types related to all the control flow graphs of the PE file and the times that any two calling types are arranged in the basic block in sequence and appear adjacently;

and the calculating module is used for calculating the transition probability matrix according to the number of the types and the times.

Optionally, the identifying module is configured to search for the call type corresponding to the obtained API name according to a pre-established API type dictionary including mapping relationships between API names and call types.

Optionally, the computing module is configured to construct a transition probability matrix with M (i, j) matrix elements and N rows and columns;

Optionally, the obtaining unit includes:

the first extraction module is used for extracting the PE file from the software installation package with the security identifier;

the judging module is used for judging whether the extracted PE file is subjected to shell adding treatment or not according to the static information of the extracted PE file;

the dump module is used for placing the extracted PE file into a sandbox for operation when the extracted PE file is subjected to shell adding processing, and dumping the memory occupied by the extracted PE file after the behavior of the extracted PE file is completely triggered;

and the second extraction module is used for extracting the uncapped PE file from the dump file.

Optionally, the static information according to the determining module includes any one or a combination of the following items: file format, program entry point instruction characteristics, and import table.

Optionally, the training unit is configured to perform model training directly according to the transition probability matrix of the PE file and the corresponding security identifier, to obtain the malware recognition model; or converting the transition probability matrix of the PE file into a one-dimensional characteristic vector, and performing model training according to the characteristic vector and a corresponding security identifier to obtain the malicious software identification model.

In a fourth aspect, the present invention provides a malware identification apparatus, including:

the acquisition unit is used for acquiring the PE file to be identified;

the computing unit is used for computing a transition probability matrix of the PE file to be identified according to the calling type of the API in the PE file to be identified;

and the recognition unit is used for recognizing whether the software corresponding to the to-be-recognized file is malware or not by using the transition probability matrix of the to-be-recognized PE file and a pre-established malware recognition model, wherein the malware recognition model is obtained by training according to the malware recognition model training method of the first aspect.

Optionally, the identifying unit is configured to directly input the transition probability matrix of the PE file to be identified into the malware identification model for malicious identification, so as to determine whether software corresponding to the PE file to be identified is malware; or converting the transition probability matrix of the PE file to be identified into a one-dimensional feature vector, and inputting the feature vector obtained by conversion into the malicious software identification model for malicious identification so as to determine whether the software corresponding to the PE file to be identified is malicious software.

In a fifth aspect, the present invention provides a storage medium storing a plurality of instructions, the instructions being adapted to be loaded by a processor and to perform the malware recognition model training method according to the first aspect, or to be loaded by a processor and to perform the malware recognition method according to the second aspect.

In a sixth aspect, the present invention provides an electronic device comprising a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded by the processor and to perform the malware recognition model training method of the first aspect, or to load and perform the malware recognition method of the second aspect.

By means of the technical scheme, the malicious software recognition model training method, the malicious software recognition method and the malicious software recognition device provided by the invention can be used for firstly obtaining the PE file with the security identification, then calculating the transition probability matrix of the PE file according to the calling type of the API in the PE file, and finally performing model training based on the transition probability matrix and the security identification of the PE file to obtain the malicious software recognition model so as to be convenient for subsequently utilizing the malicious software recognition model to perform malicious software recognition. Therefore, the malicious software recognition model is obtained based on the transfer probability matrix training, the transfer probability matrix captures the dependency relationship among system calls with different functions, the attack resistance and the robustness are high, and the dependency relationship cannot be changed along with the change of the software structure, so that the malicious software recognition model does not need to be continuously retrained when the software is continuously changed, and the retraining frequency of the malicious software recognition model is reduced. In addition, by the method for fundamentally capturing the software characteristics, malicious code authors have difficulty in bypassing the detection of the malicious software through simple code obfuscation and deformation means.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flowchart illustrating a malware recognition model training method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a malware identification method provided by an embodiment of the present invention;

FIG. 3 is a block diagram illustrating components of a malware recognition model training apparatus according to an embodiment of the present invention;

FIG. 4 is a block diagram illustrating another malware recognition model training apparatus according to an embodiment of the present invention;

fig. 5 shows a block diagram of a malware identification apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The embodiment of the invention provides a training method for a malicious software recognition model, which mainly comprises the following steps of:

101. and acquiring the PE file with the security identifier.

Wherein the security identity comprises a malicious identity and a benign identity. Specifically, a certain scale of manually or automatically labeled malicious samples and benign samples can be collected as a sample library, when model training is required, the PE files with malicious identifications are obtained from the malicious samples, and the PE files with benign identifications are obtained from the benign samples.

102. And calculating the transition probability matrix of the PE file according to the calling type of the API in the PE file.

Because the API transition probability matrix can capture the dependency relationship among system calls with different functions and has higher anti-attack capability and robustness, after PE files with security identifications are obtained, each PE file can be analyzed respectively to determine the call types of all APIs in each PE file, and then the transition probability matrix of the PE files is calculated according to the call types.

103. And performing model training based on the transition probability matrix and the security identification of the PE file to obtain a malicious software identification model.

The malware identification model can be a support vector machine model, a neural network model or other models.

According to the training method for the malicious software recognition model, provided by the embodiment of the invention, the PE file with the security identifier can be obtained firstly, then the transition probability matrix of the PE file is calculated according to the calling type of the API in the PE file, and finally the model training is carried out based on the transition probability matrix of the PE file and the security identifier to obtain the malicious software recognition model so as to be convenient for carrying out malicious software recognition by using the malicious software recognition model subsequently. Therefore, the malicious software recognition model is obtained based on the transition probability matrix training, the transition probability matrix captures the dependency relationship among system calls with different functions, the attack resistance and the robustness are high, and the dependency relationship cannot be changed along with the change of the software structure, so that the malicious software recognition model does not need to be continuously retrained when the software is continuously changed, and the retraining frequency of the malicious software recognition model is reduced. In addition, by the method for fundamentally capturing the software characteristics, malicious code authors have difficulty in bypassing the detection of the malicious software through simple code obfuscation and deformation means.

In another embodiment of the present invention, an alternative implementation of the above step 102 is further described, the implementation comprising:

(1) and acquiring a control flow graph corresponding to each function in the PE file based on a decompiling tool.

The decompilation tool is a tool such as IDA Pro (Interactive Disassembler Professional edition) or R2 that can perform reverse analysis on PE file code segments through static analysis, the PE file can identify functions therein through decompilation, and obtain a control flow graph from an entry of each function, where nodes in the control flow graph are basic blocks and edges are control flow transfer relationships between the basic blocks. A basic block is a basic unit in program code that can and cannot be executed starting with the first instruction of the basic block and going to the last instruction. Each function has a unique entry and there is a corresponding control flow graph, i.e. a control flow graph typically comprises a plurality of unconnected sub-graphs, each sub-graph corresponding to a unique function entry. This step can be achieved fully automated acquisition by writing IDC scripts.

(2) And analyzing codes in the PE file to obtain the API name contained in the basic block of each control flow graph, and identifying the calling type corresponding to the API name.

The specific implementation manner of identifying the call type may be: and searching the calling type corresponding to the acquired API name according to a pre-established API type dictionary containing the mapping relation between the API name and the calling type.

API calls refer to functions for exporting main system files, such as kernell 32.dll, ntdlll.dll, and advpi.dll, wherein each type of function can be classified according to the functions implemented by the function, for example, according to system resources affected and operated by the function, the function can be classified into files (such as file creation, reading and writing), networks (such as creation of connection, data transmission and reception), registries (such as registry key value creation, reading and writing), processes (such as process creation, closing, and the like), services (such as system service creation, starting and stopping), and interface UI operations (such as window drawing and destruction).

(3) And counting the number of the types of the calling types related to all the control flow diagrams of the PE file and the times of the adjacent appearance of any two calling types which are arranged in sequence in the basic block.

(4) And calculating the transition probability matrix according to the number of the types and the adjacent times of any two calling types.

Constructing a transition probability matrix with M (i, j) matrix elements and N row numbers and column numbers; wherein, M (i, j) represents the times that the ith call type and the jth call type are arranged in the basic block in sequence and have adjacent subsequent call types in the occurrence times of the ith call type and the ith call type, and N represents the number of the types.

For example, in a basic block, after a file reading operation is completed, the file is immediately sent to the network, the first operation is the file type, the second operation is the network type, and the file type and the network type are added with 1 in a certain statistical number in a corresponding matrix. This step requires statistics of the number of transfers of various Windows API types for all basic blocks in all control flow graphs.

Illustratively, if there are only 3 call types in a PE file, the number of times that the 1 st call type and the 2 nd call type appear successively and adjacently is 2, the number of times that the 1 st call type and the 3 rd call type appear successively and adjacently is 3, the number of times that the 2 nd call type and the 1 st call type appear successively and adjacently is 1, the number of times that the 2 nd call type and the 3 rd call type appear successively and adjacently is 4, the number of times that the 3 rd call type and the 1 st call type appear successively and adjacently is 2, the number of times that the call type in the 3 rd call type and the call type in the 2 nd call type appear successively and adjacently is 2, the transition probability matrix is

In another embodiment of the present invention, an alternative implementation of step 101 is further described, which includes: extracting a PE file from a software installation package with a security identifier; judging whether the extracted PE file is subjected to shell adding treatment or not according to the static information of the extracted PE file; and if the extracted PE file is subjected to shell adding, placing the extracted PE file into a sandbox for operation, dumping the memory occupied by the extracted PE file after the behavior of the extracted PE file is completely triggered, and extracting the PE file which is not subjected to shell adding from the dump file.

Wherein the static information comprises any one or a combination of more than one of: file format, program entry point instruction characteristics, and import table. Shell addition refers to the fact that a program is subjected to protective processing by third-party software or a malicious software author, and therefore a PE file cannot be analyzed through static analysis.

Performing memory dump on a shell-added PE file, executing the shell-added PE file in a controllable closed environment (such as a virtual machine environment such as VMWare or VirtualBox), when the shell-added PE file is started for a period of time or a network triggering behavior of the shell-added PE file is detected, acquiring a memory snapshot image through a memory dump function of a virtual execution environment, searching the memory image based on a PE structure, and recovering the shell-removed PE file.

In another embodiment of the present invention, an alternative implementation of step 103 is further described, which includes: directly performing model training according to the transition probability matrix of the PE file and the corresponding security identifier to obtain the malicious software identification model; or converting the transition probability matrix of the PE file into a one-dimensional characteristic vector, and performing model training according to the characteristic vector and a corresponding security identifier to obtain the malicious software identification model.

That is, the transition probability matrix is a two-dimensional vector, and the malware recognition model may be trained directly using a machine learning algorithm that supports multidimensional vectors, or may be trained first by converting the two-dimensional vector into a one-dimensional vector and then using a machine learning algorithm that indicates the one-dimensional vector.

Further, according to the above method embodiment, another embodiment of the present invention further provides a malware identification method, as shown in fig. 2, the method includes:

201. and acquiring the PE file to be identified.

Specifically, the PE file to be identified may be extracted from the software installation package to be identified; then judging whether the extracted PE file is subjected to shell adding treatment or not according to the static information of the extracted PE file; and if the extracted PE file is subjected to shell adding, placing the extracted PE file into a sandbox for operation, dumping the memory occupied by the extracted PE file after the behavior of the extracted PE file is completely triggered, and extracting the PE file which is not subjected to shell adding from the dump file.

202. And calculating a transition probability matrix of the PE file to be identified according to the calling type of the API in the PE file to be identified.

Specifically, a control flow graph corresponding to each function in the PE file to be identified may be obtained based on a decompiling tool; then, acquiring an API name contained in a basic block of each control flow graph by analyzing codes in the PE file to be identified, and identifying a calling type corresponding to the API name; then counting the number of the calling types related to all the control flow diagrams of the PE file to be identified and the adjacent times of any two calling types; and finally, calculating the transition probability matrix according to the number of the types and the adjacent times of any two calling types. For a more specific implementation, see step 102 above for details.

203. And identifying whether the software corresponding to the to-be-identified file is malicious software or not by utilizing the transition probability matrix of the to-be-identified PE file and a pre-established malicious software identification model.

The malicious software recognition model is obtained by training according to the malicious software recognition model training method. The malware identification model can be a support vector machine model, a neural network model or other models.

According to the malicious software identification method provided by the embodiment of the invention, after the PE file to be identified is obtained, the transition probability matrix of the PE file to be identified is calculated according to the calling type of the API in the PE file to be identified, and then the malicious software identification model is established by utilizing the transition probability matrix of the PE file to be identified and the transition probability matrix based on a large number of known good and bad PE files in advance, so that whether the software corresponding to the file to be identified is malicious software is identified. Therefore, the malicious software recognition model is obtained based on the transition probability matrix training, the transition probability matrix captures the dependency relationship among system calls with different functions, the attack resistance and the robustness are high, and the dependency relationship cannot be changed along with the change of the software structure, so that the malicious software recognition model does not need to be continuously retrained when the software is continuously changed, and the retraining frequency of the malicious software recognition model is reduced. In addition, by the method for fundamentally capturing the software characteristics, malicious code authors have difficulty in bypassing the detection of the malicious software through simple code obfuscation and deformation means.

In another embodiment of the present invention, an alternative implementation of step 103 is further described, which includes: when the malicious software identification model is obtained by directly training according to a transition probability matrix, directly inputting the transition probability matrix of the PE file to be identified into the malicious software identification model for malicious identification so as to determine whether the software corresponding to the PE file to be identified is malicious software; or when the malware identification model is obtained by training according to the one-dimensional feature vector converted from the transition probability matrix, converting the transition probability matrix of the PE file to be identified into the one-dimensional feature vector, and inputting the feature vector obtained by conversion into the malware identification model for malicious identification so as to determine whether the software corresponding to the PE file to be identified is malware.

Further, according to the foregoing method embodiment, another embodiment of the present invention further provides a malware recognition model training apparatus, as shown in fig. 3, the apparatus includes:

an obtaining unit 31, configured to obtain a PE file having security identifiers, where the security identifiers include malicious identifiers and benign identifiers;

a calculating unit 32, configured to calculate a transition probability matrix of the PE file according to a call type of an API in the PE file;

and the training unit 33 is configured to perform model training based on the transition probability matrix and the security identifier of the PE file to obtain a malware recognition model.

Optionally, as shown in fig. 4, the calculating unit 32 includes:

a first obtaining module 321, configured to obtain, based on a decompiling tool, a control flow graph corresponding to each function in the PE file;

a second obtaining module 322, configured to obtain, by analyzing the code in the PE file, an API name included in a basic block of each control flow graph;

the identification module 323 is used for identifying the calling type corresponding to the API name;

a counting module 324, configured to count the number of types of calls related to all control flow diagrams of the PE file and the number of times that any two call types are arranged in sequence and occur adjacently in a basic block;

a calculating module 325, configured to calculate the transition probability matrix according to the number of the types and the number of times.

Optionally, the identifying module 323 is configured to search for the call type corresponding to the obtained API name according to an API type dictionary that is pre-established and contains mapping relationships between API names and call types.

Optionally, the calculating module 325 is configured to construct a transition probability matrix with M (i, j) matrix elements and N rows and columns;

Optionally, as shown in fig. 4, the obtaining unit 31 includes:

a first extracting module 311, configured to extract a PE file from a software installation package with a security identifier;

a determining module 312, configured to determine whether the extracted PE file is shelled according to the static information of the extracted PE file;

the dumping module 313 is configured to place the extracted PE file into a sandbox for operation when the extracted PE file is shelled, and dump the memory occupied by the extracted PE file after the behavior of the extracted PE file is completely triggered;

and a second extracting module 314, configured to extract the uncapped PE file from the dump file.

Optionally, the training unit 33 is configured to perform model training directly according to the transition probability matrix of the PE file and the corresponding security identifier, to obtain the malware recognition model; or converting the transition probability matrix of the PE file into a one-dimensional characteristic vector, and performing model training according to the characteristic vector and a corresponding security identifier to obtain the malicious software identification model.

The training device for the malicious software recognition model provided by the embodiment of the invention can obtain the PE file with the security identifier, then calculate the transition probability matrix of the PE file according to the calling type of the API in the PE file, and finally perform model training based on the transition probability matrix of the PE file and the security identifier to obtain the malicious software recognition model so as to perform malicious software recognition by using the malicious software recognition model in the following. Therefore, the malicious software recognition model is obtained based on the transition probability matrix training, the transition probability matrix captures the dependency relationship among system calls with different functions, the attack resistance and the robustness are high, and the dependency relationship cannot be changed along with the change of the software structure, so that the malicious software recognition model does not need to be continuously retrained when the software is continuously changed, and the retraining frequency of the malicious software recognition model is reduced. In addition, by the method for fundamentally capturing the software characteristics, malicious code authors have difficulty in bypassing the detection of the malicious software through simple code obfuscation and deformation means.

Further, according to the above method embodiment, another embodiment of the present invention further provides a malware identification apparatus, as shown in fig. 5, the apparatus includes:

an obtaining unit 41, configured to obtain a PE file to be identified;

the calculating unit 42 is configured to calculate a transition probability matrix of the PE file to be identified according to a call type of an API in the PE file to be identified;

the identifying unit 43 is configured to identify whether software corresponding to the to-be-identified file is malware or not by using the transition probability matrix of the to-be-identified PE file and a pre-established malware identification model, where the malware identification model is obtained by training according to the above malware identification model training method.

Optionally, the identifying unit 43 is configured to directly input the transition probability matrix of the PE file to be identified into the malware identification model for malicious identification, so as to determine whether software corresponding to the PE file to be identified is malware; or converting the transition probability matrix of the PE file to be identified into a one-dimensional feature vector, and inputting the feature vector obtained by conversion into the malicious software identification model for malicious identification so as to determine whether the software corresponding to the PE file to be identified is malicious software.

The malicious software identification device provided by the embodiment of the invention can calculate the transition probability matrix of the PE file to be identified according to the calling type of the API in the PE file to be identified after the PE file to be identified is obtained, and then identify whether the software corresponding to the file to be identified is malicious software or not by utilizing the transition probability matrix of the PE file to be identified and the malicious software identification model which is established in advance based on the transition probability matrices of a large number of known good and bad PE files. Therefore, the malicious software recognition model is obtained based on the transition probability matrix training, the transition probability matrix captures the dependency relationship among system calls with different functions, the attack resistance and the robustness are high, and the dependency relationship cannot be changed along with the change of the software structure, so that the malicious software recognition model does not need to be continuously retrained when the software is continuously changed, and the retraining frequency of the malicious software recognition model is reduced. In addition, by the method for fundamentally capturing the software characteristics, malicious code authors have difficulty in bypassing the detection of the malicious software through simple code obfuscation and deformation means.

Further, according to the above method embodiment, another embodiment of the present invention further provides a storage medium storing a plurality of instructions, where the instructions are adapted to be loaded by a processor and executed by the above malware recognition model training method, or loaded by a processor and executed by the above malware recognition method.

The storage medium may include a volatile storage medium in a computer readable medium, a random access storage medium (RAM) and/or a non-volatile memory, such as a Read Only Memory (ROM) or a flash memory (flash RAM), and the storage medium includes at least one memory chip.

Further, according to the above method embodiment, another embodiment of the present invention also provides an electronic device, which includes a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded and executed by the processor by a malware recognition model training method as described above, or by a malware recognition method as described above.

The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on an electronic device:

acquiring a PE file to be identified;

and identifying whether the software corresponding to the to-be-identified file is malware or not by using the transition probability matrix of the to-be-identified PE file and a pre-established malware identification model, wherein the malware identification model is obtained by training according to the malware identification model training method.

The embodiment of the invention also discloses:

a1, a training method of a malware recognition model, the method comprising:

A2, according to the method in A1, the calculating the transition probability matrix of the PE file according to the calling type of the API in the PE file comprises:

A3, according to the method in A2, the step of identifying the call type corresponding to the API name comprises the steps of:

A4, according to the method of A2, wherein the calculating the transition probability matrix according to the number of the categories and the number of times comprises:

A5, according to the method in A1, obtaining the PE file with the security identifier includes:

A6, the method of A5, the static information comprising any one or combination of: file format, program entry point instruction characteristics, and import table.

A7, according to the method of any one of A1-A6, model training is carried out based on the transition probability matrix and the security identification of the PE file, and obtaining a malware recognition model comprises:

B8, a malware identification method, the method comprising:

acquiring a PE file to be identified;

and identifying whether the software corresponding to the file to be identified is malicious software or not by utilizing the transition probability matrix of the PE file to be identified and a pre-established malicious software identification model, wherein the malicious software identification model is obtained by training according to any one of the malicious software identification model training methods 1-7.

B9, according to the method of B8, identifying whether the software corresponding to the PE file to be identified is malware by using the transition probability matrix of the PE file to be identified and a pre-established malware identification model comprises the following steps:

C10, a malware recognition model training apparatus, the apparatus comprising:

C11, the apparatus of C10, the computing unit comprising:

And C12, the device according to C11, the recognition module is configured to search for the call type corresponding to the obtained API name according to a pre-established API type dictionary containing API names and call type mapping relationships.

C13, the apparatus according to C11, the calculating module is configured to construct a transition probability matrix with M (i, j) matrix elements and N rows and columns;

C14, the apparatus of C10, the obtaining unit comprising:

C15, the apparatus according to C14, the static information according to which the determining module is based on includes any one or combination of: file format, program entry point instruction characteristics, and import table.

C16, the device according to any one of C10-C15, the training unit is used for performing model training directly according to the transition probability matrix of the PE file and the corresponding security identification to obtain the malware recognition model; or converting the transition probability matrix of the PE file into a one-dimensional characteristic vector, and performing model training according to the characteristic vector and a corresponding security identifier to obtain the malicious software identification model.

D17, a malware identification apparatus, the apparatus comprising:

the acquisition unit is used for acquiring the PE file to be identified;

and the identification unit is used for identifying whether the software corresponding to the to-be-identified file is malware or not by utilizing the transition probability matrix of the to-be-identified PE file and a pre-established malware identification model, wherein the malware identification model is obtained by training according to the malware identification model training method of any one of A1-A7.

D18, the device according to D17, the recognition unit being configured to directly input the transition probability matrix of the PE file to be recognized into the malware recognition model for malware recognition, so as to determine whether the software corresponding to the PE file to be recognized is malware; or converting the transition probability matrix of the PE file to be identified into a one-dimensional feature vector, and inputting the feature vector obtained by conversion into the malicious software identification model for malicious identification so as to determine whether the software corresponding to the PE file to be identified is malicious software.

E19, a storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the malware recognition model training method of any one of a1-a7, or the malware recognition method of any one of B8-B9.

F20, an electronic device comprising a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded and executed by the processor by a malware recognition model training method as described in any one of A1-A7, or by a malware recognition method as described in any one of B8-B9.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the malware recognition model training method, the malware recognition method and the apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A malware recognition model training method, the method comprising:

2. The method of claim 1, wherein computing the transition probability matrix for the PE file based on the type of API calls in the PE file comprises:

3. The method of claim 2, wherein identifying the call type corresponding to the API name comprises:

4. The method of claim 2, wherein computing the transition probability matrix based on the number of classes and the number of times comprises:

5. The method of claim 1, wherein obtaining the PE file having the security identifier comprises:

6. A malware identification method, the method comprising:

acquiring a PE file to be identified;

and identifying whether the software corresponding to the PE file to be identified is malware or not by using the transition probability matrix of the PE file to be identified and a pre-established malware identification model, wherein the malware identification model is obtained by training according to the malware identification model training method of any one of claims 1 to 5.

7. A malware recognition model training apparatus, the apparatus comprising:

8. An apparatus for malware identification, the apparatus comprising:

the acquisition unit is used for acquiring the PE file to be identified;

an identifying unit, configured to identify whether software corresponding to the to-be-identified file is malware by using the transition probability matrix of the to-be-identified PE file and a pre-established malware identification model, where the malware identification model is obtained by training according to the malware identification model training method according to any one of claims 1 to 5.

9. A storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the malware recognition model training method of any one of claims 1-5 or the malware recognition method of claim 6.

10. An electronic device, comprising a storage medium and a processor;

the processor is suitable for realizing instructions;

the storage medium adapted to store a plurality of instructions;

the instructions are adapted to be loaded by the processor and to perform the malware recognition model training method of any one of claims 1-5, or to load and perform the malware recognition method of claim 6.