CN114254319A

CN114254319A - Network virus identification method and device, computer equipment and storage medium

Info

Publication number: CN114254319A
Application number: CN202111521591.3A
Authority: CN
Inventors: 潘佳斌; 董雷; 童志明
Original assignee: Antiy Technology Group Co Ltd
Current assignee: Antiy Technology Group Co Ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-03-29

Abstract

The application provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, relates to the technical field of computational security, and is used for improving the identification accuracy of network viruses. The method mainly comprises the following steps: determining original characteristics and virus labels respectively corresponding to a plurality of types of virus sample program codes, wherein the original characteristics comprise static characteristics and dynamic characteristics; performing neural network learning according to the original characteristics and the virus labels to obtain a target neural network model and model parameters thereof; based on the model parameters of the target neural network model, carrying out neural network learning on original characteristics and virus labels corresponding to the sample program codes of the target viruses to obtain a target virus identification model; and identifying whether the sample to be detected belongs to the network viruses of the target type or not based on the target type virus identification model.

Description

Network virus identification method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of network security technologies, and in particular, to a method and an apparatus for identifying a network virus, a computer device, and a storage medium.

Background

The malicious code recognition objectively solves a complex and ultra-large-scale network virus classification and discrimination task. The traditional method for extracting the discriminant feature fragments by manual analysis or automation is difficult to provide enough generalization capability to discover unknown samples, and has certain hysteresis. Therefore, machine-learned classification methods can be utilized to supplement the traditional ability to recognize cyber viruses through a trained good learning model.

In the traditional technology, aiming at specific field problems, the scale of a training sample set seriously restricts the expression of a model. On one hand, the stability of the model can be improved by using all data which are not completely related to the existing field, but the sensitivity of the model to the problems in the specific field is restricted; on the other hand, if only the information of the data set in the domain is relied on, the problem of insufficient training data set is further aggravated, the problem of overfitting of the artificial intelligence model is amplified, and the practicability of the model is restricted. Therefore, the accuracy of identifying the network virus based on the existing model is low.

Disclosure of Invention

The embodiment of the application provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, which are used for improving the accuracy of network virus identification.

The embodiment of the invention provides a network virus identification method, which comprises the following steps:

determining original characteristics and virus labels respectively corresponding to a plurality of types of virus sample program codes, wherein the original characteristics comprise static characteristics and dynamic characteristics;

performing neural network learning according to the original characteristics and the virus labels to obtain a target neural network model and model parameters thereof;

based on the model parameters of the target neural network model, carrying out neural network learning on original characteristics and virus labels corresponding to the sample program codes of the target viruses to obtain a target virus identification model;

and identifying whether the sample to be detected belongs to the network viruses of the target type or not based on the target type virus identification model.

The embodiment of the invention provides a network virus identification device, which comprises:

the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining original characteristics and virus labels respectively corresponding to various types of virus sample program codes, and the original characteristics comprise static characteristics and dynamic characteristics;

the training module is used for carrying out neural network learning according to the original characteristics and the virus labels to obtain a target neural network model and model parameters thereof;

the training module is further used for carrying out neural network learning on original characteristics corresponding to the sample program codes of the target viruses and the virus labels based on the model parameters of the target neural network model to obtain a target virus identification model;

and the identification module is used for identifying whether the sample to be detected belongs to the network viruses of the target type or not based on the target type virus identification model.

A computer device comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the network virus identification method.

A computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the network virus identification method described above.

A computer program product comprising a computer program which, when executed by a processor, implements the above-described network virus identification method.

The invention provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, wherein, firstly, original characteristics and virus labels respectively corresponding to a plurality of types of virus sample program codes are determined, and the original characteristics comprise static characteristics and dynamic characteristics; then, performing neural network learning according to the original characteristics and the virus labels to obtain a target neural network model; and then, carrying out neural network learning on original characteristics and virus labels corresponding to the sample program codes of the target viruses to obtain a target virus identification model based on model parameters of the target neural network model, and finally identifying whether the sample to be detected belongs to the network viruses of the target types or not according to the target virus identification model. The invention trains a target neural network model by combining the migration learning with a deep neural network model structure and by using the original characteristics of all virus types and virus labels, and then applies the model parameters of the target neural network model to a specific field model (target type virus) to realize the multiplexing of basic knowledge information. Meanwhile, the model is continuously trained by using training samples in the application field in combination with the actual requirements of the specific problem field, so that the expression capability of the model for the specific field is enhanced. And the identification accuracy rate of the target virus is improved.

Drawings

Fig. 1 is a flowchart of a network virus identification method provided in the present application;

FIG. 2 is a flow chart of another network virus identification method provided in the present application;

fig. 3 is a schematic structural diagram of an identification apparatus for network viruses provided in the present application.

Fig. 4 is a schematic diagram of a computer device provided in the present application.

Detailed Description

In order to better understand the technical solutions described above, the technical solutions of the embodiments of the present application are described in detail below with reference to the drawings and the specific embodiments, and it should be understood that the specific features of the embodiments and the embodiments of the present application are detailed descriptions of the technical solutions of the embodiments of the present application, and are not limitations of the technical solutions of the present application, and the technical features of the embodiments and the embodiments of the present application may be combined with each other without conflict.

Referring to fig. 1, a method for identifying a network virus according to an embodiment of the present invention specifically includes steps S101 to S104:

step S101, determining original characteristics and virus labels respectively corresponding to various types of virus sample program codes.

The original features comprise static features and dynamic features, and refer to malicious code feature information extracted from sample program codes through means of static and dynamic feature analysis and the like. Specifically, the static characteristics obtained through static analysis include file format information, file attribute information, character string information, binary information, and instruction characteristic information; the dynamic characteristics obtained by the dynamic analysis include local behavior characteristics, network behavior characteristics, API call characteristics, and the like, and the embodiment of the present invention is not particularly limited.

For the embodiment of the present invention, the virus tags are used to indicate the types of viruses, and there are a plurality of corresponding virus tags for how many types of viruses exist in the embodiment. The types of viruses can be classified into virus, trojan, worm and other categories, each category has a plurality of different malicious code families, each family may have a plurality of different variants, and each variant has a plurality of different files; the different sample classes here may be any of the different malicious code variants.

It should be noted that the virus tag in this embodiment may represent, in addition to the corresponding virus type, an expression form of the corresponding virus, where the expression form may be self-extracting packet, adding shell, and the like, and the expression form is not specifically limited in this embodiment.

Further, after the original features are obtained, the corresponding preprocessing needs to be performed according to the feature value types corresponding to the original features, where the feature value types refer to extracted original representation forms of the features, for example, for a person, the feature value type of height and weight is a numerical value, the feature value type of gender is a boolean variable, and the fingerprint is a picture. Specifically, the characteristic value types in this embodiment include a numerical value (number of file resources, number of file sections), a boolean variable (whether executable sections exist), serialized data (disassembly instruction sequence), a graph structure characteristic (system call flow chart), and the like, and this embodiment is not particularly limited.

In this embodiment, the pre-treatment includes at least one or more of the following treatments: if the characteristic value type is a numerical characteristic or a coding characteristic, performing normalization processing on the corresponding characteristic; if the characteristic value type is a sequence type characteristic, performing word vectorization on the corresponding characteristic by using an Embedding method; and if the characteristic value type is the characteristic of the relational graph, carrying out graph vectorization on the corresponding characteristic.

And S102, performing neural network learning according to the original characteristics and the virus labels to obtain a target neural network model and model parameters thereof.

In this embodiment, a target neural network model is constructed, which includes but is not limited to a CNN structure network, an RNN structure network, a Bert structure network, and the like, and the constructed target neural network model has a multilayer structure. Specifically, after determining original features and virus labels respectively corresponding to multiple types of virus sample program codes, the original features preprocessed by normalization, Embedding and other methods are used as the input of the model, the virus labels are used as the output to perform neural network learning, and the target neural network model and model parameters therein are obtained through training.

And S103, performing neural network learning on original characteristics and virus labels corresponding to the sample program codes of the target viruses based on the model parameters of the target neural network model to obtain a target virus identification model.

The target virus may be any kind of virus, for example, the target virus may be an Advanced Persistent Threat (APT) or a Downloader type, and the present embodiment does not specifically limit the present invention.

The target type virus recognition model in the embodiment of the invention is divided into two large structural parts, wherein the multilayer structure connected with the input layer is a model migration layer, and the structure connected with the output layer is a model field layer. The model migration layer reserves more generalization knowledge information related to the target neural network model; the model domain layer reserves more basic information related to the target neural network model.

The target type virus identification model multiplexes a migration layer structure and parameters in a target neural network model, and a model field layer structure is reconstructed by combining field actual requirements; multiplexing the model parameters of the migration layer, and training the parameters of a target type virus identification model by utilizing field sample information (the target type virus identification model); and when the model parameters are stable, deriving a newly-built model structure carrying the requirements of the matching field.

And step S104, identifying whether the sample to be detected belongs to the network viruses of the target type or not based on the target type virus identification model.

In an optional embodiment provided by the present invention, the identifying, based on the target class virus identification model, whether the sample to be detected belongs to a network virus of a target class includes: acquiring original characteristics corresponding to a sample code to be detected, wherein the original characteristics comprise static characteristics and dynamic characteristics; and inputting the original characteristics corresponding to the sample code to be detected into the target type virus identification model, and determining the virus classification result of the sample code to be detected.

In this embodiment, after obtaining the original feature corresponding to the sample code to be detected, the original feature needs to be preprocessed, and the preprocessing process may specifically be: determining the characteristic value types respectively corresponding to all the characteristics in the original characteristics; and preprocessing the corresponding features according to the feature value types to obtain the processed original features. Wherein the pre-treatment comprises at least one or more of the following treatments: if the characteristic value type is a numerical characteristic or a coding characteristic, performing normalization processing on the corresponding characteristic; if the characteristic value type is a sequence type characteristic, performing word vectorization on the corresponding characteristic; and if the characteristic value type is the characteristic of the relational graph, carrying out graph vectorization on the corresponding characteristic.

The invention provides a network virus identification method, which comprises the steps of firstly determining original characteristics and virus labels respectively corresponding to a plurality of types of virus sample program codes, wherein the original characteristics comprise static characteristics and dynamic characteristics; then, performing neural network learning according to the original characteristics and the virus labels to obtain a target neural network model; and then, carrying out neural network learning on original characteristics and virus labels corresponding to the sample program codes of the target viruses to obtain a target virus identification model based on model parameters of the target neural network model, and finally identifying whether the sample to be detected belongs to the network viruses of the target types or not according to the target virus identification model. The invention trains a target neural network model by combining the migration learning with a deep neural network model structure and by using the original characteristics of all virus types and virus labels, and then applies the model parameters of the target neural network model to a specific field model (target type virus) to realize the multiplexing of basic knowledge information. Meanwhile, the model is continuously trained by using training samples in the application field in combination with the actual requirements of the specific problem field, so that the expression capability of the model for the specific field is enhanced. And the identification accuracy rate of the target virus is improved.

Referring to fig. 2, another network virus identification method according to an embodiment of the present invention includes steps S201 to S205:

step S201, obtaining the original characteristics corresponding to the sample code to be detected.

Wherein the original features include static features and dynamic features. It should be noted that the detailed description of step S201 in this embodiment is the same as the description of the corresponding step in fig. 1, and this embodiment is not repeated herein.

Step S202, calculating the similarity between the original characteristics corresponding to the sample code to be detected and the characteristics of various types of viruses in the virus library.

The virus library stores various types of viruses and corresponding virus characteristics, and the virus characteristics also include static characteristics and dynamic characteristics. After the original features corresponding to the sample code to be detected are obtained, the similarity between the static features in the original features and the static features of various viruses in the virus library is calculated, the similarity between the dynamic features in the original features and the dynamic features of various viruses in the virus library is calculated, and then the sum of the similarity between the static features and the similarity between the dynamic features is calculated.

In an optional embodiment provided by the present invention, after obtaining the similarity between the static features in the original features and the static features of various viruses in the virus library, and the similarity between the dynamic features in the original features and the dynamic features of various viruses in the virus library, the similarity of the sample code to be detected may be obtained through weighted calculation.

For example, the similarity between the static features in the original features and the static features of various viruses in the virus library is 80%, the similarity between the dynamic features in the original features and the dynamic features of various viruses in the virus library is 50%, and if the weight value of the dynamic features is 0.8 and the weight value corresponding to the static features is 0.2, the similarity of the sample code to be detected obtained through weighting calculation is 56%.

In step S203, if the similarity exceeds the threshold, the virus type corresponding to the virus feature in the virus library is determined as the target type.

The threshold may be set according to actual requirements, and this embodiment does not specifically limit this. Specifically, the present embodiment determines, as the target category, a virus category corresponding to a virus feature whose similarity exceeds a threshold in the virus library.

For example, the virus library includes virus characteristics of 5 virus types, i.e., virus type 1, virus type 2, virus type 3, virus type 4, and virus type 5. After the similarity between the original characteristic corresponding to the sample code to be detected and the virus characteristic of the virus type 1 is 65%, the similarity between the original characteristic corresponding to the sample code to be detected and the virus characteristic of the virus type 2 is 60%, the similarity between the original characteristic corresponding to the sample code to be detected and the virus characteristic of the virus type 3 is 90%, the similarity between the original characteristic corresponding to the sample code to be detected and the virus characteristic of the virus type 4 is 54%, the similarity between the original characteristic corresponding to the sample code to be detected and the virus characteristic of the virus type 5 is 89%, and if the threshold value is 85%, the virus type 3 and the virus type 5 can be determined to be target types. After the target type is determined, the original characteristics corresponding to the sample code to be detected are respectively input into the type 3 virus identification model and the type 5 virus identification model, so as to determine whether the sample code to be detected belongs to the virus of the virus type 3 or the virus of the virus type 5.

And S204, respectively inputting the original characteristics corresponding to the sample code to be detected into the corresponding target type virus identification models, and determining the probability value of the sample code to be detected in each target type virus identification model.

And S205, determining the result of the target type virus identification model with the highest probability value as the virus classification result of the sample code to be detected.

For example, if the target types determined in step S203 are virus type 3 and virus type 5, the original features corresponding to the sample code to be detected are respectively input into a type 3 virus identification model and a type 5 virus identification model, and if the probability of the sample code to be detected being virus type 3 is 80% obtained by the type 3 virus identification model and the probability of the sample code to be detected being virus type 5 is 40% obtained by the type 5 virus identification model, the virus classification result corresponding to the sample code to be detected can be determined to be virus type 5.

In the embodiment of the invention, the target neural network model is trained by combining the migration learning with the deep neural network model structure and by using the original characteristics of all virus types and virus labels, and then the multi-layer migration model of the model is applied to a specific field model (target type virus), so that the multiplexing of basic knowledge information is realized. And different virus type recognition models are obtained by training specific type characteristic data, so that whether the corresponding program code belongs to the virus of the virus type can be recognized according to the different virus type recognition models, and the accuracy of virus recognition is improved through the embodiment.

In an optional embodiment provided by the present invention, in order to verify the accuracy of the identification result of the target type virus identification model, after determining the result of the target type virus identification model with the highest probability value as the virus classification result of the sample code to be detected, the sample code to be detected is input into a sandbox for execution, and the execution result corresponding to the sample code to be detected is determined; verifying whether the sample code to be detected is consistent with the result of the target type virus identification model with the highest probability value according to the execution result; and updating and training the target type virus identification model with the highest probability value according to the verification result.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, an apparatus for identifying a network virus is provided, where the apparatus for identifying a network virus corresponds to the method for identifying a network virus in the foregoing embodiment one to one. As shown in fig. 3, the functional modules of the network virus identification apparatus are described in detail as follows:

a determining module 31, configured to determine original features and virus tags respectively corresponding to multiple types of virus sample program codes, where the original features include static features and dynamic features;

the training module 32 is used for performing neural network learning according to the original features and the virus labels to obtain a target neural network model and model parameters thereof;

the training module 32 is further configured to perform neural network learning on original features and virus labels corresponding to sample program codes of the target virus type based on the model parameters of the target neural network model to obtain a target virus type identification model;

and the identifying module 33 is configured to identify whether the sample to be detected belongs to the network virus of the target category based on the target category virus identification model.

In an alternative embodiment, the identification module 33 is specifically configured to;

acquiring original features corresponding to a sample code to be detected, wherein the original features comprise static features and dynamic features;

and inputting the original characteristics corresponding to the sample code to be detected into the target type virus identification model, and determining the virus classification result of the sample code to be detected.

In an optional embodiment, the apparatus further comprises: a pre-processing module 34;

the determining module 33 is further configured to determine a feature value type corresponding to each feature in the original features;

and the preprocessing module 34 is configured to preprocess the corresponding features according to the feature value types to obtain processed original features.

In an alternative embodiment, the pre-treatment comprises at least one or more of the following treatments:

if the characteristic value type is a numerical characteristic or a coding characteristic, performing normalization processing on the corresponding characteristic;

if the characteristic value type is a sequence type characteristic, performing word vectorization on the corresponding characteristic;

and if the characteristic value type is the characteristic of the relational graph, carrying out graph vectorization on the corresponding characteristic.

calculating the similarity between the original characteristics corresponding to the sample code to be detected and various types of virus characteristics in a virus library; the virus library stores virus types respectively corresponding to various types of virus characteristics;

and if the similarity exceeds a threshold value, determining the virus type corresponding to the virus characteristics in the virus library as the target type. In an alternative embodiment, the identification module 33 is specifically configured to:

respectively inputting the original characteristics corresponding to the sample code to be detected into corresponding target type virus identification models, and determining the probability value of the sample code to be detected in each target type virus identification model;

and determining the result of the target type virus identification model with the highest probability value as the virus classification result of the sample code to be detected.

In an optional embodiment, the apparatus further comprises: a verification module 35;

the determining module 33 is further configured to input the sample code to be detected into a sandbox for execution, and determine an execution result corresponding to the sample code to be detected;

the verification module 35 is configured to verify whether the sample code to be detected is consistent with the result of the target class virus identification model with the highest probability value according to the execution result;

the training module 32 is further configured to update and train the target type virus identification model in which the probability value is highest according to the verification result.

For specific limitations of the network virus identification device, reference may be made to the above limitations of the network virus identification method, which are not described herein again. The various modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a network virus identification method.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, a computer program product is provided, the computer program product comprising a computer program executed by a processor to perform the steps of:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method for identifying a network virus, the method comprising:

2. The method according to claim 1, wherein the identifying whether the sample to be detected belongs to the network virus of the target species based on the target species virus identification model comprises:

and inputting the original characteristics corresponding to the sample code to be detected into the target type virus identification model, and determining the virus classification result of the sample to be detected.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

determining the characteristic value types respectively corresponding to all the characteristics in the original characteristics;

and preprocessing the corresponding features according to the feature value types to obtain the processed original features.

4. The method of claim 3, wherein the pre-processing comprises at least one or more of:

5. The method according to claim 2, wherein before inputting the original features corresponding to the sample code to be detected into the target species virus identification model and determining the virus classification result of the sample to be detected, the method further comprises:

and if the similarity exceeds a threshold value, determining the virus type corresponding to the virus characteristics in the virus library as the target type.

6. The method according to claim 5, wherein the inputting the original features corresponding to the sample code to be detected into the target-class virus identification model and determining the virus classification result of the sample code to be detected comprises:

7. The method of claim 5, wherein after determining the result of the target class virus identification model with the highest probability value as the virus classification result of the sample code to be detected, the method further comprises:

inputting the sample code to be detected into a sandbox for execution, and determining an execution result corresponding to the sample code to be detected;

verifying whether the sample code to be detected is consistent with the result of the target type virus identification model with the highest probability value according to the execution result;

and updating and training the target type virus identification model with the highest probability value according to the verification result.

8. An apparatus for identifying a network virus, the apparatus comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the network virus identification method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the network virus identification method according to any one of claims 1 to 7.