CN114266045A - Network virus identification method and device, computer equipment and storage medium - Google Patents

Network virus identification method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114266045A
CN114266045A CN202111522830.7A CN202111522830A CN114266045A CN 114266045 A CN114266045 A CN 114266045A CN 202111522830 A CN202111522830 A CN 202111522830A CN 114266045 A CN114266045 A CN 114266045A
Authority
CN
China
Prior art keywords
virus
feature
characteristic
sample
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111522830.7A
Other languages
Chinese (zh)
Inventor
潘佳斌
董雷
童志明
肖新光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Antiy Technology Group Co Ltd
Original Assignee
Antiy Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Antiy Technology Group Co Ltd filed Critical Antiy Technology Group Co Ltd
Priority to CN202111522830.7A priority Critical patent/CN114266045A/en
Publication of CN114266045A publication Critical patent/CN114266045A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, relates to the technical field of computational security, and is used for improving the efficiency and accuracy of network virus identification. The method mainly comprises the following steps: determining original characteristics and virus labels corresponding to various types of virus samples; calculating a hash code corresponding to the original characteristic; determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics; performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters; and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.

Description

Network virus identification method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a method and an apparatus for identifying a network virus, a computer device, and a storage medium.
Background
The malicious code recognition objectively solves a complex and ultra-large-scale network virus classification and discrimination task. The traditional method for extracting the discriminant feature fragments by manual analysis or automation is difficult to provide enough generalization capability to discover unknown samples, and has certain hysteresis.
The traditional method for analyzing and detecting the network virus is to manually analyze and debug the virus, extract a section of characteristic with special significance aiming at the behavior pattern of the virus, and then detect the virus by utilizing the characteristic. But the efficiency and accuracy of manual detection of network viruses are low.
Disclosure of Invention
The embodiment of the application provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, which are used for improving the efficiency and accuracy of network virus identification.
The embodiment of the invention provides a network virus identification method, which comprises the following steps:
determining original characteristics and virus labels corresponding to various types of virus samples;
calculating a hash code corresponding to the original characteristic;
determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics;
performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters;
and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
The embodiment of the invention provides a network virus identification device, which comprises:
the determining module is used for determining original characteristics and virus labels corresponding to various types of virus samples;
the calculation module is used for calculating the hash code corresponding to the original characteristic;
the determining module is further configured to determine a locality sensitive hash feature corresponding to the sample according to the hash code corresponding to the original feature;
the computing module is further used for performing clustering computation on the locality sensitive hash features and the virus labels corresponding to the sample programs according to a clustering algorithm to obtain a plurality of clustering clusters;
and the identification module is used for identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
A computer device comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the network virus identification method.
A computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the network virus identification method described above.
A computer program product comprising a computer program which, when executed by a processor, implements the above-described network virus identification method.
The invention provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, which are used for determining original characteristics and virus labels corresponding to various types of virus samples and calculating hash codes corresponding to the original characteristics; determining the locality sensitive hash characteristics corresponding to the sample program code according to the hash codes corresponding to the original characteristics; performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program codes according to a clustering algorithm to obtain a plurality of clustering clusters; and finally, identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters. The invention utilizes the local sensitive Hash feature fusion technology, realizes the feature dimension reduction and the formatting expression, simultaneously greatly retains the sample specificity information in the multi-source feature, and then carries out clustering calculation according to the local sensitive Hash feature to determine whether the sample to be detected belongs to the network virus, thereby improving the efficiency and the accuracy of identifying the network virus.
Drawings
Fig. 1 is a flowchart of a network virus identification method provided in the present application;
FIG. 2 is a flow chart of another network virus identification method provided in the present application;
fig. 3 is a schematic structural diagram of an identification apparatus for network viruses provided in the present application.
Fig. 4 is a schematic diagram of a computer device provided in the present application.
Detailed Description
In order to better understand the technical solutions described above, the technical solutions of the embodiments of the present application are described in detail below with reference to the drawings and the specific embodiments, and it should be understood that the specific features of the embodiments and the embodiments of the present application are detailed descriptions of the technical solutions of the embodiments of the present application, and are not limitations of the technical solutions of the present application, and the technical features of the embodiments and the embodiments of the present application may be combined with each other without conflict.
Referring to fig. 1, a method for identifying a network virus according to an embodiment of the present invention specifically includes steps S101 to S105:
and step S101, determining original characteristics and virus labels corresponding to various types of virus samples.
The original features refer to malicious code feature information extracted from sample program codes through means of static and dynamic feature analysis and the like, and the original features comprise static features and dynamic features. Specifically, static characteristics can be obtained through static analysis, and the static characteristics comprise file format information, file attribute information, character string information, binary information and instruction characteristic information; the dynamic characteristics are obtained by using dynamic analysis, and the dynamic characteristics include local behavior characteristics, network behavior characteristics, API call characteristics, and the like.
Further, after determining each type of feature in the original features, the embodiment needs to perform corresponding preprocessing according to a feature value type corresponding to the original features, where the feature value type refers to an extracted original representation form of the feature, for example, for a person, the feature value type of height and weight is a numerical value, the feature value type of gender is a boolean variable, and a fingerprint is a picture. Specifically, according to the data type of the original features in the sample program code, the original features may be divided into numerical features (number of file resources, number of file sections), character features, serialization features (disassembly instruction sequence), graph features (system call flow chart), boolean features (whether executable sections exist), and the like.
For the embodiment of the present invention, the virus tags are used to indicate the types of viruses, and there are a plurality of corresponding virus tags for how many types of viruses exist in the embodiment. The types of viruses can be classified into virus, trojan, worm and other categories, each category has a plurality of different malicious code families, each family may have a plurality of different variants, and each variant has a plurality of different files; the different sample classes herein may be any of the different malicious code variants.
It should be noted that the virus tag in this embodiment may represent, in addition to the corresponding virus type, an expression form of the corresponding virus, where the expression form may be self-extracting packet, adding shell, and the like, and the expression form is not specifically limited in this embodiment.
And step S102, calculating the hash code corresponding to the original characteristic.
In an optional embodiment provided by the present invention, the original features comprise: calculating hash codes respectively corresponding to the numerical features, the character features, the serialization features and/or the graph features, wherein the hash codes respectively correspond to the numerical features, the character features, the serialization features and/or the graph features, and the hash codes comprise:
and step S1021, performing hash calculation on the numerical characteristic and the character characteristic to obtain hash codes corresponding to the numerical characteristic and the character characteristic respectively.
Specifically, hash calculation is directly performed on the feature value of the character-type feature (such as ip and domain name) to obtain a hash code corresponding to the character-type feature. For the numerical characteristics (such as the number of PE file sections and the number of resource files), hash calculation may be directly performed according to the characteristic name of the numerical characteristics to obtain a corresponding hash code, and hash calculation may be performed according to the characteristic value name and the corresponding characteristic value to obtain a corresponding hash code.
For example, if the feature value of the numerical feature is named "called file number" and the feature value is 50, the hash calculation may be performed on the "called file number" to obtain the corresponding hash code, or the hash calculation may be performed according to the "called file number" in combination with the feature value 50 to obtain the corresponding hash code.
Further, in this embodiment, before the hash code corresponding to the numerical characteristic is calculated, normalization processing may be performed on the numerical characteristic, and then hash calculation may be performed according to the normalized numerical characteristic to obtain the corresponding hash code.
In step S1022, each of the serialized features is converted into a fixed-length feature vector.
Wherein the length of the feature vector may be the same as the length of the hash code.
Step S1023, adding the feature vectors of each feature in the serialized features to obtain a target feature vector; and determining the hash code corresponding to the serialized features according to the target feature vector.
Determining the hash code corresponding to the serialized features according to the target feature vector, wherein the determining the hash code corresponding to the serialized features comprises: obtaining the value of each vector in the target characteristic vector; resetting the value of the vector with the value greater than 0 in the target feature vector to 1, and resetting the value of the vector with the value less than or equal to 0 to obtain the hash code corresponding to the serialized features.
For example, the serialized features are disassembled instruction sequences, the sequence content of the serialized features is (lea, mov, mov, cmp, jz), the serialized features are subjected to Embedding processing to obtain vectors of each feature in the serialized features, namely fixed-length (128) -bit vectorization representations corresponding to the lea, mov, mov, cmp, jz in the (lea, mov, mov, cmp, jz) are obtained respectively, each item vector of the serialized features is accumulated to obtain fixed-length vectors of the serialized features (lea, mov, mov, cmp, jz), each value in the vectors is truncated (namely the value is 1 when being larger than 0 and 0 when being smaller than or equal to 0), namely binary sequences of (128) bits are obtained, and hash codes corresponding to the serialized features are obtained.
For graph features, it can be represented as a collection of points (function or API calls) and edges (associations), and these data can implement vectorized representation of points and edges in the graph by the method of Embedding.
And step S103, determining the locality sensitive hash characteristics corresponding to the sample program code according to the hash codes corresponding to the original characteristics.
Specifically, the present implementation may add hash codes corresponding to the numerical feature, the character feature, the serialization feature, and the graph feature, respectively, and determine an addition result as a locality sensitive hash feature corresponding to the sample program code (original feature).
In an optional embodiment of the present invention, the determining, according to hash codes respectively corresponding to the numerical feature, the character-type feature, the serialization feature, and the graph feature, a locality-sensitive hash feature corresponding to the sample program code includes: determining weight values corresponding to the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic respectively; and performing weighted calculation on the Hash codes respectively corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic to obtain the local sensitive Hash characteristic corresponding to the sample program code.
In the embodiment of the present invention, the determination manner of the weight values corresponding to the numerical feature, the character feature, the serialization feature, and the graph feature may be: carrying out normalization processing on the numerical characteristic to obtain a corresponding weight value; determining a weight value corresponding to the character type characteristic through a word frequency-reverse file frequency TF-IDF algorithm; and presetting weight values corresponding to the serialization characteristics and the graph characteristics. Specifically, aiming at character type characteristics or Boolean type characteristics, the frequency TF of the appearance of characteristic values and the frequency IDF of the appearance of the characteristic values in the whole sample set are counted, and the TF-IDF method is utilized to realize the calibration of weight values; aiming at the Embedding Hash codes obtained by the serialization characteristics, a pre-calibrated empirical weight is used as the weight of the corresponding Hash codes.
In addition, the determination method of the weight values corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic may also be: combining the numerical type features, the character type features, the serialization features and the graph features to obtain summary features; inputting the summarized features into a weight recognition model to obtain weight values corresponding to the features, wherein the weight recognition model is obtained by training the summarized feature samples and the weight values corresponding to the features in the summarized feature samples, and the weight values of the features are determined according to a TF-IDF algorithm.
Specifically, hash codes corresponding to the numerical characteristic, the character characteristic, the serialization characteristic, and the graph characteristic are summarized to obtain a summarized characteristic list, and then the summarized characteristic list is input to a weight identification model to obtain weight values corresponding to the characteristics.
And step S104, performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program codes according to a clustering algorithm to obtain a plurality of clustering clusters.
In this embodiment, the hamming distance may be used as the feature distance metric, and the locality sensitive hash features corresponding to each sample program code may be used as input, to perform clustering calculation to obtain a plurality of cluster clusters. And then, selecting the virus label with the largest ratio as the virus label of the cluster according to the number distribution of the virus labels corresponding to the local sensitive hash characteristics in the cluster. The virus tag may be a malware family name to which the sample program code belongs, and flag information such as whether the sample program code is a self-extracting packet, whether the sample program code is a shell, whether the sample program code is an APT tool, and the like.
And S105, identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
The method for identifying whether the sample to be detected belongs to the network virus or not according to the plurality of cluster clusters comprises the following steps: acquiring the original characteristics of the sample to be detected; determining a local sensitive hash characteristic corresponding to the original characteristic of the sample to be detected; determining whether the locality sensitive hash characteristics of the sample to be detected belong to a cluster or not through the clustering algorithm; if the local sensitive hash characteristics of the sample to be detected have the cluster, determining the network virus corresponding to the sample to be detected according to the cluster label corresponding to the cluster, wherein the cluster label is used for representing the virus type of the corresponding cluster; and if the local sensitive Hash characteristics of the sample to be detected do not have the cluster to which the local sensitive Hash characteristics belong, determining that the sample to be detected does not belong to the network virus.
For example, K-nearest neighbor search is performed on a sample to be detected, if the set maximum effective distance threshold is 6 and K is 20, then the virus-tagged sample program code with the distance between the locality-sensitive hash features smaller than 6 is an effective nearest neighbor, and a total of 100 effective nearest neighbors are found, and the distances of the effective nearest neighbors are different from 0 to 5. The 100 neighbors are ranked from small to large in distance, and the nearest 20 of the 100 neighbors are selected. And voting the virus labels (sample family, whether the samples are packets or not and whether the samples are shells or not) of the 20 samples to give a judgment result (19 of the 20 samples are marked as Trojan horses, 1 of the 20 samples are marked as worms, all the marks are not shells and are not self-decompression packets), and judging that the samples to be detected are Trojan horses, non-self-decompression packets and non-shell files.
The invention provides a network virus identification method, which is characterized in that original characteristics and virus labels respectively corresponding to samples to be detected of a plurality of virus types are determined, wherein the original characteristics comprise numerical type characteristics, character type characteristics, serialization characteristics and/or graph characteristics; calculating hash codes corresponding to all the characteristics in the original characteristics respectively; determining the locality sensitive hash characteristics corresponding to the sample to be detected according to the hash codes corresponding to all the characteristics; performing clustering calculation on the local sensitive hash characteristics and the virus labels corresponding to the samples to be detected according to a clustering algorithm to obtain a plurality of clustering clusters; and finally, identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters. The invention utilizes the local sensitive Hash feature fusion technology, realizes feature dimension reduction and formatting expression, greatly retains the sample specificity information in the multisource feature, and then performs clustering calculation according to the local sensitive Hash feature to determine whether the sample to be detected belongs to the network virus, thereby improving the efficiency and accuracy of identifying the network virus.
Referring to fig. 2, another network virus identification method according to an embodiment of the present invention includes steps S201 to S205:
step S201, determining the local sensitive hash characteristics of the sample to be detected.
It should be noted that, in this embodiment, a determination manner of the locally sensitive hash feature in step S201 is the same as the description content of the corresponding step in fig. 1, and this embodiment is not described again here.
Step S202, according to the local sensitive hash characteristics of the sample to be detected, network viruses corresponding to the sample to be detected determined through a clustering algorithm and network viruses corresponding to the sample to be detected determined through a virus library are obtained.
It should be noted that, in this embodiment, a specific implementation manner of obtaining the network virus corresponding to the sample to be detected, which is determined by the clustering algorithm, according to the locality sensitive hash feature of the sample to be detected is the same as the description content of the corresponding step in fig. 1, and this embodiment is not described herein again.
In this embodiment, obtaining the network virus corresponding to the sample to be detected, which is determined by the virus library, according to the locality sensitive hash feature of the sample to be detected includes: calculating the similarity between the local sensitive Hash characteristics of the sample to be detected and various types of virus characteristics in a virus library; and determining the virus type corresponding to the locality sensitive hash characteristics with similarity exceeding a preset value in the virus library as the network virus corresponding to the sample to be detected.
The virus library stores virus types respectively corresponding to the locality sensitive hash characteristics of various types of viruses. In this embodiment, after the locality sensitive hash feature of the sample to be detected is obtained, the similarity between the locality sensitive hash feature of the sample to be detected and the locality sensitive hash feature in the virus library is calculated, and finally, the virus type corresponding to the virus feature whose similarity exceeds a preset value in the virus library is determined as the network virus corresponding to the sample to be detected. The preset value may be set according to an actual requirement, for example, the preset value may be 80%, 85%, or 90%, and this embodiment is not particularly limited.
For example, the virus library includes 5 virus signatures of locality sensitive hash signatures, virus type 1, virus type 2, virus type 3, virus type 4, and virus type 5. After the similarity between the locality sensitive hash feature corresponding to the sample to be detected and the locality sensitive hash feature of the virus type 1 is 65%, the similarity between the locality sensitive hash feature corresponding to the sample to be detected and the locality sensitive hash feature of the virus type 2 is 60%, the similarity between the locality sensitive hash feature corresponding to the sample to be detected and the locality sensitive hash feature of the virus type 3 is 90%, the similarity between the locality sensitive hash feature corresponding to the sample to be detected and the locality sensitive hash feature of the virus type 4 is 54%, the similarity between the locality sensitive hash feature corresponding to the sample to be detected and the locality sensitive hash feature of the virus type 5 is 89%, and if the preset value is 85%, the virus type 3 and the virus type 5 can be determined to be the network viruses corresponding to the sample to be detected.
Step S203, calculating probability values of the network viruses which belong to the same network virus and are determined by the clustering algorithm and the virus library.
And step S204, determining the network virus with the highest probability value as the network virus corresponding to the sample to be detected.
For example, in step S202, the network viruses corresponding to the to-be-detected sample determined according to the virus library are virus type 3 and virus type 5, where the similarity of virus type 3 is 90% (i.e., a probability value), and the similarity of virus type 5 is 89%; the network viruses corresponding to the samples to be detected are determined to be the virus type 3 and the virus type 2 according to a clustering algorithm, wherein the probability value of the cluster belonging to the virus type 3 is 90%, the probability value of the cluster belonging to the virus type 2 is 20%, then the network viruses (the virus type 3) belonging to the same network are averaged to obtain the corresponding virus type 3 with the probability value of 90%, the virus type 5 with the probability value of 89%, the virus type 2 with the probability value of 20%, and the network virus with the highest probability value is the virus type 3, namely the virus type of the samples to be detected is determined to be the network virus 3.
The invention provides a network virus identification method, which is characterized in that according to the local sensitive Hash characteristics of a sample to be detected, network viruses corresponding to the sample to be detected determined by a clustering algorithm and network viruses corresponding to the sample to be detected determined by a virus library are obtained, and then the network viruses corresponding to the sample to be detected are determined by integrating the clustering algorithm and the virus library, so that the accuracy of network virus identification is further improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, an apparatus for identifying a network virus is provided, where the apparatus for identifying a network virus corresponds to the method for identifying a network virus in the foregoing embodiment one to one. As shown in fig. 3, the functional modules of the network virus identification apparatus are described in detail as follows:
a determining module 31, configured to determine original features and virus tags corresponding to multiple types of virus samples;
a calculating module 32, configured to calculate a hash code corresponding to the original feature;
the determining module 31 is further configured to determine, according to the hash code corresponding to the original feature, a locality sensitive hash feature corresponding to the sample;
the calculating module 32 is further configured to perform clustering calculation on the locality sensitive hash features and the virus labels corresponding to the sample programs according to a clustering algorithm to obtain a plurality of clustering clusters;
and the identifying module 33 is configured to identify whether the sample to be detected belongs to the network virus according to the plurality of cluster clusters.
In an alternative embodiment, the primitive features include: a numeric feature, the glyph-like feature, the serialization feature, and/or the graph feature; a calculation module 32, specifically configured to;
performing hash calculation on the numerical characteristic and the character type characteristic to obtain hash codes corresponding to the numerical characteristic and the character type characteristic respectively;
converting each of the serialized features into a fixed-length feature vector, the length of the feature vector being the same as the length of the hash code;
adding the feature vectors of each feature in the serialized features to obtain a target feature vector; and determining the hash code corresponding to the serialized features according to the target feature vector.
In an alternative embodiment, the calculation module 32 is further specifically configured to;
obtaining the value of each vector in the target feature vectors;
resetting the value of the vector with the value greater than 0 in the target feature vector to 1, and resetting the value of the vector with the value less than or equal to 0 to obtain the hash code corresponding to the serialized features.
In an optional embodiment, the determining module 31 is specifically configured to:
determining weight values corresponding to the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic respectively;
and performing weighted calculation on hash codes respectively corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic to obtain the locality sensitive hash characteristic corresponding to the sample program code.
In an optional embodiment, the determining module 31 is further specifically configured to;
carrying out normalization processing on the numerical characteristic to obtain a corresponding weight value;
determining a weight value corresponding to the character type characteristic through a word frequency-reverse file frequency TF-IDF algorithm;
and presetting the weight values corresponding to the serialization characteristics and the graph characteristics.
In an alternative embodiment, the determining module 31 is specifically configured to:
summarizing the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic to obtain a summarized characteristic list;
and inputting the summary feature list into a weight recognition model to obtain weight values corresponding to the features respectively.
In an alternative embodiment, the identification module 33 is specifically configured to;
acquiring the original characteristics of the sample to be detected;
determining a local sensitive hash characteristic corresponding to the original characteristic of the sample to be detected;
determining whether the locality sensitive hash characteristics of the sample to be detected belong to a cluster or not through the clustering algorithm;
if the local sensitive hash characteristics of the sample to be detected have the cluster, determining the network virus corresponding to the sample to be detected according to the cluster label corresponding to the cluster, wherein the cluster label is used for representing the virus type of the corresponding cluster;
and if the local sensitive Hash characteristics of the sample to be detected do not have the cluster to which the local sensitive Hash characteristics belong, determining that the sample to be detected does not belong to the network virus.
In an optional embodiment, the calculating module 32 is further configured to calculate similarity between the locality sensitive hash feature of the sample to be detected and various types of virus features in a virus library; the virus library stores virus types respectively corresponding to the locality sensitive hash characteristics of various types of viruses;
the determining module 31 is further configured to determine the virus type corresponding to the locality sensitive hash feature whose similarity exceeds a preset value in the virus library as the network virus corresponding to the sample to be detected.
In an optional embodiment, the computing module is further configured to obtain the network viruses corresponding to the to-be-detected samples determined by the clustering algorithm and the network viruses corresponding to the to-be-detected samples determined by the virus library; calculating probability values of the network viruses which belong to the same network virus and are determined by the clustering algorithm and the virus library;
the determining module 31 is further configured to determine the network virus with the highest probability value as the network virus corresponding to the sample to be detected.
For specific limitations of the network virus identification device, reference may be made to the above limitations of the network virus identification method, which are not described herein again. The various modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a network virus identification method.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
determining original characteristics and virus labels corresponding to various types of virus samples;
calculating a hash code corresponding to the original characteristic;
determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics;
performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters;
and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
determining original characteristics and virus labels corresponding to various types of virus samples;
calculating a hash code corresponding to the original characteristic;
determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics;
performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters;
and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
In one embodiment, a computer program product is provided, the computer program product comprising a computer program executed by a processor to perform the steps of:
determining original characteristics and virus labels corresponding to various types of virus samples;
calculating a hash code corresponding to the original characteristic;
determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics;
performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters;
and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (12)

1. A method for identifying a network virus, the method comprising:
determining original characteristics and virus labels corresponding to various types of virus samples;
calculating a hash code corresponding to the original characteristic;
determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics;
performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters;
and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
2. The method of claim 1, wherein the original features comprise: a numeric feature, the glyph-like feature, the serialization feature, and/or the graph feature;
the calculating the hash code corresponding to the original feature includes:
performing hash calculation on the numerical characteristic and the character type characteristic to obtain hash codes corresponding to the numerical characteristic and the character type characteristic respectively;
converting each of the serialized features into a fixed-length feature vector, the length of the feature vector being the same as the length of the hash code;
adding the feature vectors of each feature in the serialized features to obtain a target feature vector; and determining the hash code corresponding to the serialized features according to the target feature vector.
3. The method of claim 2, wherein determining the hash code corresponding to the serialized features from the target feature vector comprises:
obtaining the value of each vector in the target feature vectors;
resetting the value of the vector with the value greater than 0 in the target feature vector to 1, and resetting the value of the vector with the value less than or equal to 0 to obtain the hash code corresponding to the serialized features.
4. The method of claim 2, wherein the determining the locality-sensitive hash feature corresponding to the sample program code according to the hash code corresponding to the original feature comprises:
determining weight values corresponding to the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic respectively;
and performing weighted calculation on hash codes respectively corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic to obtain the locality sensitive hash characteristic corresponding to the sample program code.
5. The method of claim 4, wherein the determining the weighting values corresponding to the numerical feature, the character feature, the serialized feature, and the graph feature respectively comprises:
carrying out normalization processing on the numerical characteristic to obtain a corresponding weight value;
determining a weight value corresponding to the character type characteristic through a word frequency-reverse file frequency TF-IDF algorithm;
and presetting the weight values corresponding to the serialization characteristics and the graph characteristics.
6. The method of claim 4, wherein the determining the weighting values corresponding to the numerical feature, the character feature, the serialized feature, and the graph feature respectively comprises:
summarizing the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic to obtain a summarized characteristic list;
and inputting the summary feature list into a weight recognition model to obtain weight values corresponding to the features respectively.
7. The method according to any one of claims 1-6, wherein the identifying whether the sample to be detected belongs to the network virus according to the plurality of cluster clusters comprises:
acquiring the original characteristics of the sample to be detected;
determining a local sensitive hash characteristic corresponding to the original characteristic of the sample to be detected;
determining whether the locality sensitive hash characteristics of the sample to be detected belong to a cluster or not through the clustering algorithm;
and if the local sensitive hash characteristics of the sample to be detected belong to the cluster, determining the network virus corresponding to the sample to be detected according to the cluster label corresponding to the cluster, wherein the cluster label is used for representing the virus type of the corresponding cluster.
8. The method of claim 7, further comprising:
calculating the similarity between the local sensitive Hash characteristics of the sample to be detected and various types of virus characteristics in a virus library; the virus library stores virus types respectively corresponding to the locality sensitive hash characteristics of various types of viruses;
and determining the virus type corresponding to the locality sensitive hash characteristics with similarity exceeding a preset value in the virus library as the network virus corresponding to the sample to be detected.
9. The method of claim 8, further comprising:
acquiring the network viruses corresponding to the samples to be detected determined by the clustering algorithm and the network viruses corresponding to the samples to be detected determined by the virus library;
calculating probability values of the network viruses which belong to the same network virus and are determined by the clustering algorithm and the virus library;
and determining the network virus with the highest probability value as the network virus corresponding to the sample to be detected.
10. An apparatus for identifying a network virus, the apparatus comprising:
the determining module is used for determining original characteristics and virus labels corresponding to various types of virus samples;
the calculation module is used for calculating the hash code corresponding to the original characteristic;
the determining module is further configured to determine a locality sensitive hash feature corresponding to the sample according to the hash code corresponding to the original feature;
the computing module is further used for performing clustering computation on the locality sensitive hash features and the virus labels corresponding to the sample programs according to a clustering algorithm to obtain a plurality of clustering clusters;
and the identification module is used for identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
11. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the network virus identification method according to any one of claims 1 to 9 when executing the computer program.
12. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the network virus identification method according to any one of claims 1 to 9.
CN202111522830.7A 2021-12-13 2021-12-13 Network virus identification method and device, computer equipment and storage medium Pending CN114266045A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111522830.7A CN114266045A (en) 2021-12-13 2021-12-13 Network virus identification method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111522830.7A CN114266045A (en) 2021-12-13 2021-12-13 Network virus identification method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114266045A true CN114266045A (en) 2022-04-01

Family

ID=80826939

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111522830.7A Pending CN114266045A (en) 2021-12-13 2021-12-13 Network virus identification method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114266045A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969732A (en) * 2022-04-28 2022-08-30 国科华盾(北京)科技有限公司 Malicious code detection method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334781A (en) * 2018-03-07 2018-07-27 腾讯科技(深圳)有限公司 Method for detecting virus, device, computer readable storage medium and computer equipment
CN111901282A (en) * 2019-05-05 2020-11-06 四川大学 Method for generating malicious code flow behavior detection structure
CN112084500A (en) * 2020-09-15 2020-12-15 腾讯科技(深圳)有限公司 Method and device for clustering virus samples, electronic equipment and storage medium
CN112347477A (en) * 2019-08-07 2021-02-09 腾讯云计算(北京)有限责任公司 Family variant malicious file mining method and device
CN112615861A (en) * 2020-12-17 2021-04-06 赛尔网络有限公司 Malicious domain name identification method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334781A (en) * 2018-03-07 2018-07-27 腾讯科技(深圳)有限公司 Method for detecting virus, device, computer readable storage medium and computer equipment
CN111901282A (en) * 2019-05-05 2020-11-06 四川大学 Method for generating malicious code flow behavior detection structure
CN112347477A (en) * 2019-08-07 2021-02-09 腾讯云计算(北京)有限责任公司 Family variant malicious file mining method and device
CN112084500A (en) * 2020-09-15 2020-12-15 腾讯科技(深圳)有限公司 Method and device for clustering virus samples, electronic equipment and storage medium
CN112615861A (en) * 2020-12-17 2021-04-06 赛尔网络有限公司 Malicious domain name identification method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969732A (en) * 2022-04-28 2022-08-30 国科华盾(北京)科技有限公司 Malicious code detection method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
Zhang et al. A scalable and extensible framework for android malware detection and family attribution
CN109829302B (en) Android malicious application family classification method and device and electronic equipment
WO2022143511A1 (en) Malicious traffic identification method and related apparatus
Ge et al. AMDroid: android malware detection using function call graphs
CN112005532A (en) Malware classification of executable files over convolutional networks
CN114266046A (en) Network virus identification method and device, computer equipment and storage medium
CN107273746A (en) A kind of mutation malware detection method based on APK character string features
CN113935033A (en) Feature-fused malicious code family classification method and device and storage medium
CN111222137A (en) Program classification model training method, program classification method and device
CN112148305A (en) Application detection method and device, computer equipment and readable storage medium
Gibert et al. Orthrus: A bimodal learning architecture for malware classification
CN113254935A (en) Malicious file identification method and device and storage medium
CN116956026A (en) Training method and system for network asset identification model
Ognev et al. Clustering of malicious executable files based on the sequence analysis of system calls
CN111339531A (en) Malicious code detection method and device, storage medium and electronic equipment
CN114595451A (en) Graph convolution-based android malicious application classification method
CN114266045A (en) Network virus identification method and device, computer equipment and storage medium
CN113139185A (en) Malicious code detection method and system based on heterogeneous information network
CN111737694B (en) Malicious software homology analysis method based on behavior tree
Ugarte-Pedrero et al. On the adoption of anomaly detection for packed executable filtering
CN112257757A (en) Malicious sample detection method and system based on deep learning
Paik et al. Malware family prediction with an awareness of label uncertainty
CN115545091A (en) Integrated learner-based malicious program API (application program interface) calling sequence detection method
JPWO2019235074A1 (en) Generation method, generation device and generation program
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination