CN114266045A - Network virus identification method and device, computer equipment and storage medium - Google Patents
Network virus identification method and device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN114266045A CN114266045A CN202111522830.7A CN202111522830A CN114266045A CN 114266045 A CN114266045 A CN 114266045A CN 202111522830 A CN202111522830 A CN 202111522830A CN 114266045 A CN114266045 A CN 114266045A
- Authority
- CN
- China
- Prior art keywords
- virus
- feature
- characteristic
- sample
- hash
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 241000700605 Viruses Species 0.000 title claims abstract description 213
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 29
- 238000004364 calculation method Methods 0.000 claims abstract description 29
- 239000013598 vector Substances 0.000 claims description 36
- 238000004590 computer program Methods 0.000 claims description 19
- 238000010606 normalization Methods 0.000 claims description 4
- 230000003068 static effect Effects 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- ZXQYGBMAQZUVMI-GCMPRSNUSA-N gamma-cyhalothrin Chemical compound CC1(C)[C@@H](\C=C(/Cl)C(F)(F)F)[C@H]1C(=O)O[C@H](C#N)C1=CC=CC(OC=2C=CC=CC=2)=C1 ZXQYGBMAQZUVMI-GCMPRSNUSA-N 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, relates to the technical field of computational security, and is used for improving the efficiency and accuracy of network virus identification. The method mainly comprises the following steps: determining original characteristics and virus labels corresponding to various types of virus samples; calculating a hash code corresponding to the original characteristic; determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics; performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters; and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
Description
Technical Field
The present application relates to the field of network security technologies, and in particular, to a method and an apparatus for identifying a network virus, a computer device, and a storage medium.
Background
The malicious code recognition objectively solves a complex and ultra-large-scale network virus classification and discrimination task. The traditional method for extracting the discriminant feature fragments by manual analysis or automation is difficult to provide enough generalization capability to discover unknown samples, and has certain hysteresis.
The traditional method for analyzing and detecting the network virus is to manually analyze and debug the virus, extract a section of characteristic with special significance aiming at the behavior pattern of the virus, and then detect the virus by utilizing the characteristic. But the efficiency and accuracy of manual detection of network viruses are low.
Disclosure of Invention
The embodiment of the application provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, which are used for improving the efficiency and accuracy of network virus identification.
The embodiment of the invention provides a network virus identification method, which comprises the following steps:
determining original characteristics and virus labels corresponding to various types of virus samples;
calculating a hash code corresponding to the original characteristic;
determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics;
performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters;
and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
The embodiment of the invention provides a network virus identification device, which comprises:
the determining module is used for determining original characteristics and virus labels corresponding to various types of virus samples;
the calculation module is used for calculating the hash code corresponding to the original characteristic;
the determining module is further configured to determine a locality sensitive hash feature corresponding to the sample according to the hash code corresponding to the original feature;
the computing module is further used for performing clustering computation on the locality sensitive hash features and the virus labels corresponding to the sample programs according to a clustering algorithm to obtain a plurality of clustering clusters;
and the identification module is used for identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
A computer device comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the network virus identification method.
A computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the network virus identification method described above.
A computer program product comprising a computer program which, when executed by a processor, implements the above-described network virus identification method.
The invention provides a network virus identification method, a network virus identification device, computer equipment and a storage medium, which are used for determining original characteristics and virus labels corresponding to various types of virus samples and calculating hash codes corresponding to the original characteristics; determining the locality sensitive hash characteristics corresponding to the sample program code according to the hash codes corresponding to the original characteristics; performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program codes according to a clustering algorithm to obtain a plurality of clustering clusters; and finally, identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters. The invention utilizes the local sensitive Hash feature fusion technology, realizes the feature dimension reduction and the formatting expression, simultaneously greatly retains the sample specificity information in the multi-source feature, and then carries out clustering calculation according to the local sensitive Hash feature to determine whether the sample to be detected belongs to the network virus, thereby improving the efficiency and the accuracy of identifying the network virus.
Drawings
Fig. 1 is a flowchart of a network virus identification method provided in the present application;
FIG. 2 is a flow chart of another network virus identification method provided in the present application;
fig. 3 is a schematic structural diagram of an identification apparatus for network viruses provided in the present application.
Fig. 4 is a schematic diagram of a computer device provided in the present application.
Detailed Description
In order to better understand the technical solutions described above, the technical solutions of the embodiments of the present application are described in detail below with reference to the drawings and the specific embodiments, and it should be understood that the specific features of the embodiments and the embodiments of the present application are detailed descriptions of the technical solutions of the embodiments of the present application, and are not limitations of the technical solutions of the present application, and the technical features of the embodiments and the embodiments of the present application may be combined with each other without conflict.
Referring to fig. 1, a method for identifying a network virus according to an embodiment of the present invention specifically includes steps S101 to S105:
and step S101, determining original characteristics and virus labels corresponding to various types of virus samples.
The original features refer to malicious code feature information extracted from sample program codes through means of static and dynamic feature analysis and the like, and the original features comprise static features and dynamic features. Specifically, static characteristics can be obtained through static analysis, and the static characteristics comprise file format information, file attribute information, character string information, binary information and instruction characteristic information; the dynamic characteristics are obtained by using dynamic analysis, and the dynamic characteristics include local behavior characteristics, network behavior characteristics, API call characteristics, and the like.
Further, after determining each type of feature in the original features, the embodiment needs to perform corresponding preprocessing according to a feature value type corresponding to the original features, where the feature value type refers to an extracted original representation form of the feature, for example, for a person, the feature value type of height and weight is a numerical value, the feature value type of gender is a boolean variable, and a fingerprint is a picture. Specifically, according to the data type of the original features in the sample program code, the original features may be divided into numerical features (number of file resources, number of file sections), character features, serialization features (disassembly instruction sequence), graph features (system call flow chart), boolean features (whether executable sections exist), and the like.
For the embodiment of the present invention, the virus tags are used to indicate the types of viruses, and there are a plurality of corresponding virus tags for how many types of viruses exist in the embodiment. The types of viruses can be classified into virus, trojan, worm and other categories, each category has a plurality of different malicious code families, each family may have a plurality of different variants, and each variant has a plurality of different files; the different sample classes herein may be any of the different malicious code variants.
It should be noted that the virus tag in this embodiment may represent, in addition to the corresponding virus type, an expression form of the corresponding virus, where the expression form may be self-extracting packet, adding shell, and the like, and the expression form is not specifically limited in this embodiment.
And step S102, calculating the hash code corresponding to the original characteristic.
In an optional embodiment provided by the present invention, the original features comprise: calculating hash codes respectively corresponding to the numerical features, the character features, the serialization features and/or the graph features, wherein the hash codes respectively correspond to the numerical features, the character features, the serialization features and/or the graph features, and the hash codes comprise:
and step S1021, performing hash calculation on the numerical characteristic and the character characteristic to obtain hash codes corresponding to the numerical characteristic and the character characteristic respectively.
Specifically, hash calculation is directly performed on the feature value of the character-type feature (such as ip and domain name) to obtain a hash code corresponding to the character-type feature. For the numerical characteristics (such as the number of PE file sections and the number of resource files), hash calculation may be directly performed according to the characteristic name of the numerical characteristics to obtain a corresponding hash code, and hash calculation may be performed according to the characteristic value name and the corresponding characteristic value to obtain a corresponding hash code.
For example, if the feature value of the numerical feature is named "called file number" and the feature value is 50, the hash calculation may be performed on the "called file number" to obtain the corresponding hash code, or the hash calculation may be performed according to the "called file number" in combination with the feature value 50 to obtain the corresponding hash code.
Further, in this embodiment, before the hash code corresponding to the numerical characteristic is calculated, normalization processing may be performed on the numerical characteristic, and then hash calculation may be performed according to the normalized numerical characteristic to obtain the corresponding hash code.
In step S1022, each of the serialized features is converted into a fixed-length feature vector.
Wherein the length of the feature vector may be the same as the length of the hash code.
Step S1023, adding the feature vectors of each feature in the serialized features to obtain a target feature vector; and determining the hash code corresponding to the serialized features according to the target feature vector.
Determining the hash code corresponding to the serialized features according to the target feature vector, wherein the determining the hash code corresponding to the serialized features comprises: obtaining the value of each vector in the target characteristic vector; resetting the value of the vector with the value greater than 0 in the target feature vector to 1, and resetting the value of the vector with the value less than or equal to 0 to obtain the hash code corresponding to the serialized features.
For example, the serialized features are disassembled instruction sequences, the sequence content of the serialized features is (lea, mov, mov, cmp, jz), the serialized features are subjected to Embedding processing to obtain vectors of each feature in the serialized features, namely fixed-length (128) -bit vectorization representations corresponding to the lea, mov, mov, cmp, jz in the (lea, mov, mov, cmp, jz) are obtained respectively, each item vector of the serialized features is accumulated to obtain fixed-length vectors of the serialized features (lea, mov, mov, cmp, jz), each value in the vectors is truncated (namely the value is 1 when being larger than 0 and 0 when being smaller than or equal to 0), namely binary sequences of (128) bits are obtained, and hash codes corresponding to the serialized features are obtained.
For graph features, it can be represented as a collection of points (function or API calls) and edges (associations), and these data can implement vectorized representation of points and edges in the graph by the method of Embedding.
And step S103, determining the locality sensitive hash characteristics corresponding to the sample program code according to the hash codes corresponding to the original characteristics.
Specifically, the present implementation may add hash codes corresponding to the numerical feature, the character feature, the serialization feature, and the graph feature, respectively, and determine an addition result as a locality sensitive hash feature corresponding to the sample program code (original feature).
In an optional embodiment of the present invention, the determining, according to hash codes respectively corresponding to the numerical feature, the character-type feature, the serialization feature, and the graph feature, a locality-sensitive hash feature corresponding to the sample program code includes: determining weight values corresponding to the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic respectively; and performing weighted calculation on the Hash codes respectively corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic to obtain the local sensitive Hash characteristic corresponding to the sample program code.
In the embodiment of the present invention, the determination manner of the weight values corresponding to the numerical feature, the character feature, the serialization feature, and the graph feature may be: carrying out normalization processing on the numerical characteristic to obtain a corresponding weight value; determining a weight value corresponding to the character type characteristic through a word frequency-reverse file frequency TF-IDF algorithm; and presetting weight values corresponding to the serialization characteristics and the graph characteristics. Specifically, aiming at character type characteristics or Boolean type characteristics, the frequency TF of the appearance of characteristic values and the frequency IDF of the appearance of the characteristic values in the whole sample set are counted, and the TF-IDF method is utilized to realize the calibration of weight values; aiming at the Embedding Hash codes obtained by the serialization characteristics, a pre-calibrated empirical weight is used as the weight of the corresponding Hash codes.
In addition, the determination method of the weight values corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic may also be: combining the numerical type features, the character type features, the serialization features and the graph features to obtain summary features; inputting the summarized features into a weight recognition model to obtain weight values corresponding to the features, wherein the weight recognition model is obtained by training the summarized feature samples and the weight values corresponding to the features in the summarized feature samples, and the weight values of the features are determined according to a TF-IDF algorithm.
Specifically, hash codes corresponding to the numerical characteristic, the character characteristic, the serialization characteristic, and the graph characteristic are summarized to obtain a summarized characteristic list, and then the summarized characteristic list is input to a weight identification model to obtain weight values corresponding to the characteristics.
And step S104, performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program codes according to a clustering algorithm to obtain a plurality of clustering clusters.
In this embodiment, the hamming distance may be used as the feature distance metric, and the locality sensitive hash features corresponding to each sample program code may be used as input, to perform clustering calculation to obtain a plurality of cluster clusters. And then, selecting the virus label with the largest ratio as the virus label of the cluster according to the number distribution of the virus labels corresponding to the local sensitive hash characteristics in the cluster. The virus tag may be a malware family name to which the sample program code belongs, and flag information such as whether the sample program code is a self-extracting packet, whether the sample program code is a shell, whether the sample program code is an APT tool, and the like.
And S105, identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
The method for identifying whether the sample to be detected belongs to the network virus or not according to the plurality of cluster clusters comprises the following steps: acquiring the original characteristics of the sample to be detected; determining a local sensitive hash characteristic corresponding to the original characteristic of the sample to be detected; determining whether the locality sensitive hash characteristics of the sample to be detected belong to a cluster or not through the clustering algorithm; if the local sensitive hash characteristics of the sample to be detected have the cluster, determining the network virus corresponding to the sample to be detected according to the cluster label corresponding to the cluster, wherein the cluster label is used for representing the virus type of the corresponding cluster; and if the local sensitive Hash characteristics of the sample to be detected do not have the cluster to which the local sensitive Hash characteristics belong, determining that the sample to be detected does not belong to the network virus.
For example, K-nearest neighbor search is performed on a sample to be detected, if the set maximum effective distance threshold is 6 and K is 20, then the virus-tagged sample program code with the distance between the locality-sensitive hash features smaller than 6 is an effective nearest neighbor, and a total of 100 effective nearest neighbors are found, and the distances of the effective nearest neighbors are different from 0 to 5. The 100 neighbors are ranked from small to large in distance, and the nearest 20 of the 100 neighbors are selected. And voting the virus labels (sample family, whether the samples are packets or not and whether the samples are shells or not) of the 20 samples to give a judgment result (19 of the 20 samples are marked as Trojan horses, 1 of the 20 samples are marked as worms, all the marks are not shells and are not self-decompression packets), and judging that the samples to be detected are Trojan horses, non-self-decompression packets and non-shell files.
The invention provides a network virus identification method, which is characterized in that original characteristics and virus labels respectively corresponding to samples to be detected of a plurality of virus types are determined, wherein the original characteristics comprise numerical type characteristics, character type characteristics, serialization characteristics and/or graph characteristics; calculating hash codes corresponding to all the characteristics in the original characteristics respectively; determining the locality sensitive hash characteristics corresponding to the sample to be detected according to the hash codes corresponding to all the characteristics; performing clustering calculation on the local sensitive hash characteristics and the virus labels corresponding to the samples to be detected according to a clustering algorithm to obtain a plurality of clustering clusters; and finally, identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters. The invention utilizes the local sensitive Hash feature fusion technology, realizes feature dimension reduction and formatting expression, greatly retains the sample specificity information in the multisource feature, and then performs clustering calculation according to the local sensitive Hash feature to determine whether the sample to be detected belongs to the network virus, thereby improving the efficiency and accuracy of identifying the network virus.
Referring to fig. 2, another network virus identification method according to an embodiment of the present invention includes steps S201 to S205:
step S201, determining the local sensitive hash characteristics of the sample to be detected.
It should be noted that, in this embodiment, a determination manner of the locally sensitive hash feature in step S201 is the same as the description content of the corresponding step in fig. 1, and this embodiment is not described again here.
Step S202, according to the local sensitive hash characteristics of the sample to be detected, network viruses corresponding to the sample to be detected determined through a clustering algorithm and network viruses corresponding to the sample to be detected determined through a virus library are obtained.
It should be noted that, in this embodiment, a specific implementation manner of obtaining the network virus corresponding to the sample to be detected, which is determined by the clustering algorithm, according to the locality sensitive hash feature of the sample to be detected is the same as the description content of the corresponding step in fig. 1, and this embodiment is not described herein again.
In this embodiment, obtaining the network virus corresponding to the sample to be detected, which is determined by the virus library, according to the locality sensitive hash feature of the sample to be detected includes: calculating the similarity between the local sensitive Hash characteristics of the sample to be detected and various types of virus characteristics in a virus library; and determining the virus type corresponding to the locality sensitive hash characteristics with similarity exceeding a preset value in the virus library as the network virus corresponding to the sample to be detected.
The virus library stores virus types respectively corresponding to the locality sensitive hash characteristics of various types of viruses. In this embodiment, after the locality sensitive hash feature of the sample to be detected is obtained, the similarity between the locality sensitive hash feature of the sample to be detected and the locality sensitive hash feature in the virus library is calculated, and finally, the virus type corresponding to the virus feature whose similarity exceeds a preset value in the virus library is determined as the network virus corresponding to the sample to be detected. The preset value may be set according to an actual requirement, for example, the preset value may be 80%, 85%, or 90%, and this embodiment is not particularly limited.
For example, the virus library includes 5 virus signatures of locality sensitive hash signatures, virus type 1, virus type 2, virus type 3, virus type 4, and virus type 5. After the similarity between the locality sensitive hash feature corresponding to the sample to be detected and the locality sensitive hash feature of the virus type 1 is 65%, the similarity between the locality sensitive hash feature corresponding to the sample to be detected and the locality sensitive hash feature of the virus type 2 is 60%, the similarity between the locality sensitive hash feature corresponding to the sample to be detected and the locality sensitive hash feature of the virus type 3 is 90%, the similarity between the locality sensitive hash feature corresponding to the sample to be detected and the locality sensitive hash feature of the virus type 4 is 54%, the similarity between the locality sensitive hash feature corresponding to the sample to be detected and the locality sensitive hash feature of the virus type 5 is 89%, and if the preset value is 85%, the virus type 3 and the virus type 5 can be determined to be the network viruses corresponding to the sample to be detected.
Step S203, calculating probability values of the network viruses which belong to the same network virus and are determined by the clustering algorithm and the virus library.
And step S204, determining the network virus with the highest probability value as the network virus corresponding to the sample to be detected.
For example, in step S202, the network viruses corresponding to the to-be-detected sample determined according to the virus library are virus type 3 and virus type 5, where the similarity of virus type 3 is 90% (i.e., a probability value), and the similarity of virus type 5 is 89%; the network viruses corresponding to the samples to be detected are determined to be the virus type 3 and the virus type 2 according to a clustering algorithm, wherein the probability value of the cluster belonging to the virus type 3 is 90%, the probability value of the cluster belonging to the virus type 2 is 20%, then the network viruses (the virus type 3) belonging to the same network are averaged to obtain the corresponding virus type 3 with the probability value of 90%, the virus type 5 with the probability value of 89%, the virus type 2 with the probability value of 20%, and the network virus with the highest probability value is the virus type 3, namely the virus type of the samples to be detected is determined to be the network virus 3.
The invention provides a network virus identification method, which is characterized in that according to the local sensitive Hash characteristics of a sample to be detected, network viruses corresponding to the sample to be detected determined by a clustering algorithm and network viruses corresponding to the sample to be detected determined by a virus library are obtained, and then the network viruses corresponding to the sample to be detected are determined by integrating the clustering algorithm and the virus library, so that the accuracy of network virus identification is further improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, an apparatus for identifying a network virus is provided, where the apparatus for identifying a network virus corresponds to the method for identifying a network virus in the foregoing embodiment one to one. As shown in fig. 3, the functional modules of the network virus identification apparatus are described in detail as follows:
a determining module 31, configured to determine original features and virus tags corresponding to multiple types of virus samples;
a calculating module 32, configured to calculate a hash code corresponding to the original feature;
the determining module 31 is further configured to determine, according to the hash code corresponding to the original feature, a locality sensitive hash feature corresponding to the sample;
the calculating module 32 is further configured to perform clustering calculation on the locality sensitive hash features and the virus labels corresponding to the sample programs according to a clustering algorithm to obtain a plurality of clustering clusters;
and the identifying module 33 is configured to identify whether the sample to be detected belongs to the network virus according to the plurality of cluster clusters.
In an alternative embodiment, the primitive features include: a numeric feature, the glyph-like feature, the serialization feature, and/or the graph feature; a calculation module 32, specifically configured to;
performing hash calculation on the numerical characteristic and the character type characteristic to obtain hash codes corresponding to the numerical characteristic and the character type characteristic respectively;
converting each of the serialized features into a fixed-length feature vector, the length of the feature vector being the same as the length of the hash code;
adding the feature vectors of each feature in the serialized features to obtain a target feature vector; and determining the hash code corresponding to the serialized features according to the target feature vector.
In an alternative embodiment, the calculation module 32 is further specifically configured to;
obtaining the value of each vector in the target feature vectors;
resetting the value of the vector with the value greater than 0 in the target feature vector to 1, and resetting the value of the vector with the value less than or equal to 0 to obtain the hash code corresponding to the serialized features.
In an optional embodiment, the determining module 31 is specifically configured to:
determining weight values corresponding to the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic respectively;
and performing weighted calculation on hash codes respectively corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic to obtain the locality sensitive hash characteristic corresponding to the sample program code.
In an optional embodiment, the determining module 31 is further specifically configured to;
carrying out normalization processing on the numerical characteristic to obtain a corresponding weight value;
determining a weight value corresponding to the character type characteristic through a word frequency-reverse file frequency TF-IDF algorithm;
and presetting the weight values corresponding to the serialization characteristics and the graph characteristics.
In an alternative embodiment, the determining module 31 is specifically configured to:
summarizing the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic to obtain a summarized characteristic list;
and inputting the summary feature list into a weight recognition model to obtain weight values corresponding to the features respectively.
In an alternative embodiment, the identification module 33 is specifically configured to;
acquiring the original characteristics of the sample to be detected;
determining a local sensitive hash characteristic corresponding to the original characteristic of the sample to be detected;
determining whether the locality sensitive hash characteristics of the sample to be detected belong to a cluster or not through the clustering algorithm;
if the local sensitive hash characteristics of the sample to be detected have the cluster, determining the network virus corresponding to the sample to be detected according to the cluster label corresponding to the cluster, wherein the cluster label is used for representing the virus type of the corresponding cluster;
and if the local sensitive Hash characteristics of the sample to be detected do not have the cluster to which the local sensitive Hash characteristics belong, determining that the sample to be detected does not belong to the network virus.
In an optional embodiment, the calculating module 32 is further configured to calculate similarity between the locality sensitive hash feature of the sample to be detected and various types of virus features in a virus library; the virus library stores virus types respectively corresponding to the locality sensitive hash characteristics of various types of viruses;
the determining module 31 is further configured to determine the virus type corresponding to the locality sensitive hash feature whose similarity exceeds a preset value in the virus library as the network virus corresponding to the sample to be detected.
In an optional embodiment, the computing module is further configured to obtain the network viruses corresponding to the to-be-detected samples determined by the clustering algorithm and the network viruses corresponding to the to-be-detected samples determined by the virus library; calculating probability values of the network viruses which belong to the same network virus and are determined by the clustering algorithm and the virus library;
the determining module 31 is further configured to determine the network virus with the highest probability value as the network virus corresponding to the sample to be detected.
For specific limitations of the network virus identification device, reference may be made to the above limitations of the network virus identification method, which are not described herein again. The various modules in the above-described apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a network virus identification method.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
determining original characteristics and virus labels corresponding to various types of virus samples;
calculating a hash code corresponding to the original characteristic;
determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics;
performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters;
and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
determining original characteristics and virus labels corresponding to various types of virus samples;
calculating a hash code corresponding to the original characteristic;
determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics;
performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters;
and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
In one embodiment, a computer program product is provided, the computer program product comprising a computer program executed by a processor to perform the steps of:
determining original characteristics and virus labels corresponding to various types of virus samples;
calculating a hash code corresponding to the original characteristic;
determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics;
performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters;
and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.
Claims (12)
1. A method for identifying a network virus, the method comprising:
determining original characteristics and virus labels corresponding to various types of virus samples;
calculating a hash code corresponding to the original characteristic;
determining the locality sensitive hash characteristics corresponding to the samples according to the hash codes corresponding to the original characteristics;
performing clustering calculation on the locality sensitive hash characteristics and the virus labels corresponding to the sample program according to a clustering algorithm to obtain a plurality of clustering clusters;
and identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
2. The method of claim 1, wherein the original features comprise: a numeric feature, the glyph-like feature, the serialization feature, and/or the graph feature;
the calculating the hash code corresponding to the original feature includes:
performing hash calculation on the numerical characteristic and the character type characteristic to obtain hash codes corresponding to the numerical characteristic and the character type characteristic respectively;
converting each of the serialized features into a fixed-length feature vector, the length of the feature vector being the same as the length of the hash code;
adding the feature vectors of each feature in the serialized features to obtain a target feature vector; and determining the hash code corresponding to the serialized features according to the target feature vector.
3. The method of claim 2, wherein determining the hash code corresponding to the serialized features from the target feature vector comprises:
obtaining the value of each vector in the target feature vectors;
resetting the value of the vector with the value greater than 0 in the target feature vector to 1, and resetting the value of the vector with the value less than or equal to 0 to obtain the hash code corresponding to the serialized features.
4. The method of claim 2, wherein the determining the locality-sensitive hash feature corresponding to the sample program code according to the hash code corresponding to the original feature comprises:
determining weight values corresponding to the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic respectively;
and performing weighted calculation on hash codes respectively corresponding to the numerical characteristic, the character characteristic, the serialization characteristic and the graph characteristic to obtain the locality sensitive hash characteristic corresponding to the sample program code.
5. The method of claim 4, wherein the determining the weighting values corresponding to the numerical feature, the character feature, the serialized feature, and the graph feature respectively comprises:
carrying out normalization processing on the numerical characteristic to obtain a corresponding weight value;
determining a weight value corresponding to the character type characteristic through a word frequency-reverse file frequency TF-IDF algorithm;
and presetting the weight values corresponding to the serialization characteristics and the graph characteristics.
6. The method of claim 4, wherein the determining the weighting values corresponding to the numerical feature, the character feature, the serialized feature, and the graph feature respectively comprises:
summarizing the numerical characteristic, the character type characteristic, the serialization characteristic and the graph characteristic to obtain a summarized characteristic list;
and inputting the summary feature list into a weight recognition model to obtain weight values corresponding to the features respectively.
7. The method according to any one of claims 1-6, wherein the identifying whether the sample to be detected belongs to the network virus according to the plurality of cluster clusters comprises:
acquiring the original characteristics of the sample to be detected;
determining a local sensitive hash characteristic corresponding to the original characteristic of the sample to be detected;
determining whether the locality sensitive hash characteristics of the sample to be detected belong to a cluster or not through the clustering algorithm;
and if the local sensitive hash characteristics of the sample to be detected belong to the cluster, determining the network virus corresponding to the sample to be detected according to the cluster label corresponding to the cluster, wherein the cluster label is used for representing the virus type of the corresponding cluster.
8. The method of claim 7, further comprising:
calculating the similarity between the local sensitive Hash characteristics of the sample to be detected and various types of virus characteristics in a virus library; the virus library stores virus types respectively corresponding to the locality sensitive hash characteristics of various types of viruses;
and determining the virus type corresponding to the locality sensitive hash characteristics with similarity exceeding a preset value in the virus library as the network virus corresponding to the sample to be detected.
9. The method of claim 8, further comprising:
acquiring the network viruses corresponding to the samples to be detected determined by the clustering algorithm and the network viruses corresponding to the samples to be detected determined by the virus library;
calculating probability values of the network viruses which belong to the same network virus and are determined by the clustering algorithm and the virus library;
and determining the network virus with the highest probability value as the network virus corresponding to the sample to be detected.
10. An apparatus for identifying a network virus, the apparatus comprising:
the determining module is used for determining original characteristics and virus labels corresponding to various types of virus samples;
the calculation module is used for calculating the hash code corresponding to the original characteristic;
the determining module is further configured to determine a locality sensitive hash feature corresponding to the sample according to the hash code corresponding to the original feature;
the computing module is further used for performing clustering computation on the locality sensitive hash features and the virus labels corresponding to the sample programs according to a clustering algorithm to obtain a plurality of clustering clusters;
and the identification module is used for identifying whether the sample to be detected belongs to the network virus or not according to the plurality of clustering clusters.
11. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the network virus identification method according to any one of claims 1 to 9 when executing the computer program.
12. A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to implement the network virus identification method according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111522830.7A CN114266045A (en) | 2021-12-13 | 2021-12-13 | Network virus identification method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111522830.7A CN114266045A (en) | 2021-12-13 | 2021-12-13 | Network virus identification method and device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114266045A true CN114266045A (en) | 2022-04-01 |
Family
ID=80826939
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111522830.7A Pending CN114266045A (en) | 2021-12-13 | 2021-12-13 | Network virus identification method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114266045A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114969732A (en) * | 2022-04-28 | 2022-08-30 | 国科华盾(北京)科技有限公司 | Malicious code detection method and device, computer equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108334781A (en) * | 2018-03-07 | 2018-07-27 | 腾讯科技(深圳)有限公司 | Method for detecting virus, device, computer readable storage medium and computer equipment |
CN111901282A (en) * | 2019-05-05 | 2020-11-06 | 四川大学 | Method for generating malicious code flow behavior detection structure |
CN112084500A (en) * | 2020-09-15 | 2020-12-15 | 腾讯科技(深圳)有限公司 | Method and device for clustering virus samples, electronic equipment and storage medium |
CN112347477A (en) * | 2019-08-07 | 2021-02-09 | 腾讯云计算(北京)有限责任公司 | Family variant malicious file mining method and device |
CN112615861A (en) * | 2020-12-17 | 2021-04-06 | 赛尔网络有限公司 | Malicious domain name identification method and device, electronic equipment and storage medium |
-
2021
- 2021-12-13 CN CN202111522830.7A patent/CN114266045A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108334781A (en) * | 2018-03-07 | 2018-07-27 | 腾讯科技(深圳)有限公司 | Method for detecting virus, device, computer readable storage medium and computer equipment |
CN111901282A (en) * | 2019-05-05 | 2020-11-06 | 四川大学 | Method for generating malicious code flow behavior detection structure |
CN112347477A (en) * | 2019-08-07 | 2021-02-09 | 腾讯云计算(北京)有限责任公司 | Family variant malicious file mining method and device |
CN112084500A (en) * | 2020-09-15 | 2020-12-15 | 腾讯科技(深圳)有限公司 | Method and device for clustering virus samples, electronic equipment and storage medium |
CN112615861A (en) * | 2020-12-17 | 2021-04-06 | 赛尔网络有限公司 | Malicious domain name identification method and device, electronic equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114969732A (en) * | 2022-04-28 | 2022-08-30 | 国科华盾(北京)科技有限公司 | Malicious code detection method and device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | A scalable and extensible framework for android malware detection and family attribution | |
CN109829302B (en) | Android malicious application family classification method and device and electronic equipment | |
WO2022143511A1 (en) | Malicious traffic identification method and related apparatus | |
Ge et al. | AMDroid: android malware detection using function call graphs | |
CN112005532A (en) | Malware classification of executable files over convolutional networks | |
CN114266046A (en) | Network virus identification method and device, computer equipment and storage medium | |
CN107273746A (en) | A kind of mutation malware detection method based on APK character string features | |
CN113935033A (en) | Feature-fused malicious code family classification method and device and storage medium | |
CN111222137A (en) | Program classification model training method, program classification method and device | |
CN112148305A (en) | Application detection method and device, computer equipment and readable storage medium | |
Gibert et al. | Orthrus: A bimodal learning architecture for malware classification | |
CN113254935A (en) | Malicious file identification method and device and storage medium | |
CN116956026A (en) | Training method and system for network asset identification model | |
Ognev et al. | Clustering of malicious executable files based on the sequence analysis of system calls | |
CN111339531A (en) | Malicious code detection method and device, storage medium and electronic equipment | |
CN114595451A (en) | Graph convolution-based android malicious application classification method | |
CN114266045A (en) | Network virus identification method and device, computer equipment and storage medium | |
CN113139185A (en) | Malicious code detection method and system based on heterogeneous information network | |
CN111737694B (en) | Malicious software homology analysis method based on behavior tree | |
Ugarte-Pedrero et al. | On the adoption of anomaly detection for packed executable filtering | |
CN112257757A (en) | Malicious sample detection method and system based on deep learning | |
Paik et al. | Malware family prediction with an awareness of label uncertainty | |
CN115545091A (en) | Integrated learner-based malicious program API (application program interface) calling sequence detection method | |
JPWO2019235074A1 (en) | Generation method, generation device and generation program | |
CN112163217B (en) | Malware variant identification method, device, equipment and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |