CN113836534B - Virus family identification method, system, equipment and computer storage medium - Google Patents

Virus family identification method, system, equipment and computer storage medium Download PDF

Info

Publication number
CN113836534B
CN113836534B CN202111146137.4A CN202111146137A CN113836534B CN 113836534 B CN113836534 B CN 113836534B CN 202111146137 A CN202111146137 A CN 202111146137A CN 113836534 B CN113836534 B CN 113836534B
Authority
CN
China
Prior art keywords
family
target
virus
information
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111146137.4A
Other languages
Chinese (zh)
Other versions
CN113836534A (en
Inventor
祝洪宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202111146137.4A priority Critical patent/CN113836534B/en
Publication of CN113836534A publication Critical patent/CN113836534A/en
Application granted granted Critical
Publication of CN113836534B publication Critical patent/CN113836534B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a virus family identification method, a system, equipment and a computer readable storage medium, wherein family identification information of a plurality of virus family identification equipment for each target virus in a plurality of target viruses is obtained; converting the family identification information into vector information; clustering a plurality of target viruses based on vector information to obtain a target virus family cluster; a target family tag is determined for each target virus family cluster. In the application, the family identification information needs to be converted into vector information so as to rapidly process the family identification information based on the vector information; the viruses in the target virus family cluster are similar viruses because the similar objects are clustered together, and viruses with commonality can be clustered together no matter whether the target viruses are new viruses or not; therefore, after the target family label of each target virus family cluster is finally determined, similar target viruses can be marked with the same family label, and the identification is good and the accuracy is high.

Description

Virus family identification method, system, equipment and computer storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a virus family identification method, system, device, and computer storage medium.
Background
With the development and popularization of networks, online activities of users are increasingly frequent, and in the process of surfing the internet of users, lawbreakers can perform network attacks on the users for benefits and other demands, such as sending virus files, virus videos and the like to user equipment to perform network attacks on the users, and network attacks can bring losses to the users, so that network security is increasingly important. In order to protect the network security of users, viruses existing on the network can be detected, identified, eliminated and the like, and in the process, viruses can be classified in consideration of more virus types, namely, the viruses are classified into different virus families, and the characteristics of one virus type are described by virtue of the virus families, so that the viruses can be rapidly processed.
When the virus family needs to be identified, the family identification can be carried out on the target virus by means of each antivirus engine to obtain family identification information, and then the family label of the target virus is determined according to the family identification information by adopting a weighted voting mode and the like, so that the virus family identification can be carried out rapidly. However, this approach can have a lower accuracy in identifying the virus family when a new virus or similar virus is encountered.
In view of the above, how to accurately identify virus families is a problem to be solved by those skilled in the art.
Disclosure of Invention
The purpose of the application is to provide a virus family identification method, which can solve the technical problem of how to accurately identify the virus family to a certain extent. The application also provides a virus family identification system, a device and a computer readable storage medium.
To achieve the above object, in a first aspect, the present application provides a virus family identification method, including:
acquiring family identification information of a plurality of virus family identification devices for each target virus in a plurality of target viruses;
converting the family identification information into vector information;
clustering a plurality of target viruses based on the vector information to obtain a target virus family cluster;
determining a target family tag for each of the target virus family clusters.
In the application, after acquiring a plurality of family identification information, the family identification information needs to be converted into vector information so as to quickly process the family identification information based on the vector information; then, the target viruses are required to be clustered based on vector information, because similar objects are clustered together, viruses in the target virus family cluster are similar viruses, and whether the target viruses are new viruses or not, viruses with commonality can be clustered together in the method; therefore, after the target family label of each target virus family cluster is finally determined, similar target viruses can be marked with the same family label, and the identification is good and the accuracy is high.
Preferably, the converting the family identification information into vector information includes:
processing the family identification information to obtain an initial family name;
the initial family name is converted into the vector information.
Preferably, the processing the family identification information to obtain an initial family name includes:
dividing the family identification information to obtain family identification sub-information;
removing preset information in the family identification sub-information to obtain removed family sub-information;
performing character string selection on the family sub-information after the elimination to obtain the initial family name;
the initial family name is converted into the vector information.
Preferably, the clustering the plurality of target viruses based on the vector information to obtain a target virus family cluster includes:
clustering the vector information to obtain a vector information family cluster;
determining a cluster number of the vector information family cluster;
and clustering a plurality of target viruses based on the cluster numbers to obtain the target virus family cluster.
Preferably, the clustering the plurality of target viruses based on the cluster number to obtain the target virus family cluster includes:
For each target virus, combining the cluster numbers corresponding to the family identification information of the target virus to obtain a normalized sequence of the target virus;
and clustering a plurality of target viruses based on the normalized sequence to obtain the target virus family cluster.
In the method, cluster numbers corresponding to family identification information of the target viruses are combined to obtain a normalized sequence of the target viruses, and the normalized sequence can reflect different virus family information of the target viruses because the cluster numbers can be used for representing different vector information.
Preferably, the clustering the plurality of target viruses based on the normalized sequence to obtain the target virus family cluster includes:
the same normalized sequences are gathered together to obtain a sequence clustering result;
Determining a sequence vector of the sequence clustering result based on the vector information;
clustering the sequence clustering result based on the sequence vector to obtain a target clustering result;
and gathering the target viruses corresponding to the target clustering result together to obtain the corresponding target virus family cluster.
Preferably, the determining, based on the vector information, a sequence vector of the sequence clustering result includes:
determining a sample vector of the target virus based on the vector information corresponding to the target virus;
based on the sample vector, the sequence vector of the sequence clustering result is determined.
Preferably, the determining the sequence vector of the sequence clustering result based on the sample vector includes:
and taking the center of the sample vector corresponding to the sequence clustering result as the sequence vector.
Preferably, said determining a target family tag for each of said target virus family clusters comprises:
determining the family name with the largest occurrence number in the target virus family cluster as an initial family label;
if the same initial family label exists, adding distinguishing information to the same initial family label to obtain the target family label;
If the same initial family tag does not exist, the initial family tag is taken as the target family tag.
In a second aspect, the present application provides a virus family recognition system comprising:
a family identification information acquisition module, configured to acquire family identification information of a plurality of virus family identification devices on each of a plurality of target viruses;
the vector conversion module is used for converting the family identification information into vector information;
the clustering module is used for clustering a plurality of target viruses based on the vector information to obtain a target virus family cluster;
and the family label determining module is used for determining the target family label of each target virus family cluster.
In a third aspect, the present application provides an electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the virus family identification method as described in any one of the above when executing the computer program.
In a fourth aspect, the present application provides a computer readable storage medium having stored therein a computer program which when executed by a processor performs the steps of a method for virus family identification as described in any of the above.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.
FIG. 1 is a first flowchart of a method for identifying a virus family according to an embodiment of the present application;
FIG. 2 is a second flowchart of a method for identifying a virus family according to an embodiment of the present application;
FIG. 3 is a third flowchart of a method for identifying a virus family according to an embodiment of the present disclosure;
FIG. 4 is a fourth flowchart of a method for identifying a virus family according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a virus family identification system according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a hardware composition structure of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Referring to fig. 1, fig. 1 is a first flowchart of a virus family identification method according to an embodiment of the present application.
The virus family identification method provided by the embodiment of the application can comprise the following steps:
step S101: family identification information of a plurality of virus family identification devices for each of a plurality of target viruses is acquired.
In practical application, because the prior art has virus family identification equipment capable of identifying family information of target viruses, such as a virus killing engine and the like, the family information of the target viruses can be directly determined, and the family identification result of the virus family identification equipment has certain accuracy, the family identification information of a plurality of virus family identification equipment on each target virus can be acquired first, so that the family information of the target viruses can be determined according to the family identification information later, and the operation efficiency of the method can be improved to a certain extent.
It should be noted that, the number of virus family identification devices and the number of target viruses may be determined according to actual needs, for example, 10 virus family identification devices and 100 target viruses, and because each virus family identification device outputs one family identification information of the target viruses, each target virus has 10 family identification information and total 1000 family identification information.
Step S102: the family identification information is converted into vector information.
In practical application, considering that the number of family identification information may be more and the content of a single family identification information may be larger, if the family identification information is directly processed, the operation efficiency of the method is reduced, in order to avoid the problem, the family identification information can be converted into corresponding vector information, and because the different family identification information, the vector information obtained by conversion is different, the subsequent processing can be performed by replacing the family identification information with the vector information, and the structure of the vector information is simpler, so that the operation efficiency of the method can be improved to a certain extent; most importantly, the distance between the family identification information is quantized by means of vector information, so that virus family identification can be performed on target viruses more accurately.
It should be noted that, in the present application, the vector information refers to information obtained by vectorizing the family identification information, for example, when the family identification information is fuse, the vector information obtained by converting the family identification information may be [0,1, 0], and when the family identification information is fuse, the vector information obtained by converting the family identification information may be [0,1,0,1], and the like, and the method for converting the family identification information into the vector information may be determined according to actual needs, for example, the family identification information may be converted into the vector information by using FastText technology in NLP (Natural Language Processing ), and the present application is not limited specifically herein.
Step S103: and clustering a plurality of target viruses based on the vector information to obtain a target virus family cluster.
In practical applications, considering that the division of the families of viruses is performed according to commonalities among viruses, and the clustering can be performed to group similar objects together, the target viruses can be clustered based on vector information to obtain target virus family clusters, and at this time, the target viruses clustered into one target virus family cluster are viruses belonging to the same virus family.
It should be noted that, the clustering method applied in the present application may be determined according to actual needs, for example, the target viruses may be clustered based on vector information by means of DBSCAN (Density-Based Spatial Clustering of Applications with Noise), claans (A Clustering Algorithmbased on Randomized Search, clustering algorithm based on random selection), DENCLUE (Density Clustering), and the like, which is not specifically limited herein. In addition, when a certain target virus is a new virus, if the new virus is similar to other existing target viruses, the new virus can be clustered together with similar existing target viruses, namely, in the same target virus family cluster, if the new virus is dissimilar to other existing target viruses, the new virus can be clustered into a single target virus family cluster, namely, whether the target virus is a new virus or a known virus, the target virus family cluster corresponding to the target virus can be obtained.
Step S104: a target family tag is determined for each target virus family cluster.
In practical application, after the target virus family cluster is obtained, the target family label of each target virus family cluster can be determined, so that the target viruses in the target virus family cluster inherit the target family label, that is, the target family label is the virus family identification result of the target viruses in the target virus family cluster, and the type and the content of the target family label can be determined according to practical needs, which is not particularly limited herein.
According to the virus family identification method, family identification information of a plurality of virus family identification devices on each target virus is obtained; converting the family identification information into vector information; clustering the target viruses based on the vector information to obtain a target virus family cluster; a target family tag is determined for each target virus family cluster. In the application, after acquiring a plurality of family identification information, the family identification information needs to be converted into vector information so as to quickly process the family identification information based on the vector information; then, the target viruses are required to be clustered based on vector information, because similar objects are clustered together, viruses in the target virus family cluster are similar viruses, and whether the target viruses are new viruses or not, viruses with commonality can be clustered together in the method; therefore, after the target family label of each target virus family cluster is finally determined, similar target viruses can be marked with the same family label, and the identification is good and the accuracy is high.
Referring to fig. 2, fig. 2 is a second flowchart of a virus family identification method according to an embodiment of the present application.
The virus family identification method provided by the embodiment of the application can comprise the following steps:
step S201: family identification information of a plurality of virus family identification devices for each of a plurality of target viruses is acquired.
Step S202: and dividing the family identification information to obtain family identification sub-information.
In practical application, considering that the family identification information carries an initial family name reflecting a virus family, and the initial family name can serve the virus family identification process, the family identification information can be processed in the process of converting the family identification information into vector information, and the initial family name is obtained; the initial family name is then converted into vector information.
In a specific application scenario, considering that the family identification information may include information unrelated to the virus family, or a known generic family name, if the information unrelated to the virus family or the known generic family name is substituted into subsequent processing, the accuracy of the final virus family identification result will be affected, in order to avoid this situation, in the process of converting the family identification information into vector information, the family identification information may be first divided to obtain family identification sub-information, so that the information unrelated to the virus family or the known generic family name may be exposed, for example, according to a separator in the family identification information, such as "/", ": "etc., dividing the family identification information into corresponding family identification sub-information, for example, if one family identification information is abcd/efg/hi, dividing according to the separator"/", the three obtained family identification sub-information are respectively: abcd, efg, and hi.
Step S203: and eliminating preset information in the family identification sub-information to obtain family sub-information after elimination.
In practical application, after the family identification information is segmented to obtain the family identification sub-information, preset information in the family identification sub-information, namely, information irrelevant to the virus family or known common family names and the like, can be removed to obtain the removed family sub-information only relevant to the virus family, and the removed family sub-information reflects the specific family information which is exclusively belonging to the target virus, in other words, the family information which is exclusively belonging to the target virus can be exposed through the removing operation of the application. Taking abcd/efg/hi as an example of the family identification information, if abcd belongs to a common family name, abcd needs to be removed to obtain two family information after removal of efg and hi.
Step S204: and selecting the character strings of the family sub-information after the elimination to obtain the initial family name.
In practical application, after the preset information in the family identification sub-information is removed, one or more pieces of removed family sub-information may be obtained, and at this time, character string selection may be performed on the removed family sub-information to obtain an initial family name representing the virus family to which the target virus belongs.
It should be noted that, the selection rule for selecting the character string of the family sub-information after the elimination may be determined according to actual needs, for example, the first family sub-information obtained after the elimination of the useless information may be selected as an initial family name, or the longest family identifier sub-information obtained after the elimination of the useless information may be selected as an initial family name, which is not specifically limited herein.
Step S205: the initial family name is converted into vector information.
In practical application, after the initial family name is obtained, the initial family name can be converted into vector information. It should be noted that, because the family information unique to the target virus can be exposed by removing the information related to the virus family or the known common family name from the family identifier information, the finally converted vector information can reflect the family information unique to the target virus, and then the virus family is identified according to the family information unique to the target virus, so that the accuracy and the robustness of the identification can be enhanced.
Step S206: and clustering a plurality of target viruses based on the vector information to obtain a target virus family cluster.
Step S207: a target family tag is determined for each target virus family cluster.
Referring to fig. 3, fig. 3 is a third flowchart of a virus family identification method according to an embodiment of the present application.
The virus family identification method provided by the embodiment of the application can comprise the following steps:
step S301: family identification information of a plurality of virus family identification devices for each of a plurality of target viruses is acquired.
Step S302: the family identification information is converted into vector information.
Step S303: and clustering the vector information to obtain a vector information family cluster.
In practical application, in the process of clustering target viruses based on vector information to obtain a target virus family cluster, vector information can be clustered to obtain a vector information family cluster, so that similar vector information is clustered together, and at the moment, the target viruses in the vector information family cluster have commonality.
Step S304: cluster numbers of the vector information family clusters are determined.
In practical applications, considering that the vector information converted from two similar family identification information is similar, the two vector information can be clustered together, but a single vector information can only represent one family identification information, and a target virus has multiple family identification information, that is, a target virus has multiple vector information, if the target virus is subjected to family division according to the vector information family cluster clustered by the single vector information alone, the final division result may be inaccurate, in order to further improve the identification accuracy of the method of the application, the cluster number of the vector information family cluster can be determined, so that different vector information can be characterized by the cluster number, for example, vector information a and vector information B belong to one vector information family cluster, vector information C and vector information D belong to another vector information family cluster, the cluster number of vector information a and vector information B can be 1, the cluster number of vector information C and vector information D can be 2, and different vector information families can be distinguished by means of 1 and 2.
Step S305: and combining cluster numbers corresponding to family identification information of the target viruses for each target virus to obtain a normalized sequence of the target viruses.
In practical application, after the cluster numbers of the vector information family clusters are determined, a plurality of target viruses can be clustered based on the cluster numbers to obtain target virus family clusters, in the process, for one target virus, the plurality of vector information of the target virus reflects different virus family information of the target virus, if the virus family of the target virus is determined based on all virus family information of the target virus, the final recognition result is more accurate, and in order to achieve the effect, after the cluster numbers of the vector information family clusters are determined, the cluster numbers corresponding to the family recognition information of the target virus can be combined for each target virus to obtain the normalized sequence of the target virus. It should be noted that, the manner of combining cluster numbers corresponding to the family identification information of the target virus may be determined according to actual needs, for example, a plurality of virus family identification devices may be first sequenced, and then, according to the sequencing result, cluster numbers corresponding to the family identification information of the target virus are combined to obtain a normalized sequence of the target virus, and so on.
It should be noted that if two family identification information are similar, the vector information corresponding to the two family identification information is clustered into one vector information family cluster, so that the same cluster number is allocated to all target viruses, and therefore, if the values of the same positions in the two normalized sequences are the same, the corresponding two family identification information can be characterized as similar, for convenience of understanding, it is assumed that the two virus family identification devices are two, the target viruses are also two, the first virus family identification device is C1 and C2 for the family identification information of the two target viruses, the second virus family identification device is D1 and D2 for the family identification information of the two target viruses, it is assumed that only C1 and C2 are similar, the cluster numbers corresponding to C1 and C2 are 1, the cluster number corresponding to D1 is 2, the cluster number corresponding to D2 is 3, and the structure of the normalized sequence is CD, wherein C represents the cluster number corresponding to the family identification information of the first family identification device, D represents the cluster number corresponding to the family identification information of the second virus family identification device is C1 and C2, and the cluster number corresponding to the two target viruses are 13 for the normalized sequence.
Step S306: and clustering a plurality of target viruses based on the normalized sequence to obtain a target virus family cluster.
In practical application, the normalization sequences reflect family identification information of all virus family identification devices on target viruses, and the values of the same positions in the two normalization sequences are the same and can represent that the corresponding two family identification information are similar, and the values of the same positions in the two normalization sequences are different and can represent that the corresponding two family identification information are dissimilar, so that when the target viruses are clustered directly based on the normalization sequences, the clustering method is equivalent to comprehensively considering commonality and characteristics among the family identification information to perform family clustering, the target virus family cluster comprehensively considering the commonality and characteristics of family identification results among the target viruses can be obtained, and clustering rationality and robustness of the target virus family cluster are improved.
Step S307: a target family tag is determined for each target virus family cluster.
Referring to fig. 4, fig. 4 is a fourth flowchart of a virus family identification method according to an embodiment of the present application.
The virus family identification method provided by the embodiment of the application can comprise the following steps:
step S401: family identification information of a plurality of virus family identification devices for each of a plurality of target viruses is acquired.
Step S402: the family identification information is converted into vector information.
Step S403: and clustering the vector information to obtain a vector information family cluster.
Step S404: cluster numbers of the vector information family clusters are determined.
Step S405: and combining cluster numbers corresponding to family identification information of the target viruses for each target virus to obtain a normalized sequence of the target viruses.
Step S406: and (5) gathering the same normalized sequences together to obtain a sequence clustering result.
Step S407: based on the vector information, a sequence vector of the sequence clustering result is determined.
Step S408: and clustering the sequence clustering result based on the sequence vector to obtain a target clustering result.
Step S409: and gathering target viruses corresponding to the target clustering result to obtain corresponding target virus family clusters.
In practical application, because the nature of the normalized sequence is a cluster number sequence, when the normalized sequence is clustered directly, only the same normalized sequence can be clustered together, namely only the same family identification information can be clustered together, but similar family identification information cannot be clustered together, in order to avoid the situation, in the process of clustering the target virus based on the normalized sequence to obtain the target virus family cluster, the same normalized sequence can be clustered together to obtain a sequence clustering result; determining a sequence vector of the sequence clustering result based on the vector information so as to convert the normalized sequence into a corresponding vector; then clustering the sequence clustering result based on the sequence vector to obtain a target clustering result so as to cluster the normalized sequence at the vector level; and finally, gathering target viruses corresponding to the target clustering result to obtain corresponding target virus family clusters.
In a specific application scenario, in a process of determining a sequence vector of a sequence clustering result based on vector information, in order to facilitate determining the sequence vector, a sample vector of a target virus may be determined based on vector information corresponding to the target virus, for example, a center of the vector information corresponding to the target virus is used as the sample vector; in determining a sequence vector of the sequence clustering result based on the sample vector, for example, a center of the sample vector corresponding to the sequence clustering result is used as the sequence vector.
Step S410: a target family tag is determined for each target virus family cluster.
In the method for identifying a virus family provided in the embodiment of the present application, in order to quickly determine the target family label in the process of determining the target family label of each target virus family cluster, the family name with the largest occurrence number in the target virus family cluster may be determined as the initial family label, or the most authoritative family name in the target virus family cluster may be determined as the initial family label, etc.; in the process, if the same initial family label exists, distinguishing information is added to the same initial family label to obtain a target family label, the type of the distinguishing information can be determined according to actual needs, for example, the initial family labels of two target virus family clusters are both cerber, the target family label of one target virus family cluster is determined to be cerber_0, and the target family label of the other target virus family cluster is determined to be cerber_1; if the same initial family tag does not exist, the initial family tag is taken as the target family tag.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a virus family identification system according to an embodiment of the present application.
The virus family identification system provided in the embodiment of the application may include:
a family identification information acquisition module 101 for acquiring family identification information of a plurality of virus family identification devices for each of a plurality of target viruses;
a vector conversion module 102 for converting family identification information into vector information;
a clustering module 103, configured to cluster a plurality of target viruses based on vector information, to obtain a target virus family cluster;
the family tag determination module 104 is configured to determine a target family tag of each target virus family cluster.
The embodiment of the application provides a virus family identification system, and a vector conversion module may include:
the family name acquisition sub-module is used for processing the family identification information to acquire an initial family name;
and a conversion unit for converting the initial family name into vector information.
The virus family identification system provided in the embodiment of the present application, the family name obtaining submodule may include:
the division unit is used for dividing the family identification information to obtain family identification sub-information;
The rejecting unit is used for rejecting preset information in the family identification sub-information to obtain the family sub-information after being rejected;
and the selection unit is used for selecting the character strings of the family sub-information after the elimination to obtain an initial family name.
The embodiment of the application provides a virus family identification system, and a clustering module may include:
the first clustering unit is used for clustering the vector information to obtain a vector information family cluster;
a determining unit configured to determine a cluster number of a vector information family cluster;
and the first clustering sub-module is used for clustering a plurality of target viruses based on the cluster numbers to obtain target virus family clusters.
The embodiment of the application provides a virus family identification system, and a first clustering sub-module may include:
the combination unit is used for combining cluster numbers corresponding to family identification information of the target viruses for each target virus to obtain a normalized sequence of the target viruses;
and the second cluster unit is used for clustering the target viruses based on the normalized sequence to obtain a target virus family cluster.
The virus family identification system provided in the embodiment of the present application, the second dimer unit may be specifically used for: the same normalized sequences are gathered together to obtain a sequence clustering result; determining a sequence vector of a sequence clustering result based on the vector information; clustering the sequence clustering results based on the sequence vectors to obtain target clustering results; and gathering target viruses corresponding to the target clustering result to obtain corresponding target virus family clusters.
The virus family identification system provided in the embodiment of the present application, the second dimer unit may be specifically used for: determining a sample vector of the target virus based on vector information corresponding to the target virus; based on the sample vector, a sequence vector of the sequence clustering result is determined.
The virus family identification system provided in the embodiment of the present application, the second dimer unit may be specifically used for: and taking the center of the sample vector corresponding to the sequence clustering result as a sequence vector.
The embodiment of the application provides a virus family identification system, and a family tag determining module may include:
a tag determining unit configured to determine, as an initial family tag, a family name having the largest number of occurrences in the target virus family cluster;
the processing unit is used for adding distinguishing information to the same initial family label if the same initial family label exists, so as to obtain a target family label; if the same initial family tag does not exist, the initial family tag is taken as the target family tag.
Based on the hardware implementation of the program module, and in order to implement the method of the embodiment of the present invention, the embodiment of the present invention further provides an electronic device, and fig. 6 is a schematic diagram of a hardware composition structure of the electronic device of the embodiment of the present invention, as shown in fig. 6, where the electronic device includes:
A communication interface 1 capable of information interaction with other devices such as network devices and the like;
and the processor 2 is connected with the communication interface 1 to realize information interaction with other devices and is used for executing the virus family identification method provided by one or more of the technical schemes when running the computer program. And the computer program is stored on the memory 3.
Of course, in practice, the various components in the electronic device are coupled together by a bus system 4. It will be appreciated that the bus system 4 is used to enable connected communications between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. But for clarity of illustration the various buses are labeled as bus system 4 in fig. 6.
The memory 3 in the embodiment of the present invention is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.
It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Wherein the nonvolatile Memory may be Read Only Memory (ROM), programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read Only Memory (EEPROM, electrically Erasable Programmable Read-Only Memory), magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk Read Only Memory (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile memory may be random access memory (RAM, random Access Memory), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static RandomAccess Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory), dynamic random access memory (DRAM, dynamic Random Access Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), double data rate synchronous dynamic random access memory (ddr SDRAM, double Data Rate Synchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic RandomAccess Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic RandomAccess Memory), direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The memory 2 described in the embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
The method disclosed in the above embodiment of the present invention may be applied to the processor 2 or implemented by the processor 2. The processor 2 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 2 or by instructions in the form of software. The processor 2 described above may be a general purpose processor, DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in embodiments of the present invention. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the invention can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium in the memory 3 and the processor 2 reads the program in the memory 3 to perform the steps of the method described above in connection with its hardware.
The corresponding flow in each method of the embodiments of the present invention is implemented when the processor 2 executes the program, and for brevity, will not be described in detail herein.
In an exemplary embodiment, the present invention also provides a storage medium, i.e. a computer storage medium, in particular a computer readable storage medium, for example comprising a memory 3 storing a computer program executable by the processor 2 for performing the steps of the method described above. The computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, terminal and method may be implemented in other manners. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.
Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing an electronic device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.
The description of the relevant parts in the virus family identification system, the apparatus and the computer readable storage medium provided in the embodiments of the present application refers to the detailed description of the corresponding parts in the virus family identification method provided in the embodiments of the present application, and will not be repeated here. In addition, the parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of the corresponding technical solutions in the prior art, are not described in detail, so that redundant descriptions are avoided.
It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of virus family identification comprising:
acquiring family identification information of a plurality of virus family identification devices for each target virus in a plurality of target viruses;
converting the family identification information into vector information;
clustering a plurality of target viruses based on the vector information to obtain a target virus family cluster;
determining a target family tag for each of the target virus family clusters;
the clustering the target viruses based on the vector information to obtain a target virus family cluster includes:
clustering the vector information to obtain a vector information family cluster;
determining a cluster number of the vector information family cluster;
for each target virus, combining the cluster numbers corresponding to the family identification information of the target virus to obtain a normalized sequence of the target virus;
and clustering a plurality of target viruses based on the normalized sequence to obtain the target virus family cluster.
2. The method of claim 1, wherein the converting the family identification information into vector information comprises:
processing the family identification information to obtain an initial family name;
The initial family name is converted into the vector information.
3. The method of claim 2, wherein processing the family identification information to obtain an initial family name comprises:
dividing the family identification information to obtain family identification sub-information;
removing preset information in the family identification sub-information to obtain removed family sub-information;
and selecting the character string of the family sub-information after the elimination to obtain the initial family name.
4. The method of claim 1, wherein clustering the plurality of target viruses based on the normalized sequence to obtain the target virus family cluster comprises:
the same normalized sequences are gathered together to obtain a sequence clustering result;
determining a sequence vector of the sequence clustering result based on the vector information;
clustering the sequence clustering result based on the sequence vector to obtain a target clustering result;
and gathering the target viruses corresponding to the target clustering result together to obtain the corresponding target virus family cluster.
5. The method of claim 4, wherein the determining the sequence vector of the sequence clustering result based on the vector information comprises:
Determining a sample vector of the target virus based on the vector information corresponding to the target virus;
based on the sample vector, the sequence vector of the sequence clustering result is determined.
6. The method of claim 5, wherein the determining the sequence vector for the sequence clustering result based on the sample vector comprises:
and taking the center of the sample vector corresponding to the sequence clustering result as the sequence vector.
7. The method of any one of claims 1 to 6, wherein said determining a target family tag for each of said target virus family clusters comprises:
determining the family name with the largest occurrence number in the target virus family cluster as an initial family label;
if the same initial family label exists, adding distinguishing information to the same initial family label to obtain the target family label;
if the same initial family tag does not exist, the initial family tag is taken as the target family tag.
8. A virus family identification system, comprising:
a family identification information acquisition module, configured to acquire family identification information of a plurality of virus family identification devices on each of a plurality of target viruses;
The vector conversion module is used for converting the family identification information into vector information;
the clustering module is used for clustering a plurality of target viruses based on the vector information to obtain a target virus family cluster;
a family tag determining module, configured to determine a target family tag of each of the target virus family clusters;
wherein, the clustering module includes:
the first clustering unit is used for clustering the vector information to obtain a vector information family cluster;
a determining unit configured to determine a cluster number of the vector information family cluster;
a combination unit, configured to combine, for each target virus, the cluster numbers corresponding to the family identification information of the target virus, to obtain a normalized sequence of the target virus;
and the second clustering unit is used for clustering a plurality of target viruses based on the normalized sequence to obtain the target virus family cluster.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the virus family identification method according to any one of claims 1 to 7 when executing said computer program.
10. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, which computer program, when being executed by a processor, implements the steps of the virus family identification method according to any one of claims 1 to 7.
CN202111146137.4A 2021-09-28 2021-09-28 Virus family identification method, system, equipment and computer storage medium Active CN113836534B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111146137.4A CN113836534B (en) 2021-09-28 2021-09-28 Virus family identification method, system, equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111146137.4A CN113836534B (en) 2021-09-28 2021-09-28 Virus family identification method, system, equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN113836534A CN113836534A (en) 2021-12-24
CN113836534B true CN113836534B (en) 2024-04-12

Family

ID=78967192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111146137.4A Active CN113836534B (en) 2021-09-28 2021-09-28 Virus family identification method, system, equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN113836534B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1253501A2 (en) * 2001-04-29 2002-10-30 Beijing Rising Technology Corporation Limited Method and system for scanning and cleaning known and unknown computer viruses, recording medium and transmission medium therefor
US7210041B1 (en) * 2001-04-30 2007-04-24 Mcafee, Inc. System and method for identifying a macro virus family using a macro virus definitions database
GB0822619D0 (en) * 2008-12-11 2009-01-21 Scansafe Ltd Malware detection
WO2013020426A1 (en) * 2011-08-09 2013-02-14 腾讯科技(深圳)有限公司 Clustering processing method and device for virus files
CN103476788A (en) * 2010-12-10 2013-12-25 新加坡科技研究局 Immunogenic chikungunya virus peptides
CN105512555A (en) * 2014-12-12 2016-04-20 哈尔滨安天科技股份有限公司 Homologous family dividing and mutation method and system based on file string cluster
WO2016127037A1 (en) * 2015-02-06 2016-08-11 Alibaba Group Holding Limited Method and device for identifying computer virus variants
CN109190653A (en) * 2018-07-09 2019-01-11 四川大学 Malicious code family homology analysis technology based on semi-supervised Density Clustering
CN110399722A (en) * 2019-02-20 2019-11-01 腾讯科技(深圳)有限公司 A kind of virus family generation method, device, server and storage medium
CN111783088A (en) * 2020-06-03 2020-10-16 杭州迪普科技股份有限公司 Malicious code family clustering method and device and computer equipment
CN112084500A (en) * 2020-09-15 2020-12-15 腾讯科技(深圳)有限公司 Method and device for clustering virus samples, electronic equipment and storage medium
CN112287952A (en) * 2019-07-22 2021-01-29 腾讯科技(深圳)有限公司 Virus clustering method, virus clustering device, storage medium and electronic device
CN112765428A (en) * 2021-01-15 2021-05-07 济南大学 Malicious software family clustering and identifying method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11689561B2 (en) * 2019-11-11 2023-06-27 Microsoft Technology Licensing, Llc Detecting unknown malicious content in computer systems

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1253501A2 (en) * 2001-04-29 2002-10-30 Beijing Rising Technology Corporation Limited Method and system for scanning and cleaning known and unknown computer viruses, recording medium and transmission medium therefor
US7210041B1 (en) * 2001-04-30 2007-04-24 Mcafee, Inc. System and method for identifying a macro virus family using a macro virus definitions database
GB0822619D0 (en) * 2008-12-11 2009-01-21 Scansafe Ltd Malware detection
CN103476788A (en) * 2010-12-10 2013-12-25 新加坡科技研究局 Immunogenic chikungunya virus peptides
WO2013020426A1 (en) * 2011-08-09 2013-02-14 腾讯科技(深圳)有限公司 Clustering processing method and device for virus files
CN105512555A (en) * 2014-12-12 2016-04-20 哈尔滨安天科技股份有限公司 Homologous family dividing and mutation method and system based on file string cluster
WO2016127037A1 (en) * 2015-02-06 2016-08-11 Alibaba Group Holding Limited Method and device for identifying computer virus variants
CN109190653A (en) * 2018-07-09 2019-01-11 四川大学 Malicious code family homology analysis technology based on semi-supervised Density Clustering
CN110399722A (en) * 2019-02-20 2019-11-01 腾讯科技(深圳)有限公司 A kind of virus family generation method, device, server and storage medium
CN112287952A (en) * 2019-07-22 2021-01-29 腾讯科技(深圳)有限公司 Virus clustering method, virus clustering device, storage medium and electronic device
CN111783088A (en) * 2020-06-03 2020-10-16 杭州迪普科技股份有限公司 Malicious code family clustering method and device and computer equipment
CN112084500A (en) * 2020-09-15 2020-12-15 腾讯科技(深圳)有限公司 Method and device for clustering virus samples, electronic equipment and storage medium
CN112765428A (en) * 2021-01-15 2021-05-07 济南大学 Malicious software family clustering and identifying method and system

Also Published As

Publication number Publication date
CN113836534A (en) 2021-12-24

Similar Documents

Publication Publication Date Title
WO2022051663A1 (en) Domain name processing systems and methods
CN111241389B (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
US10216848B2 (en) Method and system for recommending cloud websites based on terminal access statistics
CN111800404B (en) Method and device for identifying malicious domain name and storage medium
CN112769775B (en) Threat information association analysis method, system, equipment and computer medium
CN114244611B (en) Abnormal attack detection method, device, equipment and storage medium
CN112839014A (en) Method, system, device and medium for establishing model for identifying abnormal visitor
CN115189914A (en) Application Programming Interface (API) identification method and device for network traffic
CN114201756A (en) Vulnerability detection method and related device for intelligent contract code segment
CN111966673B (en) Big data based data auditing method and device and storage medium
CN114372267A (en) Malicious webpage identification and detection method based on static domain, computer and storage medium
CN113836534B (en) Virus family identification method, system, equipment and computer storage medium
CN110427496B (en) Knowledge graph expansion method and device for text processing
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN111694928A (en) Data index recommendation method and device, computer equipment and readable storage medium
CN114095176B (en) Malicious domain name detection method and device
CN112541357B (en) Entity identification method and device and intelligent equipment
CN108132971B (en) Analysis method and device for database fragment files
CN112307174A (en) Multi-platform data integration method and device, computer equipment and readable storage medium
CN113765852B (en) Data packet detection method, system, storage medium and computing device
CN114860673B (en) Log feature identification method and device based on dynamic and static combination
CN114500261B (en) Network asset identification method and device, electronic equipment and storage medium
CN115455011B (en) Multi-source information data processing method and device
CN109088859B (en) Method, device, server and readable storage medium for identifying suspicious target object
CN112035890B (en) Data integrity verification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant