CN113836534A - Virus family identification method, system, equipment and computer storage medium - Google Patents

Virus family identification method, system, equipment and computer storage medium Download PDF

Info

Publication number
CN113836534A
CN113836534A CN202111146137.4A CN202111146137A CN113836534A CN 113836534 A CN113836534 A CN 113836534A CN 202111146137 A CN202111146137 A CN 202111146137A CN 113836534 A CN113836534 A CN 113836534A
Authority
CN
China
Prior art keywords
family
target
virus
information
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111146137.4A
Other languages
Chinese (zh)
Other versions
CN113836534B (en
Inventor
祝洪宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202111146137.4A priority Critical patent/CN113836534B/en
Publication of CN113836534A publication Critical patent/CN113836534A/en
Application granted granted Critical
Publication of CN113836534B publication Critical patent/CN113836534B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a virus family identification method, a system, equipment and a computer readable storage medium, which are used for acquiring family identification information of a plurality of virus family identification equipment to each target virus in a plurality of target viruses; converting the family identification information into vector information; clustering a plurality of target viruses based on the vector information to obtain a target virus family cluster; a target family tag is determined for each target virus family cluster. In the application, family identification information needs to be converted into vector information so as to quickly process the family identification information based on the vector information; and because clustering is to cluster similar objects together, viruses in the target virus family cluster are all similar viruses, and whether the target viruses are new viruses or not, the viruses with the commonalities can be clustered together; therefore, after the target family label of each target virus family cluster is finally determined, similar target viruses can be marked with the same family label, and the method is good in identification and high in accuracy.

Description

Virus family identification method, system, equipment and computer storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a method, a system, a device, and a computer storage medium for virus family identification.
Background
With the development and popularization of networks, online activities of users are increasingly frequent, and in the process of surfing the internet by users, lawless persons can attack the users for benefits and other needs, for example, network attacks can be performed on the users by sending virus files, virus videos and the like to user equipment, and the network attacks can bring losses to the users, so that the network security is increasingly important. In the process, the viruses can be classified according to more virus types, namely, the viruses are classified into different virus families, and the characteristics of one type of viruses are described by means of the virus families, so that the viruses are rapidly processed.
When virus families need to be identified, each antivirus engine can be used for carrying out family identification on target viruses to obtain family identification information, and then a weighted voting mode and other modes are adopted to determine the family labels of the target viruses according to the family identification information so as to rapidly carry out virus family identification. However, when new viruses or similar viruses are encountered, the method has low recognition accuracy for the virus family.
In view of the above, how to accurately identify virus families is a problem to be solved by those skilled in the art.
Disclosure of Invention
The purpose of the present application is to provide a virus family identification method, which can solve the technical problem of how to accurately identify a virus family to a certain extent. The application also provides a virus family identification system, equipment and a computer readable storage medium.
In order to achieve the above object, in a first aspect, the present application provides a virus family identification method, comprising:
acquiring family identification information of a plurality of virus family identification devices on each target virus in a plurality of target viruses;
converting the family identification information into vector information;
clustering the target viruses based on the vector information to obtain a target virus family cluster;
determining a target family tag for each of the target virus family clusters.
In the application, after a plurality of family identification information are acquired, the family identification information needs to be converted into vector information, so that the family identification information can be rapidly processed based on the vector information in the following; then, the target viruses need to be clustered based on the vector information, because similar objects are clustered together, the viruses in the target virus family cluster are all similar viruses, and the viruses with commonalities can be clustered together no matter whether the target viruses are new viruses or not; therefore, after the target family label of each target virus family cluster is finally determined, similar target viruses can be marked with the same family label, and the method is good in identification and high in accuracy.
Preferably, the converting the family identification information into vector information includes:
processing the family identification information to obtain an initial family name;
converting the initial family name to the vector information.
Preferably, the processing the family identification information to obtain an initial family name includes:
segmenting the family identification information to obtain family identifier information;
removing preset information in the family identifier information to obtain family information after removal;
selecting character strings from the family sub information after being removed to obtain the initial family name;
converting the initial family name to the vector information.
Preferably, the clustering the plurality of target viruses based on the vector information to obtain a target virus family cluster includes:
clustering the vector information to obtain a vector information family cluster;
determining a cluster number of the vector information family cluster;
and clustering the target viruses based on the cluster numbers to obtain the target virus family cluster.
Preferably, the clustering the target viruses based on the cluster numbers to obtain the target virus family cluster includes:
for each target virus, combining the cluster numbers corresponding to the family identification information of the target virus to obtain a normalized sequence of the target virus;
and clustering a plurality of target viruses based on the normalized sequences to obtain the target virus family cluster.
According to the method, the cluster numbers corresponding to the family identification information of the target viruses are combined to obtain the normalized sequence of the target viruses, the cluster numbers can be used for representing different vector information, so that the normalized sequence can reflect different virus family information of the target viruses, and then the target viruses are clustered based on the normalized sequence, which is equivalent to comprehensively considering the commonality and the characteristic among the family identification information to carry out family clustering, so that the target virus family cluster comprehensively considering the commonality and the characteristic of the family identification result among the target viruses can be obtained, and the clustering rationality and the robustness of the target virus family cluster are improved.
Preferably, the clustering the target viruses based on the normalized sequence to obtain the target virus family cluster includes:
gathering the same normalization sequences together to obtain a sequence clustering result;
determining a sequence vector of the sequence clustering result based on the vector information;
clustering the sequence clustering result based on the sequence vector to obtain a target clustering result;
and clustering the target viruses corresponding to the target clustering result together to obtain the corresponding target virus family cluster.
Preferably, the determining a sequence vector of the sequence clustering result based on the vector information includes:
determining a sample vector of the target virus based on the vector information corresponding to the target virus;
determining the sequence vector of the sequence clustering result based on the sample vector.
Preferably, the determining the sequence vector of the sequence clustering result based on the sample vector includes:
and taking the center of the sample vector corresponding to the sequence clustering result as the sequence vector.
Preferably, the determining the target family tag of each target virus family cluster comprises:
determining the family name with the largest occurrence number in the target virus family cluster as an initial family tag;
if the same initial family tag exists, adding distinguishing information to the same initial family tag to obtain the target family tag;
if the same initial family tag does not exist, the initial family tag is used as the target family tag.
In a second aspect, the present application provides a virus family identification system comprising:
the family identification information acquisition module is used for acquiring the family identification information of a plurality of virus family identification devices on each target virus in a plurality of target viruses;
the vector conversion module is used for converting the family identification information into vector information;
the clustering module is used for clustering the target viruses based on the vector information to obtain a target virus family cluster;
a family tag determination module for determining a target family tag for each of the target virus family clusters.
In a third aspect, the present application provides an electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of any of the above virus family identification methods when executing the computer program.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the virus family identification method as described above.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a first flowchart of a virus family identification method according to an embodiment of the present disclosure;
FIG. 2 is a second flowchart of a virus family identification method provided in the embodiments of the present application;
FIG. 3 is a third flowchart of a virus family identification method provided in the embodiments of the present application;
FIG. 4 is a fourth flowchart of a virus family identification method according to an embodiment of the present disclosure;
FIG. 5 is a schematic structural diagram of a virus family identification system according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a hardware component structure of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a first flowchart of a virus family identification method according to an embodiment of the present disclosure.
The virus family identification method provided by the embodiment of the application can comprise the following steps:
step S101: family identification information of a plurality of virus family identification devices for each target virus in a plurality of target viruses is acquired.
In practical application, because there is a virus family identification device capable of identifying family information of a target virus in the prior art, such as a virus killing engine, etc., which can directly determine the family information of the target virus, and the family identification result of the virus family identification device has certain accuracy, the family identification information of a plurality of virus family identification devices for each target virus can be obtained first, so as to determine the family information of the target virus according to the family identification information later, and the operation efficiency of the method of the present application can be improved to a certain extent.
It should be noted that, the number of the virus family identification devices and the number of the target viruses may be determined according to actual needs, for example, 10 virus family identification devices and 100 target viruses may be used, and each virus family identification device outputs one family identification information of the target virus, so each target virus has 10 family identification information, and a total number of 1000 family identification information.
Step S102: the family identification information is converted into vector information.
In practical application, considering that the number of family identification information is possibly large, the content of single family identification information is possibly large, if the family identification information is directly processed, the operation efficiency of the method can be reduced, in order to avoid the problem, the family identification information can be converted into corresponding vector information, and the vector information obtained by conversion is different because the family identification information is different, so that the family identification information can be replaced by the vector information for subsequent processing, the structure of the vector information is simple, and the operation efficiency of the method can be improved to a certain extent; most importantly, the distance between the family identification information is quantified by vector information, so that the virus family identification of the target virus can be more accurately carried out.
It should be noted that the vector information in the present application refers to information obtained by vectorizing the family identification information, for example, when the family identification information is fuerboos, the vector information obtained by conversion may be [0,1,1,0], when the family identification information is fuerboose, the vector information obtained by conversion may be [0,1,0,1], and the like, and the method for converting the family identification information into the vector information may be determined according to actual needs, for example, the family identification information may be converted into the vector information by FastText technology in NLP (Natural Language Processing), and the present application is not limited specifically herein.
Step S103: and clustering the target viruses based on the vector information to obtain a target virus family cluster.
In practical application, considering that family division of viruses is performed according to commonalities among viruses, and clustering can be performed to cluster similar objects together, target viruses can be clustered based on vector information to obtain target virus family clusters, and at this time, the target viruses clustered in one target virus family cluster are viruses belonging to the same virus family.
It should be noted that the Clustering method applied in the present application may be determined according to actual needs, for example, the target viruses may be clustered Based on vector information by means of dbss (Density-Based Spatial Clustering of Applications with Noise), CLARANS (a Clustering algorithm Based on random selection), Density (Density Clustering), and the like, and the present application is not limited in particular. In addition, when a certain target virus is a new virus, if the new virus is similar to other existing target viruses, the new virus can be clustered with the similar existing target viruses, namely, in the same target virus family cluster, and if the new virus is not similar to other existing target viruses, the new virus can be clustered in a single target virus family cluster, namely, no matter whether the target virus is the new virus or a known virus, the target virus family cluster corresponding to the target virus can be obtained by the method.
Step S104: a target family tag is determined for each target virus family cluster.
In practical application, after a target virus family cluster is obtained, a target family tag of each target virus family cluster can be determined, so that a target virus in the target virus family cluster inherits the target family tag, that is, the target family tag is a virus family identification result of the target virus in the target virus family cluster, the type and content of the target family tag can be determined according to actual needs, and the application is not particularly limited herein.
According to the virus family identification method, family identification information of a plurality of virus family identification devices for each target virus is acquired; converting the family identification information into vector information; clustering the target viruses based on the vector information to obtain a target virus family cluster; a target family tag is determined for each target virus family cluster. In the application, after a plurality of family identification information are acquired, the family identification information needs to be converted into vector information, so that the family identification information can be rapidly processed based on the vector information in the following; then, the target viruses need to be clustered based on the vector information, because similar objects are clustered together, the viruses in the target virus family cluster are all similar viruses, and the viruses with commonalities can be clustered together no matter whether the target viruses are new viruses or not; therefore, after the target family label of each target virus family cluster is finally determined, similar target viruses can be marked with the same family label, and the method is good in identification and high in accuracy.
Referring to fig. 2, fig. 2 is a second flowchart of a virus family identification method according to an embodiment of the present application.
The virus family identification method provided by the embodiment of the application can comprise the following steps:
step S201: family identification information of a plurality of virus family identification devices for each target virus in a plurality of target viruses is acquired.
Step S202: and dividing the family identification information to obtain family identifier information.
In practical application, considering that the family identification information carries an initial family name reflecting a virus family, and the initial family name can serve a virus family identification process, the family identification information can be processed to obtain the initial family name in the process of converting the family identification information into vector information; the initial family name is then converted into vector information.
In a specific application scenario, considering that the family identification information may include information unrelated to the virus family or a known generic family name, if the information unrelated to the virus family or the known generic family name is substituted into the subsequent processing, the accuracy of the final virus family identification result may be affected, and to avoid this, in the process of converting the family identification information into the vector information, the family identification information may be divided first to obtain the family identifier sub-information, so as to expose the information unrelated to the virus family or the known generic family name, for example, according to separators in the family identification information, such as "/,": "etc., dividing the family identification information into corresponding family identifier information, for example, if one family identification information is abcd/efg/hi, the three obtained family identifier information are divided according to the separator"/": abcd, efg and hi.
Step S203: and removing preset information in the family identifier information to obtain the family identifier information after removal.
In practical application, after the family identification information is divided to obtain the family identifier information, preset information in the family identifier information can be removed, namely information irrelevant to a virus family or a known universal family name and the like are removed, so that removed family information only relevant to the virus family is obtained, and the removed family information reflects unique family information which belongs to the target virus. Still taking the above family identification information as abcd/efg/hi as an example, assuming that abcd belongs to a common family name, abcd needs to be removed to obtain two pieces of family information after efg and hi are removed.
Step S204: and selecting character strings for the family sub-information after being removed to obtain an initial family name.
In practical application, after preset information in the family identifier information is removed, one or more removed family sub-information may be obtained, and at this time, the removed family sub-information may be subjected to character string selection to obtain an initial family name representing a virus family to which the target virus belongs.
It should be noted that the selection rule for selecting the character string of the removed family sub-information may be determined according to actual needs, for example, the first family sub-information obtained after the useless information is removed may be selected as the initial family name, or the longest family identifier sub-information obtained after the useless information is removed may be selected as the initial family name, and the like, and the present application is not limited specifically herein.
Step S205: the initial family name is converted to vector information.
In practical applications, after the initial family name is obtained, the initial family name can be converted into vector information. It should be noted that, because information irrelevant to the virus family or known generic family name and the like in the family identifier information is removed, the unique family information of the target virus can be exposed, so that the finally converted vector information can reflect the unique family information of the target virus, and then the virus family identification is performed according to the unique family information of the target virus, so that the identification accuracy and robustness can be enhanced.
Step S206: and clustering the target viruses based on the vector information to obtain a target virus family cluster.
Step S207: a target family tag is determined for each target virus family cluster.
Referring to fig. 3, fig. 3 is a third flowchart of a virus family identification method according to an embodiment of the present application.
The virus family identification method provided by the embodiment of the application can comprise the following steps:
step S301: family identification information of a plurality of virus family identification devices for each target virus in a plurality of target viruses is acquired.
Step S302: the family identification information is converted into vector information.
Step S303: and clustering the vector information to obtain a vector information family cluster.
In practical application, in the process of clustering target viruses based on vector information to obtain a target virus family cluster, vector information can be clustered to obtain a vector information family cluster, similar vector information is gathered together, and at the moment, commonalities exist among the target viruses in the vector information family cluster.
Step S304: a cluster number of the vector information family cluster is determined.
In practical application, considering that similar vector information converted from two family identification information is similar, the two vector information are clustered together, but a single vector information only represents a family identification information, while a target virus has a plurality of family identification information, i.e. a target virus has a plurality of vector information, if the target virus is subjected to family division only according to the vector information family cluster clustered by the single vector information, the final division result may be inaccurate, in order to further improve the identification accuracy of the method of the present application, the cluster number of the vector information family cluster may be determined, so as to represent different vector information by means of the cluster number, for example, the vector information a and B belong to a vector information family cluster, the vector information C and D belong to another vector information family cluster, the cluster number of the vector information a and B may be 1, the cluster number of the vector information C and D may be 2, and different vector information family clusters can be distinguished by 1 and 2.
Step S305: and for each target virus, combining cluster numbers corresponding to the family identification information of the target virus to obtain a normalized sequence of the target virus.
In practical application, after the cluster number of the vector information family cluster is determined, a plurality of target viruses can be clustered based on the cluster number to obtain a target virus family cluster, and in the process, for one target virus, a plurality of pieces of vector information of the target virus reflect different virus family information of the target virus, if the virus family of the target virus is determined based on all the virus family information of the target virus, the final identification result is undoubtedly more accurate, and in order to achieve the effect, after the cluster number of the vector information family cluster is determined, for each target virus, the cluster numbers corresponding to the family identification information of the target virus can be combined to obtain a normalized sequence of the target virus. It should be noted that the manner of combining the cluster numbers corresponding to the family identification information of the target virus may be determined according to actual needs, for example, a plurality of virus family identification devices may be sorted first, and then the cluster numbers corresponding to the family identification information of the target virus may be combined according to the sorting result to obtain the normalized sequence of the target virus.
It should be noted that, if the two family identification information are similar, the vector information corresponding to the two family identification information will be clustered into a vector information family cluster, and thus will be assigned with the same cluster number, and all target viruses will have a normalized sequence, so the same position value in the two normalized sequences will be the same, so it can characterize that the two corresponding family identification information are similar, for the convenience of understanding, assume that there are two virus family identification devices and two target viruses, and the family identification information of the first virus family identification device for the two target viruses is C1 and C2, the family identification information of the second virus family identification device for the two target viruses is D1 and D2, assuming that only C1 and C2 are similar, the cluster numbers corresponding to C1 and C2 are 1, the cluster number corresponding to D1 is 2, the cluster number corresponding to D2 is 3, and the structure of the normalized sequence is CD, wherein C represents the cluster number corresponding to the family identification information of the first virus family identification device, D represents the cluster number corresponding to the family identification information of the second virus family identification device, and then the normalized sequences of the two target viruses are respectively 12 and 13.
Step S306: and clustering a plurality of target viruses based on the normalized sequences to obtain a target virus family cluster.
In practical application, because the normalization sequence reflects the family identification information of all virus family identification devices to the target virus, the same values at the same positions in the two normalization sequences can represent that the corresponding two family identification information are similar, and the different values at the same positions in the two normalization sequences can represent that the corresponding two family identification information are not similar, if the target virus is clustered directly based on the normalization sequence, the method is equivalent to perform family clustering by comprehensively considering the commonality and the characteristics among the family identification information, the target virus family cluster by comprehensively considering the commonality and the characteristics of the family identification result among the target viruses can be obtained, and the clustering rationality and the robustness of the target virus family cluster are improved.
Step S307: a target family tag is determined for each target virus family cluster.
Referring to fig. 4, fig. 4 is a fourth flowchart of a virus family identification method according to an embodiment of the present disclosure.
The virus family identification method provided by the embodiment of the application can comprise the following steps:
step S401: family identification information of a plurality of virus family identification devices for each target virus in a plurality of target viruses is acquired.
Step S402: the family identification information is converted into vector information.
Step S403: and clustering the vector information to obtain a vector information family cluster.
Step S404: a cluster number of the vector information family cluster is determined.
Step S405: and for each target virus, combining cluster numbers corresponding to the family identification information of the target virus to obtain a normalized sequence of the target virus.
Step S406: and gathering the same normalized sequences together to obtain a sequence clustering result.
Step S407: and determining a sequence vector of the sequence clustering result based on the vector information.
Step S408: and clustering the sequence clustering result based on the sequence vector to obtain a target clustering result.
Step S409: and clustering the target viruses corresponding to the target clustering result together to obtain a corresponding target virus family cluster.
In practical application, because the nature of the normalized sequence is a cluster number sequence, if the normalized sequence is directly clustered, only the same normalized sequence will be clustered together, that is, only the same family identification information can be clustered together, but similar family identification information cannot be clustered together, in order to avoid this situation, the same normalized sequence can be clustered together to obtain a sequence clustering result in the process of clustering the target viruses based on the normalized sequence to obtain the target virus family cluster; determining a sequence vector of the sequence clustering result based on the vector information so as to convert the normalized sequence into a corresponding vector; clustering the sequence clustering result based on the sequence vector to obtain a target clustering result so as to cluster the normalized sequence on a vector level; and finally, clustering the target viruses corresponding to the target clustering result together to obtain a corresponding target virus family cluster.
In a specific application scenario, in the process of determining a sequence vector of a sequence clustering result based on vector information, in order to determine the sequence vector, a sample vector of a target virus may be determined based on vector information corresponding to the target virus, for example, a center of the vector information corresponding to the target virus is used as the sample vector; based on the sample vector, a sequence vector of the sequence clustering result is determined, for example, the center of the sample vector corresponding to the sequence clustering result is taken as the sequence vector.
Step S410: a target family tag is determined for each target virus family cluster.
In the virus family identification method provided by the embodiment of the application, in the process of determining the target family tag of each target virus family cluster, in order to quickly determine the target family tag, the family name with the largest occurrence frequency in the target virus family cluster can be determined as the initial family tag, or the most authoritative family name in the target virus family cluster can be determined as the initial family tag; in the process, if the same initial family tags exist, adding distinguishing information to the same initial family tags to obtain target family tags, wherein the type of the distinguishing information can be determined according to actual needs, for example, if both the initial family tags of two target virus family clusters are cerber, the target family tag of one of the target virus family clusters can be determined as cerber _0, and the target family tag of the other target virus family cluster can be determined as cerber _ 1; and if the same initial family tag does not exist, taking the initial family tag as a target family tag.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a virus family identification system according to an embodiment of the present disclosure.
The virus family identification system provided by the embodiment of the application can comprise:
a family identification information obtaining module 101, configured to obtain family identification information of each target virus in the multiple target viruses by the multiple virus family identification devices;
a vector conversion module 102, configured to convert the family identification information into vector information;
the clustering module 103 is used for clustering a plurality of target viruses based on the vector information to obtain a target virus family cluster;
a family tag determination module 104 for determining a target family tag for each target virus family cluster.
In an embodiment of the present application, a virus family identification system, a vector conversion module may include:
the family name acquisition submodule is used for processing the family identification information to acquire an initial family name;
and the conversion unit is used for converting the initial family name into vector information.
In the virus family identification system provided in the embodiment of the present application, the family name obtaining submodule may include:
the dividing unit is used for dividing the family identification information to obtain family identifier information;
the removing unit is used for removing preset information in the family identifier information to obtain the family identifier information after removal;
and the selecting unit is used for selecting the character strings of the family sub-information after being removed to obtain an initial family name.
In an embodiment of the present application, a clustering module of a virus family recognition system may include:
the first clustering unit is used for clustering the vector information to obtain a vector information family cluster;
a determining unit for determining a cluster number of the vector information family cluster;
and the first clustering submodule is used for clustering the plurality of target viruses based on the cluster numbers to obtain a target virus family cluster.
In an embodiment of the present application, a first clustering submodule of the virus family identification system may include:
the combination unit is used for combining the cluster numbers corresponding to the family identification information of the target viruses to obtain the normalized sequence of the target viruses;
and the second clustering unit is used for clustering the target viruses based on the normalized sequence to obtain a target virus family cluster.
In the virus family identification system provided in the embodiment of the present application, the second clustering unit may be specifically configured to: gathering the same normalized sequences together to obtain a sequence clustering result; determining a sequence vector of the sequence clustering result based on the vector information; clustering the sequence clustering result based on the sequence vector to obtain a target clustering result; and clustering the target viruses corresponding to the target clustering result together to obtain a corresponding target virus family cluster.
In the virus family identification system provided in the embodiment of the present application, the second clustering unit may be specifically configured to: determining a sample vector of the target virus based on vector information corresponding to the target virus; and determining a sequence vector of the sequence clustering result based on the sample vector.
In the virus family identification system provided in the embodiment of the present application, the second clustering unit may be specifically configured to: and taking the center of the sample vector corresponding to the sequence clustering result as a sequence vector.
In an embodiment of the present application, a family tag determining module of a virus family identification system may include:
a tag determining unit, configured to determine a family name with the largest number of occurrences in a target virus family cluster as an initial family tag;
the processing unit is used for adding distinguishing information to the same initial family tag to obtain a target family tag if the same initial family tag exists; and if the same initial family tag does not exist, taking the initial family tag as a target family tag.
Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present invention, an embodiment of the present invention further provides an electronic device, fig. 6 is a schematic diagram of a hardware composition structure of the electronic device according to the embodiment of the present invention, and as shown in fig. 6, the electronic device includes:
a communication interface 1 capable of information interaction with other devices such as network devices and the like;
and the processor 2 is connected with the communication interface 1 to realize information interaction with other equipment, and is used for executing the virus family identification method provided by one or more technical schemes when running a computer program. And the computer program is stored on the memory 3.
In practice, of course, the various components in the electronic device are coupled together by the bus system 4. It will be appreciated that the bus system 4 is used to enable connection communication between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. For the sake of clarity, however, the various buses are labeled as bus system 4 in fig. 6.
The memory 3 in the embodiment of the present invention is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.
It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Synchronous Random Access Memory), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), Synchronous link Dynamic Random Access Memory (SLDRAM, Synchronous Dynamic Random Access Memory), Direct Memory (DRmb Random Access Memory). The memory 2 described in the embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
The method disclosed by the above embodiment of the present invention can be applied to the processor 2, or implemented by the processor 2. The processor 2 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2. The processor 2 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 3, and the processor 2 reads the program in the memory 3 and in combination with its hardware performs the steps of the aforementioned method.
When the processor 2 executes the program, the corresponding processes in the methods according to the embodiments of the present invention are realized, and for brevity, are not described herein again.
In an exemplary embodiment, the present invention further provides a storage medium, i.e. a computer storage medium, in particular a computer readable storage medium, for example comprising a memory 3 storing a computer program, which is executable by a processor 2 to perform the steps of the aforementioned method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, terminal and method may be implemented in other manners. The above-described device embodiments are only illustrative, for example, the division of the unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
For a description of relevant parts in the virus family identification system, the device and the computer readable storage medium provided in the embodiments of the present application, reference is made to detailed descriptions of corresponding parts in the virus family identification method provided in the embodiments of the present application, and details are not repeated here. In addition, parts of the above technical solutions provided in the embodiments of the present application, which are consistent with the implementation principles of corresponding technical solutions in the prior art, are not described in detail so as to avoid redundant description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A method for identifying a virus family, comprising:
acquiring family identification information of a plurality of virus family identification devices on each target virus in a plurality of target viruses;
converting the family identification information into vector information;
clustering the target viruses based on the vector information to obtain a target virus family cluster;
determining a target family tag for each of the target virus family clusters.
2. The method of claim 1, wherein converting the family identification information into vector information comprises:
processing the family identification information to obtain an initial family name;
converting the initial family name to the vector information.
3. The method of claim 2, wherein said processing the family identification information to obtain an initial family name comprises:
segmenting the family identification information to obtain family identifier information;
removing preset information in the family identifier information to obtain family information after removal;
and selecting character strings for the family sub information after being removed to obtain the initial family name.
4. The method of claim 1, wherein the clustering the target viruses based on the vector information to obtain a target virus family cluster comprises:
clustering the vector information to obtain a vector information family cluster;
determining a cluster number of the vector information family cluster;
and clustering the target viruses based on the cluster numbers to obtain the target virus family cluster.
5. The method of claim 4, wherein the clustering the plurality of target viruses based on the cluster numbers to obtain the target virus family cluster comprises:
for each target virus, combining the cluster numbers corresponding to the family identification information of the target virus to obtain a normalized sequence of the target virus;
and clustering a plurality of target viruses based on the normalized sequences to obtain the target virus family cluster.
6. The method of claim 5, wherein clustering the plurality of target viruses based on the normalized sequences to obtain the target virus family cluster comprises:
gathering the same normalization sequences together to obtain a sequence clustering result;
determining a sequence vector of the sequence clustering result based on the vector information;
clustering the sequence clustering result based on the sequence vector to obtain a target clustering result;
and clustering the target viruses corresponding to the target clustering result together to obtain the corresponding target virus family cluster.
7. The method of claim 6, wherein the determining a sequence vector of the sequence clustering result based on the vector information comprises:
determining a sample vector of the target virus based on the vector information corresponding to the target virus;
determining the sequence vector of the sequence clustering result based on the sample vector.
8. The method of claim 7, wherein the determining the sequence vector of the sequence clustering result based on the sample vector comprises:
and taking the center of the sample vector corresponding to the sequence clustering result as the sequence vector.
9. The method of any one of claims 1 to 8, wherein said determining a target family tag for each of said target virus family clusters comprises:
determining the family name with the largest occurrence number in the target virus family cluster as an initial family tag;
if the same initial family tag exists, adding distinguishing information to the same initial family tag to obtain the target family tag;
if the same initial family tag does not exist, the initial family tag is used as the target family tag.
10. A virus family identification system, comprising:
the family identification information acquisition module is used for acquiring the family identification information of a plurality of virus family identification devices on each target virus in a plurality of target viruses;
the vector conversion module is used for converting the family identification information into vector information;
the clustering module is used for clustering the target viruses based on the vector information to obtain a target virus family cluster;
a family tag determination module for determining a target family tag for each of the target virus family clusters.
11. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the virus family identification method of any one of claims 1 to 9 when executing said computer program.
12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the virus family identification method according to any one of claims 1 to 9.
CN202111146137.4A 2021-09-28 2021-09-28 Virus family identification method, system, equipment and computer storage medium Active CN113836534B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111146137.4A CN113836534B (en) 2021-09-28 2021-09-28 Virus family identification method, system, equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111146137.4A CN113836534B (en) 2021-09-28 2021-09-28 Virus family identification method, system, equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN113836534A true CN113836534A (en) 2021-12-24
CN113836534B CN113836534B (en) 2024-04-12

Family

ID=78967192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111146137.4A Active CN113836534B (en) 2021-09-28 2021-09-28 Virus family identification method, system, equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN113836534B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1253501A2 (en) * 2001-04-29 2002-10-30 Beijing Rising Technology Corporation Limited Method and system for scanning and cleaning known and unknown computer viruses, recording medium and transmission medium therefor
US7210041B1 (en) * 2001-04-30 2007-04-24 Mcafee, Inc. System and method for identifying a macro virus family using a macro virus definitions database
GB0822619D0 (en) * 2008-12-11 2009-01-21 Scansafe Ltd Malware detection
WO2013020426A1 (en) * 2011-08-09 2013-02-14 腾讯科技(深圳)有限公司 Clustering processing method and device for virus files
CN103476788A (en) * 2010-12-10 2013-12-25 新加坡科技研究局 Immunogenic chikungunya virus peptides
CN105512555A (en) * 2014-12-12 2016-04-20 哈尔滨安天科技股份有限公司 Homologous family dividing and mutation method and system based on file string cluster
WO2016127037A1 (en) * 2015-02-06 2016-08-11 Alibaba Group Holding Limited Method and device for identifying computer virus variants
CN109190653A (en) * 2018-07-09 2019-01-11 四川大学 Malicious code family homology analysis technology based on semi-supervised Density Clustering
CN110399722A (en) * 2019-02-20 2019-11-01 腾讯科技(深圳)有限公司 A kind of virus family generation method, device, server and storage medium
CN111783088A (en) * 2020-06-03 2020-10-16 杭州迪普科技股份有限公司 Malicious code family clustering method and device and computer equipment
CN112084500A (en) * 2020-09-15 2020-12-15 腾讯科技(深圳)有限公司 Method and device for clustering virus samples, electronic equipment and storage medium
CN112287952A (en) * 2019-07-22 2021-01-29 腾讯科技(深圳)有限公司 Virus clustering method, virus clustering device, storage medium and electronic device
CN112765428A (en) * 2021-01-15 2021-05-07 济南大学 Malicious software family clustering and identifying method and system
US20210141897A1 (en) * 2019-11-11 2021-05-13 Microsoft Technology Licensing, Llc Detecting unknown malicious content in computer systems

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1253501A2 (en) * 2001-04-29 2002-10-30 Beijing Rising Technology Corporation Limited Method and system for scanning and cleaning known and unknown computer viruses, recording medium and transmission medium therefor
US7210041B1 (en) * 2001-04-30 2007-04-24 Mcafee, Inc. System and method for identifying a macro virus family using a macro virus definitions database
GB0822619D0 (en) * 2008-12-11 2009-01-21 Scansafe Ltd Malware detection
CN103476788A (en) * 2010-12-10 2013-12-25 新加坡科技研究局 Immunogenic chikungunya virus peptides
WO2013020426A1 (en) * 2011-08-09 2013-02-14 腾讯科技(深圳)有限公司 Clustering processing method and device for virus files
CN105512555A (en) * 2014-12-12 2016-04-20 哈尔滨安天科技股份有限公司 Homologous family dividing and mutation method and system based on file string cluster
WO2016127037A1 (en) * 2015-02-06 2016-08-11 Alibaba Group Holding Limited Method and device for identifying computer virus variants
CN109190653A (en) * 2018-07-09 2019-01-11 四川大学 Malicious code family homology analysis technology based on semi-supervised Density Clustering
CN110399722A (en) * 2019-02-20 2019-11-01 腾讯科技(深圳)有限公司 A kind of virus family generation method, device, server and storage medium
CN112287952A (en) * 2019-07-22 2021-01-29 腾讯科技(深圳)有限公司 Virus clustering method, virus clustering device, storage medium and electronic device
US20210141897A1 (en) * 2019-11-11 2021-05-13 Microsoft Technology Licensing, Llc Detecting unknown malicious content in computer systems
CN111783088A (en) * 2020-06-03 2020-10-16 杭州迪普科技股份有限公司 Malicious code family clustering method and device and computer equipment
CN112084500A (en) * 2020-09-15 2020-12-15 腾讯科技(深圳)有限公司 Method and device for clustering virus samples, electronic equipment and storage medium
CN112765428A (en) * 2021-01-15 2021-05-07 济南大学 Malicious software family clustering and identifying method and system

Also Published As

Publication number Publication date
CN113836534B (en) 2024-04-12

Similar Documents

Publication Publication Date Title
CN110099059B (en) Domain name identification method and device and storage medium
CN111800404B (en) Method and device for identifying malicious domain name and storage medium
CN111090807B (en) Knowledge graph-based user identification method and device
CN112769775B (en) Threat information association analysis method, system, equipment and computer medium
CN111869176B (en) System and method for malware signature generation
CN112019519B (en) Method and device for detecting threat degree of network security information and electronic device
CN104903865B (en) Virtual machine VM images are applied to the method and system of computer system
CN112839014A (en) Method, system, device and medium for establishing model for identifying abnormal visitor
CN108156127B (en) Network attack mode judging device, judging method and computer readable storage medium thereof
CN114189390A (en) Domain name detection method, system, equipment and computer readable storage medium
CN114201756A (en) Vulnerability detection method and related device for intelligent contract code segment
CN116032741A (en) Equipment identification method and device, electronic equipment and computer storage medium
CN111966673B (en) Big data based data auditing method and device and storage medium
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN113836534B (en) Virus family identification method, system, equipment and computer storage medium
CN114244611B (en) Abnormal attack detection method, device, equipment and storage medium
CN114021116B (en) Construction method of homologous analysis knowledge base, homologous analysis method and device
CN115001724B (en) Network threat intelligence management method, device, computing equipment and computer readable storage medium
CN115766258A (en) Multi-stage attack trend prediction method and device based on causal graph and storage medium
WO2016118153A1 (en) Marking nodes for analysis based on domain name system resolution
CN114372267A (en) Malicious webpage identification and detection method based on static domain, computer and storage medium
CN113434506A (en) Data management and retrieval method and device, computer equipment and readable storage medium
EP3585034B1 (en) Big data-based method for learning and protecting service logic and device for learning and protection
CN112686029A (en) SQL new sentence identification method and device for database audit system
CN116204879B (en) Malicious file detection method and device, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant