CN111368294B

CN111368294B - Virus file identification method and device, storage medium and electronic device

Info

Publication number: CN111368294B
Application number: CN201811595453.8A
Authority: CN
Inventors: 沈江波; 程虎; 彭宁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2023-02-10
Anticipated expiration: 2038-12-25
Also published as: CN111368294A

Abstract

The invention discloses a virus file identification method and device, a storage medium and an electronic device. Wherein, the method comprises the following steps: acquiring a sample record, wherein characteristic values of a program file which initiates access on an intelligent terminal on a plurality of characteristic dimensions are recorded in the sample record; clustering sample records to obtain a sample set, wherein the sample set stores the same type of sample records; searching target sample records with the same characteristic value on a target characteristic dimension in a sample set, wherein a plurality of characteristic dimensions comprise the target characteristic dimension; and in the case that the similarity between the program file recorded in the target sample record and the known virus file of the target type reaches a first threshold value, taking the program file recorded in the target sample record as a virus file of the target type. The invention solves the technical problem of low virus identification accuracy in the related technology.

Description

Virus file identification method and device, storage medium and electronic device

Technical Field

The invention relates to the field of Internet, in particular to a virus file identification method and device, a storage medium and an electronic device.

Background

A network virus (also called a virus) refers to a computer instruction or a program code that is programmed or inserted into a computer program, destroys computer functions or data, affects computer usage, and can replicate itself.

The antivirus software is used for eliminating computer threats such as computer viruses, trojan horses, malicious software and the like, the antivirus software compares data flowing through a memory with feature codes of a virus library carried by the antivirus software to judge whether the data are viruses or not and remove the viruses to protect the computer by monitoring and scanning a disk in real time, the existing virus searching and killing software mainly adopts virus features to identify the malicious software, the virus features are analyzed from collected malicious samples, and if new viruses appear, the virus searching and killing software cannot accurately identify the new viruses if the new viruses do not collect the new viruses.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a virus file identification method and device, a storage medium and an electronic device, and at least solves the technical problem of low virus identification accuracy in the related technology.

According to an aspect of the embodiments of the present invention, there is provided a method for identifying a virus file, including: acquiring a sample record, wherein the sample record records characteristic values of a program file which initiates access on an intelligent terminal on a plurality of characteristic dimensions; clustering sample records to obtain a sample set, wherein the sample set stores the same type of sample records; searching target sample records with the same characteristic value on target characteristic dimensions in a sample set, wherein the characteristic dimensions comprise the target characteristic dimensions; and in the case that the similarity between the program file recorded in the target sample record and the known virus file of the target type reaches a first threshold value, taking the program file recorded in the target sample record as a virus file of the target type.

According to another aspect of the embodiments of the present invention, there is also provided an apparatus for identifying a virus file, including: the system comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring sample records, and the sample records record characteristic values of a program file which initiates access on an intelligent terminal on a plurality of characteristic dimensions; the clustering unit is used for clustering the sample records to obtain a sample set, wherein the sample set stores the same type of sample records; the searching unit is used for searching target sample records with the same characteristic values on target characteristic dimensions in the sample set, wherein the characteristic dimensions comprise the target characteristic dimensions; and the identification unit is used for taking the program file recorded in the target sample record as a target type virus file under the condition that the similarity between the program file recorded in the target sample record and the known virus file of the target type reaches a first threshold value.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program which, when executed, performs the above-described method.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method described above through the computer program.

In the embodiment of the invention, a user behavior clustering method is introduced, the obtained sample records are clustered to obtain a sample set, and target sample records with the same characteristic value on the characteristic dimension of a target are searched in the sample set; under the condition that the similarity between the program file recorded in the target sample record and the known virus file of the target type reaches a first threshold value, the program file recorded in the target sample record is used as a virus file of the target type, namely family varieties are identified by combining a sample static characteristic matching technology, such as varieties with different information, such as IP (Internet protocol) accessed by a sample, domain names and the like, so that the technical problem of low virus identification accuracy in the related technology can be solved, and the technical effect of improving the virus identification accuracy is further achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of a hardware environment for a virus file identification method according to an embodiment of the present invention;

FIG. 2 is a flow chart of an alternative method of identifying a virus file according to an embodiment of the present invention;

FIG. 3 is a schematic representation of an alternative virus family according to embodiments of the present invention;

FIG. 4 is a flow chart of an alternative method of identifying a virus file according to an embodiment of the present invention;

FIG. 5 is a schematic illustration of an alternative virus sample record according to an embodiment of the present invention;

FIG. 6 is a schematic illustration of an alternative virus sample record according to embodiments of the present invention;

FIG. 7 is a schematic illustration of an alternative virus sample record according to an embodiment of the present invention;

FIG. 8 is a schematic illustration of an alternative virus sample record according to an embodiment of the present invention;

FIG. 9 is a schematic illustration of an alternative virus sample record according to embodiments of the present invention;

FIG. 10 is a schematic illustration of an alternative virus sample record according to embodiments of the present invention;

FIG. 11 is a schematic diagram of an alternative virus file identification apparatus according to an embodiment of the present invention; and

fig. 12 is a block diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms or terms appearing in the description of the embodiments of the present invention are applicable to the following explanations:

hash HASH: "hashing" is also translated by the hashing algorithm of an input of arbitrary length (also called pre-mapped pre-image) into a fixed length output, either a hash value or a hash value, and is a compression mapping, i.e., the space of a hash value is usually much smaller than the space of an input, and different inputs may hash to the same output, so it is not possible to determine a unique input value from the hash value, simply a function of compressing a message of arbitrary length to a message digest of some fixed length.

According to an aspect of the embodiments of the present invention, an embodiment of a method for identifying a virus file is provided.

Alternatively, in this embodiment, the virus file identification method described above may be applied to a hardware environment formed by the user terminal 101 and the server 103 as shown in fig. 1. As shown in fig. 1, the server 103 is connected to the terminal 101 through a network, and may be configured to provide services (such as game services, application services, etc.) for the terminal or a client installed on the terminal, and the terminal 101 feeds back a sample record of network behavior on the terminal to the server 103, and a database 105 may be provided on the server or separately from the server, and is configured to provide a data storage service for the sample record for the server 103, where the network includes, but is not limited to: the terminal 101 is not limited to a PC, a mobile phone, a tablet computer, and the like. The method for identifying a virus file according to the embodiment of the present invention may be executed by the server 103, and the terminal 101 may execute the method for identifying a virus file according to the embodiment of the present invention by a client installed thereon.

Fig. 2 is a flowchart of an alternative virus file identification method according to an embodiment of the present invention, and as shown in fig. 2, the method may include the following steps:

step S202, the server obtains a sample record, and the characteristic values of the program file which initiates access on the intelligent terminal on a plurality of characteristic dimensions are recorded in the sample record.

The sample record is a record of network access behavior of the terminal, the network access behavior may be access behavior of a user using a program or access behavior of a virus program, and the plurality of feature dimensions are features for describing the access behavior of the program, including but not limited to: the relevant information of the program file itself (such as sample Hash, the Hash value of the file falling under the virus or other attack forms), the internet protocol address accessed by the program file, and the domain name accessed by the program file.

Step S204, the server clusters the sample records to obtain a sample set, wherein the sample set stores the sample records of the same type, the number of the sample sets is the same as that of the types, namely, each type corresponds to one set, and the types of the sample records in any two sets are different.

The clustering is performed by using feature values of a plurality of feature dimensions, and the result of the clustering is a type of sample with a closer distance in a space with the plurality of feature dimensions as a space (one feature dimension is similar to one dimension of the space, such as a three-dimensional space formed by an X-axis feature, a Y-axis feature and a Z-axis feature).

Step S206, the server searches target sample records with the same characteristic value in the target characteristic dimension in the sample set, and the plurality of characteristic dimensions comprise the target characteristic dimension.

The target characteristic dimension is at least one of a plurality of characteristic dimensions, such as an internet protocol address accessed by the program file, and a domain name accessed by the program file. Because the virus program often has a widely spread characteristic, in other words, the number of sample records caused by the virus program should be large, finding target sample records with the same characteristic value on the target characteristic dimension in the sample set is equivalent to screening out the sample records with the large number, the identification efficiency can be improved by utilizing the number to filter, and meanwhile, some sample records belonging to the access behaviors of the user are rejected.

In step S208, when the similarity between the program file recorded in the target sample record and the known virus file of the target type reaches the first threshold, the server regards the program file recorded in the target sample record as a virus file of the target type, where the program file recorded in the target sample record is different from the known virus file, but the similarity between the program file recorded in the target sample record and the known virus file is higher, in other words, the program file recorded in the target sample record is a new variation of the known virus file. The first threshold is predetermined and is a constant, such as 90%,80%, etc.

The above embodiment describes an example in which the method for identifying a virus file is executed by the server 103, but the method for identifying a virus file of the present application may be executed by both the server 103 and the terminal 101. At this moment, the method can comprise a client and a cloud server on the terminal, the client has the functions of collecting sample behavior data (namely sample records), the server receives the data reported by the client, and establishes a user behavior database and a behavior relation database, and then the server can perform clustering and sample similarity calculation according to the reported data to judge whether new varieties appear.

Through the steps, a user behavior clustering method is introduced, the obtained sample records are clustered to obtain a sample set, and target sample records with the same characteristic value on the characteristic dimension of the target are searched in the sample set; under the condition that the similarity between the program file recorded in the target sample record and the known virus file of the target type reaches a first threshold value, the program file recorded in the target sample record is used as a virus file of the target type, namely family varieties are identified by combining a sample static characteristic matching technology, such as varieties with different information, such as IP (Internet protocol) accessed by a sample, domain names and the like, so that the technical problem of low virus identification accuracy in the related technology can be solved, and the technical effect of improving the virus identification accuracy is achieved.

In the related art, no scheme capable of effectively identifying the variety of the Hash level of the known family sample exists, and identification of the new IP, domain name and other varieties appearing in the family is not involved, but in practice, a need exists for effectively identifying information such as the newly appearing IP, domain name and the like in the family.

In the technical solution of the present application, a known family database may be established in advance according to a requirement definition, where the known family database includes n families to be monitored (each family may be regarded as a set of viruses of one type, such as a known virus file of a target type), and as shown in fig. 3, each family may include a plurality of subsets, such as a sample Hash set (including elements of a sample Hash, that is, a Hash value of a program file), an IP set for sample access (including elements of an internet protocol address IP for program file access), and a sample access domain name set (including elements of a domain name for program file access), where each subset includes all known sample hashes, IPs, and domain names that have historically appeared in the known family. The technical solution of the present application is further detailed below with reference to the steps shown in fig. 2.

In the technical solution provided in step S202, the server obtains a sample record, where feature values of a program file initiated on the intelligent terminal and accessed on multiple feature dimensions are recorded in the sample record.

As shown in fig. 4, the technical solution of the present application may be integrated in a bottom system of a terminal or a client, after a virus or an attack on the terminal starts to be executed, an active defense system of the client acquires virus or attack behavior data, that is, sample behavior data (or called sample record) is collected, and then the virus and attack behavior data are reported to a server, the server (i.e., a server) receives the data reported by the client, and establishes a user behavior database and a behavior relation database, and the server performs clustering and sample similarity calculation according to the reported data to determine whether a new variation occurs.

In the technical solution provided in step S204, the server clusters the sample records to obtain a sample set, and the sample set stores the same type of sample records.

Optionally, the obtaining of the sample set by clustering the sample records includes: clustering the sample records according to the spacing distance between the sample records determined by the characteristic values on the characteristic dimensions recorded in the sample records to obtain a plurality of sample sets, wherein the spacing distance between any two sample records in one sample set is smaller than the spacing distance between the sample record in one sample set and the sample record in the other sample set.

In the above embodiment, clustering the sample records according to the separation distance between the sample records determined by the feature values on the multiple feature dimensions recorded in the sample records, so as to obtain multiple sample sets, may be implemented by the following technical solutions shown in steps 1 to 4:

step 1, m central points may be randomly selected, for example, m first samples in the sample records are randomly selected as target sample records, a candidate sample set is created for each target sample record, the sample records in the plurality of sample records are clustered to the candidate sample set of the target sample record with the smallest distance between the sample records, that is, all the sample records are traversed, and each sample record is divided into the set of the nearest central points.

And 2, reselecting a sample record from the candidate sample set as a target sample record, namely calculating the average value of each cluster to serve as a new central point, wherein the average value between the reselected sample record and the sample record in the candidate sample set except the reselected sample record is smaller than the average value between the second sample record and the sample record in the candidate sample set except the second sample record, and the second sample record is the sample record in the candidate sample set except the reselected sample record.

Each sample record described above can be considered as a point in the multidimensional space, each characteristic dimension is similar to an axis in the multidimensional space, for example, if the sample record includes a hash value of the program file (corresponding to a value X of an X axis in the three-dimensional space), an internet protocol address accessed by the program file (corresponding to a value Y of a Y axis in the three-dimensional space, the internet protocol address can be represented by a numerical code), and a domain name accessed by the program file (corresponding to a value of a Z axis in the three-dimensional space, the domain name can be represented by a numerical code, Z), then a distance d between any two sample records (X1, Y1, Z1) and (X2, Y2, Z2) can be represented as

For any sample record, the separation distance d between all other sample records in the set can be calculated, and the sample record can be obtainedAverage separation distance of (a).

And 3, under the condition that the target sample records before clustering are different from the target sample records selected after clustering, executing the steps of creating a candidate sample set for each newly selected target sample record and clustering the sample records in the plurality of sample records to the candidate sample set of the target sample record with the minimum spacing distance with the sample record until the target sample records before clustering are the same as the target sample records selected after clustering, namely repeatedly executing the step 3 until the k central line points are not changed (converged) or executing enough iterations.

And 4, taking the plurality of candidate sample sets as a plurality of sample sets under the condition that the target sample records before clustering are the same as the target sample records selected after clustering.

The aggregation process can be seen in fig. 5 to 10, each point in fig. 5 to be fitted can be regarded as a sample record, the point marked by the symbol "x" in fig. 6 is the first sample record selected in step 1, steps 2 and 3 are equivalent to correcting the selected center point, and fig. 7 to 10 show the correction process, and it can be seen that the center point is continuously close to the position where the center point should be actually located.

In the technical solution provided in step S206, the server searches for target sample records with the same feature value in a target feature dimension in the sample set, where the plurality of feature dimensions include the target feature dimension.

Optionally, the feature values of the multiple feature dimensions include hash values of the program files, internet protocol addresses accessed by the program files, and domain names accessed by the program files, and searching for target sample records with the same feature value in the target feature dimension in the sample set includes at least one of: searching target sample records with the same hash value of the program file in the sample set; searching target sample records with the same Internet protocol address accessed by the program file in the sample set; and searching the target sample record with the same domain name accessed by the program file in the sample set.

In the technical solution provided in step S208, when the similarity between the program file recorded in the target sample record and the known virus file of the target type reaches the first threshold, the server regards the program file recorded in the target sample record as a virus file of the target type.

Optionally, if the number of users in the user cluster G is set to be m, clustering the behavior data of the user cluster G, if a result is obtained: the method comprises the following steps that x users have a sample Hash together, y users all access the same IP, z users all access the same domain name, and a server can determine the similarity between a program file recorded by a target sample record and a known virus file of a target type according to the following modes:

step 1, hash values of program files recorded in a target sample record are obtained, and the characteristic values of a plurality of characteristic dimensions comprise the hash values of the program files.

In an optional embodiment, obtaining the hash value of the program file recorded in the target sample record comprises: under the condition that the target sample records are the sample records with the same accessed Internet protocol address, acquiring a first ratio y/m between the number y of the target sample records and the number m of the sample records in the sample set and a second ratio y/p between the number y of the target sample records and the same access times (namely the total number p of users accessing the IP in the whole network) of the accessed Internet protocol address and the target sample records; and under the condition that the first ratio is greater than the second threshold value T2 and the second ratio is greater than the third threshold value T3, obtaining the hash value of the program file recorded in the target sample record.

In yet another alternative embodiment, obtaining the hash value of the program file recorded in the target sample record comprises: under the condition that the target sample records are the sample records with the same accessed domain name, acquiring a first ratio between the number of the target sample records and the number of the sample records in the sample set, and a third ratio between the number of the target sample records and the same access times (namely the total number q of users accessing the IP in the whole network) of the accessed domain name and the target sample records; and under the condition that the first ratio is greater than the second threshold value and the third ratio is greater than the third threshold value, acquiring the hash value of the program file recorded in the target sample record.

And 2, calculating the similarity between the hash value of the program file recorded in the target sample record and the hash value of the known virus file of the target type.

The hash values (marked as hash value 1) of the program file recorded in the target sample record and the hash values (marked as hash value 2) of the known virus file of the target type are obtained by the same hash algorithm, the two bits are the same, when the similarity of the two hash values is calculated, the value of each bit in the hash value 1 can be compared with the value at the same position in the hash value 2, and the ratio of the number of the same value in all the bits in the hash value 1 to the number of the bits in the hash value 1 is the similarity.

By adopting the technical scheme, the new variety of the known family, namely the sample Hash of the family virus, can be effectively sensed, the family sample accesses new IP, domain name and the like, and the variety comprises but is not limited to the sample Hash, IP and domain name.

As an alternative embodiment, the technical solution of the present application is further detailed below with reference to the embodiment shown in fig. 4:

under the condition that a user knows, a client acquires sample behavior data and reports the sample behavior data to a server, wherein the sample behavior comprises sample Hash inquiry, sample access IP, a sample access domain name and the like. The server establishes two databases according to the reported data:

a user behavior database: recording a query sample Hash of a certain user, the IP accessed by the user and the domain name accessed by the user;

a behavior relation database: the sample Hash access IP, the sample Hash access domain name, etc. are recorded.

Step S401, the server reads the set data of the family samples Hash, IP and the like, and specifically, the known samples Hash, IP and domain name data of the known families to be monitored can be taken one by one.

And step S402, according to the data and the user behavior database, marking the user group G infected with the family virus, and setting the number of users of the user group G as m.

Step S403, clustering the behavior data (including sample Hash, IP, and domain name) of the user group G, if a result is obtained: the x users co-exist sample Hash 11111, y users all access IP 22222, and z users all access domain name 33333.

Step S404, judging whether the clustering result (sample Hash, IP, domain name) exceeds the threshold value one by one, respectively judging the threshold value of the sample Hash, IP, domain name obtained by clustering, if so, executing step S405, otherwise, returning to step S401. Taking IP as an example, if y/m > threshold T1 (i.e., the second threshold) and y/total number of users accessing the IP over the network > threshold T2 (i.e., the third threshold), the IP is considered to be a known family variety.

Step S405, determining whether the clustering result is an IP or domain name, if so, performing step S406, otherwise (i.e., the clustering result is a sample Hash), performing step S408.

Step S406, if the clustering result is an IP or a domain name, querying a behavior relation database, and if a sample Hash for accessing the IP or the domain name is queried.

Step S407, determining whether there is a sample for accessing the IP or the domain name, if so, performing step S408, otherwise, performing step S409.

Step S408, calculating similarity (for example, using static feature calculation) between the input sample Hash and the known sample Hash of the family, and if the similarity is found, the clustering result is highly suspicious as a known family variant.

Step S409, the result is judged to be suspicious variety and highly suspicious variety, and the operation can be carried out after manual confirmation.

By the technical means, the variety of the known sample family can be effectively monitored, particularly the IP and the domain name of the variety, and contribution is made to a client list and an enterprise IT manager even if the interception strategy is issued.

It should be noted that for simplicity of description, the above-mentioned method embodiments are shown as a series of combinations of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

According to another aspect of the embodiment of the present invention, there is also provided a virus file identification apparatus for implementing the virus file identification method. Fig. 11 is a schematic diagram of an alternative virus file identification apparatus according to an embodiment of the present invention, and as shown in fig. 11, the apparatus may include: an acquisition unit 1101, a clustering unit 1103, a lookup unit 1105, and a recognition unit 1107.

The acquiring unit 1101 is configured to acquire a sample record, where feature values of a program file, which has initiated access on an intelligent terminal, on a plurality of feature dimensions are recorded in the sample record;

the sample record is a record of network access behavior of the terminal, the network access behavior may be access behavior of a user using a program or access behavior of a virus program, and the plurality of feature dimensions are features for describing the access behavior of the program, including but not limited to: the relevant information of the program file itself (such as sample Hash, file Hash value of virus or other attack form ground), the internet protocol address accessed by the program file, and the domain name accessed by the program file.

A clustering unit 1103, configured to obtain a sample set by clustering sample records, where the sample set stores sample records of the same type;

A searching unit 1105, configured to search a sample set for target sample records with the same feature value in a target feature dimension, where a plurality of feature dimensions include the target feature dimension;

the target characteristic dimension is at least one of a plurality of characteristic dimensions, such as an internet protocol address accessed by the program file, and a domain name accessed by the program file. Because the virus program often has a widely spread characteristic, in other words, the number of sample records caused by the virus program should be large, finding target sample records with the same feature value in the target feature dimension in the sample set is equivalent to screening out sample records with large number, and filtering by using the number can improve the identification efficiency and simultaneously reject some sample records belonging to the access behavior of the user case.

The identifying unit 1107 is configured to, in a case that the similarity between the program file recorded in the target sample record and the known virus file of the target type reaches a first threshold, regard the program file recorded in the target sample record as a virus file of the target type. The program file recorded in the target sample record is different from the known virus file, but the similarity between the program file and the known virus file is high. The first threshold is predetermined and is a constant, such as 90%,80%, etc.

It should be noted that the obtaining unit 1101 in this embodiment may be configured to execute step S202 in this embodiment, the clustering unit 1103 in this embodiment may be configured to execute step S204 in this embodiment, the searching unit 1105 in this embodiment may be configured to execute step S206 in this embodiment, and the identifying unit 1107 in this embodiment may be configured to execute step S208 in this embodiment.

It should be noted that the modules described above are the same as examples and application scenarios realized by corresponding steps, but are not limited to what is disclosed in the foregoing embodiments. It should be noted that the modules described above as part of the apparatus may operate in a hardware environment as shown in fig. 1, and may be implemented by software or hardware.

By the module, a user behavior clustering scheme is introduced, the obtained sample records are clustered to obtain a sample set, and target sample records with the same characteristic value on the target characteristic dimension are searched in the sample set; under the condition that the similarity between the program file recorded in the target sample record and the known virus file of the target type reaches a first threshold value, the program file recorded in the target sample record is used as a virus file of the target type, namely family varieties are identified by combining a sample static characteristic matching technology, such as varieties with different information, such as IP (Internet protocol) accessed by a sample, domain names and the like, so that the technical problem of low virus identification accuracy in the related technology can be solved, and the technical effect of improving the virus identification accuracy is achieved.

In the related art, no scheme capable of effectively identifying the variants of the Hash level of the known family samples exists, and identification of variants such as new IP and domain names appearing in the family is not related, but in practice, a need exists for effectively identifying information such as new IP and domain names appearing in the family.

In the technical solution of the present application, a known family database may be established in advance according to a requirement definition, where the known family database includes n families to be monitored (each family may be regarded as a set of viruses of one type, such as a known virus file of a target type), and as shown in fig. 3, each family may include a plurality of subsets, such as a sample Hash set (including elements of a sample Hash, that is, a Hash value of a program file), an IP set for sample access (including elements of an internet protocol address IP for program file access), and a sample access domain name set (including elements of a domain name for program file access), where each subset includes all known sample hashes, IPs, and domain names that have historically appeared in the known family.

Optionally, the clustering unit may be further configured to: clustering the sample records according to the spacing distance between the sample records determined by the characteristic values on the characteristic dimensions recorded in the sample records to obtain a plurality of sample sets, wherein the spacing distance between any two sample records in one sample set is smaller than the spacing distance between the sample record in one sample set and the sample record in the other sample set.

Alternatively, the clustering unit may include: the clustering module is used for selecting a first sample record in the plurality of sample records as a target sample record, creating a candidate sample set for each target sample record, and clustering the sample records in the plurality of sample records to the candidate sample set of the target sample record with the minimum spacing distance between the sample records; the selecting module is used for reselecting a sample record from the candidate sample set as a target sample record, wherein the average value between the reselected sample record and the sample record except the reselected sample record in the candidate sample set is smaller than the average value between the second sample record and the sample record except the second sample record in the candidate sample set, and the second sample record is the sample record except the reselected sample record in the candidate sample set; a re-clustering module for, when the target sample records before clustering are different from the target sample records selected after clustering, performing a step of creating a candidate sample set for each re-selected target sample record and clustering the sample records in the plurality of sample records to the candidate sample set of the target sample record having the smallest distance to the sample record until the target sample records before clustering are the same as the target sample records selected after clustering; and the determining module is used for taking the candidate sample sets as a plurality of sample sets under the condition that the target sample records before clustering are the same as the target sample records selected after clustering.

Optionally, the feature values of the plurality of feature dimensions include a hash value of the program file, an internet protocol address accessed by the program file, and a domain name accessed by the program file, and the lookup unit is further configured to perform at least one of: searching target sample records with the same hash value of the program file in the sample set; searching target sample records with the same Internet protocol address accessed by the program file in the sample set; and searching the target sample record with the same domain name accessed by the program file in the sample set.

Optionally, the identifying unit may be further configured to determine a similarity between the program file recorded by the target sample record and the known virus file of the target type as follows: obtaining hash values of program files recorded in a target sample record, wherein the characteristic values of the multiple characteristic dimensions comprise the hash values of the program files; and calculating the similarity between the hash value of the program file recorded in the target sample record and the hash value of the known virus file of the target type.

Alternatively, when the identifying unit obtains the hash value of the program file recorded in the target sample record, the identifying unit may be implemented as follows: under the condition that the target sample records are the sample records with the same accessed Internet protocol address, acquiring a first ratio between the number of the target sample records and the number of the sample records in the sample set and a second ratio between the number of the target sample records and the same access times of the accessed Internet protocol address and the target sample records; and under the condition that the first ratio is greater than the second threshold value and the second ratio is greater than the third threshold value, acquiring the hash value of the program file recorded in the target sample record.

Alternatively, when the identifying unit obtains the hash value of the program file recorded in the target sample record, the identifying unit may be implemented as follows: under the condition that the target sample records are sample records with the same accessed domain name, acquiring a first ratio between the number of the target sample records and the number of the sample records in the sample set, and a third ratio between the number of the target sample records and the same access times of the accessed domain name and the target sample records; and under the condition that the first ratio is greater than the second threshold and the third ratio is greater than the third threshold, obtaining the hash value of the program file recorded in the target sample record.

By the technical means, the variants of the known sample family, particularly variant IP and domain names can be effectively monitored, and contribution is made to a client list and an enterprise IT manager even if the client list and the enterprise IT manager issue the interception strategy.

It should be noted that the modules described above are the same as examples and application scenarios realized by corresponding steps, but are not limited to what is disclosed in the foregoing embodiments. It should be noted that the modules described above as a part of the apparatus may be operated in a hardware environment as shown in fig. 1, and may be implemented by software, or may be implemented by hardware, where the hardware environment includes a network environment.

According to another aspect of the embodiment of the present invention, a server or a terminal for implementing the virus file identification method is also provided.

Fig. 12 is a block diagram of a terminal according to an embodiment of the present invention, and as shown in fig. 12, the terminal may include: one or more processors 1201 (only one of which is shown in fig. 12), a memory 1203, and a transmission 1205. As shown in fig. 12, the terminal may also include an input-output device 1207.

The memory 1203 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for identifying a virus file in the embodiment of the present invention, and the processor 1201 executes various functional applications and data processing by running the software programs and modules stored in the memory 1203, that is, implements the method for identifying a virus file described above. The memory 1203 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1203 may further include memory located remotely from the processor 1201, which may be connected to the terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The above-mentioned transmission means 1205 is used for receiving or sending data via a network, and may also be used for data transmission between the processor and the memory. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 1205 includes a Network adapter (NIC) that can be connected to a router via a Network cable and can communicate with the internet or a local area Network. In one example, the transmission device 1205 is a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The memory 1203 is used for storing application programs, among other things.

The processor 1201 may invoke an application stored in the memory 1203 via the transmission 1205 to perform the following steps:

acquiring a sample record, wherein characteristic values of a program file which initiates access on an intelligent terminal on a plurality of characteristic dimensions are recorded in the sample record;

clustering sample records to obtain a sample set, wherein the sample set stores the same type of sample records;

searching target sample records with the same characteristic value on target characteristic dimensions in a sample set, wherein the characteristic dimensions comprise the target characteristic dimensions;

and in the case that the similarity between the program file recorded in the target sample record and the known virus file of the target type reaches a first threshold value, taking the program file recorded in the target sample record as a virus file of the target type.

The processor 1201 is further configured to perform the following steps:

selecting a first sample record in the plurality of sample records as a target sample record, creating a candidate sample set for each target sample record, and clustering the sample records in the plurality of sample records to the candidate sample set of the target sample record with the minimum spacing distance with the sample record;

reselecting a sample record from the candidate sample set as a target sample record, wherein the average value between the reselected sample record and the sample record in the candidate sample set except the reselected sample record is smaller than the average value between the second sample record and the sample record in the candidate sample set except the second sample record, and the second sample record is the sample record in the candidate sample set except the reselected sample record;

under the condition that the target sample records before clustering are different from the target sample records selected after clustering, the steps of creating a candidate sample set for each newly selected target sample record and clustering the sample records in the plurality of sample records to the candidate sample set of the target sample record with the minimum interval distance with the sample record are executed until the target sample records before clustering are the same as the target sample records selected after clustering;

and under the condition that the target sample records before clustering are the same as the target sample records selected after clustering, taking the candidate sample sets as a plurality of sample sets.

The processor 1201 is further configured to perform the steps of:

under the condition that the target sample records are the sample records with the same accessed Internet protocol address, acquiring a first ratio between the number of the target sample records and the number of the sample records in the sample set and a second ratio between the number of the target sample records and the same access times of the accessed Internet protocol address and the target sample records;

and under the condition that the first ratio is greater than the second threshold and the second ratio is greater than the third threshold, obtaining the hash value of the program file recorded in the target sample record.

The processor 1201 is further configured to perform the following steps:

under the condition that the target sample records are sample records with the same accessed domain name, acquiring a first ratio between the number of the target sample records and the number of the sample records in the sample set and a third ratio between the number of the target sample records and the same access times of the accessed domain name and the target sample records;

and under the condition that the first ratio is greater than the second threshold value and the third ratio is greater than the third threshold value, acquiring the hash value of the program file recorded in the target sample record.

By adopting the embodiment of the invention, a user behavior clustering method is introduced, the obtained sample records are clustered to obtain a sample set, and target sample records with the same characteristic value on the characteristic dimension of the target are searched in the sample set; under the condition that the similarity between the program file recorded in the target sample record and the known virus file of the target type reaches a first threshold value, the program file recorded in the target sample record is used as a virus file of the target type, namely family varieties are identified by combining a sample static characteristic matching technology, such as varieties with different information, such as IP (Internet protocol) accessed by a sample, domain names and the like, so that the technical problem of low virus identification accuracy in the related technology can be solved, and the technical effect of improving the virus identification accuracy is achieved.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.

It should be understood by those skilled in the art that the structure shown in fig. 12 is only an illustration, and the terminal may be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a Mobile Internet Device (MID), a PAD, etc. Fig. 12 is a diagram illustrating a structure of the electronic device. For example, the terminal may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 12, or have a different configuration than shown in FIG. 12.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.

The embodiment of the invention also provides a storage medium. Alternatively, in this embodiment, the storage medium may be a program code for executing the virus file identification method.

Optionally, in this embodiment, the storage medium may be located on at least one of a plurality of network devices in a network shown in the embodiment.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:

acquiring a sample record, wherein the sample record records characteristic values of a program file which initiates access on an intelligent terminal on a plurality of characteristic dimensions;

searching target sample records with the same characteristic value on a target characteristic dimension in a sample set, wherein a plurality of characteristic dimensions comprise the target characteristic dimension;

Optionally, the storage medium is further arranged to store program code for performing the steps of:

Optionally, for a specific example in this embodiment, reference may be made to the example described in the foregoing embodiment, and this embodiment is not described herein again.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be essentially or partially contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, or network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the description of each embodiment has its own emphasis, and reference may be made to the related description of other embodiments for parts that are not described in detail in a certain embodiment.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and amendments can be made without departing from the principle of the present invention, and these modifications and amendments should also be considered as the protection scope of the present invention.

Claims

1. A method for identifying a virus file, comprising:

clustering the sample records to obtain a sample set, wherein the sample records of the same type are stored in the sample set;

searching target sample records with the same characteristic value on target characteristic dimensions in the sample set, wherein the characteristic dimensions comprise the target characteristic dimensions;

under the condition that the target sample records are sample records with the same accessed internet protocol address, acquiring a first ratio between the number of the target sample records and the number of the sample records in the sample set and a second ratio between the number of the target sample records and the same access times of the accessed internet protocol address and the target sample records, and under the condition that the first ratio is larger than a second threshold and the second ratio is larger than a third threshold, acquiring a hash value of a program file recorded in the target sample records; or under the condition that the target sample records are sample records with the same accessed domain name, acquiring the first ratio between the number of the target sample records and the number of the sample records in the sample set and a third ratio between the number of the target sample records and the same access times of the accessed domain name and the target sample records, and under the condition that the first ratio is greater than the second threshold and the third ratio is greater than the third threshold, acquiring the hash value of the program file recorded in the target sample records; the second threshold and the third threshold are preset values;

and under the condition that the similarity of the hash value of the program file recorded in the target sample record and the hash value of the known virus file of the target type reaches a first threshold value, taking the program file recorded in the target sample record as a virus file of the target type.

2. The method of claim 1, wherein obtaining a set of samples by clustering the sample records comprises:

clustering the sample records according to the spacing distance between the sample records determined by the characteristic values on the characteristic dimensions recorded in the sample records to obtain a plurality of sample sets, wherein the spacing distance between any two sample records in one sample set is smaller than the spacing distance between the sample record in one sample set and the sample record in the other sample set.

3. The method of claim 2, wherein clustering the sample records according to separation distances between the sample records determined by feature values on a plurality of feature dimensions of the records in the sample records, resulting in a plurality of the sample sets comprises:

selecting a first sample record in the plurality of sample records as a target sample record, creating a candidate sample set for each target sample record, and clustering the sample records in the plurality of sample records to the candidate sample set recorded by the target sample with the minimum spacing distance with the sample record;

reselecting one sample record from the candidate sample set as the target sample record, wherein an average value between the reselected sample record and the sample records in the candidate sample set except the reselected sample record is smaller than an average value between a second sample record and the sample records in the candidate sample set except the second sample record, and the second sample record is the sample record in the candidate sample set except the reselected sample record;

under the condition that the target sample records before clustering are different from the target sample records selected after clustering, the steps of creating a candidate sample set for each newly selected target sample record and clustering the sample records in the plurality of sample records to the candidate sample set recorded by the target sample with the minimum spacing distance with the sample record are executed until the target sample records before clustering are the same as the target sample records selected after clustering;

and taking a plurality of candidate sample sets as a plurality of sample sets under the condition that the target sample records before clustering are the same as the target sample records selected after clustering.

4. The method according to any one of claims 1 to 3, wherein the feature values of the plurality of feature dimensions comprise hash values of program files, internet protocol addresses accessed by the program files, and domain names accessed by the program files, and wherein finding target sample records in the sample set with the same feature value in the target feature dimension comprises at least one of:

searching the target sample records with the same hash value of the program files in the sample set;

searching the target sample records with the same Internet protocol addresses accessed by the program files in the sample set;

and searching the target sample records with the same domain name accessed by the program file in the sample set.

5. An apparatus for identifying a virus file, comprising:

the system comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring sample records, and the sample records record characteristic values of a program file which initiates access on an intelligent terminal on a plurality of characteristic dimensions;

the clustering unit is used for clustering the sample records to obtain a sample set, wherein the sample records of the same type are stored in the sample set;

a searching unit, configured to search, in the sample set, target sample records with the same feature value in a target feature dimension, where the multiple feature dimensions include the target feature dimension;

the identification unit is used for acquiring a first ratio between the number of the target sample records and the number of the sample records in the sample set and a second ratio between the number of the target sample records and the same access times of the accessed internet protocol addresses and the target sample records under the condition that the target sample records are the sample records with the same accessed internet protocol addresses, and acquiring a hash value of a program file recorded in the target sample records under the condition that the first ratio is greater than a second threshold and the second ratio is greater than a third threshold; or under the condition that the target sample records are sample records with the same accessed domain name, acquiring the first ratio between the number of the target sample records and the number of the sample records in the sample set and a third ratio between the number of the target sample records and the same access times of the accessed domain name and the target sample records, and under the condition that the first ratio is greater than the second threshold and the third ratio is greater than the third threshold, acquiring the hash value of the program file recorded in the target sample records; the second threshold and the third threshold are both preset values; and under the condition that the similarity of the hash value of the program file recorded in the target sample record and the hash value of the known virus file of the target type reaches a first threshold value, taking the program file recorded in the target sample record as a virus file of the target type.

6. The apparatus of claim 5, wherein the clustering unit is further configured to:

7. The apparatus of claim 6, wherein the clustering unit comprises:

the clustering module is used for selecting a first sample record in the plurality of sample records as a target sample record, creating a candidate sample set for each target sample record, and clustering the sample records in the plurality of sample records to the candidate sample set of the target sample record with the minimum spacing distance from the sample record;

a selecting module, configured to reselect one sample record from the candidate sample set as the target sample record, where an average value between the reselected sample record and the sample records in the candidate sample set other than the reselected sample record is smaller than an average value between a second sample record and the sample records in the candidate sample set other than the second sample record, and the second sample record is the sample record in the candidate sample set other than the reselected sample record;

a re-clustering module, configured to, in a case where the target sample records before clustering are different from the target sample records selected after clustering, perform a step of creating one candidate sample set for each of the re-selected target sample records and clustering a sample record of the plurality of sample records to the candidate sample set of the target sample record having a smallest distance to the sample record until the target sample record before clustering is the same as the target sample record selected after clustering;

a determining module, configured to use the candidate sample sets as the sample sets when the target sample record before clustering is the same as the target sample record selected after clustering.

8. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program when executed performs the method of any of the preceding claims 1 to 4.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the method of any of the preceding claims 1 to 4 by means of the computer program.