CN111371812B

CN111371812B - Virus detection method, device and medium

Info

Publication number: CN111371812B
Application number: CN202010463067.4A
Authority: CN
Inventors: 刘敏; 齐文杰; 魏向前; 程虎; 沈江波; 彭宁; 曹有理; 谭昱
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-09-01
Anticipated expiration: 2040-05-27
Also published as: CN111371812A

Abstract

The invention discloses a virus detection method, a device and a medium. The method comprises the following steps: acquiring a suspicious file containing parent and child files, wherein the parent file in the suspicious file is a white file, and the file attribute of the child file in the suspicious file is grey or unknown; determining a first base static feature of the parent file and a second base static feature of the child file; generating a first distribution characteristic corresponding to the suspicious file based on a sample file set, the first basic static characteristic and the second basic static characteristic; outputting a virus detection result of the suspicious file by using a virus detection model of a white utilization virus with the first basic static feature, the second basic static feature and the first distribution feature as input; the virus detection model is determined by machine learning training based on the features corresponding to the labeled sample files in the sample file set. The invention can improve the accuracy and adaptability of virus detection.

Description

Virus detection method, device and medium

Technical Field

The invention relates to the technical field of internet communication, in particular to a virus detection method, a virus detection device and a virus detection medium.

Background

With the continuous development of communication technology, the internet has been incorporated into the aspects of life. However, the network black products, which are derivatives of the internet development, also become ubiquitous and seriously threaten the network security. In the network black production, viruses are disguised as child files corresponding to safe and harmless parent files, and along with the starting of the parent files, malicious child files are also loaded or executed. Subfiles that are viruses can corrupt computer functionality or corrupt data, can affect computer usage, and can replicate themselves. We call this kind of parent-child files based on white utility technology (also called white plus black utility technology) together white utility virus. In the related art, detection of leukocytes utilizes a means of virus, mainly by sample identification and active defense. The sample identification mode is based on the static characteristics or dynamic characteristics of the subfiles and is easy to bypass viruses. The active defense mode is mostly based on process detection and cannot effectively and well cover viruses in a finer category. Therefore, there is a need to provide more efficient detection schemes for leucoviruses.

Disclosure of Invention

In order to solve the problems of low accuracy and the like when the prior art is applied to detecting the white blood utilizing virus, the invention provides a virus detection method, a device and a medium, wherein the method comprises the following steps:

in one aspect, the present invention provides a method for detecting a virus, the method comprising:

acquiring a suspicious file containing parent and child files, wherein the parent file in the suspicious file is a white file, and the file attribute of the child file in the suspicious file is grey or unknown;

determining a first base static feature of the parent file and a second base static feature of the child file;

generating a first distribution feature corresponding to the suspicious file based on a sample file set, the first basic static feature and the second basic static feature;

outputting a virus detection result of the suspicious file by using a virus detection model of a white utilization virus with the first basic static feature, the second basic static feature and the first distribution feature as input;

and the virus detection model is determined by machine learning training based on the features corresponding to the labeled sample files in the sample file set.

Another aspect provides a virus detection apparatus, the apparatus comprising:

an acquisition module: the method comprises the steps of obtaining a suspicious file containing parent and child files, wherein the parent file in the suspicious file is a white file, and the file attribute of the child file in the suspicious file is unknown;

a determination module: means for determining a first base static feature of the parent file and a second base static feature of the child file;

a generation module: generating a first distribution feature corresponding to the suspicious file based on the sample file set, the first base static feature and the second base static feature;

a detection module: the virus detection module is used for outputting a virus detection result of the suspicious file by using a virus detection model of a white utilization virus with the first basic static feature, the second basic static feature and the first distribution feature as input;

Another aspect provides an electronic device, which includes a processor and a memory, where at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded by the processor and executed to implement the virus detection method as described above.

Another aspect provides a computer-readable storage medium, in which at least one instruction or at least one program is stored, and the at least one instruction or the at least one program is loaded and executed by a processor to implement the virus detection method as described above.

The virus detection method, the device and the medium provided by the invention have the following technical effects:

the method determines basic static characteristics of parent and child files in the sample files and first distribution characteristics corresponding to the sample files, performs machine learning training based on the characteristics to obtain a virus detection model, and performs virus detection on suspicious files by using the virus detection model. The parent-child relationship chain and the individual file are concerned in machine learning training based on the corresponding first distribution characteristics in the global file space, and the accuracy and the adaptability of virus detection can be improved by utilizing a virus detection model obtained by machine learning training.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the invention;

FIG. 2 is a schematic flow chart of a virus detection method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a process for determining a second distribution characteristic corresponding to a suspicious file according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a method for training a virus detection model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an architecture of a virus detection apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of constructing a tag set according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an application scenario of a virus detection model according to an embodiment of the present invention;

FIG. 8 is a schematic flow chart of a method for training a virus detection model according to an embodiment of the present invention;

fig. 9 is a schematic flowchart of a process of adding a virus tag to a suspicious sample file based on a virus detection result of the suspicious sample file and a preset determination rule according to an embodiment of the present invention;

FIG. 10 is a block diagram of a virus detection apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic diagram of an application environment according to an embodiment of the present invention, which may include a client 01 and a server 02, where the client and the server may be directly or indirectly connected through wired or wireless communication. The client sends the suspicious file to the server, and the server performs virus detection on the received suspicious file by using the virus detection model and outputs a virus detection result. It should be noted that fig. 1 is only an example.

Specifically, the client 01 may include a physical device of a type such as a smart phone, a desktop computer, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR) device, a digital assistant, a smart wearable device, and the like, and may also include software running in the physical device, such as a computer program. The operating system running on client 01 may include, but is not limited to, an Android system (Android system), an IOS system (mobile operating system developed by apple inc.), linux (an operating system), Microsoft Windows (Microsoft Windows operating system), and the like.

Specifically, the server 02 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers. The server 02 may comprise a network communication unit, a processor and a memory, etc. The server 02 may provide background services for the clients.

The following describes an embodiment of a virus detection method of the present invention, and fig. 2 is a schematic flow chart of a virus detection method provided by the embodiment of the present invention, and the present specification provides the method operation steps as described in the embodiment or the flow chart, but may include more or less operation steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 2, the method may include:

s201: acquiring a suspicious file containing parent and child files, wherein the parent file in the suspicious file is a white file, and the file attribute of the child file in the suspicious file is grey or unknown;

in an embodiment of the invention, a server obtains a suspicious file and a sample file set. A suspect file indicates a corresponding parent file and a corresponding child file, and correspondingly, each sample file in the sample file set also indicates a corresponding parent file and a corresponding child file. The preset parent-child relationship between the parent file and the child file in the suspicious file may be a program loading relationship or a program execution relationship. The parent file in the suspicious file is a white file, and the parent file is a program which is identified as safe and harmless by the antivirus engine. The file attribute of the subfile in the suspicious file is gray, which indicates that the subfile cannot be determined to be a safe and harmless program or a malicious program (virus) at present. The file attributes of the subfiles in the suspicious file are unknown, which indicates that the file attributes of the subfiles cannot be determined currently, for example, the subfiles cannot be identified by the antivirus engine in time. Further, the subfile may be a Dynamic Link Library (DLL).

In practical applications, the suspicious file and the sample file set may be sent by a client to a server. The client can set a white list, determine the file attribute by adopting a non-white (grey) black strategy, and further process the related files: the method comprises the steps of releasing safe and harmless parent and child files, intercepting the loading of sub-files of white-utilization viruses (the parent files are safe and harmless programs, the child files are malicious programs), intercepting the loading of sub-files of unknown files (the parent files are safe and harmless programs, the file attributes of the child files are unknown), intercepting the loading of sub-files of grey files (the parent files are safe and harmless programs, the file attributes of the child files are grey), and intercepting the malicious programs (serving as the parent files). Correspondingly, parent and child files played and intercepted by the client in the historical time period can be used as data sources of the sample file set. And the unknown files and the gray files intercepted by the client in the current time period correspond to suspicious files.

S202: determining a first base static feature of the parent file and a second base static feature of the child file;

in the embodiment of the invention, the server determines the basic static characteristics of the parent file in the suspicious file: a first base static feature. The server determines the basic static characteristics of the sub-files in the suspicious file: a second base static feature. Compared with the extended static features described later, the basic static features correspond to static features which are shallower and more convenient to extract.

The first underlying static features may include filenames, file paths, MD5 (MD 5 message digest algorithm) values, digital signatures, white scores, etc. of parent files in the suspect file. And the file path of the parent file indicates the ground storage position corresponding to the parent file. The MD5 value of the parent file is a hash value obtained based on the characteristics of the parent file and a preset cryptographic hash function, the MD5 value of the parent file can be used as a unique identifier of the parent file, and the MD5 value of the parent file can be a 32-bit hash value. The digital signature of the parent file is a digital string which can be generated only by the sender of the information and cannot be forged by others, and the digital string is also a valid proof of the authenticity of the information sent by the sender of the information. The digital signature of the parent file indicates a certificate obtained by the parent file being authenticated by an organization, the digital signature of the parent file can be generated based on the corresponding certificate, the digital signature of the parent file can correspond to a binary string, and the digital signature of the parent file can be directly extracted from the file structure of the parent file. The white point of the parent file may be used as a confidence indicator that the parent file is a white file.

The second underlying static features may include filenames, file paths, MD5 (MD 5 message digest algorithm) values, digital signatures, virus names, etc. of sub-files in the suspect file. The file path of the subfile indicates the landing storage position corresponding to the subfile. The MD5 value of a subfile is a hash value obtained based on the characteristics of the subfile and a predetermined cryptographic hash function (which may correspond to the MD5 value of the parent file), the MD5 value of the subfile may serve as a unique identifier of the subfile, and the MD5 value of the subfile may be a 32-bit hash value. The digital signature of the subfile is a digital string which can be generated only by the sender of the information and cannot be forged by others, and the digital string is also a valid proof of the authenticity of the information sent by the sender of the information. The digital signature of the subfile indicates a certificate obtained by the subfile being authenticated by an organization, the digital signature of the subfile can be generated based on the corresponding certificate, the digital signature of the subfile can correspond to a binary string, and the digital signature of the subfile can be directly extracted from a file structure of the subfile. The virus name of the subfile may indicate the suspected virus of the subfile, for example, if the subfile has a row path of the suspected injection-type virus, the virus name of the subfile may be the injection-type virus. It should be noted that, although the virus name of the subfile can be used as evidence for determining the file attribute of the subfile, the virus name of the subfile cannot prove that the subfile is a malicious program (virus).

In practical application, the white score of the parent file and the virus name of the child file may be tag information added to the suspicious file by the client based on the report of the antivirus engine, the client sends the suspicious file carrying the tag information to the server, and the server brings the tag information into the basic static characteristics corresponding to the suspicious file.

S203: generating (calculating) a first distribution feature corresponding to the suspicious file based on the sample file set, the first base static feature and the second base static feature;

in an embodiment of the present invention, the first distribution characteristic corresponding to the suspicious file may characterize the numeric presentation of the suspicious file as an individual file in the sample file set as the global file space based on the basic static characteristic. For example, the file name of the parent file in the suspicious file may be used as the basic static feature of the suspicious file, the number of sample files corresponding to the parent file with the same file name is determined in the sample file set, and the number may be used as the first distribution feature corresponding to the suspicious file. The first distribution characteristic corresponding to the suspicious file may point to the number obtained by performing homogeneous screening on the sample file set by using at least one basic static characteristic as a screening parameter.

In practical application, the file paths of the parent files and the child files in the suspicious files can be used as basic static features of the suspicious files, whether the suspicious files indicate known system files or known software with equivalent user levels can be determined based on the file paths of the parent files and the child files, accordingly, when the known system files or the known software with equivalent user levels are indicated, the parent files in the suspicious files are indicated to have a white utilization value, and the first distribution features corresponding to the suspicious files can be determined based on the user extent corresponding to the system files or the known software.

The file path of the parent file in the suspect file (e.g., file path 1) may be the underlying static feature of the suspect file. Based on the sample file set, sample files in which the file paths of the parent files are the same as the file path 1 are found, file attributes of sub-files in the sample files are determined, and if the file attributes of the sub-files correspond to the proportion of safe and harmless programs (white) or malicious programs (black) larger than a quantity threshold (such as 80%), the probability that the suspicious files are white-utilization viruses is low. Accordingly, a first distribution characteristic corresponding to the suspect file may be determined based on the above-mentioned percentage or the above-mentioned distribution of file attributes (white, black, gray) of these subfiles. Of course, the first distribution characteristic corresponding to the suspicious file may also be determined based on the number of the subfiles.

File paths of parent and child files in the suspicious file, such as file path 1 (parent), file path 2 (child), may serve as the underlying static features of the suspicious file. Based on the sample file set, a first file with a file path being the same as file path 1 is found, a second file with a file path being the same as file path 2 is found, and a first distribution characteristic corresponding to the suspicious file can be determined based on the number of the first files and the number of the second files.

In a specific embodiment, as shown in fig. 3, after the generating a first distribution feature corresponding to the suspicious file based on the sample file set, the first base static feature and the second base static feature, the method further includes:

s301: obtaining an object identifier corresponding to each sample file in the sample file set to obtain an object identifier set;

s302: acquiring a target object identifier corresponding to the suspicious file;

s303: determining a second distribution characteristic corresponding to the suspicious file based on the target object identification and the object identification set;

the object identifier corresponding to the sample file may indicate account information (e.g., corresponding to the registered user), client information (e.g., corresponding to the registered user and the guest user), and the like corresponding to the sample file. Correspondingly, the target object identifier corresponding to the suspicious file may also indicate account information, client information, and the like corresponding to the suspicious file. The account information may include age information, region information, etc. of the registered user. The client information may include information related to the entity device used by the user, such as Internet Protocol Address (IP Address) information, operating system information, and the like.

And obtaining an object identification set based on the object identification corresponding to each sample file in the sample file set. And determining a second distribution characteristic corresponding to the suspicious file based on the target object identification and the object identification set. For example, the file paths of the parent and child files in the suspicious file (e.g., file path 1 (parent), file path 2 (child)) may be used as the basic static features of the suspicious file. Based on the sample file set, a first file with a file path being the same as file path 1 is found, and a second file with a file path being the same as file path 2 is found. And determining the user attribution based on the IP address, determining the user attributions corresponding to the sample file corresponding to the first file and the sample file corresponding to the second file, and further determining the second distribution characteristic corresponding to the suspicious file based on the number and the distribution condition of the corresponding user attributions. The determination of the first distribution characteristic and the second distribution characteristic can improve the intelligence value of the suspicious file.

Accordingly, in step S204, a virus detection result of the suspicious file may be output by using the virus detection model with the first basic static feature, the second basic static feature, the first distribution feature and the second distribution feature as inputs.

S204: outputting a virus detection result of the suspicious file by using a virus detection model of a white utilization virus with the first basic static feature, the second basic static feature and the first distribution feature as input;

in the embodiment of the invention, the virus detection model is determined by machine learning training based on the features corresponding to the labeled sample files in the sample file set. The characteristics corresponding to the sample file comprise basic static characteristics of parent and child files in the sample file, first distribution characteristics (aiming at the sample file set) corresponding to the sample file and second distribution characteristics (aiming at the sample file set) corresponding to the sample file. The basic static features of the parent file and the child file in the sample file, the first distribution features corresponding to the sample file, and the second distribution features corresponding to the sample file may refer to the related descriptions in the foregoing steps S202 to S203, which are not described herein again.

In a specific embodiment, before outputting the virus detection result of the suspicious file by using the virus detection model for white utilization of viruses with the first basic static feature, the second basic static feature and the first distribution feature as inputs, firstly, when the feature to be input of the virus detection model is a character string type feature, converting the feature to be input into a numerical value type feature; and then, carrying out normalization processing on the characteristics of the numerical value types obtained after the processing to obtain target characteristics.

Converting the character string type features into the numeric value type features can be realized by calling some related functions of the basic library, such as classification mapping, character encoding and the like. For the digital signatures of the parent and child files, a LabelEncoder function (a label coding function) provided by a skearn library (the skearn library is a third-party library based on Python, which is a cross-platform computer programming language) can be selected for conversion, so that the calculation is simple and the efficiency is high. Dimension differences may exist in the features of the processed numerical type, normalization processing is performed on the features of the processed numerical type in order to eliminate the influence of the dimension differences on the output of a subsequent model, and the value ranges of the features can be uniformly converted into (0, 1) intervals. The normalization process may use z-score normalization (a data processing method), softmax function (an exponential normalization function) transformation, atan function (an arctangent function) transformation, and the like. In practical applications, a min _ max function (a linear normalization function) may be selected for normalization.

In another specific embodiment, as shown in fig. 4 and 8, the training process of the virus detection model includes the following steps:

s401: acquiring the sample file set, and dividing the sample file set into a positive sample file set and a negative sample file set and a suspicious sample file set, wherein virus marks carried by positive sample files in the positive sample file set and the negative sample file set indicate white utilization viruses;

the sample file set may be derived from parent and child files of clients released and intercepted in the historical time period in the foregoing step S201, where the clients indicate a plurality of clients that accept the background service provided by the server. In consideration of the difference in the corresponding characteristics of the parent and child files in different time periods, a target time period (for example, the last 30 days) may be determined, and then a sample file set may be obtained based on the target time period. As shown in fig. 8, the suspicious parent-child relationship chain feature vector for the last 30 days may be made part of the rich sample file set.

The positive sample files in the sample file set are identified as white-utilizing viruses by the antivirus engine, and the virus label carried by the positive sample files may be flag =1 (forward excitation). Negative sample files in the sample file set correspond to parent and child files that are safe and harmless (for example, the MD5 value (Opermd 5) of the parent file indicates safe and harmless programs, the MD5 value (Submd 5) of the child file indicates safe and harmless programs), and the virus label carried by the negative sample files can be flag =0 (negative stimulus). The suspicious sample files in the suspicious sample file set correspond to the parent files which are safe and harmless programs, and the file attributes of the child files are grey or unknown. In practical applications, as shown in FIG. 6, the process of constructing a tag set carrying a virus tag (see FIG. 8) is not disposable in conjunction with the steps S404-S405 described below. Wherein Opermd5 and Submd5 indicate that the file attribute is white, i.e. the relevant parent-child file is a safe and harmless program. The predicted characters may be hijack, lpk, fake, etc. The feedback system identification result may correspond to the file attribute of the suspicious sample file determined by introducing the preset determination rule in step S405 described later.

For convenient searching, the following label items can be set for the positive and negative sample files:

Opermd5

Operpath

Operfilename

Submd5

Subpath

Subfilename

flag

the Operpath is a file path of a parent file, the Operfilename is a file name of the parent file, the Subpath is a file path of a child file, and the Subfilename is a file name of the child file.

S402: based on the characteristics corresponding to the positive and negative sample file sets, using a preset machine learning model to perform virus detection training, and adjusting the model parameters of the preset machine learning model in the training until the virus detection result output by the preset machine learning model is matched with the virus label carried by the input sample file;

the characteristics corresponding to the positive and negative sample files comprise basic static characteristics of parent and child files in the sample files, first distribution characteristics (aiming at the sample file set) corresponding to the sample files and second distribution characteristics (aiming at the sample file set) corresponding to the sample files. The basic static features of the parent file and the child file in the sample file, the first distribution features corresponding to the sample file, and the second distribution features corresponding to the sample file may refer to the related descriptions in the foregoing steps S202 to S203, which are not described herein again. In addition, when the features corresponding to the positive and negative sample files are features of a character string type, the processing for converting the features into features of a numerical value type may refer to the relevant description in step S204, and is not repeated here. The first distribution characteristic and the second distribution characteristic of the sample file are used in training, white-utilization viruses with low breadth and high intelligence value can be mined from massive data to serve as positive sample files, and coverage of detection on detection of more white-utilization viruses can be improved.

Model training is performed based on the characteristics corresponding to the positive and negative sample file sets, the preset machine learning model can be an initial model or an intermediate model, the positive and negative sample file sets can be divided into a training set (for example, 90%) and a verification set (for example, 10%), and the characteristics corresponding to the training set are respectively input into two different preset machine learning models (for example, a logistic regression model and a decision tree model) to perform virus detection training. And then, respectively inputting the characteristics corresponding to the verification set into the two trained models to obtain a virus detection result. And determining the recall rate and the accuracy rate corresponding to the two models according to the virus labels carried by the sample files in the training set and the virus detection results output by the two trained models. And finally, determining an optimal model based on the recall rate, the accuracy and preset requirements. Of course, the pre-set machine learning model is not limited to the logistic regression model and the decision tree model.

In practical applications, the decision tree model corresponds to a preferred model relative to the logistic regression model. When the positive and negative example recall rate of the decision tree model in the subsequent prediction set is more than 95%, the accuracy of judging whether the virus is the white-utilization virus is about 15%. Therefore, the positioning range of the white utility viruses can be greatly reduced, and meanwhile, other security threats such as mining trojans hidden in games and the like can be conveniently excavated from the recalled white utility viruses.

In training, model parameters may be adjusted based on the difference between the intermediate results output by the model (whether the sample file is a white-to-utilize virus) and the virus label carried by the sample file.

S403: taking a preset machine learning model corresponding to the adjusted model parameters as a candidate model;

the candidate model is determined by machine learning training based on the characteristics corresponding to the positive and negative sample file sets.

S404: taking the corresponding characteristics of the suspicious sample file set as input, and outputting the virus detection result of each suspicious sample file by using the candidate model;

and taking the suspicious sample file set as a prediction set, and inputting the characteristics corresponding to the prediction set into a candidate model for virus detection training.

S405: adding virus labels to the suspicious sample files based on virus detection results of the suspicious sample files and preset judgment rules;

and introducing a preset judgment rule, namely further checking and confirming the virus detection result output by the candidate model, so that the prediction result of the candidate model can be purified. After verification and confirmation, the positive and negative sample files can be continuously expanded by adding virus labels to the suspicious sample files. Accordingly, steps S401-S407 can form a closed loop of model learning- > validation- > relearning based on the data set. With the increase of the positive and negative sample files, the white-utilization virus cases covered by the data set are more and more abundant, and the prediction accuracy of the model is higher and higher. Since the useless viruses are often secret and difficult to be found, the preset decision rule can be determined based on big data.

The verification confirmation may correspond to the following processing:

(1) firstly, when the virus detection result of the suspicious sample file indicates that the suspicious sample file is a white-utilization virus, acquiring a distribution characteristic corresponding to the suspicious sample file based on the preset judgment rule; and then, when the distribution characteristics corresponding to the suspicious sample file meet the preset requirements, adding virus marks carried by the negative sample file for the suspicious sample file.

The prediction of a suspicious sample file by the candidate model indicates that it is a white-to-exploit virus. And correspondingly judging whether the suspicious sample file has obvious non-hijack characteristics or not by a preset judgment rule. If the user breadth of the file to be checked is larger than the threshold value, the file to be checked can be judged to be not the white utilization virus. Correspondingly, the distribution characteristics corresponding to the suspicious sample file are obtained, when the corresponding user extent is larger than a threshold value, the suspicious sample file is proved to be not a white virus through inspection, and then a virus label (flag = 0) carried by the negative sample file is added to the suspicious sample file. The preset determination rule used in this embodiment may correspond to "white filtering" in fig. 9.

(2) Firstly, when the virus detection result of the suspicious sample file indicates that the suspicious sample file is a white-utilization virus, acquiring a digital signature of a parent file and a digital signature of a child file in the suspicious sample file based on the preset judgment rule; and then, when the digital signature of the parent file is the same as the digital signature of the child file, adding a virus label carried by a negative sample file to the suspicious sample file.

The prediction of a suspicious sample file by the candidate model indicates that it is a white-to-exploit virus. And correspondingly judging whether the suspicious sample file has obvious non-hijack characteristics or not by a preset judgment rule. If the digital signatures of the parent and child files of the file to be checked are the same, the file to be checked can be judged not to be the white utilization virus. Correspondingly, the digital signature of the parent file and the digital signature of the child file in the suspicious sample file; when the digital signature of the parent file is the same as the digital signature of the child file, the suspicious sample file is verified to be not a white virus, and then a virus label (flag = 0) carried by a negative sample file is added to the suspicious sample file. The preset determination rule used in this embodiment may correspond to "white filtering" in fig. 9.

(3) When the prediction of a suspicious sample file by the candidate model indicates that it is a virus of white interest, it does not correspond to a safe and harmless parent-child file, it can be labeled by:

1) when the virus detection result of the suspicious sample file indicates that the suspicious sample file is a white-utilization virus, adding the subfiles in the suspicious sample file into a suspicious subfile set;

2) based on the static characteristics and the dynamic characteristics corresponding to each suspicious subfile in the suspicious subfile set, performing clustering processing on the suspicious subfile set to obtain at least one suspicious subfile subset;

3) respectively selecting a suspicious subfile from each suspicious subfile subset as a reference subfile;

4) determining the virus attribute of the reference subfile based on the preset judgment rule, determining the reference virus label of the suspicious sample file corresponding to the reference subfile based on the virus attribute of the reference subfile, and taking the reference virus label as the virus label of the suspicious sample file corresponding to each suspicious subfile in the suspicious subfile subset where the reference subfile is located;

5) and respectively marking the suspicious sample file corresponding to each suspicious subfile in the same suspicious subfile subset by using the reference virus label corresponding to the reference subfile in each suspicious subfile subset.

And arranging the subfiles in the suspicious sample files into a block to construct a suspicious subfile set. And in the suspicious subfile set, clustering the suspicious subfile set based on the static characteristics and the dynamic characteristics corresponding to each suspicious subfile to obtain at least one suspicious subfile subset. During clustering, similarity of static features and dynamic features among different suspicious subfiles can be calculated based on a simhash algorithm (an algorithm for removing the duplicate), and when the similarity is greater than a certain threshold (for example, 85%), the relevant suspicious subfiles can be judged to be similar samples, and the similar samples form a suspicious subfile subset. The clustering process can overcome the difference between different versions corresponding to the sample files and resist the MD5 countermeasures of network black products on the sample files. And selecting one suspicious subfile from a suspicious subfile subset as a reference subfile for analysis, wherein all the suspicious subfiles are not analyzed any more, so that the analysis amount can be reduced by more than half.

When the virus attribute of the reference subfile is determined based on the preset decision rule, the static characteristics of the reference subfile can be extracted for analysis. The static features are extracted based on the disassembled code, and the static features can correspond to the extracted import and export function features and the code features of the entry points. The extended static features may be analyzed here. The extended static features correspond to deeper static features of the document than the basic static features described above. Specifically, disassembling the reference subfile to obtain a disassembled code; extracting extended static features corresponding to the reference subfiles from the disassembled codes; determining a virus attribute of the reference subfile based on the extended static features. Correspondingly, when the virus attribute of the reference subfile points to the malicious program, the reference virus label of the suspicious sample file corresponding to the reference subfile is determined to indicate that the virus is utilized. And adopting the reference virus label for the suspicious sample files corresponding to each suspicious subfile in the suspicious subfile subset where the reference subfile is located, and adding virus labels (flag = 1) carried by positive sample files for the suspicious sample files. When the virus attribute of the reference subfile points to a safe and harmless program, determining that the reference virus label of the suspicious sample file corresponding to the reference subfile does not indicate that the virus is not utilized. And the suspicious sample files corresponding to each suspicious subfile in the suspicious subfile subset where the reference subfile is located are marked by the reference virus, and virus marks (flag = 0) carried by negative sample files are added to the suspicious sample files. Here, an automated qualification tool (see FIG. 9) can determine whether a sample file can be a white utilized virus.

Of course, in addition to the above processing modes of (1), (2) and (3) verification and confirmation, for a part of the high suspicious parent-child relationship chain, accurate analysis can be performed by human intervention to add a label.

S406: based on suspicious sample files carrying virus labels, using the candidate model to carry out virus detection training, and adjusting model parameters of the candidate model in the training until virus detection results output by the candidate model are matched with the virus labels carried by the input sample files;

referring to the step S402, the feature corresponding to the suspicious sample file carrying the virus label is input to the candidate model for virus detection training. In training, model parameters may be adjusted based on the difference between the intermediate results output by the model (whether the sample file is a white-to-utilize virus) and the virus label carried by the sample file.

S407: and taking the candidate model corresponding to the adjusted model parameter as the virus detection model.

The virus detection model with high generalization capability is obtained by training the machine learning model, and the adaptability of determining whether the file is a white virus or not can be improved when the virus detection model is used for virus detection, so that the reliability and effectiveness of virus detection can be greatly improved.

As shown in fig. 7, fig. 7 is a schematic view of an application scenario of a virus detection model according to an embodiment of the present invention. In fig. 7, the training data are sample files, and each sample file carries a label indicating whether the training data is a white-utilization virus; correspondingly, the virus detection model trained subsequently can detect whether the suspicious file is a white-utilization virus or not. In fig. 7, the virus detection model is input to the suspicious file, and the virus detection model outputs a result of whether the suspicious file is a white-utilization virus.

Whether the suspicious file is a white virus or not is predicted based on a supervised algorithm, and a preset judgment rule is combined in model training, so that the misjudgment situation of a virus detection model in application can be effectively reduced.

In another specific embodiment, referring to the model training, a processing mode of checking and confirming the virus detection result of the suspicious sample file output by the candidate model by using a preset decision rule is introduced, so that the virus detection result of the suspicious sample file output by the virus detection model can be checked and confirmed accordingly. When the virus detection result of the suspicious file indicates that the suspicious file is a white-utilization virus, the verification confirmation may correspond to the following processing modes:

(1) the distribution characteristics corresponding to the suspicious file can be obtained based on the preset judgment rule; and then, when the distribution characteristics corresponding to the suspicious file meet the preset requirements, generating a virus detection result which does not indicate that the virus is utilized to update the current virus detection result.

(2) Acquiring a digital signature of a parent file and a digital signature of a child file in the suspicious file based on the preset judgment rule; then, when the digital signature of the parent file and the digital signature of the child file are the same, a virus detection result is generated which does not indicate that the virus is utilized to update the current virus detection result.

(3) Subfiles in the suspect file may be added to the first set of suspect subfiles; then, based on the static characteristics and the dynamic characteristics corresponding to each first suspicious subfile in the first suspicious subfile set, clustering the first suspicious subfile set to obtain at least one first suspicious subfile subset; then, respectively selecting a first suspicious subfile from each first suspicious subfile subset as a first reference subfile; then, determining the virus attribute of the first reference subfile based on the preset judgment rule, determining a first reference virus label of the suspicious file corresponding to the first reference subfile based on the virus attribute of the first reference subfile, and taking the first reference virus label as the virus label of the suspicious file corresponding to each first suspicious subfile in a first suspicious subfile subset where the first reference subfile is located; and finally, labeling the suspicious file corresponding to each first suspicious subfile in the same first suspicious subfile subset by using the first reference virus label corresponding to the first reference subfile in each first suspicious subfile subset.

And when the virus attribute of the first reference subfile points to a malicious program, the current virus detection result of the suspicious file corresponding to each first suspicious subfile in the first suspicious subfile subset where the first reference subfile is located is saved. And when the virus attribute of the first reference subfile points to a safe and harmless program, generating a virus detection result which does not indicate the utilization of viruses for the suspicious file corresponding to each first suspicious subfile in the first suspicious subfile subset where the first reference subfile is located so as to update the current virus detection result.

Of course, for a part of the high suspicious parent-child relationship chain, accurate analysis can be performed by human intervention to determine whether the current virus detection result is accurate. Here, based on the virus detection result after verification and confirmation, a corresponding virus label may be added to the suspicious file, and the suspicious file carrying the virus label is used to update the training data.

According to the technical scheme provided by the embodiment of the present specification, in the embodiment of the present specification, the basic static features of the parent-child files in the sample files and the distribution features corresponding to the sample files are determined, machine learning training is performed based on the features to obtain a virus detection model, and the virus detection model is used for performing virus detection on suspicious files. In the machine learning training, the parent-child relationship chain and the individual file are concerned about based on the corresponding distribution characteristics in the global file space, and the accuracy, the timeliness and the adaptability of the virus detection can be improved by using the virus detection model obtained by the machine learning training.

An embodiment of the present invention further provides a virus detection apparatus, as shown in fig. 10, the apparatus includes:

an acquisition module 1010: the file attribute acquiring method comprises the steps of acquiring a suspicious file containing parent-child files and a sample file set, wherein the parent-child files have a preset parent-child relationship, the parent file in the suspicious file is a white file, and the file attribute of the child file in the suspicious file is grey or unknown;

the determination module 1020: means for determining a first base static feature of the parent file and a second base static feature of the child file;

the generation module 1030: generating a first distribution feature corresponding to the suspect file based on the sample set of files, the first base static feature, and the second base static feature;

the detection module 1040: the virus detection module is used for outputting a virus detection result of the suspicious file by using a virus detection model of a white utilization virus with the first basic static feature, the second basic static feature and the first distribution feature as input;

In a specific embodiment, the virus detection apparatus may be architected with reference to FIG. 5.

It should be noted that the device and method embodiments in the device embodiment are based on the same inventive concept.

An embodiment of the present invention provides an electronic device, where the electronic device includes a processor and a memory, where the memory stores at least one instruction or at least one program, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the virus detection method provided in the foregoing method embodiment.

Further, fig. 11 shows a schematic hardware structure diagram of an electronic device for implementing the virus detection method provided by the embodiment of the present invention, where the electronic device may participate in forming or including the virus detection apparatus provided by the embodiment of the present invention. As shown in fig. 11, electronic device 110 may include one or more (shown as 1102a, 1102b, … …, 1102 n) processors 1102 (processor 1102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 1104 for storing data, and a transmission 1106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 11 is only an illustration and is not intended to limit the structure of the electronic device. For example, electronic device 110 may also include more or fewer components than shown in FIG. 11, or have a different configuration than shown in FIG. 11.

It should be noted that the one or more processors 1102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the electronic device 110 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory 1104 may be used for storing software programs and modules of application software, such as program instructions/data storage devices corresponding to the methods described in the embodiments of the present invention, and the processor 1102 may execute various functional applications and data processing by operating the software programs and modules stored in the memory 114, so as to implement one of the virus detection methods described above. The memory 1104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1104 may further include memory located remotely from the processor 1102, which may be connected to the electronic device 110 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 1106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the electronic device 110. In one example, the transmission device 1106 includes a network adapter (NIC) that can be connected to other network devices through a base station to communicate with the internet. In one embodiment, the transmission device 1106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the electronic device 110 (or mobile device).

The embodiment of the present invention further provides a storage medium, which can be disposed in an electronic device to store at least one instruction or at least one program for implementing a virus detection method in the method embodiment, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the virus detection method provided in the method embodiment.

Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and electronic apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for detecting a virus, the method comprising:

obtaining an object identifier corresponding to each sample file in the sample file set to obtain an object identifier set;

acquiring a target object identifier corresponding to the suspicious file, and determining a second distribution characteristic corresponding to the suspicious file based on the target object identifier and the object identifier set;

outputting a virus detection result of the suspicious file by using a virus detection model of a virus with the first basic static feature, the second basic static feature, the first distribution feature and the second distribution feature as input;

the virus detection model is determined by machine learning training based on the features corresponding to the labeled sample files in the sample file set, wherein the features corresponding to the sample files comprise basic static features of parent and child files in the sample files, first distribution features corresponding to the sample files and second distribution features corresponding to the sample files.

2. The method of claim 1, wherein the training process of the virus detection model comprises the following steps:

acquiring the sample file set, and dividing the sample file set into a positive sample file set and a negative sample file set and a suspicious sample file set, wherein virus marks carried by positive sample files in the positive sample file set and the negative sample file set indicate white utilization viruses;

based on the characteristics corresponding to the positive and negative sample file sets, using a preset machine learning model to perform virus detection training, and adjusting the model parameters of the preset machine learning model in the training until the virus detection result output by the preset machine learning model is matched with the virus label carried by the input sample file;

taking a preset machine learning model corresponding to the adjusted model parameters as a candidate model;

taking the corresponding characteristics of the suspicious sample file set as input, and outputting the virus detection result of each suspicious sample file by using the candidate model;

adding virus labels to the suspicious sample files based on virus detection results of the suspicious sample files and preset judgment rules;

based on suspicious sample files carrying virus labels, using the candidate model to carry out virus detection training, and adjusting model parameters of the candidate model in the training until virus detection results output by the candidate model are matched with the virus labels carried by the input sample files;

and taking the candidate model corresponding to the adjusted model parameter as the virus detection model.

3. The method according to claim 2, wherein the adding a virus label to the suspicious sample file based on the virus detection result of the suspicious sample file and a preset determination rule comprises:

when the virus detection result of the suspicious sample file indicates that the suspicious sample file is a white-utilization virus, acquiring a distribution characteristic corresponding to the suspicious sample file based on the preset judgment rule;

and when the distribution characteristics corresponding to the suspicious sample file meet the preset requirements, adding virus labels carried by the negative sample file for the suspicious sample file.

4. The method according to claim 2, wherein the adding a virus label to the suspicious sample file based on the virus detection result of the suspicious sample file and a preset determination rule comprises:

when the virus detection result of the suspicious sample file indicates that the suspicious sample file is a white-utilization virus, acquiring a digital signature of a parent file and a digital signature of a child file in the suspicious sample file based on the preset judgment rule;

and when the digital signature of the parent file is the same as the digital signature of the child file, adding virus marks carried by negative sample files for the suspicious sample files.

5. The method according to claim 2, wherein the adding a virus label to the suspicious sample file based on the virus detection result of the suspicious sample file and a preset determination rule comprises:

when the virus detection result of the suspicious sample file indicates that the suspicious file is a white-utilization virus, adding the subfiles in the suspicious sample file into a suspicious subfile set;

based on the static characteristics and the dynamic characteristics corresponding to each suspicious subfile in the suspicious subfile set, performing clustering processing on the suspicious subfile set to obtain at least one suspicious subfile subset;

respectively selecting a suspicious subfile from each suspicious subfile subset as a reference subfile;

determining the virus attribute of the reference subfile based on the preset judgment rule, determining the reference virus label of the suspicious sample file corresponding to the reference subfile based on the virus attribute of the reference subfile, and taking the reference virus label as the virus label of the suspicious sample file corresponding to each suspicious subfile in the suspicious subfile subset where the reference subfile is located;

and respectively marking the suspicious sample file corresponding to each suspicious subfile in the same suspicious subfile subset by using the reference virus label corresponding to the reference subfile in each suspicious subfile subset.

6. The method of claim 5, wherein the determining the virus attribute of the reference subfile based on the preset decision rule comprises:

disassembling the reference subfile to obtain a disassembled code;

extracting extended static features corresponding to the reference subfiles from the disassembled codes;

determining a virus attribute of the reference subfile based on the extended static features.

7. The method of claim 1, wherein before outputting the virus detection result of the suspicious file using a virus detection model for viruses based on the first basic static feature, the second basic static feature, the first distribution feature and the second distribution feature as inputs, the method further comprises:

when the feature to be input of the virus detection model is a character string type feature, converting the feature to be input into a numerical value type feature;

and normalizing the features of the numerical value types obtained after the processing to obtain the target features.

8. A virus detection apparatus, the apparatus comprising:

a first obtaining module: the file attribute acquiring method comprises the steps of acquiring a suspicious file containing parent and child files, wherein the parent file in the suspicious file is a white file, and the file attribute of the child file in the suspicious file is grey or unknown;

a second obtaining module: the system comprises a sample file set, a storage unit and a processing unit, wherein the sample file set is used for storing sample files;

a determination module: the system is used for acquiring a target object identifier corresponding to the suspicious file and determining a second distribution characteristic corresponding to the suspicious file based on the target object identifier and the object identifier set;

a detection module: the virus detection module is used for outputting a virus detection result of the suspicious file by using a virus detection model of a white utilization virus with the first basic static feature, the second basic static feature, the first distribution feature and the second distribution feature as input;

9. A computer-readable storage medium, wherein at least one instruction or at least one program is stored in the storage medium, and the at least one instruction or the at least one program is loaded and executed by a processor to implement the virus detection method according to any one of claims 1 to 7.