CN112149121A - Malicious file identification method, device, equipment and storage medium - Google Patents

Malicious file identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN112149121A
CN112149121A CN201910570372.0A CN201910570372A CN112149121A CN 112149121 A CN112149121 A CN 112149121A CN 201910570372 A CN201910570372 A CN 201910570372A CN 112149121 A CN112149121 A CN 112149121A
Authority
CN
China
Prior art keywords
file
sample
malicious
identified
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910570372.0A
Other languages
Chinese (zh)
Inventor
章明星
刘彦南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN201910570372.0A priority Critical patent/CN112149121A/en
Publication of CN112149121A publication Critical patent/CN112149121A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The invention discloses a malicious file identification method, which comprises the following steps: dividing all historical file samples into a plurality of sample classes in advance; training to obtain a malicious file identification model corresponding to each sample class; when a new file sample exists, determining a sample class of the new file sample; updating a malicious file identification model corresponding to the sample class to which the new file sample belongs; and when the file to be identified is detected, identifying the file to be identified by using the corresponding malicious file identification model. By applying the technical scheme provided by the embodiment of the invention, the malicious file identification model of the related sample class can be rapidly updated, and the method can be rapidly applied to the identification of the latest malicious file and timely respond to the latest threat. The invention also discloses a malicious file identification device, equipment and a storage medium, and has corresponding technical effects.

Description

Malicious file identification method, device, equipment and storage medium
Technical Field
The present invention relates to the field of computer application technologies, and in particular, to a malicious file identification method, apparatus, device, and storage medium.
Background
With the rapid development of computer technology, the identification of malicious files gradually evolves from a mode based on static malicious file feature codes, rule matching, heuristic antivirus and the like to a mode of identifying malicious files by using a machine learning algorithm to obtain a malicious file identification model. The malicious file identification model is obtained by using a machine learning algorithm, such as a classification algorithm, and the generalization capability of the malicious file identification model can learn detection rules from known file samples to identify unknown malicious files with a certain similarity.
The generalization capability of the malicious file identification model depends on the richness of the training sample set. If the training sample set does not contain information describing the variants of a malicious file sample, the trained recognition model will not cope well with such malicious file variants. That is, there is a relatively good detection rate when the malicious file identification model has just been trained, but the detection rate continually decreases over time. This phenomenon arises because malicious files evolve and mutate continuously over time, and thus the similarity between active samples that are spaced longer apart is smaller.
Therefore, the malicious file identification model needs to be updated by obtaining a new file sample, so as to obtain a better detection rate and identification accuracy of the malicious file. In the prior art, when a malicious file identification model is updated, all file samples, including new file samples and historical file samples, are mostly added to a training sample set, the malicious file identification model is retrained to obtain an updated malicious file identification model, and then the updated malicious file identification model is used for identifying malicious files.
The method has certain defects, because all the file samples are added into the training sample set, the number of the file samples in the training sample set is very large, the training of the malicious file identification model by using the training sample set needs to consume a long time, the updated malicious file identification model is difficult to obtain in a short time, the latest malicious file cannot be quickly identified, and the latest threat cannot be responded in time.
Disclosure of Invention
The invention aims to provide a malicious file identification method, a malicious file identification device, malicious file identification equipment and a storage medium, so that a malicious file identification model can be updated quickly, and latest threats can be responded in time.
In order to solve the technical problems, the invention provides the following technical scheme:
a malicious file identification method, comprising:
dividing all historical file samples into a plurality of sample classes in advance;
training to obtain a malicious file identification model corresponding to each sample class based on historical file samples in each sample class;
when a new file sample exists, determining a sample class of the new file sample;
updating a malicious file identification model corresponding to the sample class to which the new file sample belongs based on the new file sample and a historical file sample in the sample class to which the new file sample belongs;
and when the file to be identified is detected, identifying the file to be identified by using a corresponding malicious file identification model.
In a specific embodiment of the present invention, the pre-dividing all the history file samples into a plurality of sample classes includes:
acquiring all historical file samples;
extracting the original characteristics of each historical file sample;
and clustering all the historical file samples based on the original characteristics of each historical file sample to obtain a plurality of sample classes.
In a specific embodiment of the present invention, the clustering all the history file samples based on the original features of each history file sample to obtain a plurality of sample classes includes:
and performing clustering processing on all the historical file samples at least twice based on the original characteristics of each historical file sample to obtain a plurality of sample classes.
In an embodiment of the present invention, the at least two clustering processes include at least one clustering process using a density-based clustering algorithm.
In a specific embodiment of the present invention, the clustering all the history file samples at least twice based on the original features of each history file sample to obtain a plurality of sample classes includes:
based on the original characteristics of each historical file sample, performing pre-clustering processing on all historical file samples by using a K-means clustering algorithm to obtain a plurality of sample classes;
and for each sample class, clustering the historical file samples in the sample class by using a density-based clustering algorithm to obtain a plurality of sample classes.
In an embodiment of the present invention, the original feature includes at least one of a binary feature, a character string feature, an assembly code feature, and a dynamic feature.
In a specific embodiment of the present invention, the updating, based on the new file sample and the historical file sample in the sample class to which the new file sample belongs, the malicious file identification model corresponding to the sample class to which the new file sample belongs includes:
determining a training sample set based on the new file sample and historical file samples in the sample class to which the new file sample belongs;
and updating the malicious file identification model corresponding to the sample class to which the new file sample belongs by using the training sample set.
In a specific embodiment of the present invention, determining a training sample set based on the new file sample and the historical file sample in the sample class to which the new file sample belongs includes:
in the new file sample and the historical file sample in the sample class to which the new file sample belongs:
selecting file samples meeting preset necessary conditions, extracting other unselected file samples according to a preset extraction rule, and generating a training sample set;
the preset extraction rule is as follows: the later the file sample acquisition time is, the higher the probability of being extracted is, and the earlier the file sample acquisition time is, the lower the probability of being extracted is.
In a specific embodiment of the present invention, the determining the sample class to which the new file sample belongs includes:
calculating the distance between the new file sample and the center point of each sample class through a clustering algorithm;
and determining the sample class of the new file sample according to the distance.
In a specific embodiment of the present invention, when a file to be identified is identified, identifying the file to be identified using a corresponding malicious file identification model includes:
when a file to be identified exists, determining a sample class to which the file to be identified belongs;
identifying the file to be identified by using a malicious file identification model corresponding to the sample class to which the file to be identified belongs;
and determining whether the file to be identified is a malicious file or not according to the identification result.
In a specific embodiment of the present invention, the identifying a file to be identified by using a malicious file identification model corresponding to a sample class to which the file to be identified belongs includes:
respectively identifying the files to be identified by using the malicious file identification models corresponding to the sample classes to which the files to be identified belong to, and obtaining identification results of the malicious file identification models corresponding to the sample classes;
correspondingly, the determining whether the file to be identified is a malicious file according to the identification result includes:
and integrating the obtained identification results, and determining whether the file to be identified is a malicious file.
In an embodiment of the present invention, the integrating the obtained recognition results includes:
determining the weight of the identification result of each sample class to which the file to be identified belongs according to the distance between the file to be identified and the determined central point of each sample class to which the file to be identified belongs;
and carrying out weighted average on the obtained identification results according to the determined weight.
A malicious file identification apparatus comprising:
the sample dividing module is used for dividing all historical file samples into a plurality of sample classes in advance;
the model training module is used for training and obtaining a malicious file identification model corresponding to each sample class based on the historical file samples in each sample class;
the model updating module is used for determining a sample class of a new file sample when the new file sample exists; updating a malicious file identification model corresponding to the sample class to which the new file sample belongs based on the new file sample and a historical file sample in the sample class to which the new file sample belongs;
and the file identification module is used for identifying the file to be identified by using a corresponding malicious file identification model when the file to be identified is detected.
A malicious file identification apparatus comprising:
a memory for storing a computer program;
a processor for implementing the steps of any of the above malicious file identification methods when executing the computer program.
A computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of any of the above-described malicious file identification methods.
By applying the technical scheme provided by the embodiment of the invention and the method provided by the embodiment of the invention, all historical file samples are divided into a plurality of sample classes in advance, each sample class corresponds to a respective malicious file identification model, when a new file sample exists, the sample class to which the new file sample belongs is determined, the malicious file identification model corresponding to the sample class to which the new file sample belongs can be updated based on the new file sample and the historical file samples in the sample classes to which the new file sample belongs, and when a file to be identified is identified, the corresponding malicious file identification model is used for identifying the file to be identified. The historical file samples are divided into a plurality of sample classes in advance, when a new file sample appears, the new file sample can be classified into a specific sample class, the malicious file identification model corresponding to the corresponding sample class is updated, the sample size in one sample class is far smaller than the total sample size, the malicious file identification model in one sample class can be updated rapidly, the method can be applied to identification of the latest malicious file rapidly, and latest threats can be responded in time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating an implementation of a malicious file identification method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a malicious file identification apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a malicious file identification device in an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, an implementation flowchart of a malicious file identification method provided in an embodiment of the present invention is shown, where the method may include the following steps:
s110: all history file samples are divided into a plurality of sample classes in advance.
In the embodiment of the invention, the historical file sample can be obtained through the modes of information acquisition, safe software reporting and the like. The obtained historical file samples are huge in number, all the historical file samples can be divided into a plurality of sample classes in advance, and each sample class comprises a plurality of historical file samples.
In practical applications, the sample class of the historical file sample can be divided by a technician according to experience or set rules.
In one embodiment of the present invention, all history file samples may be divided into a plurality of sample classes in advance by:
the method comprises the following steps: acquiring all historical file samples;
step two: extracting the original characteristics of each historical file sample;
step three: and clustering all the historical file samples based on the original characteristics of each historical file sample to obtain a plurality of sample classes.
For convenience of description, the above three steps are combined for illustration.
The historical file samples refer to samples which are collected before and can participate in the training of the malicious file recognition model. And acquiring all history file samples, and respectively extracting the original characteristics of each history file sample.
The number of the historical file samples is huge, and is often in units of ten million. Therefore, the feature extraction performed on the history file sample is not suitable to have too high complexity. The extraction of the original features of the historical file sample can only extract the features of a shallow layer, including but not limited to a malicious file binary string extracted in an n-dimensional (n-gram) form, all character strings contained in the file, all import and export table information contained in a file PE header, and the like. These information can be compressed into feature vectors of hundreds of dimensions using hash technique (hashing technique).
Based on the original characteristics of each historical file sample, all the historical file samples can be clustered to obtain a plurality of sample classes. Specifically, all history file samples can be clustered by directly using a density-based clustering algorithm to obtain a plurality of sample classes. The Density-Based Clustering algorithm may be DBSCAN (Density-Based Spatial Clustering of Applications with Noise, Density-Based Clustering method with Noise).
In a specific embodiment of the present invention, based on the original features of each history file sample, all history file samples may be clustered at least twice to obtain a plurality of sample classes. Specifically, all the historical file samples may be clustered once to obtain a plurality of large classes, and then the historical file samples in each large class are clustered once to obtain a plurality of small classes, where each small class is the final desired sample class. Wherein the at least two clustering processes comprise at least one clustering process using a density-based clustering algorithm.
Of course, considering that the number of the historical file samples is huge, and a large amount of resources are consumed when the density-based clustering algorithm is directly operated on all the historical file samples, in the embodiment of the invention, based on the original characteristics of each historical file sample, pre-clustering processing is firstly performed on all the historical file samples by using a K-Means (K-Means) clustering algorithm to obtain a plurality of sample classes, and for each sample class, clustering processing is performed on the historical file samples in the sample class by using the density-based clustering algorithm to obtain a plurality of sample classes. This saves resources such as computation, storage, etc.
In the embodiment of the present invention, in order to facilitate clustering processing by using a clustering algorithm, the original features may specifically include at least one of a binary feature, a character string feature, an assembly code feature, and a dynamic feature. Of course, the raw features may also include other types of features, as the case may be. Wherein, the binary characteristic is a binary stream (composed of 0 or 1) represented by the history file sample; the character string features, namely various character strings extracted from the historical file sample, including the section name, the compiler name, the URL (Uniform Resource Locator) in the program, the IP, the mail address and other special information, and the called system function and so on can be extracted from the special information; assembly code characteristics, namely disassembling a historical file sample, and extracting information including an access sequence of a register, an execution sequence of an instruction code and the like from a disassembled result; the dynamic characteristics, that is, various file operations generated when the history file sample is executed, may specifically include a time characteristic, an access file mode characteristic, and other system call characteristics.
In practical application, the clustering method is not limited to the above clustering method, and a suitable clustering algorithm can be selected for clustering according to the number and types of the historical file samples, the types of the extracted original features, and the like, and more levels of clustering can be implemented.
S120: and training to obtain a malicious file identification model corresponding to each sample class based on the historical file samples in each sample class.
After all the historical file samples are divided into a plurality of sample classes in advance, each sample class comprises a plurality of historical file samples. For each sample class, based on the historical file samples in the sample class, a malicious file identification model corresponding to the sample class can be obtained through training. Specifically, all the historical file samples in the sample class may be added to the training set of the sample class, or a part of the historical file samples in the sample class may be selected to be added to the training set of the sample class according to a set selection rule, and the training set is used to train and obtain the malicious file identification model corresponding to the sample class.
In a specific embodiment of the present invention, a feature extraction model may be selected to perform feature extraction on each historical file sample in the sample class to obtain original features, such as binary features, character string features, assembly code features, dynamic features, and the like, and then a feature vector is obtained based on the original features, so as to train and obtain a malicious file identification model corresponding to the sample class.
In another specific embodiment of the present invention, each feature extraction model in the feature extraction model set may be used to perform feature extraction on each history file sample in the sample class, so as to obtain each original feature corresponding to each feature extraction model, such as a binary feature, a character string feature, an assembly code feature, a dynamic feature, and the like. The method is characterized by combining the advantages of various feature extraction models or algorithms, respectively using each feature extraction model to extract features of the same historical file sample, and performing feature fusion based on each obtained original feature. The original features obtained by the feature extraction models are complex in structure and high in dimensionality, and the high-dimensionality original features are not beneficial to execution and processing of subsequent feature fusion operation, so that dimensionality reduction processing can be performed on the original features. After the dimension reduction processing, the fusion processing can be performed on each original feature to obtain a fusion feature vector. And then training a malicious file identification model corresponding to the class based on the fusion feature vector.
Training and obtaining a malicious file identification model corresponding to each sample class based on the historical file samples in each sample class, and dividing all the historical file samples into the number of sample classes, so as to obtain the number of malicious file identification models, wherein each sample class corresponds to one malicious file identification model.
S130: and when a new file sample exists, determining the sample class of the new file sample.
After the malicious file identification model corresponding to each sample class is obtained through training, the malicious file identification model can be issued to a user for use, and a file to be identified is identified.
In order to improve the detection rate and the identification accuracy of the malicious files, the malicious file identification models corresponding to the sample classes can be updated based on the new file samples.
In the embodiment of the present invention, a new file sample in a set time period may be obtained, for example, a new file sample is obtained in a current update cycle. The current update period may be a time period from a time when the set update trigger condition is reached last time to a time when the set update trigger condition is reached at present. The new file sample may be a normal file sample or may be a malicious file sample. Specifically, each executable file collected on the terminal or the traffic device, such as a PE (executable file format in Windows) file, may be used as a new file sample.
In practical application, a file sample which cannot be determined to be a malicious file after being identified by the malicious file identification model can be used as a new file sample. After a certain file sample is identified by the malicious file identification model, if the file sample cannot be determined to be a malicious file, the file sample can be further analyzed, such as manual analysis, to determine whether the file sample is the malicious sample, and if so, the file sample can be used as a new file sample to update the corresponding malicious file identification model, so that the updating efficiency is improved, and the file identification accuracy is improved.
The update triggering condition may be set according to an actual situation, for example, when a set update period is reached, the set update triggering condition is considered to be reached, or when the number of the acquired new file samples reaches a set number requirement, the set update triggering condition is considered to be reached, or when an update instruction is received, the set update triggering condition is considered to be reached.
When there is a new file sample, the sample class to which the new file sample belongs may be determined. If there are more than one new file sample, the sample class of each new file sample is determined.
Specifically, the sample class to which the new file sample belongs can be determined through a clustering algorithm. For example, the distance between the new file sample and the center point of each sample class, such as Hamming distance, Euclidean distance, is calculated through a clustering algorithm, and the sample class to which the new file sample belongs is determined according to the distance. For example, the sample class with the distance smaller than the preset threshold may be determined as the sample class to which the new file sample belongs, and if there are a plurality of sample classes, the new file sample may be classified into the corresponding plurality of sample classes. Alternatively, the sample class with the smallest distance may be determined as the sample class to which the new file sample belongs, and the new file sample may be classified into the sample class. It is explained here that the new file sample is one, and if there are a plurality of new file samples, each of the new file samples can be classified into the corresponding sample class in the above-described manner.
In an embodiment of the present invention, the distance may be a plurality of types of distances, such as a hamming distance, an euclidean distance, and the like, and when determining the sample class to which the new file sample belongs according to the distance, a weighted average value of the plurality of types of distances between the new file sample and a center point of each sample may be determined, and then the sample class to which the new file sample belongs may be determined according to the weighted average value. The new file sample is classified into a specific sample class, so that the malicious file identification model corresponding to the sample class is updated more specifically.
For example, the hamming distance between the new file sample and the center point of each sample class is calculated by a first clustering algorithm, the euclidean distance between the new file sample and the center point of each sample class is calculated by a second clustering algorithm, if the hamming distance determines that the new file sample is closer to the center points of a plurality of sample classes, the hamming distance between the new file sample and the center point of each sample class and the weighted average value of the euclidean distances can be respectively determined, and the sample class to which the new file sample belongs is determined according to the weighted average value. Such as the sample class corresponding to the weighted average determined to be the smallest.
S140: and updating the malicious file identification model corresponding to the sample class to which the new file sample belongs based on the new file sample and the historical file sample in the sample class to which the new file sample belongs.
In the embodiment of the invention, when the set update triggering condition is reached, the malicious file identification model corresponding to the sample class to which the new file sample belongs can be updated based on the new file sample and the historical file sample in the sample class to which the new file sample belongs. If the sample class to which the new file sample N belongs is class A, the historical file samples in the class A are O1, O2, O3 and O4, the malicious file identification model corresponding to the class A is X, the malicious file identification model X can be updated based on the new file sample N, the historical file samples O1, O2, O3 and O4, and after the update, the malicious file identification model corresponding to the class A is the updated X. The updated malicious file identification model may be published to the user.
Based on the new file sample and the historical file samples in the sample class to which the new file sample belongs, a training sample set may be determined first. One simple way is to add both the new file sample and the historical file sample in the sample class to the training sample set. Or simply add all new file samples to the training sample set.
However, in order to achieve a better coverage effect and avoid the problem that the detection rate of samples which are old but still have a certain activity is low because a model is excessively fitted to emerging hot samples only by using new file samples, the embodiment of the invention determines a training sample set through the following steps:
in the new file sample and the historical file sample in the sample class to which the new file sample belongs:
selecting file samples meeting preset necessary conditions, extracting other unselected file samples according to a preset extraction rule, and generating a training sample set;
the preset extraction rule is as follows: the later the file sample acquisition time is, the higher the probability of being extracted is, and the earlier the file sample acquisition time is, the lower the probability of being extracted is.
There are multiple new file samples and multiple historical file samples.
The optional conditions can be set according to actual conditions, for example, if a series of common normal files (such as Windows system files, common software files, and the like), well-known malware (malicious files published or verified by an authority organization), historically known false alarm samples, and false negative samples exist in the history file samples, the file samples are considered to meet the preset optional conditions, and the file samples can be selected to serve as training samples. These file samples may be included in the training sample set at each training. In addition, the new file sample may also be considered to meet the preset requirements, and the new file sample must be included in the training sample set.
And extracting other file samples except the selected file sample meeting the preset optional conditions according to a preset extraction rule, and taking the extracted file sample as a training sample. Based on the selected and extracted document samples, a training sample set is generated, i.e. a training sample set is generated that contains all selected and extracted document samples.
The preset extraction rule may be: the later the file sample acquisition time is, the higher the probability of being extracted is, and the earlier the file sample acquisition time is, the lower the probability of being extracted is. For example, the probability that the document sample obtained on the current day is extracted is higher than the probability that the document sample obtained on the last week is extracted. The specific extraction method may use a Reject Sampling (Reject Sampling) technique to ensure that the complexity of each decision on whether to add to the training sample set is O (1) without calculating a complex joint probability function.
Of course, if the number of new file samples in a sample class is large, it can also be considered that the new file samples are not in the range of the necessary condition, but are extracted by the preset extraction rule, because the new file samples are obtained in the latest period of time, the probability of being extracted is large.
In practical application, in a file sample contained in a sample class to which a new file sample belongs, at least one virus sample in the virus family can be selected as a training sample for each virus family, and the training sample is added to a training sample set. Family names for virus families are available through virus-screening software from well-known vendors.
After the training sample set is determined, the training sample set can be used for training the malicious file identification model corresponding to the corresponding sample class, and the updated malicious file identification model is obtained. Specifically, the training sample set may be used to retrain the malicious file identification model corresponding to the sample class. And determining an intermediate result of the malicious file identification model meeting the preset requirement in the training process, and starting from the intermediate result, training the malicious file identification model by using a training sample set.
Although all the historical document samples are divided into a plurality of sample classes in advance, the number of the historical document samples in each sample class is still large, so that the determined training sample set generally contains more document samples, and even in the case of distributed training, a long time is still needed for completing the convergence of the model. Therefore, in practical application, the malicious file identification model can be trained by using an incremental training method. It is avoided that the random initialization parameters are started again each time, and training is performed from an intermediate result in a last training process (the last training process may be the malicious file identification model training process in step S120, and if the malicious file identification model is updated, the last training process may also be the training process when the malicious file identification model is updated last time). For example, assuming that 3000 iterations are required to complete the model training, since the similarity relationship between malicious files does not need to start from the random initialization parameters when the model is trained using the adjusted training sample set, but can start from the intermediate results in the last training process, such as the results in 2000 rounds, and train using the new training sample set. The convergence rate of the final model can be shortened by this method similar to the transfer learning.
It should be noted that, when the malicious file identification model corresponding to each sample class is obtained through training based on the historical file samples in each sample class in step S120, some file samples may be selected from the historical file samples of the sample class as training samples for each sample class, and the malicious file identification model corresponding to the sample class is obtained through training. For the selection rule of the file sample as the training sample, the determination method of the training sample set may be referred to, and details are not repeated.
S150: and when the file to be identified is detected, identifying the file to be identified by using the corresponding malicious file identification model.
In practical application, executable files at a terminal or a flow device and the like may be monitored, each of the monitored executable files may be determined as a file to be identified, or each of the monitored executable files meeting an identification condition may be determined as a file to be identified.
When the files to be identified are to be identified, the corresponding malicious file identification model can be used for identifying the files to be identified.
In one embodiment of the present invention, step S150 may include the following steps:
the method comprises the following steps: and when the file to be identified belongs to the sample class, determining the sample class of the file to be identified. The specific determination method may refer to the above method for determining the sample class to which the new file sample belongs.
Step two: and identifying the file to be identified by using a malicious file identification model corresponding to the sample class to which the file to be identified belongs. After the sample class of the file to be recognized is determined, the file to be recognized can be recognized by using the malicious file recognition model corresponding to the sample class, if the malicious file recognition model corresponding to the sample class is updated, the latest malicious file recognition model corresponding to the sample class is used for recognizing the file to be recognized.
Step three: and determining whether the file to be identified is a malicious file or not according to the identification result.
And identifying the file to be identified by using the malicious file identification model corresponding to the sample class to which the file to be identified belongs, wherein the obtained identification result can be the result of the probability that the file to be identified is a normal file and the probability that the file to be identified is a malicious file, and can also be the result of the fact that the file to be identified is a normal file, a malicious file or an abnormal file. According to the identification result, whether the file to be identified is a malicious file or not can be determined.
If the identification result is the former result, the file to be identified is determined to be a normal file when the probability that the file to be identified is the normal file is greater than the probability that the file to be identified is the malicious file, the file to be identified is determined to be the malicious file when the probability that the file to be identified is the malicious file is greater than the probability that the file to be identified is the normal file, and the file to be identified is determined to be an abnormal file when the probability that the file to be identified is the malicious file is equal to the probability that the file to be identified is the normal file.
In the embodiment of the invention, if the file to be identified is determined to be a malicious file, the file can be reported to a security system so as to intercept or kill the malicious file; if the file to be identified is determined to be a normal file, the file to be identified can not be processed; if the file to be identified is determined to be an abnormal file, the file to be identified can be reported to a management system, so that a manager can further confirm whether the abnormal file is a malicious file.
By applying the method provided by the embodiment of the invention, all historical file samples are divided into a plurality of sample classes in advance, each sample class corresponds to a respective malicious file identification model, when a new file sample exists, the sample class to which the new file sample belongs is determined, the malicious file identification model corresponding to the sample class to which the new file sample belongs can be updated based on the new file sample and the historical file samples in the sample classes to which the new file sample belongs, and when a file to be identified is identified, the corresponding malicious file identification model is used for identifying the file to be identified. The historical file samples are divided into a plurality of sample classes in advance, when a new file sample appears, the new file sample can be classified into a specific sample class, the malicious file identification model corresponding to the corresponding sample class is updated, the sample size in one sample class is far smaller than the total sample size, the malicious file identification model in one sample class can be updated rapidly, the method can be applied to identification of the latest malicious file rapidly, and latest threats can be responded in time.
In an embodiment of the present invention, there are a plurality of sample classes to which the file to be recognized belongs, and the file to be recognized can be recognized by using the malicious file recognition model corresponding to each sample class to which the file to be recognized belongs, respectively, so as to obtain the recognition result of the malicious file recognition model corresponding to each sample class;
correspondingly, the obtained identification results can be integrated to determine whether the file to be identified is a malicious file.
In the embodiment of the invention, when the sample class to which the file to be identified belongs is determined, the situation that the file to be identified belongs to a plurality of sample classes at the same time in distance can not be completely judged may occur. In this case, the files to be recognized may be recognized by using the malicious file recognition models corresponding to each sample class to which the files to be recognized belong, respectively, so as to obtain the recognition result of the malicious file recognition model corresponding to each sample class. And then integrating the obtained identification results, such as weighted average, and determining whether the file to be identified is a malicious file or not according to the integration result. And the identification accuracy is improved.
Specifically, the weight of the identification result of each sample class to which the file to be identified belongs may be determined according to the distance between the file to be identified and the determined center point of each sample class to which the file to be identified belongs, and the obtained identification results may be weighted and averaged according to the determined weight.
Corresponding to the above method embodiment, the embodiment of the present invention further provides a malicious file identification apparatus, and a malicious file identification apparatus described below and a malicious file identification method described above may be referred to in a corresponding manner.
Referring to fig. 2, the apparatus includes the following modules:
the sample dividing module 210 is configured to divide all history file samples into a plurality of sample classes in advance;
the model training module 220 is configured to train to obtain a malicious file identification model corresponding to each sample class based on the historical file samples in each sample class;
the model updating module 230 is configured to determine, when a new file sample exists, a sample class to which the new file sample belongs; updating a malicious file identification model corresponding to the sample class to which the new file sample belongs based on the new file sample and the historical file sample in the sample class to which the new file sample belongs;
and the file identification module 240 is configured to identify the file to be identified by using a corresponding malicious file identification model when the file to be identified is detected.
By applying the device provided by the embodiment of the invention, all historical file samples are divided into a plurality of sample classes in advance, each sample class corresponds to a respective malicious file identification model, when a new file sample exists, the sample class to which the new file sample belongs is determined, the malicious file identification model corresponding to the sample class to which the new file sample belongs can be updated based on the new file sample and the historical file samples in the sample class to which the new file sample belongs, and when a file to be identified is identified, the corresponding malicious file identification model is used for identifying the file to be identified. The historical file samples are divided into a plurality of sample classes in advance, when a new file sample appears, the new file sample can be classified into a specific sample class, the malicious file identification model corresponding to the corresponding sample class is updated, the sample size in one sample class is far smaller than the total sample size, the malicious file identification model in one sample class can be updated rapidly, the method can be applied to identification of the latest malicious file rapidly, and latest threats can be responded in time.
In an embodiment of the present invention, the sample dividing module 210 is specifically configured to:
acquiring all historical file samples;
extracting the original characteristics of each historical file sample;
and clustering all the historical file samples based on the original characteristics of each historical file sample to obtain a plurality of sample classes.
In an embodiment of the present invention, the sample dividing module 210 is specifically configured to:
and performing clustering processing on all the historical file samples at least twice based on the original characteristics of each historical file sample to obtain a plurality of sample classes.
In one embodiment of the present invention, the at least two clustering processes comprise at least one clustering process using a density-based clustering algorithm.
In an embodiment of the present invention, the sample dividing module 210 is specifically configured to:
based on the original characteristics of each historical file sample, performing pre-clustering processing on all historical file samples by using a K-means clustering algorithm to obtain a plurality of sample classes;
and for each sample class, clustering the historical file samples in the sample class by using a density-based clustering algorithm to obtain a plurality of sample classes.
In one embodiment of the present invention, the original features include at least one of binary features, character string features, assembly code features, and dynamic features.
In an embodiment of the present invention, the model updating module 230 is specifically configured to:
determining a training sample set based on the new file sample and the historical file sample in the sample class to which the new file sample belongs;
and updating the malicious file identification model corresponding to the sample class to which the new file sample belongs by using the training sample set.
In an embodiment of the present invention, the model updating module 230 is specifically configured to:
in the new file sample and the historical file sample in the sample class to which the new file sample belongs:
selecting file samples meeting preset necessary conditions, extracting other unselected file samples according to a preset extraction rule, and generating a training sample set;
the preset extraction rule is as follows: the later the file sample acquisition time is, the higher the probability of being extracted is, and the earlier the file sample acquisition time is, the lower the probability of being extracted is.
In an embodiment of the present invention, the model updating module 230 is specifically configured to:
calculating the distance between the new file sample and the center point of each sample class through a clustering algorithm;
and determining the sample class of the new file sample according to the distance.
In a specific embodiment of the present invention, the new file sample is: and after the malicious file identification model identifies the file, whether the file is a file sample of the malicious file cannot be determined.
In an embodiment of the present invention, the file identification module 240 is specifically configured to:
when the file to be identified belongs to the sample class, determining the sample class of the file to be identified;
identifying the file to be identified by using a malicious file identification model corresponding to the sample class to which the file to be identified belongs;
and determining whether the file to be identified is a malicious file or not according to the identification result.
In an embodiment of the present invention, the file identification module 240 is specifically configured to:
respectively identifying the files to be identified by using the malicious file identification models corresponding to the sample classes to which the files to be identified belong to obtain identification results of the malicious file identification models corresponding to the sample classes;
correspondingly, according to the identification result, determining whether the file to be identified is a malicious file comprises the following steps:
and integrating the obtained identification results to determine whether the file to be identified is a malicious file.
In an embodiment of the present invention, the file identification module 240 is specifically configured to:
determining the weight of the identification result of each sample class to which the file to be identified belongs according to the distance between the file to be identified and the determined central point of each sample class to which the file to be identified belongs;
and carrying out weighted average on the obtained identification results according to the determined weight.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a malicious file identification device, as shown in fig. 3, including:
a memory 310 for storing a computer program;
the processor 320 is configured to implement the steps of the malicious file identification method when executing the computer program.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the above malicious file identification method.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The principle and the implementation of the present invention are explained in the present application by using specific examples, and the above description of the embodiments is only used to help understanding the technical solution and the core idea of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (15)

1. A malicious file identification method, comprising:
dividing all historical file samples into a plurality of sample classes in advance;
training to obtain a malicious file identification model corresponding to each sample class based on historical file samples in each sample class;
when a new file sample exists, determining a sample class of the new file sample;
updating a malicious file identification model corresponding to the sample class to which the new file sample belongs based on the new file sample and a historical file sample in the sample class to which the new file sample belongs;
and when the file to be identified is detected, identifying the file to be identified by using a corresponding malicious file identification model.
2. The method of claim 1, wherein the pre-classifying all history file samples into a plurality of sample classes comprises:
acquiring all historical file samples;
extracting the original characteristics of each historical file sample;
and clustering all the historical file samples based on the original characteristics of each historical file sample to obtain a plurality of sample classes.
3. The method according to claim 2, wherein the clustering all the history file samples based on the original features of each history file sample to obtain a plurality of sample classes comprises:
and performing clustering processing on all the historical file samples at least twice based on the original characteristics of each historical file sample to obtain a plurality of sample classes.
4. The method of claim 3, wherein the at least two clustering processes comprise at least one clustering process using a density-based clustering algorithm.
5. The method according to claim 4, wherein the clustering all the history file samples at least twice based on the original features of each history file sample to obtain a plurality of sample classes comprises:
based on the original characteristics of each historical file sample, performing pre-clustering processing on all historical file samples by using a K-means clustering algorithm to obtain a plurality of sample classes;
and for each sample class, clustering the historical file samples in the sample class by using a density-based clustering algorithm to obtain a plurality of sample classes.
6. The method according to any one of claims 2 to 5, wherein the original features comprise at least one of binary features, string features, assembly code features, dynamic features.
7. The method according to claim 1, wherein the updating the malicious file identification model corresponding to the sample class to which the new file sample belongs based on the new file sample and the historical file sample in the sample class to which the new file sample belongs comprises:
determining a training sample set based on the new file sample and historical file samples in the sample class to which the new file sample belongs;
and updating the malicious file identification model corresponding to the sample class to which the new file sample belongs by using the training sample set.
8. The method of claim 7, wherein determining a training sample set based on the new file sample and historical file samples in a sample class to which the new file sample belongs comprises:
in the new file sample and the historical file sample in the sample class to which the new file sample belongs:
selecting file samples meeting preset necessary conditions, extracting other unselected file samples according to a preset extraction rule, and generating a training sample set;
the preset extraction rule is as follows: the later the file sample acquisition time is, the higher the probability of being extracted is, and the earlier the file sample acquisition time is, the lower the probability of being extracted is.
9. The method of claim 1, wherein the determining the sample class to which the new file sample belongs comprises:
calculating the distance between the new file sample and the center point of each sample class through a clustering algorithm;
and determining the sample class of the new file sample according to the distance.
10. The method according to any one of claims 1 to 5 and 7 to 9, wherein the identifying the file to be identified by using a corresponding malicious file identification model when the file to be identified comprises:
when a file to be identified exists, determining a sample class to which the file to be identified belongs;
identifying the file to be identified by using a malicious file identification model corresponding to the sample class to which the file to be identified belongs;
and determining whether the file to be identified is a malicious file or not according to the identification result.
11. The method according to claim 10, wherein there are a plurality of sample classes to which the file to be identified belongs, and the identifying the file to be identified by using the malicious file identification model corresponding to the sample class to which the file to be identified belongs comprises:
respectively identifying the files to be identified by using the malicious file identification models corresponding to the sample classes to which the files to be identified belong to, and obtaining identification results of the malicious file identification models corresponding to the sample classes;
correspondingly, the determining whether the file to be identified is a malicious file according to the identification result includes:
and integrating the obtained identification results, and determining whether the file to be identified is a malicious file.
12. The method according to claim 11, wherein the integrating the obtained recognition results comprises:
determining the weight of the identification result of each sample class to which the file to be identified belongs according to the distance between the file to be identified and the determined central point of each sample class to which the file to be identified belongs;
and carrying out weighted average on the obtained identification results according to the determined weight.
13. An apparatus for identifying malicious files, comprising:
the sample dividing module is used for dividing all historical file samples into a plurality of sample classes in advance;
the model training module is used for training and obtaining a malicious file identification model corresponding to each sample class based on the historical file samples in each sample class;
the model updating module is used for determining a sample class of a new file sample when the new file sample exists; updating a malicious file identification model corresponding to the sample class to which the new file sample belongs based on the new file sample and a historical file sample in the sample class to which the new file sample belongs;
and the file identification module is used for identifying the file to be identified by using a corresponding malicious file identification model when the file to be identified is detected.
14. A malicious file identification device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of identifying malicious files according to any one of claims 1 to 12 when executing the computer program.
15. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for malicious file identification according to any of claims 1 to 12.
CN201910570372.0A 2019-06-27 2019-06-27 Malicious file identification method, device, equipment and storage medium Pending CN112149121A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910570372.0A CN112149121A (en) 2019-06-27 2019-06-27 Malicious file identification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910570372.0A CN112149121A (en) 2019-06-27 2019-06-27 Malicious file identification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112149121A true CN112149121A (en) 2020-12-29

Family

ID=73868868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910570372.0A Pending CN112149121A (en) 2019-06-27 2019-06-27 Malicious file identification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112149121A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806485A (en) * 2021-09-23 2021-12-17 厦门快商通科技股份有限公司 Intention identification method and device based on small sample cold start and readable medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140090061A1 (en) * 2012-09-26 2014-03-27 Northrop Grumman Systems Corporation System and method for automated machine-learning, zero-day malware detection
US20150244733A1 (en) * 2014-02-21 2015-08-27 Verisign Inc. Systems and methods for behavior-based automated malware analysis and classification
CN105095755A (en) * 2015-06-15 2015-11-25 安一恒通(北京)科技有限公司 File recognition method and apparatus
US20170046510A1 (en) * 2015-08-14 2017-02-16 Qualcomm Incorporated Methods and Systems of Building Classifier Models in Computing Devices
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
US20180150724A1 (en) * 2016-11-30 2018-05-31 Cylance Inc. Clustering Analysis for Deduplication of Training Set Samples for Machine Learning Based Computer Threat Analysis
CN108304720A (en) * 2018-02-06 2018-07-20 恒安嘉新(北京)科技股份公司 A kind of Android malware detection methods based on machine learning
CN109492395A (en) * 2018-10-31 2019-03-19 厦门安胜网络科技有限公司 A kind of method, apparatus and storage medium detecting rogue program

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140090061A1 (en) * 2012-09-26 2014-03-27 Northrop Grumman Systems Corporation System and method for automated machine-learning, zero-day malware detection
US20150244733A1 (en) * 2014-02-21 2015-08-27 Verisign Inc. Systems and methods for behavior-based automated malware analysis and classification
CN105095755A (en) * 2015-06-15 2015-11-25 安一恒通(北京)科技有限公司 File recognition method and apparatus
US20170046510A1 (en) * 2015-08-14 2017-02-16 Qualcomm Incorporated Methods and Systems of Building Classifier Models in Computing Devices
US20180150724A1 (en) * 2016-11-30 2018-05-31 Cylance Inc. Clustering Analysis for Deduplication of Training Set Samples for Machine Learning Based Computer Threat Analysis
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
CN108304720A (en) * 2018-02-06 2018-07-20 恒安嘉新(北京)科技股份公司 A kind of Android malware detection methods based on machine learning
CN109492395A (en) * 2018-10-31 2019-03-19 厦门安胜网络科技有限公司 A kind of method, apparatus and storage medium detecting rogue program

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806485A (en) * 2021-09-23 2021-12-17 厦门快商通科技股份有限公司 Intention identification method and device based on small sample cold start and readable medium
CN113806485B (en) * 2021-09-23 2023-06-23 厦门快商通科技股份有限公司 Intention recognition method and device based on small sample cold start and readable medium

Similar Documents

Publication Publication Date Title
Shone et al. A deep learning approach to network intrusion detection
US20170026390A1 (en) Identifying Malware Communications with DGA Generated Domains by Discriminative Learning
Tesfahun et al. Intrusion detection using random forests classifier with SMOTE and feature reduction
CN106817248B (en) APT attack detection method
CN109299741B (en) Network attack type identification method based on multi-layer detection
CN102420723A (en) Anomaly detection method for various kinds of intrusion
CN110166344B (en) Identity identification method, device and related equipment
CN110460605A (en) A kind of Abnormal network traffic detection method based on autocoding
CN113221112B (en) Malicious behavior identification method, system and medium based on weak correlation integration strategy
Krishnaveni et al. Ensemble approach for network threat detection and classification on cloud computing
CN113992349B (en) Malicious traffic identification method, device, equipment and storage medium
CN110798426A (en) Method and system for detecting flood DoS attack behavior and related components
CN111260220B (en) Group control equipment identification method and device, electronic equipment and storage medium
CN113468071A (en) Fuzzy test case generation method, system, computer equipment and storage medium
CN113221109B (en) Intelligent malicious file analysis method based on generation countermeasure network
WO2020082763A1 (en) Decision trees-based method and apparatus for detecting phishing website, and computer device
CN115987615A (en) Network behavior safety early warning method and system
CN113645173A (en) Malicious domain name identification method, system and equipment
Zhang et al. Network intrusion detection based on active semi-supervised learning
Pal et al. Neural network & genetic algorithm based approach to network intrusion detection & comparative analysis of performance
KR102425525B1 (en) System and method for log anomaly detection using bayesian probability and closed pattern mining method and computer program for the same
CN112149121A (en) Malicious file identification method, device, equipment and storage medium
CN111291078A (en) Domain name matching detection method and device
CN112764791B (en) Incremental update malicious software detection method and system
CN110197066B (en) Virtual machine monitoring method and system in cloud computing environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination