CN114091019A

CN114091019A - Data set construction method and device, malicious software identification method and device, and identification model construction method and device

Info

Publication number: CN114091019A
Application number: CN202011412019.9A
Authority: CN
Inventors: 赵毅强; 王志刚; 刘恒; 齐向东; 吴云坤
Original assignee: Qax Technology Group Inc; Secworld Information Technology Beijing Co Ltd
Current assignee: Qax Technology Group Inc; Secworld Information Technology Beijing Co Ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2022-02-25

Abstract

The invention provides a method and a device for constructing a data set, identifying malicious software and constructing an identification model. The method determines the abnormal samples based on the abnormal detection algorithm, so that the abnormal samples and the samples in the seed set have higher abnormal degree, new samples with lower abnormal degree with the seed set are removed, redundant data in the seed set are reduced, the scale of the seed set is reduced, and further the data storage space is saved. Meanwhile, when the seed set with a small scale is applied to malicious software identification, the training time of a malicious software identification model can be shortened, each sample of the seed set after redundant data is removed has strong representativeness, and the problem that the accuracy of the identification model is influenced due to the fact that the data set contains noise caused by too much redundant data in the traditional method is solved.

Description

Data set construction method and device, malicious software identification method and device, and identification model construction method and device

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for data set construction, malicious software identification and identification model construction.

Background

With the continuous development of computer programming technology, software obtained based on various computer language programming also enables people to complete various tasks and works in a computer more conveniently, but malicious software carrying malicious contents also appears along with the software, and maliciously attacks normal data files or steals other people's labor achievements. Therefore, it is important to identify whether the software to be tested is malware.

The existing intelligent malware identification technology generally adopts a machine learning method to identify malware, and the application of a machine learning algorithm cannot leave a training data set. In the existing intelligent malware recognition technology, a training data set comprises a large amount of malicious and non-malicious software, but the internal homogenization phenomenon of the training data set is serious (only a very small difference exists between two similar software), namely the redundancy degree of the data set is very high, and the redundant data wastes a data storage space, prolongs the training time of a malware recognition model, and even can reduce the precision of the malware recognition model.

Disclosure of Invention

The invention provides a method and a device for data set construction, malicious software identification and identification model construction, which are used for solving the defect of high redundancy degree of a data set in the prior art.

The invention provides a data set construction method for malicious software identification, which comprises the following steps:

a new sample acquisition step: obtaining a new sample; wherein the sample type of the new sample is consistent with the sample type in a pre-constructed seed set, wherein the seed set contains malicious samples and/or non-malicious samples;

an abnormal sample determining step: determining whether the new sample is an abnormal sample or not by adopting an abnormal detection algorithm based on the new sample and the sample in the seed set, and if so, adding the new sample to the seed set;

an updating step: and circularly executing the new sample acquisition step and the abnormal sample determination step until the number of samples in the seed set meets a preset condition.

According to the data set construction method provided by the invention, the abnormal sample determination step specifically comprises the following steps:

and determining whether the new sample is an abnormal sample or not by adopting an isolated forest algorithm based on the new sample and the sample in the seed set, and if so, adding the new sample to the seed set.

According to the data set construction method provided by the invention, based on the new sample and the sample in the seed set, an isolated forest algorithm is adopted to determine whether the new sample is an abnormal sample, if so, the new sample is added to the seed set, and the method comprises the following steps:

constructing N isolated trees based on the seed set by adopting an isolated forest algorithm or an expanded isolated forest algorithm;

and based on the N isolated trees, scoring abnormal values of the new samples, and if the score is higher than a preset threshold value, adding the new samples to the seed set.

The invention provides a data set construction method, which is used for scoring abnormal values of the new sample based on the N isolated trees and comprises the following steps:

determining the depth value of the new sample on each isolated tree based on the N isolated trees;

and according to each depth value, adopting an abnormal value scoring function to score the abnormal value of the new sample.

The invention also provides a malicious software identification method, which comprises the following steps:

acquiring software to be identified;

inputting the software into a malicious software identification model, and acquiring an identification result of the software;

the malware identification model is obtained by performing machine learning training on a first seed set containing malicious samples constructed by adopting any one of the data set construction methods and/or a second seed set containing non-malicious samples constructed by adopting any one of the data set construction methods.

The invention also provides a method for constructing the malicious software identification model, which comprises the following steps:

constructing a first seed set containing a malicious sample by adopting a data set construction method as described in any one of the above methods; and/or the presence of a gas in the gas,

constructing a second seed set containing non-malicious samples by adopting the data set construction method as described in any one of the above methods;

and training a machine learning model by adopting a machine learning mode based on the first seed set and/or the second seed set to obtain a malicious software identification model.

The invention also provides a data set construction device for identifying malicious software, which comprises the following steps:

a new sample acquiring unit for acquiring a new sample; wherein the sample type of the new sample is consistent with the sample type in a pre-constructed seed set, wherein the seed set contains malicious samples and/or non-malicious samples;

an abnormal sample determining unit, configured to determine, based on the new sample and a sample in the seed set, whether the new sample is an abnormal sample by using an abnormal detection algorithm, and if yes, add the new sample to the seed set;

and the updating unit is used for circularly executing the steps in the new sample acquiring unit and the steps in the abnormal sample determining unit until the number of samples in the seed set meets a preset condition.

The present invention also provides a malware identification apparatus, including:

an acquisition unit for acquiring software to be identified;

the identification unit is used for inputting the software into a malicious software identification model and acquiring an identification result of the software;

The invention also provides a device for constructing the malicious software identification model, which comprises the following steps:

a first constructing unit, configured to construct a first seed set containing a malicious sample by using the data set constructing method according to any one of the above methods; and/or the presence of a gas in the gas,

a second construction unit, configured to construct a second seed set containing non-malicious samples by using the data set construction method as described in any of the above;

and the training unit is used for training a machine learning model by adopting a machine learning mode based on the first seed set and/or the second seed set to obtain a malicious software identification model.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of any one of the data set construction methods; and/or the processor, when executing the computer program, implements the steps of the malware identification method as described above; and/or the processor, when executing the computer program, implements the steps of the malware identification model building method as described above.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the data set construction method according to any one of the above; and/or the processor, when executing the computer program, implements the steps of the malware identification method as described above; and/or the processor, when executing the computer program, implements the steps of the malware identification model building method as described above.

The invention also provides a computer program product having stored thereon executable instructions which, when executed by a processor, cause the processor to carry out the steps of the data set construction method as described in any one of the above; and/or the processor, when executing the computer program, implements the steps of the malware identification method as described above; and/or the processor, when executing the computer program, implements the steps of the malware identification model building method as described above.

According to the data set construction method, the malicious software identification method and the identification model construction device, based on the new sample and the samples in the seed set, if the new sample is determined to be an abnormal sample by adopting an abnormal detection algorithm, the new sample is added to the seed set until the number of the samples in the seed set meets a preset condition. Therefore, the abnormal samples are determined based on the abnormal detection algorithm, so that the abnormal samples and the samples in the seed set have higher abnormality degree, and new samples with lower abnormality degree with the seed set are eliminated, so that redundant data in the seed set are reduced, the scale of the seed set is reduced, and the data storage space is saved. Meanwhile, when the seed set with a small scale is applied to malicious software identification, the training time of a malicious software identification model can be shortened, each sample of the seed set after redundant data is removed has strong representativeness, and the problem that the accuracy of the identification model is influenced due to the fact that the data set contains noise caused by too much redundant data in the traditional method is solved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a data set construction method provided by the present invention;

FIG. 2 is a schematic diagram of a seed set construction process provided by the present invention;

FIG. 3 is a schematic diagram of a seed set updating process provided by the present invention;

FIG. 4 is a flowchart illustrating a malware identification method provided by the present invention;

FIG. 5 is a flowchart illustrating a malware identification model building method provided in the present invention;

FIG. 6 is a schematic structural diagram of a data set constructing apparatus provided by the present invention;

FIG. 7 is a schematic structural diagram of a malware recognition apparatus provided in the present invention;

FIG. 8 is a schematic structural diagram of a malware identification model building apparatus provided in the present invention;

fig. 9 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Malware refers to viruses, worms, trojan horse programs, and the like that perform malicious tasks on a computer system. Malicious software can steal user information, affect the use experience of the user on the electronic equipment, and even seriously threaten the property safety of the user, so the malicious software needs to be identified. Machine learning is generally used in existing intelligent malware identification technologies to identify malware, and application of machine learning algorithms is not independent of training data sets. In the existing intelligent malware recognition technology, a data set is constructed based on a large amount of malicious and non-malicious software, but because the similarity between the software is strong, the internal homogenization phenomenon of the simply collected data set is serious, for example, adjacent small versions of the same software, viruses belonging to different versions of the same family, or the same software infected by similar viruses, and the like, so that the redundancy degree of the data set is high, and the redundant data wastes data storage space and prolongs the training time of a malware recognition model. Meanwhile, compared with a laboratory environment, a real scene of malicious software identification is often more complex, and the malicious software is updated faster, so that new samples need to be correspondingly obtained and added to a training set to update an identification model, the scale of the training set of the model is larger, accordingly, redundant data in the training set can be increased, noise is brought to the training of the identification model, and the accuracy of the identification model is influenced.

In this regard, the present invention provides a data set construction method for malware identification. Fig. 1 is a schematic flow chart of a data set constructing method provided by the present invention, and as shown in fig. 1, the method includes the following steps:

step 110, new sample acquisition step: obtaining a new sample; and the sample type of the new sample is consistent with the sample type in a pre-constructed seed set, wherein the seed set contains malicious samples and/or non-malicious samples.

In this step, in order to train the malware recognition model, a new sample needs to be acquired as training set data of the malware recognition model, and then the malware recognition model trains one or more malicious/non-malicious binary classification models according to the training set data, so as to recognize malware by using the model. The sample type of the new sample is consistent with the sample type in the pre-constructed seed set, and the seed set contains malicious samples and/or non-malicious samples, namely the new sample contains the malicious samples and/or the non-malicious samples. Where malicious samples include software attempting to corrupt a computer, gathering sensitive information, or illegally accessing other computers, and non-malicious samples include software that works properly on a computer. The malicious samples or non-malicious samples may be downloaded from the sample site, but if the malicious samples or non-malicious samples downloaded from the sample site are directly added to the seed set, some same malicious samples or non-malicious samples may exist in the seed set, so that the data size of the seed set is increased. Therefore, in this embodiment, after downloading and obtaining a malicious sample or a non-malicious sample, the sample is screened and deduplicated to remove the completely same sample in the seed set, for example, a hash digest algorithm (MD5, SHA1, etc.) may be used to perform deduplication processing on the sample, and the sample with the same digest value as the sample in the seed set is removed, so as to ensure that the sample data in the seed set is different.

In addition, in order to find out the most effective features (invariance of the same type of samples, identifiability of different samples, robustness to noise) from the original features of the samples, the embodiment performs feature extraction on the samples subjected to the deduplication processing, and converts the original features of the samples into a group of features with obvious physical significance or statistical significance or kernel, so that data redundancy can be reduced, the features capable of reflecting the essence of the samples can be accurately obtained, the learning and generalization of the samples by a subsequent malicious software identification model are promoted, and the prediction accuracy of the identification model is improved.

As shown in fig. 2, in this embodiment, a malicious sample or a non-malicious sample is obtained by downloading from a sample site, then the download sample is subjected to deduplication processing, the download sample with the same data as the seed set is removed, then the feature extraction is performed on the download sample after deduplication, and the download sample is represented in a quantized form (usually, a high-dimensional real vector) until the seed set reaches a preset scale. It should be noted that the seed set obtained in this embodiment has a small scale, for example, the data scale of the seed set is up to ten thousand, which is equivalent to 1% of the data scale of the seed set obtained in the existing intelligent malware identification technology, so that the scale of the seed set is greatly reduced, and the storage space is saved.

In addition, although the acquired seed set contains malicious samples or non-malicious samples, in an actual malware identification application scenario, the malware and the non-malicious samples are updated quickly, in order to ensure the prediction accuracy of the identification model, the update speed of the software needs to be adapted, and new samples are often acquired from a sample site for performing update training on the identification model. The present embodiment may obtain a new sample from the sample site periodically, and the sample type of the obtained new sample is consistent with the sample type in the seed set, including malicious samples or non-malicious samples.

Step 120, determining an abnormal sample: and determining whether the new sample is an abnormal sample or not by adopting an abnormal detection algorithm based on the new sample and the sample in the seed set, and if so, adding the new sample to the seed set.

In this step, it should be noted that, in the conventional method, the acquisition of the data set identified by the malware is implemented through the steps of sample downloading, sample deduplication, feature extraction, and data set construction, that is, although the identical samples may be eliminated through sample deduplication (such as a hash algorithm) in the conventional method, the software with differences obtained through hash algorithm deduplication may have a homogenization phenomenon (the difference between the software is very small) between them, for example, several adjacent small versions of the same software, or viruses belonging to different versions of the same family, or the same software infected by similar viruses, and so on, which may result in a large number of similar samples, i.e., redundant samples, existing in the final data set. The redundant samples can not provide effective information for the learning of the recognition model, and can cause the transition expansion of the data set, so that the training efficiency of the recognition model is seriously reduced. In addition, because the popularity of different software is greatly different, the redundancy of different samples also has great difference in the final data set, which finally leads to the introduction of greater bias in the training of the recognition model and influences the prediction effect of the recognition model.

Therefore, in this embodiment, after a new sample is obtained, whether the new sample is an abnormal sample is determined based on an abnormal detection algorithm, and if so, it is indicated that the new sample is greatly different from the samples in the seed set, so that the new sample is added to the seed set, and the samples in the seed set have strong representativeness. If the new sample is determined not to be the abnormal sample, the difference degree between the new sample and the sample in the seed set is small, namely the new sample is a redundant sample, so that the new sample is removed, and the phenomenon of redundancy of the sample in the seed set is avoided. Among them, anomaly detection is a detection process to find out objects whose behavior is very different from that of the intended object. In the embodiment, through the anomaly detection algorithm, the abnormal data considered by the algorithm is regarded as the non-redundant data with high quality, and the non-abnormal data is regarded as the redundant data, that is, the non-abnormal data with high redundancy rate, that is, low anomaly degree is removed, so that a data set with high quality can be constructed at low cost. Meanwhile, based on the data set determined by the anomaly detection algorithm, the finally formed samples in the data set have strong representativeness, so that the noise around the sample points is greatly reduced, the data scale can be reduced under the condition of not reducing the data coverage capacity, and the training convergence speed, the prediction precision and the generalization performance of the recognition model are effectively improved.

The method and the device solve the problem that the existing sample set cannot be efficiently and high-quality generated in a reverse thinking mode innovatively, in the sample set generating process, abnormal data are detected by using an abnormal detection algorithm, then the abnormal data are used as high-quality non-redundant data, and it can be understood that the non-abnormal data can be used as redundant data, and the abnormal data can be used as the non-redundant data due to the fact that the similarity between the non-abnormal data and the existing samples in the sample set is high. It can be understood that, generally speaking, the anomaly detection algorithm is used for eliminating the anomalous data, and the method creatively reverses the process, and adds effective sample data by using the anomaly detection algorithm, thereby reducing redundant data in a seed set, reducing the scale of the seed set, and further saving the data storage space. Meanwhile, when the seed set with a small scale is applied to malicious software identification, the training time of a malicious software identification model can be shortened, each sample of the seed set after redundant data is removed has strong representativeness, and the problem that the accuracy of the identification model is influenced due to the fact that the data set contains noise caused by too much redundant data in the traditional method is solved.

In addition, in order to improve the quality of the data set in the conventional method, the labeling condition of the data generally needs to be confirmed twice manually (even there may be multiple rounds of auditing confirmation), which is extremely time-consuming and labor-consuming, and thus, the manual auditing cost of the high-quality data set can be greatly reduced on the basis of reducing the scale of the data set.

Step 130, updating step: and circularly executing the new sample acquisition step and the abnormal sample determination step until the number of samples in the seed set meets a preset condition.

In this step, it should be noted that, because there is a redundancy phenomenon inside the pre-constructed seed set, after the new sample is added to the seed set through the methods of step 110 and step 120, the pre-constructed seed set occupies only a low proportion in the seed set obtained in step 130, so that the redundancy in the seed set is diluted and tends to be negligible. Therefore, in the embodiment, by executing step 110 and step 120 circularly, not only can a new data set sample be obtained for training the recognition model, but also the redundancy of the seed set can be diluted, the scale of the data set is reduced, and the storage overhead of the data set is saved. The preset condition may refer to that the number of samples in the seed set reaches a preset scale, or the number of updates reaches a preset number, which is not specifically limited in this embodiment.

According to the data set construction method provided by the invention, based on the new sample and the samples in the seed set, if the new sample is determined to be an abnormal sample by adopting an abnormal detection algorithm, the new sample is added into the seed set until the number of the samples in the seed set meets a preset condition. Therefore, the abnormal samples are determined based on the abnormal detection algorithm, so that the abnormal samples and the samples in the seed set have higher abnormality degree, and new samples with lower abnormality degree with the seed set are eliminated, so that redundant data in the seed set are reduced, the scale of the seed set is reduced, and the data storage space is saved. Meanwhile, when the seed set with a small scale is applied to malicious software identification, the training time of a malicious software identification model can be shortened, each sample of the seed set after redundant data is removed has strong representativeness, and the problem that the accuracy of the identification model is influenced due to the fact that the data set contains noise caused by too much redundant data in the traditional method is solved.

Based on the above embodiment, the step of determining the abnormal sample specifically includes:

In this embodiment, it should be noted that an isolated Forest (Isolation Forest) is a rapid outlier detection method based on ensembles, has linear time complexity and high accuracy, and is different from other anomaly detection algorithms in that the degree of separation between samples is described by equivalent indexes of distance and density, and the isolated Forest algorithm detects an outlier by isolating a sample point. In particular, the algorithm isolates samples using a binary search tree structure known as an isolation tree. Because of the small number of outliers and the interspersion with most samples, outliers are isolated earlier, i.e., outliers are closer to the root node of the isolation tree, while normal values are further away from the root node. In addition, compared with traditional algorithms such as LOF, K-means and the like, the isolated forest algorithm has better robustness on high-dimensional data.

In the embodiment, an isolated forest algorithm is adopted, an isolated tree is established in the seed set, the new sample and each node in the isolated tree are compared and analyzed, the depth of the new sample on the isolated tree is determined, the smaller the depth is, the closer the new sample is to the root node of the isolated tree, and the larger difference exists between the new sample and the sample in the seed set. It should be noted that the isolated forest algorithm may calculate the abnormal score value based on the depth value of the new sample in the isolated tree, and the larger the abnormal score value is, the closer the new sample is to the root node of the isolated tree. It can be understood that, by setting a preset value, if the abnormal score value is greater than the preset value, the new sample is determined to be an abnormal sample, and is added to the seed set.

In addition, because the data set for identifying the model adopts binary sample data which can be identified by a computer, if the whole data set is clustered by adopting a clustering method, and then a clustering core is taken as a low-redundancy set, although whether a new sample is different from a seed set sample can be judged, the algorithm complexity of the clustering method is higher than that of an isolated forest algorithm, and particularly for the data of the binary sample data vector expressed by a super-high-dimensionality vector (usually reaching a plurality of k dimensions, even dozens of k dimensions), the operation cost of the method seriously reduces the usability of the method.

According to the data set construction method provided by the invention, based on the new sample and the sample in the seed set, whether the new sample is an abnormal sample is determined by adopting an isolated forest algorithm, so that the operation overhead cost is reduced, and the robustness is better.

Based on the above embodiment, based on the new sample and the sample in the seed set, determining whether the new sample is an abnormal sample by using an isolated forest algorithm, and if so, adding the new sample to the seed set, including:

and (4) scoring the abnormal values of the new samples based on the N isolated trees, and if the score is higher than a preset threshold value, adding the new samples to the seed set.

In this embodiment, it should be noted that, because various forms such as discrete and continuous forms may be mixed in the feature vectors of the malicious sample or the non-malicious sample, the dimension is usually large, and in addition, the normalization of the vectors is time-consuming due to the large data volume, and an isolated forest algorithm may be used to avoid these problems; if the feature vector of the malicious sample or the non-malicious sample is a continuous normalized result, an extended Isolation Forest algorithm (extended Isolation Forest) can be used to further reduce the bias factors in the constructed data set. Therefore, according to the embodiment, the corresponding anomaly detection algorithm can be selected according to the characteristic vector condition of the malicious sample or the non-malicious sample, so that the calculation efficiency is improved.

In this embodiment, the isolated forest algorithm or the extended isolated forest algorithm constructs N isolated trees (Isolation trees) on the seed set, where each Tree is only a subset obtained by randomly sampling the seed set, then randomly selects one of all features to segment the data set, and repeats this process continuously until the current data set is a single data set or the depth reaches a threshold, where N may be set by a domain expert, such as an initial value of 100, and then may be increased according to a specified policy, such as an increase of N by 5 each time compared to the previous time.

In this embodiment, after N isolated trees are constructed in the seed set, the new sample is subjected to outlier scoring to filter samples with scores lower than a preset threshold (the preset threshold may be set by an expert, for example, 0.5, and samples lower than the preset threshold are considered as redundant/low-quality samples), and the samples with outlier scoring higher than the preset threshold are added to the seed set until the size of the seed set reaches α times of the size before the iteration (the α value may be set by a domain expert, for example, α ═ 2.0), so that it is ensured that the new sample and the existing seed set have a difference on the basis of not affecting the prediction accuracy of the identification model, and the redundancy of the seed set is obtained in the dilution step 110.

As shown in fig. 3, after the seed set in the step 110 is extracted according to the sample downloading-sample duplicate removal-feature, an isolated tree is constructed in the seed set, the new sample is subjected to quality scoring based on the isolated tree, whether the new sample is a high-quality non-redundant sample is judged, if yes, the new sample is added to the seed set sample, and the seed set sample is updated.

Based on the above embodiment, the outlier scoring of the new sample is performed based on N isolated trees, which includes:

In this embodiment, N isolated trees are constructed on a seed set, each isolated tree only refers to a subset obtained by randomly sampling the seed set, then one of all features is randomly selected to segment the seed set, and the process is repeated until the current seed set is a single data set or the depth reaches a threshold value, a new sample corresponds to a depth value on each tree, and the smaller the depth value is, the larger the difference between the new sample and the seed set sample is. The isolated forest algorithm calculates the depth mean value of the new sample on all isolated trees, and substitutes the depth mean value into an abnormal value scoring function to obtain an abnormal value scoring result:

wherein s (x, n) represents an abnormal value scoring result, x represents a new sample, n represents the number of nodes used for constructing the isolated tree, E (H (x)) represents the mean value of the depths of x on all the isolated trees, and H represents the total depth value of the isolated tree.

It should be noted that the isolated forest algorithm is one of many anomaly detection algorithms, and the application of almost all anomaly detection algorithms is still anomaly detection (for example, a sudden change of traffic, a large difference between some data of many data and other data, and the like). The main point of the application is that abnormal data is not to be removed, but the abnormal data considered by the algorithm is regarded as high-quality non-redundant data, and the non-abnormal data is regarded as redundant data (the non-abnormal data with high redundancy, namely low abnormality degree, needs to be removed), so that the application of the reverse thinking leads to that a data set with very high quality can be constructed at very low cost.

In addition, in order to improve the quality of the data set as much as possible, manual confirmation (even checking and confirming for many rounds) of the labeling condition of the data is often required to be performed for two times, and for the binary sample of the software, the manual confirmation is extremely time-consuming and labor-consuming, so that the manual checking cost of the high-quality data set can be indirectly and greatly reduced.

At present, for binary sample data, no work for effectively reducing redundancy exists, and from the viewpoint of machine learning, a clustering method can be used for clustering the whole data set, and then a clustering core is taken as a low redundancy set, but the algorithm complexity of the clustering method is higher than that of an isolated forest algorithm, and particularly for binary sample data which is usually expressed as ultrahigh-dimension (usually reaching a plurality of k dimensions, even dozens of k dimensions) vectors, the operating overhead of the method seriously reduces the usability of the method.

Based on the above embodiments, as shown in fig. 4, the present invention provides a malware identification method, including the following steps:

step 410, acquiring software to be identified;

step 420, inputting the software into a malicious software identification model to obtain an identification result of the software;

the malware identification model is obtained by performing machine learning training on a first seed set containing malicious samples constructed by adopting the data set construction method of any one of the embodiments and/or a second seed set containing non-malicious samples constructed by adopting the data set construction method of any one of the embodiments.

In this embodiment, it should be noted that, because the first seed set containing the malicious samples and constructed by using the data set construction method according to any one of the above embodiments and/or the second seed set containing the non-malicious samples and constructed by using the data set construction method according to any one of the above embodiments removes the redundant samples in the first seed set and the second seed set, and reduces the scales of the first seed set and the second seed set, the samples in the finally formed data set have strong representativeness, so that not only the noise around the sample point is greatly reduced, but also the data scale can be reduced without reducing the data coverage capability, and the training convergence speed, the prediction accuracy and the generalization performance of the recognition model are effectively improved. Therefore, in the embodiment, the identification result can be accurately obtained by inputting the software to be identified into the malware identification model, and the identification model can be correspondingly updated and trained according to the update condition of the software sample, so that the identification result of the software to be identified can be updated in real time.

Based on the above embodiment, as shown in fig. 5, the present invention provides a method for constructing a malware identification model, including the following steps:

step 510, constructing a first seed set containing a malicious sample by adopting the data set construction method of any one of the above embodiments; and/or the presence of a gas in the gas,

step 520, constructing a second seed set containing non-malicious samples by adopting the data set construction method of any one of the above embodiments;

and 530, training the machine learning model by adopting a machine learning mode based on the first seed set and/or the second seed set to obtain the malicious software identification model.

In this embodiment, it should be noted that, the first seed set containing a malicious sample, which is constructed by using the data set construction method according to any one of the above embodiments, and/or the second seed set containing a non-malicious sample, which is constructed by using the data set construction method according to any one of the above embodiments, removes redundant samples in the first seed set and the second seed set, reduces the scales of the first seed set and the second seed set, so that the samples in the finally formed data set have strong representativeness, thereby not only greatly reducing noise around sample points, but also reducing the data scale without reducing data coverage capability, effectively improving the training convergence speed, prediction accuracy, and generalization performance of the recognition model, and enabling the obtained recognition model to accurately determine whether the software to be recognized is malicious software.

The following describes the data set constructing apparatus provided by the present invention, and the data set constructing apparatus described below and the data set constructing method described above may be referred to correspondingly.

Based on the above embodiment, as shown in fig. 6, the present invention provides a data set constructing apparatus for malware identification, including:

a new sample acquisition unit 610 for acquiring a new sample; the sample type of the new sample is consistent with the sample type in a pre-constructed seed set, wherein the seed set contains malicious samples and/or non-malicious samples;

an abnormal sample determining unit 620, configured to determine whether the new sample is an abnormal sample by using an abnormal detection algorithm based on the new sample and the sample in the seed set, and if so, add the new sample to the seed set;

and the updating unit 630 is configured to cyclically execute the steps in the new sample obtaining unit and the steps in the abnormal sample determining unit until the number of samples in the seed set meets a preset condition.

Based on the above embodiment, the abnormal sample determining unit 620 is specifically configured to:

Based on the above embodiment, the abnormal sample determining unit 620 is configured to determine whether the new sample is an abnormal sample by using an isolated forest algorithm based on the new sample and the sample in the seed set, and if so, add the new sample to the seed set, including:

Based on the above embodiment, the preset condition is that the first ratio reaches the preset value, where the first ratio is a ratio of the number of first samples in the seed set after the new sample is added to the seed set this time to the number of second samples in the seed set before the new sample is not added this time.

Based on the above embodiments, as shown in fig. 7, the present invention provides a malware identification apparatus, including:

an obtaining unit 710, configured to obtain software to be identified;

the identification unit 720 is used for inputting the software into the malicious software identification model and acquiring the identification result of the software;

the malware identification model is obtained by performing machine learning training on a first seed set containing malicious samples constructed by adopting the data set construction method in any embodiment and/or a second seed set containing non-malicious samples constructed by adopting the data set construction method in any embodiment.

Based on the above embodiment, as shown in fig. 8, the present invention provides a malware identification model building apparatus, including:

a first constructing unit 810, configured to construct a first seed set containing a malicious sample by using the data set constructing method according to any of the above embodiments; and/or the presence of a gas in the gas,

a second constructing unit 820, configured to construct a second seed set containing non-malicious samples by using the data set constructing method according to any of the above embodiments;

the training unit 830 is configured to train the machine learning model in a machine learning manner based on the first seed set and/or the second seed set to obtain the malware recognition model.

Fig. 9 is a schematic structural diagram of an electronic device provided in the present invention, and as shown in fig. 9, the electronic device may include: a processor (processor)910, a communication Interface (Communications Interface)920, a memory (memory)930, and a communication bus 940, wherein the processor 910, the communication Interface 920, and the memory 930 communicate with each other via the communication bus 940. Processor 910 may invoke logic instructions in memory 930 to perform a data set construction method comprising: a new sample acquisition step: obtaining a new sample; wherein the sample type of the new sample is consistent with the sample type in a pre-constructed seed set, wherein the seed set contains malicious samples and/or non-malicious samples; an abnormal sample determining step: determining whether the new sample is an abnormal sample or not by adopting an abnormal detection algorithm based on the new sample and the sample in the seed set, and if so, adding the new sample to the seed set; an updating step: and circularly executing the new sample acquisition step and the abnormal sample determination step until the number of samples in the seed set meets the condition.

Furthermore, the logic instructions in the memory 930 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the data set construction method provided by the above methods, the method comprising: a new sample acquisition step: obtaining a new sample; wherein the sample type of the new sample is consistent with the sample type in a pre-constructed seed set, wherein the seed set contains malicious samples and/or non-malicious samples; an abnormal sample determining step: determining whether the new sample is an abnormal sample or not by adopting an abnormal detection algorithm based on the new sample and the sample in the seed set, and if so, adding the new sample to the seed set; an updating step: and circularly executing the new sample acquisition step and the abnormal sample determination step until the number of samples in the seed set meets the condition.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the data set construction method provided above, the method comprising: a new sample acquisition step: obtaining a new sample; wherein the sample type of the new sample is consistent with the sample type in a pre-constructed seed set, wherein the seed set contains malicious samples and/or non-malicious samples; an abnormal sample determining step: determining whether the new sample is an abnormal sample or not by adopting an abnormal detection algorithm based on the new sample and the sample in the seed set, and if so, adding the new sample to the seed set; an updating step: and circularly executing the new sample acquisition step and the abnormal sample determination step until the number of samples in the seed set meets the condition.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A data set construction method for malware identification, comprising:

2. The data set construction method according to claim 1, wherein the abnormal sample determination step specifically includes:

3. The data set construction method according to claim 2, wherein based on the new samples and samples in a seed set, an isolated forest algorithm is used to determine whether the new samples are abnormal samples, and if so, the new samples are added to the seed set, including:

4. The data set construction method of claim 3, wherein the scoring the new samples for outliers based on the N orphan trees comprises:

5. A malware identification method, comprising:

acquiring software to be identified;

the malware identification model is obtained by performing machine learning training on a first seed set containing malicious samples constructed by the data set construction method according to any one of claims 1 to 4 and/or a second seed set containing non-malicious samples constructed by the data set construction method according to any one of claims 1 to 4.

6. A method for constructing a malware identification model is characterized by comprising the following steps:

constructing a first subset containing malicious samples by adopting the data set construction method according to any one of claims 1 to 4; and/or the presence of a gas in the gas,

constructing a second seed set containing non-malicious samples by adopting the data set construction method according to any one of claims 1 to 4;

7. A data set building apparatus for malware identification, comprising:

8. A malware identification device, comprising:

an acquisition unit for acquiring software to be identified;

9. A malware recognition model building apparatus, comprising:

a first construction unit, configured to construct a first subset including malicious samples by using the data set construction method according to any one of claims 1 to 4; and/or the presence of a gas in the gas,

a second construction unit, configured to construct a second seed set including non-malicious samples by using the data set construction method according to any one of claims 1 to 4;

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the steps of the data set construction method according to any one of claims 1 to 4; and/or the processor, when executing the program, implements the steps of the malware identification method of claim 5; and/or the processor, when executing the program, implements the steps of the malware identification model building method of claim 6.

11. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the data set construction method according to any one of claims 1 to 4; and/or the processor, when executing the program, implements the steps of the malware identification method of claim 5; and/or the processor, when executing the program, implements the steps of the malware identification model building method of claim 6.

12. A computer program product having stored thereon executable instructions, characterized in that the instructions, when executed by a processor, cause the processor to carry out the steps of the data set construction method according to any one of claims 1 to 4; and/or the processor, when executing the program, implements the steps of the malware identification method of claim 5; and/or the processor, when executing the program, implements the steps of the malware identification model building method of claim 6.