CN114598443A

CN114598443A - Malicious software detector training method, detector, electronic device and storage medium

Info

Publication number: CN114598443A
Application number: CN202210201495.9A
Authority: CN
Inventors: 王海州
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-06-07

Abstract

The invention discloses a malicious software detector training method, a detector, electronic equipment and a storage medium. The malware detector training method comprises the following steps: acquiring an original sample data set, and obtaining the original malicious software detection rate of the original sample data set; acquiring a characteristic parameter of each original sample, wherein the characteristic parameter is used for representing the uncertainty of the original sample as malicious software; selecting a representative sample data set with a total sample proportion of alpha from the original sample data set according to the characteristic parameters, and obtaining the malicious software detection rate of the representative sample data set, wherein alpha is larger than 0 and smaller than 1, and the difference value between the malicious software detection rate and the original malicious software detection rate is within a first preset range; and inputting the representative sample data set into a preset training model for training to obtain the malicious software detector. The training method of the malicious software detector provided by the invention can reduce the difficulty of model training and ensure the accuracy of the trained model.

Description

Malicious software detector training method, detector, electronic device and storage medium

Technical Field

The present invention relates to the field of software security technologies, and in particular, to a malware detector training method, a detector, an electronic device, and a storage medium.

Background

At present, a huge amount of malicious software poses a great threat to security and user rights and interests of the android system. Therefore, the research on the android malicious software detection method is one of important contents in the field of security protection of the mobile terminal operating system.

The interpretable android malicious software detection method is mainly a rule-based android malicious software detection method, and mainly extracts the authority of frequently requesting malicious software but rarely requesting benign software, takes the authority as a rule for detecting the android malicious software, and then detects the malicious software by using the rule set.

However, the inventors found that: the rule-based android malicious software detection method can reflect the causal relationship between the characteristics and the detection result, but the method is established on the basis of a large amount of manual analysis, and the training difficulty of the model is high.

Disclosure of Invention

The invention provides a training method of a malicious software detector, a detector, electronic equipment and a storage medium, which can reduce the difficulty of model training and ensure the accuracy of a trained model.

According to an aspect of the present invention, there is provided a malware detector training method, including: the method comprises the steps of obtaining an original sample data set and obtaining an original malicious software detection rate of the original sample data set, wherein the original sample data set comprises a plurality of original samples; acquiring a characteristic parameter of each original sample, wherein the characteristic parameter is used for representing the uncertainty of the original sample being malicious software; according to the characteristic parameters, selecting a representative sample data set with a total sample proportion of alpha from the original sample data set, and obtaining the malware detection rate of the representative sample data set, wherein alpha is larger than 0 and smaller than 1, and the difference value between the malware detection rate and the original malware detection rate is within a first preset range; and inputting the representative sample data set into a preset training model for training to obtain the malicious software detector.

According to another aspect of the present invention, there is provided a malware detector comprising: the system comprises an original sample detection rate acquisition module, a malicious software detection rate acquisition module and a malicious software detection rate acquisition module, wherein the original sample detection rate acquisition module is used for acquiring an original sample data set and obtaining an original malicious software detection rate of the original sample data set, and the original sample data set comprises a plurality of original samples; a characteristic parameter obtaining module, configured to obtain a characteristic parameter of each original sample, where the characteristic parameter is used to characterize an uncertainty degree that the original sample is malware; a representative sample detection rate obtaining module, configured to select, according to the feature parameters, a representative sample data set whose total sample proportion is α from the original sample data set, and obtain a malware detection rate of the representative sample, where α is greater than 0 and smaller than 1, and a difference between the malware detection rate and the original software detection rate is within a first preset range; and the detector training module is used for inputting the representative sample data set into a preset training model for training to obtain the malicious software detector.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a malware detector training method as described in any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the malware detector training method according to any one of the embodiments of the present invention when executed.

Compared with the related art, the embodiment of the invention at least has the following advantages:

by selecting the representative sample data set from the original sample data set according to the characteristic parameters of the original sample, on one hand, the number of sample data input into the preset model for training is reduced, so that the model does not need to train a large amount of data, and the difficulty of model training is reduced; on the other hand, the difference value between the malicious software detection rate and the original malicious software detection rate can be ensured to be within a first preset range, so that the training effect same as that of training through the original sample data set can be achieved through the representative sample data set training, and the accuracy of the trained preset model is ensured; in addition, the model training method does not need manual analysis, and the labor cost can be greatly reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a malware detector training method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a malware detector training method according to a second embodiment of the present invention;

FIG. 3 is a flowchart of a malware detector training method according to a third embodiment of the present invention;

FIG. 4 is a flowchart of a malware detector training method according to the fourth embodiment of the present invention;

FIG. 5 is a functional block diagram of a malware detector training method according to the fourth embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a malware detector according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device implementing the malware detector training method according to the sixth embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a malware detector training method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

s110, obtaining an original sample data set, and obtaining an original malicious software detection rate of the original sample data set.

Specifically, the original sample data set comprises a plurality of original samples, and when the original sample data set is obtained, the types (malicious software or benign software) of the original samples can be known at the same time, that is, the specific number of the malicious software in the original sample data set can be known at the moment; inputting the original samples into a feature set training classifier to obtain the probability of classifying each sample into malicious software or benign software (if the probability of one sample being the malicious software is 0.6 and the probability of the benign software is 0.4, the sample is judged to be the malicious software); assuming that the specific number of malware in the original sample data set is 500, and the number of malware detected by the feature set training classifier is 450, the original malware detection rate of the original sample data set is 450/500-90%.

And S120, acquiring characteristic parameters of each original sample.

In particular, the characteristic parameters are used for characterizing the uncertainty of the original sample as malware. In this embodiment, the characteristic parameter is information entropy, and the information entropy of the original sample can be obtained by:

inputting a plurality of original samples into a preset training classifier to obtain the probability that each original sample is classified into malicious software or benign software; the information entropy is obtained according to the following formula:

where n is the number of original samples, i is the original sample number, p (y)_i) The probability that the original sample is classified as malware or benign, h (y) is the entropy of the information.

It should be noted that the preset training classifier may be a feature set training classifier mentioned above, and the present embodiment does not specifically limit the type of the preset training classifier, and only needs to be able to distinguish whether the original sample is malware or benign software.

And S130, selecting a representative sample data set which accounts for the total sample proportion of alpha from the original sample data set according to the characteristic parameters, and obtaining the malware detection rate of the representative sample data set.

Specifically, alpha is greater than 0 and less than 1, and the difference between the malware detection rate and the original malware detection rate is within a first preset range.

It is worth mentioning that, in order to ensure that the difference between the malware detection rate and the original malware detection rate is within a first preset range, before selecting a representative sample data set occupying a total sample proportion of α from the original sample data set, a plurality of original samples in the original sample data set are subjected to descending order arrangement according to the size of the information entropy, and then a sample occupying the total sample proportion of α and having the largest information entropy among the plurality of original samples is selected as the representative sample data set. The larger the information entropy is, the higher the uncertainty indicating that the original sample is the malicious software is, namely, the harder it is to judge whether the original sample is the malicious software, so that the selection of the representative sample data set is more targeted and the coverage range is wider, thereby meeting the requirement that the difference value between the malicious software detection rate and the original malicious software detection rate is within a first preset range, and further ensuring the accuracy of subsequent training.

And S140, inputting the representative sample data set into a preset training model for training to obtain the malicious software detector.

Specifically, the preset training model in this example may be a detection model constructed based on the AdaBoost algorithm.

Compared with the related art, the embodiment of the invention at least has the following advantages: by selecting the representative sample data set from the original sample data set according to the characteristic parameters of the original sample, on one hand, the number of sample data input into the preset model for training is reduced, so that the model does not need to train a large amount of data, and the difficulty of model training is reduced; on the other hand, the difference value between the malicious software detection rate and the original malicious software detection rate can be ensured to be within a first preset range, so that the training effect same as that of training through the original sample data set can be achieved through the representative sample data set training, and the accuracy of the trained preset model is ensured; in addition, the model training method does not need manual analysis, and the labor cost can be greatly reduced.

Example two

Fig. 2 is a flowchart of a malware detector training method according to a second embodiment of the present invention, which exemplifies the foregoing embodiment and specifically illustrates: how to ensure that the difference between the malware detection rate and the original malware detection rate is within a first preset range.

Specifically, as shown in fig. 2, the method includes:

s210, obtaining an original sample data set, and obtaining an original malicious software detection rate of the original sample data set.

And S220, acquiring the characteristic parameters of each original sample.

And S230, selecting a representative sample data set which occupies the total sample proportion of alpha from the original sample data set according to the characteristic parameters, and obtaining the malware detection rate of the representative sample data set.

S240, judging whether the difference value between the malicious software detection rate and the original malicious software detection rate is within a first preset range, if so, executing the step S260; if not, go to step S250.

Specifically, the first preset range may be set according to actual requirements, for example, the first preset range may be set to be 0 to 0.3%, and the size of the first preset range is not specifically limited in this embodiment.

And S250, adjusting the size of alpha, and executing the step S230.

Specifically, if the difference between the malware detection rate and the original malware detection rate is not within a first preset range, the alpha is continuously increased until the difference between the new malware detection rate and the original malware detection rate is within the first preset range; if the initially selected alpha can meet the requirement that the difference value between the new malware detection rate and the original malware detection rate is within a first preset range, the alpha can be properly reduced to reduce the number of representative samples as much as possible, so that the difficulty of model training is reduced as much as possible.

And S260, inputting the representative sample data set into a preset training model for training to obtain the malicious software detector.

It is to be understood that steps S210 to S230, and S260 in this embodiment are the same as steps S110 to S140 in the foregoing embodiment, and are not repeated herein to avoid repetition.

EXAMPLE III

Fig. 3 is a flowchart of a malware detector training method according to a third embodiment of the present invention, which exemplifies the foregoing embodiments and specifically illustrates: how to obtain a malware detector.

Specifically, as shown in fig. 3, the method includes:

s310, obtaining an original sample data set, and obtaining an original malicious software detection rate of the original sample data set.

And S320, acquiring the characteristic parameters of each original sample.

S330, selecting a representative sample data set with the total sample proportion of alpha from the original sample data set according to the characteristic parameters, and obtaining the malware detection rate of the representative sample data set.

And S340, inputting the representative sample data set into a detection model based on an AdaBoost algorithm, and extracting an initial detection rule.

Specifically, in the embodiment, a plurality of interdependent decision trees are constructed based on the detection model of the AdaBoost algorithm, and the final decision classification result is weighted by all the decision trees. The sample selection and the attribute selection are used as two random processes for generating the decision tree, so that the problem of overfitting can be effectively reduced. Meanwhile, the malicious software is detected through the combination of a plurality of trees, so that the problem of under-fitting caused by single tree discrimination is avoided, and the detection effect can be obviously improved.

For the convenience of understanding, the method for extracting the initial rule in the embodiment is specifically described below:

rule r ═ { if f₁∩f₂∩f₃…∩f_nthe result }, which consists of a rule body and a detection result. Wherein the regular body C ═ { f ═ f₁∩f₂∩f₃…∩f_nAnd the detection result is that the application program is malicious or benign. When leaf nodes of a random tree exist, all the leaf nodes form a rule with the root node. For example, "if android>0.5the malt ". Wherein "android>0.5 is a logical connection word "&"is Boolean expression," malware "is the detection result.

S350, removing redundant logic connection words in each initial detection rule, and taking the initial detection rule without the redundant logic connection words as a simplified detection rule.

Specifically, in this embodiment, the original detection rule may be pruned by a leave-one-out pruning method, so as to remove redundant logic connection words in the original detection rule.

The leave-one-out pruning method specifically comprises the following steps: acquiring an initial error rate of an initial detection rule and a plurality of error rates of the initial detection rule after each logic connection word is removed; judging whether the difference value of each error rate and the initial error rate is within a second preset range; when the judgment result is not in the preset range, taking the logic connection word corresponding to the error rate as a redundant logic connection word; redundant logical connection words are removed.

It should be noted that the second preset range may be set according to actual requirements, for example, the second preset range may be set to 0 to 0.1%, and the size of the second preset range is not specifically limited in this embodiment.

Please refer to table 1, which shows the simplified rule extraction method code of this embodiment:

TABLE 1

For ease of understanding, the following is a specific example of how the embodiment removes redundant logical connection words:

assuming that the second preset range is 0.05%, the initial detection rule is a and B and C or D (a, B, C, D are all rules, and or are logical connection words), and the initial error rate of the initial rule is 2%, separately removing A, B, C, D to obtain the first rule: b and C or D; the second rule is as follows: a and C or D; a third rule: a and B or D; a fourth rule: a and B and C. If the error rate of the first rule is 2.1%, the error rate of the second rule is 2.02%, the error rate of the third rule is 2.08%, and the error rate of the fourth rule is 2.2%, the logical connection words corresponding to rule B are redundant logical connection words, and are removed. And the like until all redundant logic connection words in the initial detection rule are removed.

And S360, constructing a malicious software detector according to the simplified detection rule.

It is to be understood that steps S310 to S330, and S360 in this embodiment are the same as steps S210 to S230, and S260 in the foregoing embodiment, and are not repeated herein to avoid repetition.

Example four

Fig. 4 is a flowchart of a malware detector training method according to a fourth embodiment of the present invention, which is further improved on the basis of the foregoing embodiment, and the specific improvement is that: and after the simplified rule is obtained, removing the redundant rule in the simplified rule, and then constructing a malicious software detector according to the simplified detection rule after the redundant rule is removed. In this way, effective detection of malware can be achieved and the interpretability of the malware detector is improved.

Specifically, as shown in fig. 4, the method includes:

s410, obtaining an original sample data set, and obtaining an original malicious software detection rate of the original sample data set.

And S420, acquiring characteristic parameters of each original sample.

And S430, selecting a representative sample data set with the total sample proportion of alpha from the original sample data set according to the characteristic parameters, and obtaining the malware detection rate of the representative sample data set.

And S440, inputting the representative sample data set into a detection model based on an AdaBoost algorithm, and extracting an initial detection rule.

S450, redundant logic connection words in each initial detection rule are removed, and the initial detection rule with the redundant logic connection words removed is used as a simplified detection rule.

And S460, obtaining evaluation index parameters of the simplified detection rule.

Specifically, the evaluation index parameter in this embodiment includes one of the following or any combination thereof: rule frequency of occurrence, rule error rate, and rule length.

More specifically, the rule occurrence frequency is the proportion of the number of samples satisfying the rule to the total number of samples; the rule error rate is the error rate caused by classification by using the rule, and the smaller the error rate is, the better the capability of the rule for detecting the malicious software is represented; the rule length is the number of logical connection words of the rule, and a smaller number indicates a higher readability and a higher interpretability of the rule.

And S470, removing the redundant rules in the simplified detection rules according to the evaluation index parameters.

Specifically, first, among the detection rules that generate collisions, a rule having a low error rate, a high frequency, and a short length is selected. Then, a rule matrix containing sample sequences, rules and sample labels is constructed as a rule training set, as shown in table 2. The sample serial number refers to malicious software or benign software, the value in the matrix is 1, which represents that the sample conforms to the corresponding rule, and the value in the matrix is 0, which represents that the sample does not conform to the corresponding rule. And then, training an Adaboost classifier by using the rule training set to obtain the importance of the rules, and finally, eliminating redundant rules by sequencing the importance of the rules in a descending order.

TABLE 2

And S480, constructing a malicious software detector according to the simplified detection rule after the redundant rule is removed.

Referring to fig. 5, for ease of understanding, the malware detector training method of the present example is described in detail below:

the training method principle framework of the malware detector mainly comprises 2 modules, representing sample selection and rule detection model construction. In a representative sample selection module, selecting a sample by using the information entropy, removing redundant samples, and selecting the representative sample as a new data set; in a detection rule extraction module, an Adaboost model is trained by using a new data set, then operations such as initial rule extraction, rule measurement, rule pruning, redundant rule elimination and the like are carried out from a trained random forest model to obtain a simplified rule set, and finally a malicious software rule detector is constructed by using the simplified rule set to realize the detection of Android malicious software, so that a detection result that an application program is malicious or benign and an interpretation result that the application program is malicious or benign are obtained.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

EXAMPLE five

Fig. 6 is a schematic structural diagram of a malware detector according to a fifth embodiment of the present invention. As shown in fig. 6, the malware detector includes:

the system comprises an original sample detection rate acquisition module 1, a malicious software detection rate acquisition module and a malicious software detection rate acquisition module, wherein the original sample detection rate acquisition module is used for acquiring an original sample data set and obtaining the original malicious software detection rate of the original sample data set, and the original sample data set comprises a plurality of original samples;

a characteristic parameter obtaining module 2, configured to obtain a characteristic parameter of each original sample, where the characteristic parameter is used to characterize an uncertainty degree that the original sample is malware;

a representative sample detection rate obtaining module 3, configured to select, according to the feature parameter, a representative sample data set whose total sample proportion is α from the original sample data set, and obtain a malware detection rate of the representative sample, where α is greater than 0 and smaller than 1, and a difference between the malware detection rate and the original software detection rate is within a first preset range;

and the detector training module 4 is used for inputting the representative sample data set into a preset training model for training to obtain the malicious software detector.

It should be understood that this embodiment is a device embodiment corresponding to the first embodiment, and the embodiment can be implemented in cooperation with the first embodiment. The related technical details mentioned in the first embodiment are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first embodiment.

It should be noted that, all modules involved in this embodiment are logic modules, and in practical application, one logic unit may be one physical unit, may also be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, a unit which is not so closely related to solve the technical problem proposed by the present invention is not introduced in the present embodiment, but this does not indicate that there is no other unit in the present embodiment.

The following specific experimental analysis is performed on the malware detector provided in this embodiment:

first, 80% of the data were used as training set and 20% of the data were used as test set.

And secondly, selecting the parameters of the malware detection model extracted based on the rules by utilizing a training set. When selecting the representative sample, the input features 274 dimensions, with a representative sample scale of 0.5. When a random forest model is trained, a ten-fold cross validation method is adopted for parameter adjustment, 80% of data in a training set is taken as training data in turn, 20% of data is taken as validation data to optimize model parameters, the number of final trees is 100, and the number of features randomly selected by each decision tree is 241. When the rules are extracted from the detection model based on the AdaBoost algorithm, a ten-fold cross validation method is adopted for parameter adjustment, 80% of data in a training set are taken as training data and 20% of data are taken as validation data to optimize model parameters in turn, the minimum frequency threshold of the rules is 5e-04, the error rate threshold is 0.04, and the rule length threshold is 3.

And finally, calculating evaluation indexes of an interpretable malware detection rule extraction method (RBE) and 9 comparison methods by using the test set, wherein the evaluation indexes comprise accuracy, precision, recall rate and F value. Wherein, 9 kinds of comparison methods include: decision Trees (DT), gradient spanning tree methods (GDBT), neural network Methods (MLP), logistic regression methods (LR), spanning tree methods (ADABOOST), and bayes algorithms (NB), RFRULES methods, EBBAM methods, and Sigpid methods. The 6 detection tools include: AntiVir, AVG, BitDefender, ClamAV, ESET, F-Secure.

TABLE 3 malicious software detection test results

Table 4 malware detection tool comparison experimental results

The experimental results show that:

(1) the interpretable malware detection rule extraction method is superior to the comparison method. Comparing the malware detector training method of the previous embodiment with other basic classification algorithms, experimental results show that the accuracy (0.974), the recall value (0.982) and the F1 value (0.978) of the malware detection algorithm extracted based on rules are all higher than those of other algorithms. According to the method, a small number of judgment rules with low complexity and high accuracy are automatically extracted from the tree model, and the relation between the Boolean logic relation among the characteristics and the detection result is excavated, so that the effective detection of the malicious software is realized, and compared with a single model in other comparison methods, the method has a better detection effect.

(2) The interpretable malware detection rule extraction method is superior to the comparative malware detection tool. The malware detector and the malware detection tool of the embodiment are compared and analyzed, and the comparison and analysis tool is AntiVir, AVG, BitDefender, ClamAV, ESET and F-Secure. The detection rate (recall rate) of the interpretable Android malicious software detection rule extraction method is 98.2% and is higher than 6 detection tools. The detection tool utilizes the rules summarized by experts to carry out detection, the timeliness of the rules is insufficient, and the capability of detecting the malicious software is insufficient, for example, the F-secret tool can only reach 64.16% of detection rate (recall rate). Compared with a detection tool, the interpretable Android malicious software detection rule extraction method has timeliness and can be used for effectively detecting malicious software.

The case analysis, the rule quantity and the interpretation degree are used as evaluation indexes of the experimental result, wherein the example interpretation is to interpret a specific malware family by using the interpretation result and evaluate whether the interpretation result meets the family characteristics, and the rule quantity is the number of rules in the rule set. The interpretive degree calculation formula of the rule set is as follows:

wherein RuleSet is a rule set, weight_iIs the weight of a single rule that is,

i is the interpretation degree of a single rule, and i is the number of the rules.

Wherein the maxAttribute value is the number of attributes, and the currCondition value is the number of conditions of a single rule.

First, 80% of the data in the DREBIN data set was used as a training set and 20% of the data was used as a test set. And then, selecting the Android malicious software detection model parameters extracted based on the rules by utilizing the training set. 274 dimensional features selected from the DREBIN dataset are used as an input feature set.

When selecting representative samples, calculating the classification probability of the samples by using a random forest algorithm, wherein the number of trees is 100, and the optimal parameter of the sample proportion is 0.5. When a random forest model is trained, a ten-fold cross validation method is adopted for parameter adjustment, 80% of data in a training set is taken as training data in turn, 20% of data is taken as validation data to optimize model parameters, the number of trees is 100, and the number of features randomly selected by each decision tree is 241.

When the rules are extracted from the detection model based on the AdaBoost algorithm, a ten-fold cross validation method is adopted for parameter adjustment, and 80% of data in a training set are taken as training data and 20% of data are taken as validation data in turn to optimize model parameters. Wherein the rule minimum frequency threshold is 5e-04, the error rate threshold is 0.04, the rule length threshold is 3, and the number of trees of the AdaBoost algorithm is 100.

Secondly, the test set is utilized to respectively calculate evaluation indexes, namely the interpretability and the rule quantity, of the interpretability Android malware detection rule extraction method (RBE) and the RFRULES (2019).

And finally, respectively selecting a FakeInstally family (925 malware) and a Down family (3385 malware) in the DREBIN data set as new data sets, repeating the processes, and respectively outputting the interpretation results of the interpretable Android malware detection rule extraction method (RBE) and the EBBAM method (2018).

TABLE 5 malware interpretability comparison experiment results

TABLE 6 malware interpretability comparison experiment results (Fake Installer family)

TABLE 7 malware interpretability comparison experimental results (Down family)

The experimental results show that:

the interpretable malware detection rule extraction method (RBE method) is superior in interpretability to the RFRULES method. The number of rules of the RBE method (34 pieces) is 221 less than that of the RFRULES method (255 pieces), the interpretation degree (99.73%) is improved by 1.04% compared with the comparison method, and therefore, the method has better interpretability than the RFRULES method.

The interpretable malware detection rule extraction method (RBE method) is superior in interpretability to the EBBAM method. Comparative analysis was performed using the Fake lnstar malware family and the down malware family as examples.

The family of Fake lnstar malware has two malicious behaviors: (1) the user is not allowed to pay the fee; (2) and remotely controlling the mobile phone of the user by the program backdoor. As can be seen from the experimental results in table 6, the rule "If Permission: SEND _ SMS >0.5 ═ andd android. intent.action.view ═ 0.5" shows that the malicious family does not need to pass the user's consent while acquiring the short message authority, and the EBBAM method simply uses SEND _ SMS authority to explain the action. The authority is a common authority for benign software, which only indicates that the feature greatly contributes to the classifier, but cannot directly use the feature to distinguish whether the software is malicious software or benign software. The rule "Android. intent. test >0.5 ≧ Android. app. keyguard manager. exitkkeyguard security ≦ 0.5" shows that when the Android system is in the test mode and the unlock mode, the remote server may control the Android system, whereas the EBBAM method only uses the READ PHONE STATE authority to interpret the behavior, and it is difficult to directly establish a logical relationship between the feature and the detection result.

The Down malicious family is an advertising malware that is bundled with certain applications. There are two malicious acts in this family: (1) the equipment continuously downloads and installs other malicious software to cause the blockage of the mobile phone of the user, so that the normal use of the mobile phone is influenced; (2) and sending the equipment information of the user to the remote end. As can be seen from the experimental results of table 7,

android.app.downloadmanager.addcomppleteddownload >1.5 andd.intent.acti-on.actionp ACKAGEADDED < 0.5 rule indicates that the application downloads malware without user consent.

android.app.notificationartificial >4 shows that the malware displays a message in the notification bar more than 4 times, and further,

android, telephony manager, getSimStadium number >1.5 andu java, net, URL, openStream < 0.5 shows that the malware acquires the number of the user and sends it to a certain link. In the EBBAM method, the most important 4 features have no direct relation with the two behaviors, and are all features used by normal software.

In summary, the malware detector training method in the foregoing embodiment can reflect causal relationships between interactions between features and detection results, and has lower rule complexity and higher interpretability.

EXAMPLE six

FIG. 7 illustrates a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 7, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM)12, a Random Access Memory (RAM)13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM)12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the malware detector training method.

In some embodiments, the malware detector training method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the malware detector training method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the malware detector training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired result of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A malware detector training method, comprising:

the method comprises the steps of obtaining an original sample data set and obtaining an original malicious software detection rate of the original sample data set, wherein the original sample data set comprises a plurality of original samples;

acquiring a characteristic parameter of each original sample, wherein the characteristic parameter is used for representing the uncertainty of the original sample being malicious software;

according to the characteristic parameters, selecting a representative sample data set with a total sample proportion of alpha from the original sample data set, and obtaining the malware detection rate of the representative sample data set, wherein alpha is larger than 0 and smaller than 1, and the difference value between the malware detection rate and the original malware detection rate is within a first preset range;

and inputting the representative sample data set into a preset training model for training to obtain the malicious software detector.

2. The malware detector training method of claim 1, wherein the feature parameter is entropy;

the obtaining of the characteristic parameters of each original sample comprises:

inputting a plurality of original samples into a preset training classifier to obtain the probability that each original sample is classified into malicious software or benign software;

the information entropy is obtained according to the following formula:

where n is the number of original samples, i is the original sample number, p (y)_i) H (y) is the information entropy, which is the probability that the original sample is classified as malware or benign software.

3. The malware detector training method of claim 1, wherein selecting a representative sample data set with a total sample proportion α from the original sample data set according to the feature parameters comprises:

according to the magnitude of the information entropy, performing descending order arrangement on a plurality of original samples in the original sample data set;

and selecting a sample which has the largest information entropy and occupies a total sample proportion of alpha from the plurality of original samples as the representative sample data set.

4. The malware detector training method of any one of claims 1-3, further comprising, after obtaining the malware detection rate for the representative sample dataset:

judging whether the difference value between the malicious software detection rate and the original malicious software detection rate is within the first preset range or not;

when the representative sample data set is judged to be within the first preset range, inputting the representative sample data set into a preset training model;

and when the judgment result is not in the first preset range, adjusting the size of alpha to obtain a new malicious software detection rate until the difference value between the new malicious software detection rate and the original malicious software detection rate is in the first preset range.

5. The malware detector training method of claim 1, wherein the inputting the representative sample data set into a preset training model for training to obtain a malware detector comprises:

inputting the representative sample data set into a detection model based on an AdaBoost algorithm, and extracting an initial detection rule, wherein the initial detection rule is a characteristic expression connected by a plurality of logic connection words;

removing redundant logic connection words in each initial detection rule, and taking the initial detection rule without the redundant logic connection words as a simplified detection rule;

and constructing the malicious software detector according to the simplified detection rule.

6. The malware detector training method of claim 5, wherein the removing redundant logical connection words in each of the initial detection rules comprises:

acquiring an initial error rate of the initial detection rule and a plurality of error rates of the initial detection rule after each logic connection word is removed;

respectively judging whether the difference value of each error rate and the initial error rate is within a second preset range; when the judgment result is not in the preset range, taking the logic connection word corresponding to the error rate as the redundant logic connection word;

and removing the redundant logic connection words.

7. The malware detector training method of claim 5, further comprising, prior to building the malware detector according to the lean detection rules:

obtaining evaluation index parameters of the simplified detection rule;

removing redundant rules in the simplified detection rules according to the evaluation index parameters;

the constructing the malware detector according to the streamlined detection rules includes:

and constructing the malicious software detector according to the simplified detection rule after the redundant rule is removed.

8. The malware detector training method of claim 7, wherein the evaluation index parameter comprises one or any combination of the following:

rule frequency of occurrence, rule error rate, and rule length.

9. A malware detector, comprising:

the system comprises an original sample detection rate acquisition module, a malicious software detection rate acquisition module and a malicious software detection rate acquisition module, wherein the original sample detection rate acquisition module is used for acquiring an original sample data set and obtaining an original malicious software detection rate of the original sample data set, and the original sample data set comprises a plurality of original samples;

a characteristic parameter obtaining module, configured to obtain a characteristic parameter of each original sample, where the characteristic parameter is used to characterize an uncertainty degree that the original sample is malware;

a representative sample detection rate obtaining module, configured to select, according to the feature parameters, a representative sample data set whose total sample proportion is α from the original sample data set, and obtain a malware detection rate of the representative sample, where α is greater than 0 and smaller than 1, and a difference between the malware detection rate and the original software detection rate is within a first preset range;

and the detector training module is used for inputting the representative sample data set into a preset training model for training to obtain the malicious software detector.

10. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the malware detector training method of any one of claims 1-8.

11. A computer-readable storage medium storing computer instructions for causing a processor to implement the malware detector training method of any one of claims 1-8 when executed.