CN111753299A - Unbalanced malicious software detection method based on packet integration - Google Patents

Unbalanced malicious software detection method based on packet integration Download PDF

Info

Publication number
CN111753299A
CN111753299A CN202010571828.8A CN202010571828A CN111753299A CN 111753299 A CN111753299 A CN 111753299A CN 202010571828 A CN202010571828 A CN 202010571828A CN 111753299 A CN111753299 A CN 111753299A
Authority
CN
China
Prior art keywords
samples
malicious
data set
sample
unbalanced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010571828.8A
Other languages
Chinese (zh)
Inventor
严海升
李强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Arts and Sciences
Original Assignee
Chongqing University of Arts and Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Arts and Sciences filed Critical Chongqing University of Arts and Sciences
Priority to CN202010571828.8A priority Critical patent/CN111753299A/en
Publication of CN111753299A publication Critical patent/CN111753299A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/259Fusion by voting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention belongs to the technical field of information security, and particularly relates to an unbalanced malicious software detection method based on packet integration, which comprises the following steps: s1, feature extraction: extracting authority information and API calling information from an experimental sample to form a characteristic vector set; the experimental samples comprise normal samples and malicious samples, and the number of the normal samples is larger than that of the malicious samples; s2, feature optimization: screening the characteristic vector set by adopting an information gain algorithm to remove redundant characteristics to obtain an unbalanced data set; and S3, detecting the unbalanced data set by utilizing a grouping integration detection algorithm so as to classify the normal samples and the malicious samples. The method and the device solve the defect that the accuracy and the stability of the detection of the malicious software are difficult to guarantee due to the unbalanced data set.

Description

Unbalanced malicious software detection method based on packet integration
Technical Field
The invention belongs to the technical field of information security, and particularly relates to an unbalanced malicious software detection method based on packet integration.
Background
The Android platform is popular among a large number of mobile phone manufacturers due to the open source characteristic, the Android mobile phone occupies 87% of market share according to the latest statistics of IDC, meanwhile, the Android platform is vulnerable to malicious software due to the open source characteristic, and 97% of discovered mobile phone end malicious software is related to the Android platform. Malicious molecules attack the Android platform by using malicious software to steal user privacy information, carry out malicious fee deduction and the like, and the security situation of the mobile phone is very severe, so that the detection of the malicious software becomes a research focus in the field of information security.
Machine learning has enjoyed tremendous success in the information security field of spam filtering and the like. Researchers apply machine learning algorithms to the field of Android malware detection, provide a plurality of malware detection algorithms, and verify the effectiveness of the machine learning algorithms in malware detection problems. The method provides a lightweight detection scheme based on sensitive authority, analyzes the difference of authority in different types of samples, removes redundant authority, and finally adopts a nearest neighbor classification algorithm to realize discrimination on malicious software. Zhang et al put forward an Android malicious software detection strategy based on naive Bayes, and discrimination of malicious software is realized by judging whether abuse authority and sensitive authority are supplied in series or not as characteristic attributes. Poplar and macro and the like extract authority and component intention information of android software as features, and a random forest algorithm optimized by weighted voting is adopted to detect malicious software. Although many methods for detecting malware have been proposed, most of these methods assume that malicious software and normal software in the training data do not differ greatly in number. However, in practical applications, because normal samples can be obtained from a third-party market in batches through a crawler, the collection cost of the malicious software samples is high, the difficulty is high, the number of the normal software samples is far greater than that of the malicious software samples, and the problem of imbalance of training data is caused, so that the accuracy and the stability of the malicious software detection method are difficult to guarantee. In the detection of the malicious software, data imbalance is caused due to the high collection cost of the malicious software sample and the like.
Disclosure of Invention
In order to solve the problem of low detection and classification precision of malicious software caused by data imbalance, the invention provides an unbalanced malicious software detection method based on packet integration, and the effectiveness of the method in solving the problem of data imbalance is verified on a real data set.
In order to achieve the purpose, the invention adopts the following technical scheme:
an unbalanced malware detection method based on packet integration comprises the following steps:
s1, feature extraction: extracting authority information and API calling information from an experimental sample to form a characteristic vector set; the experimental samples comprise normal samples and malicious samples, and the number of the normal samples is larger than that of the malicious samples;
s2, feature optimization: screening the characteristic vector set by adopting an information gain algorithm to remove redundant characteristics to obtain an unbalanced data set;
and S3, detecting the unbalanced data set by utilizing a grouping integration detection algorithm so as to classify the normal samples and the malicious samples.
Preferably, the step S3 specifically includes the following steps:
s31, randomly extracting three data sets from the unbalanced data set to be respectively used as a training data set, a verification data set and a test data set; the number of normal samples and the number of malicious samples in the training data set are respectively recorded as b and m;
s32, randomly and unreplaceably extracting m samples from normal samples of the training data set and combining the m samples with m malicious samples to form a new data set Di(ii) a Extracting k times to form k balanced data sets; wherein k is b/m;
s33, for each data set DiTraining by adopting a decision tree, training k decision tree classifiers, sequentially testing the classification performance of each decision tree classifier t on a verification data set, and calculating the recall rate and recording the recall rate as rt(ii) a For the decision tree classifier t, if the malicious sample is wrongly classified, adding the wrongly classified sample into the next decision tree classifier for training to form k base classifiers;
s34, combining the k base classifiers into an integrated decision tree classifier C in a weight voting mode;
s35, inputting x into k basis classifiers in the decision tree integrated classifier C for each test sample x in the test data set, and calculating weight voting results of the k basis classifiers, wherein the calculation formula is as follows:
Figure BDA0002549596450000031
wherein r istIs the recall of decision tree classifier t;
Tc,x(x) Is defined as follows:
Figure BDA0002549596450000032
when counting the votes of the malicious samples, the class c is the malicious sample, and the class non-c is the normal sample;
when counting the ticket number of the normal sample, the class c is the normal sample, and the class non-c is the malicious sample;
and calculating the total ticket number of the samples which are judged to be the malicious samples and the normal samples, and selecting the category with the most tickets as the final category of the sample x.
Preferably, the step S1 specifically includes the following steps:
s11, writing a Python program to read the authority and API calling information in the experiment sample to form a feature set;
s12, carrying out duplication elimination processing on the feature set to form a new feature set FS;
s13, judging whether the samples contain corresponding elements in the new feature set FS or not according to all the samples; if the sample contains the corresponding feature in the FS set, the corresponding element of the feature vector is represented by 1; otherwise, the corresponding element is represented by 0; all samples are traversed to form a feature vector set FVS.
Preferably, the step S1 further includes:
and adding a flag bit at the end of each feature vector, wherein 0 represents a normal sample, and 1 represents a malicious sample.
Preferably, the proportion of the training data set in the unbalanced data set is greater than 50%.
Preferably, the proportion of the training data set in the unbalanced data set is 60%, the proportion of the verification data set in the unbalanced data set is 20%, and the proportion of the test data set in the unbalanced data set is 20%.
Preferably, in step S2, the information gain algorithm calculates a difference between the entropy value of the feature and the conditional entropy thereof to obtain an IG value of the feature, and the larger the IG value is, the more important the feature is.
Preferably, in step S2, the two indexes of recall rate recall and G-mean are used as metrics for screening, which are as follows:
if the samples are predicted to be malicious samples, the samples are actually malicious samples, and the number of the malicious samples which are predicted to be correct is recorded as TP;
if the samples are predicted to be normal samples and actually malicious samples, the number of the malicious sample prediction errors is recorded as FP;
if the samples are predicted to be malicious samples and actually normal samples, the number of prediction errors of the normal samples is recorded as FN;
if the normal samples are predicted, the normal samples are actually normal samples, and the number of the normal samples which are predicted correctly is recorded as TN;
Figure BDA0002549596450000041
Figure BDA0002549596450000042
compared with the prior art, the invention has the beneficial effects that:
the unbalanced malicious software detection method based on the grouping integration utilizes the grouping integration detection algorithm, namely, the normal software samples are divided into a plurality of groups through a random sampling technology, then the normal software samples in each group and all the malicious software samples are used for training a classification model, and finally all the classification models are fused through the integration method, so that the defects that the accuracy and the stability of malicious software detection are difficult to guarantee due to the unbalanced data set are overcome.
Drawings
FIG. 1 is a block flow diagram of an unbalanced malware detection method based on packet integration according to an embodiment of the present invention;
FIG. 2 is a flow chart of a packet integration detection algorithm of an embodiment of the present invention;
FIG. 3 is a usage ranking graph of partial rights information in an experimental sample according to an embodiment of the invention;
FIG. 4 is a usage ranking graph of a portion of API call information in an experimental sample according to an embodiment of the present invention;
FIG. 5 is a simulation of the detection method of an embodiment of the present invention over different feature subset numbers.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
The invention provides an unbalanced malicious software detection method based on grouping integration for solving the problem of low detection and classification precision of malicious software caused by data imbalance, and verifies the effectiveness of the method in solving the unbalanced problem on a real data set.
Specifically, as shown in fig. 1, the unbalanced malware detection method based on packet integration according to the embodiment of the present invention is implemented by a feature extraction module, a feature optimization module, and a classification detection module, where the feature extraction module mainly uses Python language programming to extract authority and API call information from an experimental sample, so as to form a feature vector set; the characteristic optimization module mainly solves the characteristic redundancy, removes redundant characteristics through an information gain algorithm to prevent the overfitting phenomenon and improves the detection efficiency; the classification detection module provides a grouping integration detection algorithm to detect unbalanced data sets aiming at the deviation of most types of the traditional classifier.
As shown in fig. 2, the packet integration-based unbalanced malware detection method according to the embodiment of the present invention specifically includes:
s1, feature extraction: extracting authority information and API calling information from an experimental sample to form a characteristic vector set; the experimental samples comprise normal samples and malicious samples, and the number of the normal samples is larger than that of the malicious samples.
The method specifically comprises the following steps:
and 1, writing a Python program to read the authority and API calling information in the sample to form a feature set, wherein the Android self-defined authority only belongs to a specific sample and statistics is not carried out.
Step 2: and (4) carrying out duplication removal treatment on the features extracted in the step (1), and removing repeated features to form a new feature set FS.
And step 3: judging whether the samples contain corresponding elements in the FS set or not according to all the samples; if the sample contains the corresponding feature in the FS set, the corresponding element of the feature vector is represented by 1; otherwise, the corresponding element is represented by 0, all samples are traversed to form a feature vector set FVS, in addition, a flag bit is added at the end of each feature vector, 0 represents a normal sample, and 1 represents a malicious sample.
S2, feature optimization: and screening the characteristic vector set by adopting an information gain algorithm to remove redundant characteristics and obtain an unbalanced data set.
Android software relates to various authority features and API calling features, the dimension of a feature space is high, and dimension disasters are easily caused. In order to reduce the influence of redundant features on the detection effect, the detection efficiency is improved. In the embodiment of the invention, an information gain algorithm (IG) is adopted to screen out the characteristics with high category resolution capability from the original characteristic space.
(1) Information gain based feature correlation analysis
The information gain algorithm is used for measuring the difference of two probability distributions in an information theory, the difference is applied to feature selection, the measurement standard is the contribution degree of the features to classification, the algorithm obtains the IG value of the features by calculating the difference value of the entropy value of the features and the conditional entropy thereof, and the larger the value is, the higher the correlation degree is.
The calculation formula is as follows:
Figure BDA0002549596450000061
Figure BDA0002549596450000062
Figure BDA0002549596450000063
wherein m represents the total number of categories of the classification; p (t) represents the probability of occurrence of feature t; p (C)i) Represents class CiProbability of occurrence, P (C)i| t) represents feature t versus class CiThe contribution of (a) to (b),
Figure BDA0002549596450000064
representing the probability that the extracted feature does not contain the feature t,
Figure BDA0002549596450000065
indicates that the extracted features in the training sample belong to C when the features do not include the feature tiConditional probability of a class.
(2) Evaluation and selection of feature subsets
Two index quantities of recall rate and G-mean are used as the measurement indexes. The two indexes of the recall rate recall and the G-mean can be obtained by calculating through a table confusion matrix. Where the confusion matrix is shown in table 1.
TABLE 1 confusion matrix
Prediction as positive class Prediction as negative class
Actually of positive type TP FN
Actually of negative type FP TN
The recall rates recall and G-mean are calculated by the following formulas.
Figure BDA0002549596450000071
Figure BDA0002549596450000072
The recall rate represents the accuracy of prediction of the malicious samples, the G-mean considers the whole effect of the classifier on classification of the positive and negative samples, and the value depends on the detection rates of the malicious software and the normal software, namely the screening of the feature vector set is adjusted through the final detection rates of the malicious software and the normal software.
And S3, detecting the unbalanced data set by utilizing a grouping integration detection algorithm so as to classify the normal samples and the malicious samples.
Most of the traditional machine learning algorithms are predicated based on balanced samples, when unbalanced data sets are processed, classification results are biased to most of classes, but the predication effect of few classes is poor, the predication result of the few classes is more important, and if malicious samples are predicated to be normal samples, loss is caused to users. Therefore, the invention provides an unbalanced malware detection algorithm based on packet integration aiming at an unbalanced data set, and a decision tree is adopted as a base classifier.
(1) Data packet
Classifying the unbalanced data sets into a training data set, a verification data set and a test data set, and constructing k balanced data sets by random sampling aiming at the training data set; then training k decision tree classifiers C based on a balanced data set, testing the classification performance of the decision tree base classifiers by using a verification set, and adding wrongly-classified malicious data into a next decision tree base classifier for training; and finally, forming a decision tree integrated classifier C by the k decision tree base classifiers in a weight voting mode, and taking the output of the decision tree integrated classifier as a classification result of the test sample during classification detection.
The embodiment of the invention adopts the decision tree as the base classifier because the decision tree classifier is a weak classifier and has better classification performance than other strong classifiers.
(2) Integrated learning
After data grouping is carried out and a base classifier is trained, classification is integrated in a weighted voting mode.
Specifically, the flow of the packet integration detection algorithm is shown in fig. 2, and the detailed steps are as follows:
step 1: respectively randomly extracting 60% of training data sets, 20% of verification data sets and 20% of testing data sets from a normal data set and a malicious data set (namely an unbalanced data set); in the training data set, the number of normal samples and the number of malicious samples are respectively recorded as b and m;
step 2: randomly and unreplaceably extracting m samples from normal training data samples and m malicious samples to synthesize a new data set Di(ii) a A total of k decimation, where k is b/m, together forming k balanced datasets.
And step 3: for each data set DiTraining by adopting a decision tree, training k decision tree classifiers in total, testing the classification performance of the decision tree classifier t on a verification data set, and calculating the recall rate of the decision tree classifier t and recording the recall rate as rt. And for the decision tree classifier t, if the malicious sample is wrongly classified, adding the wrongly classified sample into the next decision tree classifier for training. So as to form k base classifiers in total,
and combining the k base classifiers into an integrated decision tree classifier C in a weight voting mode.
And 4, step 4: inputting x into k base classifiers in a decision tree integrated classifier C for each test sample x in a test data set, and calculating weight voting results of the k base classifiers, wherein the formula is as follows:
Figure BDA0002549596450000081
wherein r istIs the recall of decision tree classifier t;
Tc,x(x) Is defined as follows:
Figure BDA0002549596450000082
when counting the votes of the malicious samples, the class c is the malicious sample, and the class non-c is the normal sample; when counting the votes of the normal samples, the c type is the normal sample, and the non-c type is the malicious sample. When counting the ticket number of the normal sample, if the sample x is judged as the normal sample by the base classifier, Tc,x(x) 1 is ═ 1; if sample x is judged to be a malicious sample by the base classifier, Tc,x(x) 0. When counting the number of the malicious samples, T is the time when the sample x is judged to be the malicious sample by the base classifierc,x(x) 1 is ═ 1; if sample x is judged to be a normal sample by the base classifier, Tc,x(x)=0。
And calculating the total ticket number of the samples which are judged to be the malicious samples and the normal samples, and selecting the category with the most tickets as the final category of the sample x.
The packet integration-based unbalanced malware detection method of the embodiment of the invention has the following effective application:
the experimental data come from Drebin website, and contain 123453 normal samples and 5560 malicious samples.
The adopted data set comprises 123453 normal samples and 5560 malicious samples, each sample comprises 545 characteristics, and the usage rate ordering of partial authority information and API call information in the malicious samples and the normal samples is shown in figures 3 and 4.
The IG value of each feature was calculated by an IG information gain algorithm, and the larger the IG value, the more important the feature was, among which the top-ranked feature attributes are shown in table 2.
TABLE 2 characteristics and their corresponding IG values
Figure BDA0002549596450000091
Figure BDA0002549596450000101
After screening, an optimal feature subset (i.e., an unbalanced data set) is obtained, and simulation is performed on the number of different feature subsets according to the packet integration detection algorithm provided by the invention, wherein the simulation result is shown in fig. 5.
On an experimental data set, taking the pre-ranking 70 attribute calculated by an IG algorithm as an input feature, a comparison experiment is carried out on the classification detection strategy of the invention, the kNN, svm and the RF algorithm provided by a sklern packet in Python, and the result is shown in Table 3.
TABLE 3 comparison test data of the packet integration detection algorithm of the present invention with other existing algorithms
Index (I) kNN svm RF Algorithm of the invention
TP 877 855 940 1035
TN 25511 25554 25525 24659
FP 86 43 72 938
FN 235 257 172 77
recall 0.788 0.768 0.846 0.931
G-mean 0.866 0.866 0.918 0.947
The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims (8)

1. An unbalanced malware detection method based on packet integration is characterized by comprising the following steps:
s1, feature extraction: extracting authority information and API calling information from an experimental sample to form a characteristic vector set; the experimental samples comprise normal samples and malicious samples, and the number of the normal samples is larger than that of the malicious samples;
s2, feature optimization: screening the characteristic vector set by adopting an information gain algorithm to remove redundant characteristics to obtain an unbalanced data set;
and S3, detecting the unbalanced data set by utilizing a grouping integration detection algorithm so as to classify the normal samples and the malicious samples.
2. The unbalanced malware detection method based on packet integration as claimed in claim 1, wherein the step S3 specifically comprises the following steps:
s31, randomly extracting three data sets from the unbalanced data set to be respectively used as a training data set, a verification data set and a test data set; the number of normal samples and the number of malicious samples in the training data set are respectively recorded as b and m;
s32, randomly and unreplaceably extracting m samples from normal samples of the training data set and combining the m samples with m malicious samples to form a new data set Di(ii) a Extracting k times to form k balanced data sets; wherein k is b/m;
s33, for each data set DiTraining by adopting a decision tree, training k decision tree classifiers, sequentially testing the classification performance of each decision tree classifier t on a verification data set, and calculating the recall rate and recording the recall rate as rt(ii) a For the decision tree classifier t, if the malicious sample is wrongly classified, adding the wrongly classified sample into the next decision tree classifier for training to form k base classifiers;
s34, combining the k base classifiers into an integrated decision tree classifier C in a weight voting mode;
s35, inputting x into k basis classifiers in the decision tree integrated classifier C for each test sample x in the test data set, and calculating weight voting results of the k basis classifiers, wherein the calculation formula is as follows:
Figure FDA0002549596440000011
wherein r istIs the recall of decision tree classifier t;
Tc,x(x) Is defined as follows:
Figure FDA0002549596440000021
when counting the votes of the malicious samples, the class c is the malicious sample, and the class non-c is the normal sample;
when counting the ticket number of the normal sample, the class c is the normal sample, and the class non-c is the malicious sample;
and calculating the total ticket number of the samples which are judged to be the malicious samples and the normal samples, and selecting the category with the most tickets as the final category of the sample x.
3. The unbalanced malware detection method based on packet integration as claimed in claim 2, wherein the step S1 specifically comprises the following steps:
s11, writing a Python program to read the authority and API calling information in the experiment sample to form a feature set;
s12, carrying out duplication elimination processing on the feature set to form a new feature set FS;
s13, judging whether the samples contain corresponding elements in the new feature set FS or not according to all the samples; if the sample contains the corresponding feature in the FS set, the corresponding element of the feature vector is represented by 1; otherwise, the corresponding element is represented by 0; all samples are traversed to form a feature vector set FVS.
4. The unbalanced malware detection method based on packet integration as claimed in claim 3, wherein the step S1 further comprises:
and adding a flag bit at the end of each feature vector, wherein 0 represents a normal sample, and 1 represents a malicious sample.
5. The method according to claim 2, wherein the proportion of the training data set in the unbalanced data set is greater than 50%.
6. The method according to claim 2, wherein the proportion of the training data set in the unbalanced data set is 60%, the proportion of the verification data set in the unbalanced data set is 20%, and the proportion of the test data set in the unbalanced data set is 20%.
7. The method as claimed in claim 1, wherein in step S2, the information gain algorithm calculates the difference between the entropy and the conditional entropy of the features to obtain the IG value of the features, and the larger the IG value is, the more important the features are.
8. The unbalanced malware detection method based on packet integration as claimed in claim 1, wherein in the step S2, two indexes of recall rate call and G-mean are used as metrics for screening, specifically as follows:
if the samples are predicted to be malicious samples, the samples are actually malicious samples, and the number of the malicious samples which are predicted to be correct is recorded as TP;
if the samples are predicted to be normal samples and actually malicious samples, the number of the malicious sample prediction errors is recorded as FP;
if the samples are predicted to be malicious samples and actually normal samples, the number of prediction errors of the normal samples is recorded as FN;
if the normal samples are predicted, the normal samples are actually normal samples, and the number of the normal samples which are predicted correctly is recorded as TN;
Figure FDA0002549596440000031
Figure FDA0002549596440000032
CN202010571828.8A 2020-06-22 2020-06-22 Unbalanced malicious software detection method based on packet integration Withdrawn CN111753299A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010571828.8A CN111753299A (en) 2020-06-22 2020-06-22 Unbalanced malicious software detection method based on packet integration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010571828.8A CN111753299A (en) 2020-06-22 2020-06-22 Unbalanced malicious software detection method based on packet integration

Publications (1)

Publication Number Publication Date
CN111753299A true CN111753299A (en) 2020-10-09

Family

ID=72675578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010571828.8A Withdrawn CN111753299A (en) 2020-06-22 2020-06-22 Unbalanced malicious software detection method based on packet integration

Country Status (1)

Country Link
CN (1) CN111753299A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560435A (en) * 2020-12-18 2021-03-26 北京声智科技有限公司 Text corpus processing method, device, equipment and storage medium
CN112764791A (en) * 2021-01-25 2021-05-07 济南大学 Incremental updating malicious software detection method and system
CN112800426A (en) * 2021-02-09 2021-05-14 北京工业大学 Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560435A (en) * 2020-12-18 2021-03-26 北京声智科技有限公司 Text corpus processing method, device, equipment and storage medium
CN112560435B (en) * 2020-12-18 2022-03-11 北京声智科技有限公司 Text corpus processing method, device, equipment and storage medium
CN112764791A (en) * 2021-01-25 2021-05-07 济南大学 Incremental updating malicious software detection method and system
CN112764791B (en) * 2021-01-25 2023-08-08 济南大学 Incremental update malicious software detection method and system
CN112800426A (en) * 2021-02-09 2021-05-14 北京工业大学 Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN
CN112800426B (en) * 2021-02-09 2024-03-22 北京工业大学 Malicious code data unbalanced processing method based on group intelligent algorithm and cGAN

Similar Documents

Publication Publication Date Title
CN112329016B (en) Visual malicious software detection device and method based on deep neural network
CN111428231B (en) Safety processing method, device and equipment based on user behaviors
CN110704840A (en) Convolutional neural network CNN-based malicious software detection method
CN111753299A (en) Unbalanced malicious software detection method based on packet integration
Kim et al. Fusions of GA and SVM for anomaly detection in intrusion detection system
CN109359439A (en) Software detecting method, device, equipment and storage medium
CN110084609B (en) Transaction fraud behavior deep detection method based on characterization learning
CN112464232B (en) Android system malicious software detection method based on mixed feature combination classification
CN110287311B (en) Text classification method and device, storage medium and computer equipment
CN112950445B (en) Compensation-based detection feature selection method in image steganalysis
CN111062036A (en) Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment
CN115600194A (en) Intrusion detection method, storage medium and device based on XGboost and LGBM
Muttaqien et al. Increasing performance of IDS by selecting and transforming features
CN115577357A (en) Android malicious software detection method based on stacking integration technology
CN113420291B (en) Intrusion detection feature selection method based on weight integration
Weng et al. UCM-net: A U-net-like tampered-region-related framework for copy-move forgery detection
An et al. Benchmarking the robustness of image watermarks
CN113724779B (en) SNAREs protein identification method, system, storage medium and equipment based on machine learning technology
CN116170187A (en) Industrial Internet intrusion monitoring method based on CNN and LSTM fusion network
CN115842645A (en) UMAP-RF-based network attack traffic detection method and device and readable storage medium
CN115688107A (en) Fraud-related APP detection system and method
CN114510720A (en) Android malicious software classification method based on feature fusion and NLP technology
CN111383716B (en) Screening method, screening device, screening computer device and screening storage medium
CN113792141A (en) Feature selection method based on covariance measurement factor
CN112749759A (en) Preprocessing method, system and application of confrontation sample of deep neural network map

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20201009