CN111835707B - Malicious program identification method based on improved support vector machine - Google Patents

Malicious program identification method based on improved support vector machine Download PDF

Info

Publication number
CN111835707B
CN111835707B CN202010459366.0A CN202010459366A CN111835707B CN 111835707 B CN111835707 B CN 111835707B CN 202010459366 A CN202010459366 A CN 202010459366A CN 111835707 B CN111835707 B CN 111835707B
Authority
CN
China
Prior art keywords
feature
data
classification
algorithm
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010459366.0A
Other languages
Chinese (zh)
Other versions
CN111835707A (en
Inventor
陈锦富
殷上
张祖法
黄如兵
杨健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN202010459366.0A priority Critical patent/CN111835707B/en
Publication of CN111835707A publication Critical patent/CN111835707A/en
Application granted granted Critical
Publication of CN111835707B publication Critical patent/CN111835707B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a malicious program identification method based on an improved support vector machine, which comprises the following steps: collecting data in network flow through Netflow, and carrying out data normalization on the collected data packet; in order to complete the identification of the malicious program, feature extraction is required; in order to eliminate the problem of redundant features, feature attribute dimension reduction is carried out, and normalization processing is carried out; then, carrying out classification training by adopting an OFSVM algorithm; and finally, constructing a network traffic identification model by using an NTMI (network to average) identification algorithm, and finally realizing identification of malicious programs in the network traffic.

Description

Malicious program identification method based on improved support vector machine
Technical Field
The invention belongs to the field of malicious program detection in network flow, and relates to a malicious program identification method based on an improved support vector machine.
Background
With the increasing population, the network scale is promoted to be enlarged day by day, the network flow is full of various complex data, and some stealing beneficiaries attack the network by means of some bugs in the network, so that the important information is leaked, the security problem of illegal access is caused, and even more, the enterprise system is paralyzed, and great troubles are brought to the life of people.
In the huge network flow, a network malicious attacker can release some phishing websites or worm viruses to steal important information of users, and then normal programs are converted into malicious programs by using the bugs, so that a host of the user is controlled or crashed by a hacker, huge economic loss is caused, and social order is disturbed.
Before detecting the malicious programs, network traffic needs to be classified and identified, and the malicious programs overflowing towards the buffer area are further detected after harmful malicious programs are separated, and with the continuous development of technologies, the classification and identification technologies are diverse, and the existing classification and identification methods have advantages and disadvantages. The Teufl et al provides a framework for simplifying the selection of the empirical model and the feature extraction, observes whether data in the traffic violates a certain rule by analyzing the network traffic, and extracts an optimal feature set from the data to construct a traffic classification model, thereby realizing the classification and identification of the network traffic. Shrivastav et al analyzed and realized a semi-supervised network traffic classification method, by classifying the training data of labeled and unlabeled flows, the data set contains attack data and normal data, the labeled data were divided into clusters for classification and identification, and then the test results were compared with the classifier based on SVM, experiments proved that the method had better accuracy. After analyzing a plurality of network data, yang et al find that parameters transmitted by an application layer are different for different protocols, such as the size of a payload and the information entropy of each packet, and then train and classify by means of a decision tree algorithm based on a minimum partition distance, and experiments show that intercepting the first four or six data packets can shorten the time length and have higher accuracy for classification. The technologies scan malicious attack behaviors which may appear in a network, and analyze the obtained corresponding data, so that the delay is high, and meanwhile, the final classification and identification test result is greatly different from an expected result, so that the malicious program identification method based on the improved support vector machine is of great significance.
Disclosure of Invention
Based on the conditions that the detection accuracy of malicious programs in network traffic is not high, the classification accuracy is low and the like in the prior art, the invention provides a malicious program identification method based on an improved support vector machine to solve the problems.
The invention provides a malicious program identification method based on an improved support vector machine, which comprises the following steps:
step 1, acquiring data in network flow through Netflow, and carrying out data standardization on an acquired data packet;
step 2, in order to complete the identification of the malicious program, feature extraction is required;
step 3, in order to eliminate the problem of redundant features, feature attribute dimension reduction is carried out, and normalization processing is carried out;
step 4, performing classification training by adopting an OFSVM algorithm;
and 5, finally, constructing a network traffic identification model by using an NTMI (network to average molecular dynamics) identification algorithm, and finally realizing identification of malicious programs in the network traffic.
In a first aspect, the step 2 specifically includes:
by comparing the processed data set with the correlation between the sample type and the characteristic attribute, the weight value will be increased continuously with the higher correlation, and then a threshold value is set, and if the threshold value is exceeded, the characteristic attribute is retained, otherwise, the characteristic attribute is not selected. Meanwhile, if a plurality of characteristic attributes of a certain data packet are found in the extraction process, the data packet with the highest frequency of occurrence is selected for substitution. The specific characteristic selection process is as follows: a few samples s are selected hierarchically and randomly from the data set D, and then the same type D as the samples s is nearest a In y samples r, then in different classes D b Selecting y samples t, and finally calculating the distance D between the sample s and the samples r and t sr And D st (ii) a If D is sr >D st If the attribute is a problem, the attribute cannot be used for classification, and a smaller weight is set; conversely, if the feature attribute is easy to classify, a larger weight is set.
In a second aspect, the step 3 specifically includes:
firstly, adding extracted feature attributes into a set S, after researching some previous methods, providing a Filter feature dimension reduction method on the basis, then, evaluating information gain of the feature attribute set S by means of an information gain algorithm, determining whether to update a value and whether to update the feature attribute set S by evaluating the effect of each feature attribute on subsequent classification, then, ranking the feature attributes by adopting a heuristic search strategy to obtain a feature attribute set S1, circulating the process, stopping until a specified number of times is reached, on the basis, adopting a wrapper method to perform secondary feature selection, adopting an heuristic sequence forward search mode to obtain a feature attribute set S2, after performing feature dimension reduction, not only shortening time and reducing computational complexity, but also improving the classification effect.
In a third aspect, the OFSVM algorithm includes:
in parameter optimization, an optimal parameter combination is found in limited search, and a grid search parameter optimization is used for improving the SVM algorithm; while for each sample point s, by using the distance between each sample and the class as the ambiguity factor i There is a corresponding blurring factor e i This represents the uncertainty of the sample distribution, where 0 ≦ e i 1 or less, then R is used + 、R - To represent the mean point of positive and negative samples, the normal vector can be used
Figure BDA0002510453720000031
To represent, the corresponding hyperplane can be represented as (s-R) 2 cosα T =0, so that a distance of the sample point from the hyperplane can be obtained
Figure BDA0002510453720000032
Then the maximum distance d from the positive sample point to the hyperplane can be obtained 1 If and only if R is R + In the same way, when R is R - When d is greater than 2 For the maximum distance of the negative sample point to the hyperplane, then using the adjustment factor
Figure BDA0002510453720000034
To make 0 < e i 1 or less, then a blurring factor of
Figure BDA0002510453720000033
Wherein the value of d is d when different positive and negative samples are taken 1 And d 2 And proposing the validity of the constructed features to eliminate the influence of redundant features on the classification precision, and finally generating a classifier model by depending on the radial basis kernel function verified by experiments.
In a fourth aspect, the NTMI recognition algorithm specifically includes: the method comprises the steps of carrying out data sampling and normalization processing on acquired network flow data to obtain a data set which is more valuable to an experiment, simultaneously, extracting features of the network flow data more conveniently, then extracting the features of a data packet in the network flow by utilizing a Relieff algorithm, wherein the extracted features still contain some redundant attribute features, the features greatly reduce the precision of network flow classification, further providing the dimension reduction of the extracted feature set, carrying out calculation and evaluation on the features by utilizing an information gain technology, then sequencing the feature set, carrying out secondary feature selection, adopting a heuristic sequence forward searching mode, calculating the correlation of the features, and finally realizing the dimension reduction of the features. Then, normalization processing is carried out on the obtained feature subsets, all feature attributes are converted into numerical values, then the numerical values are put into a matrix array, minimum Euclidean distance calculation is carried out, training is carried out by means of an OFSVM algorithm, a classifier with a large classification effect is obtained, the rest network traffic test set is used as input, classification of normal programs and malicious programs in network traffic is achieved by means of the classifier, and finally recognition of malicious programs in the network traffic is achieved.
The invention has the beneficial effects that:
the OFSVM algorithm can be used for improving the classification accuracy of network flow, grid search is used for expanding the search range, a fuzzy factor is designed by adopting the distance from a sample to a classification hyperplane, the influence of a classification plane shape on the classification accuracy is reduced, meanwhile, a feature weight is measured according to feature effectiveness, finally, a radial basis kernel function is used for reducing complexity, and finally, the classification training performance is improved.
And 2, the NTMI recognition algorithm performs feature extraction, feature dimension reduction and normalization processing on the collected data packet to serve as the input of the OFSVM classification algorithm, so that a classifier with better classification performance is generated, a malicious program recognition model of the network flow is constructed, and the malicious program recognition is completed.
3. Corresponding data traffic is effectively collected from the network traffic to complete real-time monitoring; extracting the characteristics of the data packet; redundant features are processed by feature dimension reduction, so that the classification performance is improved; the characteristic processing is convenient, the normalization processing is provided, and the normalization processing can be better used as the input processing; the OFSVM algorithm is used for completing classification training of malicious programs; NTMI algorithms are used to identify whether there are malicious programs in the network traffic; experimental results show that the method has a certain effect on identifying the malicious programs in the network flow, can identify the malicious programs in the network flow, and ensures the network safety.
Drawings
FIG. 1 is a flow diagram of feature dimension reduction;
FIG. 2 is a flow chart of the malicious program identification method based on the improved support vector machine of the invention;
FIG. 3 is a flow diagram of a malware identification model in network traffic;
FIG. 4 is a schematic diagram of feature attributes after feature extraction;
FIG. 5 is a diagram of feature attributes after feature dimensionality reduction;
FIG. 6 is a graph comparing accuracy on CAIDA for five methods;
fig. 7 is a comparison graph of false alarm rates on CAIDA by the five methods.
Detailed Description
The invention will be further elucidated by means of the figures and the specific steps.
The invention aims to provide a malicious program identification method based on an improved support vector machine aiming at malicious programs utilizing vulnerabilities in network traffic, effectively completes identification of the malicious programs, provides an NTMI identification algorithm, and performs sufficient experiments, which also proves the feasibility and effectiveness of the method.
As shown in fig. 2, the method for identifying malicious programs based on an improved support vector machine of the present invention includes:
step 201, acquiring data in network flow through Netflow, and performing data normalization on an acquired data packet;
step 202, in order to complete the identification of the malicious program, feature extraction is required;
step 203, in order to eliminate the problem of redundant features, feature attribute dimension reduction is carried out, and normalization processing is carried out;
step 204, performing classification training by adopting an OFSVM algorithm;
step 205 is finally to use NTMI recognition algorithm to construct a network traffic recognition model, and finally to realize recognition of malicious programs in network traffic.
In the step 201, the specific steps are as follows:
(1) Data acquisition
The method includes the steps that firstly, network flow data acquisition is needed by means of NetFlow, the tool can also analyze the network flow to further eliminate network faults, but the identification efficiency of malicious programs of a plurality of vulnerability types written by an attacker is low, meanwhile, corresponding network equipment is needed to support the NetFlow, and users are needed to distinguish normal flow and malicious flow.
(2) Data normalization
And before the collected network flow data packet is normalized, data sampling is carried out to select a better data set. Data sampling is mainly to select some data as subsets in the whole experimental data set and then to perform sampling observation, and because the set has the characteristics of the original set, the excellent judgment on the whole network traffic data set is realized. The main sampling modes are systematic sampling, random sampling and hierarchical sampling. The system sampling is to sort the original data samples, and randomly extract a specified amount of sample data from the beginning every certain time; random sampling: selecting some sample data randomly from the whole sample data; the hierarchical sampling is to firstly layer the whole data sample set according to a specified rule and then randomly extract some data in each layer. Hierarchical sampling will be taken herein to observe the goodness of the entire data set.
For step 202, the main steps of extracting the features of the data packets in the network traffic are as follows:
(1) The method is characterized in that the correlation between the type of a sample and a characteristic attribute is compared with a processed data set, the weight value is continuously increased along with the higher correlation, then a threshold value is set, if the weight value exceeds the threshold value, the characteristic attribute is reserved, and if the weight value does not exceed the threshold value, the characteristic attribute is not selected. Meanwhile, if a plurality of characteristic attributes of a certain data packet are found in the extraction process, the data packet with the highest frequency of occurrence is selected for substitution.
(2) The specific characteristic selection process is as follows: randomly selecting some samples s hierarchically from the data set D, then selecting y samples r in the same type Da closest to the samples s, and then in different classes D b Selecting y samples t, and finally calculating the distance D between the sample s and the samples r and t sr And D st (ii) a If D is sr >D st It is indicated that the characteristic attribute is problematic and cannot be used for classification, and a smaller weight is set; otherwise, the feature attribute is easy to classify, a larger weight is set, and the calculation of the feature weight is carried out by referring to the existing literature, wherein D (x, r, t) is the corresponding Euclidean distance, w (x) is the corresponding weight, D j And (4) for j sample data in the data set, wherein n refers to calculating weights in n data to extract features, the processes are executed circularly, the finally calculated weights are compared with the set weights, and the finally calculated weights are reserved if the weights meet the requirements, and are abandoned if the weights do not meet the requirements, so that a final extracted feature attribute set S can be obtained. The final extracted features are shown in fig. 4.
Figure BDA0002510453720000051
For step 203, in order to eliminate the problem of redundant features, feature attribute dimension reduction is performed, and normalization processing is performed, which includes the following specific steps:
(1) Firstly, adding the extracted feature attributes into a set S, and after researching some previous methods, providing a method for reducing dimensions of the Filter features, and then, by means ofInformation gain algorithm, E IG =evaluate(F filter S) is to evaluate the information gain of the characteristic attribute set S, and whether to update E is determined by evaluating the effect of each characteristic attribute on the subsequent classification IG And whether the characteristic attribute set S is updated or not, then sequencing the characteristic attributes by adopting a heuristic search strategy to obtain a characteristic attribute set S1, circulating the process, stopping the process until the specified times are reached, on the basis, performing secondary characteristic selection by adopting a Wrapper method, and obtaining a characteristic attribute set S2 by adopting a heuristic sequence forward search mode, wherein a specific flow chart is shown in FIG. 3. After feature dimension reduction is carried out, the time is shortened, the calculation complexity is reduced, and the classification effect is improved.
(2) When the Wrapper method is used, the following formula performs secondary selection on the characteristic attributes by calculating the correlation of the flow characteristic attributes by using the existing literature, wherein n represents the number of the initially selected characteristic attributes,
Figure BDA0002510453720000061
representing coefficient of characteristic attribute, m ri Represents the average value of the flow characteristic attribute of the ith data packet,
Figure BDA0002510453720000062
is the corresponding variance, m r Represents the average value of the flow characteristic attribute r. The final feature attributes after feature dimensionality reduction are shown in fig. 5.
Figure BDA0002510453720000063
(3) Data normalization plays an important role in data mining, measurement units corresponding to different evaluation indexes are different, data analysis operation cannot be performed under the condition, normalization processing is performed based on the difference, different data are made to have comparability and operability, and after the data are processed, the data are converted into dimensionless and unit pure values to become data of the same magnitudeIndexes are convenient for subsequent processing and evaluation, and meanwhile, after normalization is carried out, the convergence rate is increased and the classification precision is improved. The specific normalization process is as follows: with the dispersion normalization method proposed in the existing literature, which may also be referred to as min-max normalization, which is mainly used to process data, by converting the target data set to between 0 and 1, by linearly transforming the acquired feature subsets, the transfer function is used as follows:
Figure BDA0002510453720000064
in this formula min refers to the minimum value of the sample data and max refers to the maximum value of the sample data, but there is a disadvantage that adding data to the target transition process will cause max and min to be changed, which in turn affects the normalization criteria, so that it is ensured that the data set will remain unchanged before the normalization process is performed.
For step 204, an OFSVM algorithm is then used for classification training, and the specific steps are as follows:
for the existing SVM classification method, along with the rapid development of economy, the popularization range of a network is expanded, so that the network flow scale is larger and larger, meanwhile, a lot of noises exist in a real network environment, and a lot of redundant features exist in sample data, so that the SVM classification precision is lower; in addition, in the process of training the sample data to generate the classifier, the sample data needs to be identified manually, so that a lot of energy is consumed, and meanwhile, human errors are difficult to prevent.
In order to solve the problems, an SVM algorithm is improved mainly from the perspective of parameter optimization, wherein SVM parameter optimization mainly finds an approximate optimal solution in finite searches by using a certain search strategy in a plurality of parameter spaces, and two important parameters, namely a kernel function parameter and a penalty parameter, need to be considered in parameter optimization. The penalty parameter plays a role in determining the generalization ability of the SVM hyperplane, and is mainly used for representing the fault tolerance when the hyperplane is constructed, and the kernel function parameter determines the action range and further influences the generalization ability of the SVM.
(1) From the perspective of parameter optimization and finding out the optimal parameter combination in limited search, the SVM algorithm is improved by using grid search parameter optimization. The principle of grid search is as follows, firstly dividing k-dimensional parameter space into k parameters, wherein grid nodes are used to represent candidate parameters; next, samples are taken at a specified step size and a corresponding set P is generated (c) i )={P(c 1 )×P(c 2 )×…×P(c k ) And set parameter c i To generate grids in different directions; finally, each grid node c is evaluated according to the designated evaluation method i And evaluating and outputting the final approximate optimal solution. In the process, firstly, the incremental step length is set to be t times of the default step length q, namely q.t, in order to reduce the search time and the density of the generated grid, then, traversal search is carried out, and after all sample data are executed, the optimal parameter combination can be obtained. In order to represent the fault tolerance of sample data when a classification plane is constructed, a penalty parameter P is introduced, the penalty parameter P is compared with a set overfitting critical value f, when the penalty parameter P is smaller than f, a search space is reduced, the step length of search is set to be half of the initial step length, the search is carried out again, the step length is reduced to enlarge the density of the grid, and therefore more accurate search is achieved; if the overfitting critical value f is exceeded, the search space is expanded, the direction of the search direction is adjusted to perform searching again, the purpose is to optimize parameters and prevent overfitting behaviors, sample data is executed in a circulating mode until the punishment parameter P is within the critical range, execution is stopped, and the optimal parameter combination value is output. The algorithm has a larger searchable space, the nodes are mutually independent, the universality is higher, and the minimum error for helping finishing classification can be realized.
(2) Then, in order to improve the classification accuracy, firstly, a fuzzy factor is introduced, and some existing researches propose that the distance between each sample and each class is calculated to be used as the fuzzy factor, so that the optimal classification hyperplane cannot be obtained, and the method reduces the effect of the support vector on the classification hyperplane. The study will use the distance from the sample to the classification hyperplane to setAnd calculating fuzzy factors, and reducing the influence of the classification plane shape on the classification precision by the method. On the basis, the corresponding classification hyperplane is constructed firstly, and then the distance from each sample node to the hyperplane is calculated, so that the classification precision of redundant noise can be eliminated by means of fuzzy factors. For each sample point s i There is a corresponding blurring factor e i This represents the uncertainty of the sample distribution, where 0 ≦ e i 1 or less, then R is used + 、R - To represent the mean point of positive and negative samples, then the normal vector can be used
Figure BDA0002510453720000071
To illustrate, and with reference to the methods in the prior art, the corresponding hyperplane can be represented as (s-R) 2 cosα T =0, so that a distance of the sample point from the hyperplane can be obtained
Figure BDA0002510453720000081
The maximum distance d from the positive sample point to the hyperplane can then be obtained 1 If and only if R is R + In the same way, when R is R - When d is greater than 2 The maximum distance of the negative sample point to the hyperplane. Then using the adjustment factor
Figure BDA0002510453720000086
To make 0 < e i Less than or equal to 1, then a blurring factor of
Figure BDA0002510453720000082
Wherein the value of d is d when different positive and negative samples are taken 1 And d 2 Thus, the influence of redundant noise on the classification accuracy is eliminated by using different fuzzy factors, but the influence of different features on the classification is not considered, and then the feature validity is introduced to eliminate the influence of weak correlation features on the classification accuracy.
(3) By referring to the calculation method of feature validity proposed by the existing literature, the methodEach sample data feature i has a corresponding feature validity
Figure BDA0002510453720000083
Can indicate the influence degree of a certain characteristic used for classification, and when the classification capability of the characteristic i is strong, the effectiveness of the characteristic is high
Figure BDA0002510453720000084
The classification effect of each feature is judged by calculating the reinforcement learning ability of each feature in the feature set S. Assuming that a training sample set S has a total number of | S |, and there are p feature attributes in a certain sample, the feature validity can be expressed as
Figure BDA0002510453720000085
When a certain feature i has a large reinforcement learning value, the feature effectiveness will be large, that is, the contribution degree to classification is high. Finally, considering the importance of kernel function parameters to classification performance, the research optimizes the SVM classification algorithm by selecting a proper kernel function angle.
(4) The kernel function is mainly used for mapping original nonlinear sample data into a feature space, and then the nonlinear sample is converted into a linear classifiable problem by means of a constructed optimal classification plane, so that huge calculation amount caused by a high-dimensional feature space can be avoided. Assuming that an input space P ∈ R ^ n and a corresponding feature space is F, when a mapping function γ (Y) = Y → P exists, K (Y) is satisfied for any of yi and yj belonging to Y i ,y j )=γ(y i ) T γ(y j ) Then the kernel function K is present at this point. The kernel function needs to satisfy the Mercer theorem, that is, for any vector of the input space, the corresponding kernel matrix should be a semi-positive matrix. After selecting the proper kernel function, the linear classification is completed without increasing the complexity. Therefore, the classification effect of the SVM is greatly related to the kernel function. The present study will use a radial basis kernel function as the kernel function, which has better performance in a local area,meanwhile, the high classification efficiency of the sample points in the data set can be realized. And the advantage that the method is not limited by the number of samples and the feature dimension makes the method more widely applied, and the radial basis kernel function has fewer parameters, while the complexity of the kernel function is generally related to the number of the parameters, so that the kernel function has lower complexity. By adopting the method to improve the classification of the SVM algorithm, the error is relatively small, and the classification and identification capability of the malicious program in the network flow is greatly improved.
For step 205, a network traffic identification model is finally constructed by using an NTMI identification algorithm, and identification of a malicious program in the network traffic is finally realized, which specifically comprises the following steps:
(1) Firstly, solving the problem of accurately classifying programs in network flow, and in order to achieve the target, firstly, acquiring the network flow by using a NetFlow technology, wherein the whole acquisition flow mainly comprises three steps, namely, trying to acquire a network card list, acquiring the network card list by using a network bottom access tool, and monitoring all flows passing through the network card in real time; selecting a network card for detection, and setting the network card data acquired in the step one to be in a hybrid mode; and step three, merging the data packets in the flow, extracting and merging the data packets of the flow data passing through the network within a certain period of time, and finally obtaining the acquired network flow data.
(2) The method comprises the steps of carrying out data sampling and normalization processing on collected network flow data to obtain a data set which is more valuable to an experiment, simultaneously, enabling the network flow data to be more convenient for people to extract features, then utilizing a Relieff algorithm to carry out feature extraction on data packets in network flow, enabling the extracted features to still contain some redundant attribute features, greatly reducing the precision of network flow classification through the features, further providing the feature set for dimension reduction, carrying out calculation and evaluation on each feature through an information gain technology, then sequencing the feature set, carrying out secondary feature selection through a wrapper method, adopting a heuristic sequence forward searching mode, calculating the correlation of the features, and finally realizing dimension reduction on the features.
(3) Normalization processing is needed to be carried out on the obtained feature subset, all feature attributes are converted into numerical values, then the numerical values are put into a matrix array, minimum Euclidean distance calculation is carried out, training is carried out by means of an OFSVM algorithm, a classifier with a large classification effect is obtained, the rest network flow test set is used as input, classification of normal programs and malicious programs in network flow is achieved by means of the classifier, recognition of the malicious programs in the network flow is finally achieved, and construction of the recognition model is further completed on the basis.
Comparing the NTMI recognition method provided by the present invention with the existing four methods, as shown in fig. 6 and fig. 7, for the public data set, the public data set is larger, then we select 10% of the data sets as training and testing respectively, and finally the data sets for testing are close to about 4 ten thousand, meanwhile, as can be seen from the figure, the accuracy of the NTMI algorithm provided by the present research is still good, and as the number of data packets increases, in the larger-scale network flow public data set, the false alarm rate of the NTMI algorithm is lower than that of the other four algorithms, and also tends to be stable, and is maintained at about 6%, which also proves that the present invention is feasible.
In the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (3)

1. A malicious program identification method based on an improved support vector machine is characterized by comprising the following steps:
step 1, acquiring data in network flow through Netflow, and carrying out data standardization on an acquired data packet;
step 2, in order to complete the identification of the malicious program, feature extraction is required;
step 3, in order to eliminate the problem of redundant features, feature attribute dimension reduction is carried out, and normalization processing is carried out;
step 4, carrying out classification training by adopting an OFSVM algorithm;
the OFSVM algorithm of step 4 includes:
in the parameter optimization, the optimal parameter combination is searched out in limited searching, and the SVM algorithm is improved by using grid searching parameter optimization; while for each sample point s, calculating the distance between each sample and the class as a blurring factor i There is a corresponding blurring factor e i This represents the uncertainty of the sample distribution, where 0 ≦ e i 1 or less, then R is used + 、R - To represent the mean point of positive and negative samples, the normal vector can be used
Figure FDA0003784035200000011
To represent, the corresponding hyperplane can be represented as (s-R) 2 cosα T =0, so that a distance of the sample point from the hyperplane can be obtained
Figure FDA0003784035200000012
The maximum distance d from the positive sample point to the hyperplane can then be obtained 1 If and only if R is R + In the same way, when R is R - When d is greater than 2 For the maximum distance of the negative sample point to the hyperplane, then using the adjustment factor
Figure FDA0003784035200000013
To make 0 < e i 1 or less, then a blurring factor of
Figure FDA0003784035200000014
Wherein the value of d is d when different positive and negative samples are taken 1 And d 2 And proposing the validity of the constructed features to eliminate the influence of redundant features on the classification precision, and finally generating a classifier model by depending on a radial basis kernel function verified by experiments;
step 5, constructing a network traffic identification model by using an NTMI (network transfer model) identification algorithm, and finally realizing identification of malicious programs in the network traffic;
the NTMI recognition algorithm of step 5 specifically includes:
the method comprises the steps of carrying out data sampling and normalization processing on acquired network flow data to obtain a data set which is more valuable to an experiment, simultaneously extracting features of the network flow data more conveniently, then carrying out feature extraction on a data packet in the network flow by utilizing a Relieff algorithm, calculating and evaluating each feature by utilizing an information gain technology, then sequencing feature sets, carrying out secondary feature selection, adopting a heuristic sequence forward searching mode, calculating the correlation of the features, and finally realizing the dimension reduction of the features; then, normalization processing is carried out on the obtained feature subsets, all feature attributes are converted into numerical values, then the numerical values are put into a matrix array, minimum Euclidean distance calculation is carried out, training is carried out by means of an OFSVM algorithm, a classifier with a large classification effect is obtained, the rest network flow test set is used as input, classification of normal programs and malicious programs in network flow is achieved by means of the classifier, and finally identification of the malicious programs in the network flow is achieved.
2. The method according to claim 1, wherein the step 2 specifically comprises:
comparing the correlation between the sample type and the characteristic attribute of the processed data set, continuously increasing the weight value, setting a threshold value along with the higher correlation, and if the weight value exceeds the threshold value, keeping the characteristic attribute, otherwise, not selecting the characteristic attribute; meanwhile, if a plurality of characteristic attributes of a certain data packet are found in the extraction process, the data packet with the highest frequency of occurrence is selected for substitution, and the specific characteristic selection process is as follows: randomly selecting a plurality of samples s from a data set D in a layering mode, then selecting y samples r from the same type Da closest to the samples s, then selecting y samples t from different types of Db, and finally calculating the distances between the samples s and the samples r and t respectively to be Dsr and Dst; if Dsr > Dst, the characteristic attribute is problematic and cannot be used for classification, and a smaller weight is set; conversely, if the feature attribute is easy to classify, a larger weight is set.
3. The method according to claim 1, wherein the step 3 specifically comprises:
firstly, adding extracted feature attributes into a set S, on the basis, providing a Filter feature dimension reduction method, then, evaluating information gain on the feature attribute set S by means of an information gain algorithm, determining whether to update a value and whether to update the feature attribute set S by evaluating the effect of each feature attribute on subsequent classification, then, sequencing the feature attributes by adopting a heuristic search strategy to obtain a feature attribute set S1, circulating the process, stopping the process until the specified times are reached, on the basis, adopting a wrapper method to perform secondary feature selection, adopting a heuristic sequence forward search mode to obtain a feature attribute set S2, and after feature dimension reduction, not only shortening time and reducing computational complexity, but also improving the classification effect.
CN202010459366.0A 2020-05-27 2020-05-27 Malicious program identification method based on improved support vector machine Active CN111835707B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010459366.0A CN111835707B (en) 2020-05-27 2020-05-27 Malicious program identification method based on improved support vector machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010459366.0A CN111835707B (en) 2020-05-27 2020-05-27 Malicious program identification method based on improved support vector machine

Publications (2)

Publication Number Publication Date
CN111835707A CN111835707A (en) 2020-10-27
CN111835707B true CN111835707B (en) 2022-12-16

Family

ID=72914111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010459366.0A Active CN111835707B (en) 2020-05-27 2020-05-27 Malicious program identification method based on improved support vector machine

Country Status (1)

Country Link
CN (1) CN111835707B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112367338A (en) * 2020-11-27 2021-02-12 腾讯科技(深圳)有限公司 Malicious request detection method and device
CN113114672B (en) * 2021-04-12 2023-02-28 常熟市国瑞科技股份有限公司 Video transmission data fine measurement method
CN113489685B (en) * 2021-06-15 2023-03-21 江苏大学 Secondary feature extraction and malicious attack identification method based on kernel principal component analysis
CN113591915B (en) * 2021-06-29 2023-05-19 中国电子科技集团公司第三十研究所 Abnormal flow identification method based on semi-supervised learning and single-classification support vector machine
CN114444569B (en) * 2021-12-22 2024-05-10 北京航天测控技术有限公司 Power control system health state evaluation algorithm
CN116805926B (en) * 2023-08-21 2023-11-17 上海飞旗网络技术股份有限公司 Network service type identification model training method and network service type identification method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106992965A (en) * 2017-02-27 2017-07-28 南京邮电大学 A kind of Trojan detecting method based on network behavior
CN110008983A (en) * 2019-01-17 2019-07-12 西安交通大学 A kind of net flow assorted method of the adaptive model based on distributed fuzzy support vector machine
CN111079142A (en) * 2019-10-31 2020-04-28 湖北工业大学 Malicious software detection method based on firework algorithm and support vector machine
CN110990834B (en) * 2019-11-19 2022-12-27 重庆邮电大学 Static detection method, system and medium for android malicious software

Also Published As

Publication number Publication date
CN111835707A (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN111835707B (en) Malicious program identification method based on improved support vector machine
Sharma et al. Classification through machine learning technique: C4. 5 algorithm based on various entropies
Rai et al. Decision tree based algorithm for intrusion detection
Hu et al. False positive elimination in intrusion detection based on clustering
CN111143838B (en) Database user abnormal behavior detection method
Rani et al. Design of an intrusion detection model for IoT-enabled smart home
Cui et al. Determine the number of unknown targets in the open world from the perspective of bidirectional analysis using Gap statistic and Isolation forest
Jiang et al. DOS: Diverse outlier sampling for out-of-distribution detection
CN117278314A (en) DDoS attack detection method
CN116545733A (en) Power grid intrusion detection method and system
CN116647844A (en) Vehicle-mounted network intrusion detection method based on stacking integration algorithm
Faraoun et al. Neural networks learning improvement using the k-means clustering algorithm to detect network intrusions
Pradhan et al. Machine learning-based intrusion detection system for the internet of vehicles
Kang et al. Classification method for network security data based on multi-featured extraction
Soliman et al. Correlation based feature selection using quantum bio inspired estimation of distribution algorithm
Sheng et al. Network traffic anomaly detection method based on chaotic neural network
Nie et al. Intrusion detection based on nonsymmetric sparse autoencoder
Wang et al. Intrusion detection algorithms based on correlation information entropy and binary particle swarm optimization
Luo et al. Network attack classification and recognition using hmm and improved evidence theory
Wu et al. Intrusion Detection System Using a Distributed Ensemble Design Based Convolutional Neural Network in Fog Computing
Hosseiny et al. Improve intrusion detection using grasshopper optimization algorithm and decision trees
bin Haji Ismail et al. A novel method for unsupervised anomaly detection using unlabelled data
Shao et al. A link prediction algorithm by unsupervised machine learning
Jain et al. A new approach for handling null values in web log using KNN and tabu search KNN
Flores et al. Hybrid network anomaly detection–learning hmms through evolutionary computation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant