CN112906786A

CN112906786A - Data classification improvement method based on naive Bayes model

Info

Publication number: CN112906786A
Application number: CN202110182847.6A
Authority: CN
Inventors: 魏光杏; 李华; 邹军国; 戴月; 陈银燕; 苗孟君
Original assignee: Chuzhou Vocational and Technical College
Current assignee: Chuzhou Vocational and Technical College
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-06-04

Abstract

The invention discloses a data classification improvement method based on a naive Bayes model, which comprises the following steps: step 1: determining an intrusion detection method based on improved weighted hidden naive Bayes; step 2: data processing is carried out on the acquired data, meanwhile, attribute weights are determined, and a basic model is established; and step 3: performing attribute selection algorithm and data discretization in data processing; and 4, step 4: and after the attribute weight is determined, measuring a weighting coefficient, and expanding the basic model. The invention optimizes the naive Bayes model in the aspect of attribute selection. For the "conditional independence assumption" of naive bayes, attributes can be selected to arrive at an optimal subset of attributes such that the overall correlation between attributes in the subset of attributes is minimized. This can be done to counteract some of the effects of the "conditional independence assumption".

Description

Data classification improvement method based on naive Bayes model

Technical Field

The invention relates to the technical field of data classification improvement methods, in particular to a data classification improvement method based on a naive Bayesian model.

Background

The naive Bayes method is a classification method based on Bayes theorem and independent hypothesis of characteristic conditions. The two most widespread classification models are the Decision Tree Model (Decision Tree Model) and the Naive bayes Model (Naive Bayesian Model, NBM). Compared with a decision tree model, a Naive Bayes Classifier (Naive Bayes Classifier or NBC) originates from classical mathematical theory, has a solid mathematical foundation and stable classification efficiency. Meanwhile, the NBC model needs few estimated parameters, is not sensitive to missing data, and has a simple algorithm. Theoretically, the NBC model has minimal error rates compared to other classification methods. This is not always the case in practice, because the NBC model assumes that the attributes are independent of each other, which is often not true in practical applications, and this has a certain impact on the correct classification of the NBC model.

With the advent of the big data age, the types of actually collected data are more and more diversified, and the data often have attributes which are irrelevant to research or redundant, so that the classification results are negatively affected. Therefore, a new technical solution needs to be provided.

Disclosure of Invention

The invention aims to provide a data classification improvement method based on a naive Bayesian model, which solves the problem that with the arrival of a big data era, the actually acquired data types are more and more diversified, and the data often have attributes irrelevant to research or redundant, which can generate negative influences on classification results.

In order to achieve the purpose, the invention provides the following technical scheme: a data classification improvement method based on a naive Bayesian model comprises the following steps:

step 1: determining an intrusion detection method based on improved weighted hidden naive Bayes;

step 2: data processing is carried out on the acquired data, meanwhile, attribute weights are determined, and a basic model is established;

and step 3: performing attribute selection algorithm and data discretization in data processing;

and 4, step 4: after the attribute weight is determined, a metering weighting coefficient is carried out, and meanwhile, the basic model is expanded;

and 5: after confirming the weighting coefficient in step 4, establishing a weighted hidden naive Bayes model after expanding the weighting coefficient and the basic model;

step 6: and 5, confirming the weighted hidden naive Bayes model, applying the weighted hidden naive Bayes model to intrusion detection, and comparing the weighted hidden naive Bayes model with the existing mature algorithm.

As a preferred embodiment of the present invention, the basic model of step 2 is an NB model in a naive bayes model.

As a preferred embodiment of the present invention, the basic model is extended to a hidden naive bayes model.

As a preferred embodiment of the present invention, the metering weighting factor in step 4 uses the fisher score value of the attribute as the weighting factor of each attribute.

As a preferred embodiment of the present invention, the step 3 attribute selection algorithm is a modified version of CFS to obtain an optimal attribute subset, so that the overall correlation between attributes in this attribute subset is minimized. This can be done to counteract some of the effects of the "conditional independence assumption".

Compared with the prior art, the invention has the following beneficial effects:

the invention optimizes the naive Bayes model in the aspect of attribute selection. For the "conditional independence assumption" of naive bayes, attributes can be selected to arrive at an optimal subset of attributes such that the overall correlation between attributes in the subset of attributes is minimized. This can be done to counteract some of the effects of the "conditional independence assumption". The structure of the naive Bayes model is improved. With the idea of attribute weighting, the fisher score value of an attribute is used as a weighting coefficient for each attribute. For an attribute, if the class has stronger discrimination ability for the final classification, then under a specific class, the variance of the attribute should be as small as possible in all data instances of the class, and the fisher score is an index for measuring the criterion. The improved attribute selection algorithm and the weighted hidden naive Bayes model are combined and applied to big data intrusion detection, and are compared with a mature algorithm.

Drawings

FIG. 1 is a schematic view of an overall improved process of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution: a data classification improvement method based on a naive Bayesian model comprises the following steps:

In a further improvement, the basic model of step 2 is an NB model in a naive bayes model.

In a further improvement, the basic model is extended to a hidden naive Bayes model, and the attributes obtained after the attribute selection cannot be guaranteed to be completely independent from each other. Therefore, the subject further relaxes the limitation of the NB model "conditional independence assumption" by adopting a relatively sophisticated hidden naive bayes model. The hidden naive Bayes model is characterized in that a hidden father node is added to each attribute on the basis of the naive Bayes model, the hidden father node represents the sum of the correlation degrees of the attribute and other attributes (excluding class attributes), and the hidden naive Bayes model relaxes the limitation of 'conditional independence assumption' of the naive Bayes to a certain extent. However, the hidden naive bayes also have a disadvantage that, because the contribution degree of each attribute to the final classification result is different, and the hidden naive bayes does not take the situation into account, in order to improve the application effect of the hidden naive bayes, the problem is to use the fisher score value of each attribute as the weighting coefficient of the attribute by utilizing the idea of attribute weighting.

In a further improvement, the weighting coefficient in step 4 is a fisher score value of each attribute, and the structure of the naive bayes model is improved. With the idea of attribute weighting, the fisher score value of an attribute is used as a weighting coefficient for each attribute. For an attribute, if the class has stronger discrimination ability for the final classification, then under a specific class, the variance of the attribute should be as small as possible in all data instances of the class, and the fisher score is an index for measuring the criterion.

In a further improvement, the step 3 attribute selection algorithm is a modified version of CFS to obtain an optimal subset of attributes such that the overall correlation between attributes in the subset of attributes is minimized. By doing so, the influence of a part of the "conditional independence assumption" can be counteracted, and the evaluation function of the attribute subset is modified. Since mutual information can only be used in datasets with only discrete attributes, no mutual information is used in the evaluation criterion of the attribute subset, but instead the Spearman correlation coefficient is used. Meanwhile, the CFS algorithm does not consider the case that the correlation degree between the attribute and the class attribute in the attribute subset is too different, and therefore, a correlation degree variance is introduced to limit the occurrence of the case.

The improvement of the attribute selection algorithm (CFS) of the invention aims at the 'conditional independence assumption' of the naive Bayes, and optimizes the naive Bayes model from the aspect of attribute selection so as to obtain an optimal attribute subset, so that the overall correlation among the attributes in the attribute subset is minimum. This can be done to counteract some of the effects of the "conditional independence assumption". Aiming at the defects of the existing CFS attribute selection algorithm which is more commonly used, the method is improved to a certain extent, and the specific idea is to modify the evaluation function of the attribute subset. Since mutual information can only be used in datasets with only discrete attributes, no mutual information is used in the evaluation criterion of the attribute subset, but instead the Spearman correlation coefficient is used. Meanwhile, the CFS algorithm does not consider the case that the correlation degree between the attribute and the class attribute in the attribute subset is too different, and therefore, a correlation degree variance is introduced to limit the occurrence of the case. The hidden naive Bayes model is improved by using a weight method, and the attributes obtained after attribute selection cannot be completely independent from each other. A more sophisticated hidden naive bayes model is used to further relax the constraints of the NB model "conditional independence assumption". The hidden naive Bayes model is characterized in that a hidden father node is added to each attribute on the basis of the naive Bayes model, the hidden father node represents the sum of the correlation degrees of the attribute and other attributes (excluding class attributes), and the hidden naive Bayes model relaxes the limitation of 'conditional independence assumption' of the naive Bayes to a certain extent. However, the hidden naive bayes also have the disadvantage that because the contribution degree of each attribute to the final classification result is different, and the hidden naive bayes does not take the situation into account, in order to improve the application effect of the hidden naive bayes, the fisher score value of each attribute is used as the weighting coefficient of the attribute by utilizing the idea of attribute weighting.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention as defined in the following claims. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data classification improvement method based on a naive Bayes model is characterized by comprising the following steps: the data classification improving method comprises the following steps:

2. The naive bayes model-based data classification improvement method according to claim 1, wherein: the basic model of the step 2 is an NB model in a naive Bayes model.

3. The naive bayes model-based data classification improvement method according to claim 1, wherein: the basic model is extended to a hidden naive Bayes model.

4. The naive bayes model-based data classification improvement method according to claim 1, wherein: the metering weighting factor in step 4 uses the fisher score value of the attribute as the weighting factor of each attribute.

5. The naive bayes model-based data classification improvement method according to claim 1, wherein: the step 3 attribute selection algorithm is an improved version of the CFS to obtain an optimal subset of attributes such that the overall correlation between attributes in this subset of attributes is minimized. This can be done to counteract some of the effects of the "conditional independence assumption".