CN112906786A - Data classification improvement method based on naive Bayes model - Google Patents
Data classification improvement method based on naive Bayes model Download PDFInfo
- Publication number
- CN112906786A CN112906786A CN202110182847.6A CN202110182847A CN112906786A CN 112906786 A CN112906786 A CN 112906786A CN 202110182847 A CN202110182847 A CN 202110182847A CN 112906786 A CN112906786 A CN 112906786A
- Authority
- CN
- China
- Prior art keywords
- naive bayes
- model
- attribute
- attributes
- bayes model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data classification improvement method based on a naive Bayes model, which comprises the following steps: step 1: determining an intrusion detection method based on improved weighted hidden naive Bayes; step 2: data processing is carried out on the acquired data, meanwhile, attribute weights are determined, and a basic model is established; and step 3: performing attribute selection algorithm and data discretization in data processing; and 4, step 4: and after the attribute weight is determined, measuring a weighting coefficient, and expanding the basic model. The invention optimizes the naive Bayes model in the aspect of attribute selection. For the "conditional independence assumption" of naive bayes, attributes can be selected to arrive at an optimal subset of attributes such that the overall correlation between attributes in the subset of attributes is minimized. This can be done to counteract some of the effects of the "conditional independence assumption".
Description
Technical Field
The invention relates to the technical field of data classification improvement methods, in particular to a data classification improvement method based on a naive Bayesian model.
Background
The naive Bayes method is a classification method based on Bayes theorem and independent hypothesis of characteristic conditions. The two most widespread classification models are the Decision Tree Model (Decision Tree Model) and the Naive bayes Model (Naive Bayesian Model, NBM). Compared with a decision tree model, a Naive Bayes Classifier (Naive Bayes Classifier or NBC) originates from classical mathematical theory, has a solid mathematical foundation and stable classification efficiency. Meanwhile, the NBC model needs few estimated parameters, is not sensitive to missing data, and has a simple algorithm. Theoretically, the NBC model has minimal error rates compared to other classification methods. This is not always the case in practice, because the NBC model assumes that the attributes are independent of each other, which is often not true in practical applications, and this has a certain impact on the correct classification of the NBC model.
With the advent of the big data age, the types of actually collected data are more and more diversified, and the data often have attributes which are irrelevant to research or redundant, so that the classification results are negatively affected. Therefore, a new technical solution needs to be provided.
Disclosure of Invention
The invention aims to provide a data classification improvement method based on a naive Bayesian model, which solves the problem that with the arrival of a big data era, the actually acquired data types are more and more diversified, and the data often have attributes irrelevant to research or redundant, which can generate negative influences on classification results.
In order to achieve the purpose, the invention provides the following technical scheme: a data classification improvement method based on a naive Bayesian model comprises the following steps:
step 1: determining an intrusion detection method based on improved weighted hidden naive Bayes;
step 2: data processing is carried out on the acquired data, meanwhile, attribute weights are determined, and a basic model is established;
and step 3: performing attribute selection algorithm and data discretization in data processing;
and 4, step 4: after the attribute weight is determined, a metering weighting coefficient is carried out, and meanwhile, the basic model is expanded;
and 5: after confirming the weighting coefficient in step 4, establishing a weighted hidden naive Bayes model after expanding the weighting coefficient and the basic model;
step 6: and 5, confirming the weighted hidden naive Bayes model, applying the weighted hidden naive Bayes model to intrusion detection, and comparing the weighted hidden naive Bayes model with the existing mature algorithm.
As a preferred embodiment of the present invention, the basic model of step 2 is an NB model in a naive bayes model.
As a preferred embodiment of the present invention, the basic model is extended to a hidden naive bayes model.
As a preferred embodiment of the present invention, the metering weighting factor in step 4 uses the fisher score value of the attribute as the weighting factor of each attribute.
As a preferred embodiment of the present invention, the step 3 attribute selection algorithm is a modified version of CFS to obtain an optimal attribute subset, so that the overall correlation between attributes in this attribute subset is minimized. This can be done to counteract some of the effects of the "conditional independence assumption".
Compared with the prior art, the invention has the following beneficial effects:
the invention optimizes the naive Bayes model in the aspect of attribute selection. For the "conditional independence assumption" of naive bayes, attributes can be selected to arrive at an optimal subset of attributes such that the overall correlation between attributes in the subset of attributes is minimized. This can be done to counteract some of the effects of the "conditional independence assumption". The structure of the naive Bayes model is improved. With the idea of attribute weighting, the fisher score value of an attribute is used as a weighting coefficient for each attribute. For an attribute, if the class has stronger discrimination ability for the final classification, then under a specific class, the variance of the attribute should be as small as possible in all data instances of the class, and the fisher score is an index for measuring the criterion. The improved attribute selection algorithm and the weighted hidden naive Bayes model are combined and applied to big data intrusion detection, and are compared with a mature algorithm.
Drawings
FIG. 1 is a schematic view of an overall improved process of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a technical solution: a data classification improvement method based on a naive Bayesian model comprises the following steps:
step 1: determining an intrusion detection method based on improved weighted hidden naive Bayes;
step 2: data processing is carried out on the acquired data, meanwhile, attribute weights are determined, and a basic model is established;
and step 3: performing attribute selection algorithm and data discretization in data processing;
and 4, step 4: after the attribute weight is determined, a metering weighting coefficient is carried out, and meanwhile, the basic model is expanded;
and 5: after confirming the weighting coefficient in step 4, establishing a weighted hidden naive Bayes model after expanding the weighting coefficient and the basic model;
step 6: and 5, confirming the weighted hidden naive Bayes model, applying the weighted hidden naive Bayes model to intrusion detection, and comparing the weighted hidden naive Bayes model with the existing mature algorithm.
In a further improvement, the basic model of step 2 is an NB model in a naive bayes model.
In a further improvement, the basic model is extended to a hidden naive Bayes model, and the attributes obtained after the attribute selection cannot be guaranteed to be completely independent from each other. Therefore, the subject further relaxes the limitation of the NB model "conditional independence assumption" by adopting a relatively sophisticated hidden naive bayes model. The hidden naive Bayes model is characterized in that a hidden father node is added to each attribute on the basis of the naive Bayes model, the hidden father node represents the sum of the correlation degrees of the attribute and other attributes (excluding class attributes), and the hidden naive Bayes model relaxes the limitation of 'conditional independence assumption' of the naive Bayes to a certain extent. However, the hidden naive bayes also have a disadvantage that, because the contribution degree of each attribute to the final classification result is different, and the hidden naive bayes does not take the situation into account, in order to improve the application effect of the hidden naive bayes, the problem is to use the fisher score value of each attribute as the weighting coefficient of the attribute by utilizing the idea of attribute weighting.
In a further improvement, the weighting coefficient in step 4 is a fisher score value of each attribute, and the structure of the naive bayes model is improved. With the idea of attribute weighting, the fisher score value of an attribute is used as a weighting coefficient for each attribute. For an attribute, if the class has stronger discrimination ability for the final classification, then under a specific class, the variance of the attribute should be as small as possible in all data instances of the class, and the fisher score is an index for measuring the criterion.
In a further improvement, the step 3 attribute selection algorithm is a modified version of CFS to obtain an optimal subset of attributes such that the overall correlation between attributes in the subset of attributes is minimized. By doing so, the influence of a part of the "conditional independence assumption" can be counteracted, and the evaluation function of the attribute subset is modified. Since mutual information can only be used in datasets with only discrete attributes, no mutual information is used in the evaluation criterion of the attribute subset, but instead the Spearman correlation coefficient is used. Meanwhile, the CFS algorithm does not consider the case that the correlation degree between the attribute and the class attribute in the attribute subset is too different, and therefore, a correlation degree variance is introduced to limit the occurrence of the case.
The improvement of the attribute selection algorithm (CFS) of the invention aims at the 'conditional independence assumption' of the naive Bayes, and optimizes the naive Bayes model from the aspect of attribute selection so as to obtain an optimal attribute subset, so that the overall correlation among the attributes in the attribute subset is minimum. This can be done to counteract some of the effects of the "conditional independence assumption". Aiming at the defects of the existing CFS attribute selection algorithm which is more commonly used, the method is improved to a certain extent, and the specific idea is to modify the evaluation function of the attribute subset. Since mutual information can only be used in datasets with only discrete attributes, no mutual information is used in the evaluation criterion of the attribute subset, but instead the Spearman correlation coefficient is used. Meanwhile, the CFS algorithm does not consider the case that the correlation degree between the attribute and the class attribute in the attribute subset is too different, and therefore, a correlation degree variance is introduced to limit the occurrence of the case. The hidden naive Bayes model is improved by using a weight method, and the attributes obtained after attribute selection cannot be completely independent from each other. A more sophisticated hidden naive bayes model is used to further relax the constraints of the NB model "conditional independence assumption". The hidden naive Bayes model is characterized in that a hidden father node is added to each attribute on the basis of the naive Bayes model, the hidden father node represents the sum of the correlation degrees of the attribute and other attributes (excluding class attributes), and the hidden naive Bayes model relaxes the limitation of 'conditional independence assumption' of the naive Bayes to a certain extent. However, the hidden naive bayes also have the disadvantage that because the contribution degree of each attribute to the final classification result is different, and the hidden naive bayes does not take the situation into account, in order to improve the application effect of the hidden naive bayes, the fisher score value of each attribute is used as the weighting coefficient of the attribute by utilizing the idea of attribute weighting.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention as defined in the following claims. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (5)
1. A data classification improvement method based on a naive Bayes model is characterized by comprising the following steps: the data classification improving method comprises the following steps:
step 1: determining an intrusion detection method based on improved weighted hidden naive Bayes;
step 2: data processing is carried out on the acquired data, meanwhile, attribute weights are determined, and a basic model is established;
and step 3: performing attribute selection algorithm and data discretization in data processing;
and 4, step 4: after the attribute weight is determined, a metering weighting coefficient is carried out, and meanwhile, the basic model is expanded;
and 5: after confirming the weighting coefficient in step 4, establishing a weighted hidden naive Bayes model after expanding the weighting coefficient and the basic model;
step 6: and 5, confirming the weighted hidden naive Bayes model, applying the weighted hidden naive Bayes model to intrusion detection, and comparing the weighted hidden naive Bayes model with the existing mature algorithm.
2. The naive bayes model-based data classification improvement method according to claim 1, wherein: the basic model of the step 2 is an NB model in a naive Bayes model.
3. The naive bayes model-based data classification improvement method according to claim 1, wherein: the basic model is extended to a hidden naive Bayes model.
4. The naive bayes model-based data classification improvement method according to claim 1, wherein: the metering weighting factor in step 4 uses the fisher score value of the attribute as the weighting factor of each attribute.
5. The naive bayes model-based data classification improvement method according to claim 1, wherein: the step 3 attribute selection algorithm is an improved version of the CFS to obtain an optimal subset of attributes such that the overall correlation between attributes in this subset of attributes is minimized. This can be done to counteract some of the effects of the "conditional independence assumption".
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110182847.6A CN112906786A (en) | 2021-02-07 | 2021-02-07 | Data classification improvement method based on naive Bayes model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110182847.6A CN112906786A (en) | 2021-02-07 | 2021-02-07 | Data classification improvement method based on naive Bayes model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112906786A true CN112906786A (en) | 2021-06-04 |
Family
ID=76123373
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110182847.6A Pending CN112906786A (en) | 2021-02-07 | 2021-02-07 | Data classification improvement method based on naive Bayes model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112906786A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376261A (en) * | 2014-11-27 | 2015-02-25 | 南京大学 | Method for automatically detecting malicious process under forensics scene |
CN108491719A (en) * | 2018-03-15 | 2018-09-04 | 重庆邮电大学 | A kind of Android malware detection methods improving NB Algorithm |
CN108874927A (en) * | 2018-05-31 | 2018-11-23 | 桂林电子科技大学 | Intrusion detection method based on hypergraph and random forest |
CN110222744A (en) * | 2019-05-23 | 2019-09-10 | 成都信息工程大学 | A kind of Naive Bayes Classification Model improved method based on attribute weight |
CN110245850A (en) * | 2019-05-31 | 2019-09-17 | 中国地质大学(武汉) | A kind of sintering process operating mode's switch method and system considering timing |
CN110276195A (en) * | 2019-04-25 | 2019-09-24 | 北京邮电大学 | A kind of smart machine intrusion detection method, equipment and storage medium |
US20210021641A1 (en) * | 2019-07-16 | 2021-01-21 | Cisco Technology, Inc. | Tls fingerprinting for process identification |
-
2021
- 2021-02-07 CN CN202110182847.6A patent/CN112906786A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376261A (en) * | 2014-11-27 | 2015-02-25 | 南京大学 | Method for automatically detecting malicious process under forensics scene |
CN108491719A (en) * | 2018-03-15 | 2018-09-04 | 重庆邮电大学 | A kind of Android malware detection methods improving NB Algorithm |
CN108874927A (en) * | 2018-05-31 | 2018-11-23 | 桂林电子科技大学 | Intrusion detection method based on hypergraph and random forest |
CN110276195A (en) * | 2019-04-25 | 2019-09-24 | 北京邮电大学 | A kind of smart machine intrusion detection method, equipment and storage medium |
CN110222744A (en) * | 2019-05-23 | 2019-09-10 | 成都信息工程大学 | A kind of Naive Bayes Classification Model improved method based on attribute weight |
CN110245850A (en) * | 2019-05-31 | 2019-09-17 | 中国地质大学(武汉) | A kind of sintering process operating mode's switch method and system considering timing |
US20210021641A1 (en) * | 2019-07-16 | 2021-01-21 | Cisco Technology, Inc. | Tls fingerprinting for process identification |
Non-Patent Citations (4)
Title |
---|
江泽涛 等: "基于特征选择的两级混合入侵检测方法", 《计算机工程与设计》 * |
王和勇 等: "《面向大数据的高维数据挖掘技术》", 1 March 2018, 西安电子科技大学出版社 * |
秦怀强 等: "基于属性值加权的隐朴素贝叶斯算法", 《山东科技大学学报(自然科学版)》 * |
贾娴 等: "基于改进属性加权的朴素贝叶斯入侵取证研究", 《计算机工程与应用》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021155706A1 (en) | Method and device for training business prediction model by using unbalanced positive and negative samples | |
CN110084610B (en) | Network transaction fraud detection system based on twin neural network | |
CN104869126B (en) | A kind of network intrusions method for detecting abnormality | |
US20160004963A1 (en) | Information processing apparatus, information processing method, and non-transitory computer readable medium | |
CN109902740B (en) | Re-learning industrial control intrusion detection method based on multi-algorithm fusion parallelism | |
CN109754258B (en) | Online transaction fraud detection method based on individual behavior modeling | |
CN110674865B (en) | Rule learning classifier integration method oriented to software defect class distribution unbalance | |
CN113283909B (en) | Ether house phishing account detection method based on deep learning | |
CN108509492B (en) | Big data processing and system based on real estate industry | |
CN110581840B (en) | Intrusion detection method based on double-layer heterogeneous integrated learner | |
KR102336035B1 (en) | Unsupervised learning method and learning device for fraud detection system based on graph, and testing method and testing device using the same | |
CN113949549B (en) | Real-time traffic anomaly detection method for intrusion and attack defense | |
Choi et al. | Machine learning based approach to financial fraud detection process in mobile payment system | |
CN113343123B (en) | Training method and detection method for generating confrontation multiple relation graph network | |
CN108491719A (en) | A kind of Android malware detection methods improving NB Algorithm | |
CN114254738A (en) | Double-layer evolvable dynamic graph convolution neural network model construction method and application | |
WO2021244105A1 (en) | Feature vector dimension compression method and apparatus, and device and medium | |
Zhou et al. | Credit card fraud identification based on principal component analysis and improved AdaBoost algorithm | |
CN112906786A (en) | Data classification improvement method based on naive Bayes model | |
CN112422546A (en) | Network anomaly detection method based on variable neighborhood algorithm and fuzzy clustering | |
CN116304518A (en) | Heterogeneous graph convolution neural network model construction method and system for information recommendation | |
CN114492569B (en) | Typhoon path classification method based on width learning system | |
CN115861625A (en) | Self-label modifying method for processing noise label | |
CN115114951A (en) | Genetic algorithm and clustering algorithm composite nested abnormal sound detection algorithm | |
CN114254758A (en) | Domain adaptation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210604 |