CN112906786A - Data classification improvement method based on naive Bayes model - Google Patents

Data classification improvement method based on naive Bayes model Download PDF

Info

Publication number
CN112906786A
CN112906786A CN202110182847.6A CN202110182847A CN112906786A CN 112906786 A CN112906786 A CN 112906786A CN 202110182847 A CN202110182847 A CN 202110182847A CN 112906786 A CN112906786 A CN 112906786A
Authority
CN
China
Prior art keywords
naive bayes
model
attribute
attributes
bayes model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110182847.6A
Other languages
Chinese (zh)
Inventor
魏光杏
李华
邹军国
戴月
陈银燕
苗孟君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chuzhou Vocational and Technical College
Original Assignee
Chuzhou Vocational and Technical College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chuzhou Vocational and Technical College filed Critical Chuzhou Vocational and Technical College
Priority to CN202110182847.6A priority Critical patent/CN112906786A/en
Publication of CN112906786A publication Critical patent/CN112906786A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data classification improvement method based on a naive Bayes model, which comprises the following steps: step 1: determining an intrusion detection method based on improved weighted hidden naive Bayes; step 2: data processing is carried out on the acquired data, meanwhile, attribute weights are determined, and a basic model is established; and step 3: performing attribute selection algorithm and data discretization in data processing; and 4, step 4: and after the attribute weight is determined, measuring a weighting coefficient, and expanding the basic model. The invention optimizes the naive Bayes model in the aspect of attribute selection. For the "conditional independence assumption" of naive bayes, attributes can be selected to arrive at an optimal subset of attributes such that the overall correlation between attributes in the subset of attributes is minimized. This can be done to counteract some of the effects of the "conditional independence assumption".

Description

Data classification improvement method based on naive Bayes model
Technical Field
The invention relates to the technical field of data classification improvement methods, in particular to a data classification improvement method based on a naive Bayesian model.
Background
The naive Bayes method is a classification method based on Bayes theorem and independent hypothesis of characteristic conditions. The two most widespread classification models are the Decision Tree Model (Decision Tree Model) and the Naive bayes Model (Naive Bayesian Model, NBM). Compared with a decision tree model, a Naive Bayes Classifier (Naive Bayes Classifier or NBC) originates from classical mathematical theory, has a solid mathematical foundation and stable classification efficiency. Meanwhile, the NBC model needs few estimated parameters, is not sensitive to missing data, and has a simple algorithm. Theoretically, the NBC model has minimal error rates compared to other classification methods. This is not always the case in practice, because the NBC model assumes that the attributes are independent of each other, which is often not true in practical applications, and this has a certain impact on the correct classification of the NBC model.
With the advent of the big data age, the types of actually collected data are more and more diversified, and the data often have attributes which are irrelevant to research or redundant, so that the classification results are negatively affected. Therefore, a new technical solution needs to be provided.
Disclosure of Invention
The invention aims to provide a data classification improvement method based on a naive Bayesian model, which solves the problem that with the arrival of a big data era, the actually acquired data types are more and more diversified, and the data often have attributes irrelevant to research or redundant, which can generate negative influences on classification results.
In order to achieve the purpose, the invention provides the following technical scheme: a data classification improvement method based on a naive Bayesian model comprises the following steps:
step 1: determining an intrusion detection method based on improved weighted hidden naive Bayes;
step 2: data processing is carried out on the acquired data, meanwhile, attribute weights are determined, and a basic model is established;
and step 3: performing attribute selection algorithm and data discretization in data processing;
and 4, step 4: after the attribute weight is determined, a metering weighting coefficient is carried out, and meanwhile, the basic model is expanded;
and 5: after confirming the weighting coefficient in step 4, establishing a weighted hidden naive Bayes model after expanding the weighting coefficient and the basic model;
step 6: and 5, confirming the weighted hidden naive Bayes model, applying the weighted hidden naive Bayes model to intrusion detection, and comparing the weighted hidden naive Bayes model with the existing mature algorithm.
As a preferred embodiment of the present invention, the basic model of step 2 is an NB model in a naive bayes model.
As a preferred embodiment of the present invention, the basic model is extended to a hidden naive bayes model.
As a preferred embodiment of the present invention, the metering weighting factor in step 4 uses the fisher score value of the attribute as the weighting factor of each attribute.
As a preferred embodiment of the present invention, the step 3 attribute selection algorithm is a modified version of CFS to obtain an optimal attribute subset, so that the overall correlation between attributes in this attribute subset is minimized. This can be done to counteract some of the effects of the "conditional independence assumption".
Compared with the prior art, the invention has the following beneficial effects:
the invention optimizes the naive Bayes model in the aspect of attribute selection. For the "conditional independence assumption" of naive bayes, attributes can be selected to arrive at an optimal subset of attributes such that the overall correlation between attributes in the subset of attributes is minimized. This can be done to counteract some of the effects of the "conditional independence assumption". The structure of the naive Bayes model is improved. With the idea of attribute weighting, the fisher score value of an attribute is used as a weighting coefficient for each attribute. For an attribute, if the class has stronger discrimination ability for the final classification, then under a specific class, the variance of the attribute should be as small as possible in all data instances of the class, and the fisher score is an index for measuring the criterion. The improved attribute selection algorithm and the weighted hidden naive Bayes model are combined and applied to big data intrusion detection, and are compared with a mature algorithm.
Drawings
FIG. 1 is a schematic view of an overall improved process of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a technical solution: a data classification improvement method based on a naive Bayesian model comprises the following steps:
step 1: determining an intrusion detection method based on improved weighted hidden naive Bayes;
step 2: data processing is carried out on the acquired data, meanwhile, attribute weights are determined, and a basic model is established;
and step 3: performing attribute selection algorithm and data discretization in data processing;
and 4, step 4: after the attribute weight is determined, a metering weighting coefficient is carried out, and meanwhile, the basic model is expanded;
and 5: after confirming the weighting coefficient in step 4, establishing a weighted hidden naive Bayes model after expanding the weighting coefficient and the basic model;
step 6: and 5, confirming the weighted hidden naive Bayes model, applying the weighted hidden naive Bayes model to intrusion detection, and comparing the weighted hidden naive Bayes model with the existing mature algorithm.
In a further improvement, the basic model of step 2 is an NB model in a naive bayes model.
In a further improvement, the basic model is extended to a hidden naive Bayes model, and the attributes obtained after the attribute selection cannot be guaranteed to be completely independent from each other. Therefore, the subject further relaxes the limitation of the NB model "conditional independence assumption" by adopting a relatively sophisticated hidden naive bayes model. The hidden naive Bayes model is characterized in that a hidden father node is added to each attribute on the basis of the naive Bayes model, the hidden father node represents the sum of the correlation degrees of the attribute and other attributes (excluding class attributes), and the hidden naive Bayes model relaxes the limitation of 'conditional independence assumption' of the naive Bayes to a certain extent. However, the hidden naive bayes also have a disadvantage that, because the contribution degree of each attribute to the final classification result is different, and the hidden naive bayes does not take the situation into account, in order to improve the application effect of the hidden naive bayes, the problem is to use the fisher score value of each attribute as the weighting coefficient of the attribute by utilizing the idea of attribute weighting.
In a further improvement, the weighting coefficient in step 4 is a fisher score value of each attribute, and the structure of the naive bayes model is improved. With the idea of attribute weighting, the fisher score value of an attribute is used as a weighting coefficient for each attribute. For an attribute, if the class has stronger discrimination ability for the final classification, then under a specific class, the variance of the attribute should be as small as possible in all data instances of the class, and the fisher score is an index for measuring the criterion.
In a further improvement, the step 3 attribute selection algorithm is a modified version of CFS to obtain an optimal subset of attributes such that the overall correlation between attributes in the subset of attributes is minimized. By doing so, the influence of a part of the "conditional independence assumption" can be counteracted, and the evaluation function of the attribute subset is modified. Since mutual information can only be used in datasets with only discrete attributes, no mutual information is used in the evaluation criterion of the attribute subset, but instead the Spearman correlation coefficient is used. Meanwhile, the CFS algorithm does not consider the case that the correlation degree between the attribute and the class attribute in the attribute subset is too different, and therefore, a correlation degree variance is introduced to limit the occurrence of the case.
The improvement of the attribute selection algorithm (CFS) of the invention aims at the 'conditional independence assumption' of the naive Bayes, and optimizes the naive Bayes model from the aspect of attribute selection so as to obtain an optimal attribute subset, so that the overall correlation among the attributes in the attribute subset is minimum. This can be done to counteract some of the effects of the "conditional independence assumption". Aiming at the defects of the existing CFS attribute selection algorithm which is more commonly used, the method is improved to a certain extent, and the specific idea is to modify the evaluation function of the attribute subset. Since mutual information can only be used in datasets with only discrete attributes, no mutual information is used in the evaluation criterion of the attribute subset, but instead the Spearman correlation coefficient is used. Meanwhile, the CFS algorithm does not consider the case that the correlation degree between the attribute and the class attribute in the attribute subset is too different, and therefore, a correlation degree variance is introduced to limit the occurrence of the case. The hidden naive Bayes model is improved by using a weight method, and the attributes obtained after attribute selection cannot be completely independent from each other. A more sophisticated hidden naive bayes model is used to further relax the constraints of the NB model "conditional independence assumption". The hidden naive Bayes model is characterized in that a hidden father node is added to each attribute on the basis of the naive Bayes model, the hidden father node represents the sum of the correlation degrees of the attribute and other attributes (excluding class attributes), and the hidden naive Bayes model relaxes the limitation of 'conditional independence assumption' of the naive Bayes to a certain extent. However, the hidden naive bayes also have the disadvantage that because the contribution degree of each attribute to the final classification result is different, and the hidden naive bayes does not take the situation into account, in order to improve the application effect of the hidden naive bayes, the fisher score value of each attribute is used as the weighting coefficient of the attribute by utilizing the idea of attribute weighting.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention as defined in the following claims. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A data classification improvement method based on a naive Bayes model is characterized by comprising the following steps: the data classification improving method comprises the following steps:
step 1: determining an intrusion detection method based on improved weighted hidden naive Bayes;
step 2: data processing is carried out on the acquired data, meanwhile, attribute weights are determined, and a basic model is established;
and step 3: performing attribute selection algorithm and data discretization in data processing;
and 4, step 4: after the attribute weight is determined, a metering weighting coefficient is carried out, and meanwhile, the basic model is expanded;
and 5: after confirming the weighting coefficient in step 4, establishing a weighted hidden naive Bayes model after expanding the weighting coefficient and the basic model;
step 6: and 5, confirming the weighted hidden naive Bayes model, applying the weighted hidden naive Bayes model to intrusion detection, and comparing the weighted hidden naive Bayes model with the existing mature algorithm.
2. The naive bayes model-based data classification improvement method according to claim 1, wherein: the basic model of the step 2 is an NB model in a naive Bayes model.
3. The naive bayes model-based data classification improvement method according to claim 1, wherein: the basic model is extended to a hidden naive Bayes model.
4. The naive bayes model-based data classification improvement method according to claim 1, wherein: the metering weighting factor in step 4 uses the fisher score value of the attribute as the weighting factor of each attribute.
5. The naive bayes model-based data classification improvement method according to claim 1, wherein: the step 3 attribute selection algorithm is an improved version of the CFS to obtain an optimal subset of attributes such that the overall correlation between attributes in this subset of attributes is minimized. This can be done to counteract some of the effects of the "conditional independence assumption".
CN202110182847.6A 2021-02-07 2021-02-07 Data classification improvement method based on naive Bayes model Pending CN112906786A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110182847.6A CN112906786A (en) 2021-02-07 2021-02-07 Data classification improvement method based on naive Bayes model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110182847.6A CN112906786A (en) 2021-02-07 2021-02-07 Data classification improvement method based on naive Bayes model

Publications (1)

Publication Number Publication Date
CN112906786A true CN112906786A (en) 2021-06-04

Family

ID=76123373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110182847.6A Pending CN112906786A (en) 2021-02-07 2021-02-07 Data classification improvement method based on naive Bayes model

Country Status (1)

Country Link
CN (1) CN112906786A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376261A (en) * 2014-11-27 2015-02-25 南京大学 Method for automatically detecting malicious process under forensics scene
CN108491719A (en) * 2018-03-15 2018-09-04 重庆邮电大学 A kind of Android malware detection methods improving NB Algorithm
CN108874927A (en) * 2018-05-31 2018-11-23 桂林电子科技大学 Intrusion detection method based on hypergraph and random forest
CN110222744A (en) * 2019-05-23 2019-09-10 成都信息工程大学 A kind of Naive Bayes Classification Model improved method based on attribute weight
CN110245850A (en) * 2019-05-31 2019-09-17 中国地质大学(武汉) A kind of sintering process operating mode's switch method and system considering timing
CN110276195A (en) * 2019-04-25 2019-09-24 北京邮电大学 A kind of smart machine intrusion detection method, equipment and storage medium
US20210021641A1 (en) * 2019-07-16 2021-01-21 Cisco Technology, Inc. Tls fingerprinting for process identification

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376261A (en) * 2014-11-27 2015-02-25 南京大学 Method for automatically detecting malicious process under forensics scene
CN108491719A (en) * 2018-03-15 2018-09-04 重庆邮电大学 A kind of Android malware detection methods improving NB Algorithm
CN108874927A (en) * 2018-05-31 2018-11-23 桂林电子科技大学 Intrusion detection method based on hypergraph and random forest
CN110276195A (en) * 2019-04-25 2019-09-24 北京邮电大学 A kind of smart machine intrusion detection method, equipment and storage medium
CN110222744A (en) * 2019-05-23 2019-09-10 成都信息工程大学 A kind of Naive Bayes Classification Model improved method based on attribute weight
CN110245850A (en) * 2019-05-31 2019-09-17 中国地质大学(武汉) A kind of sintering process operating mode's switch method and system considering timing
US20210021641A1 (en) * 2019-07-16 2021-01-21 Cisco Technology, Inc. Tls fingerprinting for process identification

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
江泽涛 等: "基于特征选择的两级混合入侵检测方法", 《计算机工程与设计》 *
王和勇 等: "《面向大数据的高维数据挖掘技术》", 1 March 2018, 西安电子科技大学出版社 *
秦怀强 等: "基于属性值加权的隐朴素贝叶斯算法", 《山东科技大学学报(自然科学版)》 *
贾娴 等: "基于改进属性加权的朴素贝叶斯入侵取证研究", 《计算机工程与应用》 *

Similar Documents

Publication Publication Date Title
WO2021155706A1 (en) Method and device for training business prediction model by using unbalanced positive and negative samples
CN110084610B (en) Network transaction fraud detection system based on twin neural network
CN104869126B (en) A kind of network intrusions method for detecting abnormality
US20160004963A1 (en) Information processing apparatus, information processing method, and non-transitory computer readable medium
CN109902740B (en) Re-learning industrial control intrusion detection method based on multi-algorithm fusion parallelism
CN109754258B (en) Online transaction fraud detection method based on individual behavior modeling
CN110674865B (en) Rule learning classifier integration method oriented to software defect class distribution unbalance
CN113283909B (en) Ether house phishing account detection method based on deep learning
CN108509492B (en) Big data processing and system based on real estate industry
CN110581840B (en) Intrusion detection method based on double-layer heterogeneous integrated learner
KR102336035B1 (en) Unsupervised learning method and learning device for fraud detection system based on graph, and testing method and testing device using the same
CN113949549B (en) Real-time traffic anomaly detection method for intrusion and attack defense
Choi et al. Machine learning based approach to financial fraud detection process in mobile payment system
CN113343123B (en) Training method and detection method for generating confrontation multiple relation graph network
CN108491719A (en) A kind of Android malware detection methods improving NB Algorithm
CN114254738A (en) Double-layer evolvable dynamic graph convolution neural network model construction method and application
WO2021244105A1 (en) Feature vector dimension compression method and apparatus, and device and medium
Zhou et al. Credit card fraud identification based on principal component analysis and improved AdaBoost algorithm
CN112906786A (en) Data classification improvement method based on naive Bayes model
CN112422546A (en) Network anomaly detection method based on variable neighborhood algorithm and fuzzy clustering
CN116304518A (en) Heterogeneous graph convolution neural network model construction method and system for information recommendation
CN114492569B (en) Typhoon path classification method based on width learning system
CN115861625A (en) Self-label modifying method for processing noise label
CN115114951A (en) Genetic algorithm and clustering algorithm composite nested abnormal sound detection algorithm
CN114254758A (en) Domain adaptation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210604