CN109145308A

CN109145308A - A kind of concerning security matters text recognition method based on improvement naive Bayesian

Info

Publication number: CN109145308A
Application number: CN201811134941.9A
Authority: CN
Inventors: 敬思远; 杨骏; 孙锐; 郭肇毅
Original assignee: Leshan Normal University
Current assignee: Leshan Normal University
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2019-01-04
Anticipated expiration: 2038-09-28
Also published as: CN109145308B

Abstract

The invention discloses a kind of based on the concerning security matters text recognition method for improving naive Bayesian, comprising the following steps: S1. building model-naive Bayesian simultaneously carries out incremental learning；S2. the model-naive Bayesian that load incremental learning obtains；S3. text to be identified is read；S4. text is identified using model-naive Bayesian, and marks its corresponding level of confidentiality.In the present invention, makes study more reasonable based on naive Bayesian weighted model, and propose the incremental learning scheme of feature weight, the accuracy rate of concerning security matters text detection can be substantially improved；Based on the carry out incremental learning that concerning security matters feature space changes, simply and effectively solve the problems, such as that the level of confidentiality for the concerning security matters feature that new concerning security matters feature is added or has been friends in the past declines.

Description

A kind of concerning security matters text recognition method based on improvement naive Bayesian

Technical field

The present invention relates to concerning security matters text identifications, more particularly to a kind of based on the concerning security matters text identification for improving naive Bayesian Method.

Background technique

With the development of information technology, can be realized a large amount of comprehensive office, research and production business information system gradually It appears in social life and work, a large amount of sensitive data and information is store in information system.How classified information is prevented It is leaked to the external world by internet, is currently highly desirable solve the problems, such as.

The automatic detection of concerning security matters text is the effective technology means to solve the above problems.According to Bell_Lapadula model, Current classified information is generally divided into disclosure, secret, secret and top-secret four grades.When concerning security matters text hand on network When the change of current turns (such as official document, Email etc.), which can effectively detect level of confidentiality belonging to the text.When detecting this After the level of confidentiality of text, then the level of confidentiality label demarcated with user oneself compares, and can find the information flow of the concerning security matters text It is whether legal.For example, if text information labeling is " disclosure ", and the level of confidentiality that automatic detection algorithm detects by user It is " secret ", then it is illegal to can determine that the behavior belongs to.

Naive Bayesian (Bayes) be current text detection field one of main stream approach.But based on simplicity Bayes realizes that the automatic detection of concerning security matters text needs to solve two hang-ups: (1) since the particularity of confidential document (cannot be random Check), it is difficult to it obtains complete mark sample and model-naive Bayesian is learnt；(2) the concerning security matters feature in text (relates to Close keyword) it can change with time-shift, the keyword of concerning security matters can not become new concerning security matters feature before some；And It was the word of concerning security matters feature before some, its level of confidentiality may can be gradually decreased with the time, and there is presently no methods to be able to solve The problem.

Summary of the invention

It is an object of the invention to overcome the deficiencies of the prior art and provide a kind of based on the concerning security matters text for improving naive Bayesian This recognition methods.

The purpose of the present invention is achieved through the following technical solutions: a kind of based on the concerning security matters text for improving naive Bayesian This recognition methods, it is characterised in that: the following steps are included:

S1. it constructs model-naive Bayesian and carries out incremental learning；

S2. the model-naive Bayesian that load incremental learning obtains；

S3. text to be identified is read；

S4. text is identified using model-naive Bayesian, and marks its corresponding level of confidentiality.

Further, the concerning security matters text recognition method further includes recognition result uploading step: by the identification knot of step S4 Fruit uploads to unified control centre.

Further, the step S1 includes following sub-step:

S101. building model-naive Bayesian identifies the sample with user annotation label；

S102. the label of unified control center administrator will identify that label and user annotation compares, if it is Identification mistake, the sample and its correct label are just added to sample database；

S103. naive Bayesian weighted model is constructed；

S104. the concerning security matters feature level of confidentiality for having new concerning security matters feature to be added or have been friends in the past in concerning security matters feature space changes When, the carry out incremental learning based on the change of concerning security matters feature space；

S105. incremental learning is carried out according to the variation of sample database and concerning security matters feature database；

S106. the model after study is written in model-naive Bayesian, and system is notified to be reloaded.

Closer, the step S101 includes:

The first, model-naive Bayesian is constructed:

If the sample space D of concerning security matters text is by feature space W={ w₁,w₂,…,w_nAnd classification space C={ c₁,c₂,…, c_mComposition；The word for including in sample space D, that is, text, classification space C, that is, concerning security matters text level of confidentiality；To a given text d= {w₁,w₂,…,w_l, model-naive Bayesian by calculate the text belong to posterior probability of all categories, to its generic into Row differentiates；The posterior probability of which classification is big, and the testing result of the text is exactly that corresponding classification, and discriminate is as follows:

Wherein P (c_i) indicate classification prior probability；P(w_j|c_i) indicate in classification c_iUnder the conditions of, feature w_jWhat is occurred is general Rate:

Wherein | C |, | D | and | W | respectively indicate the size of classification space, sample space and feature space；count(c_i) table Show and belongs to classification c_iSample number,It indicates in classification c_iIn there is feature w_jSample number；

The second, the sample with user annotation label is identified using model-naive Bayesian, obtains each sample Recognition result.

The step S103 includes:

The first, naive Bayesian weighted model is constructed:

λ_j,iIndicate that j-th of feature belongs to the weight of i-th of classification in feature space, according to Bell_Lapadula model, Each feature has 4 weights, respectively corresponds disclosure, secret, secret and top secret:

Wherein TF_i(w_j) it is text feature w_jIn c_iThe word frequency occurred in classification text；IDF_i(w_j) it is improved inverse document Frequency；Text feature number of files in class is bigger, and the number of files occurred in other classes is smaller, then its weight is bigger.

The step S104 includes:

It is new special when the concerning security matters feature level of confidentiality for having new concerning security matters feature to be added or have been friends in the past in concerning security matters feature space changes The case where sign is added: P (t is selected first from the other feature generic with new feature_j|c_i) the maximum feature of value, owned Information is copied to new feature, the weight λ according to step S103 to all features under the category_j,iWith conditional probability P (w_j|c_i) It is reevaluated；Then P (t is selected from the other feature different classes of with new feature_j|c_i) the smallest feature of value, by its institute There is information to be copied to new feature, then the weight λ according to step S103 to all features under the category_j,iWith conditional probability P (w_j|c_i) reevaluated；

The case where changing for old feature concerning security matters feature level of confidentiality similarly, first from generic other of variation characteristic P (t is selected in feature_j|c_i) the maximum feature of value, its all information is copied to transform characteristics, according to step S103 to all Weight λ of the feature under the category_j,iWith conditional probability P (w_j|c_i) reevaluated；Then from different classes of with transform characteristics Other feature in select P (t_j|c_i) the smallest feature of value, its all information is copied to transform characteristics, then according to step Weight λ of the S103 to all features under the category_j,iWith conditional probability P (w_j|c_i) reevaluated.

The step S105 includes:

Feature weight realizes incremental learning in two dimensions of sample space and feature space:

Wherein TF '_i() and count'() indicate the statistical result on sample increment collection；

Incremental learning based on feature weight obtains P (c_i) and P (w_j|c_i) incremental learning result:

The beneficial effects of the present invention are: making study more reasonable based on naive Bayesian weighted model, and propose spy The incremental learning scheme for levying weight, can be substantially improved the accuracy rate of concerning security matters text detection；Changed based on concerning security matters feature space Incremental learning is carried out, simply and effectively solves the level of confidentiality decline of the new addition of concerning security matters feature or the concerning security matters feature haveing been friends in the past Problem.

Detailed description of the invention

Fig. 1 is flow chart of the method for the present invention；

Fig. 2 is the flow chart that model-naive Bayesian carries out incremental learning.

Specific embodiment

Technical solution of the present invention is described in further detail with reference to the accompanying drawing, but protection scope of the present invention is not limited to It is as described below.

As shown in Figure 1, a kind of based on the concerning security matters text recognition method for improving naive Bayesian, comprising the following steps:

S1. it constructs model-naive Bayesian and carries out incremental learning；

S2. the model-naive Bayesian that load incremental learning obtains；

S3. text to be identified is read；

In embodiments herein, the concerning security matters text recognition method further includes recognition result uploading step: by step The recognition result of S4 uploads to unified control centre.

As described in Figure 2, the step S1 includes following sub-step:

S103. naive Bayesian weighted model is constructed；

Wherein, the step S101 includes:

The first, model-naive Bayesian is constructed:

The step S103 includes:

The first, naive Bayesian weighted model is constructed:

Concerning security matters text detection is a kind of very special application scenarios, the at any time migration of time, certain passes no before this Keyword may become concerning security matters feature；And the feature of some concerning security matters before this, level of confidentiality can then gradually decrease.Therefore, it is necessary to a kind of energy Enough adapt to the learning algorithm of this variation.It is readily apparent that, it must have specified level of confidentiality that a new concerning security matters feature, which is added, (such as code name of certain action).In other words, it is very high that this article eigen, which belongs to the confidence level of the category,.One Geju City relates to It is also similar that the level of confidentiality of close feature, which reduces (such as being reduced to confidential from confidential),.Therefore, one kind is proposed in the present invention very Simple strategy is solved, and specifically, the step S104 includes:

The step S105 includes:

Wherein TF '_i' () and count'() indicate statistical result on sample increment collection；

Most common feature weight learning method is TF-IDF, and still, there is no consider for traditional TF-IDF weight Distribution situation of the text feature in different classes of and same category.For example, some concerning security matters text feature can be in some classification It is a large amount of to occur, and seldom occur in other classifications, or even do not occur；Or this feature can lacking in some classification (such as secret class) Largely occur in amount file, and does not occur in same category of other texts.And it is weighted in the present invention based on naive Bayesian Model can solve the problems, such as this better, so that the study of model-naive Bayesian is more reasonable, can be substantially improved and relate to The accuracy rate of close text detection；The present invention can make feature weight in sample according to the variation of sample database and concerning security matters feature database simultaneously Two dimensions in this space and feature space realize incremental learning；In addition, the progress changed in the present invention based on concerning security matters feature space Incremental learning simply and effectively solves asking for the level of confidentiality decline of the new addition of concerning security matters feature or the concerning security matters feature haveing been friends in the past Topic.

The above is a preferred embodiment of the present invention, it should be understood that the present invention is not limited to shape described herein Formula should not be viewed as excluding other embodiments, and can be used for other combinations, modification and environment, and can be in this paper institute It states in contemplated scope, modifications can be made through the above teachings or related fields of technology or knowledge.And what those skilled in the art were carried out Modifications and changes do not depart from the spirit and scope of the present invention, then all should be within the scope of protection of the appended claims of the present invention.

Claims

1. a kind of based on the concerning security matters text recognition method for improving naive Bayesian, it is characterised in that: the following steps are included:

S1. it constructs model-naive Bayesian and carries out incremental learning；

S2. the model-naive Bayesian that load incremental learning obtains；

S3. text to be identified is read；

2. according to claim 1 a kind of based on the concerning security matters text recognition method for improving naive Bayesian, it is characterised in that: Further include recognition result uploading step: the recognition result of step S4 is uploaded to unified control centre.

3. according to claim 1 a kind of based on the concerning security matters text recognition method for improving naive Bayesian, it is characterised in that: The step S1 includes following sub-step:

S103. naive Bayesian weighted model is constructed；

S104. when the concerning security matters feature level of confidentiality for having new concerning security matters feature to be added or have been friends in the past in concerning security matters feature space changes, base In the carry out incremental learning that concerning security matters feature space changes；

4. according to claim 3 a kind of based on the concerning security matters text recognition method for improving naive Bayesian, it is characterised in that: The step S101 includes:

The first, model-naive Bayesian is constructed:

If the sample space D of concerning security matters text is by feature space W={ w₁,w₂,…,w_nAnd classification space C={ c₁,c₂,…,c_mGroup At；The word for including in sample space D, that is, text, classification space C, that is, concerning security matters text level of confidentiality；To a given text d={ w₁, w₂,…,w_l, model-naive Bayesian belongs to posterior probability of all categories by calculating the text, sentences to its generic Not；The posterior probability of which classification is big, and the testing result of the text is exactly that corresponding classification, and discriminate is as follows:

Wherein P (c_i) indicate classification prior probability；P(w_j|c_i) indicate in classification c_iUnder the conditions of, feature w_jThe probability of appearance:

Wherein | C |, | D | and | W | respectively indicate the size of classification space, sample space and feature space；count(c_i) indicate to belong to In classification c_iSample number, count (w_j∧c_i) indicate in classification c_iIn there is feature w_jSample number；

The second, the sample with user annotation label is identified using model-naive Bayesian, obtains the knowledge of each sample Other result.

5. according to claim 3 a kind of based on the concerning security matters text recognition method for improving naive Bayesian, it is characterised in that: The step S103 includes:

The first, naive Bayesian weighted model is constructed:

λ_j,iJ-th of feature belongs to the weight of i-th of classification in expression feature space, according to Bell_Lapadula model, each Feature has 4 weights, respectively corresponds disclosure, secret, secret and top secret:

Wherein TF_i(w_j) it is text feature w_jIn c_iThe word frequency occurred in classification text；IDF_i(w_j) it is improved inverse document frequency； Text feature number of files in class is bigger, and the number of files occurred in other classes is smaller, then its weight is bigger.

6. according to claim 3 a kind of based on the concerning security matters text recognition method for improving naive Bayesian, it is characterised in that: The step S104 includes:

When the concerning security matters feature level of confidentiality for having new concerning security matters feature to be added or have been friends in the past in concerning security matters feature space changes, new feature adds The case where entering: P (t is selected first from the other feature generic with new feature_j|c_i) the maximum feature of value, by its all information It is copied to new feature, the weight λ according to step S103 to all features under the category_j,iWith conditional probability P (w_j|c_i) carry out It reevaluates；Then P (t is selected from the other feature different classes of with new feature_j|c_i) the smallest feature of value, by its all letter Breath is copied to new feature, then the weight λ according to step S103 to all features under the category_j,iWith conditional probability P (w_j| c_i) reevaluated；

The case where changing for old feature concerning security matters feature level of confidentiality similarly, first from the other feature generic with variation characteristic Middle selection P (t_j|c_i) the maximum feature of value, its all information is copied to transform characteristics, according to step S103 to all features Weight λ under the category_j,iWith conditional probability P (w_j|c_i) reevaluated；Then from different classes of its of transform characteristics P (t is selected in its feature_j|c_i) the smallest feature of value, its all information is copied to transform characteristics, then according to step S103 To weight λ of all features under the category_j,iWith conditional probability P (w_j|c_i) reevaluated.

7. according to claim 3 a kind of based on the concerning security matters text recognition method for improving naive Bayesian, it is characterised in that: The step S105 includes:

Wherein TF_i' () and count'() indicate statistical result on sample increment collection；