CN111782811A - E-government affair sensitive text detection method based on convolutional neural network and support vector machine - Google Patents
E-government affair sensitive text detection method based on convolutional neural network and support vector machine Download PDFInfo
- Publication number
- CN111782811A CN111782811A CN202010629592.9A CN202010629592A CN111782811A CN 111782811 A CN111782811 A CN 111782811A CN 202010629592 A CN202010629592 A CN 202010629592A CN 111782811 A CN111782811 A CN 111782811A
- Authority
- CN
- China
- Prior art keywords
- text
- sensitive
- neural network
- convolutional neural
- policy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 20
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 19
- 238000012706 support-vector machine Methods 0.000 title claims abstract description 19
- 239000013598 vector Substances 0.000 claims abstract description 33
- 238000013145 classification model Methods 0.000 claims abstract description 14
- 238000012549 training Methods 0.000 claims description 16
- 238000005516 engineering process Methods 0.000 claims description 13
- 238000000034 method Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000011896 sensitive detection Methods 0.000 description 2
- 102100029469 WD repeat and HMG-box DNA-binding protein 1 Human genes 0.000 description 1
- 101710097421 WD repeat and HMG-box DNA-binding protein 1 Proteins 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine. The invention mainly comprises (1) an electronic government affair sensitive text detection model based on a convolutional neural network and a support vector machine; (2) a sensitive domain text classification model based on TFIDF and a support vector machine; (3) a policy document recognition model based on word vectors and a convolutional neural network is provided.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine.
Background
With the rapid development of internet and computer technologies, the application of network information technology in social life is more and more extensive. Because the network has the characteristics of openness, sharing and the like, the internet and computer technology bring some content security problems while improving the work efficiency of the government, certain threats to the information security of government departments, sensitive information and files of the government departments may be revealed to the internet through an electronic government platform, and particularly, policy documents in sensitive fields of religion, military, politics and the like mostly contain sensitive information, and once the sensitive information is revealed and spread, huge losses are caused to the security of the country. Therefore, how to accurately and quickly detect the electronic government affair sensitive information leaked into the network and reduce the false alarm rate and the missing alarm rate becomes a great challenge by keeping the national secret.
To protect sensitive text from leakage, it is first determined whether the text content contains sensitive information. At present, most of sensitive text detection works are carried out according to customized rules, however, with the increasing quantity and complexity of sensitive electronic text documents, the existing sensitive detection means cannot meet the requirements of high efficiency and convenience. In order to timely and comprehensively discover sensitive information leaked to an internet portal website, how to research a more efficient sensitive detection technical solution is a non-negligible problem. Currently, there are two main detection techniques: one is a detection method based on keyword matching, and sensitive word matching is the key core of the method and is generally realized by using a character string matching algorithm. The detection method based on keyword matching ignores the relevance between the deformed words and the original words, and has low accuracy. With the development of machine learning technology, another detection technology is to use text classification in machine learning to detect sensitive texts, and a sensitive content detection method based on traditional machine learning has low accuracy due to less sensitive texts which can be used for training.
Therefore, in order to solve the above problems, the present invention provides an e-government affairs sensitive text detection method based on a convolutional neural network and a support vector machine, which combines the characteristics of e-government affairs sensitive text (relating to the content of policy guidelines in the sensitive field, etc.).
Disclosure of Invention
The invention provides an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine, which mainly comprises the following three contents:
1. providing a sensitive field text classification model based on a TFIDF and a support vector machine;
2. providing a policy document identification model based on word vectors and a convolutional neural network;
3. an electronic government affair sensitive text detection model based on sensitive field text classification and policy document identification is provided.
The specific contents are as follows:
1. a sensitive field text classification model based on a TFIDF and a support vector machine is provided
The method mainly adopts a TFIDF weighting technology to construct a text vector, adopts a support vector machine algorithm, and constructs a sensitive field text classification model through continuous machine learning training, wherein the model is used for judging whether the text belongs to the sensitive field.
(1) The field text data set is converted to a text vector using TFIDF weighting techniques. For each text in the dataset, the semantics of the text are represented by a vector, each dimension of the vector corresponding to a word whose value is the TFIDF value of the word occurring in the text. TFIDF is used to evaluate how important a word is for one of the texts of a corpus of files.
The process of calculating the weights using the TFIDF weighting technique is as follows
The calculation formula consists of two parts of word frequency (TF) and inverse file frequency (IDF) and is as follows:
wij=tfij*idfij(3)
wherein n isijRepresenting the number of times of occurrence of the ith characteristic word in the jth text; n is a radical ofjRepresenting the total number of words in the jth text; n is a radical ofiIs the number of texts containing the ith feature word, and N is the total number of texts; w is aijIs the TFIDF value of the ith feature word.
Because training TFIDF on large-scale corpus can obtain very many words, in consideration of time and space efficiency, the invention limits and selects 500 characteristic words, preferentially selects words with high word frequency, and obtains X ═ X after constructing vector1,x2,…,xiIn which x1-xiAnd representing the vector corresponding to the ith text in the text training set D.
(2) And training the text data set by adopting a support vector machine algorithm to obtain a sensitive field text classification model. The process is as follows:
modeling: given a training sample T { (v)1,y1),(v1,y2),…,(vn,yn) In which v is1-vnIs n text vectors, y1-ymThe sensitive field label value corresponding to the training text is 1, and the text label value belonging to the sensitive field is-1. We need to find a hyperplane to classify the instances in each training set into different classes, where the hyperplane is wx + b ═ 0, and the classification decision model is f (x) ═ sign (wx + b), where sign stands for the sign function, w is the weight of the model, and b is the bias. In order to obtain a maximum interval hyperplane that can completely separate the sample points in the training sample set, the following optimization constraint problem needs to be solved:
s.t. yi(w*xi+b)-1≥0,i=1,2,...,n (5)
and (5) solving the optimal w and b to finally obtain a sensitive field classification decision model f (x).
And (3) detection: after modeling is completed, inputting a text vector to be detected, wherein the obtained output value is the classification label value of the text, +1 represents a positive class and indicates that the text belongs to the sensitive field, and-1 represents a negative class and indicates that the text does not belong to the sensitive field.
2. A policy document recognition model based on word vectors and a convolutional neural network is provided
The Word vector training is carried out on the Word sequence after the words are segmented by adopting the Word2vec technology to obtain the Word vector corresponding to each Word, the Word vector is used as the input data of the convolutional neural network, a policy document identification model based on the convolutional neural network is constructed, the model is used for judging whether the text is a policy document, and the model mainly comprises an input layer, a convolutional layer, a pooling layer, a full connection layer and the like.
(1) The first layer is the input layer. The input layer is a matrix of n x m, denoted by letter a. Wherein n is the number of words in a text word sequence, and the invention adopts padding technology to keep the lengths of all the text word sequences consistent. m is the dimension of the Word vector corresponding to each Word, and the Word vector training method adopts the Word2vec technology to train the Word vectors and map each Word into an m-dimensional Word vector.
(2) The second layer is a convolutional layer. The convolution operation is performed on the matrix by using convolution kernels of different sizes, the width of the convolution kernel is equal to the dimension m of the word vector, the height is h, and one convolution kernel is assumed to be a matrix t of h m. The convolution kernel is slid down by step 1, and convolution operation is performed every time a window of h × m is passed, so as to generate a new characteristic value ciProcessing a convolution kernel to obtain a feature map c, wherein n-h +1 features are obtained in total, and the calculation formula is as follows:
ci=f(t*A[i:i+h-1]+b),i=1,2,...,n+h-1 (6)
where b is the bias term and f is the activation function.
(3) The third layer is a pooling layer. Because the sizes of the characteristic graphs obtained by convolution kernels with different sizes are different, the invention adopts the pooling function 1-max-posing to extract the characteristics of each characteristic graph, so that the dimensionality of the characteristic graphs is kept consistent, and the principle of the 1-max-posing is to take a maximum value from a plurality of values.
(4) The fourth layer is a full connection layer. And the full connection layer is used for classification, and the features extracted by the convolution and pooling layer are input into a softmax function for classification training to obtain a policy document identification model.
3. An electronic government affair sensitive text detection model based on sensitive field text classification and policy document identification is provided
Since the policy document in the e-government sensitive field mostly relates to sensitive content, in order to determine whether a text belongs to a sensitive text, it is necessary to detect whether the text is a sensitive field and whether the text is a policy document. And detecting whether the text belongs to the sensitive field by adopting a sensitive field text classification model, and judging whether the text belongs to the policy document by adopting a policy document identification model for the text belonging to the sensitive field.
(1) And classifying sensitive domain texts. Firstly, a text vector of a content text to be detected is constructed, and then a support vector machine algorithm is adopted to establish a sensitive field classification model to calculate a classification result of the text. The input of the model is a text to be detected, the output is a sensitive field classification result of the text, and whether the text belongs to the sensitive field is judged.
(2) Policy document identification. Firstly, word vectors are constructed based on word2vec technology, and then a policy official document recognition result of a text is calculated by adopting a convolutional neural network building model. And inputting a text to be detected by the model, outputting a policy document identification result of the text, and judging whether the text belongs to the administrative policy document.
And finally integrating the models in the steps to obtain an electronic government affair sensitive text detection model based on sensitive field text classification and policy official document identification.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention discloses an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine. The method mainly comprises the following steps:
step 1: the text data is preprocessed. Firstly, cleaning a data set prepared by the invention to remove useless parts in a text, and then segmenting the text by using a Chinese word segmentation technology to obtain a word sequence of the segmented text.
Step 2: and establishing a sensitive field text classification model. Calculating the weight of words in the text by using a TFIDF technology, and converting a text word sequence into a corresponding text vector; a domain classification model is established by using a support vector machine algorithm, input contents are text vectors of a sensitive domain and a non-sensitive domain and corresponding classification labels, and the model obtains a better classification result through continuous training.
And step 3: and establishing a policy document identification model. Firstly, performing Word vector training on a text by using a Word2vec model, expressing each Word by using a Word vector, and converting each text into a corresponding matrix as input data of a convolutional neural network; secondly, carrying out convolution calculation on the input matrix by utilizing convolution kernels with different sizes to obtain a plurality of characteristic graphs; and then, extracting the features of each feature map by using a pooling function 1-max-posing, and outputting the maximum value of the features. And finally, inputting the extracted features into a softmax function for classification to obtain a text policy document identification result.
And 4, step 4: and (6) detecting. Firstly, converting a text to be detected into a text vector, and detecting whether the text belongs to the sensitive field by adopting the sensitive field text classification model established in the step 1; and then, judging whether the text belongs to the policy document by adopting the policy document identification model established in the step 2 for the text belonging to the sensitive field. And finally, detecting that most of policy documents in the E-government sensitive field are sensitive texts.
Claims (4)
1. A method for detecting E-government affair sensitive texts based on a convolutional neural network and a support vector machine is characterized by comprising the following steps:
(1) providing a sensitive field text classification model based on a TFIDF and a support vector machine;
(2) providing a policy document identification model based on word vectors and a convolutional neural network;
(3) an electronic government affair sensitive text detection model based on sensitive field text classification and policy document identification is provided.
2. The TFIDF and support vector machine based domain text classification model of claim 1, wherein: and calculating the weights of words in the sensitive field texts and the non-sensitive field texts by adopting a TFIDF technology, and constructing two types of text vectors. And adopting a support vector machine algorithm, taking the two types of text vectors and the classification labels thereof as input and output, and performing iterative training to obtain a final convergent sensitive field text classification model.
3. The word vector and convolutional neural network based policy document identification model of claim 1, wherein: vectorizing and expressing each keyword in the text of the policy official document and the text of the non-policy official document by adopting a word vector algorithm, and obtaining a word vector matrix of the text according to a word sequence to be used as the input of a convolutional neural network; carrying out convolution calculation on the input word vector matrixes by adopting convolution kernels with different sizes to obtain a plurality of characteristic graphs; and (3) reducing the dimension of the features of each feature map by using a pooling function, and finally inputting the features into a softmax classifier layer for classification training to obtain a policy document identification model.
4. The sensitive text detection model based on religious domain text classification and policy document identification according to claim 1, characterized in that: since policy documents in sensitive fields of e-government (e.g., sensitive fields such as religion, military, politics, etc.) mostly contain sensitive contents, in order to determine whether a text belongs to a sensitive text, it is necessary to detect whether the text is a sensitive field and whether the text is a policy document. And detecting whether the text belongs to the sensitive field by adopting the sensitive field text classification model, and judging whether the text belongs to the policy document by adopting a policy document identification model for the text belonging to the sensitive field.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010629592.9A CN111782811A (en) | 2020-07-03 | 2020-07-03 | E-government affair sensitive text detection method based on convolutional neural network and support vector machine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010629592.9A CN111782811A (en) | 2020-07-03 | 2020-07-03 | E-government affair sensitive text detection method based on convolutional neural network and support vector machine |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111782811A true CN111782811A (en) | 2020-10-16 |
Family
ID=72759199
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010629592.9A Pending CN111782811A (en) | 2020-07-03 | 2020-07-03 | E-government affair sensitive text detection method based on convolutional neural network and support vector machine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111782811A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487149A (en) * | 2020-12-10 | 2021-03-12 | 浙江诺诺网络科技有限公司 | Text auditing method, model, equipment and storage medium |
CN113723737A (en) * | 2021-05-11 | 2021-11-30 | 天元大数据信用管理有限公司 | Enterprise portrait-based policy matching method, device, equipment and medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101106539A (en) * | 2007-08-03 | 2008-01-16 | 浙江大学 | Filtering method for spam based on supporting vector machine |
CN108520030A (en) * | 2018-03-27 | 2018-09-11 | 深圳中兴网信科技有限公司 | File classification method, Text Classification System and computer installation |
CN109543084A (en) * | 2018-11-09 | 2019-03-29 | 西安交通大学 | A method of establishing the detection model of the hidden sensitive text of network-oriented social media |
CN109657243A (en) * | 2018-12-17 | 2019-04-19 | 江苏满运软件科技有限公司 | Sensitive information recognition methods, system, equipment and storage medium |
WO2019105134A1 (en) * | 2017-11-30 | 2019-06-06 | 阿里巴巴集团控股有限公司 | Word vector processing method, apparatus and device |
CN110489749A (en) * | 2019-08-07 | 2019-11-22 | 北京航空航天大学 | Intelligent Office-Automation System Work Flow Optimizing |
CN110955776A (en) * | 2019-11-16 | 2020-04-03 | 中电科大数据研究院有限公司 | Construction method of government affair text classification model |
-
2020
- 2020-07-03 CN CN202010629592.9A patent/CN111782811A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101106539A (en) * | 2007-08-03 | 2008-01-16 | 浙江大学 | Filtering method for spam based on supporting vector machine |
WO2019105134A1 (en) * | 2017-11-30 | 2019-06-06 | 阿里巴巴集团控股有限公司 | Word vector processing method, apparatus and device |
CN108520030A (en) * | 2018-03-27 | 2018-09-11 | 深圳中兴网信科技有限公司 | File classification method, Text Classification System and computer installation |
CN109543084A (en) * | 2018-11-09 | 2019-03-29 | 西安交通大学 | A method of establishing the detection model of the hidden sensitive text of network-oriented social media |
CN109657243A (en) * | 2018-12-17 | 2019-04-19 | 江苏满运软件科技有限公司 | Sensitive information recognition methods, system, equipment and storage medium |
CN110489749A (en) * | 2019-08-07 | 2019-11-22 | 北京航空航天大学 | Intelligent Office-Automation System Work Flow Optimizing |
CN110955776A (en) * | 2019-11-16 | 2020-04-03 | 中电科大数据研究院有限公司 | Construction method of government affair text classification model |
Non-Patent Citations (2)
Title |
---|
林学峰 等: "基于卷积神经网络的敏感文件检测方法", 《计算机与现代化》, no. 07, pages 28 - 32 * |
王思迪 等: "基于文本分类的政府网站信箱自动转递方法研究", 《数据分析与知识发现》, vol. 4, no. 06, pages 51 - 59 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487149A (en) * | 2020-12-10 | 2021-03-12 | 浙江诺诺网络科技有限公司 | Text auditing method, model, equipment and storage medium |
CN113723737A (en) * | 2021-05-11 | 2021-11-30 | 天元大数据信用管理有限公司 | Enterprise portrait-based policy matching method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109918505B (en) | Network security event visualization method based on text processing | |
CN110516074B (en) | Website theme classification method and device based on deep learning | |
CN106095928A (en) | A kind of event type recognition methods and device | |
CN110175221B (en) | Junk short message identification method by combining word vector with machine learning | |
CN111753087B (en) | Public opinion text classification method, apparatus, computer device and storage medium | |
CN111984791A (en) | Long text classification method based on attention mechanism | |
CN110348227A (en) | A kind of classification method and system of software vulnerability | |
CN112417153A (en) | Text classification method and device, terminal equipment and readable storage medium | |
CN116527357A (en) | Web attack detection method based on gate control converter | |
CN113515742A (en) | Internet of things malicious code detection method based on behavior semantic fusion extraction | |
CN111782811A (en) | E-government affair sensitive text detection method based on convolutional neural network and support vector machine | |
CN115718792A (en) | Sensitive information extraction method based on natural semantic processing and deep learning | |
CN115098690A (en) | Multi-data document classification method and system based on cluster analysis | |
Madasu et al. | Effectiveness of self normalizing neural networks for text classification | |
Jo | Inverted index based modified version of k-means algorithm for text clustering | |
CN114881172A (en) | Software vulnerability automatic classification method based on weighted word vector and neural network | |
CN111144453A (en) | Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data | |
CN103514168B (en) | Data processing method and device | |
CN107368610A (en) | Big text CRF and rule classification method and system based on full text | |
CN115795037B (en) | Multi-label text classification method based on label perception | |
CN110489759A (en) | Text feature weighting and short text similarity calculation method, system and medium based on word frequency | |
Zhu et al. | Chinese texts classification system | |
CN115203400A (en) | Method, device and medium for generating title abstract of commodity | |
Tao et al. | A multi-label text classification method based on labels vector fusion | |
Reshma et al. | Supervised methods for domain classification of tamil documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20201016 |