CN111782811A - E-government affair sensitive text detection method based on convolutional neural network and support vector machine - Google Patents

E-government affair sensitive text detection method based on convolutional neural network and support vector machine Download PDF

Info

Publication number
CN111782811A
CN111782811A CN202010629592.9A CN202010629592A CN111782811A CN 111782811 A CN111782811 A CN 111782811A CN 202010629592 A CN202010629592 A CN 202010629592A CN 111782811 A CN111782811 A CN 111782811A
Authority
CN
China
Prior art keywords
text
sensitive
neural network
convolutional neural
policy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010629592.9A
Other languages
Chinese (zh)
Inventor
王婷
秦拯
张吉昕
胡玉鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202010629592.9A priority Critical patent/CN111782811A/en
Publication of CN111782811A publication Critical patent/CN111782811A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine. The invention mainly comprises (1) an electronic government affair sensitive text detection model based on a convolutional neural network and a support vector machine; (2) a sensitive domain text classification model based on TFIDF and a support vector machine; (3) a policy document recognition model based on word vectors and a convolutional neural network is provided.

Description

E-government affair sensitive text detection method based on convolutional neural network and support vector machine
Technical Field
The invention relates to the technical field of machine learning, in particular to an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine.
Background
With the rapid development of internet and computer technologies, the application of network information technology in social life is more and more extensive. Because the network has the characteristics of openness, sharing and the like, the internet and computer technology bring some content security problems while improving the work efficiency of the government, certain threats to the information security of government departments, sensitive information and files of the government departments may be revealed to the internet through an electronic government platform, and particularly, policy documents in sensitive fields of religion, military, politics and the like mostly contain sensitive information, and once the sensitive information is revealed and spread, huge losses are caused to the security of the country. Therefore, how to accurately and quickly detect the electronic government affair sensitive information leaked into the network and reduce the false alarm rate and the missing alarm rate becomes a great challenge by keeping the national secret.
To protect sensitive text from leakage, it is first determined whether the text content contains sensitive information. At present, most of sensitive text detection works are carried out according to customized rules, however, with the increasing quantity and complexity of sensitive electronic text documents, the existing sensitive detection means cannot meet the requirements of high efficiency and convenience. In order to timely and comprehensively discover sensitive information leaked to an internet portal website, how to research a more efficient sensitive detection technical solution is a non-negligible problem. Currently, there are two main detection techniques: one is a detection method based on keyword matching, and sensitive word matching is the key core of the method and is generally realized by using a character string matching algorithm. The detection method based on keyword matching ignores the relevance between the deformed words and the original words, and has low accuracy. With the development of machine learning technology, another detection technology is to use text classification in machine learning to detect sensitive texts, and a sensitive content detection method based on traditional machine learning has low accuracy due to less sensitive texts which can be used for training.
Therefore, in order to solve the above problems, the present invention provides an e-government affairs sensitive text detection method based on a convolutional neural network and a support vector machine, which combines the characteristics of e-government affairs sensitive text (relating to the content of policy guidelines in the sensitive field, etc.).
Disclosure of Invention
The invention provides an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine, which mainly comprises the following three contents:
1. providing a sensitive field text classification model based on a TFIDF and a support vector machine;
2. providing a policy document identification model based on word vectors and a convolutional neural network;
3. an electronic government affair sensitive text detection model based on sensitive field text classification and policy document identification is provided.
The specific contents are as follows:
1. a sensitive field text classification model based on a TFIDF and a support vector machine is provided
The method mainly adopts a TFIDF weighting technology to construct a text vector, adopts a support vector machine algorithm, and constructs a sensitive field text classification model through continuous machine learning training, wherein the model is used for judging whether the text belongs to the sensitive field.
(1) The field text data set is converted to a text vector using TFIDF weighting techniques. For each text in the dataset, the semantics of the text are represented by a vector, each dimension of the vector corresponding to a word whose value is the TFIDF value of the word occurring in the text. TFIDF is used to evaluate how important a word is for one of the texts of a corpus of files.
The process of calculating the weights using the TFIDF weighting technique is as follows
The calculation formula consists of two parts of word frequency (TF) and inverse file frequency (IDF) and is as follows:
Figure BDA0002567981350000021
Figure BDA0002567981350000022
wij=tfij*idfij(3)
wherein n isijRepresenting the number of times of occurrence of the ith characteristic word in the jth text; n is a radical ofjRepresenting the total number of words in the jth text; n is a radical ofiIs the number of texts containing the ith feature word, and N is the total number of texts; w is aijIs the TFIDF value of the ith feature word.
Because training TFIDF on large-scale corpus can obtain very many words, in consideration of time and space efficiency, the invention limits and selects 500 characteristic words, preferentially selects words with high word frequency, and obtains X ═ X after constructing vector1,x2,…,xiIn which x1-xiAnd representing the vector corresponding to the ith text in the text training set D.
(2) And training the text data set by adopting a support vector machine algorithm to obtain a sensitive field text classification model. The process is as follows:
modeling: given a training sample T { (v)1,y1),(v1,y2),…,(vn,yn) In which v is1-vnIs n text vectors, y1-ymThe sensitive field label value corresponding to the training text is 1, and the text label value belonging to the sensitive field is-1. We need to find a hyperplane to classify the instances in each training set into different classes, where the hyperplane is wx + b ═ 0, and the classification decision model is f (x) ═ sign (wx + b), where sign stands for the sign function, w is the weight of the model, and b is the bias. In order to obtain a maximum interval hyperplane that can completely separate the sample points in the training sample set, the following optimization constraint problem needs to be solved:
Figure BDA0002567981350000023
s.t. yi(w*xi+b)-1≥0,i=1,2,...,n (5)
and (5) solving the optimal w and b to finally obtain a sensitive field classification decision model f (x).
And (3) detection: after modeling is completed, inputting a text vector to be detected, wherein the obtained output value is the classification label value of the text, +1 represents a positive class and indicates that the text belongs to the sensitive field, and-1 represents a negative class and indicates that the text does not belong to the sensitive field.
2. A policy document recognition model based on word vectors and a convolutional neural network is provided
The Word vector training is carried out on the Word sequence after the words are segmented by adopting the Word2vec technology to obtain the Word vector corresponding to each Word, the Word vector is used as the input data of the convolutional neural network, a policy document identification model based on the convolutional neural network is constructed, the model is used for judging whether the text is a policy document, and the model mainly comprises an input layer, a convolutional layer, a pooling layer, a full connection layer and the like.
(1) The first layer is the input layer. The input layer is a matrix of n x m, denoted by letter a. Wherein n is the number of words in a text word sequence, and the invention adopts padding technology to keep the lengths of all the text word sequences consistent. m is the dimension of the Word vector corresponding to each Word, and the Word vector training method adopts the Word2vec technology to train the Word vectors and map each Word into an m-dimensional Word vector.
(2) The second layer is a convolutional layer. The convolution operation is performed on the matrix by using convolution kernels of different sizes, the width of the convolution kernel is equal to the dimension m of the word vector, the height is h, and one convolution kernel is assumed to be a matrix t of h m. The convolution kernel is slid down by step 1, and convolution operation is performed every time a window of h × m is passed, so as to generate a new characteristic value ciProcessing a convolution kernel to obtain a feature map c, wherein n-h +1 features are obtained in total, and the calculation formula is as follows:
ci=f(t*A[i:i+h-1]+b),i=1,2,...,n+h-1 (6)
where b is the bias term and f is the activation function.
(3) The third layer is a pooling layer. Because the sizes of the characteristic graphs obtained by convolution kernels with different sizes are different, the invention adopts the pooling function 1-max-posing to extract the characteristics of each characteristic graph, so that the dimensionality of the characteristic graphs is kept consistent, and the principle of the 1-max-posing is to take a maximum value from a plurality of values.
(4) The fourth layer is a full connection layer. And the full connection layer is used for classification, and the features extracted by the convolution and pooling layer are input into a softmax function for classification training to obtain a policy document identification model.
3. An electronic government affair sensitive text detection model based on sensitive field text classification and policy document identification is provided
Since the policy document in the e-government sensitive field mostly relates to sensitive content, in order to determine whether a text belongs to a sensitive text, it is necessary to detect whether the text is a sensitive field and whether the text is a policy document. And detecting whether the text belongs to the sensitive field by adopting a sensitive field text classification model, and judging whether the text belongs to the policy document by adopting a policy document identification model for the text belonging to the sensitive field.
(1) And classifying sensitive domain texts. Firstly, a text vector of a content text to be detected is constructed, and then a support vector machine algorithm is adopted to establish a sensitive field classification model to calculate a classification result of the text. The input of the model is a text to be detected, the output is a sensitive field classification result of the text, and whether the text belongs to the sensitive field is judged.
(2) Policy document identification. Firstly, word vectors are constructed based on word2vec technology, and then a policy official document recognition result of a text is calculated by adopting a convolutional neural network building model. And inputting a text to be detected by the model, outputting a policy document identification result of the text, and judging whether the text belongs to the administrative policy document.
And finally integrating the models in the steps to obtain an electronic government affair sensitive text detection model based on sensitive field text classification and policy official document identification.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention discloses an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine. The method mainly comprises the following steps:
step 1: the text data is preprocessed. Firstly, cleaning a data set prepared by the invention to remove useless parts in a text, and then segmenting the text by using a Chinese word segmentation technology to obtain a word sequence of the segmented text.
Step 2: and establishing a sensitive field text classification model. Calculating the weight of words in the text by using a TFIDF technology, and converting a text word sequence into a corresponding text vector; a domain classification model is established by using a support vector machine algorithm, input contents are text vectors of a sensitive domain and a non-sensitive domain and corresponding classification labels, and the model obtains a better classification result through continuous training.
And step 3: and establishing a policy document identification model. Firstly, performing Word vector training on a text by using a Word2vec model, expressing each Word by using a Word vector, and converting each text into a corresponding matrix as input data of a convolutional neural network; secondly, carrying out convolution calculation on the input matrix by utilizing convolution kernels with different sizes to obtain a plurality of characteristic graphs; and then, extracting the features of each feature map by using a pooling function 1-max-posing, and outputting the maximum value of the features. And finally, inputting the extracted features into a softmax function for classification to obtain a text policy document identification result.
And 4, step 4: and (6) detecting. Firstly, converting a text to be detected into a text vector, and detecting whether the text belongs to the sensitive field by adopting the sensitive field text classification model established in the step 1; and then, judging whether the text belongs to the policy document by adopting the policy document identification model established in the step 2 for the text belonging to the sensitive field. And finally, detecting that most of policy documents in the E-government sensitive field are sensitive texts.

Claims (4)

1. A method for detecting E-government affair sensitive texts based on a convolutional neural network and a support vector machine is characterized by comprising the following steps:
(1) providing a sensitive field text classification model based on a TFIDF and a support vector machine;
(2) providing a policy document identification model based on word vectors and a convolutional neural network;
(3) an electronic government affair sensitive text detection model based on sensitive field text classification and policy document identification is provided.
2. The TFIDF and support vector machine based domain text classification model of claim 1, wherein: and calculating the weights of words in the sensitive field texts and the non-sensitive field texts by adopting a TFIDF technology, and constructing two types of text vectors. And adopting a support vector machine algorithm, taking the two types of text vectors and the classification labels thereof as input and output, and performing iterative training to obtain a final convergent sensitive field text classification model.
3. The word vector and convolutional neural network based policy document identification model of claim 1, wherein: vectorizing and expressing each keyword in the text of the policy official document and the text of the non-policy official document by adopting a word vector algorithm, and obtaining a word vector matrix of the text according to a word sequence to be used as the input of a convolutional neural network; carrying out convolution calculation on the input word vector matrixes by adopting convolution kernels with different sizes to obtain a plurality of characteristic graphs; and (3) reducing the dimension of the features of each feature map by using a pooling function, and finally inputting the features into a softmax classifier layer for classification training to obtain a policy document identification model.
4. The sensitive text detection model based on religious domain text classification and policy document identification according to claim 1, characterized in that: since policy documents in sensitive fields of e-government (e.g., sensitive fields such as religion, military, politics, etc.) mostly contain sensitive contents, in order to determine whether a text belongs to a sensitive text, it is necessary to detect whether the text is a sensitive field and whether the text is a policy document. And detecting whether the text belongs to the sensitive field by adopting the sensitive field text classification model, and judging whether the text belongs to the policy document by adopting a policy document identification model for the text belonging to the sensitive field.
CN202010629592.9A 2020-07-03 2020-07-03 E-government affair sensitive text detection method based on convolutional neural network and support vector machine Pending CN111782811A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010629592.9A CN111782811A (en) 2020-07-03 2020-07-03 E-government affair sensitive text detection method based on convolutional neural network and support vector machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010629592.9A CN111782811A (en) 2020-07-03 2020-07-03 E-government affair sensitive text detection method based on convolutional neural network and support vector machine

Publications (1)

Publication Number Publication Date
CN111782811A true CN111782811A (en) 2020-10-16

Family

ID=72759199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010629592.9A Pending CN111782811A (en) 2020-07-03 2020-07-03 E-government affair sensitive text detection method based on convolutional neural network and support vector machine

Country Status (1)

Country Link
CN (1) CN111782811A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487149A (en) * 2020-12-10 2021-03-12 浙江诺诺网络科技有限公司 Text auditing method, model, equipment and storage medium
CN113723737A (en) * 2021-05-11 2021-11-30 天元大数据信用管理有限公司 Enterprise portrait-based policy matching method, device, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN108520030A (en) * 2018-03-27 2018-09-11 深圳中兴网信科技有限公司 File classification method, Text Classification System and computer installation
CN109543084A (en) * 2018-11-09 2019-03-29 西安交通大学 A method of establishing the detection model of the hidden sensitive text of network-oriented social media
CN109657243A (en) * 2018-12-17 2019-04-19 江苏满运软件科技有限公司 Sensitive information recognition methods, system, equipment and storage medium
WO2019105134A1 (en) * 2017-11-30 2019-06-06 阿里巴巴集团控股有限公司 Word vector processing method, apparatus and device
CN110489749A (en) * 2019-08-07 2019-11-22 北京航空航天大学 Intelligent Office-Automation System Work Flow Optimizing
CN110955776A (en) * 2019-11-16 2020-04-03 中电科大数据研究院有限公司 Construction method of government affair text classification model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
WO2019105134A1 (en) * 2017-11-30 2019-06-06 阿里巴巴集团控股有限公司 Word vector processing method, apparatus and device
CN108520030A (en) * 2018-03-27 2018-09-11 深圳中兴网信科技有限公司 File classification method, Text Classification System and computer installation
CN109543084A (en) * 2018-11-09 2019-03-29 西安交通大学 A method of establishing the detection model of the hidden sensitive text of network-oriented social media
CN109657243A (en) * 2018-12-17 2019-04-19 江苏满运软件科技有限公司 Sensitive information recognition methods, system, equipment and storage medium
CN110489749A (en) * 2019-08-07 2019-11-22 北京航空航天大学 Intelligent Office-Automation System Work Flow Optimizing
CN110955776A (en) * 2019-11-16 2020-04-03 中电科大数据研究院有限公司 Construction method of government affair text classification model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
林学峰 等: "基于卷积神经网络的敏感文件检测方法", 《计算机与现代化》, no. 07, pages 28 - 32 *
王思迪 等: "基于文本分类的政府网站信箱自动转递方法研究", 《数据分析与知识发现》, vol. 4, no. 06, pages 51 - 59 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487149A (en) * 2020-12-10 2021-03-12 浙江诺诺网络科技有限公司 Text auditing method, model, equipment and storage medium
CN113723737A (en) * 2021-05-11 2021-11-30 天元大数据信用管理有限公司 Enterprise portrait-based policy matching method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN109918505B (en) Network security event visualization method based on text processing
CN110516074B (en) Website theme classification method and device based on deep learning
CN106095928A (en) A kind of event type recognition methods and device
CN110175221B (en) Junk short message identification method by combining word vector with machine learning
CN111753087B (en) Public opinion text classification method, apparatus, computer device and storage medium
CN111984791A (en) Long text classification method based on attention mechanism
CN112417153A (en) Text classification method and device, terminal equipment and readable storage medium
CN116527357A (en) Web attack detection method based on gate control converter
CN113515742A (en) Internet of things malicious code detection method based on behavior semantic fusion extraction
CN111782811A (en) E-government affair sensitive text detection method based on convolutional neural network and support vector machine
CN115718792A (en) Sensitive information extraction method based on natural semantic processing and deep learning
CN115098690A (en) Multi-data document classification method and system based on cluster analysis
Madasu et al. Effectiveness of self normalizing neural networks for text classification
Jo Inverted index based modified version of k-means algorithm for text clustering
CN114881172A (en) Software vulnerability automatic classification method based on weighted word vector and neural network
CN113806538B (en) Label extraction model training method, device, equipment and storage medium
CN111144453A (en) Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data
CN103514168B (en) Data processing method and device
CN107368610A (en) Big text CRF and rule classification method and system based on full text
CN115795037B (en) Multi-label text classification method based on label perception
CN110489759A (en) Text feature weighting and short text similarity calculation method, system and medium based on word frequency
CN115203400A (en) Method, device and medium for generating title abstract of commodity
Tao et al. A multi-label text classification method based on labels vector fusion
CN113761123A (en) Keyword acquisition method and device, computing equipment and storage medium
CN111860662B (en) Training method and device, application method and device of similarity detection model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201016