CN111782811A

CN111782811A - An E-government Sensitive Text Detection Method Based on Convolutional Neural Network and Support Vector Machine

Info

Publication number: CN111782811A
Application number: CN202010629592.9A
Authority: CN
Inventors: 王婷; 秦拯; 张吉昕; 胡玉鹏
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2020-10-16

Abstract

The invention relates to an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine. The invention mainly comprises (1) an electronic government affair sensitive text detection model based on a convolutional neural network and a support vector machine; (2) a sensitive domain text classification model based on TFIDF and a support vector machine; (3) a policy document recognition model based on word vectors and a convolutional neural network is provided.

Description

An E-government Sensitive Text Retrieval Based on Convolutional Neural Network and Support Vector Machine test method

技术领域technical field

本发明涉及机器学习技术领域，一种基于卷积神经网络和支持向量机的电子政务敏感文本检测方法。The invention relates to the technical field of machine learning, and a method for detecting sensitive text of electronic government affairs based on a convolutional neural network and a support vector machine.

背景技术Background technique

随着互联网和计算机技术的迅猛发展，网络信息技术在社会生活中的应用也越来越广泛。由于网络具有开放性、共享性等特点，互联网和计算机技术在提升政府工作的效率的同时，也带来了一些内容安全问题，对政府部门的信息安全带来了一定的威胁，政府部门的敏感信息和文件可能会通过电子政务平台被泄露到互联网上，尤其宗教、军事、政治等敏感领域的政策公文大多包含敏感信息，一旦被泄露传播，将对国家的安全造成巨大损失。因此如何准确、快速地检测出被泄露到网络中的电子政务敏感信息，降低误报率和漏报率，保守国家秘密成为一大挑战。With the rapid development of the Internet and computer technology, the application of network information technology in social life is becoming more and more extensive. Due to the characteristics of openness and sharing of the network, the Internet and computer technology not only improve the efficiency of government work, but also bring some content security issues, posing certain threats to the information security of government departments, and the sensitivity of government departments. Information and documents may be leaked to the Internet through e-government platforms. In particular, policy documents in sensitive fields such as religion, military, and politics mostly contain sensitive information. Once they are leaked and spread, it will cause huge losses to national security. Therefore, how to accurately and quickly detect the sensitive information of e-government leaked into the network, reduce the rate of false positives and false negatives, and keep national secrets becomes a big challenge.

要保护敏感文本不被泄露，首先要确定文本内容中是否包含敏感信息。而目前很大一部分敏感文本检测的工作是根据人为定制的规则进行检测，然而随着敏感电子文本文档的数量越来越多、复杂度也越来越大大，现有的敏感检测手段难以达到高效、便捷的要求。为了能够及时、全面地发现泄露到互联网门户网站中的敏感信息，如何研究出一个更高效率的的敏感检测技术解决方案是一个不可忽视的问题。目前主要有两种检测技术：一种是基于关键词匹配的检测方法，敏感词匹配是此方法的关键核心，一般利用字符串匹配算法实现。基于关键词匹配的检测方法，忽略了变形词与原词之间的关联性，准确率较低。随着机器学习技术的发展，另一种检测技术是用机器学习中的文本分类来进行敏感文本检测，基于传统机器学习的敏感内容检测方法，由于可以用来训练的敏感文本较少，准确率较低。To protect sensitive text from being leaked, it is first necessary to determine whether the text content contains sensitive information. At present, a large part of sensitive text detection work is based on human-made rules. However, with the increasing number and complexity of sensitive electronic text documents, the existing sensitive detection methods are difficult to achieve high efficiency. , convenient requirements. In order to timely and comprehensively discover the sensitive information leaked into the Internet portal, how to develop a more efficient sensitive detection technology solution is a problem that cannot be ignored. At present, there are mainly two detection technologies: one is a detection method based on keyword matching, and sensitive word matching is the key core of this method, which is generally implemented by a string matching algorithm. The detection method based on keyword matching ignores the correlation between the modified word and the original word, and the accuracy is low. With the development of machine learning technology, another detection technology is to use text classification in machine learning for sensitive text detection. The sensitive content detection method based on traditional machine learning has less sensitive texts that can be used for training, and the accuracy rate is low. lower.

因此，为了解决上述问题，本发明结合电子政务敏感文本的特点(涉及敏感领域政策方针等内容)，提出一种基于卷积神经网络和支持向量机的电子政务敏感文本检测方法。Therefore, in order to solve the above problems, the present invention proposes an e-government sensitive text detection method based on convolutional neural network and support vector machine in combination with the characteristics of e-government sensitive texts (involving policies and guidelines in sensitive fields).

发明内容SUMMARY OF THE INVENTION

本发明提出了一种基于卷积神经网络和支持向量机的电子政务敏感文本检测方法，主要包括三大内容：The present invention proposes an e-government sensitive text detection method based on convolutional neural network and support vector machine, which mainly includes three major contents:

1.提出一种基于TFIDF和支持向量机的敏感领域文本分类模型；1. Propose a text classification model for sensitive fields based on TFIDF and support vector machine;

2.提出一种基于词向量和卷积神经网络的政策公文识别模型；2. Propose a policy document recognition model based on word vector and convolutional neural network;

3.提出一种基于敏感领域文本分类和政策公文识别的电子政务敏感文本检测模型。3. Propose an e-government sensitive text detection model based on sensitive field text classification and policy document recognition.

具体内容如下：The details are as follows:

1.提出一种基于TFIDF和支持向量机的敏感领域文本分类模型1. Propose a text classification model for sensitive fields based on TFIDF and support vector machines

本发明主采用TFIDF加权技术构建文本向量，采用支持向量机算法，通过不断的机器学习训练构建敏感领域文本分类模型，该模型用于判断文本是否属于敏感领域。The present invention mainly adopts the TFIDF weighting technology to construct the text vector, adopts the support vector machine algorithm, and constructs the sensitive field text classification model through continuous machine learning training, and the model is used for judging whether the text belongs to the sensitive field.

(1)采用TFIDF加权技术将领域文本数据集转换为文本向量。对于数据集中的每一个文本，用一个向量表示该文本的语义，向量的每一维对应一个单词，其数值是该单词在该文本中出现的TFIDF值。TFIDF 用来评估一个字词对于一个文件集的其中一个文本的重要程度。(1) Using TFIDF weighting technique to convert the domain text dataset into text vector. For each text in the dataset, a vector is used to represent the semantics of the text, each dimension of the vector corresponds to a word, and its value is the TFIDF value of the word in the text. TFIDF is used to assess the importance of a word to one of the texts in a document set.

采用TFIDF加权技术计算权重的过程如下The process of calculating the weight using the TFIDF weighting technique is as follows

计算公式由词频(TF)和逆向文件频率(IDF)两部分组成公式如下：The calculation formula consists of two parts: word frequency (TF) and inverse document frequency (IDF). The formula is as follows:

w_ij＝tf_ij*idf_ij (3)w _ij =tf _ij *idf _ij (3)

其中n_ij代表第i个特征词在第j个文本中出现的次数；N_j代表第j个文本中词语的总数量；N_i是包含第i个特征词的文本数量，N是总文本数量；w_ij是第i个特征词的TFIDF值。where n _ij represents the number of occurrences of the i-th feature word in the j-th text; N _j represents the total number of words in the j-th text; N _i is the number of texts containing the i-th feature word, and N is the total number of texts ; w _ij is the TFIDF value of the i-th feature word.

由于在大规模语料上训练TFIDF会得到非常多的词语，出于时间和空间效率的考虑，本发明限制选择 500个特征词，优先选取词频高的词语，构建向量之后得到X＝{x₁,x₂,…,x_i}，其中x₁-x_i代表文本训练集 D中的第i个文本对应的向量。Since training TFIDF on a large-scale corpus will get a lot of words, for the consideration of time and space efficiency, the present invention limits the selection of 500 feature words, and preferentially selects words with high word frequency, and after constructing the vector, X={x ₁ , x ₂ ,...,x _i }, where x ₁ -x _i represents the vector corresponding to the ith text in the text training set D.

(2)采用支持向量机算法对文本数据集进行训练，得到敏感领域文本分类模型。过程如下：(2) Using the support vector machine algorithm to train the text data set to obtain the text classification model in the sensitive field. The process is as follows:

建模：给定训练样本T＝{(v₁,y₁),(v₁,y₂),…,(v_n,y_n)},其中v₁-v_n是n个文本向量,y₁-y_m是训练文本对应的敏感领域标签值，属于敏感领域的文本标签值为1，属于非敏感领域的文本标签值为-1。我们需要找到一个超平面将各个训练集中的实例分到不同的类别，该超平面为wx+b＝0，分类决策模型为 f(x)＝sign(wx+b)，其中sign代表符号函数，w是模型的权值，b是偏置。为了得到可以将训练样本集中的样本点完全分隔的最大间隔超平面，需要求出以下优化约束问题：Modeling: Given training samples T = {(v ₁ , y ₁ ), (v ₁ , y ₂ ), ..., (v _n , y _n )}, where v ₁ -v _n are n text vectors, y ₁ -y _m is the label value of the sensitive field corresponding to the training text, the value of the text label belonging to the sensitive field is 1, and the value of the text label belonging to the non-sensitive field is -1. We need to find a hyperplane to classify the instances in each training set into different categories. The hyperplane is wx+b=0, and the classification decision model is f(x)=sign(wx+b), where sign represents the sign function, w is the weight of the model and b is the bias. In order to obtain the maximum interval hyperplane that can completely separate the sample points in the training sample set, the following optimization constraints need to be solved:

s.t. y_i(w*x_i+b)-1≥0,i＝1,2,...,n (5)st y _i (w*x _i +b)-1≥0,i=1,2,...,n (5)

求出最优的w,b，最终得到敏感领域分类决策模型f(x)。Find the optimal w, b, and finally get the sensitive field classification decision model f(x).

检测：建模完成之后，输入待检测的文本向量，得到的输出值就是文本的分类标签值，+1代表正类，表示该文本属于敏感领域，-1代表负类，表示该文本不属于敏感领域。Detection: After the modeling is completed, input the text vector to be detected, and the obtained output value is the classification label value of the text. +1 represents the positive class, indicating that the text belongs to the sensitive field, and -1 represents the negative class, indicating that the text does not belong to the sensitive field. field.

2.提出一种基于词向量和卷积神经网络的政策公文识别模型2. Propose a policy document recognition model based on word vector and convolutional neural network

本发明采用Word2vec技术对文本分词后的词序列进行词向量训练得到每个词对应的词向量，作为卷积神经网络的输入数据，构建基于卷积神经网络的政策公文识别模型，该模型用于判断文本是否为政策公文，模型主要包括输入层，卷积层、池化层、全连接层等。The present invention adopts the Word2vec technology to perform word vector training on the word sequence after the text segmentation to obtain the word vector corresponding to each word, which is used as the input data of the convolutional neural network to construct a policy document recognition model based on the convolutional neural network. The model is used for To determine whether the text is a policy document, the model mainly includes input layer, convolution layer, pooling layer, fully connected layer, etc.

(1)第一层为输入层。输入层是一个n*m的矩阵，用字母A表示。其中n为一个文本词序列中的单词数，本发明采用padding技术将所有文本词序列长度保持一致。m是每个词对应的词向量的维度，本发明采用Word2vec技术进行词向量训练，将每个词映射成一个m维的词向量。(1) The first layer is the input layer. The input layer is an n*m matrix, denoted by the letter A. Wherein n is the number of words in a text word sequence, the present invention adopts padding technology to keep the length of all text word sequences consistent. m is the dimension of the word vector corresponding to each word. The present invention uses the Word2vec technology to perform word vector training, and maps each word into an m-dimensional word vector.

(2)第二层为卷积层。通过使用不同尺寸的卷积核对矩阵进行卷积操作，卷积核的宽度等于词向量的维度m，高度为h，假设一个卷积核为h*m的矩阵t。卷积核以步长1向下滑动，每经过一个h*m大小的窗口时进行卷积运算，产生一个新的特征值c_i，一个卷积核经过处理后得到一个特征图c,总共n-h+1个特征，计算公式如下：(2) The second layer is a convolutional layer. The convolution operation is performed by using convolution kernels of different sizes. The width of the convolution kernel is equal to the dimension m of the word vector, and the height is h. Suppose a convolution kernel is a matrix t of h*m. The convolution kernel slides down with a step size of 1, and the convolution operation is performed every time it passes through a window of size h*m to generate a new feature value c _i , a convolution kernel is processed to obtain a feature map c, a total of n -h+1 features, the calculation formula is as follows:

c_i＝f(t*A[i:i+h-1]+b),i＝1,2,...,n+h-1 (6)c _i =f(t*A[i:i+h-1]+b),i=1,2,...,n+h-1 (6)

其中b为偏置项,f为激活函数。where b is the bias term and f is the activation function.

(3)第三层为池化层。由于不同尺寸的卷积核得到的特征图大小不一样，本发明采用池化函数 1-max-pooling对每个特征图的特征进行提取，使它们的维度保持一致，1-max-pooling的原理是从多个值中取一个最大值。(3) The third layer is the pooling layer. Since the size of feature maps obtained by convolution kernels of different sizes is different, the present invention uses the pooling function 1-max-pooling to extract the features of each feature map to keep their dimensions consistent. The principle of 1-max-pooling is to take a maximum value from multiple values.

(4)第四层为全连接层。全连接层用来分类，把卷积与池化层提取的特征输入到softmax函数中进行分类进行分类训练，得到政策公文识别模型。(4) The fourth layer is a fully connected layer. The fully connected layer is used for classification, and the features extracted by the convolution and pooling layers are input into the softmax function for classification and training, and the policy document recognition model is obtained.

3.提出一种基于敏感领域文本分类和政策公文识别的电子政务敏感文本检测模型3. Propose an e-government sensitive text detection model based on sensitive field text classification and policy document recognition

由于电子政务敏感领域的政策公文多涉及到敏感内容，为了判断一个文本是否属于敏感文本，需要检测该文本是否为敏感领域以及该文本是否为政策公文。采用敏感领域文本分类模型，检测文本是否属于敏感领域，然后对于属于敏感领域的文本采用政策公文识别模型，判断文本是否属于政策公文。Since the policy documents in the sensitive fields of e-government mostly involve sensitive content, in order to judge whether a text is a sensitive text, it is necessary to detect whether the text is a sensitive field and whether the text is a policy document. The sensitive field text classification model is used to detect whether the text belongs to the sensitive field, and then the policy document recognition model is used for the text belonging to the sensitive field to judge whether the text belongs to the policy document.

(1)敏感领域文本分类。首先构建出待检测内容文本的文本向量，再采用支持向量机算法建立敏感领域分类模型计算文本的分类结果。模型的输入是待检测的文本，输出是文本的敏感领域分类结果，判断文本是否属于敏感领域。(1) Text classification in sensitive fields. Firstly, the text vector of the text to be detected is constructed, and then the support vector machine algorithm is used to establish a classification model of sensitive fields to calculate the classification results of the text. The input of the model is the text to be detected, the output is the classification result of the sensitive field of the text, and it is judged whether the text belongs to the sensitive field.

(2)政策公文识别。首先基于word2vec技术构建词向量，再采用卷积神经网络建立模型计算文本的政策公文识别结果。模型的输入是待检测的文本，输出是文本的政策公文识别结果，判断文本是否属于政策公文。(2) Identification of policy documents. Firstly, the word vector is constructed based on the word2vec technology, and then the convolutional neural network is used to establish a model to calculate the recognition result of the policy document of the text. The input of the model is the text to be detected, the output is the recognition result of the policy document of the text, and it is judged whether the text belongs to the policy document.

最终将上述步骤中的模型整合起来得到一种基于敏感领域文本分类和政策公文识别的电子政务敏感文本检测模型。Finally, the models in the above steps are integrated to obtain an e-government sensitive text detection model based on sensitive field text classification and policy document recognition.

附图说明Description of drawings

图1为本发明工作流程图。Fig. 1 is the working flow chart of the present invention.

具体实施方式Detailed ways

本发明是一种基于卷积神经网络和支持向量机的电子政务敏感文本检测方法。主要包括以下步骤：The invention is an electronic government affairs sensitive text detection method based on a convolutional neural network and a support vector machine. It mainly includes the following steps:

步骤1：对文本数据进行预处理。首先对本发明准备的数据集进行清洗，去除掉文本中无用的部分，然后使用中文分词技术对文本进行分词，得到分词后的文本词序列。Step 1: Preprocess the text data. Firstly, the data set prepared by the present invention is cleaned to remove useless parts in the text, and then the Chinese word segmentation technology is used to segment the text to obtain the segmented text word sequence.

步骤2：建立敏感领域文本分类模型。使用TFIDF技术计算文本中字词的权重，将文本词序列转化成对应的文本向量；用支持向量机算法建立领域分类模型，输入内容为敏感领域和非敏感领域文本向量及对应的分类标签，通过不断训练使模型得到较好的分类结果。Step 2: Establish a text classification model for sensitive fields. Use TFIDF technology to calculate the weight of the words in the text, and convert the text word sequence into the corresponding text vector; use the support vector machine algorithm to establish a domain classification model, and the input content is the text vector and the corresponding classification label in the sensitive and non-sensitive areas. Continuous training makes the model get better classification results.

步骤3：建立政策公文识别模型。首先，使用Word2vec模型对文本进行词向量训练，将每个单词用词向量进行表示，每个文本则转换成为对应的矩阵，作为卷积神经网络的输入数据；然后，利用不同尺寸的卷积核对输入的矩阵进行卷积计算得到多个特征图；接着，采用池化函数1-max-pooling对每个特征图的特征进行提取，输出特征最大值。最后，将提取到的特征输入到softmax函数中进行分类，得到文本的政策公文识别结果。Step 3: Establish a policy document recognition model. First, the Word2vec model is used to train the word vector for the text, each word is represented by a word vector, and each text is converted into a corresponding matrix as the input data of the convolutional neural network; then, the convolution check of different sizes is used. The input matrix is convolved to obtain multiple feature maps; then, the pooling function 1-max-pooling is used to extract the features of each feature map, and the maximum feature value is output. Finally, the extracted features are input into the softmax function for classification, and the result of the policy document recognition of the text is obtained.

步骤4：检测。首先，将待检测的文本转换为文本向量，采用步骤1中建立的敏感领域文本分类模型，检测文本是否属于敏感领域；然后，对于属于敏感领域的文本采用步骤2建立的政策公文识别模型，判断文本是否属于政策公文。最终检测出电子政务敏感领域的政策公文多为敏感文本。Step 4: Detection. First, the text to be detected is converted into a text vector, and the sensitive field text classification model established in step 1 is used to detect whether the text belongs to a sensitive field; then, for the text belonging to the sensitive field, the policy document recognition model established in step 2 is used to determine Whether the text is a policy document. Finally, it is detected that most of the policy documents in the sensitive fields of e-government are sensitive texts.

Claims

1. A method for detecting E-government affair sensitive texts based on a convolutional neural network and a support vector machine is characterized by comprising the following steps:

(1) providing a sensitive field text classification model based on a TFIDF and a support vector machine;

(2) providing a policy document identification model based on word vectors and a convolutional neural network;

(3) an electronic government affair sensitive text detection model based on sensitive field text classification and policy document identification is provided.

2. The TFIDF and support vector machine based domain text classification model of claim 1, wherein: and calculating the weights of words in the sensitive field texts and the non-sensitive field texts by adopting a TFIDF technology, and constructing two types of text vectors. And adopting a support vector machine algorithm, taking the two types of text vectors and the classification labels thereof as input and output, and performing iterative training to obtain a final convergent sensitive field text classification model.

3. The word vector and convolutional neural network based policy document identification model of claim 1, wherein: vectorizing and expressing each keyword in the text of the policy official document and the text of the non-policy official document by adopting a word vector algorithm, and obtaining a word vector matrix of the text according to a word sequence to be used as the input of a convolutional neural network; carrying out convolution calculation on the input word vector matrixes by adopting convolution kernels with different sizes to obtain a plurality of characteristic graphs; and (3) reducing the dimension of the features of each feature map by using a pooling function, and finally inputting the features into a softmax classifier layer for classification training to obtain a policy document identification model.

4. The sensitive text detection model based on religious domain text classification and policy document identification according to claim 1, characterized in that: since policy documents in sensitive fields of e-government (e.g., sensitive fields such as religion, military, politics, etc.) mostly contain sensitive contents, in order to determine whether a text belongs to a sensitive text, it is necessary to detect whether the text is a sensitive field and whether the text is a policy document. And detecting whether the text belongs to the sensitive field by adopting the sensitive field text classification model, and judging whether the text belongs to the policy document by adopting a policy document identification model for the text belonging to the sensitive field.