CN111782811A - An E-government Sensitive Text Detection Method Based on Convolutional Neural Network and Support Vector Machine - Google Patents

An E-government Sensitive Text Detection Method Based on Convolutional Neural Network and Support Vector Machine Download PDF

Info

Publication number
CN111782811A
CN111782811A CN202010629592.9A CN202010629592A CN111782811A CN 111782811 A CN111782811 A CN 111782811A CN 202010629592 A CN202010629592 A CN 202010629592A CN 111782811 A CN111782811 A CN 111782811A
Authority
CN
China
Prior art keywords
text
sensitive
neural network
convolutional neural
policy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010629592.9A
Other languages
Chinese (zh)
Inventor
王婷
秦拯
张吉昕
胡玉鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202010629592.9A priority Critical patent/CN111782811A/en
Publication of CN111782811A publication Critical patent/CN111782811A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine. The invention mainly comprises (1) an electronic government affair sensitive text detection model based on a convolutional neural network and a support vector machine; (2) a sensitive domain text classification model based on TFIDF and a support vector machine; (3) a policy document recognition model based on word vectors and a convolutional neural network is provided.

Description

一种基于卷积神经网络和支持向量机的电子政务敏感文本检 测方法An E-government Sensitive Text Retrieval Based on Convolutional Neural Network and Support Vector Machine test method

技术领域technical field

本发明涉及机器学习技术领域,一种基于卷积神经网络和支持向量机的电子政务敏感文本检测方法。The invention relates to the technical field of machine learning, and a method for detecting sensitive text of electronic government affairs based on a convolutional neural network and a support vector machine.

背景技术Background technique

随着互联网和计算机技术的迅猛发展,网络信息技术在社会生活中的应用也越来越广泛。由于网络具 有开放性、共享性等特点,互联网和计算机技术在提升政府工作的效率的同时,也带来了一些内容安全问 题,对政府部门的信息安全带来了一定的威胁,政府部门的敏感信息和文件可能会通过电子政务平台被泄 露到互联网上,尤其宗教、军事、政治等敏感领域的政策公文大多包含敏感信息,一旦被泄露传播,将对 国家的安全造成巨大损失。因此如何准确、快速地检测出被泄露到网络中的电子政务敏感信息,降低误报 率和漏报率,保守国家秘密成为一大挑战。With the rapid development of the Internet and computer technology, the application of network information technology in social life is becoming more and more extensive. Due to the characteristics of openness and sharing of the network, the Internet and computer technology not only improve the efficiency of government work, but also bring some content security issues, posing certain threats to the information security of government departments, and the sensitivity of government departments. Information and documents may be leaked to the Internet through e-government platforms. In particular, policy documents in sensitive fields such as religion, military, and politics mostly contain sensitive information. Once they are leaked and spread, it will cause huge losses to national security. Therefore, how to accurately and quickly detect the sensitive information of e-government leaked into the network, reduce the rate of false positives and false negatives, and keep national secrets becomes a big challenge.

要保护敏感文本不被泄露,首先要确定文本内容中是否包含敏感信息。而目前很大一部分敏感文本检 测的工作是根据人为定制的规则进行检测,然而随着敏感电子文本文档的数量越来越多、复杂度也越来越 大大,现有的敏感检测手段难以达到高效、便捷的要求。为了能够及时、全面地发现泄露到互联网门户网 站中的敏感信息,如何研究出一个更高效率的的敏感检测技术解决方案是一个不可忽视的问题。目前主要 有两种检测技术:一种是基于关键词匹配的检测方法,敏感词匹配是此方法的关键核心,一般利用字符串 匹配算法实现。基于关键词匹配的检测方法,忽略了变形词与原词之间的关联性,准确率较低。随着机器 学习技术的发展,另一种检测技术是用机器学习中的文本分类来进行敏感文本检测,基于传统机器学习的 敏感内容检测方法,由于可以用来训练的敏感文本较少,准确率较低。To protect sensitive text from being leaked, it is first necessary to determine whether the text content contains sensitive information. At present, a large part of sensitive text detection work is based on human-made rules. However, with the increasing number and complexity of sensitive electronic text documents, the existing sensitive detection methods are difficult to achieve high efficiency. , convenient requirements. In order to timely and comprehensively discover the sensitive information leaked into the Internet portal, how to develop a more efficient sensitive detection technology solution is a problem that cannot be ignored. At present, there are mainly two detection technologies: one is a detection method based on keyword matching, and sensitive word matching is the key core of this method, which is generally implemented by a string matching algorithm. The detection method based on keyword matching ignores the correlation between the modified word and the original word, and the accuracy is low. With the development of machine learning technology, another detection technology is to use text classification in machine learning for sensitive text detection. The sensitive content detection method based on traditional machine learning has less sensitive texts that can be used for training, and the accuracy rate is low. lower.

因此,为了解决上述问题,本发明结合电子政务敏感文本的特点(涉及敏感领域政策方针等内容), 提出一种基于卷积神经网络和支持向量机的电子政务敏感文本检测方法。Therefore, in order to solve the above problems, the present invention proposes an e-government sensitive text detection method based on convolutional neural network and support vector machine in combination with the characteristics of e-government sensitive texts (involving policies and guidelines in sensitive fields).

发明内容SUMMARY OF THE INVENTION

本发明提出了一种基于卷积神经网络和支持向量机的电子政务敏感文本检测方法,主要包括三大内 容:The present invention proposes an e-government sensitive text detection method based on convolutional neural network and support vector machine, which mainly includes three major contents:

1.提出一种基于TFIDF和支持向量机的敏感领域文本分类模型;1. Propose a text classification model for sensitive fields based on TFIDF and support vector machine;

2.提出一种基于词向量和卷积神经网络的政策公文识别模型;2. Propose a policy document recognition model based on word vector and convolutional neural network;

3.提出一种基于敏感领域文本分类和政策公文识别的电子政务敏感文本检测模型。3. Propose an e-government sensitive text detection model based on sensitive field text classification and policy document recognition.

具体内容如下:The details are as follows:

1.提出一种基于TFIDF和支持向量机的敏感领域文本分类模型1. Propose a text classification model for sensitive fields based on TFIDF and support vector machines

本发明主采用TFIDF加权技术构建文本向量,采用支持向量机算法,通过不断的机器学习训练构建敏 感领域文本分类模型,该模型用于判断文本是否属于敏感领域。The present invention mainly adopts the TFIDF weighting technology to construct the text vector, adopts the support vector machine algorithm, and constructs the sensitive field text classification model through continuous machine learning training, and the model is used for judging whether the text belongs to the sensitive field.

(1)采用TFIDF加权技术将领域文本数据集转换为文本向量。对于数据集中的每一个文本,用一个向 量表示该文本的语义,向量的每一维对应一个单词,其数值是该单词在该文本中出现的TFIDF值。TFIDF 用来评估一个字词对于一个文件集的其中一个文本的重要程度。(1) Using TFIDF weighting technique to convert the domain text dataset into text vector. For each text in the dataset, a vector is used to represent the semantics of the text, each dimension of the vector corresponds to a word, and its value is the TFIDF value of the word in the text. TFIDF is used to assess the importance of a word to one of the texts in a document set.

采用TFIDF加权技术计算权重的过程如下The process of calculating the weight using the TFIDF weighting technique is as follows

计算公式由词频(TF)和逆向文件频率(IDF)两部分组成公式如下:The calculation formula consists of two parts: word frequency (TF) and inverse document frequency (IDF). The formula is as follows:

Figure BDA0002567981350000021
Figure BDA0002567981350000021

Figure BDA0002567981350000022
Figure BDA0002567981350000022

wij=tfij*idfij (3)w ij =tf ij *idf ij (3)

其中nij代表第i个特征词在第j个文本中出现的次数;Nj代表第j个文本中词语的总数量;Ni是包含 第i个特征词的文本数量,N是总文本数量;wij是第i个特征词的TFIDF值。where n ij represents the number of occurrences of the i-th feature word in the j-th text; N j represents the total number of words in the j-th text; N i is the number of texts containing the i-th feature word, and N is the total number of texts ; w ij is the TFIDF value of the i-th feature word.

由于在大规模语料上训练TFIDF会得到非常多的词语,出于时间和空间效率的考虑,本发明限制选择 500个特征词,优先选取词频高的词语,构建向量之后得到X={x1,x2,…,xi},其中x1-xi代表文本训练集 D中的第i个文本对应的向量。Since training TFIDF on a large-scale corpus will get a lot of words, for the consideration of time and space efficiency, the present invention limits the selection of 500 feature words, and preferentially selects words with high word frequency, and after constructing the vector, X={x 1 , x 2 ,...,x i }, where x 1 -x i represents the vector corresponding to the ith text in the text training set D.

(2)采用支持向量机算法对文本数据集进行训练,得到敏感领域文本分类模型。过程如下:(2) Using the support vector machine algorithm to train the text data set to obtain the text classification model in the sensitive field. The process is as follows:

建模:给定训练样本T={(v1,y1),(v1,y2),…,(vn,yn)},其中v1-vn是n个文本向量,y1-ym是训练文本 对应的敏感领域标签值,属于敏感领域的文本标签值为1,属于非敏感领域的文本标签值为-1。我们需要 找到一个超平面将各个训练集中的实例分到不同的类别,该超平面为wx+b=0,分类决策模型为 f(x)=sign(wx+b),其中sign代表符号函数,w是模型的权值,b是偏置。为了得到可以将训练样本集中的 样本点完全分隔的最大间隔超平面,需要求出以下优化约束问题:Modeling: Given training samples T = {(v 1 , y 1 ), (v 1 , y 2 ), ..., (v n , y n )}, where v 1 -v n are n text vectors, y 1 -y m is the label value of the sensitive field corresponding to the training text, the value of the text label belonging to the sensitive field is 1, and the value of the text label belonging to the non-sensitive field is -1. We need to find a hyperplane to classify the instances in each training set into different categories. The hyperplane is wx+b=0, and the classification decision model is f(x)=sign(wx+b), where sign represents the sign function, w is the weight of the model and b is the bias. In order to obtain the maximum interval hyperplane that can completely separate the sample points in the training sample set, the following optimization constraints need to be solved:

Figure BDA0002567981350000023
Figure BDA0002567981350000023

s.t. yi(w*xi+b)-1≥0,i=1,2,...,n (5)st y i (w*x i +b)-1≥0,i=1,2,...,n (5)

求出最优的w,b,最终得到敏感领域分类决策模型f(x)。Find the optimal w, b, and finally get the sensitive field classification decision model f(x).

检测:建模完成之后,输入待检测的文本向量,得到的输出值就是文本的分类标签值,+1代表正类, 表示该文本属于敏感领域,-1代表负类,表示该文本不属于敏感领域。Detection: After the modeling is completed, input the text vector to be detected, and the obtained output value is the classification label value of the text. +1 represents the positive class, indicating that the text belongs to the sensitive field, and -1 represents the negative class, indicating that the text does not belong to the sensitive field. field.

2.提出一种基于词向量和卷积神经网络的政策公文识别模型2. Propose a policy document recognition model based on word vector and convolutional neural network

本发明采用Word2vec技术对文本分词后的词序列进行词向量训练得到每个词对应的词向量,作为卷 积神经网络的输入数据,构建基于卷积神经网络的政策公文识别模型,该模型用于判断文本是否为政策公 文,模型主要包括输入层,卷积层、池化层、全连接层等。The present invention adopts the Word2vec technology to perform word vector training on the word sequence after the text segmentation to obtain the word vector corresponding to each word, which is used as the input data of the convolutional neural network to construct a policy document recognition model based on the convolutional neural network. The model is used for To determine whether the text is a policy document, the model mainly includes input layer, convolution layer, pooling layer, fully connected layer, etc.

(1)第一层为输入层。输入层是一个n*m的矩阵,用字母A表示。其中n为一个文本词序列中的单词 数,本发明采用padding技术将所有文本词序列长度保持一致。m是每个词对应的词向量的维度,本发明 采用Word2vec技术进行词向量训练,将每个词映射成一个m维的词向量。(1) The first layer is the input layer. The input layer is an n*m matrix, denoted by the letter A. Wherein n is the number of words in a text word sequence, the present invention adopts padding technology to keep the length of all text word sequences consistent. m is the dimension of the word vector corresponding to each word. The present invention uses the Word2vec technology to perform word vector training, and maps each word into an m-dimensional word vector.

(2)第二层为卷积层。通过使用不同尺寸的卷积核对矩阵进行卷积操作,卷积核的宽度等于词向量的 维度m,高度为h,假设一个卷积核为h*m的矩阵t。卷积核以步长1向下滑动,每经过一个h*m大小 的窗口时进行卷积运算,产生一个新的特征值ci,一个卷积核经过处理后得到一个特征图c,总共n-h+1个 特征,计算公式如下:(2) The second layer is a convolutional layer. The convolution operation is performed by using convolution kernels of different sizes. The width of the convolution kernel is equal to the dimension m of the word vector, and the height is h. Suppose a convolution kernel is a matrix t of h*m. The convolution kernel slides down with a step size of 1, and the convolution operation is performed every time it passes through a window of size h*m to generate a new feature value c i , a convolution kernel is processed to obtain a feature map c, a total of n -h+1 features, the calculation formula is as follows:

ci=f(t*A[i:i+h-1]+b),i=1,2,...,n+h-1 (6)c i =f(t*A[i:i+h-1]+b),i=1,2,...,n+h-1 (6)

其中b为偏置项,f为激活函数。where b is the bias term and f is the activation function.

(3)第三层为池化层。由于不同尺寸的卷积核得到的特征图大小不一样,本发明采用池化函数 1-max-pooling对每个特征图的特征进行提取,使它们的维度保持一致,1-max-pooling的原理是从多个值 中取一个最大值。(3) The third layer is the pooling layer. Since the size of feature maps obtained by convolution kernels of different sizes is different, the present invention uses the pooling function 1-max-pooling to extract the features of each feature map to keep their dimensions consistent. The principle of 1-max-pooling is to take a maximum value from multiple values.

(4)第四层为全连接层。全连接层用来分类,把卷积与池化层提取的特征输入到softmax函数中进行 分类进行分类训练,得到政策公文识别模型。(4) The fourth layer is a fully connected layer. The fully connected layer is used for classification, and the features extracted by the convolution and pooling layers are input into the softmax function for classification and training, and the policy document recognition model is obtained.

3.提出一种基于敏感领域文本分类和政策公文识别的电子政务敏感文本检测模型3. Propose an e-government sensitive text detection model based on sensitive field text classification and policy document recognition

由于电子政务敏感领域的政策公文多涉及到敏感内容,为了判断一个文本是否属于敏感文本,需要检 测该文本是否为敏感领域以及该文本是否为政策公文。采用敏感领域文本分类模型,检测文本是否属于敏 感领域,然后对于属于敏感领域的文本采用政策公文识别模型,判断文本是否属于政策公文。Since the policy documents in the sensitive fields of e-government mostly involve sensitive content, in order to judge whether a text is a sensitive text, it is necessary to detect whether the text is a sensitive field and whether the text is a policy document. The sensitive field text classification model is used to detect whether the text belongs to the sensitive field, and then the policy document recognition model is used for the text belonging to the sensitive field to judge whether the text belongs to the policy document.

(1)敏感领域文本分类。首先构建出待检测内容文本的文本向量,再采用支持向量机算法建立敏感领 域分类模型计算文本的分类结果。模型的输入是待检测的文本,输出是文本的敏感领域分类结果,判断文 本是否属于敏感领域。(1) Text classification in sensitive fields. Firstly, the text vector of the text to be detected is constructed, and then the support vector machine algorithm is used to establish a classification model of sensitive fields to calculate the classification results of the text. The input of the model is the text to be detected, the output is the classification result of the sensitive field of the text, and it is judged whether the text belongs to the sensitive field.

(2)政策公文识别。首先基于word2vec技术构建词向量,再采用卷积神经网络建立模型计算文本的 政策公文识别结果。模型的输入是待检测的文本,输出是文本的政策公文识别结果,判断文本是否属于政 策公文。(2) Identification of policy documents. Firstly, the word vector is constructed based on the word2vec technology, and then the convolutional neural network is used to establish a model to calculate the recognition result of the policy document of the text. The input of the model is the text to be detected, the output is the recognition result of the policy document of the text, and it is judged whether the text belongs to the policy document.

最终将上述步骤中的模型整合起来得到一种基于敏感领域文本分类和政策公文识别的电子政务敏感 文本检测模型。Finally, the models in the above steps are integrated to obtain an e-government sensitive text detection model based on sensitive field text classification and policy document recognition.

附图说明Description of drawings

图1为本发明工作流程图。Fig. 1 is the working flow chart of the present invention.

具体实施方式Detailed ways

本发明是一种基于卷积神经网络和支持向量机的电子政务敏感文本检测方法。主要包括以下步骤:The invention is an electronic government affairs sensitive text detection method based on a convolutional neural network and a support vector machine. It mainly includes the following steps:

步骤1:对文本数据进行预处理。首先对本发明准备的数据集进行清洗,去除掉文本中无用的部分, 然后使用中文分词技术对文本进行分词,得到分词后的文本词序列。Step 1: Preprocess the text data. Firstly, the data set prepared by the present invention is cleaned to remove useless parts in the text, and then the Chinese word segmentation technology is used to segment the text to obtain the segmented text word sequence.

步骤2:建立敏感领域文本分类模型。使用TFIDF技术计算文本中字词的权重,将文本词序列转化成 对应的文本向量;用支持向量机算法建立领域分类模型,输入内容为敏感领域和非敏感领域文本向量及对 应的分类标签,通过不断训练使模型得到较好的分类结果。Step 2: Establish a text classification model for sensitive fields. Use TFIDF technology to calculate the weight of the words in the text, and convert the text word sequence into the corresponding text vector; use the support vector machine algorithm to establish a domain classification model, and the input content is the text vector and the corresponding classification label in the sensitive and non-sensitive areas. Continuous training makes the model get better classification results.

步骤3:建立政策公文识别模型。首先,使用Word2vec模型对文本进行词向量训练,将每个单词用词 向量进行表示,每个文本则转换成为对应的矩阵,作为卷积神经网络的输入数据;然后,利用不同尺寸的 卷积核对输入的矩阵进行卷积计算得到多个特征图;接着,采用池化函数1-max-pooling对每个特征图的 特征进行提取,输出特征最大值。最后,将提取到的特征输入到softmax函数中进行分类,得到文本的政 策公文识别结果。Step 3: Establish a policy document recognition model. First, the Word2vec model is used to train the word vector for the text, each word is represented by a word vector, and each text is converted into a corresponding matrix as the input data of the convolutional neural network; then, the convolution check of different sizes is used. The input matrix is convolved to obtain multiple feature maps; then, the pooling function 1-max-pooling is used to extract the features of each feature map, and the maximum feature value is output. Finally, the extracted features are input into the softmax function for classification, and the result of the policy document recognition of the text is obtained.

步骤4:检测。首先,将待检测的文本转换为文本向量,采用步骤1中建立的敏感领域文本分类模型, 检测文本是否属于敏感领域;然后,对于属于敏感领域的文本采用步骤2建立的政策公文识别模型,判断 文本是否属于政策公文。最终检测出电子政务敏感领域的政策公文多为敏感文本。Step 4: Detection. First, the text to be detected is converted into a text vector, and the sensitive field text classification model established in step 1 is used to detect whether the text belongs to a sensitive field; then, for the text belonging to the sensitive field, the policy document recognition model established in step 2 is used to determine Whether the text is a policy document. Finally, it is detected that most of the policy documents in the sensitive fields of e-government are sensitive texts.

Claims (4)

1. A method for detecting E-government affair sensitive texts based on a convolutional neural network and a support vector machine is characterized by comprising the following steps:
(1) providing a sensitive field text classification model based on a TFIDF and a support vector machine;
(2) providing a policy document identification model based on word vectors and a convolutional neural network;
(3) an electronic government affair sensitive text detection model based on sensitive field text classification and policy document identification is provided.
2. The TFIDF and support vector machine based domain text classification model of claim 1, wherein: and calculating the weights of words in the sensitive field texts and the non-sensitive field texts by adopting a TFIDF technology, and constructing two types of text vectors. And adopting a support vector machine algorithm, taking the two types of text vectors and the classification labels thereof as input and output, and performing iterative training to obtain a final convergent sensitive field text classification model.
3. The word vector and convolutional neural network based policy document identification model of claim 1, wherein: vectorizing and expressing each keyword in the text of the policy official document and the text of the non-policy official document by adopting a word vector algorithm, and obtaining a word vector matrix of the text according to a word sequence to be used as the input of a convolutional neural network; carrying out convolution calculation on the input word vector matrixes by adopting convolution kernels with different sizes to obtain a plurality of characteristic graphs; and (3) reducing the dimension of the features of each feature map by using a pooling function, and finally inputting the features into a softmax classifier layer for classification training to obtain a policy document identification model.
4. The sensitive text detection model based on religious domain text classification and policy document identification according to claim 1, characterized in that: since policy documents in sensitive fields of e-government (e.g., sensitive fields such as religion, military, politics, etc.) mostly contain sensitive contents, in order to determine whether a text belongs to a sensitive text, it is necessary to detect whether the text is a sensitive field and whether the text is a policy document. And detecting whether the text belongs to the sensitive field by adopting the sensitive field text classification model, and judging whether the text belongs to the policy document by adopting a policy document identification model for the text belonging to the sensitive field.
CN202010629592.9A 2020-07-03 2020-07-03 An E-government Sensitive Text Detection Method Based on Convolutional Neural Network and Support Vector Machine Pending CN111782811A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010629592.9A CN111782811A (en) 2020-07-03 2020-07-03 An E-government Sensitive Text Detection Method Based on Convolutional Neural Network and Support Vector Machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010629592.9A CN111782811A (en) 2020-07-03 2020-07-03 An E-government Sensitive Text Detection Method Based on Convolutional Neural Network and Support Vector Machine

Publications (1)

Publication Number Publication Date
CN111782811A true CN111782811A (en) 2020-10-16

Family

ID=72759199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010629592.9A Pending CN111782811A (en) 2020-07-03 2020-07-03 An E-government Sensitive Text Detection Method Based on Convolutional Neural Network and Support Vector Machine

Country Status (1)

Country Link
CN (1) CN111782811A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487149A (en) * 2020-12-10 2021-03-12 浙江诺诺网络科技有限公司 Text auditing method, model, equipment and storage medium
CN113723737A (en) * 2021-05-11 2021-11-30 天元大数据信用管理有限公司 Enterprise portrait-based policy matching method, device, equipment and medium
CN114386408A (en) * 2022-01-14 2022-04-22 中国建设银行股份有限公司 Government affair sensitive information identification method, device, equipment, medium and program product

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Spam Filtering Method Based on Support Vector Machine
CN108520030A (en) * 2018-03-27 2018-09-11 深圳中兴网信科技有限公司 File classification method, Text Classification System and computer installation
CN109543084A (en) * 2018-11-09 2019-03-29 西安交通大学 A method of establishing the detection model of the hidden sensitive text of network-oriented social media
CN109657243A (en) * 2018-12-17 2019-04-19 江苏满运软件科技有限公司 Sensitive information recognition methods, system, equipment and storage medium
WO2019105134A1 (en) * 2017-11-30 2019-06-06 阿里巴巴集团控股有限公司 Word vector processing method, apparatus and device
CN110489749A (en) * 2019-08-07 2019-11-22 北京航空航天大学 Intelligent Office-Automation System Work Flow Optimizing
CN110955776A (en) * 2019-11-16 2020-04-03 中电科大数据研究院有限公司 A Construction Method of Government Affairs Text Classification Model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Spam Filtering Method Based on Support Vector Machine
WO2019105134A1 (en) * 2017-11-30 2019-06-06 阿里巴巴集团控股有限公司 Word vector processing method, apparatus and device
CN108520030A (en) * 2018-03-27 2018-09-11 深圳中兴网信科技有限公司 File classification method, Text Classification System and computer installation
CN109543084A (en) * 2018-11-09 2019-03-29 西安交通大学 A method of establishing the detection model of the hidden sensitive text of network-oriented social media
CN109657243A (en) * 2018-12-17 2019-04-19 江苏满运软件科技有限公司 Sensitive information recognition methods, system, equipment and storage medium
CN110489749A (en) * 2019-08-07 2019-11-22 北京航空航天大学 Intelligent Office-Automation System Work Flow Optimizing
CN110955776A (en) * 2019-11-16 2020-04-03 中电科大数据研究院有限公司 A Construction Method of Government Affairs Text Classification Model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
林学峰 等: "基于卷积神经网络的敏感文件检测方法", 《计算机与现代化》, no. 07, pages 28 - 32 *
王思迪 等: "基于文本分类的政府网站信箱自动转递方法研究", 《数据分析与知识发现》, vol. 4, no. 06, pages 51 - 59 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487149A (en) * 2020-12-10 2021-03-12 浙江诺诺网络科技有限公司 Text auditing method, model, equipment and storage medium
CN113723737A (en) * 2021-05-11 2021-11-30 天元大数据信用管理有限公司 Enterprise portrait-based policy matching method, device, equipment and medium
CN114386408A (en) * 2022-01-14 2022-04-22 中国建设银行股份有限公司 Government affair sensitive information identification method, device, equipment, medium and program product

Similar Documents

Publication Publication Date Title
CN110309331B (en) A self-supervised cross-modal deep hash retrieval method
CN109189925B (en) Word vector model based on point mutual information and text classification method based on CNN
CN107463658B (en) Text classification method and device
CN108537240A (en) Commodity image semanteme marking method based on domain body
CN112711953A (en) Text multi-label classification method and system based on attention mechanism and GCN
CN109918505B (en) Network security event visualization method based on text processing
CN107563444A (en) A kind of zero sample image sorting technique and system
CN107016409A (en) A kind of image classification method and system based on salient region of image
CN110348227B (en) Software vulnerability classification method and system
CN112149420A (en) Entity recognition model training method, threat intelligence entity extraction method and device
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN109471944A (en) Training method, device and readable storage medium for text classification model
CN111782811A (en) An E-government Sensitive Text Detection Method Based on Convolutional Neural Network and Support Vector Machine
CN109299270A (en) A kind of text data unsupervised clustering based on convolutional neural networks
CN106682089A (en) RNNs-based method for automatic safety checking of short message
CN113987188B (en) A kind of short text classification method, device and electronic equipment
CN107346327A (en) The zero sample Hash picture retrieval method based on supervision transfer
CN114881172B (en) An automatic classification method for software vulnerabilities based on weighted word vectors and neural networks
CN110502742A (en) A complex entity extraction method, device, medium and system
CN114792246B (en) Product typical feature mining method and system based on topic integrated clustering
CN107609113A (en) A kind of Automatic document classification method
CN111008530A (en) A complex semantic recognition method based on document word segmentation
CN110196945A (en) A kind of microblog users age prediction technique merged based on LSTM with LeNet
CN110008699A (en) A kind of software vulnerability detection method and device based on neural network
CN107977456A (en) A kind of multi-source big data analysis method based on multitask depth network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201016

WD01 Invention patent application deemed withdrawn after publication