CN108197337A - A kind of file classification method and device - Google Patents

A kind of file classification method and device Download PDF

Info

Publication number
CN108197337A
CN108197337A CN201810262316.6A CN201810262316A CN108197337A CN 108197337 A CN108197337 A CN 108197337A CN 201810262316 A CN201810262316 A CN 201810262316A CN 108197337 A CN108197337 A CN 108197337A
Authority
CN
China
Prior art keywords
data
classification
text
length
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810262316.6A
Other languages
Chinese (zh)
Other versions
CN108197337B (en
Inventor
陈嘉慧
刘海龙
郭亚南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sohu New Media Information Technology Co Ltd
Original Assignee
Beijing Sohu New Media Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sohu New Media Information Technology Co Ltd filed Critical Beijing Sohu New Media Information Technology Co Ltd
Priority to CN201810262316.6A priority Critical patent/CN108197337B/en
Publication of CN108197337A publication Critical patent/CN108197337A/en
Application granted granted Critical
Publication of CN108197337B publication Critical patent/CN108197337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种文本分类方法及装置,在预先建立的CNN分类模型的基础上,改进了对卷积层的权重初始化的方式,具体为根据高斯分布对权重初始化,相较于现有基于CNN分类模型实现文本分类的方法,提高了分类结果的准确性。且相较于比较朴素贝叶斯,SVM等机器学习算法也提高了分类结果的准确性。

The present invention provides a text classification method and device. On the basis of the pre-established CNN classification model, the method of initializing the weight of the convolutional layer is improved. Specifically, the weight is initialized according to the Gaussian distribution. Compared with the existing CNN-based The classification model realizes the method of text classification, which improves the accuracy of classification results. And compared with Naive Bayesian, machine learning algorithms such as SVM also improve the accuracy of classification results.

Description

一种文本分类方法及装置A text classification method and device

技术领域technical field

本发明涉及分类技术领域,尤其涉及一种文本分类方法及装置。The present invention relates to the technical field of classification, in particular to a text classification method and device.

背景技术Background technique

现有技术中实现文本分类的方法为:提取待分类文本的文本特征,并根据文本特征实现对待分类文本的分类。The method for implementing text classification in the prior art is: extracting text features of the text to be classified, and classifying the text to be classified according to the text features.

基于现有技术中公开的文本分类方法,在对新闻进行分类时,由于新闻属于长文本,在提取待分类的新闻的文本特征时,往往需要投入大量的人力和时间去设计有效的文本特征来帮助分类,费时又费力。Based on the text classification methods disclosed in the prior art, when classifying news, since the news is a long text, when extracting the text features of the news to be classified, it is often necessary to invest a lot of manpower and time to design effective text features. Helping with classification is time-consuming and labor-intensive.

而深度学习由于可以自动学习文本特征,因此可以解决对新闻这类长文本分类时文本特征提取困难的问题。在深度学习中常用的为卷积神经网络模型(CNN)。Since deep learning can automatically learn text features, it can solve the problem of difficult text feature extraction when classifying long texts such as news. The convolutional neural network model (CNN) is commonly used in deep learning.

利用CNN模型实现文本分类的方法包括:对待分类的文本进行预处理,得到若干个句子;将每个句子输入训练好的CNN模型的卷积层和采样层,将采样层输出的结果输入到SVM分类器中,实现对文本的分类。The method of using the CNN model to realize text classification includes: preprocessing the text to be classified to obtain several sentences; inputting each sentence into the convolutional layer and sampling layer of the trained CNN model, and inputting the output result of the sampling layer into the SVM In the classifier, the classification of text is realized.

但是,发明人发现现有基于CNN模型实现文本分类的方法中存在准确性低的问题。However, the inventors found that the existing method for text classification based on the CNN model has a problem of low accuracy.

发明内容Contents of the invention

有鉴于此,本发明的目的在于提供一种文本分类方法及装置,以解决现有技术中基于CNN模型实现文本分类的方法中准确性低的问题。In view of this, the object of the present invention is to provide a text classification method and device to solve the problem of low accuracy in the prior art method for text classification based on CNN model.

技术方案如下:The technical solution is as follows:

本发明提供一种文本分类方法,包括:The present invention provides a text classification method, comprising:

对待分类文本进行预处理,得到多个句子;Preprocess the text to be classified to obtain multiple sentences;

将所述句子输入预先建立的CNN分类模型的输入层;The sentence is input into the input layer of the pre-established CNN classification model;

提取所述句子的word2vec特征,获取输入矩阵;extract the word2vec feature of the sentence, and obtain the input matrix;

将所述输入矩阵输入卷积层,通过卷积操作提取特征;其中,所述卷积层的权重采用高斯分布参数值进行初始化;The input matrix is input into the convolution layer, and features are extracted through convolution operations; wherein, the weights of the convolution layer are initialized with Gaussian distribution parameter values;

将所述特征输入分类器进行分类。The features are input into the classifier for classification.

优选地,所述对待分类文本进行预处理,得到多个句子包括:Preferably, the preprocessing of the text to be classified to obtain a plurality of sentences includes:

判断所述待分类文本的长度是否大于预设长度;judging whether the length of the text to be classified is greater than a preset length;

判断所述待分类文本的长度大于预设长度,则将所述待分类文本按照所述预设长度进行截断,得到多个句子;Judging that the length of the text to be classified is greater than a preset length, truncating the text to be classified according to the preset length to obtain multiple sentences;

判断所述句子的长度是否小于预设长度;judging whether the length of the sentence is less than a preset length;

判断所述句子的长度小于预设长度,则按照所述待分类文本中包括的内容顺序拼接在所述句子的后面,直至拼接后形成的新句子的长度等于所述预设长度。If it is judged that the length of the sentence is less than the preset length, the text to be classified is spliced after the sentence according to the order of the content included in the text until the length of the new sentence formed after splicing is equal to the preset length.

优选地,所述CNN分类模型的训练方法包括:Preferably, the training method of the CNN classification model comprises:

获取数据集以及未清洗数据;其中,所述未清洗数据中包括预设分类标记;Acquiring data sets and uncleaned data; wherein, the uncleaned data includes preset classification marks;

利用所述数据集,对所述CNN分类模型进行初始训练;Utilize described data set, carry out initial training to described CNN classification model;

利用初始训练后的CNN分类模型对所述未清洗数据进行分类预测,得到所述未清洗数据的预测分类标记以及预测概率;Using the CNN classification model after the initial training to classify and predict the uncleaned data, and obtain the predicted classification marks and predicted probabilities of the uncleaned data;

判断所述未清洗数据的预测概率是否大于预设概率值;judging whether the predicted probability of the uncleaned data is greater than a preset probability value;

判断所述未清洗数据的预测概率大于预设概率值,则判断所述未清洗数据的预测分类标记与所述未清洗数据的预设分类标记是否相同;Judging that the predicted probability of the uncleaned data is greater than a preset probability value, then judging whether the predicted classification mark of the uncleaned data is the same as the preset classification mark of the uncleaned data;

判断所述未清洗数据的预测分类标记与所述未清洗数据的预设分类标记不同,则将所述未清洗数据的预设分类标记修改为所述预测分类标记,得到清洗后的数据;Judging that the predicted classification mark of the uncleaned data is different from the preset classification mark of the uncleaned data, modifying the preset classification mark of the uncleaned data to the predicted classification mark to obtain the cleaned data;

利用所述清洗后的数据,训练所述CNN分类模型。Using the cleaned data to train the CNN classification model.

优选地,所述将所述未清洗数据的预设分类标记修改为所述预测分类标记包括:Preferably, the modifying the preset classification mark of the uncleaned data to the predicted classification mark comprises:

按照预设规则,从判断得到所述未清洗数据的预测分类标记与所述未清洗数据的预测分类标记不同的未清洗数据中,选择待处理的未清洗数据;According to preset rules, select the uncleaned data to be processed from the uncleaned data whose predicted classification labels of the uncleaned data are judged to be different from the predicted classification labels of the uncleaned data;

将所述待处理的未清洗数据的预设分类标记修改为所述预测分类标记。modifying the preset classification label of the uncleaned data to be the predicted classification label.

优选地,所述CNN分类模型的训练方法还包括:Preferably, the training method of the CNN classification model also includes:

若训练样本的数量小于预设样本数量,则利用预先训练的CNN分类模型;If the number of training samples is less than the preset number of samples, the pre-trained CNN classification model is used;

利用所述训练样本,对所述预先训练的CNN分类模型进行训练。Using the training samples, the pre-trained CNN classification model is trained.

本发明还提供了一种文本分类装置,包括:The present invention also provides a text classification device, comprising:

预处理单元,用于对待分类文本进行预处理,得到多个句子;A preprocessing unit is used to preprocess the text to be classified to obtain multiple sentences;

输入单元,用于将所述句子输入预先建立的CNN分类模型的输入层;The input unit is used to input the sentence into the input layer of the pre-established CNN classification model;

第一处理单元,用于提取所述句子的word2vec特征,获取输入矩阵;The first processing unit is used to extract the word2vec feature of the sentence and obtain an input matrix;

第二处理单元,用于将所述输入矩阵输入卷积层,通过卷积操作提取特征;其中,所述卷积层的权重采用高斯分布参数值进行初始化;The second processing unit is configured to input the input matrix into a convolution layer, and extract features through a convolution operation; wherein, the weight of the convolution layer is initialized with a Gaussian distribution parameter value;

分类单元,用于将所述特征输入分类器进行分类。A classification unit, configured to input the features into a classifier for classification.

优选地,所述预处理单元包括:Preferably, the preprocessing unit includes:

第一判断单元,用于判断所述待分类文本的长度是否大于预设长度;a first judging unit, configured to judge whether the length of the text to be classified is greater than a preset length;

截断单元,用于当所述第一判断单元判断所述待分类文本的长度大于预设长度时,将所述待分类文本按照所述预设长度进行截断,得到多个句子;A truncation unit, configured to truncate the text to be classified according to the preset length to obtain multiple sentences when the first judging unit judges that the length of the text to be classified is greater than a preset length;

第二判断单元,用于判断所述句子的长度是否小于预设长度;The second judging unit is used to judge whether the length of the sentence is less than a preset length;

拼接单元,用于当所述第二判断单元判断所述句子的长度小于预设长度时,按照所述待分类文本中包括的内容顺序拼接在所述句子的后面,直至拼接后形成的新句子的长度等于所述预设长度。splicing unit, for when the second judging unit judges that the length of the sentence is less than the preset length, according to the order of content included in the text to be classified, it is spliced behind the sentence until a new sentence is formed after splicing The length of is equal to the preset length.

优选地,还包括:Preferably, it also includes:

获取单元,用于获取数据集以及未清洗数据;其中,所述未清洗数据中包括预设分类标记;An acquisition unit, configured to acquire data sets and uncleaned data; wherein, the uncleaned data includes preset classification marks;

训练单元,用于利用所述数据集,对所述CNN分类模型进行初始训练;A training unit, configured to use the data set to initially train the CNN classification model;

预测单元,用于利用初始训练后的CNN分类模型对所述未清洗数据进行分类预测,得到所述未清洗数据的预测分类标记以及预测概率;A prediction unit, configured to use the CNN classification model after initial training to classify and predict the uncleaned data, and obtain the predicted classification marks and prediction probabilities of the uncleaned data;

第三判断单元,用于判断所述未清洗数据的预测概率是否大于预设概率值;A third judging unit, configured to judge whether the predicted probability of the uncleaned data is greater than a preset probability value;

第四判断单元,用于当所述第三判断单元判断所述未清洗数据的预测概率大于预设概率值时,判断所述未清洗数据的预测分类标记与所述未清洗数据的预设分类标记是否相同;A fourth judging unit, configured to judge the predicted classification flag of the uncleaned data and the preset classification of the uncleaned data when the third judging unit judges that the predicted probability of the uncleaned data is greater than a preset probability value whether the marks are the same;

修改单元,用于当所述第四判断单元判断所述未清洗数据的预测分类标记与所述未清洗数据的预设分类标记不同时,将所述未清洗数据的预设分类标记修改为所述预测分类标记,得到清洗后的数据;A modifying unit, configured to modify the preset classification mark of the uncleaned data to the The above prediction classification marks are obtained to obtain the cleaned data;

所述训练单元,还用于利用所述清洗后的数据,训练所述CNN分类模型。The training unit is further configured to use the cleaned data to train the CNN classification model.

优选地,所述修改单元包括:Preferably, the modifying unit includes:

选择子单元,用于按照预设规则,从判断得到所述未清洗数据的预测分类标记与所述未清洗数据的预测分类标记不同的未清洗数据中,选择待处理的未清洗数据;The selection subunit is used to select the uncleaned data to be processed from the uncleaned data whose predicted classification labels of the uncleaned data are judged to be different from the predicted classification labels of the uncleaned data according to preset rules;

修改子单元,用于将所述待处理的未清洗数据的预设分类标记修改为所述预测分类标记。The modifying subunit is configured to modify the preset classification mark of the uncleaned data to be the predicted classification mark.

优选地,还包括:Preferably, it also includes:

复用单元,用于若训练样本的数量小于预设样本数量,则复用预先训练的CNN分类模型;A multiplexing unit is used to reuse the pre-trained CNN classification model if the number of training samples is less than the preset number of samples;

所述训练单元还用于利用所述训练样本,对所述预先训练的CNN分类模型进行训练。The training unit is further configured to use the training samples to train the pre-trained CNN classification model.

与现有技术相比,本发明提供的上述技术方案具有如下优点:Compared with the prior art, the above-mentioned technical solution provided by the present invention has the following advantages:

从上述技术方案可知,本申请中在预先建立的CNN分类模型的基础上,改进了对卷积层的权重初始化的方式,具体为根据高斯分布对权重初始化,相较于现有基于CNN分类模型实现文本分类的方法,提高了分类结果的准确性。且相较于比较朴素贝叶斯,SVM等机器学习算法也提高了分类结果的准确性。It can be seen from the above technical solution that, on the basis of the pre-established CNN classification model, the application improves the way of initializing the weight of the convolutional layer, specifically according to the Gaussian distribution of weight initialization, compared with the existing CNN-based classification model The method for implementing text classification improves the accuracy of classification results. And compared with Naive Bayesian, machine learning algorithms such as SVM also improve the accuracy of classification results.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are For some embodiments of the present invention, those skilled in the art can also obtain other drawings based on these drawings without creative work.

图1是本发明实施例提供的一种文本分类方法的流程图;Fig. 1 is a flow chart of a text classification method provided by an embodiment of the present invention;

图2是本发明实施例提供的另一种文本分类方法的流程图;FIG. 2 is a flowchart of another text classification method provided by an embodiment of the present invention;

图3是本发明实施例提供的CNN分类模型的训练方法的流程图;Fig. 3 is the flowchart of the training method of the CNN classification model that the embodiment of the present invention provides;

图4是利用本发明实施例提供的一种文本分类装置的结构示意图;FIG. 4 is a schematic structural diagram of a text classification device provided by an embodiment of the present invention;

图5是利用本发明实施例提供的另一种文本分类装置的结构示意图。Fig. 5 is a schematic structural diagram of another text classification device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本实施例公开了一种文本分类方法,应用在长文本分类的场景中,例如,新闻分类的场景,参见图1,该实施例包括以下步骤:This embodiment discloses a text classification method, which is applied in the scene of long text classification, for example, the scene of news classification, referring to Fig. 1, this embodiment includes the following steps:

S101、对待分类文本进行预处理,得到多个句子;S101. Preprocessing the text to be classified to obtain multiple sentences;

待分类文本是长文本时,需要先将待分类文本截断为多个预设长度的文本,每一个特定长度的文本是一个句子,得到多个句子。其中,预设长度可以根据实际需要进行设置。When the text to be classified is a long text, it is necessary to truncate the text to be classified into multiple texts of preset lengths, and each text of a specific length is a sentence to obtain multiple sentences. Wherein, the preset length can be set according to actual needs.

S102、将所述句子输入预先建立的CNN分类模型的输入层;S102. Input the sentence into the input layer of the pre-established CNN classification model;

本实施例中预先建立并训练得到CNN分类模型,所述CNN分类模型具有两个并联的输入层,每个输入层对应一个输入通道。In this embodiment, a CNN classification model is pre-established and trained. The CNN classification model has two parallel input layers, and each input layer corresponds to an input channel.

对应第一输入通道的输入层采用word2vec算法初始化,不参与CNN分类模型的训练,而是对第一输入通道输入的内容进行分类;对应第二输入通道的输入层采用随机初始化,不仅对第二输入通道输入的内容进行分类而且参与CNN分类模型的训练。The input layer corresponding to the first input channel is initialized with the word2vec algorithm, and does not participate in the training of the CNN classification model, but classifies the content input by the first input channel; the input layer corresponding to the second input channel is initialized randomly, not only for the second The content input by the input channel is classified and participates in the training of the CNN classification model.

将句子通过输入通道输入所述CNN分类模型的输入层。The sentence is input into the input layer of the CNN classification model through the input channel.

S103、提取所述句子的word2vec特征,获取输入矩阵;S103, extracting the word2vec feature of the sentence, and obtaining an input matrix;

S104、所述输入矩阵输入卷积层,通过卷积操作提取特征;其中,所述卷积层的权重采用高斯分布参数值进行初始化;S104. The input matrix is input into a convolution layer, and features are extracted through a convolution operation; wherein, the weight of the convolution layer is initialized with a Gaussian distribution parameter value;

选择用零均值、小标准差的高斯分布来初始化卷积层的权重。Choose to initialize the weights of the convolutional layers with a Gaussian distribution with zero mean and small standard deviation.

S105、将所述特征输入分类器进行分类。S105. Input the features into a classifier for classification.

本实施例中提取到的特征维数为256维,分类器为softmax分类器。通过外接的softmax分类器,根据提取到的256维特征进行分类。The feature dimension extracted in this embodiment is 256 dimensions, and the classifier is a softmax classifier. Through the external softmax classifier, it is classified according to the extracted 256-dimensional features.

从上述技术方案可知,本实施例中在预先建立的CNN分类模型的基础上,改进了对卷积层的权重初始化的方式,具体为根据高斯分布对权重初始化,相较于现有基于CNN分类模型实现文本分类的方法,提高了分类结果的准确性。且相较于比较朴素贝叶斯,SVM等机器学习算法也提高了分类结果的准确性。It can be seen from the above technical solution that in this embodiment, on the basis of the pre-established CNN classification model, the method of initializing the weight of the convolutional layer is improved, specifically, the weight is initialized according to the Gaussian distribution. Compared with the existing classification based on CNN The model realizes the method of text classification, which improves the accuracy of classification results. And compared with Naive Bayesian, machine learning algorithms such as SVM also improve the accuracy of classification results.

本实施例公开了另一种文本分类的方法,详细介绍了对待分类文本进行预处理的方法,参见图2,该实施例包括以下步骤:This embodiment discloses another method for text classification, and introduces in detail the method for preprocessing the text to be classified. Referring to Fig. 2, this embodiment includes the following steps:

S201、判断所述待分类文本的长度是否大于预设长度;其中,待分类文本包括标题以及正文;S201. Determine whether the length of the text to be classified is greater than a preset length; wherein, the text to be classified includes a title and a text;

判断所述待分类文本的长度大于预设长度,则执行步骤S202;If it is judged that the length of the text to be classified is greater than the preset length, step S202 is executed;

S202、将所述待分类文本按照所述预设长度进行截断,得到多个句子;S202. Truncating the text to be classified according to the preset length to obtain multiple sentences;

例如,待分类文本为“shenqingwenjian”,预设长度为4,则对shenqingwenjian”进行截断,得到的截断结果为“shen”、“qing”、“wenj”、“ian”;每个截断结果为一个句子,通过对待分类文件执行截断操作,得到了多个句子。For example, if the text to be classified is "shenqingwenjian" and the preset length is 4, "shenqingwenjian" is truncated, and the truncated results are "shen", "qing", "wenj", and "ian"; each truncated result is a Sentence, by performing a truncation operation on the file to be classified, multiple sentences are obtained.

S203、判断所述句子的长度是否小于预设长度;S203. Determine whether the length of the sentence is less than a preset length;

判断所述句子的长度小于预设长度,则执行步骤S204;If it is judged that the length of the sentence is less than the preset length, step S204 is executed;

否则,执行步骤S205;Otherwise, execute step S205;

通过对待分类文件执行截断操作可以得到多个句子,其中,由于待分类文本的长度可能不是预设长度的整数倍,因此,通过截断操作得到的最后一个句子(“ian”)的长度可能并不等于预设长度,且小于预设长度,则对最后一个句子(“ian”)执行步骤S204;Multiple sentences can be obtained by performing a truncation operation on the file to be classified, wherein, since the length of the text to be classified may not be an integer multiple of the preset length, the length of the last sentence (“ian”) obtained by the truncation operation may not be Equal to the preset length, and less than the preset length, then execute step S204 on the last sentence ("ian");

本实施例中,由于对待分类文本的截断是按照预设长度截断的,得到的多个句子中,只有最后一个句子的长度可能不等于预设长度,因此执行判断所述句子的长度是否小于预设长度的步骤时,可以只判断所述句子的长度是否小于预设长度。In this embodiment, since the truncation of the text to be classified is truncated according to the preset length, among the multiple sentences obtained, only the length of the last sentence may not be equal to the preset length, so it is judged whether the length of the sentence is less than the preset length. In the step of setting the length, it may only be judged whether the length of the sentence is less than the preset length.

S204、按照所述待分类文本中包括的内容顺序拼接在所述句子的后面,直至拼接后形成的新句子的长度等于所述预设长度;S204. Splicing behind the sentence according to the order of the content included in the text to be classified, until the length of the new sentence formed after splicing is equal to the preset length;

“ian”的长度小于预设长度4,因此,需要执行拼接操作。对“ian”拼接可以利用待分类文本“shenqingwenjian”的内容执行拼接操作,得到的拼接结果为“ians”,当然也可以利用自身内容执行拼接操作,得到的拼接结果为“iani”,拼接后句子的长度等于预设长度。The length of "ian" is less than the preset length 4, therefore, the splicing operation needs to be performed. For "ian" splicing, the splicing operation can be performed using the content of the text to be classified "shenqingwenjian", and the resulting splicing result is "ians". Of course, the splicing operation can also be performed using its own content, and the resulting splicing result is "iani". The length of is equal to the preset length.

本步骤中拼接操作的对象为对待分类文本执行截断操作后得到的句子,但是,本实施例中通过步骤S201判断所述待分类文本的长度不大于预设长度时,存在待分类文本的长度本身就小于预设长度的情况,则不需要执行截断操作,需要执行拼接操作。其中,拼接操作的具体实现方式与对截断结果执行拼接操作的实现方式类似。The object of the splicing operation in this step is the sentence obtained after performing the truncation operation on the text to be classified. However, in this embodiment, when it is judged by step S201 that the length of the text to be classified is not greater than the preset length, there is the length of the text to be classified itself. If the length is less than the preset length, the truncation operation does not need to be performed, but the splicing operation needs to be performed. Wherein, the specific implementation manner of the splicing operation is similar to the implementation manner of performing the splicing operation on the truncated result.

具体地,当待分类文本的长度小于预设长度时,则通过拼接操作将待分类文本拼接为长度等于预设长度的新待分类文本。Specifically, when the length of the text to be classified is less than the preset length, the text to be classified is spliced into a new text to be classified whose length is equal to the preset length through a splicing operation.

例如,待分类文本为“shenqing”,待分类文本的长度为8,预设长度为10,则按照待分类文本中包括的内容顺序拼接在“shenqing”后面。For example, if the text to be classified is "shenqing", the length of the text to be classified is 8, and the preset length is 10, then it will be spliced after "shenqing" according to the order of the content contained in the text to be classified.

“shenqing”的顺序为s、h、e、n、q、i、n、g,拼接方法为先将“s”拼接在“shenqing”后面,拼接后得到的新待分类文本为“shenqings”,拼接后的长度为9,仍然小于预设长度,继续拼接;将“h”拼接在“shenqings”后面,拼接后得到的新待分类文本为“shenqingsh”,拼接后的长度为10,与预设长度相同,完成拼接操作。最终得到的新待分类文本为“shenqingsh”。The order of "shenqing" is s, h, e, n, q, i, n, g. The splicing method is to splice "s" after "shenqing", and the new text to be classified after splicing is "shenqings". The length after splicing is 9, which is still less than the preset length, and continue splicing; splicing "h" behind "shenqings", the new text to be classified after splicing is "shenqingsh", and the length after splicing is 10, which is the same as the preset The length is the same, and the splicing operation is completed. The final new text to be classified is "shenqingsh".

预设长度为20时,按照上述拼接方法得到的新待分类文本为“shenqingshenqingshen”。When the preset length is 20, the new text to be classified obtained according to the above splicing method is "shenqingshenqingshen".

此外,本实施例中通过步骤S201判断所述待分类文本的长度不大于预设长度时,还存在待分类文本的长度等于预设长度的情况,则不仅不需要对待分类文本执行截断操作,也不需要对待分类文本执行拼接操作。In addition, in this embodiment, when it is judged by step S201 that the length of the text to be classified is not greater than the preset length, there is also a case where the length of the text to be classified is equal to the preset length, then not only does it not need to perform a truncation operation on the text to be classified, but also There is no need to perform concatenation operations on the text to be classified.

S205、将所述句子输入预先建立的CNN分类模型的输入层;S205. Input the sentence into the input layer of the pre-established CNN classification model;

S206、提取所述句子的word2vec特征,获取输入矩阵;S206, extracting the word2vec feature of the sentence, and obtaining an input matrix;

S207、所述输入矩阵输入卷积层,通过卷积操作提取特征;其中,所述卷积层的权重采用高斯分布参数值进行初始化;S207. The input matrix is input into a convolution layer, and features are extracted through a convolution operation; wherein, the weights of the convolution layer are initialized with Gaussian distribution parameter values;

S208、将所述特征输入分类器进行分类。S208. Input the features into a classifier for classification.

本实施例中,步骤S205-步骤S208的实现方式与上一实施例中步骤S102-步骤S105的实现方式类似,此处不再赘述。In this embodiment, the implementation manner of step S205-step S208 is similar to the implementation manner of step S102-step S105 in the previous embodiment, and will not be repeated here.

从上述技术方案可知,本实施例中在预先建立的CNN分类模型的基础上,对输入CNN分类模型的输入内容进行预处理,具体为对待分类文本的长度超过预设长度的待分类文本按照预设长度截断,对待分类文本的长度小于预设长度的待分类文本利用待分类文本的内容进行循环填充,将预处理后的内容输入到CNN分类模型中,并且改进了CNN分类模型中对卷积层的权重初始化的方式,具体为根据高斯分布对权重初始化,相较于现有基于CNN分类模型实现文本分类的方法,提高了分类结果的准确性。It can be seen from the above technical solution that in this embodiment, on the basis of the pre-established CNN classification model, the input content input into the CNN classification model is preprocessed, specifically, the text to be classified whose length exceeds the preset length is processed according to the preset length. Set the length to be truncated, and the text to be classified whose length is less than the preset length is cyclically filled with the content of the text to be classified, and the preprocessed content is input into the CNN classification model, and the convolution in the CNN classification model is improved. The weight initialization method of the layer is specifically to initialize the weight according to the Gaussian distribution, which improves the accuracy of the classification result compared with the existing method of text classification based on the CNN classification model.

上述实施例中公开的文本分类方法是基于预先建立并训练得到的CNN分类模型实现的,下面详细介绍CNN分类模型的训练方法,参见图3,所述CNN分类模型的训练方法包括以下步骤:The text classification method disclosed in the foregoing embodiment is realized based on the CNN classification model that is pre-established and trained. The training method of the CNN classification model is described in detail below. Referring to FIG. 3, the training method of the CNN classification model includes the following steps:

S301、获取数据集以及未清洗数据;其中,所述未清洗数据中包括预设分类标记;S301. Obtain a data set and uncleaned data; wherein, the uncleaned data includes preset classification marks;

数据集由原始训练数据、人工标注数据以及外站数据这三类数据组成,其中,不同类型的数据所占数据集总量的比重不同,优选地,原始训练数据的数量占数据集总量的80%,人工标注数据的数量占数据集总量的5%,外站数据的数量占数据集总量的15%。数据集的总量设置为30万左右。The data set is composed of three types of data: original training data, manually labeled data, and outstation data. Among them, different types of data account for different proportions of the total data set. Preferably, the number of original training data accounts for 10% of the total data set 80%, the amount of manually labeled data accounts for 5% of the total dataset, and the amount of external data accounts for 15% of the total dataset. The total amount of the dataset is set to be around 300,000.

本实施例中原始训练数据指的是用户网站的入库文本;例如,搜狐本身的入库新闻,将其存储在分布式集群中。In this embodiment, the original training data refers to the stored text of the user website; for example, the stored news of Sohu itself, which is stored in the distributed cluster.

人工标注数据指的是人工编辑分类标记的数据。Manually annotated data refers to data that has been manually edited and labeled for classification.

外站数据指的是从区别于用户网站的其他网站或者公众号中抓取的文本。External site data refers to text captured from other websites or official accounts that are different from the user's website.

本实施例中未清洗数据可以直接是从获取到的数据集中抽样得到的。当然,也可以是区别于数据集而获取到的数据。但是,无论未清洗数据是从何处获取到的,都需要保证未清洗数据为已经包含预设分类标记的数据。In this embodiment, the uncleaned data may be directly sampled from the acquired data set. Of course, it may also be data obtained from a different data set. However, no matter where the uncleaned data is obtained, it is necessary to ensure that the uncleaned data already contains preset classification marks.

在实际应用中,在未清洗数据的数量很多的情况下,可以使用过滤规则对未清洗数据进行过滤,例如,设置关键词的黑名单,通过这一过滤规则选择出部分未清洗数据,并对选择出的未清洗数据执行后续清洗步骤。In practical applications, when there is a large amount of uncleaned data, filter rules can be used to filter the uncleaned data, for example, set a blacklist of keywords, select some uncleaned data through this filter rule, and filter The selected uncleaned data are subjected to subsequent cleaning steps.

S302、利用所述数据集,对所述CNN分类模型进行初始训练;S302. Using the data set, perform initial training on the CNN classification model;

S303、利用初始训练后的CNN分类模型对所述未清洗数据进行分类预测,得到所述未清洗数据的预测分类标记以及预测概率;S303. Use the CNN classification model after initial training to perform classification prediction on the uncleaned data, and obtain the predicted classification label and prediction probability of the uncleaned data;

以新闻分类为例,得到的预测分类标记为待分类新闻所属分类对应的标号,其中,新闻所属分类的标号可以根据实际业务场景进行设置,在本实施例中设置的新闻所属分类对应的标号从1至101,即不同的标号对应不同的新闻类别。例如,体育新闻对应标号1,金融新闻对应标号2,娱乐新闻对应标号3。Taking the news classification as an example, the obtained predicted classification mark is the label corresponding to the category of the news to be classified, wherein the label of the news category can be set according to the actual business scenario, and the label corresponding to the news category set in this embodiment is from 1 to 101, that is, different labels correspond to different news categories. For example, sports news corresponds to label 1, financial news corresponds to label 2, and entertainment news corresponds to label 3.

预测概率为CNN分类模型对所述未清洗数据进行分类预测结果正确的可能性。例如,对新闻A这一待清洗数据进行分类预测后,得到的预测分类标记为1,即预测新闻A为体育新闻,得到的预测概率为0.9,则说明CNN分类模型判断新闻A属于体育新闻的可能性是90%。The prediction probability is the possibility that the CNN classification model classifies and predicts the uncleaned data correctly. For example, after classifying and predicting news A, the data to be cleaned, the predicted classification mark is 1, that is, news A is predicted to be sports news, and the predicted probability is 0.9, which means that the CNN classification model judges that news A belongs to sports news The probability is 90%.

S304、判断所述未清洗数据的预测概率是否大于预设概率值;S304. Determine whether the predicted probability of the uncleaned data is greater than a preset probability value;

判断所述未清洗数据的预测概率大于预设概率值,则执行步骤S305;If it is judged that the predicted probability of the uncleaned data is greater than the preset probability value, step S305 is executed;

判断所述未清洗数据的预测概率不大于预设概率值,则不做任何的处理,即说明此次预测的结果并不可信;Judging that the predicted probability of the uncleaned data is not greater than the preset probability value, no processing is performed, which means that the result of this prediction is not credible;

预设概率值为99.5%,即CNN分类模型得出的预测概率大于99.5%时,说明此次CNN分类模型预测的结果可信,执行步骤S305;The preset probability value is 99.5%, that is, when the prediction probability obtained by the CNN classification model is greater than 99.5%, it means that the prediction result of the CNN classification model is credible, and step S305 is executed;

若CNN分类模型得出的预测概率不大于99.5%,则说明此次CNN分类模型预测的结果并不可信,进而更无需关注CNN分类模型得出的预测分类标记了。If the prediction probability obtained by the CNN classification model is not greater than 99.5%, it means that the prediction result of the CNN classification model is not credible, and there is no need to pay attention to the predicted classification marks obtained by the CNN classification model.

S305、判断所述未清洗数据的预测分类标记与所述未清洗数据的预设分类标记是否相同;S305. Judging whether the predicted classification mark of the uncleaned data is the same as the preset classification mark of the uncleaned data;

判断所述未清洗数据的预测分类标记与所述未清洗数据的预设分类标记不同,则执行步骤S306;Judging that the predicted classification mark of the uncleaned data is different from the preset classification mark of the uncleaned data, execute step S306;

判断所述未清洗数据的预测分类标记与所述未清洗数据的预设分类标记相同,则不修改所述未清洗数据的预设分类标记。Judging that the predicted classification flag of the uncleaned data is the same as the preset classification flag of the uncleaned data, the preset classification flag of the uncleaned data is not modified.

在CNN分类模型预测的结果可信时,判断未清洗数据的预测分类标记与所述未清洗数据的预设分类标记是否相同;When the result predicted by the CNN classification model is credible, it is judged whether the predicted classification mark of the uncleaned data is the same as the preset classification mark of the uncleaned data;

对于未清洗数据而言,若预测分类标记与预设分类标记相同,则说明未清洗数据初始的预设分类标记是正确的,无需修改;若预测分类标记与预设分类标记不同,则说明未清洗数据初始的预设分类标记是不正确的,执行步骤S306;For the uncleaned data, if the predicted classification labels are the same as the preset classification labels, it means that the initial preset classification labels of the uncleaned data are correct and do not need to be modified; if the predicted classification labels are different from the preset classification labels, it means that there is no The initial preset classification mark of cleaning data is incorrect, execute step S306;

S306、将所述未清洗数据的预设分类标记修改为所述预测分类标记,得到清洗后的数据;S306. Modify the preset classification mark of the uncleaned data to the predicted classification mark to obtain the cleaned data;

将未清洗数据的分类标记修改为预测分类标记,完成清洗步骤,得到清洗后的数据。Modify the classification labels of the uncleaned data to predicted classification labels, complete the cleaning step, and obtain the cleaned data.

例如,新闻B的预测分类标记为2,且预测概率大于99.5%,则CNN分类模型对新闻B的分类预测是可信的,新闻B的分类标记就应该是2。但是,新闻B的预设分类标记为1,与预测分类标记2不同,则说明新闻B的预设分类标记有噪音,是错误的,因此将新闻B的预设分类标记修改为预测分类标记,修改后的新闻B的分类标记将为2。For example, if the predicted classification mark of news B is 2, and the prediction probability is greater than 99.5%, then the classification prediction of news B by the CNN classification model is credible, and the classification mark of news B should be 2. However, the preset classification mark of news B is 1, which is different from the predicted classification mark 2, which means that the preset classification mark of news B is noisy, which is wrong, so the preset classification mark of news B is changed to the predicted classification mark, The modified category mark for News B will be 2.

S307、利用所述清洗后的数据,训练所述CNN分类模型。S307. Using the cleaned data, train the CNN classification model.

利用完成了清洗步骤的数据更新CNN分类模型的训练集,以增加训练样本,并从更新前的训练集中同比例抽取数据,放入测试集,并利用CNN分类模型进行分类预测。即通过利用初始CNN分类模型清洗数据,并利用清洗后的数据辅助进一步的CNN分类模型的训练,一层一层递进的方式实现对数据的清洗。Use the data that has completed the cleaning step to update the training set of the CNN classification model to increase the training samples, and extract data in the same proportion from the training set before the update, put it into the test set, and use the CNN classification model for classification prediction. That is, by using the initial CNN classification model to clean the data, and using the cleaned data to assist the training of the further CNN classification model, the data is cleaned in a progressive manner layer by layer.

本实施例中,为了进一步提高分类的准确性,并不是对所有的预测概率大于99.5%,且预测分类标记与预设分类标记不同的新闻分类标记都进行修改,而是按照预设规则,从预测概率大于99.5%,且预测分类标记与预设分类标记不同的未清洗数据中,选择待处理的未清洗数据;并对选择的待处理未清洗数据的分类标记进行修改。In this embodiment, in order to further improve the classification accuracy, it is not necessary to modify all the news classification marks whose predicted probability is greater than 99.5%, and the predicted classification marks are different from the preset classification marks, but according to the preset rules, from Among the uncleaned data whose prediction probability is greater than 99.5%, and whose predicted classification labels are different from the preset classification labels, select the uncleaned data to be processed; and modify the classification labels of the selected uncleaned data to be processed.

优选地,预设规则为均匀分布。Preferably, the preset rule is uniform distribution.

采用上述CNN分类模型的训练方法,当采用大量训练样本训练CNN分类模型时,可以实现对大量的训练样本的分类标记是否正确的判断,并将不正确的分类标记修改为正确的分类标记。利用具有正确分类标记的训练样本再对CNN分类模型进行训练,以得到分类准确性高的CNN分类模型。Using the training method of the above-mentioned CNN classification model, when a large number of training samples are used to train the CNN classification model, it is possible to judge whether the classification marks of a large number of training samples are correct, and to modify incorrect classification marks to correct classification marks. The CNN classification model is trained by using training samples with correct classification marks to obtain a CNN classification model with high classification accuracy.

在实际训练CNN分类模型时,还会存在训练样本数量少的情况,当训练样本数量少时,直接利用小规模的训练样本对CNN分类模型训练,容易导致发生过拟合现象,造成分类结果的错误。When actually training the CNN classification model, there will still be a small number of training samples. When the number of training samples is small, directly using small-scale training samples to train the CNN classification model will easily lead to overfitting and cause errors in classification results. .

基于此,本实施例中判断训练样本的数量小于预设样本数量,则复用预先训练的CNN分类模型;然后再利用所述训练样本,对所述预先训练的CNN分类模型进行训练。Based on this, in this embodiment, if it is judged that the number of training samples is less than the preset number of samples, the pre-trained CNN classification model is reused; and then the training samples are used to train the pre-trained CNN classification model.

其中,预先训练的CNN分类模型为与此训练样本的场景相近的场景下,训练完成的CNN分类模型。Wherein, the pre-trained CNN classification model is a trained CNN classification model in a scene similar to the training sample.

例如,本实施例中需要对新闻这一长文本进行分类,需要训练得到能够对新闻进行分类的CNN分类模型,但是此时具有准确分类标记的新闻数量少。但是,已经训练得到了用于对小说这一长文本进行分类的CNN分类模型,由于新闻和小说都属于长文本,两者的主要结构和特征表示相差不大,因此可以复用预先训练的用于对小说分类的CNN分类模型;然后在利用具有准确分类标记的新闻训练用于对小说分类的CNN分类模型。通过复用对小说分类的CNN分类模型,不需要重新学习CNN分类模型中的所有参数,只需要利用小规模的新闻样本在迭代过程中对复用的CNN分类模型中的部分参数进行调整,即可训练得到用于新闻分类的CNN分类模型。解决了由于样本数量不足,对模型训练导致的过拟合问题。For example, in this embodiment, the long text of news needs to be classified, and a CNN classification model capable of classifying news needs to be trained, but at this time, the number of news with accurate classification marks is small. However, the CNN classification model used to classify long texts such as novels has been trained. Since both news and novels belong to long texts, the main structure and feature representation of the two are not much different, so the pre-trained user can be reused. A CNN classification model for classifying novels; then train a CNN classification model for classifying novels using news with accurate classification labels. By reusing the CNN classification model for classifying novels, there is no need to relearn all the parameters in the CNN classification model, and only need to use small-scale news samples to adjust some parameters in the reused CNN classification model in the iterative process, namely A CNN classification model for news classification can be trained. Solved the overfitting problem caused by model training due to insufficient number of samples.

对应上述文本分类方法,本实施例还公开了一种文本分类装置,所述文本分类装置的结构示意图请参阅图4所示,本实施例中文本分类装置包括:Corresponding to the above text classification method, this embodiment also discloses a text classification device. Please refer to FIG. 4 for a schematic structural diagram of the text classification device. In this embodiment, the text classification device includes:

预处理单元401、输入单元402、第一处理单元403、第二处理单元404和分类单元405;A preprocessing unit 401, an input unit 402, a first processing unit 403, a second processing unit 404 and a classification unit 405;

预处理单元401,用于对待分类文本进行预处理,得到多个句子;A preprocessing unit 401, configured to preprocess the text to be classified to obtain multiple sentences;

其中,预处理单元401包括:Wherein, the preprocessing unit 401 includes:

第一判断单元、截断单元、第二判断单元和拼接单元;a first judging unit, a truncation unit, a second judging unit and a splicing unit;

所述第一判断单元,用于判断所述待分类文本的长度是否大于预设长度;The first judging unit is configured to judge whether the length of the text to be classified is greater than a preset length;

所述截断单元,用于当所述第一判断单元判断所述待分类文本的长度大于预设长度时,将所述待分类文本按照所述预设长度进行截断,得到多个句子;The truncation unit is configured to truncate the text to be classified according to the preset length to obtain multiple sentences when the first judging unit judges that the length of the text to be classified is greater than a preset length;

所述第二判断单元,用于判断所述句子的长度是否小于预设长度;The second judging unit is used to judge whether the length of the sentence is less than a preset length;

所述拼接单元,用于当所述第二判断单元判断所述句子的长度小于预设长度时,按照所述待分类文本中包括的内容顺序拼接在所述句子的后面,直至拼接后形成的新句子的长度等于所述预设长度。The splicing unit is configured to, when the second judging unit judges that the length of the sentence is less than a preset length, splice the sentence after the sentence according to the order of the content contained in the text to be classified until the spliced The length of the new sentence is equal to the preset length.

输入单元402,用于将所述句子输入预先建立的CNN分类模型的输入层;The input unit 402 is used to input the sentence into the input layer of the pre-established CNN classification model;

第一处理单元403,用于提取所述句子的word2vec特征,获取输入矩阵;The first processing unit 403 is used to extract the word2vec feature of the sentence and obtain the input matrix;

第二处理单元404,用于将所述输入矩阵输入卷积层,通过卷积操作提取特征;其中,所述卷积层的权重采用高斯分布参数值进行初始化;The second processing unit 404 is configured to input the input matrix into a convolutional layer, and extract features through a convolution operation; wherein, the weights of the convolutional layer are initialized with Gaussian distribution parameter values;

分类单元405,用于将所述特征输入分类器进行分类。A classification unit 405, configured to input the features into a classifier for classification.

从上述技术方案可知,本实施例中在预先建立的CNN分类模型的基础上,对输入CNN分类模型的输入内容进行预处理,具体为对待分类文本的长度超过预设长度的待分类文本按照预设长度截断,对待分类文本的长度小于预设长度的待分类文本利用待分类文本的内容进行循环填充,将预处理后的内容输入到CNN分类模型中,并且改进了CNN分类模型中对卷积层的权重初始化的方式,具体为根据高斯分布对权重初始化,相较于现有基于CNN分类模型实现文本分类的方法,提高了分类结果的准确性。It can be seen from the above technical solution that in this embodiment, on the basis of the pre-established CNN classification model, the input content input into the CNN classification model is preprocessed, specifically, the text to be classified whose length exceeds the preset length is processed according to the preset length. Set the length to be truncated, and the text to be classified whose length is less than the preset length is cyclically filled with the content of the text to be classified, and the preprocessed content is input into the CNN classification model, and the convolution in the CNN classification model is improved. The weight initialization method of the layer is specifically to initialize the weight according to the Gaussian distribution, which improves the accuracy of the classification result compared with the existing method of text classification based on the CNN classification model.

在上一实施例公开的文本分类装置的基础上,本实施例还公开了另一种文本分类装置,所述文本分类装置的结构示意图请参阅图5所示,本实施例中文本分类装置还包括:On the basis of the text classification device disclosed in the previous embodiment, this embodiment also discloses another text classification device. The structural diagram of the text classification device is shown in Figure 5. In this embodiment, the text classification device also include:

获取单元501、训练单元502、预测单元503、第三判断单元504、第四判断单元505、修改单元506和复用单元507;Acquiring unit 501, training unit 502, predicting unit 503, third judging unit 504, fourth judging unit 505, modifying unit 506 and multiplexing unit 507;

获取单元501,用于获取数据集以及未清洗数据;其中,所述未清洗数据中包括预设分类标记;An acquisition unit 501, configured to acquire a data set and uncleaned data; wherein, the uncleaned data includes preset classification marks;

训练单元502,用于利用所述数据集,对所述CNN分类模型进行初始训练;A training unit 502, configured to use the data set to perform initial training on the CNN classification model;

预测单元503,用于利用初始训练后的CNN分类模型对所述未清洗数据进行分类预测,得到所述未清洗数据的预测分类标记以及预测概率;The prediction unit 503 is configured to use the CNN classification model after initial training to perform classification prediction on the uncleaned data, and obtain the predicted classification label and prediction probability of the uncleaned data;

第三判断单元504,用于判断所述未清洗数据的预测概率是否大于预设概率值;A third judging unit 504, configured to judge whether the predicted probability of the uncleaned data is greater than a preset probability value;

第四判断单元505,用于当所述第三判断单元判断所述未清洗数据的预测概率大于预设概率值时,判断所述未清洗数据的预测分类标记与所述未清洗数据的预设分类标记是否相同;The fourth judging unit 505 is configured to judge the predicted classification flag of the uncleaned data and the preset Whether the classification marks are the same;

修改单元506,用于当所述第四判断单元判断所述未清洗数据的预测分类标记与所述未清洗数据的预设分类标记不同时,将所述未清洗数据的预设分类标记修改为所述预测分类标记,得到清洗后的数据;The modifying unit 506 is configured to modify the preset classification mark of the uncleaned data to The predicted classification marks are obtained by cleaning the data;

判断所述未清洗数据的预测分类标记与所述未清洗数据的预设分类标记相同,则不修改所述未清洗数据的预设分类标记;Judging that the predicted classification mark of the uncleaned data is the same as the preset classification mark of the uncleaned data, the preset classification mark of the uncleaned data is not modified;

训练单元502,还用于利用所述清洗后的数据,训练所述CNN分类模型;The training unit 502 is further configured to use the cleaned data to train the CNN classification model;

其中,修改单元506,包括:Wherein, the modification unit 506 includes:

选择子单元和修改子单元;select subunits and modify subunits;

选择子单元,用于按照预设规则,从判断得到所述未清洗数据的预测分类标记与所述未清洗数据的预测分类标记不同的未清洗数据中,选择待处理的未清洗数据;The selection subunit is used to select the uncleaned data to be processed from the uncleaned data whose predicted classification labels of the uncleaned data are judged to be different from the predicted classification labels of the uncleaned data according to preset rules;

修改子单元,用于将所述待处理的未清洗数据的预设分类标记修改为所述预测分类标记;A modifying subunit, configured to modify the preset classification mark of the uncleaned data to be the predicted classification mark;

复用单元507,用于若训练样本的数量小于预设样本数量,则复用预先训练的CNN分类模型;The multiplexing unit 507 is used to reuse the pre-trained CNN classification model if the number of training samples is less than the preset sample number;

训练单元502,还用于利用所述训练样本,对所述预先训练的CNN分类模型进行训练。The training unit 502 is further configured to use the training samples to train the pre-trained CNN classification model.

训练得到CNN分类模型后,可以在文本分类中使用此训练得到的CNN分类模型。After training the CNN classification model, you can use the trained CNN classification model in text classification.

从上述技术方案可知,本实施例中在预先建立的CNN分类模型的基础上,对输入CNN分类模型的输入内容进行预处理,具体为对待分类文本的长度超过预设长度的待分类文本按照预设长度截断,对待分类文本的长度小于预设长度的待分类文本利用待分类文本的内容进行循环填充,将预处理后的内容输入到CNN分类模型中,并且改进了CNN分类模型中对卷积层的权重初始化的方式,具体为根据高斯分布对权重初始化,相较于现有基于CNN分类模型实现文本分类的方法,提高了分类结果的准确性。且,在训练CNN分类模型时,对大量的训练样本先进行清洗,然后在利用清洗后的训练样本对CNN分类模型进行训练,可以训练得到准确的CNN分类模型,同时,在训练样本数量少时,复用已经训练好的其他CNN分类模型,然后在利用训练样本对复用的CNN分类模型进行训练,可以避免过拟合问题的产生。It can be seen from the above technical solution that in this embodiment, on the basis of the pre-established CNN classification model, the input content input into the CNN classification model is preprocessed, specifically, the text to be classified whose length exceeds the preset length is processed according to the preset length. Set the length to be truncated, and the text to be classified whose length is less than the preset length is cyclically filled with the content of the text to be classified, and the preprocessed content is input into the CNN classification model, and the convolution in the CNN classification model is improved. The weight initialization method of the layer is specifically to initialize the weight according to the Gaussian distribution, which improves the accuracy of the classification result compared with the existing method of text classification based on the CNN classification model. Moreover, when training the CNN classification model, a large number of training samples are first cleaned, and then the CNN classification model is trained using the cleaned training samples, which can be trained to obtain an accurate CNN classification model. At the same time, when the number of training samples is small, Reusing other CNN classification models that have been trained, and then using training samples to train the reused CNN classification model can avoid the occurrence of over-fitting problems.

本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例提供的装置而言,由于其与实施例提供的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same and similar parts of each embodiment can be referred to each other. As for the device provided in the embodiment, since it corresponds to the method provided in the embodiment, the description is relatively simple, and for the relevant details, please refer to the description of the method part.

需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, the terms "comprising", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

对所公开的实施例的上述说明,使本领域技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that, for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications can also be made. It should be regarded as the protection scope of the present invention.

Claims (10)

1. a kind of file classification method, which is characterized in that including:
It treats classifying text to be pre-processed, obtains multiple sentences;
The input layer of CNN disaggregated models that the sentence inputting is pre-established;
The word2vec features of the sentence are extracted, obtain input matrix;
The input matrix is inputted into convolutional layer, feature is extracted by convolution operation;Wherein, the weight of the convolutional layer is using high This distributed constant value is initialized;
Feature input grader is classified.
2. file classification method according to claim 1, which is characterized in that the classifying text for the treatment of is pre-processed, Multiple sentences are obtained to include:
Judge whether the length of the text to be sorted is more than preset length;
Judge the text to be sorted length be more than preset length, then by the text to be sorted according to the preset length into Row blocks, and obtains multiple sentences;
Judge whether the length of the sentence is less than preset length;
Judge that the length of the sentence is less than preset length, then the content order splicing included according to the text to be sorted exists Behind the sentence, until the length of new sentence formed after splicing is equal to the preset length.
3. according to claims 1 or 2 any one of them file classification method, which is characterized in that the instruction of the CNN disaggregated models Practice method to include:
It obtains data set and does not clean data;Wherein, the data of not cleaning include default classification marker;
Using the data set, initial training is carried out to the CNN disaggregated models;
Classification prediction is carried out to the data of not cleaning using the CNN disaggregated models after initial training, obtains described not cleaning number According to prediction classification marker and prediction probability;
Whether the prediction probability for not cleaning data described in judging is more than predetermined probabilities value;
The prediction probability for not cleaning data described in judging is more than predetermined probabilities value, then does not clean the prediction classification of data described in judgement Whether label is identical with the default classification marker for not cleaning data;
The prediction classification marker for not cleaning data described in judging is different from the default classification marker for not cleaning data, then by institute It states and does not clean the default classification markers of data and be revised as the prediction classification marker, the data after being cleaned;
Utilize the data after the cleaning, the training CNN disaggregated models.
4. file classification method according to claim 3, which is characterized in that described by do not clean data default point Class label is revised as the prediction classification marker and includes:
According to preset rules, the prediction classification marker for not cleaning data and the prediction for not cleaning data are obtained from judgement Classification marker is different not to clean in data, selects pending not clean data;
The pending default classification marker for not cleaning data is revised as the prediction classification marker.
5. file classification method according to claim 3, which is characterized in that the training method of the CNN disaggregated models is also Including:
If the quantity of training sample is less than default sample size, CNN disaggregated models trained in advance are utilized;
Using the training sample, the CNN disaggregated models trained in advance are trained.
6. a kind of document sorting apparatus, which is characterized in that including:
Pretreatment unit is pre-processed for treating classifying text, obtains multiple sentences;
Input unit, for the input layer of CNN disaggregated models for pre-establishing the sentence inputting;
First processing units for extracting the word2vec features of the sentence, obtain input matrix;
For the input matrix to be inputted convolutional layer, feature is extracted by convolution operation for second processing unit;Wherein, it is described The weight of convolutional layer is initialized using Gaussian Distribution Parameters value;
Taxon, for feature input grader to be classified.
7. document sorting apparatus according to claim 6, which is characterized in that the pretreatment unit includes:
First judging unit, for judging whether the length of the text to be sorted is more than preset length;
Block unit, for judge when first judging unit length of the text to be sorted be more than preset length when, will The text to be sorted is blocked according to the preset length, obtains multiple sentences;
Second judgment unit, for judging whether the length of the sentence is less than preset length;
Concatenation unit, for judge when the second judgment unit length of the sentence be less than preset length when, according to described The content order that text to be sorted includes splices behind the sentence, until length of new sentence formed after splicing etc. In the preset length.
8. according to claim 6 or 7 any one of them document sorting apparatus, which is characterized in that further include:
Acquiring unit, for obtaining data set and not cleaning data;Wherein, the data of not cleaning include default contingency table Note;
For utilizing the data set, initial training is carried out to the CNN disaggregated models for training unit;
Predicting unit for carrying out classification prediction to the data of not cleaning using the CNN disaggregated models after initial training, obtains The prediction classification marker for not cleaning data and prediction probability;
Third judging unit, for judging whether the prediction probability for not cleaning data is more than predetermined probabilities value;
4th judging unit, for work as the third judging unit judge the prediction probability for not cleaning data be more than it is default generally During rate value, judge described in do not clean the prediction classification markers of data and the default classification marker for not cleaning data whether phase Together;
Change unit, for work as the 4th judging unit judge the prediction classification marker for not cleaning data with it is described not clearly When washing the default classification marker difference of data, the default classification marker for not cleaning data is revised as the prediction contingency table Note, the data after being cleaned;
The training unit is additionally operable to utilize the data after the cleaning, the training CNN disaggregated models.
9. document sorting apparatus according to claim 8, which is characterized in that the modification unit includes:
Select subelement, for according to preset rules, from judge to obtain the prediction classification marker for not cleaning data with it is described Do not clean data prediction classification marker it is different do not clean in data, select pending not clean data;
Subelement is changed, for the pending default classification marker for not cleaning data to be revised as the prediction contingency table Note.
10. document sorting apparatus according to claim 8, which is characterized in that further include:
Multiplexing Unit if the quantity for training sample is less than default sample size, is multiplexed CNN classification moulds trained in advance Type;
The training unit is additionally operable to using the training sample, and the CNN disaggregated models trained in advance are trained.
CN201810262316.6A 2018-03-28 2018-03-28 A text classification method and device Active CN108197337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810262316.6A CN108197337B (en) 2018-03-28 2018-03-28 A text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810262316.6A CN108197337B (en) 2018-03-28 2018-03-28 A text classification method and device

Publications (2)

Publication Number Publication Date
CN108197337A true CN108197337A (en) 2018-06-22
CN108197337B CN108197337B (en) 2020-09-29

Family

ID=62596274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810262316.6A Active CN108197337B (en) 2018-03-28 2018-03-28 A text classification method and device

Country Status (1)

Country Link
CN (1) CN108197337B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020073507A1 (en) * 2018-10-11 2020-04-16 平安科技(深圳)有限公司 Text classification method and terminal
CN111612023A (en) * 2019-02-25 2020-09-01 北京嘀嘀无限科技发展有限公司 A method and device for constructing a classification model
CN112288015A (en) * 2020-10-30 2021-01-29 国网四川省电力公司电力科学研究院 Distribution network electrical topology identification method and system based on edge calculation improved KNN
CN112733544A (en) * 2021-04-02 2021-04-30 中国电子科技网络信息安全有限公司 Target character activity track information extraction method, computer device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937434A (en) * 2009-06-29 2011-01-05 天津一度搜索网络科技有限公司 Search method with fault-tolerant function
CN106408522A (en) * 2016-06-27 2017-02-15 深圳市未来媒体技术研究院 Image de-noising method based on convolution pair neural network
CN106445919A (en) * 2016-09-28 2017-02-22 上海智臻智能网络科技股份有限公司 Sentiment classifying method and device
CN107835496A (en) * 2017-11-24 2018-03-23 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937434A (en) * 2009-06-29 2011-01-05 天津一度搜索网络科技有限公司 Search method with fault-tolerant function
CN106408522A (en) * 2016-06-27 2017-02-15 深圳市未来媒体技术研究院 Image de-noising method based on convolution pair neural network
CN106445919A (en) * 2016-09-28 2017-02-22 上海智臻智能网络科技股份有限公司 Sentiment classifying method and device
CN107835496A (en) * 2017-11-24 2018-03-23 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and server

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZJ_IMPROVE: ""清除标注错误的数据"", 《HTTPS://BLOG.CSDN.NET/JUNJUN_ZHAO/ARTICLE/DETAILS/79167410》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020073507A1 (en) * 2018-10-11 2020-04-16 平安科技(深圳)有限公司 Text classification method and terminal
CN111612023A (en) * 2019-02-25 2020-09-01 北京嘀嘀无限科技发展有限公司 A method and device for constructing a classification model
CN112288015A (en) * 2020-10-30 2021-01-29 国网四川省电力公司电力科学研究院 Distribution network electrical topology identification method and system based on edge calculation improved KNN
CN112733544A (en) * 2021-04-02 2021-04-30 中国电子科技网络信息安全有限公司 Target character activity track information extraction method, computer device and storage medium
CN112733544B (en) * 2021-04-02 2021-07-09 中国电子科技网络信息安全有限公司 Target character activity track information extraction method, computer device and storage medium

Also Published As

Publication number Publication date
CN108197337B (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN111177326B (en) Key information extraction method and device based on fine labeling text and storage medium
CN113051356B (en) Open relation extraction method and device, electronic equipment and storage medium
CN104572958B (en) A kind of sensitive information monitoring method based on event extraction
CN110188195B (en) Text intention recognition method, device and equipment based on deep learning
CN104598535B (en) A kind of event extraction method based on maximum entropy
CN108197337A (en) A kind of file classification method and device
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN108304468A (en) A kind of file classification method and document sorting apparatus
CN103336766A (en) Short text garbage identification and modeling method and device
CN108664512B (en) Text object classification method and device
CN107004141A (en) Efficient labeling of large sample groups
CN110750974A (en) Structured processing method and system for referee document
CN108900905A (en) A kind of video clipping method and device
CN109784368A (en) A kind of determination method and apparatus of application program classification
CN113688232B (en) Method and device for classifying bid-inviting text, storage medium and terminal
CN106528538A (en) Method and device for intelligent emotion recognition
CN108733675A (en) Affective Evaluation method and device based on great amount of samples data
CN105608075A (en) Related knowledge point acquisition method and system
CN111309859A (en) A method and device for sentiment analysis of online word-of-mouth in scenic spots
CN105975497A (en) Automatic microblog topic recommendation method and device
CN111506785A (en) Method and system for identifying topics of network public opinion based on social text
CN114416981A (en) A long text classification method, device, equipment and storage medium
CN115114408B (en) Multi-mode emotion classification method, device, equipment and storage medium
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN114357164B (en) Emotion-reason pair extraction method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant