CN109376241B

CN109376241B - DenseNet-based telephone appeal text classification algorithm for power field

Info

Publication number: CN109376241B
Application number: CN201811208673.0A
Authority: CN
Inventors: 王亿; 陆岷; 章晨璐; 汪宇杰; 李豪帅; 吴亦灵; 孔锋峰; 邱海锋; 陈杰; 翁利国; 陈辉
Original assignee: State Grid Corp of China SGCC; Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd; Zhejiang Zhongxin Electric Power Engineering Construction Co Ltd
Current assignee: State Grid Corp of China SGCC; Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd; Zhejiang Zhongxin Electric Power Engineering Construction Co Ltd
Priority date: 2018-10-17
Filing date: 2018-10-17
Publication date: 2020-09-18
Anticipated expiration: 2038-10-17
Also published as: CN109376241A

Abstract

The invention discloses a DenseNet-based telephone appeal text classification algorithm facing the power field, belongs to the technical field of text classification algorithms, and is characterized in that a text classifier is obtained by carrying out operations such as preprocessing, data augmentation, vocabulary dictionary establishment, word vector id matching, word vector dimension reduction, splicing characteristic values, characteristic values after random permutation, combination and splicing and the like on a text to be classified, and the text is classified by utilizing the text classifier. The DenseNet-based telephone appeal text classification algorithm facing the power field can effectively make up for the defects of the traditional algorithm, well adapt to the characteristics of strong speciality, large length difference, mixed characters and numbers and the like of the power appeal text, reduce the complexity of a model on the premise of ensuring the classification accuracy, realize the rapid and accurate classification of the telephone appeal text in the power field, and well meet the classification requirement.

Description

DenseNet-based telephone appeal text classification algorithm for power field

Technical Field

The invention relates to the technical field of text classification algorithms, in particular to a telephone appeal text classification algorithm facing the power field based on DenseNet.

Background

With popularization and improvement of power grid construction, more and more power grid users are provided, in order to guarantee stability of power supply of a power grid and improve satisfaction degree of power utilization of users, a power grid company builds a telephone feedback platform, and the users can consult service contents, reflect power utilization faults, evaluate the power grid company, and put opinions or complaints to the power grid company and the like through the telephone feedback platform. In order to better complete the construction and service of the power grid company through the telephone feedback platform, the telephone appeal texts need to be classified. The existing classification method generally classifies texts through a convolutional neural network model, but the classification method needs a relatively comprehensive corpus and has single output characteristics, and the method has great defects in classifying short texts such as telephone appeal texts in the power field. In order to improve the defects of classifying the telephone appeal text by using the convolutional neural network, the feature output needs to be increased by increasing the maximum pooling layer and using filters with different sizes, and the improvement means also needs a larger corpus and the filters with different sizes also increase the training parameter quantity of the model. In addition, the flow mode of text features needs to be changed, shallow features flow in deep layers through a dense connection convolution network, the diversity of feature learning is increased, and the classification effect is improved. However, the method can deepen the network level, has huge parameter quantity needing training, is sensitive to the sparsity of text features, has low classification speed, and cannot well meet the requirement of classifying the telephone appeal text in the power field.

Disclosure of Invention

In order to overcome the defects and shortcomings in the prior art, the invention provides a DenseNet-based telephone appeal text classification algorithm facing the power field, which is low in model complexity and good in classification effect.

In order to achieve the technical purpose, the telephone appeal text classification algorithm facing the power field based on the DenseNet comprises the following steps,

s1, obtaining a telephone appeal text to be classified;

s2, preprocessing the telephone appeal text acquired in the step S1;

s3, performing data augmentation according to the telephone appeal text preprocessed in the step S2;

s4, establishing a vocabulary dictionary according to the data amplified in the step S3;

s5, performing word vector id matching according to the vocabulary dictionary established in the step S4;

s6, performing word vector dimension reduction on the word vectors matched in the step S5;

s7, adopting ResNet and DenseNet-BC to perform 1 x 1 convolution layer processing on the word vector subjected to dimension reduction in the step S6, and splicing eigenvalues with the same size obtained after convolution layer processing;

s8, randomly arranging the characteristic values spliced in the step S7 to obtain high-level characteristics;

and S9, classifying the telephone appeal texts by using the high-level features obtained in the step S8 to achieve the purpose of classification.

Preferably, the preprocessing performed on the telephone appeal text to be classified in the step S2 includes a de-duplication processing, a de-noising processing, a de-deactivation processing and a text word segmentation processing.

Preferably, in step S2, the telephone appeal text to be classified is deduplicated with euclidean distance.

Preferably, in the step S2, the telephone appeal text to be classified is denoised based on the hash value of the DOM tree.

Preferably, in the step S2, the deactivation processing of the telephone appeal text to be classified is realized by creating a deactivation word bank dedicated to the power domain.

Preferably, in step S2, a jieba language model is used to perform word segmentation on the to-be-classified telephone appeal text to realize text word segmentation.

Preferably, in step S4, the vocabulary dictionary is built by using a double array trie tree method.

Preferably, in step S6, principal component analysis dimensionality reduction is performed on the one-hot form word vector.

Preferably, in step S7, the eigenvalues are spliced by a formula,

wherein R is_kThe characteristic value obtained after treatment of 1 × 1 convolutional layer with ResNet, D_kThe characteristic value, C, obtained by treating a 1 × 1 convolutional layer with DenseNet-BC_kPresentation puzzleSubsequent characteristic value, x_k+1Represents the input to the (k + 1) th layer and H represents the activation function.

After the technical scheme is adopted, the telephone appeal text classification algorithm facing the power field based on the DenseNet has the following advantages:

1. the DenseNet-based telephone appeal text classification algorithm facing the power field can effectively make up for the defects of the traditional algorithm, well adapt to the characteristics of strong speciality, large length difference, mixed characters and numbers and the like of the power appeal text, reduce the complexity of a model on the premise of ensuring the classification accuracy, realize the rapid and accurate classification of the telephone appeal text in the power field, and well meet the classification requirement.

The preprocessing mainly includes cleaning and normalization, and aims to improve the quality of text data so as to improve the execution efficiency in classification. The purpose of increasing the training data volume can be achieved by converting the original data under the condition of less data volume by data augmentation according to the text, so that the problem of sparse features of the telephone appeal text in the power field is solved. The vocabulary dictionary is established according to the augmented data, so that the space utilization rate and efficiency can be effectively improved, and the training time can be shortened. And performing word vector id matching according to the established vocabulary dictionary, namely matching a word vector for each word, and avoiding repeated training of the word vectors, thereby effectively reducing the parameters, complexity and training time of network training. The dimensionality of the word vector can be reduced by reducing the dimensionality of the word vector, excessive model parameters caused by overhigh dimensionality of the word vector are avoided, parameter learning of the model is reduced, and the complexity of the model is reduced. The processed two groups of feature values with the same size are spliced, so that edge feature expression and shallow feature flow can be realized, the flow of redundant features can be reduced, and unnecessary feature learning and parameter iteration are reduced. And the spliced features are randomly combined to prevent the model from being over-fitted, and the obtained high-level features are used as input to improve the classification accuracy of the model. The mixed high-level features are used as the input quantity of the neural network to realize the classification of the telephone appeal texts, and the classification speed and accuracy are effectively improved.

2. The preprocessing of the telephone appeal text to be classified comprises duplication removing processing, denoising processing, deactivation removing processing and text word segmentation processing, wherein the duplication removing processing is realized by Euclidean distances, the Euclidean distances of all texts are calculated, only one text with a short distance is reserved, and the duplication removing accuracy is improved. The denoising processing can remove the part of the text irrelevant to the classification as noise, thereby being beneficial to improving the accuracy of the classification. And comparing the words in the text with the words in the stop word bank one by one, and deleting the words from the text if the words are stop words, so that the data quality is improved. And performing word segmentation on the text by adopting a jieba language model to realize word segmentation processing of the text, so that reasonable data augmentation can be performed according to words obtained by word segmentation in the subsequent steps.

3. Since the one-hot word vector has how many words, in order to avoid dimension explosion of the word vector, dimension reduction needs to be performed on the word vector in this form. And the principal component analysis dimensionality reduction is realized by calculating the eigenvalues of the covariance matrix of the word vectors and selecting a plurality of maximum eigenvalues as principal components, and then multiplying the original word vectors by the eigenvector matrix corresponding to the selected maximum eigenvalue to obtain the word vectors after dimensionality reduction.

Drawings

Fig. 1 is a schematic flowchart of a telephone appeal text classification algorithm for the power domain based on DenseNet according to an embodiment of the present invention;

FIG. 2 is a time-error rate line graph of EPCT text classification by several models in an embodiment of the present invention;

FIG. 3 is a time-error rate line graph of several models classifying THUCNews text in an embodiment of the present invention;

FIG. 4 is a graph of error rate versus training data size for several models training EPCT text in accordance with an embodiment of the present invention;

FIG. 5 is a graph of error rate versus training data size for training the THUCNews text by several models in an embodiment of the present invention;

FIG. 6 is a histogram of the computation time for classifying EPCT texts by several models according to the embodiment of the present invention;

fig. 7 is a histogram of operation time for several models to classify the THUCNews text according to the embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples. It is to be understood that the following terms "upper," "lower," "left," "right," "longitudinal," "lateral," "inner," "outer," "vertical," "horizontal," "top," "bottom," and the like are used merely to indicate an orientation or positional relationship relative to one another as illustrated in the drawings, merely to facilitate describing and simplifying the invention, and are not intended to indicate or imply that the device/component so referred to must have a particular orientation or be constructed and operated in a particular orientation, and therefore are not to be considered limiting of the invention.

Example one

As shown in fig. 1, a denneet-oriented telephone appeal text classification algorithm in the power domain according to an embodiment of the present invention includes the following steps,

s1, obtaining a telephone appeal text to be classified;

s2, preprocessing the telephone appeal text acquired in the step S1;

In the step S1, the telephone appeal text to be classified may be obtained by platform calling or the like.

In the step S2, the preprocessing of the telephone appeal text to be classified includes the following steps,

step S201, deduplication: the Euclidean distance is adopted to perform duplicate removal processing on the telephone appeal texts to be classified, the Euclidean distance of each text is calculated, only one text with a short distance is reserved, and the duplicate removal accuracy is improved;

step S202, denoising treatment: denoising the telephone appeal text to be classified by adopting a hash value based on a DOM tree, and removing the part irrelevant to classification in the text as noise;

step S203, deactivation removal processing: newly building a shutdown word bank special for the power field, comparing words in the text with words in the shutdown word bank one by one, and deleting the words from the text if the words are shutdown words, so that shutdown processing is realized, and the data quality is improved;

step S204, text word segmentation processing: and performing word segmentation on the telephone appeal text to be classified by adopting a jieba language model to realize text word segmentation processing, so that reasonable data augmentation can be performed according to words obtained by word segmentation in the subsequent step S3.

In step S3, the specialized vocabulary in the power domain is added to the data to increase the generalization ability of the model to the data.

In the step S4, the vocabulary dictionary is built by using the double-array trie tree method according to the augmented data, so that the space utilization rate and efficiency are effectively improved, and the training time is shortened.

In the step S5, word vector id matching is performed according to the established vocabulary dictionary, that is, one word vector is matched for each word, so that repeated training of the word vectors is avoided, and thus the parameters, complexity and training time of network training are effectively reduced.

In step S6, since the one-hot word vector has how many words, it is necessary to perform dimension reduction on the word vector in this form in order to avoid dimension explosion of the word vector. And the principal component analysis dimensionality reduction is realized by calculating the eigenvalues of the covariance matrix of the word vectors and selecting a plurality of maximum eigenvalues as principal components, and then multiplying the original word vectors by the eigenvector matrix corresponding to the selected maximum eigenvalue to obtain the word vectors after dimensionality reduction. The condition that the number of parameters of the model is too large due to too high dimensionality of the word vector can be avoided by reducing the dimensionality of the word vector, the learning of the model to the parameters is reduced, and the complexity of the model is reduced.

In the above step S7, the eigenvalues are spliced by the formula,

wherein R is_kThe characteristic value obtained after treatment of 1 × 1 convolutional layer with ResNet, D_kThe characteristic value, C, obtained by treating a 1 × 1 convolutional layer with DenseNet-BC_kRepresenting the characteristic value, x, after stitching_k+1Represents the input to the (k + 1) th layer and H represents the activation function.

The processed two groups of feature values with the same size are spliced, so that edge feature expression and shallow feature flow can be realized, the flow of redundant features can be reduced, and unnecessary feature learning and parameter iteration are reduced.

In step S8, the features after the stitching are randomly combined to prevent overfitting of the model, and the classification accuracy of the model can be improved by using the obtained high-level features as input.

In the step S9, the text classifier is formed by the high-level features obtained in the step S8, and the classification of the telephone appeal text is realized by using the mixed high-level features as the input amount of the neural network, so that the speed and the accuracy of the classification are effectively improved.

To examine the effect of the classification algorithm of the present embodiment, the present embodiment also designed the following experiment.

The hardware configuration of the experimental environment is 4GB RAM, Nvidia Geforce GTX 970M and video memory 3GB, the integrated configuration is anaconda3(64bit) + python (3.6) + spyder, and the experimental framework is tensoflow (1.1.0).

Experimental data, for better evaluation of the model, a data set with different fields, data scales and classification numbers is selected in the experiment, and specific characteristic information is shown in table 1. Wherein, thycnews is a standard news text classification data set, and EPCT (power appeal text) comprises 95598 annual acceptance text data.

TABLE 1 data set characteristic information Table

Name (R)	Number of classification	Number of	Average text length	Training/validation/testing	FIELD
						THUCNews
	20	20000	236	12000/4000/4000	News
						EPCT	7	5000	93	12000/4000/4000	Appeal in the field of electric power

And (3) model parameter configuration, because the classification algorithm of the invention is spliced on the premise that the sizes of the characteristic values are the same, a 1 × 1 convolution layer is added after a3 × 3 convolution layer and a 2 × 2 average pooling layer to change the mapping size of the characteristic, and the related model parameter values are set in a table 2.

TABLE 2 parameter value settings of the model

Parameter name	Parameter value
		Size of embedding layer	64
Upper limit of sentence length	600
		Number of words	500
Hidden layer size	128
		Batch size	64
Number of iterations	10

And the evaluation indexes adopt the error rate, the F1 score and the model operation time as the evaluation indexes, and the model is evaluated in a multi-angle and all-around manner.

And (3) comparing the models, namely evaluating the performances of the One-hot and Word2vec Word vector models and different combination models from the aspect of error rate, wherein the specific comparison of the error rate is shown in a table 3. As can be seen from table 3, the classification algorithm of the present embodiment obtains better processing effect than other algorithms in both data sets, and especially, the error rate is as low as 7.63% in the data processing of the EPCT.

TABLE 3 error Rate comparison of several model Process datasets

Model combination of algorithms	THUCNews	EPCT
			one-hot+CNN	11.47	9.5
word2vec+CNN	8.46	8.21
			one-hot+Densenet	8.34	7.92
word2vec+Densenet	8.21	7.75
			Classification algorithm of the present embodiment	8.06	7.63

Next, for the splicing operation in which the present embodiment is improved, an F1 score that performs optimally before and after splicing is selected as an evaluation result, which is specifically shown in table 4. As can be seen from table 4, the model using the stitching operation in this embodiment achieves better effects in multiple categories.

TABLE 4 comparison of F1 scores before and after splicing

In addition, as can be seen from fig. 2 and fig. 3, the improvement of the classification algorithm in the embodiment has good effect in terms of model efficiency, the error rate of classifying the EPCT text by training in the classification algorithm of the embodiment can be as low as 7.5%, and the error rate of classifying the THUCNews text can be as low as 8.6%.

The trend graph of the error rate and the scale of the training data for the training sets of different scales is shown in fig. 4 and 5. As can be seen from fig. 4 and 5, the model provided by the present invention has significant advantages in both data sets, and particularly, when processing an EPCT data set, a good effect can be obtained even when the training data size is not large.

Finally, the efficiency of the model is evaluated by using the operation time of the model as an index, and as shown in fig. 6 and 7, compared with the one-hot + densenert model, the model of the classification algorithm of the present embodiment shortens the operation time by about 40% when processing the EPCT text and shortens the operation time by about 35% when processing the THUCNews text. Therefore, the classification algorithm of the embodiment can rapidly, accurately and efficiently classify the telephone appeal texts in the power field, and better meets the classification requirements.

It can be understood that, in the present embodiment, reference may be made to the prior art for a specific method for performing deduplication processing on a to-be-classified telephone appeal text by using the euclidean distance.

It can be understood that, in the present embodiment, reference may be made to the prior art for a specific method for denoising a to-be-classified telephone appeal text by using a hash value based on a DOM tree.

It can be understood that, in the present embodiment, the prior art may be referred to for a specific method for segmenting the words of the telephone appeal text to be classified by using the jieba language model.

It is understood that, in the present embodiment, reference may be made to the prior art for a specific method for building a vocabulary dictionary by using the double-array trie method.

It is understood that, in the present embodiment, the specific method for performing principal component analysis dimension reduction on the word vector in the one-hot form may refer to the prior art.

Other embodiments of the present invention than the preferred embodiments described above, and those skilled in the art can make various changes and modifications according to the present invention without departing from the spirit of the present invention, should fall within the scope of the present invention defined in the claims.

Claims

1. A DenseNet-based telephone appeal text classification algorithm facing the power field is characterized by comprising the following steps,

s1, obtaining a telephone appeal text to be classified;

s2, preprocessing the telephone appeal text acquired in the step S1;

s7, adopting ResNet and DenseNet-BC to perform 1 × 1 convolution layer processing on the word vector after dimension reduction in the step S6, splicing the eigenvalues of the same size obtained after convolution layer processing through a formula I,

wherein,R_kthe characteristic value obtained after treatment of 1 × 1 convolutional layer with ResNet, D_kThe characteristic value, C, obtained by treating a 1 × 1 convolutional layer with DenseNet-BC_kRepresenting the characteristic value, x, after stitching_k+1Represents the input of the (k + 1) th layer, and H represents the activation function;

2. The telephone appeal text classification algorithm according to claim 1, wherein the preprocessing of the telephone appeal text to be classified in the step S2 includes a de-duplication process, a de-noising process, a de-deactivation process and a text word segmentation process.

3. The algorithm for classifying telephone appeal text according to claim 2, wherein the euclidean distance is used to perform de-duplication on the telephone appeal text to be classified in step S2.

4. The phone complaint text classification algorithm of claim 2, wherein the phone complaint text to be classified is denoised in step S2 by using a DOM tree-based hash value.

5. The phone complaint text classification algorithm of claim 2, wherein the step S2 is implemented by creating a deactivation word bank dedicated to the power domain to perform deactivation processing on the phone complaint text to be classified.

6. The telephone appeal text classification algorithm according to claim 2, wherein in the step S2, a jieba language model is adopted to perform word segmentation on the telephone appeal text to be classified so as to realize text word segmentation.

7. The telephony appeal text classification algorithm of claim 1, wherein the vocabulary dictionary is built in the step S4 by using a double array trie method.

8. The telephony appeal text classification algorithm according to claim 1, wherein in the step S6, principal component analysis dimensionality reduction is performed on the one-hot form word vector.