CN107609113A

CN107609113A - A kind of Automatic document classification method

Info

Publication number: CN107609113A
Application number: CN201710822309.2A
Authority: CN
Inventors: 张媛钰; 阿孜古丽; 谢永红; 张德政; 栗辉; 李春苗
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2017-09-13
Filing date: 2017-09-13
Publication date: 2018-01-19

Abstract

The present invention provides a kind of Automatic document classification method, it is possible to increase the accuracy and anti-noise ability of text classification.Methods described includes：Obtain text to be sorted；Using noise reduction autocoder and limited Boltzmann machine, noise reduction deep neural network model is built；Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition；According to feature extraction result, classified automatically using Softmax regression algorithms.The present invention relates to text classification field.

Description

A kind of Automatic document classification method

Technical field

The present invention relates to text classification field, particularly relates to a kind of Automatic document classification method.

Background technology

In the network information, text carries approach in occupation of critical role as main information.Text classification (Text Classification, TC) namely utilize computer to text set or other entities and object according to certain classification System or standard carry out automatic key words sorting.At present, deep learning has been successfully applied to various modes classification problem, uses base In the method for deep learning, the complicated semantic relation contained in the text more preferable can must be excavated.

But in the prior art, typically text is classified using single method, ability in feature extraction is weaker, right The disposal ability of noise data is poor so that classification results accuracy is relatively low.

The content of the invention

The technical problem to be solved in the present invention is to provide a kind of Automatic document classification method, to solve present in prior art Noise processed ability, the problem of ability in feature extraction is weak.

In order to solve the above technical problems, the embodiment of the present invention provides a kind of Automatic document classification method, including：

Obtain text to be sorted；

Using noise reduction autocoder and limited Boltzmann machine, noise reduction deep neural network model is built；

Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition；

According to feature extraction result, classified automatically using Softmax regression algorithms.

Further, in the noise reduction deep neural network model using structure, spy is carried out to the text to be sorted of acquisition Before sign extraction, methods described also includes：

The noise data in the text to be sorted of the acquisition is rejected, wherein, the noise data includes：Useless letter Breath and/or punctuation mark and spcial character in text.

Further, after the noise data in the text to be sorted for rejecting the acquisition, methods described also includes：

Word segmentation processing is carried out to the text data for removing noise data.

Further, after the text data to removing noise data carries out word segmentation processing, methods described also includes：

According to the word segmentation result of text data, stop words is gone to text data, wherein, the stop words removed is not distinguish With the Feature Words of predictive ability.

Further, after stop words is removed to text data, methods described also includes：

The Feature Words for going after stop words to obtain are turned into vocabulary form；

Calculate the weights of each Feature Words in feature vocabulary and be recorded in feature vocabulary, wherein, the feature vocabulary bag Include the corresponding relation between the weights of each Feature Words in the Feature Words and text in text, text；

According to obtained feature vocabulary, each text is represented sequentially as to the form of characteristic vector.

Further, the feature vocabulary that the basis obtains, each text is represented sequentially as to the form bag of characteristic vector Include：

According to default rule, judge whether the first text is short text；

If so, then expanding algorithm according to short essay eigen, feature expansion is carried out to first text, and feature based expands Result is filled, first text representation is characterized to the form of vector；

If it is not, first text representation is directly then characterized according to obtained feature vocabulary by vectorial form.

Further, the feature vocabulary obtained in basis, after each text is represented sequentially as into the form of characteristic vector, Methods described also includes：

The each numerical value for being expressed as vector characteristics form is normalized.

Further, the noise reduction deep neural network model includes：

The first noise reduction autocoder positioned at the noise reduction deep neural network model bottom, positioned at the described first drop Make an uproar autocoder upper strata the second noise reduction autocoder, positioned at the first limited of the second noise reduction autocoder upper strata Boltzmann machine, the second limited Boltzmann machine positioned at the described first limited Boltzmann machine upper strata.

Further, the first noise reduction autocoder and the second noise reduction autocoder composition noise reduction module, it is described Noise reduction module is used to carry out noise reduction process to the characteristic vector for inputting the noise reduction deep neural network model；Wherein, described Layer where two noise reduction autocoders is the output layer of the noise reduction module while is also the described first limited Boltzmann machine Input layer；

Described second is limited the output layer that Boltzmann machine is the noise reduction deep neural network model, the output of output layer As a result it is the character representation of the text to be sorted.

Further, the input of the noise reduction deep neural network model is the characteristic vector of a fixed dimension.

The above-mentioned technical proposal of the present invention has the beneficial effect that：

In such scheme, by using noise reduction autocoder and limited Boltzmann machine, noise reduction depth nerve net is built Network model；Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition；According to spy Sign extraction result, is classified automatically using Softmax regression algorithms.So, based on the noise reduction with powerful anti-noise ability Autocoder and the noise reduction deep neural network model that there is the limited Boltzmann machine of powerful ability in feature extraction to build The feature of the text to be sorted of extraction, it is possible to increase the accuracy and anti-noise ability of text classification.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of Automatic document classification method provided in an embodiment of the present invention；

Fig. 2 is the schematic flow sheet provided in an embodiment of the present invention that the text representation to be sorted of acquisition is characterized to vector；

Fig. 3 is the topological schematic diagram of noise reduction deep neural network model provided in an embodiment of the present invention；

Fig. 4 is that the principle of noise reduction deep neural network model provided in an embodiment of the present invention is illustrated.

Embodiment

To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool Body embodiment is described in detail.

The problem of present invention is directed to existing noise processed ability, ability in feature extraction is weak, there is provided a kind of text is automatic Sorting technique.

As shown in figure 1, Automatic document classification method provided in an embodiment of the present invention, including：

S101, obtain text to be sorted；

S102, using noise reduction autocoder (Denoising Auto Encoder, DAE) and limited Boltzmann machine (Restricted Boltzmann Machine, RBM), structure noise reduction deep neural network model (Denoising Deep Neural Network, DDNN)；

S103, using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition；

S104, according to feature extraction result, classified automatically using Softmax regression algorithms.

Automatic document classification method described in the embodiment of the present invention, by using noise reduction autocoder and limited Bohr hereby Graceful machine, build noise reduction deep neural network model；Using the noise reduction deep neural network model of structure, to the to be sorted of acquisition Text carries out feature extraction；According to feature extraction result, classified automatically using Softmax regression algorithms.So, based on tool There are the noise reduction autocoder of powerful anti-noise ability and the limited Boltzmann machine with powerful ability in feature extraction to build Noise reduction deep neural network model extraction text to be sorted feature, it is possible to increase the accuracy and anti-noise of text classification Ability.

In the embodiment of aforementioned texts automatic classification method, further, in the noise reduction depth using structure Neural network model, before the text progress feature extraction to be sorted of acquisition, methods described also includes：

As shown in Fig. 2 in the present embodiment, the noise data in the text to be sorted of the acquisition is rejected, is mainly picked Except some useless information, for example, the garbage of the similar author often occurred in newsletter archive, version number, date etc, net Stand the similar addresser occurred in forum, transmit the date, transmit the useless information such as station, source, and various punctuates in text The useless information such as symbol and spcial character.

In the embodiment of aforementioned texts automatic classification method, further, treating point for the acquisition is being rejected After noise data in the text of class, methods described also includes：

In the present embodiment, Chinese text is different from English text, and by space-separated between English word and word, Chinese is only Having between sentence and sentence has punctuation mark separation, therefore is extraction word feature, the Chinese text for removing noise data is entered Row word segmentation processing.

As shown in Fig. 2 in the present embodiment, the ICTCLAS Words partition systems of the Chinese Academy of Sciences Jing Guo secondary development can be used to enter Row participle, the system can provide service according to the language of developer's selection.

In the embodiment of aforementioned texts automatic classification method, further, in the text to removing noise data After notebook data carries out word segmentation processing, methods described also includes：

As shown in Fig. 2 in the present embodiment, after carrying out word segmentation processing to text, the inside can include many useless Feature Words (also referred to as：Stop words), these Feature Words are not distinguished and predictive ability, for example, auxiliary word, article, conjunction, pronoun, preposition etc., Therefore, these useless Feature Words are removed, to reduce the dimension of Feature Words.

In the embodiment of aforementioned texts automatic classification method, further, stop words is being gone to text data Afterwards, methods described also includes：

As shown in Fig. 2 in the present embodiment, the Feature Words for going after stop words to obtain are turned into vocabulary form, calculated special Levy the weights of each Feature Words in vocabulary and be recorded in the feature vocabulary.

In the present embodiment, term frequency-inverse document frequency (Term Frequency-Inverse Document can be used Frequency, TF-IDF) algorithm calculates the weights of each Feature Words in feature vocabulary, and the TF-IDF algorithmic notations are：

TF_IDF=(TF/N_i)*lg(N/DF) (1)

In formula (1), TF_IDF represent weights, TF be text in particular characteristic value word frequency, N_iFor Feature Words in text Sum, N are the sum of text, and DF is the textual data for including this feature word.

According to obtained feature vocabulary, each text is represented sequentially as to the form of characteristic vector, as shown in table 1, table 1 is The characteristic vector space model of text represents, in table 1, d_mRepresent certain single text, t_jRepresent certain Feature Words, w_ijRepresentative feature word Weights.

The characteristic vector space model of the text of table 1 represents

	t₁	...	t_j	...	t_n
						d₁	w₁₁	...	w_1j	...	w_1n
...	...	...	...	...	...
						d_i	w_i1	...	w_ij	...	w_in
...	...	...	...	...	...
						d_m	w_m1	...	w_mj	...	w_mn

As shown in Fig. 2 in the embodiment of aforementioned texts automatic classification method, further, the basis obtains The feature vocabulary arrived, each text is represented sequentially as to the form of characteristic vector to be included：

According to default rule, judge whether the first text is short text；

In the present embodiment, short text can be determined whether it is according to the size of text, if for example, the size of text is less than Default threshold value, then the text is short text..

In the present embodiment, it is assumed that the training dataset of text is D={ d_i, this data set structure sorting algorithm is based on, Wherein, d_i={ t_k, t in short text_kNumber it is typically smaller.The short essay eigen expands algorithm and has been broadly divided into two Step：

A1, first, select has high indicative feature construction to go out required feature space T to classification, so as to original Feature space D carried out dimensionality reduction.

During the T of construction feature space, it should allow the feature distribution in feature space T as far as possible in each short text, i.e., Ensure that each sample space has direct correlation with the feature space T built；For ensure feature distribution uniformly avoid it is sparse, Selected feature should be included by more short text.

From the point of view of to sum up, feature space T structure wants emphasis to consider difference and feature of each classification in amount of text Correlation degree between classification, the feature big to category classification contribution degree is selected in each classification, it is corresponding for characterizing Class.This point can be used and characterize certain feature t_kThe feature within-cluster variance DI of distribution situation in class_icTo be weighed.

In formula (2), m represents C_iText sum in class, f (t_ij) represent feature t_ijIn C_iOccur in class in jth piece text Number,Represent feature t_ijIn C_iTF-IDF average value in all texts of class.

Feature t_ijWithin-cluster variance DI_icIt is smaller, show that its distribution in class is more uniform, also just having can be preferably Distinguish the ability of each classification.Next, the DI to each class_icSequence from big to small is carried out, K spy before being extracted according to ratio Sign, finally merges the unduplicated feature constitutive characteristic space T of all classes.

A2, then, is exactly to any d_iSelected from feature space T and t_kThe higher feature of similarity is extended.

On the basis of the feature space T of structure, it is possible to which short essay eigen is extended.Principle is to utilize and short essay The feature that this feature t itself included has maximum relation degree is extended to it.Calculate the common methods of the feature degree of correlation Mainly utilize mutual information, can the intuitively directly related property of response feature and classification, but it is exactly to sparse to have individual shortcoming The sensitivity for the inaccuracy exception that data band comes, the mutual information that may result between feature turn into negative value, the processing to the later stage Using causing trouble.

In the present embodiment, using a kind of modified version calculation formula based on mutual information, low-frequency word is avoided to a certain extent The binary mutual information of composition higher than high-frequency phrase into two tuples the problem of, and weaken the sparse related feature of data The influence of degree.

In formula (3), R (t_i,t_j) represent feature t_i、t_jBetween the degree of correlation, P (t_i,t_j) represent in data set, feature t_i And t_jThe probability occurred simultaneously, P (t_i) represent feature t_iThe probability occurred in data set, P (t_j) represent feature t_jIn data set The probability of middle appearance.

In the embodiment of aforementioned texts automatic classification method, further, according to obtained feature vocabulary, After each text is represented sequentially as into the form of characteristic vector, methods described also includes：

As shown in Fig. 2 in the present embodiment, because gap of the input data on the order of magnitude, the data of input can be caused to go out Existing incompatibility problem, so needing the vector characteristics numerical value for inputting the noise reduction deep neural network model of structure being normalized Processing, specifically：The each numerical value for being expressed as vector characteristics form is normalized according to formula (4)：

In formula (4), x_i、V_iRepresent to normalize forward and backward characteristic value respectively, V represents the characteristic vector after normalization, x_minWith x_maxIt is the minimum value and maximum for the vector characteristics intermediate value for inputting noise reduction deep neural network model respectively.

In the embodiment of aforementioned texts automatic classification method, further, the noise reduction deep neural network Model includes：

In the embodiment of aforementioned texts automatic classification method, further, the first noise reduction autocoding Device and the second noise reduction autocoder composition noise reduction module, the noise reduction module are used for inputting the noise reduction deep neural network The characteristic vector of model carries out noise reduction process；Wherein, layer where the second noise reduction autocoder is the noise reduction module Output layer is also the input layer of the described first limited Boltzmann machine simultaneously；

In the embodiment of aforementioned texts automatic classification method, further, the noise reduction deep neural network The input of model is the characteristic vector of a fixed dimension.

In the present embodiment, the Chinese text corpus compiled by using Fudan University is to the text described in the present embodiment This automatic classification method carries out testing research, and the corpus includes the language material of nearly ten thousand, enumerate 20 classifications, have physical culture, Politics, medicine, art, military, economic dispatch, training set and test set distribution are as shown in table 2 in data set.

Training set and test set distribution in the data set of table 2

Classification	Class name	Training set	Test set
				C1	Computer	900	300
C2	Environment	900	300
				C3	Agricultural	900	300
C4	It is economical	900	300
				C5	Politics	900	300
C6	Motion	900	300
				……	……	……	……

In the present embodiment, main submodule of the noise reduction deep neural network model as characteristic extracting module is described The structure of noise reduction deep neural network model mainly uses noise reduction autocoder (DAE) and limited Boltzmann machine (RBM) two Big component.

After considering the complexity of training and the efficiency of model, in the present embodiment, using 2 layers of noise reduction autocoding (DAE) and 2 layers of (limited Boltzmann machine RBM), noise reduction deep neural network model model topological structure as shown in figure 3, The noise reduction deep neural network model includes：The first noise reduction positioned at the noise reduction deep neural network model bottom is automatic Encoder (DAE1), the second noise reduction autocoder (DAE2) positioned at the first noise reduction autocoder upper strata, positioned at institute State the first limited Boltzmann machine (RBM1) on the second noise reduction autocoder upper strata, positioned at the described first limited Boltzmann machine The limited Boltzmann machine (RBM2) of the second of upper strata.

In the present embodiment, the first noise reduction autocoder (DAE1) and the second noise reduction autocoder (DAE2) composition Noise reduction module, the noise reduction module are used to carry out at noise reduction the characteristic vector for inputting the noise reduction deep neural network model Reason；Wherein, layer where the second noise reduction autocoder (DAE2) is the output layer of the noise reduction module while is also described The input layer of first limited Boltzmann machine (RBM1)；Described second limited Boltzmann machine (RBM2) is the noise reduction depth god Output layer through network model, the output result of output layer are the character representation of the text to be sorted.

In the present embodiment, the characteristic extracting module drops first with noise reduction module to the original feature vector of input Make an uproar processing, the noise reduction module is located at the bottom of whole noise reduction deep neural network model, to make full use of noise reduction automatic The characteristic of encoder noise reduction, weight is carried out to the original feature vector of input by the unsupervised learning ability of noise reduction autocoder Structure, a noise reduction process to input signal can be accomplished so that enter the signal of network more after noise reduction autocoder What is added is pure, reduces noise data to the influence caused by subsequent builds grader.

In the present embodiment, there is powerful feature to carry for the described first limited Boltzmann machine and the second limited Boltzmann machine Ability is taken, the described first limited Boltzmann machine (RBM1) and the second limited Boltzmann machine (RBM2) are located at the noise reduction depth The upper strata of neural network model, it can learn rule complicated in data so that the high-level characteristic extracted more has table Sign property；After being extracted by RBM further features, by the classification of the more representational feature input extracted to the end In device, best classification results are desirably to obtain.

In the present embodiment, by experiment, it is selected suitable plus make an uproar than and learning rate, improve the property of characteristic extracting module Energy.

In embodiment, the running of noise reduction deep neural network model (DDNN) is as shown in figure 4, noise reduction depth is neural Network model (DDNN) includes four layers altogether：DAE1, DAE2, RBM1 and RBM2, v are visual layers while are also noise reduction depth nerve The input layer of network model (DDNN), in the present embodiment, every text is all represented by the vector of fixed dimension, W₁、W₂、W₃With W₄What is represented respectively is the connection weight between each layer, h₁、h₂、h₃And h₄What is represented respectively is each hidden layer, corresponding to DAE1, DAE2, RBM1 and RBM2；It is that no node connects for all nodes, between same layer, but often connected two The intermediate node and node of layer all connect entirely.

In the present embodiment, the input of noise reduction deep neural network model (DDNN) is the vector of a fixed dimension, first by The noise reduction module of two layers of composition of DAE1 and DAE2 is trained, and DAE2 layers are the output layer of noise reduction module while are also follow-up RBM1 The input layer of layer, RBM2 is noise reduction deep neural network model (DDNN) output layer, represents the character representation of the text, and Visual layers are contrasted, and this layer is represented for the high-level characteristic of text data, and follow-up text categorization task is also all based on What this high-level characteristic was calculated, visual layers represent for the low-level feature of text data.

In the present embodiment, identification sort module is classified using Softmax regression algorithms, and it is inputted as noise reduction depth god The high-level characteristic exported through network model (DDNN).

In the present embodiment, it is assumed that text data, which is concentrated, the n text from k classification, and training set is expressed as { (x⁽¹⁾,y⁽¹⁾),(x⁽²⁾,y⁽²⁾),...,(x^(n-1),y^(n-1)),(x⁽ⁿ⁾,y⁽ⁿ⁾), wherein, x⁽ⁱ⁾I-th of training text is represented, y represents classification Multiple different values, y can be taken⁽ⁱ⁾∈ { 1,2 ..., k-1, k }, the main purpose of Softmax regression algorithms are for given Training set x, it is desirable to be able to calculate x and belong to the other probability of tag class.Assuming that function such as formula (5)：

In formula (5), h_θ(x⁽ⁱ⁾) each point vector of vector is text x⁽ⁱ⁾The probable value to belong to a different category, to cause Some divides the probable value of vector and for 1, it is necessary to which probable value is done into normalized processing, θ₁,θ₂,...,θ_k-1,θ_k∈Rⁿ⁺¹, Rⁿ⁺¹Refer to Be n+1 dimension real number spaces, θ here is the vector of a n+1 dimension, is exactly the parameter that Softmax is used in itself, for sample This is weighted per one-dimensional attribute, obtains a numberSubscript T represents transposition.

The cost function used in Softmax regression algorithms such as formula (6)：

In formula (6), what 1 { } represented is indicative function, and when the expression formula value in bracket is true, the function value is 1； Conversely, when the expression formula value in bracket is fictitious time, the value of the function is 0.θ_ijWhat is represented is Softmax i-th of parameter The jth dimension of vector,It is a penalty term.Because the former cost function of the forward part of plus sige is not strict convex letter Number, so having added a weight attenuation term later, for preventing multiple appearance being most worth.When Softmax regression models are joined Number λ>0, the cost function can become a strict convex function, final to obtain so as to prevent to training sample overfitting To the optimal solution of the overall situation.

Solution extreme value, the gradient calculation formula (7) of cost function are carried out to cost function using gradient descent method：

The function h assumed before also just having been obtained after obtaining θ_θ(x).So as to according to function h_θ(x) text is calculated The probable value of each classification belonging to this x, probable value it is big be exactly final classification that Softmax regression algorithms predict.

In the present embodiment, input does not add data of making an uproar, by the Automatic document classification method proposed in the present invention and single algorithm Model is contrasted, and obtained text classification accuracy is as shown in table 3.

Do not add the data classification Accuracy (%) that makes an uproar under the algorithms of different of table 3

Input plus data of making an uproar, the Automatic document classification method proposed in the present invention and single algorithm model are contrasted, Obtained text classification accuracy is as shown in table 4.

The algorithms of different of table 4 adds the data classification Accuracy (%) that makes an uproar

In table 3 and table 4, KNN, BPNN, SVM represent K arest neighbors, reverse transmittance nerve network, SVMs respectively.

It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation In any this actual relation or order.

Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, on the premise of principle of the present invention is not departed from, some improvements and modifications can also be made, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

A kind of 1. Automatic document classification method, it is characterised in that including：

Obtain text to be sorted；

Using noise reduction autocoder and limited Boltzmann machine, noise reduction deep neural network model is built；

Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition；

According to feature extraction result, classified automatically using Softmax regression algorithms.
2. Automatic document classification method according to claim 1, it is characterised in that in the noise reduction depth nerve using structure Network model, before the text progress feature extraction to be sorted of acquisition, methods described also includes：

The noise data in the text to be sorted of the acquisition is rejected, wherein, the noise data includes：Useless information And/or punctuation mark and spcial character in text.
3. Automatic document classification method according to claim 2, it is characterised in that rejecting the to be sorted of the acquisition After noise data in text, methods described also includes：

Word segmentation processing is carried out to the text data for removing noise data.
4. Automatic document classification method according to claim 3, it is characterised in that in the textual data to removing noise data After word segmentation processing is carried out, methods described also includes：

According to the word segmentation result of text data, stop words is gone to text data, wherein, the stop words removed is without differentiation and in advance The Feature Words of survey ability.
5. Automatic document classification method according to claim 4, it is characterised in that text data is gone stop words it Afterwards, methods described also includes：

The Feature Words for going after stop words to obtain are turned into vocabulary form；

Calculate the weights of each Feature Words in feature vocabulary and be recorded in feature vocabulary, wherein, the feature vocabulary includes text Originally, the corresponding relation in Feature Words and text in text between the weights of each Feature Words；

According to obtained feature vocabulary, each text is represented sequentially as to the form of characteristic vector.
6. Automatic document classification method according to claim 5, it is characterised in that the feature vocabulary that the basis obtains, Each text is represented sequentially as to the form of characteristic vector to be included：

According to default rule, judge whether the first text is short text；

If so, then expanding algorithm according to short essay eigen, feature expansion is carried out to first text, and feature based expands knot Fruit, first text representation is characterized to the form of vector；

If it is not, first text representation is directly then characterized according to obtained feature vocabulary by vectorial form.
7. Automatic document classification method according to claim 5, it is characterised in that, will in the feature vocabulary that basis obtains Each text is represented sequentially as after the form of characteristic vector, and methods described also includes：

The each numerical value for being expressed as vector characteristics form is normalized.
8. Automatic document classification method according to claim 1, it is characterised in that the noise reduction deep neural network model Including：

The first noise reduction autocoder positioned at the noise reduction deep neural network model bottom, positioned at first noise reduction from The second noise reduction autocoder on dynamic encoder upper strata, first limited Bohr positioned at the second noise reduction autocoder upper strata Hereby graceful machine, the second limited Boltzmann machine positioned at the described first limited Boltzmann machine upper strata.
9. Automatic document classification method according to claim 8, it is characterised in that the first noise reduction autocoder and Second noise reduction autocoder forms noise reduction module, and the noise reduction module is used for inputting the noise reduction deep neural network model Characteristic vector carry out noise reduction process；Wherein, layer where the second noise reduction autocoder is the output of the noise reduction module Layer is also the input layer of the described first limited Boltzmann machine simultaneously；

Described second is limited the output layer that Boltzmann machine is the noise reduction deep neural network model, the output result of output layer For the character representation of the text to be sorted.
10. Automatic document classification method according to claim 1, it is characterised in that the noise reduction deep neural network mould The input of type is the characteristic vector of a fixed dimension.