CN107609113A - A kind of Automatic document classification method - Google Patents
A kind of Automatic document classification method Download PDFInfo
- Publication number
- CN107609113A CN107609113A CN201710822309.2A CN201710822309A CN107609113A CN 107609113 A CN107609113 A CN 107609113A CN 201710822309 A CN201710822309 A CN 201710822309A CN 107609113 A CN107609113 A CN 107609113A
- Authority
- CN
- China
- Prior art keywords
- text
- noise reduction
- feature
- neural network
- classification method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of Automatic document classification method, it is possible to increase the accuracy and anti-noise ability of text classification.Methods described includes:Obtain text to be sorted;Using noise reduction autocoder and limited Boltzmann machine, noise reduction deep neural network model is built;Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition;According to feature extraction result, classified automatically using Softmax regression algorithms.The present invention relates to text classification field.
Description
Technical field
The present invention relates to text classification field, particularly relates to a kind of Automatic document classification method.
Background technology
In the network information, text carries approach in occupation of critical role as main information.Text classification (Text
Classification, TC) namely utilize computer to text set or other entities and object according to certain classification
System or standard carry out automatic key words sorting.At present, deep learning has been successfully applied to various modes classification problem, uses base
In the method for deep learning, the complicated semantic relation contained in the text more preferable can must be excavated.
But in the prior art, typically text is classified using single method, ability in feature extraction is weaker, right
The disposal ability of noise data is poor so that classification results accuracy is relatively low.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of Automatic document classification method, to solve present in prior art
Noise processed ability, the problem of ability in feature extraction is weak.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of Automatic document classification method, including:
Obtain text to be sorted;
Using noise reduction autocoder and limited Boltzmann machine, noise reduction deep neural network model is built;
Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition;
According to feature extraction result, classified automatically using Softmax regression algorithms.
Further, in the noise reduction deep neural network model using structure, spy is carried out to the text to be sorted of acquisition
Before sign extraction, methods described also includes:
The noise data in the text to be sorted of the acquisition is rejected, wherein, the noise data includes:Useless letter
Breath and/or punctuation mark and spcial character in text.
Further, after the noise data in the text to be sorted for rejecting the acquisition, methods described also includes:
Word segmentation processing is carried out to the text data for removing noise data.
Further, after the text data to removing noise data carries out word segmentation processing, methods described also includes:
According to the word segmentation result of text data, stop words is gone to text data, wherein, the stop words removed is not distinguish
With the Feature Words of predictive ability.
Further, after stop words is removed to text data, methods described also includes:
The Feature Words for going after stop words to obtain are turned into vocabulary form;
Calculate the weights of each Feature Words in feature vocabulary and be recorded in feature vocabulary, wherein, the feature vocabulary bag
Include the corresponding relation between the weights of each Feature Words in the Feature Words and text in text, text;
According to obtained feature vocabulary, each text is represented sequentially as to the form of characteristic vector.
Further, the feature vocabulary that the basis obtains, each text is represented sequentially as to the form bag of characteristic vector
Include:
According to default rule, judge whether the first text is short text;
If so, then expanding algorithm according to short essay eigen, feature expansion is carried out to first text, and feature based expands
Result is filled, first text representation is characterized to the form of vector;
If it is not, first text representation is directly then characterized according to obtained feature vocabulary by vectorial form.
Further, the feature vocabulary obtained in basis, after each text is represented sequentially as into the form of characteristic vector,
Methods described also includes:
The each numerical value for being expressed as vector characteristics form is normalized.
Further, the noise reduction deep neural network model includes:
The first noise reduction autocoder positioned at the noise reduction deep neural network model bottom, positioned at the described first drop
Make an uproar autocoder upper strata the second noise reduction autocoder, positioned at the first limited of the second noise reduction autocoder upper strata
Boltzmann machine, the second limited Boltzmann machine positioned at the described first limited Boltzmann machine upper strata.
Further, the first noise reduction autocoder and the second noise reduction autocoder composition noise reduction module, it is described
Noise reduction module is used to carry out noise reduction process to the characteristic vector for inputting the noise reduction deep neural network model;Wherein, described
Layer where two noise reduction autocoders is the output layer of the noise reduction module while is also the described first limited Boltzmann machine
Input layer;
Described second is limited the output layer that Boltzmann machine is the noise reduction deep neural network model, the output of output layer
As a result it is the character representation of the text to be sorted.
Further, the input of the noise reduction deep neural network model is the characteristic vector of a fixed dimension.
The above-mentioned technical proposal of the present invention has the beneficial effect that:
In such scheme, by using noise reduction autocoder and limited Boltzmann machine, noise reduction depth nerve net is built
Network model;Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition;According to spy
Sign extraction result, is classified automatically using Softmax regression algorithms.So, based on the noise reduction with powerful anti-noise ability
Autocoder and the noise reduction deep neural network model that there is the limited Boltzmann machine of powerful ability in feature extraction to build
The feature of the text to be sorted of extraction, it is possible to increase the accuracy and anti-noise ability of text classification.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of Automatic document classification method provided in an embodiment of the present invention;
Fig. 2 is the schematic flow sheet provided in an embodiment of the present invention that the text representation to be sorted of acquisition is characterized to vector;
Fig. 3 is the topological schematic diagram of noise reduction deep neural network model provided in an embodiment of the present invention;
Fig. 4 is that the principle of noise reduction deep neural network model provided in an embodiment of the present invention is illustrated.
Embodiment
To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool
Body embodiment is described in detail.
The problem of present invention is directed to existing noise processed ability, ability in feature extraction is weak, there is provided a kind of text is automatic
Sorting technique.
As shown in figure 1, Automatic document classification method provided in an embodiment of the present invention, including:
S101, obtain text to be sorted;
S102, using noise reduction autocoder (Denoising Auto Encoder, DAE) and limited Boltzmann machine
(Restricted Boltzmann Machine, RBM), structure noise reduction deep neural network model (Denoising Deep
Neural Network, DDNN);
S103, using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition;
S104, according to feature extraction result, classified automatically using Softmax regression algorithms.
Automatic document classification method described in the embodiment of the present invention, by using noise reduction autocoder and limited Bohr hereby
Graceful machine, build noise reduction deep neural network model;Using the noise reduction deep neural network model of structure, to the to be sorted of acquisition
Text carries out feature extraction;According to feature extraction result, classified automatically using Softmax regression algorithms.So, based on tool
There are the noise reduction autocoder of powerful anti-noise ability and the limited Boltzmann machine with powerful ability in feature extraction to build
Noise reduction deep neural network model extraction text to be sorted feature, it is possible to increase the accuracy and anti-noise of text classification
Ability.
In the embodiment of aforementioned texts automatic classification method, further, in the noise reduction depth using structure
Neural network model, before the text progress feature extraction to be sorted of acquisition, methods described also includes:
The noise data in the text to be sorted of the acquisition is rejected, wherein, the noise data includes:Useless letter
Breath and/or punctuation mark and spcial character in text.
As shown in Fig. 2 in the present embodiment, the noise data in the text to be sorted of the acquisition is rejected, is mainly picked
Except some useless information, for example, the garbage of the similar author often occurred in newsletter archive, version number, date etc, net
Stand the similar addresser occurred in forum, transmit the date, transmit the useless information such as station, source, and various punctuates in text
The useless information such as symbol and spcial character.
In the embodiment of aforementioned texts automatic classification method, further, treating point for the acquisition is being rejected
After noise data in the text of class, methods described also includes:
Word segmentation processing is carried out to the text data for removing noise data.
In the present embodiment, Chinese text is different from English text, and by space-separated between English word and word, Chinese is only
Having between sentence and sentence has punctuation mark separation, therefore is extraction word feature, the Chinese text for removing noise data is entered
Row word segmentation processing.
As shown in Fig. 2 in the present embodiment, the ICTCLAS Words partition systems of the Chinese Academy of Sciences Jing Guo secondary development can be used to enter
Row participle, the system can provide service according to the language of developer's selection.
In the embodiment of aforementioned texts automatic classification method, further, in the text to removing noise data
After notebook data carries out word segmentation processing, methods described also includes:
According to the word segmentation result of text data, stop words is gone to text data, wherein, the stop words removed is not distinguish
With the Feature Words of predictive ability.
As shown in Fig. 2 in the present embodiment, after carrying out word segmentation processing to text, the inside can include many useless Feature Words
(also referred to as:Stop words), these Feature Words are not distinguished and predictive ability, for example, auxiliary word, article, conjunction, pronoun, preposition etc.,
Therefore, these useless Feature Words are removed, to reduce the dimension of Feature Words.
In the embodiment of aforementioned texts automatic classification method, further, stop words is being gone to text data
Afterwards, methods described also includes:
The Feature Words for going after stop words to obtain are turned into vocabulary form;
Calculate the weights of each Feature Words in feature vocabulary and be recorded in feature vocabulary, wherein, the feature vocabulary bag
Include the corresponding relation between the weights of each Feature Words in the Feature Words and text in text, text;
According to obtained feature vocabulary, each text is represented sequentially as to the form of characteristic vector.
As shown in Fig. 2 in the present embodiment, the Feature Words for going after stop words to obtain are turned into vocabulary form, calculated special
Levy the weights of each Feature Words in vocabulary and be recorded in the feature vocabulary.
In the present embodiment, term frequency-inverse document frequency (Term Frequency-Inverse Document can be used
Frequency, TF-IDF) algorithm calculates the weights of each Feature Words in feature vocabulary, and the TF-IDF algorithmic notations are:
TF_IDF=(TF/Ni)*lg(N/DF) (1)
In formula (1), TF_IDF represent weights, TF be text in particular characteristic value word frequency, NiFor Feature Words in text
Sum, N are the sum of text, and DF is the textual data for including this feature word.
According to obtained feature vocabulary, each text is represented sequentially as to the form of characteristic vector, as shown in table 1, table 1 is
The characteristic vector space model of text represents, in table 1, dmRepresent certain single text, tjRepresent certain Feature Words, wijRepresentative feature word
Weights.
The characteristic vector space model of the text of table 1 represents
t1 | ... | tj | ... | tn | |
d1 | w11 | ... | w1j | ... | w1n |
... | ... | ... | ... | ... | ... |
di | wi1 | ... | wij | ... | win |
... | ... | ... | ... | ... | ... |
dm | wm1 | ... | wmj | ... | wmn |
As shown in Fig. 2 in the embodiment of aforementioned texts automatic classification method, further, the basis obtains
The feature vocabulary arrived, each text is represented sequentially as to the form of characteristic vector to be included:
According to default rule, judge whether the first text is short text;
If so, then expanding algorithm according to short essay eigen, feature expansion is carried out to first text, and feature based expands
Result is filled, first text representation is characterized to the form of vector;
If it is not, first text representation is directly then characterized according to obtained feature vocabulary by vectorial form.
In the present embodiment, short text can be determined whether it is according to the size of text, if for example, the size of text is less than
Default threshold value, then the text is short text..
In the present embodiment, it is assumed that the training dataset of text is D={ di, this data set structure sorting algorithm is based on,
Wherein, di={ tk, t in short textkNumber it is typically smaller.The short essay eigen expands algorithm and has been broadly divided into two
Step:
A1, first, select has high indicative feature construction to go out required feature space T to classification, so as to original
Feature space D carried out dimensionality reduction.
During the T of construction feature space, it should allow the feature distribution in feature space T as far as possible in each short text, i.e.,
Ensure that each sample space has direct correlation with the feature space T built;For ensure feature distribution uniformly avoid it is sparse,
Selected feature should be included by more short text.
From the point of view of to sum up, feature space T structure wants emphasis to consider difference and feature of each classification in amount of text
Correlation degree between classification, the feature big to category classification contribution degree is selected in each classification, it is corresponding for characterizing
Class.This point can be used and characterize certain feature tkThe feature within-cluster variance DI of distribution situation in classicTo be weighed.
In formula (2), m represents CiText sum in class, f (tij) represent feature tijIn CiOccur in class in jth piece text
Number,Represent feature tijIn CiTF-IDF average value in all texts of class.
Feature tijWithin-cluster variance DIicIt is smaller, show that its distribution in class is more uniform, also just having can be preferably
Distinguish the ability of each classification.Next, the DI to each classicSequence from big to small is carried out, K spy before being extracted according to ratio
Sign, finally merges the unduplicated feature constitutive characteristic space T of all classes.
A2, then, is exactly to any diSelected from feature space T and tkThe higher feature of similarity is extended.
On the basis of the feature space T of structure, it is possible to which short essay eigen is extended.Principle is to utilize and short essay
The feature that this feature t itself included has maximum relation degree is extended to it.Calculate the common methods of the feature degree of correlation
Mainly utilize mutual information, can the intuitively directly related property of response feature and classification, but it is exactly to sparse to have individual shortcoming
The sensitivity for the inaccuracy exception that data band comes, the mutual information that may result between feature turn into negative value, the processing to the later stage
Using causing trouble.
In the present embodiment, using a kind of modified version calculation formula based on mutual information, low-frequency word is avoided to a certain extent
The binary mutual information of composition higher than high-frequency phrase into two tuples the problem of, and weaken the sparse related feature of data
The influence of degree.
In formula (3), R (ti,tj) represent feature ti、tjBetween the degree of correlation, P (ti,tj) represent in data set, feature ti
And tjThe probability occurred simultaneously, P (ti) represent feature tiThe probability occurred in data set, P (tj) represent feature tjIn data set
The probability of middle appearance.
In the embodiment of aforementioned texts automatic classification method, further, according to obtained feature vocabulary,
After each text is represented sequentially as into the form of characteristic vector, methods described also includes:
The each numerical value for being expressed as vector characteristics form is normalized.
As shown in Fig. 2 in the present embodiment, because gap of the input data on the order of magnitude, the data of input can be caused to go out
Existing incompatibility problem, so needing the vector characteristics numerical value for inputting the noise reduction deep neural network model of structure being normalized
Processing, specifically:The each numerical value for being expressed as vector characteristics form is normalized according to formula (4):
In formula (4), xi、ViRepresent to normalize forward and backward characteristic value respectively, V represents the characteristic vector after normalization, xminWith
xmaxIt is the minimum value and maximum for the vector characteristics intermediate value for inputting noise reduction deep neural network model respectively.
In the embodiment of aforementioned texts automatic classification method, further, the noise reduction deep neural network
Model includes:
The first noise reduction autocoder positioned at the noise reduction deep neural network model bottom, positioned at the described first drop
Make an uproar autocoder upper strata the second noise reduction autocoder, positioned at the first limited of the second noise reduction autocoder upper strata
Boltzmann machine, the second limited Boltzmann machine positioned at the described first limited Boltzmann machine upper strata.
In the embodiment of aforementioned texts automatic classification method, further, the first noise reduction autocoding
Device and the second noise reduction autocoder composition noise reduction module, the noise reduction module are used for inputting the noise reduction deep neural network
The characteristic vector of model carries out noise reduction process;Wherein, layer where the second noise reduction autocoder is the noise reduction module
Output layer is also the input layer of the described first limited Boltzmann machine simultaneously;
Described second is limited the output layer that Boltzmann machine is the noise reduction deep neural network model, the output of output layer
As a result it is the character representation of the text to be sorted.
In the embodiment of aforementioned texts automatic classification method, further, the noise reduction deep neural network
The input of model is the characteristic vector of a fixed dimension.
In the present embodiment, the Chinese text corpus compiled by using Fudan University is to the text described in the present embodiment
This automatic classification method carries out testing research, and the corpus includes the language material of nearly ten thousand, enumerate 20 classifications, have physical culture,
Politics, medicine, art, military, economic dispatch, training set and test set distribution are as shown in table 2 in data set.
Training set and test set distribution in the data set of table 2
Classification | Class name | Training set | Test set |
C1 | Computer | 900 | 300 |
C2 | Environment | 900 | 300 |
C3 | Agricultural | 900 | 300 |
C4 | It is economical | 900 | 300 |
C5 | Politics | 900 | 300 |
C6 | Motion | 900 | 300 |
…… | …… | …… | …… |
In the present embodiment, main submodule of the noise reduction deep neural network model as characteristic extracting module is described
The structure of noise reduction deep neural network model mainly uses noise reduction autocoder (DAE) and limited Boltzmann machine (RBM) two
Big component.
After considering the complexity of training and the efficiency of model, in the present embodiment, using 2 layers of noise reduction autocoding
(DAE) and 2 layers of (limited Boltzmann machine RBM), noise reduction deep neural network model model topological structure as shown in figure 3,
The noise reduction deep neural network model includes:The first noise reduction positioned at the noise reduction deep neural network model bottom is automatic
Encoder (DAE1), the second noise reduction autocoder (DAE2) positioned at the first noise reduction autocoder upper strata, positioned at institute
State the first limited Boltzmann machine (RBM1) on the second noise reduction autocoder upper strata, positioned at the described first limited Boltzmann machine
The limited Boltzmann machine (RBM2) of the second of upper strata.
In the present embodiment, the first noise reduction autocoder (DAE1) and the second noise reduction autocoder (DAE2) composition
Noise reduction module, the noise reduction module are used to carry out at noise reduction the characteristic vector for inputting the noise reduction deep neural network model
Reason;Wherein, layer where the second noise reduction autocoder (DAE2) is the output layer of the noise reduction module while is also described
The input layer of first limited Boltzmann machine (RBM1);Described second limited Boltzmann machine (RBM2) is the noise reduction depth god
Output layer through network model, the output result of output layer are the character representation of the text to be sorted.
In the present embodiment, the characteristic extracting module drops first with noise reduction module to the original feature vector of input
Make an uproar processing, the noise reduction module is located at the bottom of whole noise reduction deep neural network model, to make full use of noise reduction automatic
The characteristic of encoder noise reduction, weight is carried out to the original feature vector of input by the unsupervised learning ability of noise reduction autocoder
Structure, a noise reduction process to input signal can be accomplished so that enter the signal of network more after noise reduction autocoder
What is added is pure, reduces noise data to the influence caused by subsequent builds grader.
In the present embodiment, there is powerful feature to carry for the described first limited Boltzmann machine and the second limited Boltzmann machine
Ability is taken, the described first limited Boltzmann machine (RBM1) and the second limited Boltzmann machine (RBM2) are located at the noise reduction depth
The upper strata of neural network model, it can learn rule complicated in data so that the high-level characteristic extracted more has table
Sign property;After being extracted by RBM further features, by the classification of the more representational feature input extracted to the end
In device, best classification results are desirably to obtain.
In the present embodiment, by experiment, it is selected suitable plus make an uproar than and learning rate, improve the property of characteristic extracting module
Energy.
In embodiment, the running of noise reduction deep neural network model (DDNN) is as shown in figure 4, noise reduction depth is neural
Network model (DDNN) includes four layers altogether:DAE1, DAE2, RBM1 and RBM2, v are visual layers while are also noise reduction depth nerve
The input layer of network model (DDNN), in the present embodiment, every text is all represented by the vector of fixed dimension, W1、W2、W3With
W4What is represented respectively is the connection weight between each layer, h1、h2、h3And h4What is represented respectively is each hidden layer, corresponding to DAE1,
DAE2, RBM1 and RBM2;It is that no node connects for all nodes, between same layer, but often connected two
The intermediate node and node of layer all connect entirely.
In the present embodiment, the input of noise reduction deep neural network model (DDNN) is the vector of a fixed dimension, first by
The noise reduction module of two layers of composition of DAE1 and DAE2 is trained, and DAE2 layers are the output layer of noise reduction module while are also follow-up RBM1
The input layer of layer, RBM2 is noise reduction deep neural network model (DDNN) output layer, represents the character representation of the text, and
Visual layers are contrasted, and this layer is represented for the high-level characteristic of text data, and follow-up text categorization task is also all based on
What this high-level characteristic was calculated, visual layers represent for the low-level feature of text data.
In the present embodiment, identification sort module is classified using Softmax regression algorithms, and it is inputted as noise reduction depth god
The high-level characteristic exported through network model (DDNN).
In the present embodiment, it is assumed that text data, which is concentrated, the n text from k classification, and training set is expressed as { (x(1),y(1)),(x(2),y(2)),...,(x(n-1),y(n-1)),(x(n),y(n)), wherein, x(i)I-th of training text is represented, y represents classification
Multiple different values, y can be taken(i)∈ { 1,2 ..., k-1, k }, the main purpose of Softmax regression algorithms are for given
Training set x, it is desirable to be able to calculate x and belong to the other probability of tag class.Assuming that function such as formula (5):
In formula (5), hθ(x(i)) each point vector of vector is text x(i)The probable value to belong to a different category, to cause
Some divides the probable value of vector and for 1, it is necessary to which probable value is done into normalized processing, θ1,θ2,...,θk-1,θk∈Rn+1, Rn+1Refer to
Be n+1 dimension real number spaces, θ here is the vector of a n+1 dimension, is exactly the parameter that Softmax is used in itself, for sample
This is weighted per one-dimensional attribute, obtains a numberSubscript T represents transposition.
The cost function used in Softmax regression algorithms such as formula (6):
In formula (6), what 1 { } represented is indicative function, and when the expression formula value in bracket is true, the function value is 1;
Conversely, when the expression formula value in bracket is fictitious time, the value of the function is 0.θijWhat is represented is Softmax i-th of parameter
The jth dimension of vector,It is a penalty term.Because the former cost function of the forward part of plus sige is not strict convex letter
Number, so having added a weight attenuation term later, for preventing multiple appearance being most worth.When Softmax regression models are joined
Number λ>0, the cost function can become a strict convex function, final to obtain so as to prevent to training sample overfitting
To the optimal solution of the overall situation.
Solution extreme value, the gradient calculation formula (7) of cost function are carried out to cost function using gradient descent method:
The function h assumed before also just having been obtained after obtaining θθ(x).So as to according to function hθ(x) text is calculated
The probable value of each classification belonging to this x, probable value it is big be exactly final classification that Softmax regression algorithms predict.
In the present embodiment, input does not add data of making an uproar, by the Automatic document classification method proposed in the present invention and single algorithm
Model is contrasted, and obtained text classification accuracy is as shown in table 3.
Do not add the data classification Accuracy (%) that makes an uproar under the algorithms of different of table 3
Input plus data of making an uproar, the Automatic document classification method proposed in the present invention and single algorithm model are contrasted,
Obtained text classification accuracy is as shown in table 4.
The algorithms of different of table 4 adds the data classification Accuracy (%) that makes an uproar
In table 3 and table 4, KNN, BPNN, SVM represent K arest neighbors, reverse transmittance nerve network, SVMs respectively.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality
Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation
In any this actual relation or order.
Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art
For, on the premise of principle of the present invention is not departed from, some improvements and modifications can also be made, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (10)
- A kind of 1. Automatic document classification method, it is characterised in that including:Obtain text to be sorted;Using noise reduction autocoder and limited Boltzmann machine, noise reduction deep neural network model is built;Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition;According to feature extraction result, classified automatically using Softmax regression algorithms.
- 2. Automatic document classification method according to claim 1, it is characterised in that in the noise reduction depth nerve using structure Network model, before the text progress feature extraction to be sorted of acquisition, methods described also includes:The noise data in the text to be sorted of the acquisition is rejected, wherein, the noise data includes:Useless information And/or punctuation mark and spcial character in text.
- 3. Automatic document classification method according to claim 2, it is characterised in that rejecting the to be sorted of the acquisition After noise data in text, methods described also includes:Word segmentation processing is carried out to the text data for removing noise data.
- 4. Automatic document classification method according to claim 3, it is characterised in that in the textual data to removing noise data After word segmentation processing is carried out, methods described also includes:According to the word segmentation result of text data, stop words is gone to text data, wherein, the stop words removed is without differentiation and in advance The Feature Words of survey ability.
- 5. Automatic document classification method according to claim 4, it is characterised in that text data is gone stop words it Afterwards, methods described also includes:The Feature Words for going after stop words to obtain are turned into vocabulary form;Calculate the weights of each Feature Words in feature vocabulary and be recorded in feature vocabulary, wherein, the feature vocabulary includes text Originally, the corresponding relation in Feature Words and text in text between the weights of each Feature Words;According to obtained feature vocabulary, each text is represented sequentially as to the form of characteristic vector.
- 6. Automatic document classification method according to claim 5, it is characterised in that the feature vocabulary that the basis obtains, Each text is represented sequentially as to the form of characteristic vector to be included:According to default rule, judge whether the first text is short text;If so, then expanding algorithm according to short essay eigen, feature expansion is carried out to first text, and feature based expands knot Fruit, first text representation is characterized to the form of vector;If it is not, first text representation is directly then characterized according to obtained feature vocabulary by vectorial form.
- 7. Automatic document classification method according to claim 5, it is characterised in that, will in the feature vocabulary that basis obtains Each text is represented sequentially as after the form of characteristic vector, and methods described also includes:The each numerical value for being expressed as vector characteristics form is normalized.
- 8. Automatic document classification method according to claim 1, it is characterised in that the noise reduction deep neural network model Including:The first noise reduction autocoder positioned at the noise reduction deep neural network model bottom, positioned at first noise reduction from The second noise reduction autocoder on dynamic encoder upper strata, first limited Bohr positioned at the second noise reduction autocoder upper strata Hereby graceful machine, the second limited Boltzmann machine positioned at the described first limited Boltzmann machine upper strata.
- 9. Automatic document classification method according to claim 8, it is characterised in that the first noise reduction autocoder and Second noise reduction autocoder forms noise reduction module, and the noise reduction module is used for inputting the noise reduction deep neural network model Characteristic vector carry out noise reduction process;Wherein, layer where the second noise reduction autocoder is the output of the noise reduction module Layer is also the input layer of the described first limited Boltzmann machine simultaneously;Described second is limited the output layer that Boltzmann machine is the noise reduction deep neural network model, the output result of output layer For the character representation of the text to be sorted.
- 10. Automatic document classification method according to claim 1, it is characterised in that the noise reduction deep neural network mould The input of type is the characteristic vector of a fixed dimension.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710822309.2A CN107609113A (en) | 2017-09-13 | 2017-09-13 | A kind of Automatic document classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710822309.2A CN107609113A (en) | 2017-09-13 | 2017-09-13 | A kind of Automatic document classification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107609113A true CN107609113A (en) | 2018-01-19 |
Family
ID=61063938
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710822309.2A Pending CN107609113A (en) | 2017-09-13 | 2017-09-13 | A kind of Automatic document classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107609113A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108447565A (en) * | 2018-03-23 | 2018-08-24 | 北京工业大学 | A kind of small for gestational age infant disease forecasting method based on improvement noise reduction autocoder |
CN109308471A (en) * | 2018-09-29 | 2019-02-05 | 河海大学常州校区 | A kind of EMG Feature Extraction |
CN109829054A (en) * | 2019-01-17 | 2019-05-31 | 齐鲁工业大学 | A kind of file classification method and system |
CN111310451A (en) * | 2018-12-10 | 2020-06-19 | 北京沃东天骏信息技术有限公司 | Sensitive dictionary generation method and device, storage medium and electronic equipment |
CN112214598A (en) * | 2020-09-27 | 2021-01-12 | 中润普达(十堰)大数据中心有限公司 | Cognitive system based on hair condition |
CN112905795A (en) * | 2021-03-11 | 2021-06-04 | 证通股份有限公司 | Text intention classification method, device and readable medium |
US11488055B2 (en) | 2018-07-26 | 2022-11-01 | International Business Machines Corporation | Training corpus refinement and incremental updating |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102955856A (en) * | 2012-11-09 | 2013-03-06 | 北京航空航天大学 | Chinese short text classification method based on characteristic extension |
CN104834747A (en) * | 2015-05-25 | 2015-08-12 | 中国科学院自动化研究所 | Short text classification method based on convolution neutral network |
CN105912716A (en) * | 2016-04-29 | 2016-08-31 | 国家计算机网络与信息安全管理中心 | Short text classification method and apparatus |
KR101681109B1 (en) * | 2015-10-01 | 2016-11-30 | 한국외국어대학교 연구산학협력단 | An automatic method for classifying documents by using presentative words and similarity |
CN106372640A (en) * | 2016-08-19 | 2017-02-01 | 中山大学 | Character frequency text classification method |
-
2017
- 2017-09-13 CN CN201710822309.2A patent/CN107609113A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102955856A (en) * | 2012-11-09 | 2013-03-06 | 北京航空航天大学 | Chinese short text classification method based on characteristic extension |
CN104834747A (en) * | 2015-05-25 | 2015-08-12 | 中国科学院自动化研究所 | Short text classification method based on convolution neutral network |
KR101681109B1 (en) * | 2015-10-01 | 2016-11-30 | 한국외국어대학교 연구산학협력단 | An automatic method for classifying documents by using presentative words and similarity |
CN105912716A (en) * | 2016-04-29 | 2016-08-31 | 国家计算机网络与信息安全管理中心 | Short text classification method and apparatus |
CN106372640A (en) * | 2016-08-19 | 2017-02-01 | 中山大学 | Character frequency text classification method |
Non-Patent Citations (1)
Title |
---|
周超: "基于深度学习混合模型的文本分类研究", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108447565A (en) * | 2018-03-23 | 2018-08-24 | 北京工业大学 | A kind of small for gestational age infant disease forecasting method based on improvement noise reduction autocoder |
CN108447565B (en) * | 2018-03-23 | 2021-10-08 | 北京工业大学 | Small gestational age infant prediction method based on improved noise reduction automatic encoder |
US11488055B2 (en) | 2018-07-26 | 2022-11-01 | International Business Machines Corporation | Training corpus refinement and incremental updating |
CN109308471A (en) * | 2018-09-29 | 2019-02-05 | 河海大学常州校区 | A kind of EMG Feature Extraction |
CN111310451A (en) * | 2018-12-10 | 2020-06-19 | 北京沃东天骏信息技术有限公司 | Sensitive dictionary generation method and device, storage medium and electronic equipment |
CN109829054A (en) * | 2019-01-17 | 2019-05-31 | 齐鲁工业大学 | A kind of file classification method and system |
CN112214598A (en) * | 2020-09-27 | 2021-01-12 | 中润普达(十堰)大数据中心有限公司 | Cognitive system based on hair condition |
CN112905795A (en) * | 2021-03-11 | 2021-06-04 | 证通股份有限公司 | Text intention classification method, device and readable medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107609113A (en) | A kind of Automatic document classification method | |
CN108304468B (en) | Text classification method and text classification device | |
Kadhim et al. | Text document preprocessing and dimension reduction techniques for text document clustering | |
CN105389379A (en) | Rubbish article classification method based on distributed feature representation of text | |
CN107301171A (en) | A kind of text emotion analysis method and system learnt based on sentiment dictionary | |
CN101714135B (en) | Emotional orientation analytical method of cross-domain texts | |
Romanov et al. | Application of natural language processing algorithms to the task of automatic classification of Russian scientific texts | |
Dasari et al. | Text categorization and machine learning methods: current state of the art | |
CN106570170A (en) | Text classification and naming entity recognition integrated method and system based on depth cyclic neural network | |
CN113157859A (en) | Event detection method based on upper concept information | |
Hossain et al. | Authorship classification in a resource constraint language using convolutional neural networks | |
Balli et al. | Sentimental analysis of Twitter users from Turkish content with natural language processing | |
Ong et al. | Sentiment analysis of informal Malay tweets with deep learning | |
CN107463715A (en) | English social media account number classification method based on information gain | |
Nguyen et al. | An ensemble of shallow and deep learning algorithms for Vietnamese sentiment analysis | |
Soni et al. | A comprehensive study for the Hindi language to implement supervised text classification techniques | |
Mamoun et al. | Arabic text stemming: Comparative analysis | |
Dhar et al. | Bengali news headline categorization using optimized machine learning pipeline | |
Zobeidi et al. | Effective text classification using multi-level fuzzy neural network | |
CN110348497A (en) | A kind of document representation method based on the building of WT-GloVe term vector | |
KR101240330B1 (en) | System and method for mutidimensional document classification | |
CN113761123A (en) | Keyword acquisition method and device, computing equipment and storage medium | |
Alharbi et al. | Neural networks based on Latent Dirichlet Allocation for news web page classifications | |
Susmitha et al. | Performance assessment using supervised machine learning algorithms of opinion mining on social media dataset | |
Wikarsa et al. | Automatic Generation Of Word-Emotion Lexicon For Multiple Sentiment Polarities On Social Media Texts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180119 |