CN107609113A - A kind of Automatic document classification method - Google Patents

A kind of Automatic document classification method Download PDF

Info

Publication number
CN107609113A
CN107609113A CN201710822309.2A CN201710822309A CN107609113A CN 107609113 A CN107609113 A CN 107609113A CN 201710822309 A CN201710822309 A CN 201710822309A CN 107609113 A CN107609113 A CN 107609113A
Authority
CN
China
Prior art keywords
text
noise reduction
feature
neural network
classification method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710822309.2A
Other languages
Chinese (zh)
Inventor
张媛钰
阿孜古丽
谢永红
张德政
栗辉
李春苗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201710822309.2A priority Critical patent/CN107609113A/en
Publication of CN107609113A publication Critical patent/CN107609113A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of Automatic document classification method, it is possible to increase the accuracy and anti-noise ability of text classification.Methods described includes:Obtain text to be sorted;Using noise reduction autocoder and limited Boltzmann machine, noise reduction deep neural network model is built;Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition;According to feature extraction result, classified automatically using Softmax regression algorithms.The present invention relates to text classification field.

Description

A kind of Automatic document classification method
Technical field
The present invention relates to text classification field, particularly relates to a kind of Automatic document classification method.
Background technology
In the network information, text carries approach in occupation of critical role as main information.Text classification (Text Classification, TC) namely utilize computer to text set or other entities and object according to certain classification System or standard carry out automatic key words sorting.At present, deep learning has been successfully applied to various modes classification problem, uses base In the method for deep learning, the complicated semantic relation contained in the text more preferable can must be excavated.
But in the prior art, typically text is classified using single method, ability in feature extraction is weaker, right The disposal ability of noise data is poor so that classification results accuracy is relatively low.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of Automatic document classification method, to solve present in prior art Noise processed ability, the problem of ability in feature extraction is weak.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of Automatic document classification method, including:
Obtain text to be sorted;
Using noise reduction autocoder and limited Boltzmann machine, noise reduction deep neural network model is built;
Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition;
According to feature extraction result, classified automatically using Softmax regression algorithms.
Further, in the noise reduction deep neural network model using structure, spy is carried out to the text to be sorted of acquisition Before sign extraction, methods described also includes:
The noise data in the text to be sorted of the acquisition is rejected, wherein, the noise data includes:Useless letter Breath and/or punctuation mark and spcial character in text.
Further, after the noise data in the text to be sorted for rejecting the acquisition, methods described also includes:
Word segmentation processing is carried out to the text data for removing noise data.
Further, after the text data to removing noise data carries out word segmentation processing, methods described also includes:
According to the word segmentation result of text data, stop words is gone to text data, wherein, the stop words removed is not distinguish With the Feature Words of predictive ability.
Further, after stop words is removed to text data, methods described also includes:
The Feature Words for going after stop words to obtain are turned into vocabulary form;
Calculate the weights of each Feature Words in feature vocabulary and be recorded in feature vocabulary, wherein, the feature vocabulary bag Include the corresponding relation between the weights of each Feature Words in the Feature Words and text in text, text;
According to obtained feature vocabulary, each text is represented sequentially as to the form of characteristic vector.
Further, the feature vocabulary that the basis obtains, each text is represented sequentially as to the form bag of characteristic vector Include:
According to default rule, judge whether the first text is short text;
If so, then expanding algorithm according to short essay eigen, feature expansion is carried out to first text, and feature based expands Result is filled, first text representation is characterized to the form of vector;
If it is not, first text representation is directly then characterized according to obtained feature vocabulary by vectorial form.
Further, the feature vocabulary obtained in basis, after each text is represented sequentially as into the form of characteristic vector, Methods described also includes:
The each numerical value for being expressed as vector characteristics form is normalized.
Further, the noise reduction deep neural network model includes:
The first noise reduction autocoder positioned at the noise reduction deep neural network model bottom, positioned at the described first drop Make an uproar autocoder upper strata the second noise reduction autocoder, positioned at the first limited of the second noise reduction autocoder upper strata Boltzmann machine, the second limited Boltzmann machine positioned at the described first limited Boltzmann machine upper strata.
Further, the first noise reduction autocoder and the second noise reduction autocoder composition noise reduction module, it is described Noise reduction module is used to carry out noise reduction process to the characteristic vector for inputting the noise reduction deep neural network model;Wherein, described Layer where two noise reduction autocoders is the output layer of the noise reduction module while is also the described first limited Boltzmann machine Input layer;
Described second is limited the output layer that Boltzmann machine is the noise reduction deep neural network model, the output of output layer As a result it is the character representation of the text to be sorted.
Further, the input of the noise reduction deep neural network model is the characteristic vector of a fixed dimension.
The above-mentioned technical proposal of the present invention has the beneficial effect that:
In such scheme, by using noise reduction autocoder and limited Boltzmann machine, noise reduction depth nerve net is built Network model;Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition;According to spy Sign extraction result, is classified automatically using Softmax regression algorithms.So, based on the noise reduction with powerful anti-noise ability Autocoder and the noise reduction deep neural network model that there is the limited Boltzmann machine of powerful ability in feature extraction to build The feature of the text to be sorted of extraction, it is possible to increase the accuracy and anti-noise ability of text classification.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of Automatic document classification method provided in an embodiment of the present invention;
Fig. 2 is the schematic flow sheet provided in an embodiment of the present invention that the text representation to be sorted of acquisition is characterized to vector;
Fig. 3 is the topological schematic diagram of noise reduction deep neural network model provided in an embodiment of the present invention;
Fig. 4 is that the principle of noise reduction deep neural network model provided in an embodiment of the present invention is illustrated.
Embodiment
To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool Body embodiment is described in detail.
The problem of present invention is directed to existing noise processed ability, ability in feature extraction is weak, there is provided a kind of text is automatic Sorting technique.
As shown in figure 1, Automatic document classification method provided in an embodiment of the present invention, including:
S101, obtain text to be sorted;
S102, using noise reduction autocoder (Denoising Auto Encoder, DAE) and limited Boltzmann machine (Restricted Boltzmann Machine, RBM), structure noise reduction deep neural network model (Denoising Deep Neural Network, DDNN);
S103, using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition;
S104, according to feature extraction result, classified automatically using Softmax regression algorithms.
Automatic document classification method described in the embodiment of the present invention, by using noise reduction autocoder and limited Bohr hereby Graceful machine, build noise reduction deep neural network model;Using the noise reduction deep neural network model of structure, to the to be sorted of acquisition Text carries out feature extraction;According to feature extraction result, classified automatically using Softmax regression algorithms.So, based on tool There are the noise reduction autocoder of powerful anti-noise ability and the limited Boltzmann machine with powerful ability in feature extraction to build Noise reduction deep neural network model extraction text to be sorted feature, it is possible to increase the accuracy and anti-noise of text classification Ability.
In the embodiment of aforementioned texts automatic classification method, further, in the noise reduction depth using structure Neural network model, before the text progress feature extraction to be sorted of acquisition, methods described also includes:
The noise data in the text to be sorted of the acquisition is rejected, wherein, the noise data includes:Useless letter Breath and/or punctuation mark and spcial character in text.
As shown in Fig. 2 in the present embodiment, the noise data in the text to be sorted of the acquisition is rejected, is mainly picked Except some useless information, for example, the garbage of the similar author often occurred in newsletter archive, version number, date etc, net Stand the similar addresser occurred in forum, transmit the date, transmit the useless information such as station, source, and various punctuates in text The useless information such as symbol and spcial character.
In the embodiment of aforementioned texts automatic classification method, further, treating point for the acquisition is being rejected After noise data in the text of class, methods described also includes:
Word segmentation processing is carried out to the text data for removing noise data.
In the present embodiment, Chinese text is different from English text, and by space-separated between English word and word, Chinese is only Having between sentence and sentence has punctuation mark separation, therefore is extraction word feature, the Chinese text for removing noise data is entered Row word segmentation processing.
As shown in Fig. 2 in the present embodiment, the ICTCLAS Words partition systems of the Chinese Academy of Sciences Jing Guo secondary development can be used to enter Row participle, the system can provide service according to the language of developer's selection.
In the embodiment of aforementioned texts automatic classification method, further, in the text to removing noise data After notebook data carries out word segmentation processing, methods described also includes:
According to the word segmentation result of text data, stop words is gone to text data, wherein, the stop words removed is not distinguish With the Feature Words of predictive ability.
As shown in Fig. 2 in the present embodiment, after carrying out word segmentation processing to text, the inside can include many useless Feature Words (also referred to as:Stop words), these Feature Words are not distinguished and predictive ability, for example, auxiliary word, article, conjunction, pronoun, preposition etc., Therefore, these useless Feature Words are removed, to reduce the dimension of Feature Words.
In the embodiment of aforementioned texts automatic classification method, further, stop words is being gone to text data Afterwards, methods described also includes:
The Feature Words for going after stop words to obtain are turned into vocabulary form;
Calculate the weights of each Feature Words in feature vocabulary and be recorded in feature vocabulary, wherein, the feature vocabulary bag Include the corresponding relation between the weights of each Feature Words in the Feature Words and text in text, text;
According to obtained feature vocabulary, each text is represented sequentially as to the form of characteristic vector.
As shown in Fig. 2 in the present embodiment, the Feature Words for going after stop words to obtain are turned into vocabulary form, calculated special Levy the weights of each Feature Words in vocabulary and be recorded in the feature vocabulary.
In the present embodiment, term frequency-inverse document frequency (Term Frequency-Inverse Document can be used Frequency, TF-IDF) algorithm calculates the weights of each Feature Words in feature vocabulary, and the TF-IDF algorithmic notations are:
TF_IDF=(TF/Ni)*lg(N/DF) (1)
In formula (1), TF_IDF represent weights, TF be text in particular characteristic value word frequency, NiFor Feature Words in text Sum, N are the sum of text, and DF is the textual data for including this feature word.
According to obtained feature vocabulary, each text is represented sequentially as to the form of characteristic vector, as shown in table 1, table 1 is The characteristic vector space model of text represents, in table 1, dmRepresent certain single text, tjRepresent certain Feature Words, wijRepresentative feature word Weights.
The characteristic vector space model of the text of table 1 represents
t1 ... tj ... tn
d1 w11 ... w1j ... w1n
... ... ... ... ... ...
di wi1 ... wij ... win
... ... ... ... ... ...
dm wm1 ... wmj ... wmn
As shown in Fig. 2 in the embodiment of aforementioned texts automatic classification method, further, the basis obtains The feature vocabulary arrived, each text is represented sequentially as to the form of characteristic vector to be included:
According to default rule, judge whether the first text is short text;
If so, then expanding algorithm according to short essay eigen, feature expansion is carried out to first text, and feature based expands Result is filled, first text representation is characterized to the form of vector;
If it is not, first text representation is directly then characterized according to obtained feature vocabulary by vectorial form.
In the present embodiment, short text can be determined whether it is according to the size of text, if for example, the size of text is less than Default threshold value, then the text is short text..
In the present embodiment, it is assumed that the training dataset of text is D={ di, this data set structure sorting algorithm is based on, Wherein, di={ tk, t in short textkNumber it is typically smaller.The short essay eigen expands algorithm and has been broadly divided into two Step:
A1, first, select has high indicative feature construction to go out required feature space T to classification, so as to original Feature space D carried out dimensionality reduction.
During the T of construction feature space, it should allow the feature distribution in feature space T as far as possible in each short text, i.e., Ensure that each sample space has direct correlation with the feature space T built;For ensure feature distribution uniformly avoid it is sparse, Selected feature should be included by more short text.
From the point of view of to sum up, feature space T structure wants emphasis to consider difference and feature of each classification in amount of text Correlation degree between classification, the feature big to category classification contribution degree is selected in each classification, it is corresponding for characterizing Class.This point can be used and characterize certain feature tkThe feature within-cluster variance DI of distribution situation in classicTo be weighed.
In formula (2), m represents CiText sum in class, f (tij) represent feature tijIn CiOccur in class in jth piece text Number,Represent feature tijIn CiTF-IDF average value in all texts of class.
Feature tijWithin-cluster variance DIicIt is smaller, show that its distribution in class is more uniform, also just having can be preferably Distinguish the ability of each classification.Next, the DI to each classicSequence from big to small is carried out, K spy before being extracted according to ratio Sign, finally merges the unduplicated feature constitutive characteristic space T of all classes.
A2, then, is exactly to any diSelected from feature space T and tkThe higher feature of similarity is extended.
On the basis of the feature space T of structure, it is possible to which short essay eigen is extended.Principle is to utilize and short essay The feature that this feature t itself included has maximum relation degree is extended to it.Calculate the common methods of the feature degree of correlation Mainly utilize mutual information, can the intuitively directly related property of response feature and classification, but it is exactly to sparse to have individual shortcoming The sensitivity for the inaccuracy exception that data band comes, the mutual information that may result between feature turn into negative value, the processing to the later stage Using causing trouble.
In the present embodiment, using a kind of modified version calculation formula based on mutual information, low-frequency word is avoided to a certain extent The binary mutual information of composition higher than high-frequency phrase into two tuples the problem of, and weaken the sparse related feature of data The influence of degree.
In formula (3), R (ti,tj) represent feature ti、tjBetween the degree of correlation, P (ti,tj) represent in data set, feature ti And tjThe probability occurred simultaneously, P (ti) represent feature tiThe probability occurred in data set, P (tj) represent feature tjIn data set The probability of middle appearance.
In the embodiment of aforementioned texts automatic classification method, further, according to obtained feature vocabulary, After each text is represented sequentially as into the form of characteristic vector, methods described also includes:
The each numerical value for being expressed as vector characteristics form is normalized.
As shown in Fig. 2 in the present embodiment, because gap of the input data on the order of magnitude, the data of input can be caused to go out Existing incompatibility problem, so needing the vector characteristics numerical value for inputting the noise reduction deep neural network model of structure being normalized Processing, specifically:The each numerical value for being expressed as vector characteristics form is normalized according to formula (4):
In formula (4), xi、ViRepresent to normalize forward and backward characteristic value respectively, V represents the characteristic vector after normalization, xminWith xmaxIt is the minimum value and maximum for the vector characteristics intermediate value for inputting noise reduction deep neural network model respectively.
In the embodiment of aforementioned texts automatic classification method, further, the noise reduction deep neural network Model includes:
The first noise reduction autocoder positioned at the noise reduction deep neural network model bottom, positioned at the described first drop Make an uproar autocoder upper strata the second noise reduction autocoder, positioned at the first limited of the second noise reduction autocoder upper strata Boltzmann machine, the second limited Boltzmann machine positioned at the described first limited Boltzmann machine upper strata.
In the embodiment of aforementioned texts automatic classification method, further, the first noise reduction autocoding Device and the second noise reduction autocoder composition noise reduction module, the noise reduction module are used for inputting the noise reduction deep neural network The characteristic vector of model carries out noise reduction process;Wherein, layer where the second noise reduction autocoder is the noise reduction module Output layer is also the input layer of the described first limited Boltzmann machine simultaneously;
Described second is limited the output layer that Boltzmann machine is the noise reduction deep neural network model, the output of output layer As a result it is the character representation of the text to be sorted.
In the embodiment of aforementioned texts automatic classification method, further, the noise reduction deep neural network The input of model is the characteristic vector of a fixed dimension.
In the present embodiment, the Chinese text corpus compiled by using Fudan University is to the text described in the present embodiment This automatic classification method carries out testing research, and the corpus includes the language material of nearly ten thousand, enumerate 20 classifications, have physical culture, Politics, medicine, art, military, economic dispatch, training set and test set distribution are as shown in table 2 in data set.
Training set and test set distribution in the data set of table 2
Classification Class name Training set Test set
C1 Computer 900 300
C2 Environment 900 300
C3 Agricultural 900 300
C4 It is economical 900 300
C5 Politics 900 300
C6 Motion 900 300
…… …… …… ……
In the present embodiment, main submodule of the noise reduction deep neural network model as characteristic extracting module is described The structure of noise reduction deep neural network model mainly uses noise reduction autocoder (DAE) and limited Boltzmann machine (RBM) two Big component.
After considering the complexity of training and the efficiency of model, in the present embodiment, using 2 layers of noise reduction autocoding (DAE) and 2 layers of (limited Boltzmann machine RBM), noise reduction deep neural network model model topological structure as shown in figure 3, The noise reduction deep neural network model includes:The first noise reduction positioned at the noise reduction deep neural network model bottom is automatic Encoder (DAE1), the second noise reduction autocoder (DAE2) positioned at the first noise reduction autocoder upper strata, positioned at institute State the first limited Boltzmann machine (RBM1) on the second noise reduction autocoder upper strata, positioned at the described first limited Boltzmann machine The limited Boltzmann machine (RBM2) of the second of upper strata.
In the present embodiment, the first noise reduction autocoder (DAE1) and the second noise reduction autocoder (DAE2) composition Noise reduction module, the noise reduction module are used to carry out at noise reduction the characteristic vector for inputting the noise reduction deep neural network model Reason;Wherein, layer where the second noise reduction autocoder (DAE2) is the output layer of the noise reduction module while is also described The input layer of first limited Boltzmann machine (RBM1);Described second limited Boltzmann machine (RBM2) is the noise reduction depth god Output layer through network model, the output result of output layer are the character representation of the text to be sorted.
In the present embodiment, the characteristic extracting module drops first with noise reduction module to the original feature vector of input Make an uproar processing, the noise reduction module is located at the bottom of whole noise reduction deep neural network model, to make full use of noise reduction automatic The characteristic of encoder noise reduction, weight is carried out to the original feature vector of input by the unsupervised learning ability of noise reduction autocoder Structure, a noise reduction process to input signal can be accomplished so that enter the signal of network more after noise reduction autocoder What is added is pure, reduces noise data to the influence caused by subsequent builds grader.
In the present embodiment, there is powerful feature to carry for the described first limited Boltzmann machine and the second limited Boltzmann machine Ability is taken, the described first limited Boltzmann machine (RBM1) and the second limited Boltzmann machine (RBM2) are located at the noise reduction depth The upper strata of neural network model, it can learn rule complicated in data so that the high-level characteristic extracted more has table Sign property;After being extracted by RBM further features, by the classification of the more representational feature input extracted to the end In device, best classification results are desirably to obtain.
In the present embodiment, by experiment, it is selected suitable plus make an uproar than and learning rate, improve the property of characteristic extracting module Energy.
In embodiment, the running of noise reduction deep neural network model (DDNN) is as shown in figure 4, noise reduction depth is neural Network model (DDNN) includes four layers altogether:DAE1, DAE2, RBM1 and RBM2, v are visual layers while are also noise reduction depth nerve The input layer of network model (DDNN), in the present embodiment, every text is all represented by the vector of fixed dimension, W1、W2、W3With W4What is represented respectively is the connection weight between each layer, h1、h2、h3And h4What is represented respectively is each hidden layer, corresponding to DAE1, DAE2, RBM1 and RBM2;It is that no node connects for all nodes, between same layer, but often connected two The intermediate node and node of layer all connect entirely.
In the present embodiment, the input of noise reduction deep neural network model (DDNN) is the vector of a fixed dimension, first by The noise reduction module of two layers of composition of DAE1 and DAE2 is trained, and DAE2 layers are the output layer of noise reduction module while are also follow-up RBM1 The input layer of layer, RBM2 is noise reduction deep neural network model (DDNN) output layer, represents the character representation of the text, and Visual layers are contrasted, and this layer is represented for the high-level characteristic of text data, and follow-up text categorization task is also all based on What this high-level characteristic was calculated, visual layers represent for the low-level feature of text data.
In the present embodiment, identification sort module is classified using Softmax regression algorithms, and it is inputted as noise reduction depth god The high-level characteristic exported through network model (DDNN).
In the present embodiment, it is assumed that text data, which is concentrated, the n text from k classification, and training set is expressed as { (x(1),y(1)),(x(2),y(2)),...,(x(n-1),y(n-1)),(x(n),y(n)), wherein, x(i)I-th of training text is represented, y represents classification Multiple different values, y can be taken(i)∈ { 1,2 ..., k-1, k }, the main purpose of Softmax regression algorithms are for given Training set x, it is desirable to be able to calculate x and belong to the other probability of tag class.Assuming that function such as formula (5):
In formula (5), hθ(x(i)) each point vector of vector is text x(i)The probable value to belong to a different category, to cause Some divides the probable value of vector and for 1, it is necessary to which probable value is done into normalized processing, θ12,...,θk-1k∈Rn+1, Rn+1Refer to Be n+1 dimension real number spaces, θ here is the vector of a n+1 dimension, is exactly the parameter that Softmax is used in itself, for sample This is weighted per one-dimensional attribute, obtains a numberSubscript T represents transposition.
The cost function used in Softmax regression algorithms such as formula (6):
In formula (6), what 1 { } represented is indicative function, and when the expression formula value in bracket is true, the function value is 1; Conversely, when the expression formula value in bracket is fictitious time, the value of the function is 0.θijWhat is represented is Softmax i-th of parameter The jth dimension of vector,It is a penalty term.Because the former cost function of the forward part of plus sige is not strict convex letter Number, so having added a weight attenuation term later, for preventing multiple appearance being most worth.When Softmax regression models are joined Number λ>0, the cost function can become a strict convex function, final to obtain so as to prevent to training sample overfitting To the optimal solution of the overall situation.
Solution extreme value, the gradient calculation formula (7) of cost function are carried out to cost function using gradient descent method:
The function h assumed before also just having been obtained after obtaining θθ(x).So as to according to function hθ(x) text is calculated The probable value of each classification belonging to this x, probable value it is big be exactly final classification that Softmax regression algorithms predict.
In the present embodiment, input does not add data of making an uproar, by the Automatic document classification method proposed in the present invention and single algorithm Model is contrasted, and obtained text classification accuracy is as shown in table 3.
Do not add the data classification Accuracy (%) that makes an uproar under the algorithms of different of table 3
Input plus data of making an uproar, the Automatic document classification method proposed in the present invention and single algorithm model are contrasted, Obtained text classification accuracy is as shown in table 4.
The algorithms of different of table 4 adds the data classification Accuracy (%) that makes an uproar
In table 3 and table 4, KNN, BPNN, SVM represent K arest neighbors, reverse transmittance nerve network, SVMs respectively.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation In any this actual relation or order.
Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, on the premise of principle of the present invention is not departed from, some improvements and modifications can also be made, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (10)

  1. A kind of 1. Automatic document classification method, it is characterised in that including:
    Obtain text to be sorted;
    Using noise reduction autocoder and limited Boltzmann machine, noise reduction deep neural network model is built;
    Using the noise reduction deep neural network model of structure, feature extraction is carried out to the text to be sorted of acquisition;
    According to feature extraction result, classified automatically using Softmax regression algorithms.
  2. 2. Automatic document classification method according to claim 1, it is characterised in that in the noise reduction depth nerve using structure Network model, before the text progress feature extraction to be sorted of acquisition, methods described also includes:
    The noise data in the text to be sorted of the acquisition is rejected, wherein, the noise data includes:Useless information And/or punctuation mark and spcial character in text.
  3. 3. Automatic document classification method according to claim 2, it is characterised in that rejecting the to be sorted of the acquisition After noise data in text, methods described also includes:
    Word segmentation processing is carried out to the text data for removing noise data.
  4. 4. Automatic document classification method according to claim 3, it is characterised in that in the textual data to removing noise data After word segmentation processing is carried out, methods described also includes:
    According to the word segmentation result of text data, stop words is gone to text data, wherein, the stop words removed is without differentiation and in advance The Feature Words of survey ability.
  5. 5. Automatic document classification method according to claim 4, it is characterised in that text data is gone stop words it Afterwards, methods described also includes:
    The Feature Words for going after stop words to obtain are turned into vocabulary form;
    Calculate the weights of each Feature Words in feature vocabulary and be recorded in feature vocabulary, wherein, the feature vocabulary includes text Originally, the corresponding relation in Feature Words and text in text between the weights of each Feature Words;
    According to obtained feature vocabulary, each text is represented sequentially as to the form of characteristic vector.
  6. 6. Automatic document classification method according to claim 5, it is characterised in that the feature vocabulary that the basis obtains, Each text is represented sequentially as to the form of characteristic vector to be included:
    According to default rule, judge whether the first text is short text;
    If so, then expanding algorithm according to short essay eigen, feature expansion is carried out to first text, and feature based expands knot Fruit, first text representation is characterized to the form of vector;
    If it is not, first text representation is directly then characterized according to obtained feature vocabulary by vectorial form.
  7. 7. Automatic document classification method according to claim 5, it is characterised in that, will in the feature vocabulary that basis obtains Each text is represented sequentially as after the form of characteristic vector, and methods described also includes:
    The each numerical value for being expressed as vector characteristics form is normalized.
  8. 8. Automatic document classification method according to claim 1, it is characterised in that the noise reduction deep neural network model Including:
    The first noise reduction autocoder positioned at the noise reduction deep neural network model bottom, positioned at first noise reduction from The second noise reduction autocoder on dynamic encoder upper strata, first limited Bohr positioned at the second noise reduction autocoder upper strata Hereby graceful machine, the second limited Boltzmann machine positioned at the described first limited Boltzmann machine upper strata.
  9. 9. Automatic document classification method according to claim 8, it is characterised in that the first noise reduction autocoder and Second noise reduction autocoder forms noise reduction module, and the noise reduction module is used for inputting the noise reduction deep neural network model Characteristic vector carry out noise reduction process;Wherein, layer where the second noise reduction autocoder is the output of the noise reduction module Layer is also the input layer of the described first limited Boltzmann machine simultaneously;
    Described second is limited the output layer that Boltzmann machine is the noise reduction deep neural network model, the output result of output layer For the character representation of the text to be sorted.
  10. 10. Automatic document classification method according to claim 1, it is characterised in that the noise reduction deep neural network mould The input of type is the characteristic vector of a fixed dimension.
CN201710822309.2A 2017-09-13 2017-09-13 A kind of Automatic document classification method Pending CN107609113A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710822309.2A CN107609113A (en) 2017-09-13 2017-09-13 A kind of Automatic document classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710822309.2A CN107609113A (en) 2017-09-13 2017-09-13 A kind of Automatic document classification method

Publications (1)

Publication Number Publication Date
CN107609113A true CN107609113A (en) 2018-01-19

Family

ID=61063938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710822309.2A Pending CN107609113A (en) 2017-09-13 2017-09-13 A kind of Automatic document classification method

Country Status (1)

Country Link
CN (1) CN107609113A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108447565A (en) * 2018-03-23 2018-08-24 北京工业大学 A kind of small for gestational age infant disease forecasting method based on improvement noise reduction autocoder
CN109308471A (en) * 2018-09-29 2019-02-05 河海大学常州校区 A kind of EMG Feature Extraction
CN109829054A (en) * 2019-01-17 2019-05-31 齐鲁工业大学 A kind of file classification method and system
CN111310451A (en) * 2018-12-10 2020-06-19 北京沃东天骏信息技术有限公司 Sensitive dictionary generation method and device, storage medium and electronic equipment
CN112214598A (en) * 2020-09-27 2021-01-12 中润普达(十堰)大数据中心有限公司 Cognitive system based on hair condition
CN112905795A (en) * 2021-03-11 2021-06-04 证通股份有限公司 Text intention classification method, device and readable medium
US11488055B2 (en) 2018-07-26 2022-11-01 International Business Machines Corporation Training corpus refinement and incremental updating

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955856A (en) * 2012-11-09 2013-03-06 北京航空航天大学 Chinese short text classification method based on characteristic extension
CN104834747A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Short text classification method based on convolution neutral network
CN105912716A (en) * 2016-04-29 2016-08-31 国家计算机网络与信息安全管理中心 Short text classification method and apparatus
KR101681109B1 (en) * 2015-10-01 2016-11-30 한국외국어대학교 연구산학협력단 An automatic method for classifying documents by using presentative words and similarity
CN106372640A (en) * 2016-08-19 2017-02-01 中山大学 Character frequency text classification method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955856A (en) * 2012-11-09 2013-03-06 北京航空航天大学 Chinese short text classification method based on characteristic extension
CN104834747A (en) * 2015-05-25 2015-08-12 中国科学院自动化研究所 Short text classification method based on convolution neutral network
KR101681109B1 (en) * 2015-10-01 2016-11-30 한국외국어대학교 연구산학협력단 An automatic method for classifying documents by using presentative words and similarity
CN105912716A (en) * 2016-04-29 2016-08-31 国家计算机网络与信息安全管理中心 Short text classification method and apparatus
CN106372640A (en) * 2016-08-19 2017-02-01 中山大学 Character frequency text classification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周超: "基于深度学习混合模型的文本分类研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108447565A (en) * 2018-03-23 2018-08-24 北京工业大学 A kind of small for gestational age infant disease forecasting method based on improvement noise reduction autocoder
CN108447565B (en) * 2018-03-23 2021-10-08 北京工业大学 Small gestational age infant prediction method based on improved noise reduction automatic encoder
US11488055B2 (en) 2018-07-26 2022-11-01 International Business Machines Corporation Training corpus refinement and incremental updating
CN109308471A (en) * 2018-09-29 2019-02-05 河海大学常州校区 A kind of EMG Feature Extraction
CN111310451A (en) * 2018-12-10 2020-06-19 北京沃东天骏信息技术有限公司 Sensitive dictionary generation method and device, storage medium and electronic equipment
CN109829054A (en) * 2019-01-17 2019-05-31 齐鲁工业大学 A kind of file classification method and system
CN112214598A (en) * 2020-09-27 2021-01-12 中润普达(十堰)大数据中心有限公司 Cognitive system based on hair condition
CN112905795A (en) * 2021-03-11 2021-06-04 证通股份有限公司 Text intention classification method, device and readable medium

Similar Documents

Publication Publication Date Title
CN107609113A (en) A kind of Automatic document classification method
CN108304468B (en) Text classification method and text classification device
Kadhim et al. Text document preprocessing and dimension reduction techniques for text document clustering
CN105389379A (en) Rubbish article classification method based on distributed feature representation of text
CN107301171A (en) A kind of text emotion analysis method and system learnt based on sentiment dictionary
CN101714135B (en) Emotional orientation analytical method of cross-domain texts
Romanov et al. Application of natural language processing algorithms to the task of automatic classification of Russian scientific texts
Dasari et al. Text categorization and machine learning methods: current state of the art
CN106570170A (en) Text classification and naming entity recognition integrated method and system based on depth cyclic neural network
CN113157859A (en) Event detection method based on upper concept information
Hossain et al. Authorship classification in a resource constraint language using convolutional neural networks
Balli et al. Sentimental analysis of Twitter users from Turkish content with natural language processing
Ong et al. Sentiment analysis of informal Malay tweets with deep learning
CN107463715A (en) English social media account number classification method based on information gain
Nguyen et al. An ensemble of shallow and deep learning algorithms for Vietnamese sentiment analysis
Soni et al. A comprehensive study for the Hindi language to implement supervised text classification techniques
Mamoun et al. Arabic text stemming: Comparative analysis
Dhar et al. Bengali news headline categorization using optimized machine learning pipeline
Zobeidi et al. Effective text classification using multi-level fuzzy neural network
CN110348497A (en) A kind of document representation method based on the building of WT-GloVe term vector
KR101240330B1 (en) System and method for mutidimensional document classification
CN113761123A (en) Keyword acquisition method and device, computing equipment and storage medium
Alharbi et al. Neural networks based on Latent Dirichlet Allocation for news web page classifications
Susmitha et al. Performance assessment using supervised machine learning algorithms of opinion mining on social media dataset
Wikarsa et al. Automatic Generation Of Word-Emotion Lexicon For Multiple Sentiment Polarities On Social Media Texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180119