CN108573047A

CN108573047A - A kind of training method and device of Module of Automatic Chinese Documents Classification

Info

Publication number: CN108573047A
Application number: CN201810350019.7A
Authority: CN
Inventors: 刘怡俊; 林裕鹏
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2018-04-18
Filing date: 2018-04-18
Publication date: 2018-09-25

Abstract

The present invention provides a kind of training method of Module of Automatic Chinese Documents Classification and device, solves between the characteristic item that these traditional document representation methods indicate independently of each other, and Sparse, result in computationally intensive technical problem.Wherein method, including：S1, the training text for obtaining tape label；S2, after being pre-processed to the training text, the training text after being segmented；S3, the training text after the participle is input in word2vec models, the training text after the participle is converted into term vector set；S4, it the label of the term vector set and the training text is input in convolutional neural networks is trained, and pass through the loss values of convolutional neural networks described in preset cost function calculation；S5, judge whether the loss values are less than preset threshold value, if so, determining the convolutional neural networks convergence, the parameter of the convolutional neural networks is preserved, and generate the Module of Automatic Chinese Documents Classification after training, if it is not, then return to step S1.

Description

A kind of training method and device of Module of Automatic Chinese Documents Classification

Technical field

The present invention relates to Text Classification field more particularly to the training methods and dress of a kind of Module of Automatic Chinese Documents Classification It sets.

Background technology

Since the last century 90's, with universal and network technology the constantly improve of Internet, Internet is Through as the most huge most abundant information resource database in the whole world.Shown by by the end of December, 2016 according to newest CNNIC statistics, in State's webpage quantity reaches hundred billion, and netizen's scale breaks through 6.88 hundred million, and internet just becomes people's daily life basic resources. The opening of Internet makes various information that can be issued on internet, however, this of Internet in first time Kind exploration also results in the scrambling and redundancy of information on Internet.The how effectively non-knot of organization and management magnanimity The text message of structure, and be precisely that user realizes that Information locating is the one of current information Science and Technology field face to choose greatly War, one of them successful example is exactly to be classified automatically to information according to the content of information.

Automatic classification technology develops on the basis of traditional information manual sort, as a kind of effective information processing Various information is arranged according to certain taxonomic hierarchies, largely solves the problems, such as information clutter by mode.Tradition Though manual information sorting technique it is quite ripe, handled it is apparent that being unsuitable for Internet information newer to the moment. The eighties, " knowledge engineering " (Knowledge Engineering) theory be used to instruct text classification, by by expertise Artificial definition is one group of rule, is classified according to these rules in the case of given classification.After the nineties, " engineering Habit " (Machine Learning) is developing progressively the mainstream technology for text classification, and it is good to shift to an earlier date handmarking by one group Classifying documents, learn interested category feature by the process of an induction type, then use machine learning techniques structure Produce auto text categorization classifier.Chinese is the language that number of users is most in the world, as the arrival of information age and knowledge pass through The globalization of Ji, Chinese Text Categorization effect have become very important.

In recent years, deep learning model achieves significant achievement in terms of computer vision and speech recognition, in nature Language Processing field carries out feature learning and text classification to natural language text information using neural network, also becomes text The cutting edge technology of classification.Existing sorting technique includes mainly rule-based disaggregated model, the classification mould based on machine learning Type, more famous Document Classification Method have decision tree (Decision Tree), random forest (Random Forest), pattra leaves This grader (Bayes), linear classifier (logistic regression), support vector machines (Support Vector Machine, SVM), Maximum entropy classifiers etc..They are all to begin by means of machine learning method, by manual features engineering and shallow-layer disaggregated model come Carry out text classification.

The task of text classification (Text Classification) is automatic point according to the interior perhaps theme for giving document With pre-defined class label.Classify to document, generally requires by two steps of text representation and learning classification.And How to be the structural data that can handle of algorithm document representation, this is undoubtedly the important link of text classification.For text Expression, traditional method is all discrete expression, such as One-hot codings, also referred to as one-hot coding, it means use N bit status registers encode N number of state, and each state has his independent register-bit, at any time, wherein Only one effectively.Although this expression makes each word have unique index, this coding that can cause every in text Sequence of a word in sentence does not have relevance, and the dictionary established therewith is bigger, and the sequence of this coding is longer, data Also very sparse therewith.There are bag of words (Bag of Words) later, it is exactly that indicate document vector can directly will be each The term vector of word indicates adduction；N-gram models, it is exactly that n neighbouring collocations codings are considered word in this way Sequence, but cause vocabulary dimension with corpus increase expand, word sequence also with corpus expansion faster, Sparse is asked Topic etc..

It is mutual indepedent between the characteristic item that these traditional document representation methods indicate, and Sparse, result in meter Big technical problem is measured in calculation.

Invention content

The present invention provides a kind of training method of Module of Automatic Chinese Documents Classification and devices, solve these traditional texts It is mutual indepedent between the characteristic item that representation method indicates, and Sparse, result in computationally intensive technical problem.

The present invention provides a kind of training methods of Module of Automatic Chinese Documents Classification, including：

S1, the training text for obtaining tape label；

S2, after being pre-processed to the training text, the training text after being segmented；

S3, the training text after the participle is input in word2vec models, by the training text after the participle Be converted to term vector set；

S4, the label of the term vector set and the training text is input in convolutional neural networks and is instructed Practice, and passes through the loss values of convolutional neural networks described in preset cost function calculation；

S5, judge whether the loss values are less than preset threshold value, if so, determining the convolutional neural networks convergence, protect The parameter of the convolutional neural networks is deposited, and generates the Module of Automatic Chinese Documents Classification after training, if it is not, then return to step S1.

Optionally, the step S2 is specifically included：

The training text is segmented by preset Knowledge based engineering participle model, the training text after being segmented This.

Optionally, the step S2 further includes：

The Feature Words in the training text are extracted by term frequency-inverse document frequency approach, and remove the training text In meaningless word；

Calculate the corresponding feature weight of the Feature Words.

Optionally, after the step S3, further include before the step S4：

According to the corresponding feature weight of the Feature Words, improves the corresponding term vector of the Feature Words and account for the term vector collection The weighted value of conjunction.

The present invention provides a kind of training devices of Module of Automatic Chinese Documents Classification, including：

Acquiring unit, the training text for obtaining tape label；

Pretreatment unit, after being pre-processed to the training text, the training text after being segmented；

Vectorial conversion unit will be described point for the training text after the participle to be input in word2vec models Training text after word is converted to term vector set；

Training unit, for the label of the term vector set and the training text to be input to convolutional neural networks In be trained, and pass through the loss values of convolutional neural networks described in preset cost function calculation；

Judging unit, for judging whether the loss values are less than preset threshold value, if so, determining the convolutional Neural net Network is restrained, and the parameter of the convolutional neural networks is preserved, and generates the Module of Automatic Chinese Documents Classification after training, if it is not, then redirecting To acquiring unit.

Optionally, the pretreatment unit specifically includes：

Subelement is segmented, the training text is segmented for passing through preset Knowledge based engineering participle model, is obtained Training text after participle.

Optionally, the pretreatment unit further includes：

Feature extraction subelement, for extracting the Feature Words in the training text by term frequency-inverse document frequency approach, And remove meaningless word in the training text；

Feature weight computation subunit, for calculating the corresponding feature weight of the Feature Words.

Optionally, the training device of Module of Automatic Chinese Documents Classification provided by the invention further includes：

Weight improves unit, for according to the corresponding feature weight of the Feature Words, improving the corresponding word of the Feature Words Vector accounts for the weighted value of the term vector set.

The present invention provides a kind of sorting techniques of Chinese text, are divided based on the Chinese text described in any one of as above The Module of Automatic Chinese Documents Classification that the training method of class model obtains, including：

Obtain text to be sorted；

By the text input to be sorted to as above any one of described in Module of Automatic Chinese Documents Classification training side In the Module of Automatic Chinese Documents Classification that method obtains, the classification results of the text to be sorted are obtained.

The present invention provides a kind of computer readable storage medium, the computer-readable recording medium storage has computer Instruction, when described instruction is executed by processor realize as above any one of described in method.

As can be seen from the above technical solutions, the present invention has the following advantages：

The present invention provides a kind of training methods of Module of Automatic Chinese Documents Classification, including：S1, the training text for obtaining tape label This；S2, after being pre-processed to the training text, the training text after being segmented；S3, text will be trained after the participle Originally it is input in word2vec models, the training text after the participle is converted into term vector set；S4, by the term vector The label of set and the training text, which is input in convolutional neural networks, to be trained, and passes through preset cost function calculation The loss values of the convolutional neural networks；S5, judge whether the loss values are less than preset threshold value, if so, determining the volume Product neural network convergence, preserves the parameter of the convolutional neural networks, and generate the Module of Automatic Chinese Documents Classification after training, if It is no, then return to step S1.

The present invention is converted to term vector set by using word2vec models, by training text so that text being capable of table Continuous, the dense data of similar image and voice are shown as, convolutional neural networks are then utilized, with the shape similar to processing image Formula, by the convolutional layer of convolutional neural networks, pond layer and non-linear conversion at training network parameter, enabling obtain just True classification solves between the characteristic item that traditional these document representation methods indicate independently of each other, and Sparse, leads Computationally intensive technical problem is caused.

Description of the drawings

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without having to pay creative labor, may be used also for those of ordinary skill in the art To obtain other attached drawings according to these attached drawings.

Fig. 1 is a kind of flow signal of one embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention Figure；

Fig. 2 is that a kind of flow of another embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention is shown It is intended to；

Fig. 3 is a kind of structural representation of one embodiment of the training device of Module of Automatic Chinese Documents Classification provided by the invention Figure.

Specific implementation mode

An embodiment of the present invention provides a kind of training method of Module of Automatic Chinese Documents Classification and device, solve traditional this It is mutual indepedent between the characteristic item that a little document representation methods indicate, and Sparse, result in computationally intensive technical problem.

In order to make the invention's purpose, features and advantages of the invention more obvious and easy to understand, below in conjunction with the present invention Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that disclosed below Embodiment be only a part of the embodiment of the present invention, and not all embodiment.Based on the embodiments of the present invention, this field All other embodiment that those of ordinary skill is obtained without making creative work, belongs to protection of the present invention Range.

Referring to Fig. 1, the present invention provides a kind of one embodiment of the training method of Module of Automatic Chinese Documents Classification, packet It includes：

101, the training text of tape label is obtained；

102, after being pre-processed to training text, the training text after being segmented；

103, the training text after participle is input in word2vec models, the training text after participle is converted into word Vector set；

104, the label of term vector set and training text is input in convolutional neural networks and is trained, and passed through The loss values of preset cost function calculation convolutional neural networks；

105, judge whether loss values are less than preset threshold value, if so, determining convolutional neural networks convergence, preserve convolution god Parameter through network, and the Module of Automatic Chinese Documents Classification after training is generated, if it is not, then return to step 101.

The embodiment of the present invention is converted to term vector set by using word2vec models, by training text so that text Continuous, the dense data of similar image and voice can be expressed as, convolutional neural networks are then utilized, to be similar to processing figure The form of picture, by the convolutional layer of convolutional neural networks, pond layer and non-linear conversion at training network parameter, enabling Correctly classified, is solved between the characteristic item that traditional these document representation methods indicate independently of each other, and data It is sparse, result in computationally intensive technical problem.

It is to be carried out to a kind of one embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention above Illustrate, a kind of another embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention will be said below It is bright.

Referring to Fig. 2, the present invention provides a kind of another embodiment of the training method of Module of Automatic Chinese Documents Classification, packet It includes：

201, the training text of tape label is obtained；

It should be noted that before training, it is necessary first to obtain the training text of tape label, i.e., known classification results Training text.

202, the Feature Words in training text are extracted by term frequency-inverse document frequency approach, and removes nothing in training text Meaning word；

It should be noted that characteristic processing is exactly to extract the Feature Words of reflection theme from training text, and determine special Levy the weight of word.It corresponds to the extraction of Feature Words and the calculating of feature weight.The extraction of Feature Words refers to just being commented according to some Valence index is independent to carry out marking and queuing to primitive character word, therefrom chooses some Feature Words of highest scoring, filters out remaining Feature Words.We use TF-IDF (term frequency-inverse document frequency) algorithm herein, and thought is exactly the significance level of a word Directly proportional to the word frequency in classification, the number occurred with all categories is inversely proportional, and can thus filter out those each The document frequency of occurrences is very high, but the word that difference has little significance, to select important text feature.

Text is according to source difference, typically with the label unrelated with content.These labels may be that control display is outer The mark of sight；It is also likely to be some functional symbols, such as punctuation mark；Be also possible to be some other media informations, as image, Sound, animation etc.；It could also be possible that some mess codes.They can not play help to classification, so should get rid of.

203, the corresponding feature weight of Feature Words is calculated；

It should be noted that the calculating of feature weight：Main thought is in significance level and classification according to a word Word frequency is directly proportional (representativeness), is inversely proportional (discrimination) with the number occurred in all categories.It is elected mathematically to carry out spy When sign extraction, determine that the main factor of Text character extraction effect is the quality of valuation functions.

204, training text is segmented by preset Knowledge based engineering participle model, the training text after being segmented This；

It should be noted that during text information processing, generally word, word or phrase can be selected as the spy of text Levy item.Although phrase carries enough information content, the probability that phrase occurs in the text is few, uses phrase as characteristic item meeting Cause feature vector rare, loses many important informations.Therefore, in order to extract Chinese vocabulary entry, need to Chinese text carry out compared with For complicated participle.We use Knowledge based engineering segmenting method, this method to regard participle as knowledge reasoning herein Process needs to carry out syntax, semantic analysis, therefore it needs to instruct sorting algorithm with a large amount of linguistry and information, makes Its information that can be provided by context is bound word.The content of one text mainly passes through noun, verb It being embodied with the notional words such as adjective, function word and the part high frequency words often occurred in various text back warps have no meaning to classification, So these buzz words or word can filter out.

205, the training text after participle is input in word2vec models, the training text after participle is converted into word Vector set；

It should be noted that using the text distribution representation method of this models of word2vec, it is deep learning side The important foundation of method.The distributed basic thought indicated of text is that each vocabulary is shown as n dimensions is dense, continuous real number to Amount, great advantage is that it has very powerful characterization ability, for example n-dimensional vector often ties up k value, can characterize the n times of k Side concept.The expression of text is by the representation method of this term vector, text data from the sparse neural network of high latitude Intractable mode becomes the continuous dense data of similar image, language, we can be the algorithm of deep learning in this way Move to text field.

206, according to the corresponding feature weight of Feature Words, the weight that the corresponding term vector of Feature Words accounts for term vector set is improved Value；

It should be noted that the Feature Words in training text is determined, and the corresponding feature of Feature Words has been calculated It, will be according to the corresponding spy of Feature Words before the term vector set converted using training text carries out model training after weight Weight is levied, the weighted value of the corresponding term vector of Feature Words in term vector set is improved, it is accurate with the classification for more improving training text Exactness.

207, the label of term vector set and training text is input in convolutional neural networks and is trained, and passed through The loss values of preset cost function calculation convolutional neural networks；

It should be noted that on the basis of the term vector set that word2vec models obtain, training convolutional neural networks Last text classifier is done, the main thought of convolutional neural networks textual classification model is, defeated to the text of term vector form Enter to carry out convolution operation.CNN, which is initially used to handle in image data, with image procossing, chooses the two-dimensional field progress convolution operation not Together, the convolution operation of text-oriented is carried out for the lexical item in fixed sliding window.By convolutional layer, pond layer and non-thread Property conversion layer after, CNN can obtain Text eigenvector for classification learning.The advantage of CNN be calculate text feature to It is effectively retained useful word order information during amount.It can learn multilayer neural network automatically, and input feature value is mapped Onto corresponding class label.By introducing nonlinear activation layer, which can realize nonlinear discriminant classification formula.It utilizes Word2vec models are the necessary conditions for realizing effective disaggregated model to the initial characteristics expression of the high quality of text.

And pass through the loss of the convolutional neural networks that training obtains every time of preset cost function calculation in convolutional neural networks Value.

208, judge whether loss values are less than preset threshold value, if so, determining convolutional neural networks convergence, preserve convolution god Parameter through network, and the Module of Automatic Chinese Documents Classification after training is generated, if it is not, then return to step 201.

It should be noted that after convolutional neural networks after being trained every time, corresponding loss values are calculated, and judge Whether loss values are less than preset threshold value, if so, representing convolutional neural networks convergence, preserve the parameter of convolutional neural networks, i.e., Module of Automatic Chinese Documents Classification is produced, if it is not, then needing the training text of reacquisition tape label, re-starts training.

The embodiment of the present invention is converted to term vector set by using word2vec models, by training text so that text Continuous, the dense data of similar image and voice can be expressed as, convolutional neural networks are then utilized, to be similar to processing figure The form of picture, by the convolutional layer of convolutional neural networks, pond layer and non-linear conversion at training network parameter, enabling Correctly classified, is solved between the characteristic item that traditional these document representation methods indicate independently of each other, and data It is sparse, result in computationally intensive technical problem.Further, the embodiment of the present invention is by extracting the feature in training text Word, and weighted value of the corresponding term vector of Feature Words in term vector set is improved according to the corresponding feature weight of Feature Words, The accuracy to text classification is improved, the time of model training is reduced.

It is to a kind of another embodiment progress of the training method of Module of Automatic Chinese Documents Classification provided by the invention above Explanation, a kind of one embodiment of the training device of Module of Automatic Chinese Documents Classification provided by the invention will be said below It is bright.

Referring to Fig. 3, the present invention provides a kind of one embodiment of the training device of Module of Automatic Chinese Documents Classification, packet It includes：

Acquiring unit 301, the training text for obtaining tape label；

Pretreatment unit 302, after being pre-processed to training text, the training text after being segmented；

Pretreatment unit 302 specifically includes：

Feature extraction subelement 3021, for extracting the Feature Words in training text by term frequency-inverse document frequency approach, And remove meaningless word in training text；

Feature weight computation subunit 3022, for calculating the corresponding feature weight of Feature Words；

Subelement 3023 is segmented, training text is segmented for passing through preset Knowledge based engineering participle model, is obtained Training text after participle；

Vectorial conversion unit 303, for the training text after participle to be input in word2vec models, after participle Training text is converted to term vector set；

Weight improves unit 304, for according to the corresponding feature weight of Feature Words, improving the corresponding term vector of Feature Words and accounting for The weighted value of term vector set；

Training unit 305, for by the label of term vector set and training text be input in convolutional neural networks into Row training, and pass through the loss values of preset cost function calculation convolutional neural networks；

Judging unit 306, for judging whether loss values are less than preset threshold value, if so, determining that convolutional neural networks are received It holds back, preserves the parameter of convolutional neural networks, and generate the Module of Automatic Chinese Documents Classification after training, if it is not, it is single then to jump to acquisition Member 301.

It is to be carried out to a kind of one embodiment of the training device of Module of Automatic Chinese Documents Classification provided by the invention above Illustrate, a kind of one embodiment of the sorting technique of Chinese text provided by the invention will be illustrated below.

The present invention provides a kind of sorting techniques of Chinese text, based on such as any one of embodiment one and embodiment two Module of Automatic Chinese Documents Classification the obtained Module of Automatic Chinese Documents Classification of training method, including：

Obtain text to be sorted；

By the Module of Automatic Chinese Documents Classification of any one of text input to be sorted to such as embodiment one and embodiment two The obtained Module of Automatic Chinese Documents Classification of training method in, obtain the classification results of text to be sorted.

It is to a kind of explanation that one embodiment of the sorting technique of Chinese text carries out provided by the invention, below above A kind of one embodiment of computer readable storage medium provided by the invention will be illustrated.

The present invention provides a kind of computer readable storage medium, computer-readable recording medium storage has computer to refer to It enables, the method such as any one of embodiment one and embodiment two is realized when instruction is executed by processor.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, device or unit It closes or communicates to connect, can be electrical, machinery or other forms.

The unit illustrated as separating component may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme 's.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can be stored in a computer read/write memory medium.Based on this understanding, technical scheme of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the present invention Portion or part steps.And storage medium above-mentioned includes：USB flash disk, mobile hard disk, read-only memory (ROM, Read- OnlyMemory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various to deposit Store up the medium of program code.

The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although with reference to before Stating embodiment, invention is explained in detail, it will be understood by those of ordinary skill in the art that：It still can be to preceding The technical solution recorded in each embodiment is stated to modify or equivalent replacement of some of the technical features；And these Modification or replacement, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of training method of Module of Automatic Chinese Documents Classification, which is characterized in that including：

S1, the training text for obtaining tape label；

S3, the training text after the participle is input in word2vec models, the training text after the participle is converted For term vector set；

S4, it the label of the term vector set and the training text is input in convolutional neural networks is trained, and Pass through the loss values of convolutional neural networks described in preset cost function calculation；

S5, judge whether the loss values are less than preset threshold value, if so, determining the convolutional neural networks convergence, preserve institute The parameter of convolutional neural networks is stated, and generates the Module of Automatic Chinese Documents Classification after training, if it is not, then return to step S1.

2. the training method of Module of Automatic Chinese Documents Classification according to claim 1, which is characterized in that the step S2 is specific Including：

The training text is segmented by preset Knowledge based engineering participle model, the training text after being segmented.

3. the training method of Module of Automatic Chinese Documents Classification according to claim 2, which is characterized in that the step S2 is also wrapped It includes：

The Feature Words in the training text are extracted by term frequency-inverse document frequency approach, and remove nothing in the training text Meaning word；

Calculate the corresponding feature weight of the Feature Words.

4. the training method of Module of Automatic Chinese Documents Classification according to claim 3, which is characterized in that the step S3 it Afterwards, further include before the step S4：

According to the corresponding feature weight of the Feature Words, improves the corresponding term vector of the Feature Words and account for the term vector set Weighted value.

5. a kind of training device of Module of Automatic Chinese Documents Classification, which is characterized in that including：

Acquiring unit, the training text for obtaining tape label；

Vectorial conversion unit, for the training text after the participle to be input in word2vec models, after the participle Training text be converted to term vector set；

Training unit, for by the label of the term vector set and the training text be input in convolutional neural networks into Row training, and pass through the loss values of convolutional neural networks described in preset cost function calculation；

Judging unit, for judging whether the loss values are less than preset threshold value, if so, determining that the convolutional neural networks are received It holds back, preserves the parameter of the convolutional neural networks, and generate the Module of Automatic Chinese Documents Classification after training, obtained if it is not, then jumping to Take unit.

6. the training device of Module of Automatic Chinese Documents Classification according to claim 5, which is characterized in that the pretreatment unit It specifically includes：

Subelement is segmented, the training text is segmented for passing through preset Knowledge based engineering participle model, is segmented Training text afterwards.

7. the training device of Module of Automatic Chinese Documents Classification according to claim 6, which is characterized in that the pretreatment unit Further include：

Feature extraction subelement for extracting the Feature Words in the training text by term frequency-inverse document frequency approach, and is gone Fall meaningless word in the training text；

8. the training device of Module of Automatic Chinese Documents Classification according to claim 7, which is characterized in that further include：

Weight improves unit, for according to the corresponding feature weight of the Feature Words, improving the corresponding term vector of the Feature Words Account for the weighted value of the term vector set.

9. a kind of sorting technique of Chinese text, based on the Chinese Text Categorization mould as described in any one of Claims 1-4 The Module of Automatic Chinese Documents Classification that the training method of type obtains, which is characterized in that including：

Obtain text to be sorted；

It will be the text input to be sorted to the Module of Automatic Chinese Documents Classification as described in any one of Claims 1-4 In the Module of Automatic Chinese Documents Classification that training method obtains, the classification results of the text to be sorted are obtained.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer to refer to It enables, method according to any one of claims 1 to 4 is realized when described instruction is executed by processor.