CN108573047A - A kind of training method and device of Module of Automatic Chinese Documents Classification - Google Patents
A kind of training method and device of Module of Automatic Chinese Documents Classification Download PDFInfo
- Publication number
- CN108573047A CN108573047A CN201810350019.7A CN201810350019A CN108573047A CN 108573047 A CN108573047 A CN 108573047A CN 201810350019 A CN201810350019 A CN 201810350019A CN 108573047 A CN108573047 A CN 108573047A
- Authority
- CN
- China
- Prior art keywords
- training
- text
- module
- training text
- convolutional neural
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of training method of Module of Automatic Chinese Documents Classification and device, solves between the characteristic item that these traditional document representation methods indicate independently of each other, and Sparse, result in computationally intensive technical problem.Wherein method, including:S1, the training text for obtaining tape label;S2, after being pre-processed to the training text, the training text after being segmented;S3, the training text after the participle is input in word2vec models, the training text after the participle is converted into term vector set;S4, it the label of the term vector set and the training text is input in convolutional neural networks is trained, and pass through the loss values of convolutional neural networks described in preset cost function calculation;S5, judge whether the loss values are less than preset threshold value, if so, determining the convolutional neural networks convergence, the parameter of the convolutional neural networks is preserved, and generate the Module of Automatic Chinese Documents Classification after training, if it is not, then return to step S1.
Description
Technical field
The present invention relates to Text Classification field more particularly to the training methods and dress of a kind of Module of Automatic Chinese Documents Classification
It sets.
Background technology
Since the last century 90's, with universal and network technology the constantly improve of Internet, Internet is
Through as the most huge most abundant information resource database in the whole world.Shown by by the end of December, 2016 according to newest CNNIC statistics, in
State's webpage quantity reaches hundred billion, and netizen's scale breaks through 6.88 hundred million, and internet just becomes people's daily life basic resources.
The opening of Internet makes various information that can be issued on internet, however, this of Internet in first time
Kind exploration also results in the scrambling and redundancy of information on Internet.The how effectively non-knot of organization and management magnanimity
The text message of structure, and be precisely that user realizes that Information locating is the one of current information Science and Technology field face to choose greatly
War, one of them successful example is exactly to be classified automatically to information according to the content of information.
Automatic classification technology develops on the basis of traditional information manual sort, as a kind of effective information processing
Various information is arranged according to certain taxonomic hierarchies, largely solves the problems, such as information clutter by mode.Tradition
Though manual information sorting technique it is quite ripe, handled it is apparent that being unsuitable for Internet information newer to the moment.
The eighties, " knowledge engineering " (Knowledge Engineering) theory be used to instruct text classification, by by expertise
Artificial definition is one group of rule, is classified according to these rules in the case of given classification.After the nineties, " engineering
Habit " (Machine Learning) is developing progressively the mainstream technology for text classification, and it is good to shift to an earlier date handmarking by one group
Classifying documents, learn interested category feature by the process of an induction type, then use machine learning techniques structure
Produce auto text categorization classifier.Chinese is the language that number of users is most in the world, as the arrival of information age and knowledge pass through
The globalization of Ji, Chinese Text Categorization effect have become very important.
In recent years, deep learning model achieves significant achievement in terms of computer vision and speech recognition, in nature
Language Processing field carries out feature learning and text classification to natural language text information using neural network, also becomes text
The cutting edge technology of classification.Existing sorting technique includes mainly rule-based disaggregated model, the classification mould based on machine learning
Type, more famous Document Classification Method have decision tree (Decision Tree), random forest (Random Forest), pattra leaves
This grader (Bayes), linear classifier (logistic regression), support vector machines (Support Vector Machine, SVM),
Maximum entropy classifiers etc..They are all to begin by means of machine learning method, by manual features engineering and shallow-layer disaggregated model come
Carry out text classification.
The task of text classification (Text Classification) is automatic point according to the interior perhaps theme for giving document
With pre-defined class label.Classify to document, generally requires by two steps of text representation and learning classification.And
How to be the structural data that can handle of algorithm document representation, this is undoubtedly the important link of text classification.For text
Expression, traditional method is all discrete expression, such as One-hot codings, also referred to as one-hot coding, it means use
N bit status registers encode N number of state, and each state has his independent register-bit, at any time, wherein
Only one effectively.Although this expression makes each word have unique index, this coding that can cause every in text
Sequence of a word in sentence does not have relevance, and the dictionary established therewith is bigger, and the sequence of this coding is longer, data
Also very sparse therewith.There are bag of words (Bag of Words) later, it is exactly that indicate document vector can directly will be each
The term vector of word indicates adduction;N-gram models, it is exactly that n neighbouring collocations codings are considered word in this way
Sequence, but cause vocabulary dimension with corpus increase expand, word sequence also with corpus expansion faster, Sparse is asked
Topic etc..
It is mutual indepedent between the characteristic item that these traditional document representation methods indicate, and Sparse, result in meter
Big technical problem is measured in calculation.
Invention content
The present invention provides a kind of training method of Module of Automatic Chinese Documents Classification and devices, solve these traditional texts
It is mutual indepedent between the characteristic item that representation method indicates, and Sparse, result in computationally intensive technical problem.
The present invention provides a kind of training methods of Module of Automatic Chinese Documents Classification, including:
S1, the training text for obtaining tape label;
S2, after being pre-processed to the training text, the training text after being segmented;
S3, the training text after the participle is input in word2vec models, by the training text after the participle
Be converted to term vector set;
S4, the label of the term vector set and the training text is input in convolutional neural networks and is instructed
Practice, and passes through the loss values of convolutional neural networks described in preset cost function calculation;
S5, judge whether the loss values are less than preset threshold value, if so, determining the convolutional neural networks convergence, protect
The parameter of the convolutional neural networks is deposited, and generates the Module of Automatic Chinese Documents Classification after training, if it is not, then return to step S1.
Optionally, the step S2 is specifically included:
The training text is segmented by preset Knowledge based engineering participle model, the training text after being segmented
This.
Optionally, the step S2 further includes:
The Feature Words in the training text are extracted by term frequency-inverse document frequency approach, and remove the training text
In meaningless word;
Calculate the corresponding feature weight of the Feature Words.
Optionally, after the step S3, further include before the step S4:
According to the corresponding feature weight of the Feature Words, improves the corresponding term vector of the Feature Words and account for the term vector collection
The weighted value of conjunction.
The present invention provides a kind of training devices of Module of Automatic Chinese Documents Classification, including:
Acquiring unit, the training text for obtaining tape label;
Pretreatment unit, after being pre-processed to the training text, the training text after being segmented;
Vectorial conversion unit will be described point for the training text after the participle to be input in word2vec models
Training text after word is converted to term vector set;
Training unit, for the label of the term vector set and the training text to be input to convolutional neural networks
In be trained, and pass through the loss values of convolutional neural networks described in preset cost function calculation;
Judging unit, for judging whether the loss values are less than preset threshold value, if so, determining the convolutional Neural net
Network is restrained, and the parameter of the convolutional neural networks is preserved, and generates the Module of Automatic Chinese Documents Classification after training, if it is not, then redirecting
To acquiring unit.
Optionally, the pretreatment unit specifically includes:
Subelement is segmented, the training text is segmented for passing through preset Knowledge based engineering participle model, is obtained
Training text after participle.
Optionally, the pretreatment unit further includes:
Feature extraction subelement, for extracting the Feature Words in the training text by term frequency-inverse document frequency approach,
And remove meaningless word in the training text;
Feature weight computation subunit, for calculating the corresponding feature weight of the Feature Words.
Optionally, the training device of Module of Automatic Chinese Documents Classification provided by the invention further includes:
Weight improves unit, for according to the corresponding feature weight of the Feature Words, improving the corresponding word of the Feature Words
Vector accounts for the weighted value of the term vector set.
The present invention provides a kind of sorting techniques of Chinese text, are divided based on the Chinese text described in any one of as above
The Module of Automatic Chinese Documents Classification that the training method of class model obtains, including:
Obtain text to be sorted;
By the text input to be sorted to as above any one of described in Module of Automatic Chinese Documents Classification training side
In the Module of Automatic Chinese Documents Classification that method obtains, the classification results of the text to be sorted are obtained.
The present invention provides a kind of computer readable storage medium, the computer-readable recording medium storage has computer
Instruction, when described instruction is executed by processor realize as above any one of described in method.
As can be seen from the above technical solutions, the present invention has the following advantages:
The present invention provides a kind of training methods of Module of Automatic Chinese Documents Classification, including:S1, the training text for obtaining tape label
This;S2, after being pre-processed to the training text, the training text after being segmented;S3, text will be trained after the participle
Originally it is input in word2vec models, the training text after the participle is converted into term vector set;S4, by the term vector
The label of set and the training text, which is input in convolutional neural networks, to be trained, and passes through preset cost function calculation
The loss values of the convolutional neural networks;S5, judge whether the loss values are less than preset threshold value, if so, determining the volume
Product neural network convergence, preserves the parameter of the convolutional neural networks, and generate the Module of Automatic Chinese Documents Classification after training, if
It is no, then return to step S1.
The present invention is converted to term vector set by using word2vec models, by training text so that text being capable of table
Continuous, the dense data of similar image and voice are shown as, convolutional neural networks are then utilized, with the shape similar to processing image
Formula, by the convolutional layer of convolutional neural networks, pond layer and non-linear conversion at training network parameter, enabling obtain just
True classification solves between the characteristic item that traditional these document representation methods indicate independently of each other, and Sparse, leads
Computationally intensive technical problem is caused.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention without having to pay creative labor, may be used also for those of ordinary skill in the art
To obtain other attached drawings according to these attached drawings.
Fig. 1 is a kind of flow signal of one embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention
Figure;
Fig. 2 is that a kind of flow of another embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention is shown
It is intended to;
Fig. 3 is a kind of structural representation of one embodiment of the training device of Module of Automatic Chinese Documents Classification provided by the invention
Figure.
Specific implementation mode
An embodiment of the present invention provides a kind of training method of Module of Automatic Chinese Documents Classification and device, solve traditional this
It is mutual indepedent between the characteristic item that a little document representation methods indicate, and Sparse, result in computationally intensive technical problem.
In order to make the invention's purpose, features and advantages of the invention more obvious and easy to understand, below in conjunction with the present invention
Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that disclosed below
Embodiment be only a part of the embodiment of the present invention, and not all embodiment.Based on the embodiments of the present invention, this field
All other embodiment that those of ordinary skill is obtained without making creative work, belongs to protection of the present invention
Range.
Referring to Fig. 1, the present invention provides a kind of one embodiment of the training method of Module of Automatic Chinese Documents Classification, packet
It includes:
101, the training text of tape label is obtained;
102, after being pre-processed to training text, the training text after being segmented;
103, the training text after participle is input in word2vec models, the training text after participle is converted into word
Vector set;
104, the label of term vector set and training text is input in convolutional neural networks and is trained, and passed through
The loss values of preset cost function calculation convolutional neural networks;
105, judge whether loss values are less than preset threshold value, if so, determining convolutional neural networks convergence, preserve convolution god
Parameter through network, and the Module of Automatic Chinese Documents Classification after training is generated, if it is not, then return to step 101.
The embodiment of the present invention is converted to term vector set by using word2vec models, by training text so that text
Continuous, the dense data of similar image and voice can be expressed as, convolutional neural networks are then utilized, to be similar to processing figure
The form of picture, by the convolutional layer of convolutional neural networks, pond layer and non-linear conversion at training network parameter, enabling
Correctly classified, is solved between the characteristic item that traditional these document representation methods indicate independently of each other, and data
It is sparse, result in computationally intensive technical problem.
It is to be carried out to a kind of one embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention above
Illustrate, a kind of another embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention will be said below
It is bright.
Referring to Fig. 2, the present invention provides a kind of another embodiment of the training method of Module of Automatic Chinese Documents Classification, packet
It includes:
201, the training text of tape label is obtained;
It should be noted that before training, it is necessary first to obtain the training text of tape label, i.e., known classification results
Training text.
202, the Feature Words in training text are extracted by term frequency-inverse document frequency approach, and removes nothing in training text
Meaning word;
It should be noted that characteristic processing is exactly to extract the Feature Words of reflection theme from training text, and determine special
Levy the weight of word.It corresponds to the extraction of Feature Words and the calculating of feature weight.The extraction of Feature Words refers to just being commented according to some
Valence index is independent to carry out marking and queuing to primitive character word, therefrom chooses some Feature Words of highest scoring, filters out remaining
Feature Words.We use TF-IDF (term frequency-inverse document frequency) algorithm herein, and thought is exactly the significance level of a word
Directly proportional to the word frequency in classification, the number occurred with all categories is inversely proportional, and can thus filter out those each
The document frequency of occurrences is very high, but the word that difference has little significance, to select important text feature.
Text is according to source difference, typically with the label unrelated with content.These labels may be that control display is outer
The mark of sight;It is also likely to be some functional symbols, such as punctuation mark;Be also possible to be some other media informations, as image,
Sound, animation etc.;It could also be possible that some mess codes.They can not play help to classification, so should get rid of.
203, the corresponding feature weight of Feature Words is calculated;
It should be noted that the calculating of feature weight:Main thought is in significance level and classification according to a word
Word frequency is directly proportional (representativeness), is inversely proportional (discrimination) with the number occurred in all categories.It is elected mathematically to carry out spy
When sign extraction, determine that the main factor of Text character extraction effect is the quality of valuation functions.
204, training text is segmented by preset Knowledge based engineering participle model, the training text after being segmented
This;
It should be noted that during text information processing, generally word, word or phrase can be selected as the spy of text
Levy item.Although phrase carries enough information content, the probability that phrase occurs in the text is few, uses phrase as characteristic item meeting
Cause feature vector rare, loses many important informations.Therefore, in order to extract Chinese vocabulary entry, need to Chinese text carry out compared with
For complicated participle.We use Knowledge based engineering segmenting method, this method to regard participle as knowledge reasoning herein
Process needs to carry out syntax, semantic analysis, therefore it needs to instruct sorting algorithm with a large amount of linguistry and information, makes
Its information that can be provided by context is bound word.The content of one text mainly passes through noun, verb
It being embodied with the notional words such as adjective, function word and the part high frequency words often occurred in various text back warps have no meaning to classification,
So these buzz words or word can filter out.
205, the training text after participle is input in word2vec models, the training text after participle is converted into word
Vector set;
It should be noted that using the text distribution representation method of this models of word2vec, it is deep learning side
The important foundation of method.The distributed basic thought indicated of text is that each vocabulary is shown as n dimensions is dense, continuous real number to
Amount, great advantage is that it has very powerful characterization ability, for example n-dimensional vector often ties up k value, can characterize the n times of k
Side concept.The expression of text is by the representation method of this term vector, text data from the sparse neural network of high latitude
Intractable mode becomes the continuous dense data of similar image, language, we can be the algorithm of deep learning in this way
Move to text field.
206, according to the corresponding feature weight of Feature Words, the weight that the corresponding term vector of Feature Words accounts for term vector set is improved
Value;
It should be noted that the Feature Words in training text is determined, and the corresponding feature of Feature Words has been calculated
It, will be according to the corresponding spy of Feature Words before the term vector set converted using training text carries out model training after weight
Weight is levied, the weighted value of the corresponding term vector of Feature Words in term vector set is improved, it is accurate with the classification for more improving training text
Exactness.
207, the label of term vector set and training text is input in convolutional neural networks and is trained, and passed through
The loss values of preset cost function calculation convolutional neural networks;
It should be noted that on the basis of the term vector set that word2vec models obtain, training convolutional neural networks
Last text classifier is done, the main thought of convolutional neural networks textual classification model is, defeated to the text of term vector form
Enter to carry out convolution operation.CNN, which is initially used to handle in image data, with image procossing, chooses the two-dimensional field progress convolution operation not
Together, the convolution operation of text-oriented is carried out for the lexical item in fixed sliding window.By convolutional layer, pond layer and non-thread
Property conversion layer after, CNN can obtain Text eigenvector for classification learning.The advantage of CNN be calculate text feature to
It is effectively retained useful word order information during amount.It can learn multilayer neural network automatically, and input feature value is mapped
Onto corresponding class label.By introducing nonlinear activation layer, which can realize nonlinear discriminant classification formula.It utilizes
Word2vec models are the necessary conditions for realizing effective disaggregated model to the initial characteristics expression of the high quality of text.
And pass through the loss of the convolutional neural networks that training obtains every time of preset cost function calculation in convolutional neural networks
Value.
208, judge whether loss values are less than preset threshold value, if so, determining convolutional neural networks convergence, preserve convolution god
Parameter through network, and the Module of Automatic Chinese Documents Classification after training is generated, if it is not, then return to step 201.
It should be noted that after convolutional neural networks after being trained every time, corresponding loss values are calculated, and judge
Whether loss values are less than preset threshold value, if so, representing convolutional neural networks convergence, preserve the parameter of convolutional neural networks, i.e.,
Module of Automatic Chinese Documents Classification is produced, if it is not, then needing the training text of reacquisition tape label, re-starts training.
The embodiment of the present invention is converted to term vector set by using word2vec models, by training text so that text
Continuous, the dense data of similar image and voice can be expressed as, convolutional neural networks are then utilized, to be similar to processing figure
The form of picture, by the convolutional layer of convolutional neural networks, pond layer and non-linear conversion at training network parameter, enabling
Correctly classified, is solved between the characteristic item that traditional these document representation methods indicate independently of each other, and data
It is sparse, result in computationally intensive technical problem.Further, the embodiment of the present invention is by extracting the feature in training text
Word, and weighted value of the corresponding term vector of Feature Words in term vector set is improved according to the corresponding feature weight of Feature Words,
The accuracy to text classification is improved, the time of model training is reduced.
It is to a kind of another embodiment progress of the training method of Module of Automatic Chinese Documents Classification provided by the invention above
Explanation, a kind of one embodiment of the training device of Module of Automatic Chinese Documents Classification provided by the invention will be said below
It is bright.
Referring to Fig. 3, the present invention provides a kind of one embodiment of the training device of Module of Automatic Chinese Documents Classification, packet
It includes:
Acquiring unit 301, the training text for obtaining tape label;
Pretreatment unit 302, after being pre-processed to training text, the training text after being segmented;
Pretreatment unit 302 specifically includes:
Feature extraction subelement 3021, for extracting the Feature Words in training text by term frequency-inverse document frequency approach,
And remove meaningless word in training text;
Feature weight computation subunit 3022, for calculating the corresponding feature weight of Feature Words;
Subelement 3023 is segmented, training text is segmented for passing through preset Knowledge based engineering participle model, is obtained
Training text after participle;
Vectorial conversion unit 303, for the training text after participle to be input in word2vec models, after participle
Training text is converted to term vector set;
Weight improves unit 304, for according to the corresponding feature weight of Feature Words, improving the corresponding term vector of Feature Words and accounting for
The weighted value of term vector set;
Training unit 305, for by the label of term vector set and training text be input in convolutional neural networks into
Row training, and pass through the loss values of preset cost function calculation convolutional neural networks;
Judging unit 306, for judging whether loss values are less than preset threshold value, if so, determining that convolutional neural networks are received
It holds back, preserves the parameter of convolutional neural networks, and generate the Module of Automatic Chinese Documents Classification after training, if it is not, it is single then to jump to acquisition
Member 301.
It is to be carried out to a kind of one embodiment of the training device of Module of Automatic Chinese Documents Classification provided by the invention above
Illustrate, a kind of one embodiment of the sorting technique of Chinese text provided by the invention will be illustrated below.
The present invention provides a kind of sorting techniques of Chinese text, based on such as any one of embodiment one and embodiment two
Module of Automatic Chinese Documents Classification the obtained Module of Automatic Chinese Documents Classification of training method, including:
Obtain text to be sorted;
By the Module of Automatic Chinese Documents Classification of any one of text input to be sorted to such as embodiment one and embodiment two
The obtained Module of Automatic Chinese Documents Classification of training method in, obtain the classification results of text to be sorted.
It is to a kind of explanation that one embodiment of the sorting technique of Chinese text carries out provided by the invention, below above
A kind of one embodiment of computer readable storage medium provided by the invention will be illustrated.
The present invention provides a kind of computer readable storage medium, computer-readable recording medium storage has computer to refer to
It enables, the method such as any one of embodiment one and embodiment two is realized when instruction is executed by processor.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component
It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, device or unit
It closes or communicates to connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple
In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can be stored in a computer read/write memory medium.Based on this understanding, technical scheme of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the present invention
Portion or part steps.And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-
OnlyMemory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various to deposit
Store up the medium of program code.
The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to before
Stating embodiment, invention is explained in detail, it will be understood by those of ordinary skill in the art that:It still can be to preceding
The technical solution recorded in each embodiment is stated to modify or equivalent replacement of some of the technical features;And these
Modification or replacement, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.
Claims (10)
1. a kind of training method of Module of Automatic Chinese Documents Classification, which is characterized in that including:
S1, the training text for obtaining tape label;
S2, after being pre-processed to the training text, the training text after being segmented;
S3, the training text after the participle is input in word2vec models, the training text after the participle is converted
For term vector set;
S4, it the label of the term vector set and the training text is input in convolutional neural networks is trained, and
Pass through the loss values of convolutional neural networks described in preset cost function calculation;
S5, judge whether the loss values are less than preset threshold value, if so, determining the convolutional neural networks convergence, preserve institute
The parameter of convolutional neural networks is stated, and generates the Module of Automatic Chinese Documents Classification after training, if it is not, then return to step S1.
2. the training method of Module of Automatic Chinese Documents Classification according to claim 1, which is characterized in that the step S2 is specific
Including:
The training text is segmented by preset Knowledge based engineering participle model, the training text after being segmented.
3. the training method of Module of Automatic Chinese Documents Classification according to claim 2, which is characterized in that the step S2 is also wrapped
It includes:
The Feature Words in the training text are extracted by term frequency-inverse document frequency approach, and remove nothing in the training text
Meaning word;
Calculate the corresponding feature weight of the Feature Words.
4. the training method of Module of Automatic Chinese Documents Classification according to claim 3, which is characterized in that the step S3 it
Afterwards, further include before the step S4:
According to the corresponding feature weight of the Feature Words, improves the corresponding term vector of the Feature Words and account for the term vector set
Weighted value.
5. a kind of training device of Module of Automatic Chinese Documents Classification, which is characterized in that including:
Acquiring unit, the training text for obtaining tape label;
Pretreatment unit, after being pre-processed to the training text, the training text after being segmented;
Vectorial conversion unit, for the training text after the participle to be input in word2vec models, after the participle
Training text be converted to term vector set;
Training unit, for by the label of the term vector set and the training text be input in convolutional neural networks into
Row training, and pass through the loss values of convolutional neural networks described in preset cost function calculation;
Judging unit, for judging whether the loss values are less than preset threshold value, if so, determining that the convolutional neural networks are received
It holds back, preserves the parameter of the convolutional neural networks, and generate the Module of Automatic Chinese Documents Classification after training, obtained if it is not, then jumping to
Take unit.
6. the training device of Module of Automatic Chinese Documents Classification according to claim 5, which is characterized in that the pretreatment unit
It specifically includes:
Subelement is segmented, the training text is segmented for passing through preset Knowledge based engineering participle model, is segmented
Training text afterwards.
7. the training device of Module of Automatic Chinese Documents Classification according to claim 6, which is characterized in that the pretreatment unit
Further include:
Feature extraction subelement for extracting the Feature Words in the training text by term frequency-inverse document frequency approach, and is gone
Fall meaningless word in the training text;
Feature weight computation subunit, for calculating the corresponding feature weight of the Feature Words.
8. the training device of Module of Automatic Chinese Documents Classification according to claim 7, which is characterized in that further include:
Weight improves unit, for according to the corresponding feature weight of the Feature Words, improving the corresponding term vector of the Feature Words
Account for the weighted value of the term vector set.
9. a kind of sorting technique of Chinese text, based on the Chinese Text Categorization mould as described in any one of Claims 1-4
The Module of Automatic Chinese Documents Classification that the training method of type obtains, which is characterized in that including:
Obtain text to be sorted;
It will be the text input to be sorted to the Module of Automatic Chinese Documents Classification as described in any one of Claims 1-4
In the Module of Automatic Chinese Documents Classification that training method obtains, the classification results of the text to be sorted are obtained.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer to refer to
It enables, method according to any one of claims 1 to 4 is realized when described instruction is executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810350019.7A CN108573047A (en) | 2018-04-18 | 2018-04-18 | A kind of training method and device of Module of Automatic Chinese Documents Classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810350019.7A CN108573047A (en) | 2018-04-18 | 2018-04-18 | A kind of training method and device of Module of Automatic Chinese Documents Classification |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108573047A true CN108573047A (en) | 2018-09-25 |
Family
ID=63575162
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810350019.7A Pending CN108573047A (en) | 2018-04-18 | 2018-04-18 | A kind of training method and device of Module of Automatic Chinese Documents Classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108573047A (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299468A (en) * | 2018-10-25 | 2019-02-01 | 四川长虹电器股份有限公司 | Short text classification method based on conditional entropy and convolutional neural networks |
CN109376244A (en) * | 2018-10-25 | 2019-02-22 | 山东省通信管理局 | A kind of swindle website identification method based on tagsort |
CN109492764A (en) * | 2018-10-24 | 2019-03-19 | 平安科技(深圳)有限公司 | Training method, relevant device and the medium of production confrontation network |
CN109582784A (en) * | 2018-10-26 | 2019-04-05 | 阿里巴巴集团控股有限公司 | File classification method and device |
CN109684476A (en) * | 2018-12-07 | 2019-04-26 | 中科恒运股份有限公司 | A kind of file classification method, document sorting apparatus and terminal device |
CN109815334A (en) * | 2019-01-25 | 2019-05-28 | 武汉斗鱼鱼乐网络科技有限公司 | A kind of barrage file classification method, storage medium, equipment and system |
CN109858035A (en) * | 2018-12-29 | 2019-06-07 | 深兰科技(上海)有限公司 | A kind of sensibility classification method, device, electronic equipment and readable storage medium storing program for executing |
CN109933667A (en) * | 2019-03-19 | 2019-06-25 | 中国联合网络通信集团有限公司 | Textual classification model training method, file classification method and equipment |
CN109948665A (en) * | 2019-02-28 | 2019-06-28 | 中国地质大学(武汉) | Physical activity genre classification methods and system based on long Memory Neural Networks in short-term |
CN109960726A (en) * | 2019-02-13 | 2019-07-02 | 平安科技(深圳)有限公司 | Textual classification model construction method, device, terminal and storage medium |
CN110008342A (en) * | 2019-04-12 | 2019-07-12 | 智慧芽信息科技(苏州)有限公司 | Document classification method, apparatus, equipment and storage medium |
CN110009064A (en) * | 2019-04-30 | 2019-07-12 | 广东电网有限责任公司 | A kind of semantic model training method and device based on electrical network field |
CN110188798A (en) * | 2019-04-28 | 2019-08-30 | 阿里巴巴集团控股有限公司 | A kind of object classification method and model training method and device |
CN110232128A (en) * | 2019-06-21 | 2019-09-13 | 华中师范大学 | Topic file classification method and device |
CN110413773A (en) * | 2019-06-20 | 2019-11-05 | 平安科技(深圳)有限公司 | Intelligent text classification method, device and computer readable storage medium |
CN110427480A (en) * | 2019-06-28 | 2019-11-08 | 平安科技(深圳)有限公司 | Personalized text intelligent recommendation method, apparatus and computer readable storage medium |
CN110879832A (en) * | 2019-10-23 | 2020-03-13 | 支付宝(杭州)信息技术有限公司 | Target text detection method, model training method, device and equipment |
CN110909164A (en) * | 2019-11-22 | 2020-03-24 | 科大国创软件股份有限公司 | Text enhancement semantic classification method and system based on convolutional neural network |
CN111324831A (en) * | 2018-12-17 | 2020-06-23 | 中国移动通信集团北京有限公司 | Method and device for detecting fraudulent website |
CN112380350A (en) * | 2021-01-14 | 2021-02-19 | 北京崔玉涛儿童健康管理中心有限公司 | Text classification method and device |
CN112560427A (en) * | 2020-12-16 | 2021-03-26 | 平安银行股份有限公司 | Problem expansion method, device, electronic equipment and medium |
CN113010667A (en) * | 2019-12-20 | 2021-06-22 | 王道维 | Training method for machine learning decision model by using natural language corpus |
CN113360657A (en) * | 2021-06-30 | 2021-09-07 | 安徽商信政通信息技术股份有限公司 | Intelligent document distribution and handling method and device and computer equipment |
CN113742479A (en) * | 2020-05-29 | 2021-12-03 | 北京沃东天骏信息技术有限公司 | Method and device for screening target text |
CN116186271A (en) * | 2023-04-19 | 2023-05-30 | 北京亚信数据有限公司 | Medical term classification model training method, classification method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100583101C (en) * | 2008-06-12 | 2010-01-20 | 昆明理工大学 | Text categorization feature selection and weight computation method based on field knowledge |
CN107102989A (en) * | 2017-05-24 | 2017-08-29 | 南京大学 | A kind of entity disambiguation method based on term vector, convolutional neural networks |
US20170308790A1 (en) * | 2016-04-21 | 2017-10-26 | International Business Machines Corporation | Text classification by ranking with convolutional neural networks |
-
2018
- 2018-04-18 CN CN201810350019.7A patent/CN108573047A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100583101C (en) * | 2008-06-12 | 2010-01-20 | 昆明理工大学 | Text categorization feature selection and weight computation method based on field knowledge |
US20170308790A1 (en) * | 2016-04-21 | 2017-10-26 | International Business Machines Corporation | Text classification by ranking with convolutional neural networks |
CN107102989A (en) * | 2017-05-24 | 2017-08-29 | 南京大学 | A kind of entity disambiguation method based on term vector, convolutional neural networks |
Non-Patent Citations (1)
Title |
---|
蓝雯飞,徐 蔚,王 涛: ""基于卷积神经网络的中文新闻文本分类"", 《中南民族大学学报(自然科学版)》 * |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109492764A (en) * | 2018-10-24 | 2019-03-19 | 平安科技(深圳)有限公司 | Training method, relevant device and the medium of production confrontation network |
CN109299468A (en) * | 2018-10-25 | 2019-02-01 | 四川长虹电器股份有限公司 | Short text classification method based on conditional entropy and convolutional neural networks |
CN109376244A (en) * | 2018-10-25 | 2019-02-22 | 山东省通信管理局 | A kind of swindle website identification method based on tagsort |
CN109582784A (en) * | 2018-10-26 | 2019-04-05 | 阿里巴巴集团控股有限公司 | File classification method and device |
CN109684476A (en) * | 2018-12-07 | 2019-04-26 | 中科恒运股份有限公司 | A kind of file classification method, document sorting apparatus and terminal device |
CN109684476B (en) * | 2018-12-07 | 2023-10-17 | 中科恒运股份有限公司 | Text classification method, text classification device and terminal equipment |
CN111324831A (en) * | 2018-12-17 | 2020-06-23 | 中国移动通信集团北京有限公司 | Method and device for detecting fraudulent website |
CN109858035A (en) * | 2018-12-29 | 2019-06-07 | 深兰科技(上海)有限公司 | A kind of sensibility classification method, device, electronic equipment and readable storage medium storing program for executing |
CN109815334A (en) * | 2019-01-25 | 2019-05-28 | 武汉斗鱼鱼乐网络科技有限公司 | A kind of barrage file classification method, storage medium, equipment and system |
CN109960726A (en) * | 2019-02-13 | 2019-07-02 | 平安科技(深圳)有限公司 | Textual classification model construction method, device, terminal and storage medium |
CN109960726B (en) * | 2019-02-13 | 2024-01-23 | 平安科技(深圳)有限公司 | Text classification model construction method, device, terminal and storage medium |
CN109948665A (en) * | 2019-02-28 | 2019-06-28 | 中国地质大学(武汉) | Physical activity genre classification methods and system based on long Memory Neural Networks in short-term |
CN109948665B (en) * | 2019-02-28 | 2020-11-27 | 中国地质大学(武汉) | Human activity type classification method and system based on long-time and short-time memory neural network |
CN109933667A (en) * | 2019-03-19 | 2019-06-25 | 中国联合网络通信集团有限公司 | Textual classification model training method, file classification method and equipment |
CN110008342A (en) * | 2019-04-12 | 2019-07-12 | 智慧芽信息科技(苏州)有限公司 | Document classification method, apparatus, equipment and storage medium |
CN110188798A (en) * | 2019-04-28 | 2019-08-30 | 阿里巴巴集团控股有限公司 | A kind of object classification method and model training method and device |
CN110188798B (en) * | 2019-04-28 | 2023-08-08 | 创新先进技术有限公司 | Object classification method and model training method and device |
CN110009064A (en) * | 2019-04-30 | 2019-07-12 | 广东电网有限责任公司 | A kind of semantic model training method and device based on electrical network field |
CN110413773A (en) * | 2019-06-20 | 2019-11-05 | 平安科技(深圳)有限公司 | Intelligent text classification method, device and computer readable storage medium |
WO2020253043A1 (en) * | 2019-06-20 | 2020-12-24 | 平安科技(深圳)有限公司 | Intelligent text classification method and apparatus, and computer-readable storage medium |
CN110413773B (en) * | 2019-06-20 | 2023-09-22 | 平安科技(深圳)有限公司 | Intelligent text classification method, device and computer readable storage medium |
CN110232128A (en) * | 2019-06-21 | 2019-09-13 | 华中师范大学 | Topic file classification method and device |
CN110427480A (en) * | 2019-06-28 | 2019-11-08 | 平安科技(深圳)有限公司 | Personalized text intelligent recommendation method, apparatus and computer readable storage medium |
CN110427480B (en) * | 2019-06-28 | 2022-10-11 | 平安科技(深圳)有限公司 | Intelligent personalized text recommendation method and device and computer readable storage medium |
CN110879832A (en) * | 2019-10-23 | 2020-03-13 | 支付宝(杭州)信息技术有限公司 | Target text detection method, model training method, device and equipment |
CN110909164A (en) * | 2019-11-22 | 2020-03-24 | 科大国创软件股份有限公司 | Text enhancement semantic classification method and system based on convolutional neural network |
CN113010667A (en) * | 2019-12-20 | 2021-06-22 | 王道维 | Training method for machine learning decision model by using natural language corpus |
CN113742479A (en) * | 2020-05-29 | 2021-12-03 | 北京沃东天骏信息技术有限公司 | Method and device for screening target text |
CN112560427B (en) * | 2020-12-16 | 2023-09-22 | 平安银行股份有限公司 | Problem expansion method, device, electronic equipment and medium |
CN112560427A (en) * | 2020-12-16 | 2021-03-26 | 平安银行股份有限公司 | Problem expansion method, device, electronic equipment and medium |
CN112380350B (en) * | 2021-01-14 | 2021-05-07 | 北京育学园健康管理中心有限公司 | Text classification method and device |
CN112380350A (en) * | 2021-01-14 | 2021-02-19 | 北京崔玉涛儿童健康管理中心有限公司 | Text classification method and device |
CN113360657A (en) * | 2021-06-30 | 2021-09-07 | 安徽商信政通信息技术股份有限公司 | Intelligent document distribution and handling method and device and computer equipment |
CN113360657B (en) * | 2021-06-30 | 2023-10-24 | 安徽商信政通信息技术股份有限公司 | Intelligent document distribution handling method and device and computer equipment |
CN116186271A (en) * | 2023-04-19 | 2023-05-30 | 北京亚信数据有限公司 | Medical term classification model training method, classification method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108573047A (en) | A kind of training method and device of Module of Automatic Chinese Documents Classification | |
CN107609121B (en) | News text classification method based on LDA and word2vec algorithm | |
CN110097085B (en) | Lyric text generation method, training method, device, server and storage medium | |
CN102332028B (en) | Webpage-oriented unhealthy Web content identifying method | |
CN109376251A (en) | A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model | |
CN109977416A (en) | A kind of multi-level natural language anti-spam text method and system | |
CN108363790A (en) | For the method, apparatus, equipment and storage medium to being assessed | |
CN107229610A (en) | The analysis method and device of a kind of affection data | |
CN103577989B (en) | A kind of information classification approach and information classifying system based on product identification | |
CN111709242B (en) | Chinese punctuation mark adding method based on named entity recognition | |
CN108090099B (en) | Text processing method and device | |
CN112434164B (en) | Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration | |
CN110188195A (en) | A kind of text intension recognizing method, device and equipment based on deep learning | |
CN113590764B (en) | Training sample construction method and device, electronic equipment and storage medium | |
CN110287314A (en) | Long text credibility evaluation method and system based on Unsupervised clustering | |
CN112528031A (en) | Work order intelligent distribution method and system | |
CN103678318B (en) | Multi-word unit extraction method and equipment and artificial neural network training method and equipment | |
CN107463703A (en) | English social media account number classification method based on information gain | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN106570170A (en) | Text classification and naming entity recognition integrated method and system based on depth cyclic neural network | |
CN111782793A (en) | Intelligent customer service processing method, system and equipment | |
CN110019776A (en) | Article classification method and device, storage medium | |
CN111475651A (en) | Text classification method, computing device and computer storage medium | |
CN113486143A (en) | User portrait generation method based on multi-level text representation and model fusion | |
CN111191029B (en) | AC construction method based on supervised learning and text classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180925 |
|
RJ01 | Rejection of invention patent application after publication |