CN108573047A - A kind of training method and device of Module of Automatic Chinese Documents Classification - Google Patents

A kind of training method and device of Module of Automatic Chinese Documents Classification Download PDF

Info

Publication number
CN108573047A
CN108573047A CN201810350019.7A CN201810350019A CN108573047A CN 108573047 A CN108573047 A CN 108573047A CN 201810350019 A CN201810350019 A CN 201810350019A CN 108573047 A CN108573047 A CN 108573047A
Authority
CN
China
Prior art keywords
training
text
module
training text
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810350019.7A
Other languages
Chinese (zh)
Inventor
刘怡俊
林裕鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201810350019.7A priority Critical patent/CN108573047A/en
Publication of CN108573047A publication Critical patent/CN108573047A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of training method of Module of Automatic Chinese Documents Classification and device, solves between the characteristic item that these traditional document representation methods indicate independently of each other, and Sparse, result in computationally intensive technical problem.Wherein method, including:S1, the training text for obtaining tape label;S2, after being pre-processed to the training text, the training text after being segmented;S3, the training text after the participle is input in word2vec models, the training text after the participle is converted into term vector set;S4, it the label of the term vector set and the training text is input in convolutional neural networks is trained, and pass through the loss values of convolutional neural networks described in preset cost function calculation;S5, judge whether the loss values are less than preset threshold value, if so, determining the convolutional neural networks convergence, the parameter of the convolutional neural networks is preserved, and generate the Module of Automatic Chinese Documents Classification after training, if it is not, then return to step S1.

Description

A kind of training method and device of Module of Automatic Chinese Documents Classification
Technical field
The present invention relates to Text Classification field more particularly to the training methods and dress of a kind of Module of Automatic Chinese Documents Classification It sets.
Background technology
Since the last century 90's, with universal and network technology the constantly improve of Internet, Internet is Through as the most huge most abundant information resource database in the whole world.Shown by by the end of December, 2016 according to newest CNNIC statistics, in State's webpage quantity reaches hundred billion, and netizen's scale breaks through 6.88 hundred million, and internet just becomes people's daily life basic resources. The opening of Internet makes various information that can be issued on internet, however, this of Internet in first time Kind exploration also results in the scrambling and redundancy of information on Internet.The how effectively non-knot of organization and management magnanimity The text message of structure, and be precisely that user realizes that Information locating is the one of current information Science and Technology field face to choose greatly War, one of them successful example is exactly to be classified automatically to information according to the content of information.
Automatic classification technology develops on the basis of traditional information manual sort, as a kind of effective information processing Various information is arranged according to certain taxonomic hierarchies, largely solves the problems, such as information clutter by mode.Tradition Though manual information sorting technique it is quite ripe, handled it is apparent that being unsuitable for Internet information newer to the moment. The eighties, " knowledge engineering " (Knowledge Engineering) theory be used to instruct text classification, by by expertise Artificial definition is one group of rule, is classified according to these rules in the case of given classification.After the nineties, " engineering Habit " (Machine Learning) is developing progressively the mainstream technology for text classification, and it is good to shift to an earlier date handmarking by one group Classifying documents, learn interested category feature by the process of an induction type, then use machine learning techniques structure Produce auto text categorization classifier.Chinese is the language that number of users is most in the world, as the arrival of information age and knowledge pass through The globalization of Ji, Chinese Text Categorization effect have become very important.
In recent years, deep learning model achieves significant achievement in terms of computer vision and speech recognition, in nature Language Processing field carries out feature learning and text classification to natural language text information using neural network, also becomes text The cutting edge technology of classification.Existing sorting technique includes mainly rule-based disaggregated model, the classification mould based on machine learning Type, more famous Document Classification Method have decision tree (Decision Tree), random forest (Random Forest), pattra leaves This grader (Bayes), linear classifier (logistic regression), support vector machines (Support Vector Machine, SVM), Maximum entropy classifiers etc..They are all to begin by means of machine learning method, by manual features engineering and shallow-layer disaggregated model come Carry out text classification.
The task of text classification (Text Classification) is automatic point according to the interior perhaps theme for giving document With pre-defined class label.Classify to document, generally requires by two steps of text representation and learning classification.And How to be the structural data that can handle of algorithm document representation, this is undoubtedly the important link of text classification.For text Expression, traditional method is all discrete expression, such as One-hot codings, also referred to as one-hot coding, it means use N bit status registers encode N number of state, and each state has his independent register-bit, at any time, wherein Only one effectively.Although this expression makes each word have unique index, this coding that can cause every in text Sequence of a word in sentence does not have relevance, and the dictionary established therewith is bigger, and the sequence of this coding is longer, data Also very sparse therewith.There are bag of words (Bag of Words) later, it is exactly that indicate document vector can directly will be each The term vector of word indicates adduction;N-gram models, it is exactly that n neighbouring collocations codings are considered word in this way Sequence, but cause vocabulary dimension with corpus increase expand, word sequence also with corpus expansion faster, Sparse is asked Topic etc..
It is mutual indepedent between the characteristic item that these traditional document representation methods indicate, and Sparse, result in meter Big technical problem is measured in calculation.
Invention content
The present invention provides a kind of training method of Module of Automatic Chinese Documents Classification and devices, solve these traditional texts It is mutual indepedent between the characteristic item that representation method indicates, and Sparse, result in computationally intensive technical problem.
The present invention provides a kind of training methods of Module of Automatic Chinese Documents Classification, including:
S1, the training text for obtaining tape label;
S2, after being pre-processed to the training text, the training text after being segmented;
S3, the training text after the participle is input in word2vec models, by the training text after the participle Be converted to term vector set;
S4, the label of the term vector set and the training text is input in convolutional neural networks and is instructed Practice, and passes through the loss values of convolutional neural networks described in preset cost function calculation;
S5, judge whether the loss values are less than preset threshold value, if so, determining the convolutional neural networks convergence, protect The parameter of the convolutional neural networks is deposited, and generates the Module of Automatic Chinese Documents Classification after training, if it is not, then return to step S1.
Optionally, the step S2 is specifically included:
The training text is segmented by preset Knowledge based engineering participle model, the training text after being segmented This.
Optionally, the step S2 further includes:
The Feature Words in the training text are extracted by term frequency-inverse document frequency approach, and remove the training text In meaningless word;
Calculate the corresponding feature weight of the Feature Words.
Optionally, after the step S3, further include before the step S4:
According to the corresponding feature weight of the Feature Words, improves the corresponding term vector of the Feature Words and account for the term vector collection The weighted value of conjunction.
The present invention provides a kind of training devices of Module of Automatic Chinese Documents Classification, including:
Acquiring unit, the training text for obtaining tape label;
Pretreatment unit, after being pre-processed to the training text, the training text after being segmented;
Vectorial conversion unit will be described point for the training text after the participle to be input in word2vec models Training text after word is converted to term vector set;
Training unit, for the label of the term vector set and the training text to be input to convolutional neural networks In be trained, and pass through the loss values of convolutional neural networks described in preset cost function calculation;
Judging unit, for judging whether the loss values are less than preset threshold value, if so, determining the convolutional Neural net Network is restrained, and the parameter of the convolutional neural networks is preserved, and generates the Module of Automatic Chinese Documents Classification after training, if it is not, then redirecting To acquiring unit.
Optionally, the pretreatment unit specifically includes:
Subelement is segmented, the training text is segmented for passing through preset Knowledge based engineering participle model, is obtained Training text after participle.
Optionally, the pretreatment unit further includes:
Feature extraction subelement, for extracting the Feature Words in the training text by term frequency-inverse document frequency approach, And remove meaningless word in the training text;
Feature weight computation subunit, for calculating the corresponding feature weight of the Feature Words.
Optionally, the training device of Module of Automatic Chinese Documents Classification provided by the invention further includes:
Weight improves unit, for according to the corresponding feature weight of the Feature Words, improving the corresponding word of the Feature Words Vector accounts for the weighted value of the term vector set.
The present invention provides a kind of sorting techniques of Chinese text, are divided based on the Chinese text described in any one of as above The Module of Automatic Chinese Documents Classification that the training method of class model obtains, including:
Obtain text to be sorted;
By the text input to be sorted to as above any one of described in Module of Automatic Chinese Documents Classification training side In the Module of Automatic Chinese Documents Classification that method obtains, the classification results of the text to be sorted are obtained.
The present invention provides a kind of computer readable storage medium, the computer-readable recording medium storage has computer Instruction, when described instruction is executed by processor realize as above any one of described in method.
As can be seen from the above technical solutions, the present invention has the following advantages:
The present invention provides a kind of training methods of Module of Automatic Chinese Documents Classification, including:S1, the training text for obtaining tape label This;S2, after being pre-processed to the training text, the training text after being segmented;S3, text will be trained after the participle Originally it is input in word2vec models, the training text after the participle is converted into term vector set;S4, by the term vector The label of set and the training text, which is input in convolutional neural networks, to be trained, and passes through preset cost function calculation The loss values of the convolutional neural networks;S5, judge whether the loss values are less than preset threshold value, if so, determining the volume Product neural network convergence, preserves the parameter of the convolutional neural networks, and generate the Module of Automatic Chinese Documents Classification after training, if It is no, then return to step S1.
The present invention is converted to term vector set by using word2vec models, by training text so that text being capable of table Continuous, the dense data of similar image and voice are shown as, convolutional neural networks are then utilized, with the shape similar to processing image Formula, by the convolutional layer of convolutional neural networks, pond layer and non-linear conversion at training network parameter, enabling obtain just True classification solves between the characteristic item that traditional these document representation methods indicate independently of each other, and Sparse, leads Computationally intensive technical problem is caused.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without having to pay creative labor, may be used also for those of ordinary skill in the art To obtain other attached drawings according to these attached drawings.
Fig. 1 is a kind of flow signal of one embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention Figure;
Fig. 2 is that a kind of flow of another embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention is shown It is intended to;
Fig. 3 is a kind of structural representation of one embodiment of the training device of Module of Automatic Chinese Documents Classification provided by the invention Figure.
Specific implementation mode
An embodiment of the present invention provides a kind of training method of Module of Automatic Chinese Documents Classification and device, solve traditional this It is mutual indepedent between the characteristic item that a little document representation methods indicate, and Sparse, result in computationally intensive technical problem.
In order to make the invention's purpose, features and advantages of the invention more obvious and easy to understand, below in conjunction with the present invention Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that disclosed below Embodiment be only a part of the embodiment of the present invention, and not all embodiment.Based on the embodiments of the present invention, this field All other embodiment that those of ordinary skill is obtained without making creative work, belongs to protection of the present invention Range.
Referring to Fig. 1, the present invention provides a kind of one embodiment of the training method of Module of Automatic Chinese Documents Classification, packet It includes:
101, the training text of tape label is obtained;
102, after being pre-processed to training text, the training text after being segmented;
103, the training text after participle is input in word2vec models, the training text after participle is converted into word Vector set;
104, the label of term vector set and training text is input in convolutional neural networks and is trained, and passed through The loss values of preset cost function calculation convolutional neural networks;
105, judge whether loss values are less than preset threshold value, if so, determining convolutional neural networks convergence, preserve convolution god Parameter through network, and the Module of Automatic Chinese Documents Classification after training is generated, if it is not, then return to step 101.
The embodiment of the present invention is converted to term vector set by using word2vec models, by training text so that text Continuous, the dense data of similar image and voice can be expressed as, convolutional neural networks are then utilized, to be similar to processing figure The form of picture, by the convolutional layer of convolutional neural networks, pond layer and non-linear conversion at training network parameter, enabling Correctly classified, is solved between the characteristic item that traditional these document representation methods indicate independently of each other, and data It is sparse, result in computationally intensive technical problem.
It is to be carried out to a kind of one embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention above Illustrate, a kind of another embodiment of the training method of Module of Automatic Chinese Documents Classification provided by the invention will be said below It is bright.
Referring to Fig. 2, the present invention provides a kind of another embodiment of the training method of Module of Automatic Chinese Documents Classification, packet It includes:
201, the training text of tape label is obtained;
It should be noted that before training, it is necessary first to obtain the training text of tape label, i.e., known classification results Training text.
202, the Feature Words in training text are extracted by term frequency-inverse document frequency approach, and removes nothing in training text Meaning word;
It should be noted that characteristic processing is exactly to extract the Feature Words of reflection theme from training text, and determine special Levy the weight of word.It corresponds to the extraction of Feature Words and the calculating of feature weight.The extraction of Feature Words refers to just being commented according to some Valence index is independent to carry out marking and queuing to primitive character word, therefrom chooses some Feature Words of highest scoring, filters out remaining Feature Words.We use TF-IDF (term frequency-inverse document frequency) algorithm herein, and thought is exactly the significance level of a word Directly proportional to the word frequency in classification, the number occurred with all categories is inversely proportional, and can thus filter out those each The document frequency of occurrences is very high, but the word that difference has little significance, to select important text feature.
Text is according to source difference, typically with the label unrelated with content.These labels may be that control display is outer The mark of sight;It is also likely to be some functional symbols, such as punctuation mark;Be also possible to be some other media informations, as image, Sound, animation etc.;It could also be possible that some mess codes.They can not play help to classification, so should get rid of.
203, the corresponding feature weight of Feature Words is calculated;
It should be noted that the calculating of feature weight:Main thought is in significance level and classification according to a word Word frequency is directly proportional (representativeness), is inversely proportional (discrimination) with the number occurred in all categories.It is elected mathematically to carry out spy When sign extraction, determine that the main factor of Text character extraction effect is the quality of valuation functions.
204, training text is segmented by preset Knowledge based engineering participle model, the training text after being segmented This;
It should be noted that during text information processing, generally word, word or phrase can be selected as the spy of text Levy item.Although phrase carries enough information content, the probability that phrase occurs in the text is few, uses phrase as characteristic item meeting Cause feature vector rare, loses many important informations.Therefore, in order to extract Chinese vocabulary entry, need to Chinese text carry out compared with For complicated participle.We use Knowledge based engineering segmenting method, this method to regard participle as knowledge reasoning herein Process needs to carry out syntax, semantic analysis, therefore it needs to instruct sorting algorithm with a large amount of linguistry and information, makes Its information that can be provided by context is bound word.The content of one text mainly passes through noun, verb It being embodied with the notional words such as adjective, function word and the part high frequency words often occurred in various text back warps have no meaning to classification, So these buzz words or word can filter out.
205, the training text after participle is input in word2vec models, the training text after participle is converted into word Vector set;
It should be noted that using the text distribution representation method of this models of word2vec, it is deep learning side The important foundation of method.The distributed basic thought indicated of text is that each vocabulary is shown as n dimensions is dense, continuous real number to Amount, great advantage is that it has very powerful characterization ability, for example n-dimensional vector often ties up k value, can characterize the n times of k Side concept.The expression of text is by the representation method of this term vector, text data from the sparse neural network of high latitude Intractable mode becomes the continuous dense data of similar image, language, we can be the algorithm of deep learning in this way Move to text field.
206, according to the corresponding feature weight of Feature Words, the weight that the corresponding term vector of Feature Words accounts for term vector set is improved Value;
It should be noted that the Feature Words in training text is determined, and the corresponding feature of Feature Words has been calculated It, will be according to the corresponding spy of Feature Words before the term vector set converted using training text carries out model training after weight Weight is levied, the weighted value of the corresponding term vector of Feature Words in term vector set is improved, it is accurate with the classification for more improving training text Exactness.
207, the label of term vector set and training text is input in convolutional neural networks and is trained, and passed through The loss values of preset cost function calculation convolutional neural networks;
It should be noted that on the basis of the term vector set that word2vec models obtain, training convolutional neural networks Last text classifier is done, the main thought of convolutional neural networks textual classification model is, defeated to the text of term vector form Enter to carry out convolution operation.CNN, which is initially used to handle in image data, with image procossing, chooses the two-dimensional field progress convolution operation not Together, the convolution operation of text-oriented is carried out for the lexical item in fixed sliding window.By convolutional layer, pond layer and non-thread Property conversion layer after, CNN can obtain Text eigenvector for classification learning.The advantage of CNN be calculate text feature to It is effectively retained useful word order information during amount.It can learn multilayer neural network automatically, and input feature value is mapped Onto corresponding class label.By introducing nonlinear activation layer, which can realize nonlinear discriminant classification formula.It utilizes Word2vec models are the necessary conditions for realizing effective disaggregated model to the initial characteristics expression of the high quality of text.
And pass through the loss of the convolutional neural networks that training obtains every time of preset cost function calculation in convolutional neural networks Value.
208, judge whether loss values are less than preset threshold value, if so, determining convolutional neural networks convergence, preserve convolution god Parameter through network, and the Module of Automatic Chinese Documents Classification after training is generated, if it is not, then return to step 201.
It should be noted that after convolutional neural networks after being trained every time, corresponding loss values are calculated, and judge Whether loss values are less than preset threshold value, if so, representing convolutional neural networks convergence, preserve the parameter of convolutional neural networks, i.e., Module of Automatic Chinese Documents Classification is produced, if it is not, then needing the training text of reacquisition tape label, re-starts training.
The embodiment of the present invention is converted to term vector set by using word2vec models, by training text so that text Continuous, the dense data of similar image and voice can be expressed as, convolutional neural networks are then utilized, to be similar to processing figure The form of picture, by the convolutional layer of convolutional neural networks, pond layer and non-linear conversion at training network parameter, enabling Correctly classified, is solved between the characteristic item that traditional these document representation methods indicate independently of each other, and data It is sparse, result in computationally intensive technical problem.Further, the embodiment of the present invention is by extracting the feature in training text Word, and weighted value of the corresponding term vector of Feature Words in term vector set is improved according to the corresponding feature weight of Feature Words, The accuracy to text classification is improved, the time of model training is reduced.
It is to a kind of another embodiment progress of the training method of Module of Automatic Chinese Documents Classification provided by the invention above Explanation, a kind of one embodiment of the training device of Module of Automatic Chinese Documents Classification provided by the invention will be said below It is bright.
Referring to Fig. 3, the present invention provides a kind of one embodiment of the training device of Module of Automatic Chinese Documents Classification, packet It includes:
Acquiring unit 301, the training text for obtaining tape label;
Pretreatment unit 302, after being pre-processed to training text, the training text after being segmented;
Pretreatment unit 302 specifically includes:
Feature extraction subelement 3021, for extracting the Feature Words in training text by term frequency-inverse document frequency approach, And remove meaningless word in training text;
Feature weight computation subunit 3022, for calculating the corresponding feature weight of Feature Words;
Subelement 3023 is segmented, training text is segmented for passing through preset Knowledge based engineering participle model, is obtained Training text after participle;
Vectorial conversion unit 303, for the training text after participle to be input in word2vec models, after participle Training text is converted to term vector set;
Weight improves unit 304, for according to the corresponding feature weight of Feature Words, improving the corresponding term vector of Feature Words and accounting for The weighted value of term vector set;
Training unit 305, for by the label of term vector set and training text be input in convolutional neural networks into Row training, and pass through the loss values of preset cost function calculation convolutional neural networks;
Judging unit 306, for judging whether loss values are less than preset threshold value, if so, determining that convolutional neural networks are received It holds back, preserves the parameter of convolutional neural networks, and generate the Module of Automatic Chinese Documents Classification after training, if it is not, it is single then to jump to acquisition Member 301.
It is to be carried out to a kind of one embodiment of the training device of Module of Automatic Chinese Documents Classification provided by the invention above Illustrate, a kind of one embodiment of the sorting technique of Chinese text provided by the invention will be illustrated below.
The present invention provides a kind of sorting techniques of Chinese text, based on such as any one of embodiment one and embodiment two Module of Automatic Chinese Documents Classification the obtained Module of Automatic Chinese Documents Classification of training method, including:
Obtain text to be sorted;
By the Module of Automatic Chinese Documents Classification of any one of text input to be sorted to such as embodiment one and embodiment two The obtained Module of Automatic Chinese Documents Classification of training method in, obtain the classification results of text to be sorted.
It is to a kind of explanation that one embodiment of the sorting technique of Chinese text carries out provided by the invention, below above A kind of one embodiment of computer readable storage medium provided by the invention will be illustrated.
The present invention provides a kind of computer readable storage medium, computer-readable recording medium storage has computer to refer to It enables, the method such as any one of embodiment one and embodiment two is realized when instruction is executed by processor.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, device or unit It closes or communicates to connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can be stored in a computer read/write memory medium.Based on this understanding, technical scheme of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the present invention Portion or part steps.And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read- OnlyMemory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various to deposit Store up the medium of program code.
The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to before Stating embodiment, invention is explained in detail, it will be understood by those of ordinary skill in the art that:It still can be to preceding The technical solution recorded in each embodiment is stated to modify or equivalent replacement of some of the technical features;And these Modification or replacement, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.

Claims (10)

1. a kind of training method of Module of Automatic Chinese Documents Classification, which is characterized in that including:
S1, the training text for obtaining tape label;
S2, after being pre-processed to the training text, the training text after being segmented;
S3, the training text after the participle is input in word2vec models, the training text after the participle is converted For term vector set;
S4, it the label of the term vector set and the training text is input in convolutional neural networks is trained, and Pass through the loss values of convolutional neural networks described in preset cost function calculation;
S5, judge whether the loss values are less than preset threshold value, if so, determining the convolutional neural networks convergence, preserve institute The parameter of convolutional neural networks is stated, and generates the Module of Automatic Chinese Documents Classification after training, if it is not, then return to step S1.
2. the training method of Module of Automatic Chinese Documents Classification according to claim 1, which is characterized in that the step S2 is specific Including:
The training text is segmented by preset Knowledge based engineering participle model, the training text after being segmented.
3. the training method of Module of Automatic Chinese Documents Classification according to claim 2, which is characterized in that the step S2 is also wrapped It includes:
The Feature Words in the training text are extracted by term frequency-inverse document frequency approach, and remove nothing in the training text Meaning word;
Calculate the corresponding feature weight of the Feature Words.
4. the training method of Module of Automatic Chinese Documents Classification according to claim 3, which is characterized in that the step S3 it Afterwards, further include before the step S4:
According to the corresponding feature weight of the Feature Words, improves the corresponding term vector of the Feature Words and account for the term vector set Weighted value.
5. a kind of training device of Module of Automatic Chinese Documents Classification, which is characterized in that including:
Acquiring unit, the training text for obtaining tape label;
Pretreatment unit, after being pre-processed to the training text, the training text after being segmented;
Vectorial conversion unit, for the training text after the participle to be input in word2vec models, after the participle Training text be converted to term vector set;
Training unit, for by the label of the term vector set and the training text be input in convolutional neural networks into Row training, and pass through the loss values of convolutional neural networks described in preset cost function calculation;
Judging unit, for judging whether the loss values are less than preset threshold value, if so, determining that the convolutional neural networks are received It holds back, preserves the parameter of the convolutional neural networks, and generate the Module of Automatic Chinese Documents Classification after training, obtained if it is not, then jumping to Take unit.
6. the training device of Module of Automatic Chinese Documents Classification according to claim 5, which is characterized in that the pretreatment unit It specifically includes:
Subelement is segmented, the training text is segmented for passing through preset Knowledge based engineering participle model, is segmented Training text afterwards.
7. the training device of Module of Automatic Chinese Documents Classification according to claim 6, which is characterized in that the pretreatment unit Further include:
Feature extraction subelement for extracting the Feature Words in the training text by term frequency-inverse document frequency approach, and is gone Fall meaningless word in the training text;
Feature weight computation subunit, for calculating the corresponding feature weight of the Feature Words.
8. the training device of Module of Automatic Chinese Documents Classification according to claim 7, which is characterized in that further include:
Weight improves unit, for according to the corresponding feature weight of the Feature Words, improving the corresponding term vector of the Feature Words Account for the weighted value of the term vector set.
9. a kind of sorting technique of Chinese text, based on the Chinese Text Categorization mould as described in any one of Claims 1-4 The Module of Automatic Chinese Documents Classification that the training method of type obtains, which is characterized in that including:
Obtain text to be sorted;
It will be the text input to be sorted to the Module of Automatic Chinese Documents Classification as described in any one of Claims 1-4 In the Module of Automatic Chinese Documents Classification that training method obtains, the classification results of the text to be sorted are obtained.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer to refer to It enables, method according to any one of claims 1 to 4 is realized when described instruction is executed by processor.
CN201810350019.7A 2018-04-18 2018-04-18 A kind of training method and device of Module of Automatic Chinese Documents Classification Pending CN108573047A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810350019.7A CN108573047A (en) 2018-04-18 2018-04-18 A kind of training method and device of Module of Automatic Chinese Documents Classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810350019.7A CN108573047A (en) 2018-04-18 2018-04-18 A kind of training method and device of Module of Automatic Chinese Documents Classification

Publications (1)

Publication Number Publication Date
CN108573047A true CN108573047A (en) 2018-09-25

Family

ID=63575162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810350019.7A Pending CN108573047A (en) 2018-04-18 2018-04-18 A kind of training method and device of Module of Automatic Chinese Documents Classification

Country Status (1)

Country Link
CN (1) CN108573047A (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299468A (en) * 2018-10-25 2019-02-01 四川长虹电器股份有限公司 Short text classification method based on conditional entropy and convolutional neural networks
CN109376244A (en) * 2018-10-25 2019-02-22 山东省通信管理局 A kind of swindle website identification method based on tagsort
CN109492764A (en) * 2018-10-24 2019-03-19 平安科技(深圳)有限公司 Training method, relevant device and the medium of production confrontation network
CN109582784A (en) * 2018-10-26 2019-04-05 阿里巴巴集团控股有限公司 File classification method and device
CN109684476A (en) * 2018-12-07 2019-04-26 中科恒运股份有限公司 A kind of file classification method, document sorting apparatus and terminal device
CN109815334A (en) * 2019-01-25 2019-05-28 武汉斗鱼鱼乐网络科技有限公司 A kind of barrage file classification method, storage medium, equipment and system
CN109858035A (en) * 2018-12-29 2019-06-07 深兰科技(上海)有限公司 A kind of sensibility classification method, device, electronic equipment and readable storage medium storing program for executing
CN109933667A (en) * 2019-03-19 2019-06-25 中国联合网络通信集团有限公司 Textual classification model training method, file classification method and equipment
CN109948665A (en) * 2019-02-28 2019-06-28 中国地质大学(武汉) Physical activity genre classification methods and system based on long Memory Neural Networks in short-term
CN109960726A (en) * 2019-02-13 2019-07-02 平安科技(深圳)有限公司 Textual classification model construction method, device, terminal and storage medium
CN110008342A (en) * 2019-04-12 2019-07-12 智慧芽信息科技(苏州)有限公司 Document classification method, apparatus, equipment and storage medium
CN110009064A (en) * 2019-04-30 2019-07-12 广东电网有限责任公司 A kind of semantic model training method and device based on electrical network field
CN110188798A (en) * 2019-04-28 2019-08-30 阿里巴巴集团控股有限公司 A kind of object classification method and model training method and device
CN110232128A (en) * 2019-06-21 2019-09-13 华中师范大学 Topic file classification method and device
CN110413773A (en) * 2019-06-20 2019-11-05 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium
CN110427480A (en) * 2019-06-28 2019-11-08 平安科技(深圳)有限公司 Personalized text intelligent recommendation method, apparatus and computer readable storage medium
CN110879832A (en) * 2019-10-23 2020-03-13 支付宝(杭州)信息技术有限公司 Target text detection method, model training method, device and equipment
CN110909164A (en) * 2019-11-22 2020-03-24 科大国创软件股份有限公司 Text enhancement semantic classification method and system based on convolutional neural network
CN111324831A (en) * 2018-12-17 2020-06-23 中国移动通信集团北京有限公司 Method and device for detecting fraudulent website
CN112380350A (en) * 2021-01-14 2021-02-19 北京崔玉涛儿童健康管理中心有限公司 Text classification method and device
CN112560427A (en) * 2020-12-16 2021-03-26 平安银行股份有限公司 Problem expansion method, device, electronic equipment and medium
CN113010667A (en) * 2019-12-20 2021-06-22 王道维 Training method for machine learning decision model by using natural language corpus
CN113360657A (en) * 2021-06-30 2021-09-07 安徽商信政通信息技术股份有限公司 Intelligent document distribution and handling method and device and computer equipment
CN113742479A (en) * 2020-05-29 2021-12-03 北京沃东天骏信息技术有限公司 Method and device for screening target text
CN116186271A (en) * 2023-04-19 2023-05-30 北京亚信数据有限公司 Medical term classification model training method, classification method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100583101C (en) * 2008-06-12 2010-01-20 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
US20170308790A1 (en) * 2016-04-21 2017-10-26 International Business Machines Corporation Text classification by ranking with convolutional neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100583101C (en) * 2008-06-12 2010-01-20 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
US20170308790A1 (en) * 2016-04-21 2017-10-26 International Business Machines Corporation Text classification by ranking with convolutional neural networks
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
蓝雯飞,徐 蔚,王 涛: ""基于卷积神经网络的中文新闻文本分类"", 《中南民族大学学报(自然科学版)》 *

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492764A (en) * 2018-10-24 2019-03-19 平安科技(深圳)有限公司 Training method, relevant device and the medium of production confrontation network
CN109299468A (en) * 2018-10-25 2019-02-01 四川长虹电器股份有限公司 Short text classification method based on conditional entropy and convolutional neural networks
CN109376244A (en) * 2018-10-25 2019-02-22 山东省通信管理局 A kind of swindle website identification method based on tagsort
CN109582784A (en) * 2018-10-26 2019-04-05 阿里巴巴集团控股有限公司 File classification method and device
CN109684476A (en) * 2018-12-07 2019-04-26 中科恒运股份有限公司 A kind of file classification method, document sorting apparatus and terminal device
CN109684476B (en) * 2018-12-07 2023-10-17 中科恒运股份有限公司 Text classification method, text classification device and terminal equipment
CN111324831A (en) * 2018-12-17 2020-06-23 中国移动通信集团北京有限公司 Method and device for detecting fraudulent website
CN109858035A (en) * 2018-12-29 2019-06-07 深兰科技(上海)有限公司 A kind of sensibility classification method, device, electronic equipment and readable storage medium storing program for executing
CN109815334A (en) * 2019-01-25 2019-05-28 武汉斗鱼鱼乐网络科技有限公司 A kind of barrage file classification method, storage medium, equipment and system
CN109960726A (en) * 2019-02-13 2019-07-02 平安科技(深圳)有限公司 Textual classification model construction method, device, terminal and storage medium
CN109960726B (en) * 2019-02-13 2024-01-23 平安科技(深圳)有限公司 Text classification model construction method, device, terminal and storage medium
CN109948665A (en) * 2019-02-28 2019-06-28 中国地质大学(武汉) Physical activity genre classification methods and system based on long Memory Neural Networks in short-term
CN109948665B (en) * 2019-02-28 2020-11-27 中国地质大学(武汉) Human activity type classification method and system based on long-time and short-time memory neural network
CN109933667A (en) * 2019-03-19 2019-06-25 中国联合网络通信集团有限公司 Textual classification model training method, file classification method and equipment
CN110008342A (en) * 2019-04-12 2019-07-12 智慧芽信息科技(苏州)有限公司 Document classification method, apparatus, equipment and storage medium
CN110188798A (en) * 2019-04-28 2019-08-30 阿里巴巴集团控股有限公司 A kind of object classification method and model training method and device
CN110188798B (en) * 2019-04-28 2023-08-08 创新先进技术有限公司 Object classification method and model training method and device
CN110009064A (en) * 2019-04-30 2019-07-12 广东电网有限责任公司 A kind of semantic model training method and device based on electrical network field
CN110413773A (en) * 2019-06-20 2019-11-05 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium
WO2020253043A1 (en) * 2019-06-20 2020-12-24 平安科技(深圳)有限公司 Intelligent text classification method and apparatus, and computer-readable storage medium
CN110413773B (en) * 2019-06-20 2023-09-22 平安科技(深圳)有限公司 Intelligent text classification method, device and computer readable storage medium
CN110232128A (en) * 2019-06-21 2019-09-13 华中师范大学 Topic file classification method and device
CN110427480A (en) * 2019-06-28 2019-11-08 平安科技(深圳)有限公司 Personalized text intelligent recommendation method, apparatus and computer readable storage medium
CN110427480B (en) * 2019-06-28 2022-10-11 平安科技(深圳)有限公司 Intelligent personalized text recommendation method and device and computer readable storage medium
CN110879832A (en) * 2019-10-23 2020-03-13 支付宝(杭州)信息技术有限公司 Target text detection method, model training method, device and equipment
CN110909164A (en) * 2019-11-22 2020-03-24 科大国创软件股份有限公司 Text enhancement semantic classification method and system based on convolutional neural network
CN113010667A (en) * 2019-12-20 2021-06-22 王道维 Training method for machine learning decision model by using natural language corpus
CN113742479A (en) * 2020-05-29 2021-12-03 北京沃东天骏信息技术有限公司 Method and device for screening target text
CN112560427B (en) * 2020-12-16 2023-09-22 平安银行股份有限公司 Problem expansion method, device, electronic equipment and medium
CN112560427A (en) * 2020-12-16 2021-03-26 平安银行股份有限公司 Problem expansion method, device, electronic equipment and medium
CN112380350B (en) * 2021-01-14 2021-05-07 北京育学园健康管理中心有限公司 Text classification method and device
CN112380350A (en) * 2021-01-14 2021-02-19 北京崔玉涛儿童健康管理中心有限公司 Text classification method and device
CN113360657A (en) * 2021-06-30 2021-09-07 安徽商信政通信息技术股份有限公司 Intelligent document distribution and handling method and device and computer equipment
CN113360657B (en) * 2021-06-30 2023-10-24 安徽商信政通信息技术股份有限公司 Intelligent document distribution handling method and device and computer equipment
CN116186271A (en) * 2023-04-19 2023-05-30 北京亚信数据有限公司 Medical term classification model training method, classification method and device

Similar Documents

Publication Publication Date Title
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN110097085B (en) Lyric text generation method, training method, device, server and storage medium
CN102332028B (en) Webpage-oriented unhealthy Web content identifying method
CN109376251A (en) A kind of microblogging Chinese sentiment dictionary construction method based on term vector learning model
CN109977416A (en) A kind of multi-level natural language anti-spam text method and system
CN108363790A (en) For the method, apparatus, equipment and storage medium to being assessed
CN107229610A (en) The analysis method and device of a kind of affection data
CN103577989B (en) A kind of information classification approach and information classifying system based on product identification
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN108090099B (en) Text processing method and device
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
CN110188195A (en) A kind of text intension recognizing method, device and equipment based on deep learning
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
CN112528031A (en) Work order intelligent distribution method and system
CN103678318B (en) Multi-word unit extraction method and equipment and artificial neural network training method and equipment
CN107463703A (en) English social media account number classification method based on information gain
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN106570170A (en) Text classification and naming entity recognition integrated method and system based on depth cyclic neural network
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN110019776A (en) Article classification method and device, storage medium
CN111475651A (en) Text classification method, computing device and computer storage medium
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN111191029B (en) AC construction method based on supervised learning and text classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180925

RJ01 Rejection of invention patent application after publication