CN109325122A - Vocabulary generation method, file classification method, device, equipment and storage medium - Google Patents
Vocabulary generation method, file classification method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN109325122A CN109325122A CN201811080887.4A CN201811080887A CN109325122A CN 109325122 A CN109325122 A CN 109325122A CN 201811080887 A CN201811080887 A CN 201811080887A CN 109325122 A CN109325122 A CN 109325122A
- Authority
- CN
- China
- Prior art keywords
- vocabulary
- sample
- algorithm
- word
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of vocabulary generation method, file classification method, device, equipment and storage mediums.The generation method includes: to obtain multiple trained samples, and each training sample includes content of text and text label;Data processing is carried out with sample to the multiple training;Obtain the default the number of iterations of Lable-LDA algorithm;According to default the number of iterations, training is iterated to generate the vocabulary of text label with sample to multiple training after data processing using Lable-LDA algorithm;And calculate the weight of all words in the vocabulary of the text label, all words that weight accounting is greater than the first preset value are summarized into the first vocabulary as text label, weight accounting is summarized less than all words of the second preset value to the second vocabulary as text label.It can solve the problem that classification accuracy is not high in existing file classification method and classification effectiveness is lower by implementing this programme.
Description
Technical field
The present invention relates to Text Classification field more particularly to a kind of vocabulary generation methods, file classification method, dress
It sets, equipment and storage medium.
Background technique
Text classification is own through having obtained widely answering in fields such as search engine, personalized recommendation system, public sentiment monitoring
With, be realize efficiently management and be accurately positioned massive information an important ring.However in current text classification field exist with
Lower two the problem of can not being temporarily resolved, first, it is all black for largely possessing the text classification algorithm model of high-accuracy
BOX Model.For example, neural network algorithm and algorithm of support vector machine.Although knowing neural network algorithm in text classification field
In have a very high accuracy rate, but due to its complicated structure and unaccountable algorithm, those skilled in the art can not know
How its high-accuracy in text classification is the result is that obtain.Since there are such black box effect, technical staff is in benefit
The work that largely labels is needed when carrying out text classification with neural network algorithm model.Ordinary skill can be by some
Significant composer of ci poetry's work relevant to label sets up set of rule, goes out some qualified samples by Rules Filtering and is instructed
Practice, but such method more is more difficult to implement to the later period because many for by the document of rule hit, technical staff be difficult from
It searches and extracts in the sample of magnanimity, need to consume a large amount of time to check.Wherein, if largely labelling work
It is then to need to spend a large amount of human and material resources, substantially increase the cost of text classification by manually labelling;If
By algorithm model automatic labeling, then it is easy that there are Second Problems, i.e., it is difficult to ensure that the reliability of label.Wherein, at present
Unsupervised algorithm model automatic labeling is mainly used, although technical staff can be by unsupervised algorithm model to sample
The work that labels is carried out, but not can guarantee the reliability of label, if the sample for carrying unreliable label is applied to text
Training be easy to cause the accuracy rate of model result obscured and influence text classification algorithm model in sorting algorithm model,
And not can guarantee text classifying quality, since the characteristic dimension of text is too big really.
Summary of the invention
The embodiment of the invention provides a kind of vocabulary generation method, file classification method, device, equipment and storage medium,
It is intended to improve the accuracy and efficiency of text classification.
In order to solve the above-mentioned technical problem, in a first aspect, the embodiment of the invention provides a kind of vocabulary generation method,
It include: to obtain multiple trained samples, each training sample includes content of text and text label, wherein described more
A training is one text label with sample;Data processing is carried out with sample to the multiple training;According to default iteration time
Number is iterated training with sample to multiple training after data processing using Lable-LDA algorithm to generate the text
The vocabulary of label;And the weight of all words in the vocabulary of the text label is calculated, weight accounting is greater than first and is preset
All words of value summarize the first vocabulary as text label, all words by weight accounting less than the second preset value
Summarize the second vocabulary as text label.
Second aspect, the embodiment of the invention also provides a kind of file classification methods comprising: sample to be sorted is obtained,
The sample to be sorted includes content of text and text label, wherein the text label and first aspect of the sample to be sorted
The text label is identical;Data processing is carried out to the sample to be sorted;Obtain the method generated the of first aspect
One vocabulary and the second vocabulary;By any word and first vocabulary in the sample to be sorted after data processing
And second vocabulary matched one by one to obtain the matching result of any word;According to the matching result to described wait divide
Class sample carries out secondary data processing;Using text classification algorithm to through secondary data, treated that sample to be sorted is instructed
Practice to carry out text classification to the sample to be sorted.
The third aspect, the embodiment of the invention also provides a kind of devices comprising for executing above-mentioned first or second side
The unit of the method in face.
Fourth aspect, the embodiment of the invention also provides a kind of computer equipment, the computer equipment includes memory
And processor, computer program is stored on the memory, the processor is realized above-mentioned when executing the computer program
The method of first or second aspect.
5th aspect, the embodiment of the invention also provides a kind of storage medium, the storage medium is stored with computer journey
Sequence, the computer program include program instruction, and described program instruction can realize above-mentioned first or the when being executed by a processor
The method of two aspects.
The embodiment of the invention provides a kind of vocabulary generation method, file classification method, device, equipment and storage mediums.
The embodiment of the present invention needs a large amount of to solve due to the black box effect in text classification algorithm when to data prediction
The problem of text classification efficiency caused by work that labels is lower and higher cost, in order to sufficiently improve the work that labels
The efficiency of work has introduced hidden Di of tape label layer in the process of data preprocessing before the document classified to needs is classified
Sharp Cray is distributed (Lable-Latent Dirichlet Allocation, Lable-LDA) algorithm, wherein Lable-LDA is calculated
Method is that one layer of label layer is added on the basis of LDA algorithm.The implementation embodiment of the present invention can effectively improve text and label
The efficiency of work, and by carrying out data processing to the noise word in text, noise word can be effectively excluded to the shadow of text classification
It rings, the accuracy of text classification can also be effectively improved.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to required use in embodiment description
Attached drawing be briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, for ability
For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is the flow diagram of vocabulary generation method provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of file classification method provided in an embodiment of the present invention;
Fig. 3 is a kind of schematic block diagram of device provided in an embodiment of the present invention;
Fig. 4 is the schematic block diagram of another device provided in an embodiment of the present invention;And
Fig. 5 is a kind of schematic block diagram of computer equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other reality obtained by those of ordinary skill in the art without making creative efforts
Example is applied, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction
Described feature, entirety, step, operation, the presence of element and/or component, but be not precluded one or more of the other feature,
Entirety, step, operation, the presence or addition of element, component and/or its set.
It is also understood that the term used in this description of the invention is merely for the sake of description specific embodiment
Purpose and be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless
Context clearly indicates other situations, and otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is
Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Referring to Fig. 1, it is a kind of schematic flow chart of vocabulary generation method provided in an embodiment of the present invention.The word
Table generating method is applied to search engine, personalized recommendation system, public sentiment monitoring etc. and needs in the scene of text classification.Such as figure
Shown, this method may include step S110 to S150.
S110, multiple trained samples are obtained, each training sample includes content of text and text label,
In, the multiple training is one text label with sample.
Specifically, the acquisition multiple training samples the step of before need to carry out multiple training with sample
Data prediction needs to carry out the work that labels with sample to multiple training, to obtain multiple training with text label
With sample, wherein each trained sample can be a document, or one section of word, or more identical
The document of theme.
S120, data processing is carried out with sample to the multiple training.
Specifically, described the step of data processing is carried out with sample to the multiple training specifically include following steps A with
And step B:
Step A: word segmentation processing is carried out with sample to the multiple training using segmentation methods.
Specifically, the content of text of each training sample is cut into n word, and n word using segmentation methods
Between out-of-order relationship, for example, using segmentation methods can by short " I goes to school " be cut into " I ", " going ", " on
", " " 4 words.Wherein, the segmentation methods include the segmentation methods based on string matching, the calculation of the participle based on understanding
Method and segmentation methods based on statistics.Wherein, the segmentation methods based on string matching are also known as mechanical Chinese word segmentation algorithm, are
The Chinese character string being analysed to according to certain rule is matched with the entry in " sufficiently big " machine dictionary, if in word
Some character string identical with the Chinese character string is found in allusion quotation, then successful match.It in one embodiment, can be according to scanning direction
Difference, string matching segmentation methods are divided into positive matching algorithm and reverse matching algorithm;In another embodiment, may be used
To be classified as maximum (longest) matching algorithm and minimum (most short) matching algorithm according to different length priority match rule;
In other embodiments, simple segmentation methods can also be classified as according to whether the rule combined with part-of-speech tagging process
The Integrated Algorithm combined with participle with mark.Segmentation methods based on understanding are by making computer mould personification distich
Understanding, achieve the effect that identify word, realize to document text content carry out cutting participle purpose, mainly by point
Syntactic analysis and semantic analysis are carried out while word, handle Ambiguity using syntactic information and semantic information.Based on reason
The segmentation methods of solution generally include following three parts: participle subsystem, syntactic-semantic subsystem, master control part.In master control portion
Under the coordination divided, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to sentence to segmentation ambiguity
Disconnected, i.e., it simulates people to the understanding process of sentence.Segmentation methods based on statistics are by co-occurrence adjacent in sample database
The frequency of each combinatorics on words counted, calculate their information that appears alternatively, the information that appears alternatively is embodied to combine between Chinese character and be closed
The tightness degree of system.When tightness degree is higher than some threshold value, it can think that this word group may constitute a word.Because
It formally sees, word is stable combinatorics on words, and the number that adjacent word occurs simultaneously within a context is more, and more having can
A word can be constituted.Therefore the frequency of word co-occurrence adjacent with word or probability can preferably reflect into the confidence level of word.
Step B: the noise word of multiple trained samples after word segmentation processing is removed.
Specifically, the noise word includes stop words and public word, the stop words includes English character, number, number
Learn character, punctuation mark, frequency of use extra-high Chinese word character and determiner, the public word includes proper noun and each
Everyday words in a technical field.In one embodiment, available one default noise vocabulary, by the noise vocabulary and through dividing
All words are matched one by one in word treated multiple trained samples, if successful match, by the word from described more
It is removed in a training sample.
S130, the default the number of iterations for obtaining Lable-LDA algorithm.
Specifically, in one embodiment, the default the number of iterations can be 500 times, in other embodiments, described
Default the number of iterations can also carry out customized setting according to user itself practical situations, can also be instructed by successive ignition
The best the number of iterations of iteration effect is chosen in white silk as default the number of iterations.
S140, according to acquired default the number of iterations, using Lable-LDA algorithm to multiple after data processing
Training is iterated training with sample to generate the vocabulary of the text label.
Specifically, multiple trained samples after data processing are preset repeatedly according to described using Lable-LDA algorithm
Generation number is iterated training, the probability matrix of available text label-word, it can obtains the corresponding n of text label
The probability of the vocabulary of a word composition.Text label-word the probability matrix generated in Lable-LDA algorithm is LDA algorithm
The probability distribution generated after text label is combined, the word extracted by the multilayer Bayesian Structure of LDA is available
Text label-word probability matrix, it can obtain some words relevant to text label height, and these words
The recessive keyword precisely often ignored during artificial mark, therefore, the text generated by Lable-LDA algorithm
The vocabulary of label is no longer dependent on whether the word appears in the content of text of multiple trained sample.Wherein,
Labeled-LDA algorithm is that one layer of label layer is added, belongs to semi-supervised model on the basis of LDA algorithm, wherein LDA is calculated
Method model is also referred to as 3 layers of Bayesian probability survival model, includes word, theme and document three-decker, wherein document to theme
Obey the distribution of hidden Di Li Cray, what k theme of theme to word obedience multinomial distribution, i.e. documents representative was constituted
One probability distribution, and each theme represents the probability distribution of m word composition.LDA algorithm belongs to unsupervised algorithm
Model does not need to carry out the work that labels in the training process, be directed to the content of text master generated of trained sample
Topic is obscured, and can not explain to text content, thus is difficult to be applied directly in text classification, but in LDA
After joined label layer on the basis of algorithm, the word sampling in implicit theme is not extracted from whole bag of words instead of, from
It is extracted in the bag of words of each label, after this information is utilized, considerably increases the interpretation between each theme.
Weight accounting is greater than the first preset value by S150, the weight for calculating all words in the vocabulary of the text label
All words summarize the first vocabulary as text label, by weight accounting less than the second preset value all words converge
Always the second vocabulary as text label.
Specifically, in one embodiment, first preset value is 90%, and second preset value is 25%, according to institute
Text label-word probability matrix that the training of Lable-LDA algorithm obtains is stated as a result, calculating the word of the text label
The weight of all words in table, all words by weight accounting greater than 90% summarize the first vocabulary as text label,
All words by weight accounting less than 25% summarize the second vocabulary as text label, it follows that first vocabulary
Interior all words are keyword relevant to text label height, as feature vocabulary, and all in second vocabulary
Word be and the incoherent non-key word of text label, as invalid vocabulary.
In the above-described embodiments, multiple training after data processing are carried out with sample using Lable-LDA algorithm more
Secondary repetitive exercise passes through LDA algorithm to obtain the text label of multiple trained samples and the probability distribution of word
The word that multilayer Bayesian Structure extracts, the more available and highly relevant keyword of text label, and then can be by life
At text label vocabulary be added to subsequent text classification during because the feature of text classification is exactly word, text
Label-word probability distribution can be seen that each word to the weight of text label, and then can to text classification this
Black-box model carries out a degree of explanation;It is available in nerve using the multiple trained samples of Lable-LDA algorithm training
The some recessive words that can not be obtained in network training process, are conducive to establish new mark according to recessive word obtained
Label rule, achievees the purpose that quick mark.
Referring to Fig. 2, it is a kind of flow diagram of file classification method provided in an embodiment of the present invention.The text
Classification method is applied to search engine, personalized recommendation system, public sentiment monitoring etc. and needs in the scene of text classification.Such as Fig. 2 institute
Show, this method may include step S210-S260.
S210, obtain sample to be sorted, the sample to be sorted includes content of text and text label, wherein it is described to
The text label of classification samples is identical as the text label in step S110-S150.
Specifically, the premise vocabulary of the text label of the generation in step S110-S150 being added in text classification
It is that text label corresponding to the acquired text label of sample to be sorted and the vocabulary of the generation is one text mark
Label.
S220, data processing is carried out to the sample to be sorted.
Specifically, described the step of carrying out data processing to the sample to be sorted, specifically includes following steps C and step
Rapid D:
Step C: word segmentation processing is carried out to the sample to be sorted using segmentation methods.
Specifically, the segmentation methods include segmentation methods based on string matching, based on the segmentation methods of understanding with
And the segmentation methods based on statistics;Wherein, the segmentation methods based on string matching include positive matching algorithm, it is reverse
The integration that matching algorithm, maximum matching algorithm, smallest match algorithm, simple segmentation methods and participle are combined with mark
Algorithm.
Step D: the noise word of the sample to be sorted after word segmentation processing is removed.
Specifically, the noise word includes stop words and public word, and the stop words includes English character, number, number
Learn character, punctuation mark, frequency of use extra-high Chinese word character and determiner, the public word includes proper noun and each
Everyday words in a technical field.
Specifically, the specific implementation process and effect of step C and step D, can be with reference in preceding method embodiment
Step A and step B description, for convenience of description and succinctly, details are not described herein.
The first vocabulary and the second vocabulary of text label generated in S230, obtaining step S110-S150.
Specifically, the first vocabulary of text label generated in step S110-S150 is weight accounting greater than 90%
All words summarize, and the second vocabulary is that all words of the weight accounting less than 25% summarize, i.e. all words in first vocabulary
Language is keyword relevant to text label height, and all words in second vocabulary are and text label not phase
The non-key word closed.
S240, by any word and first vocabulary and the in the sample to be sorted after data processing
Two vocabularys are matched one by one to obtain the matching result of any word.
Specifically, all words in the sample to be sorted of the traversal after data processing, will all words and
First vocabulary and the second vocabulary are matched one by one, for example, in one embodiment, it will be in the sample to be sorted
A certain word is matched with all words in first vocabulary to search in first vocabulary with the presence or absence of the word
Language, the word then matches the word to search described second with all words in second vocabulary if it does not exist
It whether there is the word in vocabulary, and then obtain the matching result of the word.
S250, secondary data processing is carried out to the sample to be sorted according to the matching result.
Specifically, described the step of carrying out secondary data processing to the sample to be sorted according to the matching result, has
Body includes the following steps E and step F:
Step E: if the matching result is to be matched to identical word in first vocabulary, increases the word and exist
Weight in the sample to be sorted.
Specifically, if a certain word of the content of text in sample to be sorted after data processing is in first word
It is searched in table and is matched to same word, since all words in first vocabulary are and text label height phase
The keyword of pass shows that the word is the Feature Words in sample to be sorted, then in order to improve the text classification of the sample to be sorted
Accuracy can increase the weight of the specific word automatically, i.e., increase weight of the word in the sample to be sorted automatically.
In one embodiment, the Feature Words that identical word is matched in first vocabulary are spliced in order former to be sorted
The end of the content of text of sample is to increase the weight of the specific word.
Step F: if the matching result is to be matched to identical word in second vocabulary, by the word from institute
It states and is deleted in sample to be sorted.
Specifically, if a certain word of the content of text in sample to be sorted after data processing is in second word
Search in table and be matched to same word, due to all words in second vocabulary be with text label height not
Relevant non-key word shows that the word is the invalid word in sample to be sorted, then in order to improve the text of the sample to be sorted
Classification accuracy can directly delete the invalid word from the content of text of the sample to be sorted, by deleting largely to this
The text classification training process of sample to be sorted does not have contributive noise word, can effectively improve the utilization of text label information
Rate.In addition, if a certain word of the content of text in sample to be sorted after data processing is in first vocabulary and
It can not be searched in two vocabularys and be matched to same word, then retain the word, any processing not done to the word.
S260, using text classification algorithm to through secondary data, treated that sample to be sorted is trained to wait for this
Classification samples carry out text classification.
Specifically, the text classification algorithm includes neural network algorithm and conventional machines learning classification algorithm,
In, the neural network algorithm includes: convolutional neural networks algorithm, Recognition with Recurrent Neural Network algorithm and Fasttext algorithm;Institute
Stating conventional machines learning classification algorithm includes generalized linear regression algorithm, sorting algorithm and supporting vector based on tree construction
Machine algorithm.
Specifically, in one embodiment, 2000 news datas are chosen, totally 10 classification, each classification is a text
This label, each text label choose 100 datas as multiple trained samples, multiple training are put into sample
Successive ignition training is carried out in Lable-LDA algorithm to generate the first vocabulary and the second vocabulary of text label, described the
One vocabulary is characterized vocabulary, and second vocabulary is invalid vocabulary, and then chooses the 100 data conducts with text label
Sample to be sorted, by any word and first vocabulary and the second word in the sample to be sorted after data processing
Table is matched one by one to obtain the matching result of any word, and matching result is as shown in the left frame of the following table 1, wherein will be with
The font format for the word that word in first vocabulary matches is set as inclination format;Will in second vocabulary
The font format of word that matches of word be set as band underscore format.According to the matching result to described to be sorted
Sample carries out secondary data processing, that is, deletes the weight with underlined words and the automatic font for increasing inclination format, warp
The result of secondary data processing is as shown in the right frame of the following table 1:
Table 1
It will treated that the sample to be sorted is put into text classification algorithm is trained to wait for this through secondary data
Classification samples carry out text classification, wherein in the present embodiment, using Fasttext algorithm to this after secondary data is handled
The sample to be sorted be trained, and convolutional Neural net will be respectively put into without the sample to be sorted that secondary data is handled
Network algorithm (Convolutional Neural Network, CNN), Recognition with Recurrent Neural Network algorithm (Recurrent Neural
Networks, RNN) and Fasttext algorithm in be trained, available following classification performance experimental result is as follows
Shown in table 2:
CNN | RNN | Fasttext | LLDA-Fasttext | |
Precision | 0.8866 | 0.95 | 0.9633 | 0.9894 |
Recall | 0.89 | 0.94 | 0.9633 | 0.9894 |
Table 2
Wherein, which is accuracy rate, and Recall is recall rate, as shown in Table 2, utilizes Lable-LDA algorithm
The first vocabulary and the second vocabulary for training the text label generated carry out secondary data processing to the sample to be sorted, into
And by this, through secondary data, treated that sample to be sorted is added in Fasttext algorithm is trained, in the height of text classification
It is much higher than the accuracy rate of CNN algorithm and RNN algorithm in accuracy rate, and than directly carrying out text using Fasttext algorithm
Classification based training improves two percentage points, more intuitive can illustrate the validity and conspicuousness of this programme.
In the above-described embodiments, Lable-LDA algorithm training institute is added before sample to be sorted carries out text classification training
The first vocabulary and the second vocabulary of the text label of generation, mainly will be to carry out the secondary data processing of text
Any word in classification samples is matched one by one with first vocabulary and the second vocabulary to obtain of any word
With as a result, actively increasing Feature Words according to the matching result and rejecting invalid word, less for content of text is to be sorted
Sample is conducive to the accuracy rate for improving text classification by actively increasing Feature Words;For content of text, there are much noises
The sample to be sorted of word effectively eliminates influence of the noise word to text classification algorithm by actively rejecting invalid word.This hair
The file classification method that bright embodiment is proposed is based on Lable-LDA algorithm feature vocabulary generated and invalid vocabulary
File classification method, can be widely used in various text categorization tasks, treat classification samples data prediction letter
It is single quick, and integral operation time and performance are more prominent, can effectively improve the accuracy of text classification.
Referring to Fig. 3, it is a kind of schematic block diagram of device 300 provided in an embodiment of the present invention.It, should as shown in Fig. 3
Device 300 corresponds to vocabulary generation method shown in FIG. 1.The device 300 includes the list for executing above-mentioned vocabulary generation method
Member, the device 300 can be configured in the terminals such as desktop computer, tablet computer, laptop computer.Specifically, referring to Fig. 3,
The device 300 includes first acquisition unit 301, the first data processing unit 302, second acquisition unit 303, training unit 304
And computing unit 305.
For the first acquisition unit 301 for obtaining multiple trained samples, each training sample includes text
Content and text label, wherein the multiple training is one text label with sample.
First data processing unit 302 is for carrying out data processing with sample to the multiple training.Specifically,
First data processing unit 302 includes first participle unit 3021 and the first clearing cell 3022.
The first participle unit 3021 is for carrying out at participle the multiple training with sample using segmentation methods
Reason.
Specifically, the segmentation methods include segmentation methods based on string matching, based on the segmentation methods of understanding with
And the segmentation methods based on statistics;Wherein, the segmentation methods based on string matching include positive matching algorithm, it is reverse
The integration that matching algorithm, maximum matching algorithm, smallest match algorithm, simple segmentation methods and participle are combined with mark
Algorithm.
First clearing cell 3022 is used to remove the noise word of multiple trained samples after word segmentation processing.
Specifically, the noise word includes stop words and public word, and the stop words includes English character, number, number
Learn character, punctuation mark, frequency of use extra-high Chinese word character and determiner, the public word includes proper noun and each
Everyday words in a technical field.
The second acquisition unit 303 is used to obtain the default the number of iterations of Lable-LDA algorithm.
Specifically, in one embodiment, the default the number of iterations can be 500 times.
The training unit 304 is used for according to acquired default the number of iterations, using Lable-LDA algorithm to through number
According to treated, multiple training are iterated training with sample to generate the vocabulary of the text label.
Specifically, the Labeled-LDA algorithm is that one layer of label layer is added on the basis of LDA algorithm, belongs to half
Monitor model, wherein LDA algorithm model is also referred to as 3 layers of Bayesian probability survival model, includes word, theme and document three-layered node
Structure, wherein document to theme obeys hidden Di Li Cray distribution, and theme to word obeys multinomial distribution.
The computing unit 305 is used to calculate the weight of all words in the vocabulary of the text label, by weight accounting
All words greater than the first preset value summarize the first vocabulary as text label, and weight accounting is preset less than second
All words of value summarize the second vocabulary as text label.
Specifically, in one embodiment, first preset value is 90%, and second preset value is 25%, according to institute
Text label-word probability matrix that the training of Lable-LDA algorithm obtains is stated as a result, calculating the word of the text label
The weight of all words in table, all words by weight accounting greater than 90% summarize the first vocabulary as text label,
All words by weight accounting less than 25% summarize the second vocabulary as text label.
It should be noted that it is apparent to those skilled in the art that, above-mentioned apparatus 300 and each unit
Specific implementation process and effect, can be with reference to the corresponding description in preceding method embodiment, for convenience of description and letter
Clean, details are not described herein.
Referring to Fig. 4, it is the schematic block diagram of another device 400 provided in an embodiment of the present invention.As shown in figure 4,
The another kind device 400 corresponds to file classification method shown in Fig. 2.The another kind device 400 includes for executing above-mentioned text
The unit of this classification method, the another kind device 400 can be configured in the terminals such as desktop computer, tablet computer, laptop computer
In.Specifically, referring to Fig. 4, the another kind device 400 include third acquiring unit 401, the second data processing unit 402,
4th acquiring unit 403, matching unit 404, third data processing unit 405 and text training unit 406.
For the third acquiring unit 401 for obtaining sample to be sorted, the sample to be sorted includes content of text and text
This label, wherein the text label of the sample to be sorted is identical as the text label in step S110-S150.
Specifically, the vocabulary institute of the text label of sample to be sorted acquired in the third acquiring unit 401 and the generation
Corresponding text label is one text label.
Second data processing unit 402 is used to carry out data processing to the sample to be sorted.
Specifically, second data processing unit 402 includes the second participle unit 4021 and the second clearing cell
4022。
Second participle unit 4021 is used to carry out word segmentation processing to the sample to be sorted using segmentation methods.
Second clearing cell 4022 is used to remove the noise word of the sample to be sorted after word segmentation processing.
First vocabulary of 4th acquiring unit 403 for text label generated in obtaining step S110-S150
And second vocabulary.
The matching unit 404 be used for by the sample to be sorted after data processing any word with it is described
First vocabulary and the second vocabulary are matched one by one to obtain the matching result of any word.
The third data processing unit 405 is used to carry out the sample to be sorted according to the matching result secondary
Data processing.The third data processing unit 405 includes increasing weight unit 4051 and deletion unit 4052.
If the increase weight unit 4051 is matched in first vocabulary identical for the matching result
Word increases weight of the word in the sample to be sorted.
If the deletion unit 4052 is to be matched to identical word in second vocabulary for the matching result
Language deletes the word from the sample to be sorted.
The text training unit 406 is used for using text classification algorithm to through secondary data treated sample to be sorted
Originally it is trained to carry out text classification to the sample to be sorted.
Specifically, the text classification algorithm includes neural network algorithm and conventional machines learning classification algorithm,
In, the neural network algorithm includes: convolutional neural networks algorithm, Recognition with Recurrent Neural Network algorithm and Fasttext algorithm;Institute
Stating conventional machines learning classification algorithm includes generalized linear regression algorithm, sorting algorithm and supporting vector based on tree construction
Machine algorithm.
It should be noted that it is apparent to those skilled in the art that, above-mentioned apparatus 400 and each unit
Specific implementation process and effect, can be with reference to the corresponding description in preceding method embodiment, for convenience of description and letter
Clean, details are not described herein.
Above-mentioned apparatus can be implemented as a kind of form of computer program, which can be as shown in Fig. 5
It is run in computer equipment.
Referring to Fig. 5, it is a kind of schematic block diagram of computer equipment provided in an embodiment of the present invention.The computer
Equipment 600 can be terminal, be also possible to server, wherein terminal can be smart phone, tablet computer, notebook electricity
The electronic equipments such as brain, desktop computer and personal digital assistant.Server can be independent server, be also possible to multiple clothes
The server cluster of business device composition.
Refering to Fig. 5, which includes processor 602, memory and the net connected by system bus 601
Network interface 605, wherein memory may include non-volatile memory medium 603 and built-in storage 604.
The non-volatile memory medium 603 can storage program area 6031 and computer program 6032.The computer program
6032 include program instruction, which is performed, may make processor 602 execute a kind of vocabulary generation method and
File classification method.
The processor 602 is for providing calculating and control ability, to support the operation of entire computer equipment 600.
The built-in storage 604 provides environment for the operation of the computer program 6032 in non-volatile memory medium 603,
When the computer program 6032 is executed by processor 602, processor 602 may make to execute a kind of vocabulary generation method and text
This classification method.
The network interface 605 is used to carry out network communication with other equipment.It will be understood by those skilled in the art that in Fig. 5
The structure shown, only the block diagram of part-structure relevant to application scheme, does not constitute and is applied to application scheme
The restriction of computer equipment 600 thereon, specific computer equipment 600 may include more more or less than as shown in the figure
Component, perhaps combine certain components or with different component layouts.
Wherein, the processor 602 is as follows to realize for running computer program 6032 stored in memory
Step:
In one embodiment, processor 602 is implemented as follows step when realizing the vocabulary generation method: obtaining
Multiple trained samples, each training sample includes content of text and text label, wherein the multiple training is used
Sample is one text label;Data processing is carried out with sample to the multiple training;Obtain the default of Lable-LDA algorithm
The number of iterations;According to acquired default the number of iterations, using Lable-LDA algorithm to multiple training after data processing
Training is iterated with sample to generate the vocabulary of the text label;And own in the vocabulary of the calculating text label
All words that weight accounting is greater than the first preset value are summarized the first vocabulary as text label by the weight of word, will
Weight accounting summarizes the second vocabulary as text label less than all words of the second preset value.
In one embodiment, processor 602 is realizing the step for carrying out data processing with sample to the multiple training
When rapid, it is implemented as follows step: word segmentation processing being carried out with sample to the multiple training using segmentation methods;And it removes
The noise word of multiple trained samples after word segmentation processing.
In one embodiment, processor 602 is implemented as follows step when realizing the file classification method: obtaining
Sample to be sorted, the sample to be sorted includes content of text and text label, wherein the text mark of the sample to be sorted
It signs identical as the text label of multiple trained samples in vocabulary generation method;The sample to be sorted is carried out at data
Reason;Obtain the first vocabulary and the second vocabulary of vocabulary generation method text label generated;By the institute after data processing
Any word stated in sample to be sorted is matched one by one with first vocabulary and the second vocabulary to obtain any word
Matching result;Secondary data processing is carried out to the sample to be sorted according to the matching result;Utilize text classification algorithm
To through secondary data, treated that sample to be sorted is trained to carry out text classification to the sample to be sorted.
In one embodiment, processor 602 described carries out the sample to be sorted according to the matching result realizing
When the step of secondary data processing, it is implemented as follows step: if the matching result is to be matched in first vocabulary
Identical word increases weight of the word in the sample to be sorted;And if the matching result is described second
It is matched to identical word in vocabulary, which is deleted from the sample to be sorted.
It should be appreciated that in embodiments of the present invention, processor 602 can be central processing unit (Central
Processing Unit, CPU), which can also be other general processors, digital signal processor
(Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated
Circuit, ASIC), ready-made programmable gate array (Field-Programmable GateArray, FPGA) or other can
Programmed logic device, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be micro-
Processor or the processor are also possible to any conventional processor etc..
Those of ordinary skill in the art will appreciate that be all or part of stream in the method for realize above-described embodiment
Journey is relevant hardware can be instructed to complete by computer program.The computer program includes program instruction, is calculated
Machine program can be stored in a storage medium, which is storage medium.The program instruction is by the computer system
At least one processor executes, to realize the process step of the embodiment of the above method.
Therefore, the present invention also provides a kind of storage mediums.The storage medium is computer readable storage medium, the calculating
Machine readable storage medium storing program for executing is stored with computer program, and wherein computer program includes program instruction.The program instruction is by processor
Processor is set to execute vocabulary generation method as described above and file classification method when execution.
The storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic disk
Or the various storage mediums that can store program code such as CD.
Those of ordinary skill in the art may be aware that described in conjunction with the examples disclosed in the embodiments of the present disclosure
Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate
The interchangeability of hardware and software generally describes each exemplary composition and step according to function in the above description
Suddenly.These functions are implemented in hardware or software actually, the specific application and design constraint item depending on technical solution
Part.Professional technician can use different methods to achieve the described function each specific application, but this
Realization should not be considered as beyond the scope of the present invention.
In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through
Other modes are realized.For example, the apparatus embodiments described above are merely exemplary.For example, each unit is drawn
Point, only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components
It can be combined or can be integrated into another system, or some features can be ignored or not executed.
The steps in the embodiment of the present invention can be sequentially adjusted, merged and deleted according to actual needs.This hair
Unit in bright embodiment device can be combined, divided and deleted according to actual needs.In addition, in each reality of the present invention
Each functional unit applied in example can integrate in one processing unit, is also possible to each unit and physically exists alone, can also
To be that two or more units are integrated in one unit.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existing
Having all or part for the part or the technical solution that technology contributes can be embodied in the form of software products,
The computer software product is stored in a storage medium, including some instructions are used so that computer equipment (can be with
It is personal computer, terminal or the network equipment etc.) execute all or part of step of each embodiment the method for the present invention
Suddenly.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, appoints
What those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications
Or replacement, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention is answered
It is subject to the protection scope in claims.
Claims (10)
1. a kind of vocabulary generation method characterized by comprising
Multiple trained samples are obtained, each training sample includes content of text and text label, wherein the multiple
Training is one text label with sample;
Data processing is carried out with sample to the multiple training;
Obtain the default the number of iterations of Lable-LDA algorithm;
According to acquired default the number of iterations, using Lable-LDA algorithm to multiple training samples after data processing
Training is iterated to generate the vocabulary of the text label;And
Weight accounting is greater than all words of the first preset value by the weight for calculating all words in the vocabulary of the text label
Summarize the first vocabulary as text label, all words by weight accounting less than the second preset value summarize as the text
Second vocabulary of label.
2. vocabulary generation method according to claim 1, which is characterized in that described to be carried out to the multiple training with sample
Data processing, comprising:
Word segmentation processing is carried out with sample to the multiple training using segmentation methods;And
Remove the noise word of multiple trained samples after word segmentation processing.
3. vocabulary generation method according to claim 2, which is characterized in that the segmentation methods include being based on character string
Segmentation methods, the segmentation methods based on understanding and the segmentation methods based on statistics matched;Wherein, described to be based on string matching
Segmentation methods include that positive matching algorithm, reverse matching algorithm, maximum matching algorithm, smallest match algorithm, simple participle are calculated
The Integrated Algorithm that method and participle are combined with mark.
4. vocabulary generation method according to claim 1 characterized by comprising first preset value is 90%, institute
Stating the second preset value is 25%.
5. a kind of file classification method characterized by comprising
Sample to be sorted is obtained, the sample to be sorted includes content of text and text label, wherein the sample to be sorted
Text label is identical as the text label in claim any one of 1-4;
Data processing is carried out to the sample to be sorted;
Obtain the first vocabulary and the second vocabulary of any one of claim 1-4 text label generated;
Any word in the sample to be sorted after data processing is carried out with first vocabulary and the second vocabulary
It is matched one by one to obtain the matching result of any word;
Secondary data processing is carried out to the sample to be sorted according to the matching result;
Using text classification algorithm to through secondary data treated sample to be sorted is trained with to the sample to be sorted into
Row text classification.
6. file classification method according to claim 5, which is characterized in that it is described according to the matching result to it is described to
Classification samples carry out secondary data processing, comprising:
If the matching result is to be matched to identical word in first vocabulary, increase the word in the sample to be sorted
Weight in this;And
If the matching result is to be matched to identical word in second vocabulary, by the word from the sample to be sorted
Middle deletion.
7. file classification method according to claim 5, which is characterized in that the text classification algorithm includes neural network
Algorithm and conventional machines learning classification algorithm, wherein the neural network algorithm includes: convolutional neural networks algorithm, circulation
Neural network algorithm and Fasttext algorithm;The conventional machines learning classification algorithm includes generalized linear regression algorithm, base
In the sorting algorithm and algorithm of support vector machine of tree construction.
8. a kind of device, which is characterized in that including for executing the unit such as any one of claim 1-7 the method.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor, on the memory
It is stored with computer program, the processor is realized as described in any one of claim 1-7 when executing the computer program
Method.
10. a kind of storage medium, which is characterized in that the storage medium is stored with computer program, and the computer program is worked as
It can realize when being executed by processor such as method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811080887.4A CN109325122A (en) | 2018-09-17 | 2018-09-17 | Vocabulary generation method, file classification method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811080887.4A CN109325122A (en) | 2018-09-17 | 2018-09-17 | Vocabulary generation method, file classification method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109325122A true CN109325122A (en) | 2019-02-12 |
Family
ID=65265457
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811080887.4A Withdrawn CN109325122A (en) | 2018-09-17 | 2018-09-17 | Vocabulary generation method, file classification method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109325122A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110795558A (en) * | 2019-09-03 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Label acquisition method and device, storage medium and electronic device |
CN111191011A (en) * | 2020-04-17 | 2020-05-22 | 郑州工程技术学院 | Search matching method, device and equipment for text label and storage medium |
CN111797234A (en) * | 2020-06-16 | 2020-10-20 | 北京北大软件工程股份有限公司 | Method and system for multi-label distributed learning in natural language processing classification model |
CN111930944A (en) * | 2020-08-12 | 2020-11-13 | 中国银行股份有限公司 | File label classification method and device |
CN111966830A (en) * | 2020-06-30 | 2020-11-20 | 北京来也网络科技有限公司 | Text classification method, device, equipment and medium combining RPA and AI |
CN113255337A (en) * | 2021-05-21 | 2021-08-13 | 广州欢聚时代信息科技有限公司 | Word list construction method, machine translation method, device, equipment and medium thereof |
-
2018
- 2018-09-17 CN CN201811080887.4A patent/CN109325122A/en not_active Withdrawn
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110795558A (en) * | 2019-09-03 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Label acquisition method and device, storage medium and electronic device |
CN110795558B (en) * | 2019-09-03 | 2023-09-29 | 腾讯科技(深圳)有限公司 | Label acquisition method and device, storage medium and electronic device |
CN111191011A (en) * | 2020-04-17 | 2020-05-22 | 郑州工程技术学院 | Search matching method, device and equipment for text label and storage medium |
CN111191011B (en) * | 2020-04-17 | 2024-02-23 | 郑州工程技术学院 | Text label searching and matching method, device, equipment and storage medium |
CN111797234A (en) * | 2020-06-16 | 2020-10-20 | 北京北大软件工程股份有限公司 | Method and system for multi-label distributed learning in natural language processing classification model |
CN111797234B (en) * | 2020-06-16 | 2024-04-30 | 北京北大软件工程股份有限公司 | Method and system for multi-label distribution learning in natural language processing classification model |
CN111966830A (en) * | 2020-06-30 | 2020-11-20 | 北京来也网络科技有限公司 | Text classification method, device, equipment and medium combining RPA and AI |
CN111930944A (en) * | 2020-08-12 | 2020-11-13 | 中国银行股份有限公司 | File label classification method and device |
CN111930944B (en) * | 2020-08-12 | 2023-08-22 | 中国银行股份有限公司 | File label classification method and device |
CN113255337A (en) * | 2021-05-21 | 2021-08-13 | 广州欢聚时代信息科技有限公司 | Word list construction method, machine translation method, device, equipment and medium thereof |
CN113255337B (en) * | 2021-05-21 | 2024-02-02 | 广州欢聚时代信息科技有限公司 | Vocabulary construction method, machine translation method, device, equipment and medium thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109325122A (en) | Vocabulary generation method, file classification method, device, equipment and storage medium | |
CN107451126B (en) | Method and system for screening similar meaning words | |
US11210468B2 (en) | System and method for comparing plurality of documents | |
CN107480122B (en) | Artificial intelligence interaction method and artificial intelligence interaction device | |
Mills et al. | Graph-based methods for natural language processing and understanding—A survey and analysis | |
CN111444330A (en) | Method, device and equipment for extracting short text keywords and storage medium | |
US11170169B2 (en) | System and method for language-independent contextual embedding | |
US11030533B2 (en) | Method and system for generating a transitory sentiment community | |
CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN112434164B (en) | Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration | |
CN111309916B (en) | Digest extracting method and apparatus, storage medium, and electronic apparatus | |
CN112052356A (en) | Multimedia classification method, apparatus and computer-readable storage medium | |
CN112860896A (en) | Corpus generalization method and man-machine conversation emotion analysis method for industrial field | |
US20230074771A1 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
CN110929022A (en) | Text abstract generation method and system | |
CN114840685A (en) | Emergency plan knowledge graph construction method | |
CN110929518A (en) | Text sequence labeling algorithm using overlapping splitting rule | |
US11605004B2 (en) | Method and system for generating a transitory sentiment community | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN116911286A (en) | Dictionary construction method, emotion analysis device, dictionary construction equipment and storage medium | |
Rauniyar | A survey on deep learning based various methods analysis of text summarization | |
CN111681731A (en) | Method for automatically marking colors of inspection report | |
US20200184521A1 (en) | Method and system for initiating an interface concurrent with generation of a transitory sentiment community | |
Figueroa et al. | Collaborative ranking between supervised and unsupervised approaches for keyphrase extraction | |
CN114661892A (en) | Manuscript abstract generation method and device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190212 |
|
WW01 | Invention patent application withdrawn after publication |