CN109325122A

CN109325122A - Vocabulary generation method, file classification method, device, equipment and storage medium

Info

Publication number: CN109325122A
Application number: CN201811080887.4A
Authority: CN
Inventors: 雷昕
Original assignee: Shenzhen Dingfeng Cattle Technology Co Ltd
Current assignee: Shenzhen Dingfeng Cattle Technology Co Ltd
Priority date: 2018-09-17
Filing date: 2018-09-17
Publication date: 2019-02-12

Abstract

The invention discloses a kind of vocabulary generation method, file classification method, device, equipment and storage mediums.The generation method includes: to obtain multiple trained samples, and each training sample includes content of text and text label；Data processing is carried out with sample to the multiple training；Obtain the default the number of iterations of Lable-LDA algorithm；According to default the number of iterations, training is iterated to generate the vocabulary of text label with sample to multiple training after data processing using Lable-LDA algorithm；And calculate the weight of all words in the vocabulary of the text label, all words that weight accounting is greater than the first preset value are summarized into the first vocabulary as text label, weight accounting is summarized less than all words of the second preset value to the second vocabulary as text label.It can solve the problem that classification accuracy is not high in existing file classification method and classification effectiveness is lower by implementing this programme.

Description

Vocabulary generation method, file classification method, device, equipment and storage medium

Technical field

The present invention relates to Text Classification field more particularly to a kind of vocabulary generation methods, file classification method, dress It sets, equipment and storage medium.

Background technique

Text classification is own through having obtained widely answering in fields such as search engine, personalized recommendation system, public sentiment monitoring With, be realize efficiently management and be accurately positioned massive information an important ring.However in current text classification field exist with Lower two the problem of can not being temporarily resolved, first, it is all black for largely possessing the text classification algorithm model of high-accuracy BOX Model.For example, neural network algorithm and algorithm of support vector machine.Although knowing neural network algorithm in text classification field In have a very high accuracy rate, but due to its complicated structure and unaccountable algorithm, those skilled in the art can not know How its high-accuracy in text classification is the result is that obtain.Since there are such black box effect, technical staff is in benefit The work that largely labels is needed when carrying out text classification with neural network algorithm model.Ordinary skill can be by some Significant composer of ci poetry's work relevant to label sets up set of rule, goes out some qualified samples by Rules Filtering and is instructed Practice, but such method more is more difficult to implement to the later period because many for by the document of rule hit, technical staff be difficult from It searches and extracts in the sample of magnanimity, need to consume a large amount of time to check.Wherein, if largely labelling work It is then to need to spend a large amount of human and material resources, substantially increase the cost of text classification by manually labelling；If By algorithm model automatic labeling, then it is easy that there are Second Problems, i.e., it is difficult to ensure that the reliability of label.Wherein, at present Unsupervised algorithm model automatic labeling is mainly used, although technical staff can be by unsupervised algorithm model to sample The work that labels is carried out, but not can guarantee the reliability of label, if the sample for carrying unreliable label is applied to text Training be easy to cause the accuracy rate of model result obscured and influence text classification algorithm model in sorting algorithm model, And not can guarantee text classifying quality, since the characteristic dimension of text is too big really.

Summary of the invention

The embodiment of the invention provides a kind of vocabulary generation method, file classification method, device, equipment and storage medium, It is intended to improve the accuracy and efficiency of text classification.

In order to solve the above-mentioned technical problem, in a first aspect, the embodiment of the invention provides a kind of vocabulary generation method, It include: to obtain multiple trained samples, each training sample includes content of text and text label, wherein described more A training is one text label with sample；Data processing is carried out with sample to the multiple training；According to default iteration time Number is iterated training with sample to multiple training after data processing using Lable-LDA algorithm to generate the text The vocabulary of label；And the weight of all words in the vocabulary of the text label is calculated, weight accounting is greater than first and is preset All words of value summarize the first vocabulary as text label, all words by weight accounting less than the second preset value Summarize the second vocabulary as text label.

Second aspect, the embodiment of the invention also provides a kind of file classification methods comprising: sample to be sorted is obtained, The sample to be sorted includes content of text and text label, wherein the text label and first aspect of the sample to be sorted The text label is identical；Data processing is carried out to the sample to be sorted；Obtain the method generated the of first aspect One vocabulary and the second vocabulary；By any word and first vocabulary in the sample to be sorted after data processing And second vocabulary matched one by one to obtain the matching result of any word；According to the matching result to described wait divide Class sample carries out secondary data processing；Using text classification algorithm to through secondary data, treated that sample to be sorted is instructed Practice to carry out text classification to the sample to be sorted.

The third aspect, the embodiment of the invention also provides a kind of devices comprising for executing above-mentioned first or second side The unit of the method in face.

Fourth aspect, the embodiment of the invention also provides a kind of computer equipment, the computer equipment includes memory And processor, computer program is stored on the memory, the processor is realized above-mentioned when executing the computer program The method of first or second aspect.

5th aspect, the embodiment of the invention also provides a kind of storage medium, the storage medium is stored with computer journey Sequence, the computer program include program instruction, and described program instruction can realize above-mentioned first or the when being executed by a processor The method of two aspects.

The embodiment of the invention provides a kind of vocabulary generation method, file classification method, device, equipment and storage mediums. The embodiment of the present invention needs a large amount of to solve due to the black box effect in text classification algorithm when to data prediction The problem of text classification efficiency caused by work that labels is lower and higher cost, in order to sufficiently improve the work that labels The efficiency of work has introduced hidden Di of tape label layer in the process of data preprocessing before the document classified to needs is classified Sharp Cray is distributed (Lable-Latent Dirichlet Allocation, Lable-LDA) algorithm, wherein Lable-LDA is calculated Method is that one layer of label layer is added on the basis of LDA algorithm.The implementation embodiment of the present invention can effectively improve text and label The efficiency of work, and by carrying out data processing to the noise word in text, noise word can be effectively excluded to the shadow of text classification It rings, the accuracy of text classification can also be effectively improved.

Detailed description of the invention

Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to required use in embodiment description Attached drawing be briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is the flow diagram of vocabulary generation method provided in an embodiment of the present invention；

Fig. 2 is the flow diagram of file classification method provided in an embodiment of the present invention；

Fig. 3 is a kind of schematic block diagram of device provided in an embodiment of the present invention；

Fig. 4 is the schematic block diagram of another device provided in an embodiment of the present invention；And

Fig. 5 is a kind of schematic block diagram of computer equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other reality obtained by those of ordinary skill in the art without making creative efforts Example is applied, shall fall within the protection scope of the present invention.

It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but be not precluded one or more of the other feature, Entirety, step, operation, the presence or addition of element, component and/or its set.

It is also understood that the term used in this description of the invention is merely for the sake of description specific embodiment Purpose and be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless Context clearly indicates other situations, and otherwise " one " of singular, "one" and "the" are intended to include plural form.

It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.

Referring to Fig. 1, it is a kind of schematic flow chart of vocabulary generation method provided in an embodiment of the present invention.The word Table generating method is applied to search engine, personalized recommendation system, public sentiment monitoring etc. and needs in the scene of text classification.Such as figure Shown, this method may include step S110 to S150.

S110, multiple trained samples are obtained, each training sample includes content of text and text label, In, the multiple training is one text label with sample.

Specifically, the acquisition multiple training samples the step of before need to carry out multiple training with sample Data prediction needs to carry out the work that labels with sample to multiple training, to obtain multiple training with text label With sample, wherein each trained sample can be a document, or one section of word, or more identical The document of theme.

S120, data processing is carried out with sample to the multiple training.

Specifically, described the step of data processing is carried out with sample to the multiple training specifically include following steps A with And step B:

Step A: word segmentation processing is carried out with sample to the multiple training using segmentation methods.

Specifically, the content of text of each training sample is cut into n word, and n word using segmentation methods Between out-of-order relationship, for example, using segmentation methods can by short " I goes to school " be cut into " I ", " going ", " on ", " " 4 words.Wherein, the segmentation methods include the segmentation methods based on string matching, the calculation of the participle based on understanding Method and segmentation methods based on statistics.Wherein, the segmentation methods based on string matching are also known as mechanical Chinese word segmentation algorithm, are The Chinese character string being analysed to according to certain rule is matched with the entry in " sufficiently big " machine dictionary, if in word Some character string identical with the Chinese character string is found in allusion quotation, then successful match.It in one embodiment, can be according to scanning direction Difference, string matching segmentation methods are divided into positive matching algorithm and reverse matching algorithm；In another embodiment, may be used To be classified as maximum (longest) matching algorithm and minimum (most short) matching algorithm according to different length priority match rule； In other embodiments, simple segmentation methods can also be classified as according to whether the rule combined with part-of-speech tagging process The Integrated Algorithm combined with participle with mark.Segmentation methods based on understanding are by making computer mould personification distich Understanding, achieve the effect that identify word, realize to document text content carry out cutting participle purpose, mainly by point Syntactic analysis and semantic analysis are carried out while word, handle Ambiguity using syntactic information and semantic information.Based on reason The segmentation methods of solution generally include following three parts: participle subsystem, syntactic-semantic subsystem, master control part.In master control portion Under the coordination divided, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to sentence to segmentation ambiguity Disconnected, i.e., it simulates people to the understanding process of sentence.Segmentation methods based on statistics are by co-occurrence adjacent in sample database The frequency of each combinatorics on words counted, calculate their information that appears alternatively, the information that appears alternatively is embodied to combine between Chinese character and be closed The tightness degree of system.When tightness degree is higher than some threshold value, it can think that this word group may constitute a word.Because It formally sees, word is stable combinatorics on words, and the number that adjacent word occurs simultaneously within a context is more, and more having can A word can be constituted.Therefore the frequency of word co-occurrence adjacent with word or probability can preferably reflect into the confidence level of word.

Step B: the noise word of multiple trained samples after word segmentation processing is removed.

Specifically, the noise word includes stop words and public word, the stop words includes English character, number, number Learn character, punctuation mark, frequency of use extra-high Chinese word character and determiner, the public word includes proper noun and each Everyday words in a technical field.In one embodiment, available one default noise vocabulary, by the noise vocabulary and through dividing All words are matched one by one in word treated multiple trained samples, if successful match, by the word from described more It is removed in a training sample.

S130, the default the number of iterations for obtaining Lable-LDA algorithm.

Specifically, in one embodiment, the default the number of iterations can be 500 times, in other embodiments, described Default the number of iterations can also carry out customized setting according to user itself practical situations, can also be instructed by successive ignition The best the number of iterations of iteration effect is chosen in white silk as default the number of iterations.

S140, according to acquired default the number of iterations, using Lable-LDA algorithm to multiple after data processing Training is iterated training with sample to generate the vocabulary of the text label.

Specifically, multiple trained samples after data processing are preset repeatedly according to described using Lable-LDA algorithm Generation number is iterated training, the probability matrix of available text label-word, it can obtains the corresponding n of text label The probability of the vocabulary of a word composition.Text label-word the probability matrix generated in Lable-LDA algorithm is LDA algorithm The probability distribution generated after text label is combined, the word extracted by the multilayer Bayesian Structure of LDA is available Text label-word probability matrix, it can obtain some words relevant to text label height, and these words The recessive keyword precisely often ignored during artificial mark, therefore, the text generated by Lable-LDA algorithm The vocabulary of label is no longer dependent on whether the word appears in the content of text of multiple trained sample.Wherein, Labeled-LDA algorithm is that one layer of label layer is added, belongs to semi-supervised model on the basis of LDA algorithm, wherein LDA is calculated Method model is also referred to as 3 layers of Bayesian probability survival model, includes word, theme and document three-decker, wherein document to theme Obey the distribution of hidden Di Li Cray, what k theme of theme to word obedience multinomial distribution, i.e. documents representative was constituted One probability distribution, and each theme represents the probability distribution of m word composition.LDA algorithm belongs to unsupervised algorithm Model does not need to carry out the work that labels in the training process, be directed to the content of text master generated of trained sample Topic is obscured, and can not explain to text content, thus is difficult to be applied directly in text classification, but in LDA After joined label layer on the basis of algorithm, the word sampling in implicit theme is not extracted from whole bag of words instead of, from It is extracted in the bag of words of each label, after this information is utilized, considerably increases the interpretation between each theme.

Weight accounting is greater than the first preset value by S150, the weight for calculating all words in the vocabulary of the text label All words summarize the first vocabulary as text label, by weight accounting less than the second preset value all words converge Always the second vocabulary as text label.

Specifically, in one embodiment, first preset value is 90%, and second preset value is 25%, according to institute Text label-word probability matrix that the training of Lable-LDA algorithm obtains is stated as a result, calculating the word of the text label The weight of all words in table, all words by weight accounting greater than 90% summarize the first vocabulary as text label, All words by weight accounting less than 25% summarize the second vocabulary as text label, it follows that first vocabulary Interior all words are keyword relevant to text label height, as feature vocabulary, and all in second vocabulary Word be and the incoherent non-key word of text label, as invalid vocabulary.

In the above-described embodiments, multiple training after data processing are carried out with sample using Lable-LDA algorithm more Secondary repetitive exercise passes through LDA algorithm to obtain the text label of multiple trained samples and the probability distribution of word The word that multilayer Bayesian Structure extracts, the more available and highly relevant keyword of text label, and then can be by life At text label vocabulary be added to subsequent text classification during because the feature of text classification is exactly word, text Label-word probability distribution can be seen that each word to the weight of text label, and then can to text classification this Black-box model carries out a degree of explanation；It is available in nerve using the multiple trained samples of Lable-LDA algorithm training The some recessive words that can not be obtained in network training process, are conducive to establish new mark according to recessive word obtained Label rule, achievees the purpose that quick mark.

Referring to Fig. 2, it is a kind of flow diagram of file classification method provided in an embodiment of the present invention.The text Classification method is applied to search engine, personalized recommendation system, public sentiment monitoring etc. and needs in the scene of text classification.Such as Fig. 2 institute Show, this method may include step S210-S260.

S210, obtain sample to be sorted, the sample to be sorted includes content of text and text label, wherein it is described to The text label of classification samples is identical as the text label in step S110-S150.

Specifically, the premise vocabulary of the text label of the generation in step S110-S150 being added in text classification It is that text label corresponding to the acquired text label of sample to be sorted and the vocabulary of the generation is one text mark Label.

S220, data processing is carried out to the sample to be sorted.

Specifically, described the step of carrying out data processing to the sample to be sorted, specifically includes following steps C and step Rapid D:

Step C: word segmentation processing is carried out to the sample to be sorted using segmentation methods.

Specifically, the segmentation methods include segmentation methods based on string matching, based on the segmentation methods of understanding with And the segmentation methods based on statistics；Wherein, the segmentation methods based on string matching include positive matching algorithm, it is reverse The integration that matching algorithm, maximum matching algorithm, smallest match algorithm, simple segmentation methods and participle are combined with mark Algorithm.

Step D: the noise word of the sample to be sorted after word segmentation processing is removed.

Specifically, the noise word includes stop words and public word, and the stop words includes English character, number, number Learn character, punctuation mark, frequency of use extra-high Chinese word character and determiner, the public word includes proper noun and each Everyday words in a technical field.

Specifically, the specific implementation process and effect of step C and step D, can be with reference in preceding method embodiment Step A and step B description, for convenience of description and succinctly, details are not described herein.

The first vocabulary and the second vocabulary of text label generated in S230, obtaining step S110-S150.

Specifically, the first vocabulary of text label generated in step S110-S150 is weight accounting greater than 90% All words summarize, and the second vocabulary is that all words of the weight accounting less than 25% summarize, i.e. all words in first vocabulary Language is keyword relevant to text label height, and all words in second vocabulary are and text label not phase The non-key word closed.

S240, by any word and first vocabulary and the in the sample to be sorted after data processing Two vocabularys are matched one by one to obtain the matching result of any word.

Specifically, all words in the sample to be sorted of the traversal after data processing, will all words and First vocabulary and the second vocabulary are matched one by one, for example, in one embodiment, it will be in the sample to be sorted A certain word is matched with all words in first vocabulary to search in first vocabulary with the presence or absence of the word Language, the word then matches the word to search described second with all words in second vocabulary if it does not exist It whether there is the word in vocabulary, and then obtain the matching result of the word.

S250, secondary data processing is carried out to the sample to be sorted according to the matching result.

Specifically, described the step of carrying out secondary data processing to the sample to be sorted according to the matching result, has Body includes the following steps E and step F:

Step E: if the matching result is to be matched to identical word in first vocabulary, increases the word and exist Weight in the sample to be sorted.

Specifically, if a certain word of the content of text in sample to be sorted after data processing is in first word It is searched in table and is matched to same word, since all words in first vocabulary are and text label height phase The keyword of pass shows that the word is the Feature Words in sample to be sorted, then in order to improve the text classification of the sample to be sorted Accuracy can increase the weight of the specific word automatically, i.e., increase weight of the word in the sample to be sorted automatically. In one embodiment, the Feature Words that identical word is matched in first vocabulary are spliced in order former to be sorted The end of the content of text of sample is to increase the weight of the specific word.

Step F: if the matching result is to be matched to identical word in second vocabulary, by the word from institute It states and is deleted in sample to be sorted.

Specifically, if a certain word of the content of text in sample to be sorted after data processing is in second word Search in table and be matched to same word, due to all words in second vocabulary be with text label height not Relevant non-key word shows that the word is the invalid word in sample to be sorted, then in order to improve the text of the sample to be sorted Classification accuracy can directly delete the invalid word from the content of text of the sample to be sorted, by deleting largely to this The text classification training process of sample to be sorted does not have contributive noise word, can effectively improve the utilization of text label information Rate.In addition, if a certain word of the content of text in sample to be sorted after data processing is in first vocabulary and It can not be searched in two vocabularys and be matched to same word, then retain the word, any processing not done to the word.

S260, using text classification algorithm to through secondary data, treated that sample to be sorted is trained to wait for this Classification samples carry out text classification.

Specifically, the text classification algorithm includes neural network algorithm and conventional machines learning classification algorithm, In, the neural network algorithm includes: convolutional neural networks algorithm, Recognition with Recurrent Neural Network algorithm and Fasttext algorithm；Institute Stating conventional machines learning classification algorithm includes generalized linear regression algorithm, sorting algorithm and supporting vector based on tree construction Machine algorithm.

Specifically, in one embodiment, 2000 news datas are chosen, totally 10 classification, each classification is a text This label, each text label choose 100 datas as multiple trained samples, multiple training are put into sample Successive ignition training is carried out in Lable-LDA algorithm to generate the first vocabulary and the second vocabulary of text label, described the One vocabulary is characterized vocabulary, and second vocabulary is invalid vocabulary, and then chooses the 100 data conducts with text label Sample to be sorted, by any word and first vocabulary and the second word in the sample to be sorted after data processing Table is matched one by one to obtain the matching result of any word, and matching result is as shown in the left frame of the following table 1, wherein will be with The font format for the word that word in first vocabulary matches is set as inclination format；Will in second vocabulary The font format of word that matches of word be set as band underscore format.According to the matching result to described to be sorted Sample carries out secondary data processing, that is, deletes the weight with underlined words and the automatic font for increasing inclination format, warp The result of secondary data processing is as shown in the right frame of the following table 1:

Table 1

It will treated that the sample to be sorted is put into text classification algorithm is trained to wait for this through secondary data Classification samples carry out text classification, wherein in the present embodiment, using Fasttext algorithm to this after secondary data is handled The sample to be sorted be trained, and convolutional Neural net will be respectively put into without the sample to be sorted that secondary data is handled Network algorithm (Convolutional Neural Network, CNN), Recognition with Recurrent Neural Network algorithm (Recurrent Neural Networks, RNN) and Fasttext algorithm in be trained, available following classification performance experimental result is as follows Shown in table 2:

	CNN	RNN	Fasttext	LLDA-Fasttext
					Precision	0.8866	0.95	0.9633	0.9894
Recall	0.89	0.94	0.9633	0.9894

Table 2

Wherein, which is accuracy rate, and Recall is recall rate, as shown in Table 2, utilizes Lable-LDA algorithm The first vocabulary and the second vocabulary for training the text label generated carry out secondary data processing to the sample to be sorted, into And by this, through secondary data, treated that sample to be sorted is added in Fasttext algorithm is trained, in the height of text classification It is much higher than the accuracy rate of CNN algorithm and RNN algorithm in accuracy rate, and than directly carrying out text using Fasttext algorithm Classification based training improves two percentage points, more intuitive can illustrate the validity and conspicuousness of this programme.

In the above-described embodiments, Lable-LDA algorithm training institute is added before sample to be sorted carries out text classification training The first vocabulary and the second vocabulary of the text label of generation, mainly will be to carry out the secondary data processing of text Any word in classification samples is matched one by one with first vocabulary and the second vocabulary to obtain of any word With as a result, actively increasing Feature Words according to the matching result and rejecting invalid word, less for content of text is to be sorted Sample is conducive to the accuracy rate for improving text classification by actively increasing Feature Words；For content of text, there are much noises The sample to be sorted of word effectively eliminates influence of the noise word to text classification algorithm by actively rejecting invalid word.This hair The file classification method that bright embodiment is proposed is based on Lable-LDA algorithm feature vocabulary generated and invalid vocabulary File classification method, can be widely used in various text categorization tasks, treat classification samples data prediction letter It is single quick, and integral operation time and performance are more prominent, can effectively improve the accuracy of text classification.

Referring to Fig. 3, it is a kind of schematic block diagram of device 300 provided in an embodiment of the present invention.It, should as shown in Fig. 3 Device 300 corresponds to vocabulary generation method shown in FIG. 1.The device 300 includes the list for executing above-mentioned vocabulary generation method Member, the device 300 can be configured in the terminals such as desktop computer, tablet computer, laptop computer.Specifically, referring to Fig. 3, The device 300 includes first acquisition unit 301, the first data processing unit 302, second acquisition unit 303, training unit 304 And computing unit 305.

For the first acquisition unit 301 for obtaining multiple trained samples, each training sample includes text Content and text label, wherein the multiple training is one text label with sample.

First data processing unit 302 is for carrying out data processing with sample to the multiple training.Specifically, First data processing unit 302 includes first participle unit 3021 and the first clearing cell 3022.

The first participle unit 3021 is for carrying out at participle the multiple training with sample using segmentation methods Reason.

First clearing cell 3022 is used to remove the noise word of multiple trained samples after word segmentation processing.

The second acquisition unit 303 is used to obtain the default the number of iterations of Lable-LDA algorithm.

Specifically, in one embodiment, the default the number of iterations can be 500 times.

The training unit 304 is used for according to acquired default the number of iterations, using Lable-LDA algorithm to through number According to treated, multiple training are iterated training with sample to generate the vocabulary of the text label.

Specifically, the Labeled-LDA algorithm is that one layer of label layer is added on the basis of LDA algorithm, belongs to half Monitor model, wherein LDA algorithm model is also referred to as 3 layers of Bayesian probability survival model, includes word, theme and document three-layered node Structure, wherein document to theme obeys hidden Di Li Cray distribution, and theme to word obeys multinomial distribution.

The computing unit 305 is used to calculate the weight of all words in the vocabulary of the text label, by weight accounting All words greater than the first preset value summarize the first vocabulary as text label, and weight accounting is preset less than second All words of value summarize the second vocabulary as text label.

Specifically, in one embodiment, first preset value is 90%, and second preset value is 25%, according to institute Text label-word probability matrix that the training of Lable-LDA algorithm obtains is stated as a result, calculating the word of the text label The weight of all words in table, all words by weight accounting greater than 90% summarize the first vocabulary as text label, All words by weight accounting less than 25% summarize the second vocabulary as text label.

It should be noted that it is apparent to those skilled in the art that, above-mentioned apparatus 300 and each unit Specific implementation process and effect, can be with reference to the corresponding description in preceding method embodiment, for convenience of description and letter Clean, details are not described herein.

Referring to Fig. 4, it is the schematic block diagram of another device 400 provided in an embodiment of the present invention.As shown in figure 4, The another kind device 400 corresponds to file classification method shown in Fig. 2.The another kind device 400 includes for executing above-mentioned text The unit of this classification method, the another kind device 400 can be configured in the terminals such as desktop computer, tablet computer, laptop computer In.Specifically, referring to Fig. 4, the another kind device 400 include third acquiring unit 401, the second data processing unit 402, 4th acquiring unit 403, matching unit 404, third data processing unit 405 and text training unit 406.

For the third acquiring unit 401 for obtaining sample to be sorted, the sample to be sorted includes content of text and text This label, wherein the text label of the sample to be sorted is identical as the text label in step S110-S150.

Specifically, the vocabulary institute of the text label of sample to be sorted acquired in the third acquiring unit 401 and the generation Corresponding text label is one text label.

Second data processing unit 402 is used to carry out data processing to the sample to be sorted.

Specifically, second data processing unit 402 includes the second participle unit 4021 and the second clearing cell 4022。

Second participle unit 4021 is used to carry out word segmentation processing to the sample to be sorted using segmentation methods.

Second clearing cell 4022 is used to remove the noise word of the sample to be sorted after word segmentation processing.

First vocabulary of 4th acquiring unit 403 for text label generated in obtaining step S110-S150 And second vocabulary.

The matching unit 404 be used for by the sample to be sorted after data processing any word with it is described First vocabulary and the second vocabulary are matched one by one to obtain the matching result of any word.

The third data processing unit 405 is used to carry out the sample to be sorted according to the matching result secondary Data processing.The third data processing unit 405 includes increasing weight unit 4051 and deletion unit 4052.

If the increase weight unit 4051 is matched in first vocabulary identical for the matching result Word increases weight of the word in the sample to be sorted.

If the deletion unit 4052 is to be matched to identical word in second vocabulary for the matching result Language deletes the word from the sample to be sorted.

The text training unit 406 is used for using text classification algorithm to through secondary data treated sample to be sorted Originally it is trained to carry out text classification to the sample to be sorted.

It should be noted that it is apparent to those skilled in the art that, above-mentioned apparatus 400 and each unit Specific implementation process and effect, can be with reference to the corresponding description in preceding method embodiment, for convenience of description and letter Clean, details are not described herein.

Above-mentioned apparatus can be implemented as a kind of form of computer program, which can be as shown in Fig. 5 It is run in computer equipment.

Referring to Fig. 5, it is a kind of schematic block diagram of computer equipment provided in an embodiment of the present invention.The computer Equipment 600 can be terminal, be also possible to server, wherein terminal can be smart phone, tablet computer, notebook electricity The electronic equipments such as brain, desktop computer and personal digital assistant.Server can be independent server, be also possible to multiple clothes The server cluster of business device composition.

Refering to Fig. 5, which includes processor 602, memory and the net connected by system bus 601 Network interface 605, wherein memory may include non-volatile memory medium 603 and built-in storage 604.

The non-volatile memory medium 603 can storage program area 6031 and computer program 6032.The computer program 6032 include program instruction, which is performed, may make processor 602 execute a kind of vocabulary generation method and File classification method.

The processor 602 is for providing calculating and control ability, to support the operation of entire computer equipment 600.

The built-in storage 604 provides environment for the operation of the computer program 6032 in non-volatile memory medium 603, When the computer program 6032 is executed by processor 602, processor 602 may make to execute a kind of vocabulary generation method and text This classification method.

The network interface 605 is used to carry out network communication with other equipment.It will be understood by those skilled in the art that in Fig. 5 The structure shown, only the block diagram of part-structure relevant to application scheme, does not constitute and is applied to application scheme The restriction of computer equipment 600 thereon, specific computer equipment 600 may include more more or less than as shown in the figure Component, perhaps combine certain components or with different component layouts.

Wherein, the processor 602 is as follows to realize for running computer program 6032 stored in memory Step:

In one embodiment, processor 602 is implemented as follows step when realizing the vocabulary generation method: obtaining Multiple trained samples, each training sample includes content of text and text label, wherein the multiple training is used Sample is one text label；Data processing is carried out with sample to the multiple training；Obtain the default of Lable-LDA algorithm The number of iterations；According to acquired default the number of iterations, using Lable-LDA algorithm to multiple training after data processing Training is iterated with sample to generate the vocabulary of the text label；And own in the vocabulary of the calculating text label All words that weight accounting is greater than the first preset value are summarized the first vocabulary as text label by the weight of word, will Weight accounting summarizes the second vocabulary as text label less than all words of the second preset value.

In one embodiment, processor 602 is realizing the step for carrying out data processing with sample to the multiple training When rapid, it is implemented as follows step: word segmentation processing being carried out with sample to the multiple training using segmentation methods；And it removes The noise word of multiple trained samples after word segmentation processing.

In one embodiment, processor 602 is implemented as follows step when realizing the file classification method: obtaining Sample to be sorted, the sample to be sorted includes content of text and text label, wherein the text mark of the sample to be sorted It signs identical as the text label of multiple trained samples in vocabulary generation method；The sample to be sorted is carried out at data Reason；Obtain the first vocabulary and the second vocabulary of vocabulary generation method text label generated；By the institute after data processing Any word stated in sample to be sorted is matched one by one with first vocabulary and the second vocabulary to obtain any word Matching result；Secondary data processing is carried out to the sample to be sorted according to the matching result；Utilize text classification algorithm To through secondary data, treated that sample to be sorted is trained to carry out text classification to the sample to be sorted.

In one embodiment, processor 602 described carries out the sample to be sorted according to the matching result realizing When the step of secondary data processing, it is implemented as follows step: if the matching result is to be matched in first vocabulary Identical word increases weight of the word in the sample to be sorted；And if the matching result is described second It is matched to identical word in vocabulary, which is deleted from the sample to be sorted.

It should be appreciated that in embodiments of the present invention, processor 602 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable GateArray, FPGA) or other can Programmed logic device, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be micro- Processor or the processor are also possible to any conventional processor etc..

Those of ordinary skill in the art will appreciate that be all or part of stream in the method for realize above-described embodiment Journey is relevant hardware can be instructed to complete by computer program.The computer program includes program instruction, is calculated Machine program can be stored in a storage medium, which is storage medium.The program instruction is by the computer system At least one processor executes, to realize the process step of the embodiment of the above method.

Therefore, the present invention also provides a kind of storage mediums.The storage medium is computer readable storage medium, the calculating Machine readable storage medium storing program for executing is stored with computer program, and wherein computer program includes program instruction.The program instruction is by processor Processor is set to execute vocabulary generation method as described above and file classification method when execution.

The storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic disk Or the various storage mediums that can store program code such as CD.

Those of ordinary skill in the art may be aware that described in conjunction with the examples disclosed in the embodiments of the present disclosure Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate The interchangeability of hardware and software generally describes each exemplary composition and step according to function in the above description Suddenly.These functions are implemented in hardware or software actually, the specific application and design constraint item depending on technical solution Part.Professional technician can use different methods to achieve the described function each specific application, but this Realization should not be considered as beyond the scope of the present invention.

In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through Other modes are realized.For example, the apparatus embodiments described above are merely exemplary.For example, each unit is drawn Point, only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.

The steps in the embodiment of the present invention can be sequentially adjusted, merged and deleted according to actual needs.This hair Unit in bright embodiment device can be combined, divided and deleted according to actual needs.In addition, in each reality of the present invention Each functional unit applied in example can integrate in one processing unit, is also possible to each unit and physically exists alone, can also To be that two or more units are integrated in one unit.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existing Having all or part for the part or the technical solution that technology contributes can be embodied in the form of software products, The computer software product is stored in a storage medium, including some instructions are used so that computer equipment (can be with It is personal computer, terminal or the network equipment etc.) execute all or part of step of each embodiment the method for the present invention Suddenly.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, appoints What those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications Or replacement, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention is answered It is subject to the protection scope in claims.

Claims

1. a kind of vocabulary generation method characterized by comprising

Multiple trained samples are obtained, each training sample includes content of text and text label, wherein the multiple Training is one text label with sample；

Data processing is carried out with sample to the multiple training；

Obtain the default the number of iterations of Lable-LDA algorithm；

According to acquired default the number of iterations, using Lable-LDA algorithm to multiple training samples after data processing Training is iterated to generate the vocabulary of the text label；And

Weight accounting is greater than all words of the first preset value by the weight for calculating all words in the vocabulary of the text label Summarize the first vocabulary as text label, all words by weight accounting less than the second preset value summarize as the text Second vocabulary of label.

2. vocabulary generation method according to claim 1, which is characterized in that described to be carried out to the multiple training with sample Data processing, comprising:

Word segmentation processing is carried out with sample to the multiple training using segmentation methods；And

Remove the noise word of multiple trained samples after word segmentation processing.

3. vocabulary generation method according to claim 2, which is characterized in that the segmentation methods include being based on character string Segmentation methods, the segmentation methods based on understanding and the segmentation methods based on statistics matched；Wherein, described to be based on string matching Segmentation methods include that positive matching algorithm, reverse matching algorithm, maximum matching algorithm, smallest match algorithm, simple participle are calculated The Integrated Algorithm that method and participle are combined with mark.

4. vocabulary generation method according to claim 1 characterized by comprising first preset value is 90%, institute Stating the second preset value is 25%.

5. a kind of file classification method characterized by comprising

Sample to be sorted is obtained, the sample to be sorted includes content of text and text label, wherein the sample to be sorted Text label is identical as the text label in claim any one of 1-4；

Data processing is carried out to the sample to be sorted；

Obtain the first vocabulary and the second vocabulary of any one of claim 1-4 text label generated；

Any word in the sample to be sorted after data processing is carried out with first vocabulary and the second vocabulary It is matched one by one to obtain the matching result of any word；

Secondary data processing is carried out to the sample to be sorted according to the matching result；

Using text classification algorithm to through secondary data treated sample to be sorted is trained with to the sample to be sorted into Row text classification.

6. file classification method according to claim 5, which is characterized in that it is described according to the matching result to it is described to Classification samples carry out secondary data processing, comprising:

If the matching result is to be matched to identical word in first vocabulary, increase the word in the sample to be sorted Weight in this；And

If the matching result is to be matched to identical word in second vocabulary, by the word from the sample to be sorted Middle deletion.

7. file classification method according to claim 5, which is characterized in that the text classification algorithm includes neural network Algorithm and conventional machines learning classification algorithm, wherein the neural network algorithm includes: convolutional neural networks algorithm, circulation Neural network algorithm and Fasttext algorithm；The conventional machines learning classification algorithm includes generalized linear regression algorithm, base In the sorting algorithm and algorithm of support vector machine of tree construction.

8. a kind of device, which is characterized in that including for executing the unit such as any one of claim 1-7 the method.

9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor, on the memory It is stored with computer program, the processor is realized as described in any one of claim 1-7 when executing the computer program Method.

10. a kind of storage medium, which is characterized in that the storage medium is stored with computer program, and the computer program is worked as It can realize when being executed by processor such as method of any of claims 1-7.