CN109543032A

CN109543032A - File classification method, device, computer equipment and storage medium

Info

Publication number: CN109543032A
Application number: CN201811258359.3A
Authority: CN
Inventors: 徐冰; 汪伟; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2019-03-29
Also published as: WO2020082569A1

Abstract

This application involves a kind of file classification method based on disaggregated model, device, computer equipment and storage mediums.The described method includes: selecting text feature combination from pre-set text feature library, it is extracted from text feature from text to be sorted and combines corresponding fusion feature, it is combined according to text feature, multiple classifiers trained in advance are selected from pre-set classifier library, according to classifier, obtain integrated classification device, fusion feature is inputted into integrated classification device, obtain the probability of multiple default labels of integrated classification device output, the default corresponding text type of label, according to the default label of maximum probability, the text type of text to be sorted is determined.It can be improved the accuracy of text classification using this method.

Description

File classification method, device, computer equipment and storage medium

Technical field

This application involves field of computer technology, more particularly to a kind of file classification method, device, computer equipment and Storage medium.

Background technique

Text classification refers to that, by natural statement classification to the technology in a certain specified classification, which is widely used in mutual In networking technology field.Newsletter archive can be screened by Text Classification when news push, specifically, will be new It when news text is pushed to specified platform, needs to obtain newsletter archive from each source of news, is then referring to newsletter archive publication In fixed platform, so as to platform access person's reading.In order to guarantee the quality for the newsletter archive issued in platform, need to newsletter archive It is audited.By taking government's financial platform as an example, need to issue be financial class news, from each source of news obtain news After text, need to audit the content of newsletter archive, audit specifically include that whether content credible, whether comprising advertisement, Main contents whether be related to finance and whether be social concerns money article etc., with this to determine whether by newsletter archive Publication is on platform.However, in order to guarantee newsletter archive push efficiency, can by existing algorithm model to newsletter archive into The requirement of accuracy when newsletter archive pushes is extremely difficult to when going and classify, but classifying using existing algorithm model.

Summary of the invention

Based on this, it is necessary in view of the above technical problems, it is accurate to provide classification when one kind is able to solve newsletter archive push File classification method, device, computer equipment and the storage medium of the low problem of property.

A kind of file classification method, which comprises

Text feature combination is selected from pre-set text feature library, is extracted and the text from text to be sorted Feature combines corresponding fusion feature；

It is combined according to the text feature, multiple classifiers trained in advance is selected from pre-set classifier library；

The classifier is selected according to the fusion feature, obtains integrated classification device；

The fusion feature is inputted into the integrated classification device, obtains the probability of multiple default labels；The default label A corresponding text type；

According to the default label of maximum probability, the text type of the text to be sorted is determined.

The step of training classifier in one of the embodiments, comprising: select to have marked from pre-set corpus Explanatory notes sheet；According to the target labels for having marked text and pre-set termination condition, training classifier；When the classification When the probability that device exports the target labels is all satisfied the termination condition, the classifier trained.

In one of the embodiments, further include:

The corresponding a variety of text features of text that marked are extracted to combine；

Each described text feature combination is sequentially input into each classification trained in the classifier library Device；

The probability for exporting the target labels to each classifier trained is ranked up, and filters out satisfaction The classifier of preset condition establishes the corresponding relationship of the text feature combination and the multiple classifier；；According to the text Corresponding relationship described in feature query composition selects multiple classifiers trained in advance from pre-set classifier library.

Include: in the text feature library in one of the embodiments, text size feature, keyword word frequency, word to Measure similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature；Further include: from text Text size feature, keyword words-frequency feature, term vector similarity feature, TF-IDF weight are selected in the text feature of feature database Two or more in feature, the Probability Characteristics of LDA model and informed source feature obtains text feature combination；From to Each text feature in the text feature combination is extracted in classifying text；Each text feature is combined, is obtained To fusion feature.

The text to be sorted includes: title text and body text in one of the embodiments,；Further include: it obtains The title text length and body text length of the text to be sorted；According to the title text length and the body text Length respectively obtains length for heading vector sum text size vector；By text size vector described in the length for heading vector sum Spliced, obtains the text size feature of text to be sorted；Or, pre-set antistop list is obtained, according to the key Vocabulary matches the title text and the body text, obtains in the text to be sorted comprising keyword in antistop list Word frequency；Vectorization is carried out to the word frequency, obtains keyword words-frequency feature；Or, obtain the title feature of the title text to The text feature vector of amount and body text, splices text feature vector described in the title feature vector sum, obtains Term vector similarity feature；Or, obtaining TF-IDF of each keyword in default corpus in the text to be sorted Weight obtains the average TF-IDF weight of text to be sorted, to institute according to the mean value of the TF-IDF weight of each keyword Average TF-IDF weight vectorization is stated, the TF-IDF weight feature of the text to be sorted is obtained；Or, by the text to be sorted Pre-set LDA model is inputted, the probability distribution that the text to be sorted belongs to each preset themes is obtained, by the probability Distribution vector obtains the Probability Characteristics of the LDA model of the text to be sorted；Or, obtaining the text to be sorted Informed source obtains the source number of the informed source, numbers and carry out to the source according to pre-set coding rule Vectorization obtains informed source feature.

In one of the embodiments, further include: according to pre-set weighting algorithm, calculate each in the classifier The weight of classifier；According to the weight, each classifier is weighted to obtain the integrated classification device.

In one of the embodiments, further include: the title text and the body text are segmented respectively, obtained To the fisrt feature set of words of the title text and the second feature set of words of the body text；

According to pre-set positive and negative keywords database and pre-set term vector tool, the fisrt feature word is obtained In set in the first term vector of each Feature Words and the second feature set of words each Feature Words the second term vector； It is averaged to obtain title feature vector according to first term vector, and averages to obtain text according to second term vector Feature vector.

A kind of document sorting apparatus, described device include:

Fusion Features module, for selecting text feature combination from pre-set text feature library, from text to be sorted It is extracted in this and combines corresponding fusion feature with the text feature；

Classifier selecting module selects more for being combined according to the text feature from pre-set classifier library A classifier trained in advance；

Multiple Classifier Fusion module, for obtaining integrated classification device according to the classifier；

Output module obtains the probability of multiple default labels for the fusion feature to be inputted the integrated classification device； The corresponding text type of default label；

Categorization module determines the text type of the text to be sorted for the default label according to maximum probability.

A kind of computer equipment, including memory and processor, the memory are stored with computer program, the processing Device performs the steps of when executing the computer program

According to the classifier, integrated classification device is obtained；

A kind of computer readable storage medium, is stored thereon with computer program, and the computer program is held by processor It is performed the steps of when row

According to the classifier, integrated classification device is obtained；

Above-mentioned file classification method, device, computer equipment and storage medium can be with needles by constructing text feature library To different classes of text to be sorted, adaptability selects different text features to combine, and improves feature selecting accuracy, in addition, The feature as text to be sorted text feature is combined, inputs pre-set classifier library, classifier can be with corresponding selection Classifiers combination carries out classification prediction to text feature combination, guarantees to select optimal classifier, whole process is without artificial behaviour Make, classification prediction accurately can also be carried out to text.

Detailed description of the invention

Fig. 1 is the application scenario diagram of file classification method in one embodiment；

Fig. 2 is the flow diagram of file classification method in one embodiment；

Fig. 3 is the flow diagram that fusion feature step is extracted in one embodiment；

Fig. 4 is the flow diagram of file classification method in another embodiment；

Fig. 5 is the flow diagram of file classification method in another embodiment；

Fig. 6 is the structural block diagram of document sorting apparatus in one embodiment；

Fig. 7 is the internal structure chart of computer equipment in one embodiment.

Specific embodiment

It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, not For limiting the application.

File classification method provided by the present application can be applied in application environment as shown in Figure 1.Wherein, terminal 102 It is communicated with server 104 by network by network.Wherein, terminal 102 can be, but not limited to be various personal computers, Laptop, server 104 can be realized with the server cluster of the either multiple server compositions of independent server.

Wherein, terminal 102 can obtain text to be sorted from server 104 by HTTP request.Text to be sorted can be with It is microblog passage, public platform article, blog and information of news platform channel etc., terminal 102 obtains above-mentioned text to be sorted Afterwards, every text to be sorted can be stored in the database of terminal 102.

Further, the text to be sorted in terminal 102 is pushed to before platform issued, is needed to text to be sorted This is classified, and the text to be sorted for meeting default regulatory requirements can be just sent in platform, completes content of platform with this Supervision.

Specifically, terminal 102, when carrying out text classification, by extracting the fusion feature of text to be sorted, then root melts Feature is closed, selects corresponding classifier to be merged, obtains integrated classification device, fusion feature is then inputted into integrated classification device, Since the classifier in integrated classification device is trained according to the regulatory requirements of platform, integrated classification device, which can export, to be melted The probability that feature is directed to each default label is closed, and default label has corresponded to text type, by presetting the probability size of label, It can determine the text type of text to be sorted.Therefore, the corresponding text of text type that terminal 102 can will meet regulatory requirements This push value platform is issued, and the supervision of content of platform is completed.

In one embodiment, as shown in Fig. 2, providing a kind of file classification method, it is applied in Fig. 1 in this way It is illustrated for terminal, comprising the following steps:

Step 202, text feature combination is selected from pre-set text feature library, is extracted from from text to be sorted Text feature combines corresponding fusion feature.

It wherein, include multiple text features constructed in advance in text feature library, if input text to be sorted, terminal is determined When plan, the text feature constructed in advance in corresponding text feature library is selected, then the text that can export text to be sorted is special Sign.Therefore, text feature can be selected according to Terminal-decision, such as: for the text to be sorted of headline, carrying out Decision is to preferably select the text features such as text size feature, keyword words-frequency feature, term vector similarity feature.Pass through this Kind mode, can be further improved the accuracy of classifier prediction.

Further, the training of text feature library can be characterized decision model with preset limit decision model.

Specifically, input feature vector decision model in terminal, then feature decision model exports several when being classified Text feature combination, the training logic of feature decision model can be the classification according to text to be sorted, such as: news category, event Thing class discusses class, selects suitable text feature, to ensure the accuracy classified.Text to be sorted can be identified in terminal This type can export text feature combination with this automatically, therefore, see on the whole, the scheme of the present embodiment has made model Two layers stacking, to improve the forecasting efficiency of model.

Specifically, spy can be passed through when extracting each text feature that text to be sorted is directed to out in text feature combination Multiple text features are fused to fusion feature by the mode for levying fusion.

Step 204, it is combined according to text feature, multiple classification trained in advance is selected from pre-set classifier library Device.

It wherein, include the classifier of multiple and different types in classifier library, according to pre-set regulatory requirements, setting is not With the text type of regulatory requirements, device label corresponds to different text types in different categories, by point in class library Class device is trained, and can be classified to the text to be sorted of input.

It include various types of classifier in classifier library, each classifier is directed to different text feature effects not Together, therefore, when inputting fusion feature, multiple classifiers is can choose and classified, so as to improve the accuracy of classification.

Further, pre-established in terminal the combination of text feature in fusion feature in classifier library classifier it is corresponding Relationship passes through one text feature combination of identification, it can select corresponding classifier from classifier library automatically.

It is worth noting that classifier library and text feature library are the tool being stored in advance in the terminal, terminal according to Corresponding logic can choose the tool in calling classification device library and text feature library.

Step 206, according to classifier, integrated classification device is obtained.

Wherein, it when obtaining integrated classification device, can be merged from classifier structure, obtain integrated classification device, be tied Structure fusion merges the output of each classifier.Another way is not handle classifier, is acquired by terminal The output of each classifier as a result, then calculate final structure by terminal, integrated classification device is obtained with this.

Step 208, fusion feature is inputted into integrated classification device, obtains the general of multiple default labels of integrated classification device output Rate.

Wherein, when carrying out classifier training, by the corresponding text type of default label, such as: violation text is corresponding One default label indicates that text to be sorted is violation text when the probability that classifier exports the default label is 20% Probability is 20%.

Specifically, the output of classifier can be exported by softmax, therefore the probability of available each default label is big It is small, convenient for the Accurate classification of text.

Step 210, according to the default label of maximum probability, the text type of text to be sorted is determined.

Wherein, when obtaining the probability size of each default label, maximum probability can be determined by the way of sequence Then label determines the text type of text to be sorted according to default label.

In above-mentioned file classification method, by constructing text feature library, it can be directed to different classes of text to be sorted, fitted Answering property selects different text features to combine, and improves feature selecting accuracy, is used as text to be sorted in addition, combining text feature This feature, inputs pre-set classifier library, classifier can with corresponding selection classifiers combination to text feature combine into Row classification prediction, guarantees to select optimal classifier, whole process without human intervention, can also accurately divide text Class prediction.

In one embodiment, as shown in figure 3, providing a kind of schematic flow chart for extracting fusion feature step, wherein It include: text size feature, keyword word frequency, term vector similarity feature, TF-IDF weight feature, LDA in text feature library The Probability Characteristics and informed source feature of model, the specific steps are as follows:

Step 302, select text size feature, keyword words-frequency feature, term vector similarity special from text feature library Two or more in sign, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature, obtains text Feature combination.

Step 304, from each text feature extracted in text to be sorted in text feature combination.

Step 306, each text feature is combined, obtains fusion feature.

In the present embodiment, by the way that a variety of text features are arranged, various texts to be sorted can be directed to, spy is accurately extracted Sign, so as to improve the accuracy of text classification.

For the text to be sorted mentioned in Fig. 3, in one embodiment, which includes: title text and just Therefore text can pass through the title text length and body text length of acquisition text to be sorted；It is long according to title text Degree and body text length, respectively obtain length for heading vector sum text size vector；By length for heading vector sum text size Vector is spliced, and the text size feature of text to be sorted is obtained；By obtaining pre-set antistop list, according to key Vocabulary matches title text and body text, obtains the word frequency comprising keyword in antistop list in text to be sorted；To word frequency Vectorization is carried out, keyword words-frequency feature is obtained；By the text for obtaining the title feature vector sum body text of title text Feature vector splices title feature vector sum text feature vector, obtains term vector similarity feature；Or, by obtaining TF-IDF weight of each keyword in default corpus in text to be sorted is taken, according to the TF-IDF weight of each keyword Mean value, obtain the average TF-IDF weight of text to be sorted, to average TF-IDF weight vectorization, obtain text to be sorted TF-IDF weight feature；Or, obtaining text to be sorted by by the pre-set LDA model of text input to be sorted and belonging to respectively ProbabilityDistribution Vector is obtained the Probability Characteristics of the LDA model of text to be sorted by the probability distribution of a preset themes； Or, according to pre-set coding rule, the source for obtaining informed source is compiled by the informed source for obtaining text to be sorted Number, source is numbered and carries out vectorization, obtains informed source feature.

In the embodiment, due to including at least two above-mentioned text features in text feature combination, text to be sorted is being obtained This when, it is necessary first to parse title text and body text therein, feature is then carried out by each text feature tool It extracts.

In one embodiment, the step of training classifier, comprising:

It selects to have marked text from pre-set corpus, according to the target labels for having marked text and preset Termination condition, training classifier, when classifier output target labels probability meet termination condition when, trained point Class device.

It in another embodiment, include: decision tree, random forest, extra tree, gradient promotion in classifier library Tree, logistic regression, fully-connected network and adaptive connection tree；By the above-mentioned classifier of training, available classifier library.

In another embodiment, it extracts and has marked the corresponding a variety of text feature combinations of text；By each text spy Sign combination sequentially inputs each classifier trained in classifier library；To each classifier output target labels probability into Row sequence, filters out the classifier for meeting preset condition, establishes the corresponding relationship of text feature combination and multiple classifiers.That , the step of being combined according to text feature, multiple trained in advance classifiers are selected from pre-set classifier library packet It includes: according to text feature query composition corresponding relationship, multiple classification trained in advance is selected from pre-set classifier library Device.

In summary several embodiments, in another embodiment, as shown in figure 4, using fusion feature as text size feature, Term vector similarity feature and the Probability Characteristics of LDA model merge, and integrated classification device is decision tree, at random For forest and logistic regression fusion form, from Fig. 4, it can clearly show the classification stream of the embodiment of the present invention Journey.

In one embodiment, the step of obtaining integrated classification device may is that according to pre-set weighting algorithm, calculate The weight of each classifier in multiple classifiers；According to weight, each classifier is weighted to obtain integrated classification device.

Specifically, the workflow of weighting algorithm is as follows: extracting the fusion feature for having marked text, assigned to each classifier Initial weight is given, fusion feature is inputted in each classifier, the probability of final default label is calculated according to initial weight, it will be pre- The probability of bidding label is compared with target labels, if difference is greater than preset value, adjusts initial weight, until difference is less than pre- If value, so that the weight of each classifier is obtained, then with being weighted to obtain integrated classification device according to the weight.

It is worth noting that weight is different when the classifier of various combination is merged, therefore, in the training stage, need The classifier to combine to every kind calculates separately weight when it is merged.

In addition, in one embodiment, obtaining the text feature vector of the title feature vector sum body text of title text The step of may is that title text and body text are segmented respectively, obtain the fisrt feature set of words of title text with And the second feature set of words of body text；According to pre-set positive and negative keywords database and pre-set term vector work Tool, obtains each Feature Words in the first term vector and second feature set of words of each Feature Words in fisrt feature set of words The second term vector；It is averaged to obtain title feature vector according to the first term vector, and is averaged according to the second term vector To text feature vector.

In the present embodiment, positive anti-keyword can strengthen Feature Words it is matched as a result, can not only be matched to it is positive as a result, It can be matched to the corresponding reversed word of the specific word when being not matched to Feature Words by the way that corresponding reversed word is arranged, thus The matching efficiency of Feature Words is improved, it is therefore, as a result more accurate in construction feature vector.

In one embodiment, as shown in figure 5, providing a kind of platform news push scheme based on file classification method Schematic flow chart, the specific steps are as follows:

Step 502, newsletter archive to be pushed is received, newsletter archive includes headline and body.

Newsletter archive source, such as Sina, the www.xinhuanet.com can be preset, then as unit of news article, in terminal In save as news item text.

Step 504, extract the text size feature of newsletter archive, keyword words-frequency feature, term vector similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature.

Step 506, special according to text size feature, keyword words-frequency feature, term vector similarity feature, TF-IDF weight Sign, the Probability Characteristics of LDA model and informed source feature, obtain the fusion feature of newsletter archive.

Wherein, after each text feature can be carried out vectorization first by the mode of fusion, vector is spliced, is obtained Fusion feature.

Step 508, fusion feature is inputted into classifier library, default label is exported according to classifier each in classifier library Probability is ranked up each classifier, and three forward classifiers of select probability are merged to obtain integrated classification device.

Wherein it is possible to be merged by the way of weighting, weight is arranged in as each classifier, to classifier output As a result it is weighted.

Step 510, according to the output of integrated classification device as a result, carry out classification prediction to newsletter archive, if newsletter archive Classification meets platform regulatory requirements, then the newsletter archive is issued in platform, if the classification of newsletter archive does not meet strip supervision It is required that not issuing the newsletter archive then.

In the present embodiment, by classifying to newsletter archive, realizes the monitoring given a news briefing to platform, guarantee new platform The quality of news.

In another embodiment, when the newsletter archive pushes, correction strategy can also be set, and correction strategy can be quick Feel word filtering, by whether including sensitive word in detection newsletter archive, determines whether to push the newsletter archive to platform.

It should be understood that although each step in the flow chart of Fig. 2,3,5 is successively shown according to the instruction of arrow, These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, these steps can execute in other order.Moreover, in Fig. 2,3,5 at least A part of step may include that perhaps these sub-steps of multiple stages or stage are not necessarily in same a period of time to multiple sub-steps Quarter executes completion, but can execute at different times, the execution in these sub-steps or stage be sequentially also not necessarily according to Secondary progress, but in turn or can replace at least part of the sub-step or stage of other steps or other steps Ground executes.

In one embodiment, as shown in fig. 6, providing a kind of document sorting apparatus, comprising: Fusion Features module 602, Classifier selecting module 604, Multiple Classifier Fusion module 606, output module 608 and categorization module 610, in which:

Fusion Features module 602, for selecting text feature combination from pre-set text feature library, to be sorted It is extracted in text and combines corresponding fusion feature with the text feature.

Classifier selecting module 604 is selected from pre-set classifier library for being combined according to the text feature Multiple classifiers trained in advance.

Multiple Classifier Fusion module 606, for obtaining integrated classification device according to the classifier.

Output module 608 obtains the general of multiple default labels for the fusion feature to be inputted the integrated classification device Rate；The corresponding text type of default label.

Categorization module 610 determines the text type of the text to be sorted for the default label according to maximum probability.

In one embodiment, it selects to have marked text from pre-set corpus；Text has been marked according to described Target labels and pre-set termination condition, training classifier；When the classifier exports the probability of the target labels When meeting the termination condition, the classifier trained.

In one embodiment, classifier selecting module 604, which is also used to extract, described has marked the corresponding a variety of institutes of text State text feature combination；Each described text feature combination is sequentially input into each institute trained in the classifier library State classifier；The probability for exporting the target labels to each classifier trained is ranked up, and is filtered out full The classifier of sufficient preset condition establishes the corresponding relationship of the text feature combination and the multiple classifier；According to the text Corresponding relationship described in eigen query composition selects multiple classifiers trained in advance from pre-set classifier library.

In one embodiment, include: in the text feature library text size feature, keyword words-frequency feature, word to Measure similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature；Fusion Features module 602 are also used to select text size feature, keyword words-frequency feature, term vector similarity feature, TF- from text feature library Two or more in IDF weight feature, the Probability Characteristics of LDA model and informed source feature, obtains text feature group It closes；From each text feature extracted in text to be sorted in the text feature combination；Each text feature is carried out Combination, obtains fusion feature.

In one embodiment, the text to be sorted includes: title text and body text；Fusion Features module 602 It is also used to obtain the title text length and body text length of the text to be sorted；According to the title text length and institute Body text length is stated, length for heading vector sum text size vector is respectively obtained；By described in the length for heading vector sum just Literary length vector is spliced, and the text size feature of text to be sorted is obtained；Or, obtaining pre-set antistop list, root The title text and the body text are matched according to the antistop list, is obtained in the text to be sorted comprising antistop list The word frequency of middle keyword；Vectorization is carried out to the word frequency, obtains keyword words-frequency feature；Or, obtaining the title text The text feature vector of title feature vector sum body text carries out text feature vector described in the title feature vector sum Splicing, obtains term vector similarity feature；Or, each keyword is in default corpus in the acquisition text to be sorted TF-IDF weight the average TF-IDF of text to be sorted is obtained according to the mean value of the TF-IDF weight of each keyword Weight obtains the TF-IDF weight feature of the text to be sorted to the average TF-IDF weight vectorization；Or, will be described The pre-set LDA model of text input to be sorted, obtains the probability distribution that the text to be sorted belongs to each preset themes, By the ProbabilityDistribution Vector, the Probability Characteristics of the LDA model of the text to be sorted are obtained；Or, obtain it is described to The informed source of classifying text obtains the source number of the informed source, to described next according to pre-set coding rule Source number carries out vectorization, obtains informed source feature.

In one embodiment, output module 608 is also used to calculate the multiple point according to pre-set weighting algorithm The weight of each classifier in class device；According to the weight, each classifier is weighted to obtain the integrated classification device.

In one embodiment, Fusion Features module 602 is also used to distinguish the title text and the body text It is segmented, obtains the fisrt feature set of words of the title text and the second feature set of words of the body text；Root According to pre-set positive and negative keywords database and pre-set term vector tool, obtain each in the fisrt feature set of words Second term vector of each Feature Words in first term vector of Feature Words and the second feature set of words；According to described One term vector averages to obtain title feature vector, and averages to obtain text feature vector according to second term vector.

Specific about document sorting apparatus limits the restriction that may refer to above for file classification method, herein not It repeats again.Modules in above-mentioned document sorting apparatus can be realized fully or partially through software, hardware and combinations thereof.On Stating each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also store in a software form In memory in computer equipment, the corresponding operation of the above modules is executed in order to which processor calls.

In one embodiment, a kind of computer equipment is provided, which can be server, internal junction Composition can be as shown in Figure 7.The computer equipment include by system bus connect processor, memory, network interface and Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating The database of machine equipment is for storing text data to be sorted.The network interface of the computer equipment is used for logical with external terminal Cross network connection communication.To realize a kind of file classification method when the computer program is executed by processor.

It will be understood by those skilled in the art that structure shown in Fig. 7, only part relevant to application scheme is tied The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme, specific computer equipment It may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.

In one embodiment, a kind of computer equipment, including memory and processor are provided, which is stored with Computer program, the processor perform the steps of when executing computer program

According to the classifier, integrated classification device is obtained；

In one embodiment, it also performs the steps of when processor executes computer program from pre-set corpus It selects to have marked text in library；According to the target labels for having marked text and pre-set termination condition, training classification Device；When the probability that the classifier exports the target labels meets the termination condition, the classification trained Device.

In one embodiment, it is also performed the steps of when processor executes computer program and has marked text described in extraction This corresponding a variety of described text feature combination；Each described text feature combination is sequentially input in the classifier library Each classifier trained；The probability for exporting the target labels to each classifier trained carries out Sequence, filters out the classifier for meeting preset condition, establishes the text feature combination pass corresponding with the multiple classifier System；According to corresponding relationship described in the text feature query composition, multiple preparatory instructions are selected from pre-set classifier library Experienced classifier.

In one embodiment, include: in the text feature library text size feature, keyword words-frequency feature, word to Measure similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature；Processor executes meter The selection text size feature, keyword words-frequency feature, term vector from text feature library is also performed the steps of when calculation machine program Two or more in similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature, Obtain text feature combination；From each text feature extracted in text to be sorted in the text feature combination；To described each A text feature is combined, and obtains fusion feature.

In one embodiment, the text to be sorted includes: title text and body text；Processor executes computer The title text length and body text length for obtaining the text to be sorted are also performed the steps of when program；According to described Title text length and the body text length respectively obtain length for heading vector sum text size vector；By the title Length vector and the text size vector are spliced, and the text size feature of text to be sorted is obtained；It is set in advance or, obtaining The antistop list set matches the title text and the body text according to the antistop list, obtains the text to be sorted Word frequency comprising keyword in antistop list in this；Vectorization is carried out to the word frequency, obtains keyword words-frequency feature；Or, obtaining The text feature vector for taking the title feature vector sum body text of the title text, to described in the title feature vector sum Text feature vector is spliced, and term vector similarity feature is obtained；Or, obtaining each key in the text to be sorted TF-IDF weight of the word in default corpus obtains to be sorted according to the mean value of the TF-IDF weight of each keyword The average TF-IDF weight of text obtains the TF-IDF power of the text to be sorted to the average TF-IDF weight vectorization Weight feature；Or, by the pre-set LDA model of text input to be sorted, obtain the text to be sorted belong to it is each pre- If the probability distribution of theme, by the ProbabilityDistribution Vector, the probability distribution for obtaining the LDA model of the text to be sorted is special Sign；Or, the informed source for obtaining the text to be sorted obtains the informed source according to pre-set coding rule Source number, numbers the source and carries out vectorization, obtain informed source feature.

In one embodiment, it also performs the steps of when processor executes computer program and is added according to pre-set Algorithm is weighed, the weight of each classifier in the multiple classifier is calculated；According to the weight, each classifier is weighted Obtain the integrated classification device.

In one embodiment, also perform the steps of when processor executes computer program to the title text and The body text is segmented respectively, obtain the title text fisrt feature set of words and the body text Two feature set of words；According to pre-set positive and negative keywords database and pre-set term vector tool, described first is obtained In feature set of words in the first term vector of each Feature Words and the second feature set of words each Feature Words the second word Vector；It is averaged to obtain title feature vector according to first term vector, and is averaged according to second term vector To text feature vector.

In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated Machine program performs the steps of when being executed by processor

According to the classifier, integrated classification device is obtained；

In one embodiment, it is also performed the steps of when computer program is executed by processor from pre-set language Material selects to have marked text in library；According to the target labels for having marked text and pre-set termination condition, training point Class device；When the probability that the classifier exports the target labels meets the termination condition, trained described point Class device.

In one embodiment, it also performs the steps of when computer program is executed by processor and has been marked described in extraction The corresponding a variety of text feature combinations of text；Each described text feature combination is sequentially input in the classifier library Each classifier trained；To each classifier trained export the probability of the target labels into Row sequence, filters out the classifier for meeting preset condition, and it is corresponding with the multiple classifier to establish the text feature combination Relationship；According to corresponding relationship described in the text feature query composition, selected from pre-set classifier library multiple preparatory Trained classifier.

In one embodiment, include: in the text feature library text size feature, keyword words-frequency feature, word to Measure similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature；Computer program quilt Processor execute when also perform the steps of from text feature library select text size feature, keyword words-frequency feature, word to Measure similarity feature, TF-IDF weight feature, two in the Probability Characteristics of LDA model and informed source feature with On, obtain text feature combination；From each text feature extracted in text to be sorted in the text feature combination；To described Each text feature is combined, and obtains fusion feature.

In one embodiment, the text to be sorted includes: title text and body text；Computer program is processed Device also performs the steps of the title text length and body text length for obtaining the text to be sorted when executing；According to institute Title text length and the body text length are stated, length for heading vector sum text size vector is respectively obtained；By the mark Topic length vector and the text size vector are spliced, and the text size feature of text to be sorted is obtained；Or, obtaining preparatory The antistop list of setting matches the title text and the body text according to the antistop list, obtains described to be sorted Word frequency comprising keyword in antistop list in text；Vectorization is carried out to the word frequency, obtains keyword words-frequency feature；Or, The text feature vector for obtaining the title feature vector sum body text of the title text, to the title feature vector sum institute It states text feature vector to be spliced, obtains term vector similarity feature；Or, obtaining each pass in the text to be sorted TF-IDF weight of the keyword in default corpus is obtained according to the mean value of the TF-IDF weight of each keyword wait divide The average TF-IDF weight of class text obtains the TF-IDF of the text to be sorted to the average TF-IDF weight vectorization Weight feature；Or, by the pre-set LDA model of text input to be sorted, obtain the text to be sorted belong to it is each The ProbabilityDistribution Vector is obtained the probability distribution of the LDA model of the text to be sorted by the probability distribution of preset themes Feature；Or, the informed source for obtaining the text to be sorted obtains the informed source according to pre-set coding rule Source number, to the source number carry out vectorization, obtain informed source feature.

In one embodiment, it also performs the steps of when computer program is executed by processor according to pre-set Weighting algorithm calculates the weight of each classifier in the multiple classifier；According to the weight, each classifier is added Power obtains the integrated classification device.

In one embodiment, it also performs the steps of when computer program is executed by processor to the title text Segmented respectively with the body text, obtain the title text fisrt feature set of words and the body text Second feature set of words；According to pre-set positive and negative keywords database and pre-set term vector tool, described is obtained In one feature set of words in the first term vector of each Feature Words and the second feature set of words each Feature Words second Term vector；It is averaged to obtain title feature vector according to first term vector, and is averaged according to second term vector Obtain text feature vector.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.

The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the concept of this application, various modifications and improvements can be made, these belong to the protection of the application Range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.

Claims

1. a kind of file classification method, which comprises

Text feature combination is selected from pre-set text feature library, is extracted and the text feature from text to be sorted Combine corresponding fusion feature；

According to the classifier, integrated classification device is obtained；

The fusion feature is inputted into the integrated classification device, obtains the probability of multiple default labels；The default label is corresponding One text type；

2. the method according to claim 1, wherein

The step of training classifier, comprising:

It selects to have marked text from pre-set corpus；

According to the target labels for having marked text and pre-set termination condition, training classifier；

When the probability that the classifier exports the target labels meets the termination condition, the classification trained Device.

3. according to the method described in claim 2, it is characterized in that, the method also includes:

Each described text feature combination is sequentially input into each classifier trained in the classifier library；

The probability for exporting the target labels to each classifier trained is ranked up, and it is default to filter out satisfaction The classifier of condition establishes the corresponding relationship of the text feature combination and the multiple classifier；

It is described to be combined according to the text feature, multiple classifiers trained in advance are selected from pre-set classifier library, Include:

According to corresponding relationship described in the text feature query composition, multiple preparatory instructions are selected from pre-set classifier library Experienced classifier.

4. the method according to claim 1, wherein including: text size feature in the text feature library, closing Keyword words-frequency feature, term vector similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source Feature；

It is described that text feature combination is selected from pre-set text feature library, it is extracted and the text from text to be sorted Feature combines corresponding fusion feature, comprising:

Text size feature, keyword words-frequency feature, term vector similarity feature, TF-IDF weight are selected from text feature library Two or more in feature, the Probability Characteristics of LDA model and informed source feature obtains text feature combination；

From each text feature extracted in text to be sorted in the text feature combination；

Each text feature is combined, fusion feature is obtained.

5. according to the method described in claim 4, it is characterized in that, the text to be sorted includes: title text and text text This；

Described extract from text to be sorted combines corresponding fusion feature with the text feature, comprising:

Obtain the title text length and body text length of the text to be sorted；According to the title text length and described Body text length respectively obtains length for heading vector sum text size vector；By text described in the length for heading vector sum Length vector is spliced, and the text size feature of text to be sorted is obtained；

Or,

Pre-set antistop list is obtained, the title text and the body text are matched according to the antistop list, obtained Word frequency into the text to be sorted comprising keyword in antistop list；Vectorization is carried out to the word frequency, obtains keyword Words-frequency feature；

Or,

The text feature vector for obtaining the title feature vector sum body text of the title text, to the title feature vector Spliced with the text feature vector, obtains term vector similarity feature；

Or,

TF-IDF weight of each keyword in default corpus in the text to be sorted is obtained, according to described each The mean value of the TF-IDF weight of keyword obtains the average TF-IDF weight of text to be sorted, to the average TF-IDF weight Vectorization obtains the TF-IDF weight feature of the text to be sorted；

Or,

By the pre-set LDA model of text input to be sorted, obtains the text to be sorted and belong to each preset themes Probability distribution the ProbabilityDistribution Vector is obtained into the Probability Characteristics of the LDA model of the text to be sorted；

Or,

The informed source for obtaining the text to be sorted obtains coming for the informed source according to pre-set coding rule Source number, numbers the source and carries out vectorization, obtain informed source feature.

6. method according to any one of claims 1 to 5, which is characterized in that it is described according to the classifier, it is merged Classifier, comprising:

According to pre-set weighting algorithm, the weight of each classifier in the classifier is calculated；

According to the weight, each classifier is weighted to obtain integrated classification device.

7. according to the method described in claim 5, it is characterized in that, the title feature vector sum for obtaining the title text The text feature vector of body text, comprising:

The title text and the body text are segmented respectively, obtain the fisrt feature set of words of the title text And the second feature set of words of the body text；

According to pre-set positive and negative keywords database and pre-set term vector tool, the fisrt feature set of words is obtained In the first term vector of each Feature Words and the second term vector of each Feature Words in the second feature set of words；

It is averaged to obtain title feature vector according to first term vector, and averages to obtain according to second term vector Text feature vector.

8. a kind of document sorting apparatus, which is characterized in that described device includes:

Fusion Features module, for selecting text feature combination from pre-set text feature library, from text to be sorted It extracts and combines corresponding fusion feature with the text feature；

Classifier selecting module selects multiple pre- for being combined according to the text feature from pre-set classifier library First trained classifier；

Output module obtains the probability of multiple default labels for the fusion feature to be inputted the integrated classification device；It is described The default corresponding text type of label；

9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In the step of processor realizes any one of claims 1 to 7 the method when executing the computer program.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of method described in any one of claims 1 to 7 is realized when being executed by processor.