CN109543032A - File classification method, device, computer equipment and storage medium - Google Patents
File classification method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN109543032A CN109543032A CN201811258359.3A CN201811258359A CN109543032A CN 109543032 A CN109543032 A CN 109543032A CN 201811258359 A CN201811258359 A CN 201811258359A CN 109543032 A CN109543032 A CN 109543032A
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- classifier
- sorted
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000004927 fusion Effects 0.000 claims abstract description 59
- 238000004590 computer program Methods 0.000 claims description 27
- 238000012549 training Methods 0.000 claims description 14
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 239000000203 mixture Substances 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims description 5
- 238000010586 diagram Methods 0.000 description 7
- 230000001105 regulatory effect Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000012550 audit Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 244000097202 Rathbunia alamosensis Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000155 melt Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application involves a kind of file classification method based on disaggregated model, device, computer equipment and storage mediums.The described method includes: selecting text feature combination from pre-set text feature library, it is extracted from text feature from text to be sorted and combines corresponding fusion feature, it is combined according to text feature, multiple classifiers trained in advance are selected from pre-set classifier library, according to classifier, obtain integrated classification device, fusion feature is inputted into integrated classification device, obtain the probability of multiple default labels of integrated classification device output, the default corresponding text type of label, according to the default label of maximum probability, the text type of text to be sorted is determined.It can be improved the accuracy of text classification using this method.
Description
Technical field
This application involves field of computer technology, more particularly to a kind of file classification method, device, computer equipment and
Storage medium.
Background technique
Text classification refers to that, by natural statement classification to the technology in a certain specified classification, which is widely used in mutual
In networking technology field.Newsletter archive can be screened by Text Classification when news push, specifically, will be new
It when news text is pushed to specified platform, needs to obtain newsletter archive from each source of news, is then referring to newsletter archive publication
In fixed platform, so as to platform access person's reading.In order to guarantee the quality for the newsletter archive issued in platform, need to newsletter archive
It is audited.By taking government's financial platform as an example, need to issue be financial class news, from each source of news obtain news
After text, need to audit the content of newsletter archive, audit specifically include that whether content credible, whether comprising advertisement,
Main contents whether be related to finance and whether be social concerns money article etc., with this to determine whether by newsletter archive
Publication is on platform.However, in order to guarantee newsletter archive push efficiency, can by existing algorithm model to newsletter archive into
The requirement of accuracy when newsletter archive pushes is extremely difficult to when going and classify, but classifying using existing algorithm model.
Summary of the invention
Based on this, it is necessary in view of the above technical problems, it is accurate to provide classification when one kind is able to solve newsletter archive push
File classification method, device, computer equipment and the storage medium of the low problem of property.
A kind of file classification method, which comprises
Text feature combination is selected from pre-set text feature library, is extracted and the text from text to be sorted
Feature combines corresponding fusion feature;
It is combined according to the text feature, multiple classifiers trained in advance is selected from pre-set classifier library;
The classifier is selected according to the fusion feature, obtains integrated classification device;
The fusion feature is inputted into the integrated classification device, obtains the probability of multiple default labels;The default label
A corresponding text type;
According to the default label of maximum probability, the text type of the text to be sorted is determined.
The step of training classifier in one of the embodiments, comprising: select to have marked from pre-set corpus
Explanatory notes sheet;According to the target labels for having marked text and pre-set termination condition, training classifier;When the classification
When the probability that device exports the target labels is all satisfied the termination condition, the classifier trained.
In one of the embodiments, further include:
The corresponding a variety of text features of text that marked are extracted to combine;
Each described text feature combination is sequentially input into each classification trained in the classifier library
Device;
The probability for exporting the target labels to each classifier trained is ranked up, and filters out satisfaction
The classifier of preset condition establishes the corresponding relationship of the text feature combination and the multiple classifier;;According to the text
Corresponding relationship described in feature query composition selects multiple classifiers trained in advance from pre-set classifier library.
Include: in the text feature library in one of the embodiments, text size feature, keyword word frequency, word to
Measure similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature;Further include: from text
Text size feature, keyword words-frequency feature, term vector similarity feature, TF-IDF weight are selected in the text feature of feature database
Two or more in feature, the Probability Characteristics of LDA model and informed source feature obtains text feature combination;From to
Each text feature in the text feature combination is extracted in classifying text;Each text feature is combined, is obtained
To fusion feature.
The text to be sorted includes: title text and body text in one of the embodiments,;Further include: it obtains
The title text length and body text length of the text to be sorted;According to the title text length and the body text
Length respectively obtains length for heading vector sum text size vector;By text size vector described in the length for heading vector sum
Spliced, obtains the text size feature of text to be sorted;Or, pre-set antistop list is obtained, according to the key
Vocabulary matches the title text and the body text, obtains in the text to be sorted comprising keyword in antistop list
Word frequency;Vectorization is carried out to the word frequency, obtains keyword words-frequency feature;Or, obtain the title feature of the title text to
The text feature vector of amount and body text, splices text feature vector described in the title feature vector sum, obtains
Term vector similarity feature;Or, obtaining TF-IDF of each keyword in default corpus in the text to be sorted
Weight obtains the average TF-IDF weight of text to be sorted, to institute according to the mean value of the TF-IDF weight of each keyword
Average TF-IDF weight vectorization is stated, the TF-IDF weight feature of the text to be sorted is obtained;Or, by the text to be sorted
Pre-set LDA model is inputted, the probability distribution that the text to be sorted belongs to each preset themes is obtained, by the probability
Distribution vector obtains the Probability Characteristics of the LDA model of the text to be sorted;Or, obtaining the text to be sorted
Informed source obtains the source number of the informed source, numbers and carry out to the source according to pre-set coding rule
Vectorization obtains informed source feature.
In one of the embodiments, further include: according to pre-set weighting algorithm, calculate each in the classifier
The weight of classifier;According to the weight, each classifier is weighted to obtain the integrated classification device.
In one of the embodiments, further include: the title text and the body text are segmented respectively, obtained
To the fisrt feature set of words of the title text and the second feature set of words of the body text;
According to pre-set positive and negative keywords database and pre-set term vector tool, the fisrt feature word is obtained
In set in the first term vector of each Feature Words and the second feature set of words each Feature Words the second term vector;
It is averaged to obtain title feature vector according to first term vector, and averages to obtain text according to second term vector
Feature vector.
A kind of document sorting apparatus, described device include:
Fusion Features module, for selecting text feature combination from pre-set text feature library, from text to be sorted
It is extracted in this and combines corresponding fusion feature with the text feature;
Classifier selecting module selects more for being combined according to the text feature from pre-set classifier library
A classifier trained in advance;
Multiple Classifier Fusion module, for obtaining integrated classification device according to the classifier;
Output module obtains the probability of multiple default labels for the fusion feature to be inputted the integrated classification device;
The corresponding text type of default label;
Categorization module determines the text type of the text to be sorted for the default label according to maximum probability.
A kind of computer equipment, including memory and processor, the memory are stored with computer program, the processing
Device performs the steps of when executing the computer program
Text feature combination is selected from pre-set text feature library, is extracted and the text from text to be sorted
Feature combines corresponding fusion feature;
It is combined according to the text feature, multiple classifiers trained in advance is selected from pre-set classifier library;
According to the classifier, integrated classification device is obtained;
The fusion feature is inputted into the integrated classification device, obtains the probability of multiple default labels;The default label
A corresponding text type;
According to the default label of maximum probability, the text type of the text to be sorted is determined.
A kind of computer readable storage medium, is stored thereon with computer program, and the computer program is held by processor
It is performed the steps of when row
Text feature combination is selected from pre-set text feature library, is extracted and the text from text to be sorted
Feature combines corresponding fusion feature;
It is combined according to the text feature, multiple classifiers trained in advance is selected from pre-set classifier library;
According to the classifier, integrated classification device is obtained;
The fusion feature is inputted into the integrated classification device, obtains the probability of multiple default labels;The default label
A corresponding text type;
According to the default label of maximum probability, the text type of the text to be sorted is determined.
Above-mentioned file classification method, device, computer equipment and storage medium can be with needles by constructing text feature library
To different classes of text to be sorted, adaptability selects different text features to combine, and improves feature selecting accuracy, in addition,
The feature as text to be sorted text feature is combined, inputs pre-set classifier library, classifier can be with corresponding selection
Classifiers combination carries out classification prediction to text feature combination, guarantees to select optimal classifier, whole process is without artificial behaviour
Make, classification prediction accurately can also be carried out to text.
Detailed description of the invention
Fig. 1 is the application scenario diagram of file classification method in one embodiment;
Fig. 2 is the flow diagram of file classification method in one embodiment;
Fig. 3 is the flow diagram that fusion feature step is extracted in one embodiment;
Fig. 4 is the flow diagram of file classification method in another embodiment;
Fig. 5 is the flow diagram of file classification method in another embodiment;
Fig. 6 is the structural block diagram of document sorting apparatus in one embodiment;
Fig. 7 is the internal structure chart of computer equipment in one embodiment.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood
The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, not
For limiting the application.
File classification method provided by the present application can be applied in application environment as shown in Figure 1.Wherein, terminal 102
It is communicated with server 104 by network by network.Wherein, terminal 102 can be, but not limited to be various personal computers,
Laptop, server 104 can be realized with the server cluster of the either multiple server compositions of independent server.
Wherein, terminal 102 can obtain text to be sorted from server 104 by HTTP request.Text to be sorted can be with
It is microblog passage, public platform article, blog and information of news platform channel etc., terminal 102 obtains above-mentioned text to be sorted
Afterwards, every text to be sorted can be stored in the database of terminal 102.
Further, the text to be sorted in terminal 102 is pushed to before platform issued, is needed to text to be sorted
This is classified, and the text to be sorted for meeting default regulatory requirements can be just sent in platform, completes content of platform with this
Supervision.
Specifically, terminal 102, when carrying out text classification, by extracting the fusion feature of text to be sorted, then root melts
Feature is closed, selects corresponding classifier to be merged, obtains integrated classification device, fusion feature is then inputted into integrated classification device,
Since the classifier in integrated classification device is trained according to the regulatory requirements of platform, integrated classification device, which can export, to be melted
The probability that feature is directed to each default label is closed, and default label has corresponded to text type, by presetting the probability size of label,
It can determine the text type of text to be sorted.Therefore, the corresponding text of text type that terminal 102 can will meet regulatory requirements
This push value platform is issued, and the supervision of content of platform is completed.
In one embodiment, as shown in Fig. 2, providing a kind of file classification method, it is applied in Fig. 1 in this way
It is illustrated for terminal, comprising the following steps:
Step 202, text feature combination is selected from pre-set text feature library, is extracted from from text to be sorted
Text feature combines corresponding fusion feature.
It wherein, include multiple text features constructed in advance in text feature library, if input text to be sorted, terminal is determined
When plan, the text feature constructed in advance in corresponding text feature library is selected, then the text that can export text to be sorted is special
Sign.Therefore, text feature can be selected according to Terminal-decision, such as: for the text to be sorted of headline, carrying out
Decision is to preferably select the text features such as text size feature, keyword words-frequency feature, term vector similarity feature.Pass through this
Kind mode, can be further improved the accuracy of classifier prediction.
Further, the training of text feature library can be characterized decision model with preset limit decision model.
Specifically, input feature vector decision model in terminal, then feature decision model exports several when being classified
Text feature combination, the training logic of feature decision model can be the classification according to text to be sorted, such as: news category, event
Thing class discusses class, selects suitable text feature, to ensure the accuracy classified.Text to be sorted can be identified in terminal
This type can export text feature combination with this automatically, therefore, see on the whole, the scheme of the present embodiment has made model
Two layers stacking, to improve the forecasting efficiency of model.
Specifically, spy can be passed through when extracting each text feature that text to be sorted is directed to out in text feature combination
Multiple text features are fused to fusion feature by the mode for levying fusion.
Step 204, it is combined according to text feature, multiple classification trained in advance is selected from pre-set classifier library
Device.
It wherein, include the classifier of multiple and different types in classifier library, according to pre-set regulatory requirements, setting is not
With the text type of regulatory requirements, device label corresponds to different text types in different categories, by point in class library
Class device is trained, and can be classified to the text to be sorted of input.
It include various types of classifier in classifier library, each classifier is directed to different text feature effects not
Together, therefore, when inputting fusion feature, multiple classifiers is can choose and classified, so as to improve the accuracy of classification.
Further, pre-established in terminal the combination of text feature in fusion feature in classifier library classifier it is corresponding
Relationship passes through one text feature combination of identification, it can select corresponding classifier from classifier library automatically.
It is worth noting that classifier library and text feature library are the tool being stored in advance in the terminal, terminal according to
Corresponding logic can choose the tool in calling classification device library and text feature library.
Step 206, according to classifier, integrated classification device is obtained.
Wherein, it when obtaining integrated classification device, can be merged from classifier structure, obtain integrated classification device, be tied
Structure fusion merges the output of each classifier.Another way is not handle classifier, is acquired by terminal
The output of each classifier as a result, then calculate final structure by terminal, integrated classification device is obtained with this.
Step 208, fusion feature is inputted into integrated classification device, obtains the general of multiple default labels of integrated classification device output
Rate.
Wherein, when carrying out classifier training, by the corresponding text type of default label, such as: violation text is corresponding
One default label indicates that text to be sorted is violation text when the probability that classifier exports the default label is 20%
Probability is 20%.
Specifically, the output of classifier can be exported by softmax, therefore the probability of available each default label is big
It is small, convenient for the Accurate classification of text.
Step 210, according to the default label of maximum probability, the text type of text to be sorted is determined.
Wherein, when obtaining the probability size of each default label, maximum probability can be determined by the way of sequence
Then label determines the text type of text to be sorted according to default label.
In above-mentioned file classification method, by constructing text feature library, it can be directed to different classes of text to be sorted, fitted
Answering property selects different text features to combine, and improves feature selecting accuracy, is used as text to be sorted in addition, combining text feature
This feature, inputs pre-set classifier library, classifier can with corresponding selection classifiers combination to text feature combine into
Row classification prediction, guarantees to select optimal classifier, whole process without human intervention, can also accurately divide text
Class prediction.
In one embodiment, as shown in figure 3, providing a kind of schematic flow chart for extracting fusion feature step, wherein
It include: text size feature, keyword word frequency, term vector similarity feature, TF-IDF weight feature, LDA in text feature library
The Probability Characteristics and informed source feature of model, the specific steps are as follows:
Step 302, select text size feature, keyword words-frequency feature, term vector similarity special from text feature library
Two or more in sign, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature, obtains text
Feature combination.
Step 304, from each text feature extracted in text to be sorted in text feature combination.
Step 306, each text feature is combined, obtains fusion feature.
In the present embodiment, by the way that a variety of text features are arranged, various texts to be sorted can be directed to, spy is accurately extracted
Sign, so as to improve the accuracy of text classification.
For the text to be sorted mentioned in Fig. 3, in one embodiment, which includes: title text and just
Therefore text can pass through the title text length and body text length of acquisition text to be sorted;It is long according to title text
Degree and body text length, respectively obtain length for heading vector sum text size vector;By length for heading vector sum text size
Vector is spliced, and the text size feature of text to be sorted is obtained;By obtaining pre-set antistop list, according to key
Vocabulary matches title text and body text, obtains the word frequency comprising keyword in antistop list in text to be sorted;To word frequency
Vectorization is carried out, keyword words-frequency feature is obtained;By the text for obtaining the title feature vector sum body text of title text
Feature vector splices title feature vector sum text feature vector, obtains term vector similarity feature;Or, by obtaining
TF-IDF weight of each keyword in default corpus in text to be sorted is taken, according to the TF-IDF weight of each keyword
Mean value, obtain the average TF-IDF weight of text to be sorted, to average TF-IDF weight vectorization, obtain text to be sorted
TF-IDF weight feature;Or, obtaining text to be sorted by by the pre-set LDA model of text input to be sorted and belonging to respectively
ProbabilityDistribution Vector is obtained the Probability Characteristics of the LDA model of text to be sorted by the probability distribution of a preset themes;
Or, according to pre-set coding rule, the source for obtaining informed source is compiled by the informed source for obtaining text to be sorted
Number, source is numbered and carries out vectorization, obtains informed source feature.
In the embodiment, due to including at least two above-mentioned text features in text feature combination, text to be sorted is being obtained
This when, it is necessary first to parse title text and body text therein, feature is then carried out by each text feature tool
It extracts.
In one embodiment, the step of training classifier, comprising:
It selects to have marked text from pre-set corpus, according to the target labels for having marked text and preset
Termination condition, training classifier, when classifier output target labels probability meet termination condition when, trained point
Class device.
It in another embodiment, include: decision tree, random forest, extra tree, gradient promotion in classifier library
Tree, logistic regression, fully-connected network and adaptive connection tree;By the above-mentioned classifier of training, available classifier library.
In another embodiment, it extracts and has marked the corresponding a variety of text feature combinations of text;By each text spy
Sign combination sequentially inputs each classifier trained in classifier library;To each classifier output target labels probability into
Row sequence, filters out the classifier for meeting preset condition, establishes the corresponding relationship of text feature combination and multiple classifiers.That
, the step of being combined according to text feature, multiple trained in advance classifiers are selected from pre-set classifier library packet
It includes: according to text feature query composition corresponding relationship, multiple classification trained in advance is selected from pre-set classifier library
Device.
In summary several embodiments, in another embodiment, as shown in figure 4, using fusion feature as text size feature,
Term vector similarity feature and the Probability Characteristics of LDA model merge, and integrated classification device is decision tree, at random
For forest and logistic regression fusion form, from Fig. 4, it can clearly show the classification stream of the embodiment of the present invention
Journey.
In one embodiment, the step of obtaining integrated classification device may is that according to pre-set weighting algorithm, calculate
The weight of each classifier in multiple classifiers;According to weight, each classifier is weighted to obtain integrated classification device.
Specifically, the workflow of weighting algorithm is as follows: extracting the fusion feature for having marked text, assigned to each classifier
Initial weight is given, fusion feature is inputted in each classifier, the probability of final default label is calculated according to initial weight, it will be pre-
The probability of bidding label is compared with target labels, if difference is greater than preset value, adjusts initial weight, until difference is less than pre-
If value, so that the weight of each classifier is obtained, then with being weighted to obtain integrated classification device according to the weight.
It is worth noting that weight is different when the classifier of various combination is merged, therefore, in the training stage, need
The classifier to combine to every kind calculates separately weight when it is merged.
In addition, in one embodiment, obtaining the text feature vector of the title feature vector sum body text of title text
The step of may is that title text and body text are segmented respectively, obtain the fisrt feature set of words of title text with
And the second feature set of words of body text;According to pre-set positive and negative keywords database and pre-set term vector work
Tool, obtains each Feature Words in the first term vector and second feature set of words of each Feature Words in fisrt feature set of words
The second term vector;It is averaged to obtain title feature vector according to the first term vector, and is averaged according to the second term vector
To text feature vector.
In the present embodiment, positive anti-keyword can strengthen Feature Words it is matched as a result, can not only be matched to it is positive as a result,
It can be matched to the corresponding reversed word of the specific word when being not matched to Feature Words by the way that corresponding reversed word is arranged, thus
The matching efficiency of Feature Words is improved, it is therefore, as a result more accurate in construction feature vector.
In one embodiment, as shown in figure 5, providing a kind of platform news push scheme based on file classification method
Schematic flow chart, the specific steps are as follows:
Step 502, newsletter archive to be pushed is received, newsletter archive includes headline and body.
Newsletter archive source, such as Sina, the www.xinhuanet.com can be preset, then as unit of news article, in terminal
In save as news item text.
Step 504, extract the text size feature of newsletter archive, keyword words-frequency feature, term vector similarity feature,
TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature.
Step 506, special according to text size feature, keyword words-frequency feature, term vector similarity feature, TF-IDF weight
Sign, the Probability Characteristics of LDA model and informed source feature, obtain the fusion feature of newsletter archive.
Wherein, after each text feature can be carried out vectorization first by the mode of fusion, vector is spliced, is obtained
Fusion feature.
Step 508, fusion feature is inputted into classifier library, default label is exported according to classifier each in classifier library
Probability is ranked up each classifier, and three forward classifiers of select probability are merged to obtain integrated classification device.
Wherein it is possible to be merged by the way of weighting, weight is arranged in as each classifier, to classifier output
As a result it is weighted.
Step 510, according to the output of integrated classification device as a result, carry out classification prediction to newsletter archive, if newsletter archive
Classification meets platform regulatory requirements, then the newsletter archive is issued in platform, if the classification of newsletter archive does not meet strip supervision
It is required that not issuing the newsletter archive then.
In the present embodiment, by classifying to newsletter archive, realizes the monitoring given a news briefing to platform, guarantee new platform
The quality of news.
In another embodiment, when the newsletter archive pushes, correction strategy can also be set, and correction strategy can be quick
Feel word filtering, by whether including sensitive word in detection newsletter archive, determines whether to push the newsletter archive to platform.
It should be understood that although each step in the flow chart of Fig. 2,3,5 is successively shown according to the instruction of arrow,
These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps
Execution there is no stringent sequences to limit, these steps can execute in other order.Moreover, in Fig. 2,3,5 at least
A part of step may include that perhaps these sub-steps of multiple stages or stage are not necessarily in same a period of time to multiple sub-steps
Quarter executes completion, but can execute at different times, the execution in these sub-steps or stage be sequentially also not necessarily according to
Secondary progress, but in turn or can replace at least part of the sub-step or stage of other steps or other steps
Ground executes.
In one embodiment, as shown in fig. 6, providing a kind of document sorting apparatus, comprising: Fusion Features module 602,
Classifier selecting module 604, Multiple Classifier Fusion module 606, output module 608 and categorization module 610, in which:
Fusion Features module 602, for selecting text feature combination from pre-set text feature library, to be sorted
It is extracted in text and combines corresponding fusion feature with the text feature.
Classifier selecting module 604 is selected from pre-set classifier library for being combined according to the text feature
Multiple classifiers trained in advance.
Multiple Classifier Fusion module 606, for obtaining integrated classification device according to the classifier.
Output module 608 obtains the general of multiple default labels for the fusion feature to be inputted the integrated classification device
Rate;The corresponding text type of default label.
Categorization module 610 determines the text type of the text to be sorted for the default label according to maximum probability.
In one embodiment, it selects to have marked text from pre-set corpus;Text has been marked according to described
Target labels and pre-set termination condition, training classifier;When the classifier exports the probability of the target labels
When meeting the termination condition, the classifier trained.
In one embodiment, classifier selecting module 604, which is also used to extract, described has marked the corresponding a variety of institutes of text
State text feature combination;Each described text feature combination is sequentially input into each institute trained in the classifier library
State classifier;The probability for exporting the target labels to each classifier trained is ranked up, and is filtered out full
The classifier of sufficient preset condition establishes the corresponding relationship of the text feature combination and the multiple classifier;According to the text
Corresponding relationship described in eigen query composition selects multiple classifiers trained in advance from pre-set classifier library.
In one embodiment, include: in the text feature library text size feature, keyword words-frequency feature, word to
Measure similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature;Fusion Features module
602 are also used to select text size feature, keyword words-frequency feature, term vector similarity feature, TF- from text feature library
Two or more in IDF weight feature, the Probability Characteristics of LDA model and informed source feature, obtains text feature group
It closes;From each text feature extracted in text to be sorted in the text feature combination;Each text feature is carried out
Combination, obtains fusion feature.
In one embodiment, the text to be sorted includes: title text and body text;Fusion Features module 602
It is also used to obtain the title text length and body text length of the text to be sorted;According to the title text length and institute
Body text length is stated, length for heading vector sum text size vector is respectively obtained;By described in the length for heading vector sum just
Literary length vector is spliced, and the text size feature of text to be sorted is obtained;Or, obtaining pre-set antistop list, root
The title text and the body text are matched according to the antistop list, is obtained in the text to be sorted comprising antistop list
The word frequency of middle keyword;Vectorization is carried out to the word frequency, obtains keyword words-frequency feature;Or, obtaining the title text
The text feature vector of title feature vector sum body text carries out text feature vector described in the title feature vector sum
Splicing, obtains term vector similarity feature;Or, each keyword is in default corpus in the acquisition text to be sorted
TF-IDF weight the average TF-IDF of text to be sorted is obtained according to the mean value of the TF-IDF weight of each keyword
Weight obtains the TF-IDF weight feature of the text to be sorted to the average TF-IDF weight vectorization;Or, will be described
The pre-set LDA model of text input to be sorted, obtains the probability distribution that the text to be sorted belongs to each preset themes,
By the ProbabilityDistribution Vector, the Probability Characteristics of the LDA model of the text to be sorted are obtained;Or, obtain it is described to
The informed source of classifying text obtains the source number of the informed source, to described next according to pre-set coding rule
Source number carries out vectorization, obtains informed source feature.
In one embodiment, output module 608 is also used to calculate the multiple point according to pre-set weighting algorithm
The weight of each classifier in class device;According to the weight, each classifier is weighted to obtain the integrated classification device.
In one embodiment, Fusion Features module 602 is also used to distinguish the title text and the body text
It is segmented, obtains the fisrt feature set of words of the title text and the second feature set of words of the body text;Root
According to pre-set positive and negative keywords database and pre-set term vector tool, obtain each in the fisrt feature set of words
Second term vector of each Feature Words in first term vector of Feature Words and the second feature set of words;According to described
One term vector averages to obtain title feature vector, and averages to obtain text feature vector according to second term vector.
Specific about document sorting apparatus limits the restriction that may refer to above for file classification method, herein not
It repeats again.Modules in above-mentioned document sorting apparatus can be realized fully or partially through software, hardware and combinations thereof.On
Stating each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also store in a software form
In memory in computer equipment, the corresponding operation of the above modules is executed in order to which processor calls.
In one embodiment, a kind of computer equipment is provided, which can be server, internal junction
Composition can be as shown in Figure 7.The computer equipment include by system bus connect processor, memory, network interface and
Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment
Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data
Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating
The database of machine equipment is for storing text data to be sorted.The network interface of the computer equipment is used for logical with external terminal
Cross network connection communication.To realize a kind of file classification method when the computer program is executed by processor.
It will be understood by those skilled in the art that structure shown in Fig. 7, only part relevant to application scheme is tied
The block diagram of structure does not constitute the restriction for the computer equipment being applied thereon to application scheme, specific computer equipment
It may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.
In one embodiment, a kind of computer equipment, including memory and processor are provided, which is stored with
Computer program, the processor perform the steps of when executing computer program
Text feature combination is selected from pre-set text feature library, is extracted and the text from text to be sorted
Feature combines corresponding fusion feature;
It is combined according to the text feature, multiple classifiers trained in advance is selected from pre-set classifier library;
According to the classifier, integrated classification device is obtained;
The fusion feature is inputted into the integrated classification device, obtains the probability of multiple default labels;The default label
A corresponding text type;
According to the default label of maximum probability, the text type of the text to be sorted is determined.
In one embodiment, it also performs the steps of when processor executes computer program from pre-set corpus
It selects to have marked text in library;According to the target labels for having marked text and pre-set termination condition, training classification
Device;When the probability that the classifier exports the target labels meets the termination condition, the classification trained
Device.
In one embodiment, it is also performed the steps of when processor executes computer program and has marked text described in extraction
This corresponding a variety of described text feature combination;Each described text feature combination is sequentially input in the classifier library
Each classifier trained;The probability for exporting the target labels to each classifier trained carries out
Sequence, filters out the classifier for meeting preset condition, establishes the text feature combination pass corresponding with the multiple classifier
System;According to corresponding relationship described in the text feature query composition, multiple preparatory instructions are selected from pre-set classifier library
Experienced classifier.
In one embodiment, include: in the text feature library text size feature, keyword words-frequency feature, word to
Measure similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature;Processor executes meter
The selection text size feature, keyword words-frequency feature, term vector from text feature library is also performed the steps of when calculation machine program
Two or more in similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature,
Obtain text feature combination;From each text feature extracted in text to be sorted in the text feature combination;To described each
A text feature is combined, and obtains fusion feature.
In one embodiment, the text to be sorted includes: title text and body text;Processor executes computer
The title text length and body text length for obtaining the text to be sorted are also performed the steps of when program;According to described
Title text length and the body text length respectively obtain length for heading vector sum text size vector;By the title
Length vector and the text size vector are spliced, and the text size feature of text to be sorted is obtained;It is set in advance or, obtaining
The antistop list set matches the title text and the body text according to the antistop list, obtains the text to be sorted
Word frequency comprising keyword in antistop list in this;Vectorization is carried out to the word frequency, obtains keyword words-frequency feature;Or, obtaining
The text feature vector for taking the title feature vector sum body text of the title text, to described in the title feature vector sum
Text feature vector is spliced, and term vector similarity feature is obtained;Or, obtaining each key in the text to be sorted
TF-IDF weight of the word in default corpus obtains to be sorted according to the mean value of the TF-IDF weight of each keyword
The average TF-IDF weight of text obtains the TF-IDF power of the text to be sorted to the average TF-IDF weight vectorization
Weight feature;Or, by the pre-set LDA model of text input to be sorted, obtain the text to be sorted belong to it is each pre-
If the probability distribution of theme, by the ProbabilityDistribution Vector, the probability distribution for obtaining the LDA model of the text to be sorted is special
Sign;Or, the informed source for obtaining the text to be sorted obtains the informed source according to pre-set coding rule
Source number, numbers the source and carries out vectorization, obtain informed source feature.
In one embodiment, it also performs the steps of when processor executes computer program and is added according to pre-set
Algorithm is weighed, the weight of each classifier in the multiple classifier is calculated;According to the weight, each classifier is weighted
Obtain the integrated classification device.
In one embodiment, also perform the steps of when processor executes computer program to the title text and
The body text is segmented respectively, obtain the title text fisrt feature set of words and the body text
Two feature set of words;According to pre-set positive and negative keywords database and pre-set term vector tool, described first is obtained
In feature set of words in the first term vector of each Feature Words and the second feature set of words each Feature Words the second word
Vector;It is averaged to obtain title feature vector according to first term vector, and is averaged according to second term vector
To text feature vector.
In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated
Machine program performs the steps of when being executed by processor
Text feature combination is selected from pre-set text feature library, is extracted and the text from text to be sorted
Feature combines corresponding fusion feature;
It is combined according to the text feature, multiple classifiers trained in advance is selected from pre-set classifier library;
According to the classifier, integrated classification device is obtained;
The fusion feature is inputted into the integrated classification device, obtains the probability of multiple default labels;The default label
A corresponding text type;
According to the default label of maximum probability, the text type of the text to be sorted is determined.
In one embodiment, it is also performed the steps of when computer program is executed by processor from pre-set language
Material selects to have marked text in library;According to the target labels for having marked text and pre-set termination condition, training point
Class device;When the probability that the classifier exports the target labels meets the termination condition, trained described point
Class device.
In one embodiment, it also performs the steps of when computer program is executed by processor and has been marked described in extraction
The corresponding a variety of text feature combinations of text;Each described text feature combination is sequentially input in the classifier library
Each classifier trained;To each classifier trained export the probability of the target labels into
Row sequence, filters out the classifier for meeting preset condition, and it is corresponding with the multiple classifier to establish the text feature combination
Relationship;According to corresponding relationship described in the text feature query composition, selected from pre-set classifier library multiple preparatory
Trained classifier.
In one embodiment, include: in the text feature library text size feature, keyword words-frequency feature, word to
Measure similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source feature;Computer program quilt
Processor execute when also perform the steps of from text feature library select text size feature, keyword words-frequency feature, word to
Measure similarity feature, TF-IDF weight feature, two in the Probability Characteristics of LDA model and informed source feature with
On, obtain text feature combination;From each text feature extracted in text to be sorted in the text feature combination;To described
Each text feature is combined, and obtains fusion feature.
In one embodiment, the text to be sorted includes: title text and body text;Computer program is processed
Device also performs the steps of the title text length and body text length for obtaining the text to be sorted when executing;According to institute
Title text length and the body text length are stated, length for heading vector sum text size vector is respectively obtained;By the mark
Topic length vector and the text size vector are spliced, and the text size feature of text to be sorted is obtained;Or, obtaining preparatory
The antistop list of setting matches the title text and the body text according to the antistop list, obtains described to be sorted
Word frequency comprising keyword in antistop list in text;Vectorization is carried out to the word frequency, obtains keyword words-frequency feature;Or,
The text feature vector for obtaining the title feature vector sum body text of the title text, to the title feature vector sum institute
It states text feature vector to be spliced, obtains term vector similarity feature;Or, obtaining each pass in the text to be sorted
TF-IDF weight of the keyword in default corpus is obtained according to the mean value of the TF-IDF weight of each keyword wait divide
The average TF-IDF weight of class text obtains the TF-IDF of the text to be sorted to the average TF-IDF weight vectorization
Weight feature;Or, by the pre-set LDA model of text input to be sorted, obtain the text to be sorted belong to it is each
The ProbabilityDistribution Vector is obtained the probability distribution of the LDA model of the text to be sorted by the probability distribution of preset themes
Feature;Or, the informed source for obtaining the text to be sorted obtains the informed source according to pre-set coding rule
Source number, to the source number carry out vectorization, obtain informed source feature.
In one embodiment, it also performs the steps of when computer program is executed by processor according to pre-set
Weighting algorithm calculates the weight of each classifier in the multiple classifier;According to the weight, each classifier is added
Power obtains the integrated classification device.
In one embodiment, it also performs the steps of when computer program is executed by processor to the title text
Segmented respectively with the body text, obtain the title text fisrt feature set of words and the body text
Second feature set of words;According to pre-set positive and negative keywords database and pre-set term vector tool, described is obtained
In one feature set of words in the first term vector of each Feature Words and the second feature set of words each Feature Words second
Term vector;It is averaged to obtain title feature vector according to first term vector, and is averaged according to second term vector
Obtain text feature vector.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,
To any reference of memory, storage, database or other media used in each embodiment provided herein,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment
In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance
Shield all should be considered as described in this specification.
The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously
It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art
It says, without departing from the concept of this application, various modifications and improvements can be made, these belong to the protection of the application
Range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.
Claims (10)
1. a kind of file classification method, which comprises
Text feature combination is selected from pre-set text feature library, is extracted and the text feature from text to be sorted
Combine corresponding fusion feature;
It is combined according to the text feature, multiple classifiers trained in advance is selected from pre-set classifier library;
According to the classifier, integrated classification device is obtained;
The fusion feature is inputted into the integrated classification device, obtains the probability of multiple default labels;The default label is corresponding
One text type;
According to the default label of maximum probability, the text type of the text to be sorted is determined.
2. the method according to claim 1, wherein
The step of training classifier, comprising:
It selects to have marked text from pre-set corpus;
According to the target labels for having marked text and pre-set termination condition, training classifier;
When the probability that the classifier exports the target labels meets the termination condition, the classification trained
Device.
3. according to the method described in claim 2, it is characterized in that, the method also includes:
The corresponding a variety of text features of text that marked are extracted to combine;
Each described text feature combination is sequentially input into each classifier trained in the classifier library;
The probability for exporting the target labels to each classifier trained is ranked up, and it is default to filter out satisfaction
The classifier of condition establishes the corresponding relationship of the text feature combination and the multiple classifier;
It is described to be combined according to the text feature, multiple classifiers trained in advance are selected from pre-set classifier library,
Include:
According to corresponding relationship described in the text feature query composition, multiple preparatory instructions are selected from pre-set classifier library
Experienced classifier.
4. the method according to claim 1, wherein including: text size feature in the text feature library, closing
Keyword words-frequency feature, term vector similarity feature, TF-IDF weight feature, the Probability Characteristics of LDA model and informed source
Feature;
It is described that text feature combination is selected from pre-set text feature library, it is extracted and the text from text to be sorted
Feature combines corresponding fusion feature, comprising:
Text size feature, keyword words-frequency feature, term vector similarity feature, TF-IDF weight are selected from text feature library
Two or more in feature, the Probability Characteristics of LDA model and informed source feature obtains text feature combination;
From each text feature extracted in text to be sorted in the text feature combination;
Each text feature is combined, fusion feature is obtained.
5. according to the method described in claim 4, it is characterized in that, the text to be sorted includes: title text and text text
This;
Described extract from text to be sorted combines corresponding fusion feature with the text feature, comprising:
Obtain the title text length and body text length of the text to be sorted;According to the title text length and described
Body text length respectively obtains length for heading vector sum text size vector;By text described in the length for heading vector sum
Length vector is spliced, and the text size feature of text to be sorted is obtained;
Or,
Pre-set antistop list is obtained, the title text and the body text are matched according to the antistop list, obtained
Word frequency into the text to be sorted comprising keyword in antistop list;Vectorization is carried out to the word frequency, obtains keyword
Words-frequency feature;
Or,
The text feature vector for obtaining the title feature vector sum body text of the title text, to the title feature vector
Spliced with the text feature vector, obtains term vector similarity feature;
Or,
TF-IDF weight of each keyword in default corpus in the text to be sorted is obtained, according to described each
The mean value of the TF-IDF weight of keyword obtains the average TF-IDF weight of text to be sorted, to the average TF-IDF weight
Vectorization obtains the TF-IDF weight feature of the text to be sorted;
Or,
By the pre-set LDA model of text input to be sorted, obtains the text to be sorted and belong to each preset themes
Probability distribution the ProbabilityDistribution Vector is obtained into the Probability Characteristics of the LDA model of the text to be sorted;
Or,
The informed source for obtaining the text to be sorted obtains coming for the informed source according to pre-set coding rule
Source number, numbers the source and carries out vectorization, obtain informed source feature.
6. method according to any one of claims 1 to 5, which is characterized in that it is described according to the classifier, it is merged
Classifier, comprising:
According to pre-set weighting algorithm, the weight of each classifier in the classifier is calculated;
According to the weight, each classifier is weighted to obtain integrated classification device.
7. according to the method described in claim 5, it is characterized in that, the title feature vector sum for obtaining the title text
The text feature vector of body text, comprising:
The title text and the body text are segmented respectively, obtain the fisrt feature set of words of the title text
And the second feature set of words of the body text;
According to pre-set positive and negative keywords database and pre-set term vector tool, the fisrt feature set of words is obtained
In the first term vector of each Feature Words and the second term vector of each Feature Words in the second feature set of words;
It is averaged to obtain title feature vector according to first term vector, and averages to obtain according to second term vector
Text feature vector.
8. a kind of document sorting apparatus, which is characterized in that described device includes:
Fusion Features module, for selecting text feature combination from pre-set text feature library, from text to be sorted
It extracts and combines corresponding fusion feature with the text feature;
Classifier selecting module selects multiple pre- for being combined according to the text feature from pre-set classifier library
First trained classifier;
Multiple Classifier Fusion module, for obtaining integrated classification device according to the classifier;
Output module obtains the probability of multiple default labels for the fusion feature to be inputted the integrated classification device;It is described
The default corresponding text type of label;
Categorization module determines the text type of the text to be sorted for the default label according to maximum probability.
9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists
In the step of processor realizes any one of claims 1 to 7 the method when executing the computer program.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
The step of method described in any one of claims 1 to 7 is realized when being executed by processor.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811258359.3A CN109543032A (en) | 2018-10-26 | 2018-10-26 | File classification method, device, computer equipment and storage medium |
PCT/CN2018/123353 WO2020082569A1 (en) | 2018-10-26 | 2018-12-25 | Text classification method, apparatus, computer device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811258359.3A CN109543032A (en) | 2018-10-26 | 2018-10-26 | File classification method, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109543032A true CN109543032A (en) | 2019-03-29 |
Family
ID=65844943
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811258359.3A Pending CN109543032A (en) | 2018-10-26 | 2018-10-26 | File classification method, device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109543032A (en) |
WO (1) | WO2020082569A1 (en) |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134785A (en) * | 2019-04-15 | 2019-08-16 | 平安普惠企业管理有限公司 | Management method, device, storage medium and the equipment of forum's article |
CN110569361A (en) * | 2019-09-06 | 2019-12-13 | 腾讯科技(深圳)有限公司 | Text recognition method and equipment |
CN110750643A (en) * | 2019-09-29 | 2020-02-04 | 上证所信息网络有限公司 | Method and device for classifying non-periodic announcements of listed companies and storage medium |
CN110795558A (en) * | 2019-09-03 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Label acquisition method and device, storage medium and electronic device |
CN110969208A (en) * | 2019-11-29 | 2020-04-07 | 支付宝(杭州)信息技术有限公司 | Fusion method and device for multiple model results |
CN111008329A (en) * | 2019-11-22 | 2020-04-14 | 厦门美柚股份有限公司 | Page content recommendation method and device based on content classification |
CN111078878A (en) * | 2019-12-06 | 2020-04-28 | 北京百度网讯科技有限公司 | Text processing method, device and equipment and computer readable storage medium |
CN111143568A (en) * | 2019-12-31 | 2020-05-12 | 郑州工程技术学院 | Method, device and equipment for buffering during paper classification and storage medium |
CN111191004A (en) * | 2019-12-27 | 2020-05-22 | 咪咕文化科技有限公司 | Text label extraction method and device and computer readable storage medium |
CN111309914A (en) * | 2020-03-03 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Method and device for classifying multiple rounds of conversations based on multiple model results |
CN111353301A (en) * | 2020-02-24 | 2020-06-30 | 成都网安科技发展有限公司 | Auxiliary secret fixing method and device |
CN111401040A (en) * | 2020-03-17 | 2020-07-10 | 上海爱数信息技术股份有限公司 | Keyword extraction method suitable for word text |
CN111475651A (en) * | 2020-04-08 | 2020-07-31 | 掌阅科技股份有限公司 | Text classification method, computing device and computer storage medium |
CN111581381A (en) * | 2020-04-29 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Method and device for generating training set of text classification model and electronic equipment |
CN111611801A (en) * | 2020-06-02 | 2020-09-01 | 腾讯科技(深圳)有限公司 | Method, device, server and storage medium for identifying text region attribute |
CN111651566A (en) * | 2020-08-10 | 2020-09-11 | 四川大学 | Multi-task small sample learning-based referee document dispute focus extraction method |
CN111666748A (en) * | 2020-05-12 | 2020-09-15 | 武汉大学 | Construction method of automatic classifier and method for recognizing decision from software development text product |
CN111680502A (en) * | 2020-05-14 | 2020-09-18 | 深圳平安通信科技有限公司 | Text processing method and related device |
CN111797229A (en) * | 2020-06-10 | 2020-10-20 | 南京擎盾信息科技有限公司 | Text representation method and device and text classification method |
WO2020215563A1 (en) * | 2019-04-24 | 2020-10-29 | 平安科技(深圳)有限公司 | Training sample generation method and device for text classification, and computer apparatus |
CN111966830A (en) * | 2020-06-30 | 2020-11-20 | 北京来也网络科技有限公司 | Text classification method, device, equipment and medium combining RPA and AI |
CN112328787A (en) * | 2020-11-04 | 2021-02-05 | 中国平安人寿保险股份有限公司 | Text classification model training method and device, terminal equipment and storage medium |
CN112347255A (en) * | 2020-11-06 | 2021-02-09 | 天津大学 | Text classification method based on title and text combination of graph network |
CN112749558A (en) * | 2020-09-03 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Target content acquisition method and device, computer equipment and storage medium |
CN112905793A (en) * | 2021-02-23 | 2021-06-04 | 山西同方知网数字出版技术有限公司 | Case recommendation method and system based on Bilstm + Attention text classification |
CN112966766A (en) * | 2021-03-18 | 2021-06-15 | 北京三快在线科技有限公司 | Article classification method, apparatus, server and storage medium |
CN113064993A (en) * | 2021-03-23 | 2021-07-02 | 南京视察者智能科技有限公司 | Design method, optimization method and labeling method of automatic text classification labeling system based on big data |
CN113157927A (en) * | 2021-05-27 | 2021-07-23 | 中国平安人寿保险股份有限公司 | Text classification method and device, electronic equipment and readable storage medium |
CN113239200A (en) * | 2021-05-20 | 2021-08-10 | 东北农业大学 | Content identification and classification method, device and system and storage medium |
CN113935307A (en) * | 2021-09-16 | 2022-01-14 | 有米科技股份有限公司 | Method and device for extracting features of advertisement case |
CN116304717A (en) * | 2023-05-09 | 2023-06-23 | 北京搜狐新媒体信息技术有限公司 | Text classification method and device, storage medium and electronic equipment |
CN116468037A (en) * | 2023-03-17 | 2023-07-21 | 北京深维智讯科技有限公司 | NLP-based data processing method and system |
CN117236329A (en) * | 2023-11-15 | 2023-12-15 | 阿里巴巴达摩院(北京)科技有限公司 | Text classification method and device and related equipment |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112613501A (en) * | 2020-12-21 | 2021-04-06 | 深圳壹账通智能科技有限公司 | Information auditing classification model construction method and information auditing method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104951542A (en) * | 2015-06-19 | 2015-09-30 | 百度在线网络技术(北京)有限公司 | Method and device for recognizing class of social contact short texts and method and device for training classification models |
CN105373800A (en) * | 2014-08-28 | 2016-03-02 | 百度在线网络技术(北京)有限公司 | Classification method and device |
US20160132788A1 (en) * | 2014-11-07 | 2016-05-12 | Xerox Corporation | Methods and systems for creating a classifier capable of predicting personality type of users |
CN107908715A (en) * | 2017-11-10 | 2018-04-13 | 中国民航大学 | Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion |
CN108171280A (en) * | 2018-01-31 | 2018-06-15 | 国信优易数据有限公司 | A kind of grader construction method and the method for prediction classification |
CN108388914A (en) * | 2018-02-26 | 2018-08-10 | 中译语通科技股份有限公司 | A kind of grader construction method, grader based on semantic computation |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3392780A3 (en) * | 2017-04-19 | 2018-11-07 | Tata Consultancy Services Limited | Systems and methods for classification of software defect reports |
CN107545038B (en) * | 2017-07-31 | 2019-12-10 | 中国农业大学 | Text classification method and equipment |
CN108520030B (en) * | 2018-03-27 | 2022-02-11 | 深圳中兴网信科技有限公司 | Text classification method, text classification system and computer device |
CN108595632B (en) * | 2018-04-24 | 2022-05-24 | 福州大学 | Hybrid neural network text classification method fusing abstract and main body characteristics |
-
2018
- 2018-10-26 CN CN201811258359.3A patent/CN109543032A/en active Pending
- 2018-12-25 WO PCT/CN2018/123353 patent/WO2020082569A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105373800A (en) * | 2014-08-28 | 2016-03-02 | 百度在线网络技术(北京)有限公司 | Classification method and device |
US20160132788A1 (en) * | 2014-11-07 | 2016-05-12 | Xerox Corporation | Methods and systems for creating a classifier capable of predicting personality type of users |
CN104951542A (en) * | 2015-06-19 | 2015-09-30 | 百度在线网络技术(北京)有限公司 | Method and device for recognizing class of social contact short texts and method and device for training classification models |
CN107908715A (en) * | 2017-11-10 | 2018-04-13 | 中国民航大学 | Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion |
CN108171280A (en) * | 2018-01-31 | 2018-06-15 | 国信优易数据有限公司 | A kind of grader construction method and the method for prediction classification |
CN108388914A (en) * | 2018-02-26 | 2018-08-10 | 中译语通科技股份有限公司 | A kind of grader construction method, grader based on semantic computation |
Non-Patent Citations (1)
Title |
---|
唐春生, 金以慧: "基于全信息矩阵的多分类器集成方法", 软件学报, no. 06, 23 June 2003 (2003-06-23), pages 1103 - 1109 * |
Cited By (53)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134785A (en) * | 2019-04-15 | 2019-08-16 | 平安普惠企业管理有限公司 | Management method, device, storage medium and the equipment of forum's article |
WO2020215563A1 (en) * | 2019-04-24 | 2020-10-29 | 平安科技(深圳)有限公司 | Training sample generation method and device for text classification, and computer apparatus |
CN110795558A (en) * | 2019-09-03 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Label acquisition method and device, storage medium and electronic device |
CN110795558B (en) * | 2019-09-03 | 2023-09-29 | 腾讯科技(深圳)有限公司 | Label acquisition method and device, storage medium and electronic device |
CN110569361B (en) * | 2019-09-06 | 2021-10-19 | 腾讯科技(深圳)有限公司 | Text recognition method and equipment |
CN110569361A (en) * | 2019-09-06 | 2019-12-13 | 腾讯科技(深圳)有限公司 | Text recognition method and equipment |
CN110750643A (en) * | 2019-09-29 | 2020-02-04 | 上证所信息网络有限公司 | Method and device for classifying non-periodic announcements of listed companies and storage medium |
CN110750643B (en) * | 2019-09-29 | 2024-02-09 | 上证所信息网络有限公司 | Method, device and storage medium for classifying non-periodic announcements of marketing companies |
CN111008329A (en) * | 2019-11-22 | 2020-04-14 | 厦门美柚股份有限公司 | Page content recommendation method and device based on content classification |
CN110969208A (en) * | 2019-11-29 | 2020-04-07 | 支付宝(杭州)信息技术有限公司 | Fusion method and device for multiple model results |
CN110969208B (en) * | 2019-11-29 | 2022-04-12 | 支付宝(杭州)信息技术有限公司 | Fusion method and device for multiple model results |
CN111078878A (en) * | 2019-12-06 | 2020-04-28 | 北京百度网讯科技有限公司 | Text processing method, device and equipment and computer readable storage medium |
CN111078878B (en) * | 2019-12-06 | 2023-07-04 | 北京百度网讯科技有限公司 | Text processing method, device, equipment and computer readable storage medium |
CN111191004A (en) * | 2019-12-27 | 2020-05-22 | 咪咕文化科技有限公司 | Text label extraction method and device and computer readable storage medium |
CN111191004B (en) * | 2019-12-27 | 2023-09-22 | 咪咕文化科技有限公司 | Text label extraction method, text label extraction device and computer readable storage medium |
CN111143568A (en) * | 2019-12-31 | 2020-05-12 | 郑州工程技术学院 | Method, device and equipment for buffering during paper classification and storage medium |
CN111353301A (en) * | 2020-02-24 | 2020-06-30 | 成都网安科技发展有限公司 | Auxiliary secret fixing method and device |
CN111309914B (en) * | 2020-03-03 | 2023-05-09 | 支付宝(杭州)信息技术有限公司 | Classification method and device for multi-round conversations based on multiple model results |
CN111309914A (en) * | 2020-03-03 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Method and device for classifying multiple rounds of conversations based on multiple model results |
CN111401040B (en) * | 2020-03-17 | 2021-06-18 | 上海爱数信息技术股份有限公司 | Keyword extraction method suitable for word text |
CN111401040A (en) * | 2020-03-17 | 2020-07-10 | 上海爱数信息技术股份有限公司 | Keyword extraction method suitable for word text |
CN111475651A (en) * | 2020-04-08 | 2020-07-31 | 掌阅科技股份有限公司 | Text classification method, computing device and computer storage medium |
CN111475651B (en) * | 2020-04-08 | 2023-04-07 | 掌阅科技股份有限公司 | Text classification method, computing device and computer storage medium |
CN111581381B (en) * | 2020-04-29 | 2023-10-10 | 北京字节跳动网络技术有限公司 | Method and device for generating training set of text classification model and electronic equipment |
CN111581381A (en) * | 2020-04-29 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Method and device for generating training set of text classification model and electronic equipment |
CN111666748B (en) * | 2020-05-12 | 2022-09-13 | 武汉大学 | Construction method of automatic classifier and decision recognition method |
CN111666748A (en) * | 2020-05-12 | 2020-09-15 | 武汉大学 | Construction method of automatic classifier and method for recognizing decision from software development text product |
CN111680502B (en) * | 2020-05-14 | 2023-09-22 | 深圳平安通信科技有限公司 | Text processing method and related device |
CN111680502A (en) * | 2020-05-14 | 2020-09-18 | 深圳平安通信科技有限公司 | Text processing method and related device |
CN111611801A (en) * | 2020-06-02 | 2020-09-01 | 腾讯科技(深圳)有限公司 | Method, device, server and storage medium for identifying text region attribute |
CN111797229A (en) * | 2020-06-10 | 2020-10-20 | 南京擎盾信息科技有限公司 | Text representation method and device and text classification method |
CN111966830A (en) * | 2020-06-30 | 2020-11-20 | 北京来也网络科技有限公司 | Text classification method, device, equipment and medium combining RPA and AI |
CN111651566A (en) * | 2020-08-10 | 2020-09-11 | 四川大学 | Multi-task small sample learning-based referee document dispute focus extraction method |
CN111651566B (en) * | 2020-08-10 | 2020-12-01 | 四川大学 | Multi-task small sample learning-based referee document dispute focus extraction method |
CN112749558B (en) * | 2020-09-03 | 2023-11-24 | 腾讯科技(深圳)有限公司 | Target content acquisition method, device, computer equipment and storage medium |
CN112749558A (en) * | 2020-09-03 | 2021-05-04 | 腾讯科技(深圳)有限公司 | Target content acquisition method and device, computer equipment and storage medium |
CN112328787B (en) * | 2020-11-04 | 2024-02-20 | 中国平安人寿保险股份有限公司 | Text classification model training method and device, terminal equipment and storage medium |
CN112328787A (en) * | 2020-11-04 | 2021-02-05 | 中国平安人寿保险股份有限公司 | Text classification model training method and device, terminal equipment and storage medium |
CN112347255A (en) * | 2020-11-06 | 2021-02-09 | 天津大学 | Text classification method based on title and text combination of graph network |
CN112905793A (en) * | 2021-02-23 | 2021-06-04 | 山西同方知网数字出版技术有限公司 | Case recommendation method and system based on Bilstm + Attention text classification |
CN112905793B (en) * | 2021-02-23 | 2023-06-20 | 山西同方知网数字出版技术有限公司 | Case recommendation method and system based on bilstm+attention text classification |
CN112966766A (en) * | 2021-03-18 | 2021-06-15 | 北京三快在线科技有限公司 | Article classification method, apparatus, server and storage medium |
CN113064993A (en) * | 2021-03-23 | 2021-07-02 | 南京视察者智能科技有限公司 | Design method, optimization method and labeling method of automatic text classification labeling system based on big data |
CN113064993B (en) * | 2021-03-23 | 2023-07-21 | 南京视察者智能科技有限公司 | Design method, optimization method and labeling method of automatic text classification labeling system based on big data |
CN113239200A (en) * | 2021-05-20 | 2021-08-10 | 东北农业大学 | Content identification and classification method, device and system and storage medium |
CN113157927B (en) * | 2021-05-27 | 2023-10-31 | 中国平安人寿保险股份有限公司 | Text classification method, apparatus, electronic device and readable storage medium |
CN113157927A (en) * | 2021-05-27 | 2021-07-23 | 中国平安人寿保险股份有限公司 | Text classification method and device, electronic equipment and readable storage medium |
CN113935307A (en) * | 2021-09-16 | 2022-01-14 | 有米科技股份有限公司 | Method and device for extracting features of advertisement case |
CN116468037A (en) * | 2023-03-17 | 2023-07-21 | 北京深维智讯科技有限公司 | NLP-based data processing method and system |
CN116304717B (en) * | 2023-05-09 | 2023-12-15 | 北京搜狐新媒体信息技术有限公司 | Text classification method and device, storage medium and electronic equipment |
CN116304717A (en) * | 2023-05-09 | 2023-06-23 | 北京搜狐新媒体信息技术有限公司 | Text classification method and device, storage medium and electronic equipment |
CN117236329A (en) * | 2023-11-15 | 2023-12-15 | 阿里巴巴达摩院(北京)科技有限公司 | Text classification method and device and related equipment |
CN117236329B (en) * | 2023-11-15 | 2024-02-06 | 阿里巴巴达摩院(北京)科技有限公司 | Text classification method and device and related equipment |
Also Published As
Publication number | Publication date |
---|---|
WO2020082569A1 (en) | 2020-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543032A (en) | File classification method, device, computer equipment and storage medium | |
CN110147445A (en) | Intension recognizing method, device, equipment and storage medium based on text classification | |
CN110489550A (en) | File classification method, device and computer equipment based on combination neural net | |
CN110377730A (en) | Case is by classification method, device, computer equipment and storage medium | |
CN108509482A (en) | Question classification method, device, computer equipment and storage medium | |
CN109522406A (en) | Text semantic matching process, device, computer equipment and storage medium | |
CN110209805A (en) | File classification method, device, storage medium and computer equipment | |
CN108491406B (en) | Information classification method and device, computer equipment and storage medium | |
CN110399609A (en) | Intension recognizing method, device, equipment and computer readable storage medium | |
CN105787025A (en) | Network platform public account classifying method and device | |
CN108629693A (en) | Automatically generate method, apparatus, computer equipment and the storage medium of suggestion for investment | |
CN114240101A (en) | Risk identification model verification method, device and equipment | |
CN107679209B (en) | Classification expression generation method and device | |
CN111400449B (en) | Regular expression extraction method and device | |
CN111899027A (en) | Anti-fraud model training method and device | |
CN113705188B (en) | Intelligent evaluation method for customs import and export commodity specification declaration | |
CN110532359A (en) | Legal provision query method, apparatus, computer equipment and storage medium | |
CN113220885A (en) | Text processing method and system | |
US20170364827A1 (en) | Scenario Analytics System | |
CN108595568A (en) | A kind of text sentiment classification method based on very big unrelated multivariate logistic regression | |
CN112685639A (en) | Activity recommendation method and device, computer equipment and storage medium | |
CN116186257A (en) | Method and system for classifying short texts based on mixed features | |
CN109522407A (en) | Business connection prediction technique, device, computer equipment and storage medium | |
CN114049215A (en) | Abnormal transaction identification method, device and application | |
CN114443803A (en) | Text information mining method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |