CN105787004A - Text classification method and device - Google Patents

Text classification method and device Download PDF

Info

Publication number
CN105787004A
CN105787004A CN201610096316.4A CN201610096316A CN105787004A CN 105787004 A CN105787004 A CN 105787004A CN 201610096316 A CN201610096316 A CN 201610096316A CN 105787004 A CN105787004 A CN 105787004A
Authority
CN
China
Prior art keywords
classification
characteristic vector
word
text
training text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610096316.4A
Other languages
Chinese (zh)
Inventor
王茂帅
高峰
柳廷娜
于文才
甄教明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Co Ltd
Original Assignee
Inspur Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Co Ltd filed Critical Inspur Software Co Ltd
Priority to CN201610096316.4A priority Critical patent/CN105787004A/en
Publication of CN105787004A publication Critical patent/CN105787004A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention provides a text classification method and device.The method comprises the steps of obtaining multiple training texts, wherein the multiple training texts belong to multiple categories; according to the training texts contained in each category, determining a characteristic vector of each category; conducting dimensionality reduction operation on the characteristic vector of each category; according to each category obtained after dimensionality reduction operation, calculating the probability of the text to be classified of belonging to each category; distributing the text to be classified into the category with the maximum probability.According to the scheme, after the characteristic vector of each category is determined, dimensionality reduction operation is conducted on the characteristic vector of each category, thereby each category can be simplified, and text classification efficiency can be improved.

Description

A kind of file classification method and device
Technical field
The present invention relates to field of computer technology, particularly to a kind of file classification method and device.
Background technology
Along with developing rapidly of computer technology, the information targetedly that obtains from vast resources storehouse becomes a kind of demand of present society, and its various information processing technologies relied on also become currently indispensable instrument.The automatic classification technology of text refers under the premise of each classification given, according to its affiliated categorizing process of the content automatic decision in text to be sorted.Owing to text is to be made up of multiple words, the word quantity comprised in the corpus of generation is sizable, and when text table is shown as vector, dimension is huge, affects calculated performance, therefore, how to provide a kind of file classification method, to improve calculated performance.
Summary of the invention
Embodiments provide a kind of file classification method and device, to reduce the vector dimension of text.
First aspect, embodiments provides a kind of file classification method, including:
Obtaining multiple training text, wherein, the plurality of training text belongs to multiple classification;
According to the training text that each classification is included, it is determined that the characteristic vector of each classification;
The characteristic vector of each classification is carried out dimensionality reduction operation;
Each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification;
Described text to be sorted is assigned in the classification of maximum probability.
Preferably, the described characteristic vector determining each classification, including:
For the current training text of each in each current class, perform following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;
Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
Preferably, each word of described calculating feature weight in current training text, including:
Word t is calculated at training text by following formulaIn feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
Preferably, the described characteristic vector to each classification carries out dimensionality reduction operation, including:
Characteristic vector for each current class performs following operation: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class;Wherein,
A = l o g ( P ( W | C j ) P ( W ) )
P ( W | C j ) = 1 + Σ i = 1 | D 1 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 1 | N ( W s , d i )
P ( W ) = 1 + Σ i = 1 | D 2 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 2 | N ( W s , d i )
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
Preferably, described calculating text to be sorted is belonging respectively to the probability of each classification, including:
Text d to be sorted is calculated by following formulamBelong to classification CjProbability:
P ( C j | d m ; θ ^ ) = P ( C j | θ ^ ) Π k = 1 n P ( W k | C j ; θ ^ ) N ( W k , d m ) Σ r = 1 | C | P ( C r | θ ^ ) Π k = 1 n P ( W k | C r ; θ ^ ) N ( W k , d m )
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
Second aspect, embodiments provides a kind of document sorting apparatus, including:
Acquiring unit, is used for obtaining multiple training text, and wherein, the plurality of training text belongs to multiple classification, and is sent to by the multiple training texts obtained and determines unit;
Described determine unit, for the training text included according to each classification, it is determined that the characteristic vector of each classification, and the characteristic vector of each classification is sent to dimensionality reduction unit;
Described dimensionality reduction unit, for the characteristic vector of each classification carries out dimensionality reduction operation, and is sent to computing unit by each classification after dimensionality reduction operation;
Described computing unit, for each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification, and the probability that the text to be sorted calculated is belonging respectively to each classification is sent to allocation units;
Described allocation units, for being assigned to described text to be sorted in the classification of maximum probability.
Preferably, described determine unit, specifically for for the current training text of each in each current class, performing following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
Preferably, described determine unit, specifically for calculating word t at training text by following formulaIn feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
Preferably, described dimensionality reduction unit, specifically for performing following operation for the characteristic vector of each current class: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class;Wherein,
A = l o g ( P ( W | C j ) P ( W ) )
P ( W | C j ) = 1 + Σ i = 1 | D 1 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 1 | N ( W s , d i )
P ( W ) = 1 + Σ i = 1 | D 2 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 2 | N ( W s , d i )
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
Preferably, described computing unit, specifically for calculating text d to be sorted by following formulamBelong to classification CjProbability:
P ( C j | d m ; θ ^ ) = P ( C j | θ ^ ) Π k = 1 n P ( W k | C j ; θ ^ ) N ( W k , d m ) Σ r = 1 | C | P ( C r | θ ^ ) Π k = 1 n P ( W k | C r ; θ ^ ) N ( W k , d m )
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
Embodiments provide a kind of file classification method and device, by after the characteristic vector determining each classification, the characteristic vector of each classification is carried out dimensionality reduction operation, so that each classification is simplified, it is possible to improve the efficiency of text classification.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is a kind of method flow diagram that one embodiment of the invention provides;
Fig. 2 is the another kind of method flow diagram that one embodiment of the invention provides;
Fig. 3 is the hardware structure figure of the device place equipment that one embodiment of the invention provides;
Fig. 4 is the apparatus structure schematic diagram that one embodiment of the invention provides.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, embodiments provide a kind of file classification method, the method may comprise steps of:
Step 101: obtain multiple training text, wherein, the plurality of training text belongs to multiple classification;
Step 102: according to the training text that each classification is included, it is determined that the characteristic vector of each classification;
Step 103: the characteristic vector of each classification is carried out dimensionality reduction operation;
Step 104: each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification;
Step 105: described text to be sorted is assigned in the classification of maximum probability.
According to the scheme that above-described embodiment provides, by, after the characteristic vector determining each classification, the characteristic vector of each classification being carried out dimensionality reduction operation, so that each classification is simplified, it is possible to improve the efficiency of text classification.
In an embodiment of the invention, the characteristic vector of each classification is used to determine the important parameter whether a new text belongs to this classification, wherein, determine the characteristic vector of each classification, can be accomplished in that for the current training text of each in each current class, perform following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.
As in figure 2 it is shown, embodiments provide a kind of file classification method, the method may comprise steps of:
Step 201: obtain multiple training text, wherein, the plurality of training text belongs to multiple classification.
In the present embodiment, in order to calculate the classification belonging to new text, it is thus necessary to determine that multiple classification, each classification includes multiple training text, and wherein, each multiple training text included by classification is for calculating whether new text belongs to the sample for reference of this classification.
Step 202: for each training sample included in each classification, carry out participle routine processes, and remove stop words, merge the words such as numeral and name, and add up word frequency.
In the present embodiment, owing to there being some words there is no actual meaning for the classification of text in training sample, for instance some stop words, it is necessary to these words are removed;And the actual weight that some words are in training text may result in the mistake of classification, for instance, some numeral or names etc., it is necessary to these words are merged, there is the weight of practical significance calculating these words.
Step 203: for each classification, calculate the feature weight of each word in each training text.
In the present embodiment, it is possible to calculate word t at training text by following formula (1)In feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
Step 204: the feature weight according to all words included in each current class, forms the characteristic vector of this current class.
For the feature weight that the characteristic vector of each current class is corresponding to the word occurred included in this classification.
Step 205: for each current class, calculate the mutual information A of word corresponding to each characteristic vector and this current class.
In the present embodiment, it is possible to realized the calculating of mutual information A by such as following formula (2), (3), (4):
A = l o g ( P ( W | C j ) P ( W ) ) - - - ( 2 )
P ( W | C j ) = 1 + Σ i = 1 | D 1 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 1 | N ( W s , d i ) - - - ( 3 )
P ( W ) = 1 + Σ i = 1 | D 2 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 2 | N ( W s , d i ) - - - ( 4 )
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
Step 206: the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class in the characteristic vector of current class, to realize the dimensionality reduction operation of the characteristic vector to this current class.
Owing to the included word amount of each classification is relatively big, therefore the dimension of its characteristic vector is higher, in order to realize that classification is carried out dimensionality reduction operation, it is possible to the characteristic vector selecting some mutual informations big forms this classification.
Wherein, this predetermined number generally may be set in about several thousand, for instance, 2000.This predetermined number can also determine optimum according to experiment test and statistical result.
Step 207: each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification.
In the present embodiment, it is possible to calculate text d to be sorted by equation belowmBelong to classification CjProbability:
P ( C j | d m ; θ ^ ) = P ( C j | θ ^ ) Π k = 1 n P ( W k | C j ; θ ^ ) N ( W k , d m ) Σ r = 1 | C | P ( C r | θ ^ ) Π k = 1 n P ( W k | C r ; θ ^ ) N ( W k , d m )
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
Step 208: text to be sorted is assigned in the classification of maximum probability.
As shown in Figure 3, Figure 4, a kind of document sorting apparatus is embodiments provided.Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; a kind of hardware structure diagram for the document sorting apparatus place equipment that the embodiment of the present invention provides; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment.The document sorting apparatus that the present embodiment provides, including:
Acquiring unit 401, is used for obtaining multiple training text, and wherein, the plurality of training text belongs to multiple classification, and is sent to by the multiple training texts obtained and determines unit 402;
Described determine unit 402, for the training text included according to each classification, it is determined that the characteristic vector of each classification, and the characteristic vector of each classification is sent to dimensionality reduction unit 403;
Described dimensionality reduction unit 403, for the characteristic vector of each classification carries out dimensionality reduction operation, and is sent to computing unit 404 by each classification after dimensionality reduction operation;
Described computing unit 404, for each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification, and the probability that the text to be sorted calculated is belonging respectively to each classification is sent to allocation units 405;
Described allocation units 405, for being assigned to described text to be sorted in the classification of maximum probability.
Wherein, described determine unit 402, specifically for for the current training text of each in each current class, performing following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
Wherein, described determine unit 402, specifically for calculating word t at training text by following formulaIn feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
Wherein, described dimensionality reduction unit 403, specifically for performing following operation for the characteristic vector of each current class: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class;Wherein,
A = l o g ( P ( W | C j ) P ( W ) )
P ( W | C j ) = 1 + Σ i = 1 | D 1 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 1 | N ( W s , d i )
P ( W ) = 1 + Σ i = 1 | D 2 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 2 | N ( W s , d i )
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
Wherein, described computing unit 404, specifically for calculating text d to be sorted by following formulamBelong to classification CjProbability:
P ( C j | d m ; θ ^ ) = P ( C j | θ ^ ) Π k = 1 n P ( W k | C j ; θ ^ ) N ( W k , d m ) Σ r = 1 | C | P ( C r | θ ^ ) Π k = 1 n P ( W k | C r ; θ ^ ) N ( W k , d m )
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
To sum up, the embodiment of the present invention at least can realize following beneficial effect:
1, in embodiments of the present invention, by, after the characteristic vector determining each classification, the characteristic vector of each classification being carried out dimensionality reduction operation, so that each classification is simplified, it is possible to improve the efficiency of text classification.
2, in embodiments of the present invention, carry out the vector representation of text using word as feature, and the characteristic vector obtained is carried out dimensionality reduction expression, both remained text important information, the calculating after facilitating again;By carrying out training, sum up criteria for classification, according to standard, new text is classified automatically;By remaining interface beyond the clouds, it is possible to externally provide open safely controllable API service.
3, in embodiments of the present invention, by after extensive text message is carried out effective classification process, it is possible to set up individual searching engine targetedly, improve the precision ratio of system, allow user can retrieve target information fast and effectively.
The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.
It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment;And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims (10)

1. a file classification method, it is characterised in that including:
Obtaining multiple training text, wherein, the plurality of training text belongs to multiple classification;
According to the training text that each classification is included, it is determined that the characteristic vector of each classification;
The characteristic vector of each classification is carried out dimensionality reduction operation;
Each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification;
Described text to be sorted is assigned in the classification of maximum probability.
2. method according to claim 1, it is characterised in that the described characteristic vector determining each classification, including:
For the current training text of each in each current class, perform following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;
Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
3. method according to claim 2, it is characterised in that each word of described calculating feature weight in current training text, including:
Word t is calculated at training text by following formulaIn feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
4. method according to claim 1, it is characterised in that the described characteristic vector to each classification carries out dimensionality reduction operation, including:
Characteristic vector for each current class performs following operation: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class;Wherein,
A = log ( P ( W | C j ) P ( W ) )
P ( W | C j ) = 1 + Σ i = 1 | D 1 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 1 | N ( W s , d i )
P ( W ) = 1 + Σ i = 1 | D 2 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 2 | N ( W s , d i )
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
5. according to described method arbitrary in claim 1-4, it is characterised in that described calculating text to be sorted is belonging respectively to the probability of each classification, including:
Text d to be sorted is calculated by following formulamBelong to classification CjProbability:
P ( C j | d m ; θ ^ ) = P ( C j | θ ^ ) Π k = 1 n P ( W k | C j ; θ ^ ) N ( W k , d m ) Σ r = 1 | C | P ( C r | θ ^ ) Π k = 1 n P ( W k | C r ; θ ^ ) N ( W k , d m )
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
6. a document sorting apparatus, it is characterised in that including:
Acquiring unit, is used for obtaining multiple training text, and wherein, the plurality of training text belongs to multiple classification, and is sent to by the multiple training texts obtained and determines unit;
Described determine unit, for the training text included according to each classification, it is determined that the characteristic vector of each classification, and the characteristic vector of each classification is sent to dimensionality reduction unit;
Described dimensionality reduction unit, for the characteristic vector of each classification carries out dimensionality reduction operation, and is sent to computing unit by each classification after dimensionality reduction operation;
Described computing unit, for each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification, and the probability that the text to be sorted calculated is belonging respectively to each classification is sent to allocation units;
Described allocation units, for being assigned to described text to be sorted in the classification of maximum probability.
7. document sorting apparatus according to claim 6, it is characterised in that described determine unit, specifically for for the current training text of each in each current class, performing following operation respectively: current training text is carried out word segmentation processing;Calculate each word feature weight in current training text;Feature weight according to all words included in this current class, forms the characteristic vector of this current class.
8. document sorting apparatus according to claim 7, it is characterised in that described determine unit, specifically for calculating word t at training text by following formulaIn feature weight
Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, ntInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.
9. document sorting apparatus according to claim 6, it is characterized in that, described dimensionality reduction unit, specifically for performing following operation for the characteristic vector of each current class: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class;Wherein,
A = log ( P ( W | C j ) P ( W ) )
P ( W | C j ) = 1 + Σ i = 1 | D 1 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 1 | N ( W s , d i )
P ( W ) = 1 + Σ i = 1 | D 2 | N ( W , d i ) | V | + Σ s = 1 | V | Σ i = 1 | D 2 | N ( W s , d i )
Wherein, P (W | Cj) for word W classification CjThe proportion of middle appearance, | D1 | is this classification CjTraining text number, N (W, di) for word W at training text diIn word frequency, | V | is this training text diTotal word number,For this classification CjIn all words word frequency and;| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.
10. according to described document sorting apparatus arbitrary in claim 6-9, it is characterised in that described computing unit, specifically for calculating text d to be sorted by following formulamBelong to classification CjProbability:
P ( C j | d m ; θ ^ ) = P ( C j | θ ^ ) Π k = 1 n P ( W k | C j ; θ ^ ) N ( W k , d m ) Σ r = 1 | C | P ( C r | θ ^ ) Π k = 1 n P ( W k | C r ; θ ^ ) N ( W k , d m )
Wherein, | C | is the quantity of classification, N (Wk,di) for characteristic vector WkAt training text dmIn word frequency, n is total number of characteristic vector.
CN201610096316.4A 2016-02-22 2016-02-22 Text classification method and device Pending CN105787004A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610096316.4A CN105787004A (en) 2016-02-22 2016-02-22 Text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610096316.4A CN105787004A (en) 2016-02-22 2016-02-22 Text classification method and device

Publications (1)

Publication Number Publication Date
CN105787004A true CN105787004A (en) 2016-07-20

Family

ID=56403606

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610096316.4A Pending CN105787004A (en) 2016-02-22 2016-02-22 Text classification method and device

Country Status (1)

Country Link
CN (1) CN105787004A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372117A (en) * 2016-08-23 2017-02-01 电子科技大学 Word co-occurrence-based text classification method and apparatus
CN107329999A (en) * 2017-06-09 2017-11-07 江西科技学院 Document classification method and device
CN109284377A (en) * 2018-09-13 2019-01-29 云南电网有限责任公司 A kind of file classification method and device based on vector space
CN109408636A (en) * 2018-09-29 2019-03-01 新华三大数据技术有限公司 File classification method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
CN102930063A (en) * 2012-12-05 2013-02-13 电子科技大学 Feature item selection and weight calculation based text classification method
CN103593418A (en) * 2013-10-30 2014-02-19 中国科学院计算技术研究所 Distributed subject finding method and system for big data
US20150113388A1 (en) * 2013-10-22 2015-04-23 Qualcomm Incorporated Method and apparatus for performing topic-relevance highlighting of electronic text
CN105159998A (en) * 2015-09-08 2015-12-16 海南大学 Keyword calculation method based on document clustering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
CN102930063A (en) * 2012-12-05 2013-02-13 电子科技大学 Feature item selection and weight calculation based text classification method
US20150113388A1 (en) * 2013-10-22 2015-04-23 Qualcomm Incorporated Method and apparatus for performing topic-relevance highlighting of electronic text
CN103593418A (en) * 2013-10-30 2014-02-19 中国科学院计算技术研究所 Distributed subject finding method and system for big data
CN105159998A (en) * 2015-09-08 2015-12-16 海南大学 Keyword calculation method based on document clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵杰: "3.7.1特征选择", 《搜索引擎技术》 *
陈慧芳: "文本分类中特征向量空间降维方法", 《中国优秀硕士论文全文数据库信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372117A (en) * 2016-08-23 2017-02-01 电子科技大学 Word co-occurrence-based text classification method and apparatus
CN106372117B (en) * 2016-08-23 2019-06-14 电子科技大学 A kind of file classification method and its device based on Term co-occurrence
CN107329999A (en) * 2017-06-09 2017-11-07 江西科技学院 Document classification method and device
CN107329999B (en) * 2017-06-09 2020-10-20 江西科技学院 Document classification method and device
CN109284377A (en) * 2018-09-13 2019-01-29 云南电网有限责任公司 A kind of file classification method and device based on vector space
CN109408636A (en) * 2018-09-29 2019-03-01 新华三大数据技术有限公司 File classification method and device

Similar Documents

Publication Publication Date Title
CN107844559A (en) A kind of file classifying method, device and electronic equipment
US20220147023A1 (en) Method and device for identifying industry classification of enterprise and particular pollutants of enterprise
CN106033416A (en) A string processing method and device
CN105787004A (en) Text classification method and device
CN110334356A (en) Article matter method for determination of amount, article screening technique and corresponding device
CN108241867B (en) Classification method and device
CN109918658A (en) A kind of method and system obtaining target vocabulary from text
CN106991090A (en) The analysis method and device of public sentiment event entity
US11288266B2 (en) Candidate projection enumeration based query response generation
CN113934848B (en) Data classification method and device and electronic equipment
CN107908649B (en) Text classification control method
CN106997340A (en) The generation of dictionary and the Document Classification Method and device using dictionary
CN114611850A (en) Service analysis method and device and electronic equipment
CN112818110A (en) Text filtering method, text filtering equipment and computer storage medium
CN110019556A (en) A kind of topic news acquisition methods, device and its equipment
CN115563268A (en) Text abstract generation method and device, electronic equipment and storage medium
CN110895703A (en) Legal document routing identification method and device
CN105677677A (en) Information classification and device
CN105512145A (en) Method and device for information classification
CN108520012A (en) Mobile Internet user comment method for digging based on machine learning
JP5310196B2 (en) Classification system revision support program, classification system revision support device, and classification system revision support method
CN114021716A (en) Model training method and system and electronic equipment
CN110321435B (en) Data source dividing method, device, equipment and storage medium
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
Zou et al. An improved model for spam user identification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160720