CN105787004A

CN105787004A - Text classification method and device

Info

Publication number: CN105787004A
Application number: CN201610096316.4A
Authority: CN
Inventors: 王茂帅; 高峰; 柳廷娜; 于文才; 甄教明
Original assignee: Inspur Software Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2016-02-22
Filing date: 2016-02-22
Publication date: 2016-07-20

Abstract

The invention provides a text classification method and device.The method comprises the steps of obtaining multiple training texts, wherein the multiple training texts belong to multiple categories; according to the training texts contained in each category, determining a characteristic vector of each category; conducting dimensionality reduction operation on the characteristic vector of each category; according to each category obtained after dimensionality reduction operation, calculating the probability of the text to be classified of belonging to each category; distributing the text to be classified into the category with the maximum probability.According to the scheme, after the characteristic vector of each category is determined, dimensionality reduction operation is conducted on the characteristic vector of each category, thereby each category can be simplified, and text classification efficiency can be improved.

Description

A kind of file classification method and device

Technical field

The present invention relates to field of computer technology, particularly to a kind of file classification method and device.

Background technology

Along with developing rapidly of computer technology, the information targetedly that obtains from vast resources storehouse becomes a kind of demand of present society, and its various information processing technologies relied on also become currently indispensable instrument.The automatic classification technology of text refers under the premise of each classification given, according to its affiliated categorizing process of the content automatic decision in text to be sorted.Owing to text is to be made up of multiple words, the word quantity comprised in the corpus of generation is sizable, and when text table is shown as vector, dimension is huge, affects calculated performance, therefore, how to provide a kind of file classification method, to improve calculated performance.

Summary of the invention

Embodiments provide a kind of file classification method and device, to reduce the vector dimension of text.

First aspect, embodiments provides a kind of file classification method, including:

Obtaining multiple training text, wherein, the plurality of training text belongs to multiple classification；

According to the training text that each classification is included, it is determined that the characteristic vector of each classification；

The characteristic vector of each classification is carried out dimensionality reduction operation；

Each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification；

Described text to be sorted is assigned in the classification of maximum probability.

Preferably, the described characteristic vector determining each classification, including:

For the current training text of each in each current class, perform following operation respectively: current training text is carried out word segmentation processing；Calculate each word feature weight in current training text；

Feature weight according to all words included in this current class, forms the characteristic vector of this current class.

Preferably, each word of described calculating feature weight in current training text, including:

Word t is calculated at training text by following formulaIn feature weight

Wherein,For characterizing word t at training textIn word frequency, N is used for characterizing training text in this current classSum, n_tInclude the amount of text of word t for training texts all in this current class, denominator is normalization factor.

Preferably, the described characteristic vector to each classification carries out dimensionality reduction operation, including:

Characteristic vector for each current class performs following operation: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class；Wherein,

A = l o g (\frac{P (W | C_{j})}{P (W)})

P (W | C_{j}) = \frac{1 + Σ_{i = 1}^{| D 1 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 1 |} N (W_{s}, d_{i})}

P (W) = \frac{1 + Σ_{i = 1}^{| D 2 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 2 |} N (W_{s}, d_{i})}

Wherein, P (W | C_j) for word W classification C_jThe proportion of middle appearance, | D1 | is this classification C_jTraining text number, N (W, d_i) for word W at training text d_iIn word frequency, | V | is this training text d_iTotal word number,For this classification C_jIn all words word frequency and；| D2 | is the training text number of all classification correspondences,For words all in all classification word frequency and.

Preferably, described calculating text to be sorted is belonging respectively to the probability of each classification, including:

Text d to be sorted is calculated by following formula_mBelong to classification C_jProbability:

P (C_{j} | d_{m}; \hat{θ}) = \frac{P (C_{j} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{j}; \hat{θ})}^{N (W_{k}, d_{m})}}{Σ_{r = 1}^{| C |} P (C_{r} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{r}; \hat{θ})}^{N (W_{k}, d_{m})}}

Wherein, | C | is the quantity of classification, N (W_k,d_i) for characteristic vector W_kAt training text d_mIn word frequency, n is total number of characteristic vector.

Second aspect, embodiments provides a kind of document sorting apparatus, including:

Acquiring unit, is used for obtaining multiple training text, and wherein, the plurality of training text belongs to multiple classification, and is sent to by the multiple training texts obtained and determines unit；

Described determine unit, for the training text included according to each classification, it is determined that the characteristic vector of each classification, and the characteristic vector of each classification is sent to dimensionality reduction unit；

Described dimensionality reduction unit, for the characteristic vector of each classification carries out dimensionality reduction operation, and is sent to computing unit by each classification after dimensionality reduction operation；

Described computing unit, for each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification, and the probability that the text to be sorted calculated is belonging respectively to each classification is sent to allocation units；

Described allocation units, for being assigned to described text to be sorted in the classification of maximum probability.

Preferably, described determine unit, specifically for for the current training text of each in each current class, performing following operation respectively: current training text is carried out word segmentation processing；Calculate each word feature weight in current training text；Feature weight according to all words included in this current class, forms the characteristic vector of this current class.

Preferably, described determine unit, specifically for calculating word t at training text by following formulaIn feature weight

Preferably, described dimensionality reduction unit, specifically for performing following operation for the characteristic vector of each current class: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class；Wherein,

A = l o g (\frac{P (W | C_{j})}{P (W)})

P (W | C_{j}) = \frac{1 + Σ_{i = 1}^{| D 1 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 1 |} N (W_{s}, d_{i})}

P (W) = \frac{1 + Σ_{i = 1}^{| D 2 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 2 |} N (W_{s}, d_{i})}

Preferably, described computing unit, specifically for calculating text d to be sorted by following formula_mBelong to classification C_jProbability:

P (C_{j} | d_{m}; \hat{θ}) = \frac{P (C_{j} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{j}; \hat{θ})}^{N (W_{k}, d_{m})}}{Σ_{r = 1}^{| C |} P (C_{r} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{r}; \hat{θ})}^{N (W_{k}, d_{m})}}

Embodiments provide a kind of file classification method and device, by after the characteristic vector determining each classification, the characteristic vector of each classification is carried out dimensionality reduction operation, so that each classification is simplified, it is possible to improve the efficiency of text classification.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is a kind of method flow diagram that one embodiment of the invention provides；

Fig. 2 is the another kind of method flow diagram that one embodiment of the invention provides；

Fig. 3 is the hardware structure figure of the device place equipment that one embodiment of the invention provides；

Fig. 4 is the apparatus structure schematic diagram that one embodiment of the invention provides.

Detailed description of the invention

For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.

As it is shown in figure 1, embodiments provide a kind of file classification method, the method may comprise steps of:

Step 101: obtain multiple training text, wherein, the plurality of training text belongs to multiple classification；

Step 102: according to the training text that each classification is included, it is determined that the characteristic vector of each classification；

Step 103: the characteristic vector of each classification is carried out dimensionality reduction operation；

Step 104: each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification；

Step 105: described text to be sorted is assigned in the classification of maximum probability.

According to the scheme that above-described embodiment provides, by, after the characteristic vector determining each classification, the characteristic vector of each classification being carried out dimensionality reduction operation, so that each classification is simplified, it is possible to improve the efficiency of text classification.

In an embodiment of the invention, the characteristic vector of each classification is used to determine the important parameter whether a new text belongs to this classification, wherein, determine the characteristic vector of each classification, can be accomplished in that for the current training text of each in each current class, perform following operation respectively: current training text is carried out word segmentation processing；Calculate each word feature weight in current training text；Feature weight according to all words included in this current class, forms the characteristic vector of this current class.

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with drawings and the specific embodiments, the present invention is described in further detail.

As in figure 2 it is shown, embodiments provide a kind of file classification method, the method may comprise steps of:

Step 201: obtain multiple training text, wherein, the plurality of training text belongs to multiple classification.

In the present embodiment, in order to calculate the classification belonging to new text, it is thus necessary to determine that multiple classification, each classification includes multiple training text, and wherein, each multiple training text included by classification is for calculating whether new text belongs to the sample for reference of this classification.

Step 202: for each training sample included in each classification, carry out participle routine processes, and remove stop words, merge the words such as numeral and name, and add up word frequency.

In the present embodiment, owing to there being some words there is no actual meaning for the classification of text in training sample, for instance some stop words, it is necessary to these words are removed；And the actual weight that some words are in training text may result in the mistake of classification, for instance, some numeral or names etc., it is necessary to these words are merged, there is the weight of practical significance calculating these words.

Step 203: for each classification, calculate the feature weight of each word in each training text.

In the present embodiment, it is possible to calculate word t at training text by following formula (1)In feature weight

Step 204: the feature weight according to all words included in each current class, forms the characteristic vector of this current class.

For the feature weight that the characteristic vector of each current class is corresponding to the word occurred included in this classification.

Step 205: for each current class, calculate the mutual information A of word corresponding to each characteristic vector and this current class.

In the present embodiment, it is possible to realized the calculating of mutual information A by such as following formula (2), (3), (4):

A = l o g (\frac{P (W | C_{j})}{P (W)}) - - - (2)

P (W | C_{j}) = \frac{1 + Σ_{i = 1}^{| D 1 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 1 |} N (W_{s}, d_{i})} - - - (3)

P (W) = \frac{1 + Σ_{i = 1}^{| D 2 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 2 |} N (W_{s}, d_{i})} - - - (4)

Step 206: the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class in the characteristic vector of current class, to realize the dimensionality reduction operation of the characteristic vector to this current class.

Owing to the included word amount of each classification is relatively big, therefore the dimension of its characteristic vector is higher, in order to realize that classification is carried out dimensionality reduction operation, it is possible to the characteristic vector selecting some mutual informations big forms this classification.

Wherein, this predetermined number generally may be set in about several thousand, for instance, 2000.This predetermined number can also determine optimum according to experiment test and statistical result.

Step 207: each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification.

In the present embodiment, it is possible to calculate text d to be sorted by equation below_mBelong to classification C_jProbability:

P (C_{j} | d_{m}; \hat{θ}) = \frac{P (C_{j} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{j}; \hat{θ})}^{N (W_{k}, d_{m})}}{Σ_{r = 1}^{| C |} P (C_{r} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{r}; \hat{θ})}^{N (W_{k}, d_{m})}}

Step 208: text to be sorted is assigned in the classification of maximum probability.

As shown in Figure 3, Figure 4, a kind of document sorting apparatus is embodiments provided.Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; a kind of hardware structure diagram for the document sorting apparatus place equipment that the embodiment of the present invention provides; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment.The document sorting apparatus that the present embodiment provides, including:

Acquiring unit 401, is used for obtaining multiple training text, and wherein, the plurality of training text belongs to multiple classification, and is sent to by the multiple training texts obtained and determines unit 402；

Described determine unit 402, for the training text included according to each classification, it is determined that the characteristic vector of each classification, and the characteristic vector of each classification is sent to dimensionality reduction unit 403；

Described dimensionality reduction unit 403, for the characteristic vector of each classification carries out dimensionality reduction operation, and is sent to computing unit 404 by each classification after dimensionality reduction operation；

Described computing unit 404, for each classification after operating according to dimensionality reduction, calculates text to be sorted and is belonging respectively to the probability of each classification, and the probability that the text to be sorted calculated is belonging respectively to each classification is sent to allocation units 405；

Described allocation units 405, for being assigned to described text to be sorted in the classification of maximum probability.

Wherein, described determine unit 402, specifically for for the current training text of each in each current class, performing following operation respectively: current training text is carried out word segmentation processing；Calculate each word feature weight in current training text；Feature weight according to all words included in this current class, forms the characteristic vector of this current class.

Wherein, described determine unit 402, specifically for calculating word t at training text by following formulaIn feature weight

Wherein, described dimensionality reduction unit 403, specifically for performing following operation for the characteristic vector of each current class: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class；Wherein,

A = l o g (\frac{P (W | C_{j})}{P (W)})

P (W | C_{j}) = \frac{1 + Σ_{i = 1}^{| D 1 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 1 |} N (W_{s}, d_{i})}

P (W) = \frac{1 + Σ_{i = 1}^{| D 2 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 2 |} N (W_{s}, d_{i})}

Wherein, described computing unit 404, specifically for calculating text d to be sorted by following formula_mBelong to classification C_jProbability:

P (C_{j} | d_{m}; \hat{θ}) = \frac{P (C_{j} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{j}; \hat{θ})}^{N (W_{k}, d_{m})}}{Σ_{r = 1}^{| C |} P (C_{r} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{r}; \hat{θ})}^{N (W_{k}, d_{m})}}

To sum up, the embodiment of the present invention at least can realize following beneficial effect:

1, in embodiments of the present invention, by, after the characteristic vector determining each classification, the characteristic vector of each classification being carried out dimensionality reduction operation, so that each classification is simplified, it is possible to improve the efficiency of text classification.

2, in embodiments of the present invention, carry out the vector representation of text using word as feature, and the characteristic vector obtained is carried out dimensionality reduction expression, both remained text important information, the calculating after facilitating again；By carrying out training, sum up criteria for classification, according to standard, new text is classified automatically；By remaining interface beyond the clouds, it is possible to externally provide open safely controllable API service.

3, in embodiments of the present invention, by after extensive text message is carried out effective classification process, it is possible to set up individual searching engine targetedly, improve the precision ratio of system, allow user can retrieve target information fast and effectively.

The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.

It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.

One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment；And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.

Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims

1. a file classification method, it is characterised in that including:

2. method according to claim 1, it is characterised in that the described characteristic vector determining each classification, including:

3. method according to claim 2, it is characterised in that each word of described calculating feature weight in current training text, including:

Word t is calculated at training text by following formulaIn feature weight

4. method according to claim 1, it is characterised in that the described characteristic vector to each classification carries out dimensionality reduction operation, including:

A = \log (\frac{P (W | C_{j})}{P (W)})

P (W | C_{j}) = \frac{1 + Σ_{i = 1}^{| D 1 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 1 |} N (W_{s}, d_{i})}

P (W) = \frac{1 + Σ_{i = 1}^{| D 2 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 2 |} N (W_{s}, d_{i})}

5. according to described method arbitrary in claim 1-4, it is characterised in that described calculating text to be sorted is belonging respectively to the probability of each classification, including:

P (C_{j} | d_{m}; \hat{θ}) = \frac{P (C_{j} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{j}; \hat{θ})}^{N (W_{k}, d_{m})}}{Σ_{r = 1}^{| C |} P (C_{r} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{r}; \hat{θ})}^{N (W_{k}, d_{m})}}

6. a document sorting apparatus, it is characterised in that including:

7. document sorting apparatus according to claim 6, it is characterised in that described determine unit, specifically for for the current training text of each in each current class, performing following operation respectively: current training text is carried out word segmentation processing；Calculate each word feature weight in current training text；Feature weight according to all words included in this current class, forms the characteristic vector of this current class.

8. document sorting apparatus according to claim 7, it is characterised in that described determine unit, specifically for calculating word t at training text by following formulaIn feature weight

9. document sorting apparatus according to claim 6, it is characterized in that, described dimensionality reduction unit, specifically for performing following operation for the characteristic vector of each current class: calculate the mutual information A of word corresponding to each characteristic vector and this current class, in the characteristic vector of current class, the characteristic vector of the predetermined number that selection mutual information is maximum forms this current class, to realize the dimensionality reduction operation of the characteristic vector to this current class；Wherein,

A = \log (\frac{P (W | C_{j})}{P (W)})

P (W | C_{j}) = \frac{1 + Σ_{i = 1}^{| D 1 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 1 |} N (W_{s}, d_{i})}

P (W) = \frac{1 + Σ_{i = 1}^{| D 2 |} N (W, d_{i})}{| V | + Σ_{s = 1}^{| V |} Σ_{i = 1}^{| D 2 |} N (W_{s}, d_{i})}

10. according to described document sorting apparatus arbitrary in claim 6-9, it is characterised in that described computing unit, specifically for calculating text d to be sorted by following formula_mBelong to classification C_jProbability:

P (C_{j} | d_{m}; \hat{θ}) = \frac{P (C_{j} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{j}; \hat{θ})}^{N (W_{k}, d_{m})}}{Σ_{r = 1}^{| C |} P (C_{r} | \hat{θ}) Π_{k = 1}^{n} P {(W_{k} | C_{r}; \hat{θ})}^{N (W_{k}, d_{m})}}