CN103793385A

CN103793385A - Textual feature extracting method and device

Info

Publication number: CN103793385A
Application number: CN201210419624.8A
Authority: CN
Inventors: 邹维; 尹华彬; 周畅; 杨俊松; 宫建涛; 吴振宇; 宁合军
Original assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2012-10-29
Filing date: 2012-10-29
Publication date: 2014-05-14

Abstract

The invention discloses a textual feature extracting method and device. The method comprises the steps of confirming mutual information between a feature word Fi in a feature word library and each tag in a tag library according to occurrence frequency of the feature word Fi in a feature word Fi sample included in a sample library and a tag including the feature word Fi sample; performing word segmentation on a target file to obtain all of feature words emerging in the target file; confirming the weight of each feature word in the target file to each tag based on the mutual information between each feature word and each tag in the target file, weighting the weight of all of feature words in the target file to the same tag to obtain total weight of all of feature words in the target file to the same tag; confirming that a target tag serves as textual features of the target file from each tag according to the total weight of all of tags. The textual feature extracting accuracy can be improved by adopting the textual feature extracting method and device.

Description

A kind of text feature and device

Technical field

The application relates to areas of information technology, relates in particular to a kind of text feature and device.

Background technology

In text classification field, because the number of the Feature Words occurring in one piece of document is too many, how from one piece of document, to extract crucial Feature Words, how from one piece of document, to extract text feature, become the important technological problems of text classification.

The conventional text classification based on probability model, because realize the feature that principle is simple, accuracy rate is high, becomes one of most widely used file classification method.Wherein, the extraction of the text feature based on mutual information (Mutual Information, MI) is exactly a kind of typical file classification method based on probability model.

Mutual information, refers to two correlativitys between event sets.

Particularly, the mutual information of two event X and Y is defined as formula 1:

MI (X, Y) = \log \frac{p (X, Y)}{p (X) \times p (Y)}

Wherein, p (X) and p (Y) be presentation of events X and the event Y probability of generation separately respectively, p (X, Y) presentation of events X and the simultaneous probability of event Y.

In text feature based on mutual information, formula 1 develops into formula 2:

MI (t, X_{i}) = \log \frac{p (t, X_{i})}{p (t) \times p (X_{i})}

Wherein, t represents the keyword getting by participle from document, and Xi represents i classification in known text classification set, p (t) and p (X _i) represent respectively the probability that obtains the probability of keyword t by participle and document is classified as to Xi class from document, p (t, X _i) represent from document, to obtain keyword t by participle and the document is classified as to the probability of Xi class, MI (t, X _i) represent to obtain keyword t and the document is classified as to the mutual information between Xi by participle from document, it has characterized the weight of keyword t to text categories Xi.

Since mutual information can be used for the weight of characteristic feature word to text categories, so when document is carried out to text feature extraction, can be using the label for representing text feature as a classification, formula 2 can be transformed to formula 3:

MI (t, T_{i}) = \log \frac{p (t, T_{i})}{p (t) \times p (T_{i})}

Wherein, Ti represents i label in tag library T, and p (t) represents to obtain by participle the probability of keyword t, p (T from document _i) document is classified as to T _ithe probability of class, p (t, T _i) represent obtain keyword t and the document is classified as to T by participle from document _ithe probability of class.

Particularly, set up in advance document Sample Storehouse, all documents in the document Sample Storehouse are all accomplished fluently label by the mode such as manual, and p (t) is the total document number divided by this Sample Storehouse of the document number in the document Sample Storehouse with Feature Words t, p (T _i) be T in the document Sample Storehouse _ithe document number of class is divided by total document number of this Sample Storehouse, p (t, T _i) be in the document Sample Storehouse, there is Feature Words t and belong to T _ithe document number of class is divided by total document number of this Sample Storehouse.

Visible, by formula 3, can obtain the mutual information of all Feature Words with each label in tag library T.So, in the time that needs extract text feature from one piece of document d, from document d, extract text feature Ti(also can be described as stamp label Ti) weight can obtain by formula 4:

p (d, T_{i}) = Σ_{x = 1}^{N} MI (t_{x}, T_{i})

Wherein, P (d, Ti) is the weight that document d can be stamped label Ti, can from document d, extract the weight of text feature Ti, and N is the number of the Feature Words in document d, t _xx Feature Words in document d.

For example, one piece of article d is carried out to participle, the Feature Words extracting and the number of times occurring in the document thereof comprise: (android, 2), (chat, 1), (speech talkback, 2), (Quick Response Code, 1), wherein, the number of times that the Feature Words in this bracket of the numeral occurring in bracket occurs in this piece of article d.Suppose that this piece of possible label of article d is " the micro-letter " in tag library, the probability that this piece of document d can be stamped " micro-letter " label is so:

P (d, micro-letter)=MI (android, micro-letter)+MI (chat, micro-letter)+MI (speech talkback, micro-letter)+MI (Quick Response Code, micro-letter)

Wherein, MI (android, micro-letter), MI (chat, micro-letter), MI (speech talkback, micro-letter) and MI (Quick Response Code, micro-letter) they are by setting up in advance document Sample Storehouse, and calculate according to formula 3.

Visible, due at present according to the number of times that has occurred in Sample Storehouse that the document number of certain Feature Words and this Feature Words and each label occur simultaneously, determine the mutual information (referring to formula 3) of this Feature Words and each label (being text feature), but do not consider the frequency that same Feature Words and label occur in one piece of document simultaneously, for example go up in example, Feature Words " android " and " speech talkback " have all occurred twice, it is with respect to only occurring Feature Words such as " Quick Response Codes " of 1 time, contribution degree for label " micro-letter " is higher, should be larger with the mutual information value of label " micro-letter ", but, according to the method for determining at present mutual information, but cannot embody this difference, therefore, the accuracy of the method for at present definite mutual information is lower, cannot reflect exactly the correlativity between Feature Words and label, correspondingly, it is also lower that method based on determining at present mutual information is carried out the accuracy of text classification.

In addition, no matter mutual information how to confirm, at present in the time carrying out text feature extraction, also be only to consider to have occurred which Feature Words in destination document, do not consider the number of times that a certain Feature Words occurs in destination document, and in fact, if certain Feature Words occurs continually in destination document, the text feature that this Feature Words is tackled this destination document extracts has higher contribution margin, and from this angle, the accuracy of method of carrying out at present text feature extraction is also lower.

Summary of the invention

The application provides a kind of text feature and device, can improve the accuracy of extracting text feature.

A kind of text feature, the method comprises:

For the Feature Words F in feature dictionary _i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance _isample in this Feature Words F _ioccurrence number and comprise this Feature Words F _ithe label that has of sample, determine this Feature Words F _iand the mutual information between the each label in tag library;

Destination document is carried out to participle, obtain all Feature Words that occur in described destination document;

Mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;

According to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.

A kind of text feature, the method comprises:

Determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;

According to described total weight of each label, from described each label, determine the text feature of target labels as described destination document;

Wherein, determine that the each Feature Words in described destination document comprises the weight of each label:

According to Feature Words F _iwith label T _jmutual information MI (F _i, T _i), this Feature Words F _inumber of times TF (the F occurring in described destination document _i) and the importance degree IDF (F of this Feature Words _i) determine this Feature Words F _ito label T _jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F _inumber of samples more, this Feature Words F _iimportance degree IDF (F _i) lower.

A kind of text feature extraction element, this device comprises mutual information determination module and text feature extraction module;

Described mutual information determination module, for the Feature Words F in feature dictionary _i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance _isample in this Feature Words F _ioccurrence number and comprise this Feature Words F _ithe label that has of sample, determine this Feature Words F _iand the mutual information between the each label in tag library;

Described text feature extraction module, for destination document is carried out to participle, obtain all Feature Words that occur in described destination document, mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label, according to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.

A kind of text feature extraction element, this device comprises word-dividing mode, weight determination module and text feature extraction module;

Described word-dividing mode, for destination document is carried out to participle, obtains all Feature Words that occur in described destination document;

Described weight determination module, for determining each Feature Words of described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;

Described text feature extraction module for according to described total weight of each label, is determined the text feature of target labels as described destination document from described each label;

Wherein, described weight determination module, for according to Feature Words F _iwith label T _jmutual information MI (F _i, T _j), this Feature Words F _inumber of times TF (the F occurring in described destination document _i) and the importance degree IDF (F of this Feature Words _i) determine this Feature Words F _ito label T _jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F _inumber of samples more, this Feature Words F _iimportance degree IDF (F _i) lower.

From such scheme, the present invention is in the time determining mutual information, not only to consider whether the sample in Sample Storehouse has occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in the sample of Sample Storehouse, the number of times occurring in sample due to Feature Words is more, correlativity between the label that general this Feature Words has this sample is just larger, therefore, adopt the present invention to determine the technical scheme of mutual information, can reflect comparatively exactly the correlativity between Feature Words and label, and then carry out text feature extraction based on this mutual information, also can improve the accuracy that text feature extracts.

In addition, the present invention is in the time extracting text feature, also can not only consider in destination document, whether to have occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in destination document, and the number of the sample that comprises this Feature Words in the Sample Storehouse of setting up in advance, the number of times occurring in destination document due to Feature Words, can reflect the possibility of the label relevant to Feature Words as the text feature of destination document, and in Sample Storehouse, comprise the number of samples of this Feature Words, can reflect the significance level of this Feature Words, therefore, adopt this technical scheme also can improve the accuracy of extracting text feature.

Accompanying drawing explanation

Fig. 1 is the process flow diagram that mutual information provided by the invention is determined method.

Fig. 2 is the process flow diagram of text feature provided by the invention.

Fig. 3 is the first structural drawing of text feature extraction element provided by the invention.

Fig. 4 is the second structural drawing of text feature extraction element provided by the invention.

Embodiment

As shown in Figure 1, this flow process comprises:

Step 101, for certain the Feature Words F in feature dictionary _iwith certain the label T in tag library _jaccording to the sample information in the Sample Storehouse of setting up in advance, be determined to be this Feature Words and there is the result p (F of the total degree gained that the total degree that occurs in the number of samples n of this label, each occurrence number Num that occurs this Feature Words and have this Feature Words in the sample of this label, all samples of this Feature Words in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse _i) and Sample Storehouse in number of samples divided by the result p (T of the total number gained of the sample in Sample Storehouse _j).

Step 102, the total number N of sample in information and the Sample Storehouse of determining according to step 101 determines the mutual information of described Feature Words and described label.

Visible, method shown in Fig. 1 is in the time determining mutual information, not only to consider whether the sample in Sample Storehouse has occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in the sample of Sample Storehouse, the number of times occurring in sample due to Feature Words is more, correlativity between the label that general this Feature Words has this sample is just larger, therefore, adopts Fig. 1 method to determine that mutual information can reflect the correlativity between Feature Words and label comparatively exactly.And then the mutual information definite based on method described in Fig. 1 carries out text feature extraction, also can improve the accuracy that text feature extracts.

Particularly, the present invention proposes, can be by Feature Words F _iwith label T _jmutual information MI (F _i, T _j) be defined as:

MI (F_{i}, T_{j}) = \log \frac{Σ_{k = 0}^{n} \log (e - 1 + Num)}{N \times p (F_{i}) \times p (T_{j})} .

Based on this thought of occurrence number of considering Feature Words, the present invention also provides a kind of text feature, specifically refers to Fig. 2.

Fig. 2 is the process flow diagram of text feature provided by the invention.

As shown in Figure 2, this flow process comprises:

Step 201, carries out participle to destination document, obtains all Feature Words that occur in described destination document.

Step 202, the number of times occurring in described destination document according to the mutual information of each Feature Words and each label, each Feature Words and the significance level of each Feature Words, determine each Feature Words in described destination document weight to each label.

Wherein, the Sample Storehouse of setting up in advance comprises that the number of samples of a certain Feature Words is more, and the significance level of this Feature Words is lower.

Step 203, is weighted all Feature Words in described destination document to the weight of same label, obtain all Feature Words in described destination document total weight to same label.

Step 204 according to described total weight of each label, is determined the text feature of target labels as described destination document from described each label.

Visible, when method shown in Fig. 2 is extracted text feature, not only consider in destination document, whether to have occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in destination document, and the number of the sample that comprises this Feature Words in the Sample Storehouse of setting up in advance, the number of times occurring in destination document due to Feature Words, can reflect the possibility of the label relevant to Feature Words as the text feature of destination document, and in Sample Storehouse, comprise the number of samples of this Feature Words, can reflect the significance level of this Feature Words, therefore, adopt Fig. 2 method to extract text feature and can improve the accuracy of extracting text feature.

Particularly, the present invention also proposes, can be by Feature Words F _ito label T _jweight p (F _i, T _j) be defined as;

p(F _i,T _j)=MI(F _i,T _j)×IF(F _i)×IDF(F _i)。

Wherein, MI (F _i, T _j) be Feature Words F _iwith label T _jmutual information, TF (F _i) be Feature Words F _ithe number of times, the IDF (F that in destination document, occur _i) be Feature Words F _isignificance level, wherein, in advance set up Sample Storehouse in comprise this Feature Words F _inumber of samples more, this Feature Words F _isignificance level IDF (F _i) lower.

Further, Feature Words F _isignificance level IDF (F _i) can be:

IDF (F_{i}) = \log (1 + \frac{N}{Nfi})

Wherein, N is the total number of the sample in Sample Storehouse, and Feature Words F has appearred in Nfi in Sample Storehouse _inumber of samples.

In order further to improve the accuracy that text feature extracts, in text feature provided by the invention, can further adopt the mutual information that the present invention proposes to determine method, in text feature provided by the invention, Feature Words F _iwith label T _jmutual information MI (F _i, T _j) be:

MI (F_{i}, T_{j}) = \log \frac{Σ_{k = 0}^{n} \log (e - 1 + Num)}{N \times p (F_{i}) \times p (T_{j})};

Wherein, there is Feature Words F in n in Sample Storehouse _iand there is label T _jnumber of samples, Num there is Feature Words F _iand there is label T _jk sample in Feature Words F _ithe number of times, the p (F that occur _i) be Feature Words F _ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T _j) be in Sample Storehouse, to there is label T _jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.

According to said method provided by the invention, the present invention also provides two kinds of text feature extraction elements, specifically refers to Fig. 3 and Fig. 4.

As shown in Figure 3, this device comprises mutual information determination module 301 and text feature extraction module 302.

Mutual information determination module 301, for the Feature Words F in feature dictionary _i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance _isample in this Feature Words F _ioccurrence number and comprise this Feature Words F _ithe label that has of sample, determine this Feature Words F _iand the mutual information between the each label in tag library.

Text feature extraction module 302, for destination document is carried out to participle, obtain all Feature Words that occur in described destination document, mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label, according to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.

Wherein, mutual information determination module 301, can be for by Feature Words F _iwith label T _jmutual information MI (F _i, T _j) be defined as:

MI (F_{i}, T_{j}) = \log \frac{Σ_{k = 0}^{n} \log (e - 1 + Num)}{N \times p (F_{i}) \times p (T_{j})} .

Wherein, there is Feature Words F in n in Sample Storehouse _iand there is label T _inumber of samples, Num there is Feature Words F _iand there is label T _jk sample in Feature Words F _ithe number of times, the p (F that occur _i) be Feature Words F _ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T _j) be in Sample Storehouse, to there is label T _jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.

As shown in Figure 4, text feature deriving means comprises word-dividing mode 401, weight determination module 402 and text feature extraction module 403.

Word-dividing mode 401, for destination document is carried out to participle, obtains all Feature Words that occur in described destination document.

Weight determination module 402, for determining each Feature Words of described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label.

Text feature extraction module 403 for according to described total weight of each label, is determined the text feature of target labels as described destination document from described each label.

Wherein, weight determination module 402, for according to Feature Words F _iwith label T _jmutual information MI (F _i, T _j), this Feature Words F _inumber of times TF (the F occurring in described destination document _i) and the importance degree IDF (F of this Feature Words _i) determine this Feature Words F _ito label T _jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F _inumber of samples more, this Feature Words F _iimportance degree IDF (F _i) lower.

Weight determination module 402, can be for according to p (F _i, T _j)=MI (F _i, T _j) × TF (F _i) × IDF (F _i) determine Feature Words F _ito label T _jweight p (F _i, T _j), according to

p (F, T_{j}) = Σ_{i = 0}^{m} MI (F_{i}, T_{j}) \times TF (F_{i}) \times IDF (F_{i})

Determine that the set F of all Feature Words in destination document is to label T _jtotal weight p (F, T _j), m is the number of all Feature Words in destination document.

Wherein, Feature Words F _iimportance degree IDF (F _i) can be:

IDF (F_{i}) = \log (1 + \frac{N}{Nfi})

Wherein, N is the total number of the sample in Sample Storehouse, and Feature Words F has appearred in Nfi in Sample Storehouse _inumber of samples.Weight determination module 402, can be for by Feature Words F _iwith label T _jmutual information MI (F _i, T _j) be:

MI (F_{i}, T_{j}) = \log \frac{Σ_{k = 0}^{n} \log (e - 1 + Num)}{N \times p (F_{i}) \times p (T_{j})};

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. a text feature, is characterized in that, the method comprises:

2. method according to claim 1, is characterized in that, determines this Feature Words F _iand the mutual information between the each label in tag library comprises:

By Feature Words F _iwith the label T in tag library _jmutual information be defined as:

MI (F_{i}, T_{j}) = \log \frac{Σ_{k = 0}^{n} \log (e - 1 + Num)}{N \times p (F_{i}) \times p (T_{j})}

Wherein, there is Feature Words F in n in the Sample Storehouse of setting up in advance _iand there is label T _jnumber of samples, Num there is Feature Words F _iand there is label T _jk sample in Feature Words F _ithe number of times, the p (F that occur _i) be Feature Words F _ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T _j) be in Sample Storehouse, to there is label T _jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.

3. a text feature, is characterized in that, the method comprises:

According to Feature Words F _iwith label T _jmutual information MI (F _i, T _j), this Feature Words F _inumber of times TF (the F occurring in described destination document _i) and the significance level IDF (F of this Feature Words _i) determine this Feature Words F _ito label T _jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F _inumber of samples more, this Feature Words F _isignificance level IDF (F _i) lower.

4. method according to claim 3, is characterized in that, according to Feature Words F _iwith label T _jmutual information MI (F _i, T _j), this Feature Words F _inumber of times TF (the F occurring in described destination document _i) and this Feature Words F _iimportance degree IDF (F _i) determine this Feature Words F _ito label T _jweight comprise:

By Feature Words F _ito label T _jweight p (F _i, T _j) be defined as:

p(F _i,T _j)=MI(F _i,T _j)×TF(F _i)×IDF(F _i)；

All Feature Words in described destination document are weighted the weight of same label, and all Feature Words that obtain in described destination document comprise total weight of described label:

By the set F of all Feature Words in destination document to label T _jtotal weight p (F, T _j) be defined as:

p (F, T_{j}) = Σ_{i = 0}^{m} MI (F_{i}, T_{j}) \times TF (F_{i}) \times IDF (F_{i}),

Wherein, m is the number of all Feature Words in destination document.

5. method according to claim 4, is characterized in that, Feature Words F _iimportance degree IDF (F _i) be:

IDF (F_{i}) = \log (1 + \frac{N}{Nfi})

6. according to the method described in claim 3 or 4 or 5, it is characterized in that Feature Words F _iwith label T _jmutual information MI (F _i, T _j) be:

MI (F_{i}, T_{j}) = \log \frac{Σ_{k = 0}^{n} \log (e - 1 + Num)}{N \times p (F_{i}) \times p (T_{j})};

7. a text feature extraction element, is characterized in that, this device comprises mutual information determination module and text feature extraction module;

8. device according to claim 7, is characterized in that,

Described mutual information determination module, for by Feature Words F _iwith label T _jmutual information MI (F _i, T _j) be defined as:

MI (F_{i}, T_{j}) = \log \frac{Σ_{k = 0}^{n} \log (e - 1 + Num)}{N \times p (F_{i}) \times p (T_{j})};

9. a text feature extraction element, is characterized in that, this device comprises word-dividing mode, weight determination module and text feature extraction module;

10. device according to claim 9, is characterized in that,

Described weight determination module, for according to p (F _i, T _j)=MI (F _i, T _j) × TF (F _i) × IDF (F _i) determine Feature Words F _ito label T _jweight p (F _i, T _j), according to

p (F, T_{j}) = Σ_{i = 0}^{m} MI (F_{i}, T_{j}) \times TF (F_{i}) \times IDF (F_{i})

11. devices according to claim 10, is characterized in that, Feature Words F _iimportance degree IDF (F _i) be:

IDF (F_{i}) = \log (1 + \frac{N}{Nfi})

12. according to the device described in claim 9 or 10 or 11, it is characterized in that,

Described weight determination module, for by Feature Words F _iwith label T _jmutual information MI (F _i, T _j) be:

MI (F_{i}, T_{j}) = \log \frac{Σ_{k = 0}^{n} \log (e - 1 + Num)}{N \times p (F_{i}) \times p (T_{j})};