CN103793385A - Textual feature extracting method and device - Google Patents
Textual feature extracting method and device Download PDFInfo
- Publication number
- CN103793385A CN103793385A CN201210419624.8A CN201210419624A CN103793385A CN 103793385 A CN103793385 A CN 103793385A CN 201210419624 A CN201210419624 A CN 201210419624A CN 103793385 A CN103793385 A CN 103793385A
- Authority
- CN
- China
- Prior art keywords
- feature words
- label
- feature
- destination document
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a textual feature extracting method and device. The method comprises the steps of confirming mutual information between a feature word Fi in a feature word library and each tag in a tag library according to occurrence frequency of the feature word Fi in a feature word Fi sample included in a sample library and a tag including the feature word Fi sample; performing word segmentation on a target file to obtain all of feature words emerging in the target file; confirming the weight of each feature word in the target file to each tag based on the mutual information between each feature word and each tag in the target file, weighting the weight of all of feature words in the target file to the same tag to obtain total weight of all of feature words in the target file to the same tag; confirming that a target tag serves as textual features of the target file from each tag according to the total weight of all of tags. The textual feature extracting accuracy can be improved by adopting the textual feature extracting method and device.
Description
Technical field
The application relates to areas of information technology, relates in particular to a kind of text feature and device.
Background technology
In text classification field, because the number of the Feature Words occurring in one piece of document is too many, how from one piece of document, to extract crucial Feature Words, how from one piece of document, to extract text feature, become the important technological problems of text classification.
The conventional text classification based on probability model, because realize the feature that principle is simple, accuracy rate is high, becomes one of most widely used file classification method.Wherein, the extraction of the text feature based on mutual information (Mutual Information, MI) is exactly a kind of typical file classification method based on probability model.
Mutual information, refers to two correlativitys between event sets.
Particularly, the mutual information of two event X and Y is defined as formula 1:
Wherein, p (X) and p (Y) be presentation of events X and the event Y probability of generation separately respectively, p (X, Y) presentation of events X and the simultaneous probability of event Y.
In text feature based on mutual information, formula 1 develops into formula 2:
Wherein, t represents the keyword getting by participle from document, and Xi represents i classification in known text classification set, p (t) and p (X
i) represent respectively the probability that obtains the probability of keyword t by participle and document is classified as to Xi class from document, p (t, X
i) represent from document, to obtain keyword t by participle and the document is classified as to the probability of Xi class, MI (t, X
i) represent to obtain keyword t and the document is classified as to the mutual information between Xi by participle from document, it has characterized the weight of keyword t to text categories Xi.
Since mutual information can be used for the weight of characteristic feature word to text categories, so when document is carried out to text feature extraction, can be using the label for representing text feature as a classification, formula 2 can be transformed to formula 3:
Wherein, Ti represents i label in tag library T, and p (t) represents to obtain by participle the probability of keyword t, p (T from document
i) document is classified as to T
ithe probability of class, p (t, T
i) represent obtain keyword t and the document is classified as to T by participle from document
ithe probability of class.
Particularly, set up in advance document Sample Storehouse, all documents in the document Sample Storehouse are all accomplished fluently label by the mode such as manual, and p (t) is the total document number divided by this Sample Storehouse of the document number in the document Sample Storehouse with Feature Words t, p (T
i) be T in the document Sample Storehouse
ithe document number of class is divided by total document number of this Sample Storehouse, p (t, T
i) be in the document Sample Storehouse, there is Feature Words t and belong to T
ithe document number of class is divided by total document number of this Sample Storehouse.
Visible, by formula 3, can obtain the mutual information of all Feature Words with each label in tag library T.So, in the time that needs extract text feature from one piece of document d, from document d, extract text feature Ti(also can be described as stamp label Ti) weight can obtain by formula 4:
Wherein, P (d, Ti) is the weight that document d can be stamped label Ti, can from document d, extract the weight of text feature Ti, and N is the number of the Feature Words in document d, t
xx Feature Words in document d.
For example, one piece of article d is carried out to participle, the Feature Words extracting and the number of times occurring in the document thereof comprise: (android, 2), (chat, 1), (speech talkback, 2), (Quick Response Code, 1), wherein, the number of times that the Feature Words in this bracket of the numeral occurring in bracket occurs in this piece of article d.Suppose that this piece of possible label of article d is " the micro-letter " in tag library, the probability that this piece of document d can be stamped " micro-letter " label is so:
P (d, micro-letter)=MI (android, micro-letter)+MI (chat, micro-letter)+MI (speech talkback, micro-letter)+MI (Quick Response Code, micro-letter)
Wherein, MI (android, micro-letter), MI (chat, micro-letter), MI (speech talkback, micro-letter) and MI (Quick Response Code, micro-letter) they are by setting up in advance document Sample Storehouse, and calculate according to formula 3.
Visible, due at present according to the number of times that has occurred in Sample Storehouse that the document number of certain Feature Words and this Feature Words and each label occur simultaneously, determine the mutual information (referring to formula 3) of this Feature Words and each label (being text feature), but do not consider the frequency that same Feature Words and label occur in one piece of document simultaneously, for example go up in example, Feature Words " android " and " speech talkback " have all occurred twice, it is with respect to only occurring Feature Words such as " Quick Response Codes " of 1 time, contribution degree for label " micro-letter " is higher, should be larger with the mutual information value of label " micro-letter ", but, according to the method for determining at present mutual information, but cannot embody this difference, therefore, the accuracy of the method for at present definite mutual information is lower, cannot reflect exactly the correlativity between Feature Words and label, correspondingly, it is also lower that method based on determining at present mutual information is carried out the accuracy of text classification.
In addition, no matter mutual information how to confirm, at present in the time carrying out text feature extraction, also be only to consider to have occurred which Feature Words in destination document, do not consider the number of times that a certain Feature Words occurs in destination document, and in fact, if certain Feature Words occurs continually in destination document, the text feature that this Feature Words is tackled this destination document extracts has higher contribution margin, and from this angle, the accuracy of method of carrying out at present text feature extraction is also lower.
Summary of the invention
The application provides a kind of text feature and device, can improve the accuracy of extracting text feature.
A kind of text feature, the method comprises:
For the Feature Words F in feature dictionary
i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance
isample in this Feature Words F
ioccurrence number and comprise this Feature Words F
ithe label that has of sample, determine this Feature Words F
iand the mutual information between the each label in tag library;
Destination document is carried out to participle, obtain all Feature Words that occur in described destination document;
Mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
According to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.
A kind of text feature, the method comprises:
Destination document is carried out to participle, obtain all Feature Words that occur in described destination document;
Determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
According to described total weight of each label, from described each label, determine the text feature of target labels as described destination document;
Wherein, determine that the each Feature Words in described destination document comprises the weight of each label:
According to Feature Words F
iwith label T
jmutual information MI (F
i, T
i), this Feature Words F
inumber of times TF (the F occurring in described destination document
i) and the importance degree IDF (F of this Feature Words
i) determine this Feature Words F
ito label T
jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F
inumber of samples more, this Feature Words F
iimportance degree IDF (F
i) lower.
A kind of text feature extraction element, this device comprises mutual information determination module and text feature extraction module;
Described mutual information determination module, for the Feature Words F in feature dictionary
i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance
isample in this Feature Words F
ioccurrence number and comprise this Feature Words F
ithe label that has of sample, determine this Feature Words F
iand the mutual information between the each label in tag library;
Described text feature extraction module, for destination document is carried out to participle, obtain all Feature Words that occur in described destination document, mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label, according to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.
A kind of text feature extraction element, this device comprises word-dividing mode, weight determination module and text feature extraction module;
Described word-dividing mode, for destination document is carried out to participle, obtains all Feature Words that occur in described destination document;
Described weight determination module, for determining each Feature Words of described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
Described text feature extraction module for according to described total weight of each label, is determined the text feature of target labels as described destination document from described each label;
Wherein, described weight determination module, for according to Feature Words F
iwith label T
jmutual information MI (F
i, T
j), this Feature Words F
inumber of times TF (the F occurring in described destination document
i) and the importance degree IDF (F of this Feature Words
i) determine this Feature Words F
ito label T
jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F
inumber of samples more, this Feature Words F
iimportance degree IDF (F
i) lower.
From such scheme, the present invention is in the time determining mutual information, not only to consider whether the sample in Sample Storehouse has occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in the sample of Sample Storehouse, the number of times occurring in sample due to Feature Words is more, correlativity between the label that general this Feature Words has this sample is just larger, therefore, adopt the present invention to determine the technical scheme of mutual information, can reflect comparatively exactly the correlativity between Feature Words and label, and then carry out text feature extraction based on this mutual information, also can improve the accuracy that text feature extracts.
In addition, the present invention is in the time extracting text feature, also can not only consider in destination document, whether to have occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in destination document, and the number of the sample that comprises this Feature Words in the Sample Storehouse of setting up in advance, the number of times occurring in destination document due to Feature Words, can reflect the possibility of the label relevant to Feature Words as the text feature of destination document, and in Sample Storehouse, comprise the number of samples of this Feature Words, can reflect the significance level of this Feature Words, therefore, adopt this technical scheme also can improve the accuracy of extracting text feature.
Accompanying drawing explanation
Fig. 1 is the process flow diagram that mutual information provided by the invention is determined method.
Fig. 2 is the process flow diagram of text feature provided by the invention.
Fig. 3 is the first structural drawing of text feature extraction element provided by the invention.
Fig. 4 is the second structural drawing of text feature extraction element provided by the invention.
Embodiment
Fig. 1 is the process flow diagram that mutual information provided by the invention is determined method.
As shown in Figure 1, this flow process comprises:
Visible, method shown in Fig. 1 is in the time determining mutual information, not only to consider whether the sample in Sample Storehouse has occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in the sample of Sample Storehouse, the number of times occurring in sample due to Feature Words is more, correlativity between the label that general this Feature Words has this sample is just larger, therefore, adopts Fig. 1 method to determine that mutual information can reflect the correlativity between Feature Words and label comparatively exactly.And then the mutual information definite based on method described in Fig. 1 carries out text feature extraction, also can improve the accuracy that text feature extracts.
Particularly, the present invention proposes, can be by Feature Words F
iwith label T
jmutual information MI (F
i, T
j) be defined as:
Based on this thought of occurrence number of considering Feature Words, the present invention also provides a kind of text feature, specifically refers to Fig. 2.
Fig. 2 is the process flow diagram of text feature provided by the invention.
As shown in Figure 2, this flow process comprises:
Wherein, the Sample Storehouse of setting up in advance comprises that the number of samples of a certain Feature Words is more, and the significance level of this Feature Words is lower.
Step 204 according to described total weight of each label, is determined the text feature of target labels as described destination document from described each label.
Visible, when method shown in Fig. 2 is extracted text feature, not only consider in destination document, whether to have occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in destination document, and the number of the sample that comprises this Feature Words in the Sample Storehouse of setting up in advance, the number of times occurring in destination document due to Feature Words, can reflect the possibility of the label relevant to Feature Words as the text feature of destination document, and in Sample Storehouse, comprise the number of samples of this Feature Words, can reflect the significance level of this Feature Words, therefore, adopt Fig. 2 method to extract text feature and can improve the accuracy of extracting text feature.
Particularly, the present invention also proposes, can be by Feature Words F
ito label T
jweight p (F
i, T
j) be defined as;
p(F
i,T
j)=MI(F
i,T
j)×IF(F
i)×IDF(F
i)。
Wherein, MI (F
i, T
j) be Feature Words F
iwith label T
jmutual information, TF (F
i) be Feature Words F
ithe number of times, the IDF (F that in destination document, occur
i) be Feature Words F
isignificance level, wherein, in advance set up Sample Storehouse in comprise this Feature Words F
inumber of samples more, this Feature Words F
isignificance level IDF (F
i) lower.
Further, Feature Words F
isignificance level IDF (F
i) can be:
Wherein, N is the total number of the sample in Sample Storehouse, and Feature Words F has appearred in Nfi in Sample Storehouse
inumber of samples.
In order further to improve the accuracy that text feature extracts, in text feature provided by the invention, can further adopt the mutual information that the present invention proposes to determine method, in text feature provided by the invention, Feature Words F
iwith label T
jmutual information MI (F
i, T
j) be:
Wherein, there is Feature Words F in n in Sample Storehouse
iand there is label T
jnumber of samples, Num there is Feature Words F
iand there is label T
jk sample in Feature Words F
ithe number of times, the p (F that occur
i) be Feature Words F
ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T
j) be in Sample Storehouse, to there is label T
jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
According to said method provided by the invention, the present invention also provides two kinds of text feature extraction elements, specifically refers to Fig. 3 and Fig. 4.
Fig. 3 is the first structural drawing of text feature extraction element provided by the invention.
As shown in Figure 3, this device comprises mutual information determination module 301 and text feature extraction module 302.
Mutual information determination module 301, for the Feature Words F in feature dictionary
i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance
isample in this Feature Words F
ioccurrence number and comprise this Feature Words F
ithe label that has of sample, determine this Feature Words F
iand the mutual information between the each label in tag library.
Text feature extraction module 302, for destination document is carried out to participle, obtain all Feature Words that occur in described destination document, mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label, according to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.
Wherein, mutual information determination module 301, can be for by Feature Words F
iwith label T
jmutual information MI (F
i, T
j) be defined as:
Wherein, there is Feature Words F in n in Sample Storehouse
iand there is label T
inumber of samples, Num there is Feature Words F
iand there is label T
jk sample in Feature Words F
ithe number of times, the p (F that occur
i) be Feature Words F
ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T
j) be in Sample Storehouse, to there is label T
jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
Fig. 4 is the second structural drawing of text feature extraction element provided by the invention.
As shown in Figure 4, text feature deriving means comprises word-dividing mode 401, weight determination module 402 and text feature extraction module 403.
Word-dividing mode 401, for destination document is carried out to participle, obtains all Feature Words that occur in described destination document.
Text feature extraction module 403 for according to described total weight of each label, is determined the text feature of target labels as described destination document from described each label.
Wherein, weight determination module 402, for according to Feature Words F
iwith label T
jmutual information MI (F
i, T
j), this Feature Words F
inumber of times TF (the F occurring in described destination document
i) and the importance degree IDF (F of this Feature Words
i) determine this Feature Words F
ito label T
jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F
inumber of samples more, this Feature Words F
iimportance degree IDF (F
i) lower.
Wherein, Feature Words F
iimportance degree IDF (F
i) can be:
Wherein, N is the total number of the sample in Sample Storehouse, and Feature Words F has appearred in Nfi in Sample Storehouse
inumber of samples.Weight determination module 402, can be for by Feature Words F
iwith label T
jmutual information MI (F
i, T
j) be:
Wherein, there is Feature Words F in n in Sample Storehouse
iand there is label T
jnumber of samples, Num there is Feature Words F
iand there is label T
jk sample in Feature Words F
ithe number of times, the p (F that occur
i) be Feature Words F
ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T
j) be in Sample Storehouse, to there is label T
jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.
Claims (12)
1. a text feature, is characterized in that, the method comprises:
For the Feature Words F in feature dictionary
i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance
isample in this Feature Words F
ioccurrence number and comprise this Feature Words F
ithe label that has of sample, determine this Feature Words F
iand the mutual information between the each label in tag library;
Destination document is carried out to participle, obtain all Feature Words that occur in described destination document;
Mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
According to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.
2. method according to claim 1, is characterized in that, determines this Feature Words F
iand the mutual information between the each label in tag library comprises:
By Feature Words F
iwith the label T in tag library
jmutual information be defined as:
Wherein, there is Feature Words F in n in the Sample Storehouse of setting up in advance
iand there is label T
jnumber of samples, Num there is Feature Words F
iand there is label T
jk sample in Feature Words F
ithe number of times, the p (F that occur
i) be Feature Words F
ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T
j) be in Sample Storehouse, to there is label T
jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
3. a text feature, is characterized in that, the method comprises:
Destination document is carried out to participle, obtain all Feature Words that occur in described destination document;
Determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
According to described total weight of each label, from described each label, determine the text feature of target labels as described destination document;
Wherein, determine that the each Feature Words in described destination document comprises the weight of each label:
According to Feature Words F
iwith label T
jmutual information MI (F
i, T
j), this Feature Words F
inumber of times TF (the F occurring in described destination document
i) and the significance level IDF (F of this Feature Words
i) determine this Feature Words F
ito label T
jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F
inumber of samples more, this Feature Words F
isignificance level IDF (F
i) lower.
4. method according to claim 3, is characterized in that, according to Feature Words F
iwith label T
jmutual information MI (F
i, T
j), this Feature Words F
inumber of times TF (the F occurring in described destination document
i) and this Feature Words F
iimportance degree IDF (F
i) determine this Feature Words F
ito label T
jweight comprise:
By Feature Words F
ito label T
jweight p (F
i, T
j) be defined as:
p(F
i,T
j)=MI(F
i,T
j)×TF(F
i)×IDF(F
i);
All Feature Words in described destination document are weighted the weight of same label, and all Feature Words that obtain in described destination document comprise total weight of described label:
By the set F of all Feature Words in destination document to label T
jtotal weight p (F, T
j) be defined as:
5. method according to claim 4, is characterized in that, Feature Words F
iimportance degree IDF (F
i) be:
Wherein, N is the total number of the sample in Sample Storehouse, and Feature Words F has appearred in Nfi in Sample Storehouse
inumber of samples.
6. according to the method described in claim 3 or 4 or 5, it is characterized in that Feature Words F
iwith label T
jmutual information MI (F
i, T
j) be:
Wherein, there is Feature Words F in n in Sample Storehouse
iand there is label T
jnumber of samples, Num there is Feature Words F
iand there is label T
jk sample in Feature Words F
ithe number of times, the p (F that occur
i) be Feature Words F
ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T
j) be in Sample Storehouse, to there is label T
jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
7. a text feature extraction element, is characterized in that, this device comprises mutual information determination module and text feature extraction module;
Described mutual information determination module, for the Feature Words F in feature dictionary
i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance
isample in this Feature Words F
ioccurrence number and comprise this Feature Words F
ithe label that has of sample, determine this Feature Words F
iand the mutual information between the each label in tag library;
Described text feature extraction module, for destination document is carried out to participle, obtain all Feature Words that occur in described destination document, mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label, according to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.
8. device according to claim 7, is characterized in that,
Described mutual information determination module, for by Feature Words F
iwith label T
jmutual information MI (F
i, T
j) be defined as:
Wherein, there is Feature Words F in n in Sample Storehouse
iand there is label T
jnumber of samples, Num there is Feature Words F
iand there is label T
jk sample in Feature Words F
ithe number of times, the p (F that occur
i) be Feature Words F
ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T
j) be in Sample Storehouse, to there is label T
jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
9. a text feature extraction element, is characterized in that, this device comprises word-dividing mode, weight determination module and text feature extraction module;
Described word-dividing mode, for destination document is carried out to participle, obtains all Feature Words that occur in described destination document;
Described weight determination module, for determining each Feature Words of described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
Described text feature extraction module for according to described total weight of each label, is determined the text feature of target labels as described destination document from described each label;
Wherein, described weight determination module, for according to Feature Words F
iwith label T
jmutual information MI (F
i, T
j), this Feature Words F
inumber of times TF (the F occurring in described destination document
i) and the importance degree IDF (F of this Feature Words
i) determine this Feature Words F
ito label T
jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F
inumber of samples more, this Feature Words F
iimportance degree IDF (F
i) lower.
10. device according to claim 9, is characterized in that,
Described weight determination module, for according to p (F
i, T
j)=MI (F
i, T
j) × TF (F
i) × IDF (F
i) determine Feature Words F
ito label T
jweight p (F
i, T
j), according to
Determine that the set F of all Feature Words in destination document is to label T
jtotal weight p (F, T
j), m is the number of all Feature Words in destination document.
11. devices according to claim 10, is characterized in that, Feature Words F
iimportance degree IDF (F
i) be:
Wherein, N is the total number of the sample in Sample Storehouse, and Feature Words F has appearred in Nfi in Sample Storehouse
inumber of samples.
12. according to the device described in claim 9 or 10 or 11, it is characterized in that,
Described weight determination module, for by Feature Words F
iwith label T
jmutual information MI (F
i, T
j) be:
Wherein, there is Feature Words F in n in Sample Storehouse
iand there is label T
jnumber of samples, Num there is Feature Words F
iand there is label T
jk sample in Feature Words F
ithe number of times, the p (F that occur
i) be Feature Words F
ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T
j) be in Sample Storehouse, to there is label T
jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210419624.8A CN103793385A (en) | 2012-10-29 | 2012-10-29 | Textual feature extracting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210419624.8A CN103793385A (en) | 2012-10-29 | 2012-10-29 | Textual feature extracting method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103793385A true CN103793385A (en) | 2014-05-14 |
Family
ID=50669070
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210419624.8A Pending CN103793385A (en) | 2012-10-29 | 2012-10-29 | Textual feature extracting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103793385A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677677A (en) * | 2014-11-20 | 2016-06-15 | 阿里巴巴集团控股有限公司 | Information classification and device |
CN105701084A (en) * | 2015-12-28 | 2016-06-22 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Characteristic extraction method of text classification on the basis of mutual information |
CN107562928A (en) * | 2017-09-15 | 2018-01-09 | 南京大学 | A kind of CCMI text feature selections method |
CN114331766A (en) * | 2022-01-05 | 2022-04-12 | 中国科学技术信息研究所 | Method and device for determining patent technology core degree, electronic equipment and storage medium |
-
2012
- 2012-10-29 CN CN201210419624.8A patent/CN103793385A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677677A (en) * | 2014-11-20 | 2016-06-15 | 阿里巴巴集团控股有限公司 | Information classification and device |
CN105701084A (en) * | 2015-12-28 | 2016-06-22 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Characteristic extraction method of text classification on the basis of mutual information |
CN107562928A (en) * | 2017-09-15 | 2018-01-09 | 南京大学 | A kind of CCMI text feature selections method |
CN107562928B (en) * | 2017-09-15 | 2019-11-15 | 南京大学 | A kind of CCMI text feature selection method |
CN114331766A (en) * | 2022-01-05 | 2022-04-12 | 中国科学技术信息研究所 | Method and device for determining patent technology core degree, electronic equipment and storage medium |
CN114331766B (en) * | 2022-01-05 | 2022-07-08 | 中国科学技术信息研究所 | Method and device for determining patent technology core degree, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104408093B (en) | A kind of media event key element abstracting method and device | |
CN108717406B (en) | Text emotion analysis method and device and storage medium | |
CN109325165B (en) | Network public opinion analysis method, device and storage medium | |
CN104572958B (en) | A kind of sensitive information monitoring method based on event extraction | |
CN106250513B (en) | Event modeling-based event personalized classification method and system | |
US9424524B2 (en) | Extracting facts from unstructured text | |
CN109145216A (en) | Network public-opinion monitoring method, device and storage medium | |
Tromp et al. | Graph-based n-gram language identification on short texts | |
CN104598535B (en) | A kind of event extraction method based on maximum entropy | |
US8688690B2 (en) | Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction | |
CN104199972A (en) | Named entity relation extraction and construction method based on deep learning | |
CN104317784A (en) | Cross-platform user identification method and cross-platform user identification system | |
CN105183717A (en) | OSN user emotion analysis method based on random forest and user relationship | |
CN103294664A (en) | Method and system for discovering new words in open fields | |
US10282467B2 (en) | Mining product aspects from opinion text | |
CN104731958A (en) | User-demand-oriented cloud manufacturing service recommendation method | |
Bam et al. | Named entity recognition for nepali text using support vector machines | |
CN110298039B (en) | Event place identification method, system, equipment and computer readable storage medium | |
CN104142912A (en) | Accurate corpus category marking method and device | |
US20130282727A1 (en) | Unexpectedness determination system, unexpectedness determination method and program | |
CN110741376A (en) | Automatic document analysis for different natural languages | |
CN103793385A (en) | Textual feature extracting method and device | |
CN103605690A (en) | Device and method for recognizing advertising messages in instant messaging | |
CN109857869A (en) | A kind of hot topic prediction technique based on Ap increment cluster and network primitive | |
CN105426379A (en) | Keyword weight calculation method based on position of word |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140514 |