CN103793385A - Textual feature extracting method and device - Google Patents

Textual feature extracting method and device Download PDF

Info

Publication number
CN103793385A
CN103793385A CN201210419624.8A CN201210419624A CN103793385A CN 103793385 A CN103793385 A CN 103793385A CN 201210419624 A CN201210419624 A CN 201210419624A CN 103793385 A CN103793385 A CN 103793385A
Authority
CN
China
Prior art keywords
feature words
label
feature
destination document
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210419624.8A
Other languages
Chinese (zh)
Inventor
邹维
尹华彬
周畅
杨俊松
宫建涛
吴振宇
宁合军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shiji Guangsu Information Technology Co Ltd filed Critical Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority to CN201210419624.8A priority Critical patent/CN103793385A/en
Publication of CN103793385A publication Critical patent/CN103793385A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a textual feature extracting method and device. The method comprises the steps of confirming mutual information between a feature word Fi in a feature word library and each tag in a tag library according to occurrence frequency of the feature word Fi in a feature word Fi sample included in a sample library and a tag including the feature word Fi sample; performing word segmentation on a target file to obtain all of feature words emerging in the target file; confirming the weight of each feature word in the target file to each tag based on the mutual information between each feature word and each tag in the target file, weighting the weight of all of feature words in the target file to the same tag to obtain total weight of all of feature words in the target file to the same tag; confirming that a target tag serves as textual features of the target file from each tag according to the total weight of all of tags. The textual feature extracting accuracy can be improved by adopting the textual feature extracting method and device.

Description

A kind of text feature and device
Technical field
The application relates to areas of information technology, relates in particular to a kind of text feature and device.
Background technology
In text classification field, because the number of the Feature Words occurring in one piece of document is too many, how from one piece of document, to extract crucial Feature Words, how from one piece of document, to extract text feature, become the important technological problems of text classification.
The conventional text classification based on probability model, because realize the feature that principle is simple, accuracy rate is high, becomes one of most widely used file classification method.Wherein, the extraction of the text feature based on mutual information (Mutual Information, MI) is exactly a kind of typical file classification method based on probability model.
Mutual information, refers to two correlativitys between event sets.
Particularly, the mutual information of two event X and Y is defined as formula 1:
MI ( X , Y ) = log p ( X , Y ) p ( X ) × p ( Y )
Wherein, p (X) and p (Y) be presentation of events X and the event Y probability of generation separately respectively, p (X, Y) presentation of events X and the simultaneous probability of event Y.
In text feature based on mutual information, formula 1 develops into formula 2:
MI ( t , X i ) = log p ( t , X i ) p ( t ) × p ( X i )
Wherein, t represents the keyword getting by participle from document, and Xi represents i classification in known text classification set, p (t) and p (X i) represent respectively the probability that obtains the probability of keyword t by participle and document is classified as to Xi class from document, p (t, X i) represent from document, to obtain keyword t by participle and the document is classified as to the probability of Xi class, MI (t, X i) represent to obtain keyword t and the document is classified as to the mutual information between Xi by participle from document, it has characterized the weight of keyword t to text categories Xi.
Since mutual information can be used for the weight of characteristic feature word to text categories, so when document is carried out to text feature extraction, can be using the label for representing text feature as a classification, formula 2 can be transformed to formula 3:
MI ( t , T i ) = log p ( t , T i ) p ( t ) × p ( T i )
Wherein, Ti represents i label in tag library T, and p (t) represents to obtain by participle the probability of keyword t, p (T from document i) document is classified as to T ithe probability of class, p (t, T i) represent obtain keyword t and the document is classified as to T by participle from document ithe probability of class.
Particularly, set up in advance document Sample Storehouse, all documents in the document Sample Storehouse are all accomplished fluently label by the mode such as manual, and p (t) is the total document number divided by this Sample Storehouse of the document number in the document Sample Storehouse with Feature Words t, p (T i) be T in the document Sample Storehouse ithe document number of class is divided by total document number of this Sample Storehouse, p (t, T i) be in the document Sample Storehouse, there is Feature Words t and belong to T ithe document number of class is divided by total document number of this Sample Storehouse.
Visible, by formula 3, can obtain the mutual information of all Feature Words with each label in tag library T.So, in the time that needs extract text feature from one piece of document d, from document d, extract text feature Ti(also can be described as stamp label Ti) weight can obtain by formula 4:
p ( d , T i ) = Σ x = 1 N MI ( t x , T i )
Wherein, P (d, Ti) is the weight that document d can be stamped label Ti, can from document d, extract the weight of text feature Ti, and N is the number of the Feature Words in document d, t xx Feature Words in document d.
For example, one piece of article d is carried out to participle, the Feature Words extracting and the number of times occurring in the document thereof comprise: (android, 2), (chat, 1), (speech talkback, 2), (Quick Response Code, 1), wherein, the number of times that the Feature Words in this bracket of the numeral occurring in bracket occurs in this piece of article d.Suppose that this piece of possible label of article d is " the micro-letter " in tag library, the probability that this piece of document d can be stamped " micro-letter " label is so:
P (d, micro-letter)=MI (android, micro-letter)+MI (chat, micro-letter)+MI (speech talkback, micro-letter)+MI (Quick Response Code, micro-letter)
Wherein, MI (android, micro-letter), MI (chat, micro-letter), MI (speech talkback, micro-letter) and MI (Quick Response Code, micro-letter) they are by setting up in advance document Sample Storehouse, and calculate according to formula 3.
Visible, due at present according to the number of times that has occurred in Sample Storehouse that the document number of certain Feature Words and this Feature Words and each label occur simultaneously, determine the mutual information (referring to formula 3) of this Feature Words and each label (being text feature), but do not consider the frequency that same Feature Words and label occur in one piece of document simultaneously, for example go up in example, Feature Words " android " and " speech talkback " have all occurred twice, it is with respect to only occurring Feature Words such as " Quick Response Codes " of 1 time, contribution degree for label " micro-letter " is higher, should be larger with the mutual information value of label " micro-letter ", but, according to the method for determining at present mutual information, but cannot embody this difference, therefore, the accuracy of the method for at present definite mutual information is lower, cannot reflect exactly the correlativity between Feature Words and label, correspondingly, it is also lower that method based on determining at present mutual information is carried out the accuracy of text classification.
In addition, no matter mutual information how to confirm, at present in the time carrying out text feature extraction, also be only to consider to have occurred which Feature Words in destination document, do not consider the number of times that a certain Feature Words occurs in destination document, and in fact, if certain Feature Words occurs continually in destination document, the text feature that this Feature Words is tackled this destination document extracts has higher contribution margin, and from this angle, the accuracy of method of carrying out at present text feature extraction is also lower.
Summary of the invention
The application provides a kind of text feature and device, can improve the accuracy of extracting text feature.
A kind of text feature, the method comprises:
For the Feature Words F in feature dictionary i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance isample in this Feature Words F ioccurrence number and comprise this Feature Words F ithe label that has of sample, determine this Feature Words F iand the mutual information between the each label in tag library;
Destination document is carried out to participle, obtain all Feature Words that occur in described destination document;
Mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
According to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.
A kind of text feature, the method comprises:
Destination document is carried out to participle, obtain all Feature Words that occur in described destination document;
Determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
According to described total weight of each label, from described each label, determine the text feature of target labels as described destination document;
Wherein, determine that the each Feature Words in described destination document comprises the weight of each label:
According to Feature Words F iwith label T jmutual information MI (F i, T i), this Feature Words F inumber of times TF (the F occurring in described destination document i) and the importance degree IDF (F of this Feature Words i) determine this Feature Words F ito label T jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F inumber of samples more, this Feature Words F iimportance degree IDF (F i) lower.
A kind of text feature extraction element, this device comprises mutual information determination module and text feature extraction module;
Described mutual information determination module, for the Feature Words F in feature dictionary i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance isample in this Feature Words F ioccurrence number and comprise this Feature Words F ithe label that has of sample, determine this Feature Words F iand the mutual information between the each label in tag library;
Described text feature extraction module, for destination document is carried out to participle, obtain all Feature Words that occur in described destination document, mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label, according to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.
A kind of text feature extraction element, this device comprises word-dividing mode, weight determination module and text feature extraction module;
Described word-dividing mode, for destination document is carried out to participle, obtains all Feature Words that occur in described destination document;
Described weight determination module, for determining each Feature Words of described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
Described text feature extraction module for according to described total weight of each label, is determined the text feature of target labels as described destination document from described each label;
Wherein, described weight determination module, for according to Feature Words F iwith label T jmutual information MI (F i, T j), this Feature Words F inumber of times TF (the F occurring in described destination document i) and the importance degree IDF (F of this Feature Words i) determine this Feature Words F ito label T jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F inumber of samples more, this Feature Words F iimportance degree IDF (F i) lower.
From such scheme, the present invention is in the time determining mutual information, not only to consider whether the sample in Sample Storehouse has occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in the sample of Sample Storehouse, the number of times occurring in sample due to Feature Words is more, correlativity between the label that general this Feature Words has this sample is just larger, therefore, adopt the present invention to determine the technical scheme of mutual information, can reflect comparatively exactly the correlativity between Feature Words and label, and then carry out text feature extraction based on this mutual information, also can improve the accuracy that text feature extracts.
In addition, the present invention is in the time extracting text feature, also can not only consider in destination document, whether to have occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in destination document, and the number of the sample that comprises this Feature Words in the Sample Storehouse of setting up in advance, the number of times occurring in destination document due to Feature Words, can reflect the possibility of the label relevant to Feature Words as the text feature of destination document, and in Sample Storehouse, comprise the number of samples of this Feature Words, can reflect the significance level of this Feature Words, therefore, adopt this technical scheme also can improve the accuracy of extracting text feature.
Accompanying drawing explanation
Fig. 1 is the process flow diagram that mutual information provided by the invention is determined method.
Fig. 2 is the process flow diagram of text feature provided by the invention.
Fig. 3 is the first structural drawing of text feature extraction element provided by the invention.
Fig. 4 is the second structural drawing of text feature extraction element provided by the invention.
Embodiment
Fig. 1 is the process flow diagram that mutual information provided by the invention is determined method.
As shown in Figure 1, this flow process comprises:
Step 101, for certain the Feature Words F in feature dictionary iwith certain the label T in tag library jaccording to the sample information in the Sample Storehouse of setting up in advance, be determined to be this Feature Words and there is the result p (F of the total degree gained that the total degree that occurs in the number of samples n of this label, each occurrence number Num that occurs this Feature Words and have this Feature Words in the sample of this label, all samples of this Feature Words in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse i) and Sample Storehouse in number of samples divided by the result p (T of the total number gained of the sample in Sample Storehouse j).
Step 102, the total number N of sample in information and the Sample Storehouse of determining according to step 101 determines the mutual information of described Feature Words and described label.
Visible, method shown in Fig. 1 is in the time determining mutual information, not only to consider whether the sample in Sample Storehouse has occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in the sample of Sample Storehouse, the number of times occurring in sample due to Feature Words is more, correlativity between the label that general this Feature Words has this sample is just larger, therefore, adopts Fig. 1 method to determine that mutual information can reflect the correlativity between Feature Words and label comparatively exactly.And then the mutual information definite based on method described in Fig. 1 carries out text feature extraction, also can improve the accuracy that text feature extracts.
Particularly, the present invention proposes, can be by Feature Words F iwith label T jmutual information MI (F i, T j) be defined as:
MI ( F i , T j ) = log Σ k = 0 n log ( e - 1 + Num ) N × p ( F i ) × p ( T j ) .
Based on this thought of occurrence number of considering Feature Words, the present invention also provides a kind of text feature, specifically refers to Fig. 2.
Fig. 2 is the process flow diagram of text feature provided by the invention.
As shown in Figure 2, this flow process comprises:
Step 201, carries out participle to destination document, obtains all Feature Words that occur in described destination document.
Step 202, the number of times occurring in described destination document according to the mutual information of each Feature Words and each label, each Feature Words and the significance level of each Feature Words, determine each Feature Words in described destination document weight to each label.
Wherein, the Sample Storehouse of setting up in advance comprises that the number of samples of a certain Feature Words is more, and the significance level of this Feature Words is lower.
Step 203, is weighted all Feature Words in described destination document to the weight of same label, obtain all Feature Words in described destination document total weight to same label.
Step 204 according to described total weight of each label, is determined the text feature of target labels as described destination document from described each label.
Visible, when method shown in Fig. 2 is extracted text feature, not only consider in destination document, whether to have occurred certain Feature Words, also further consider the number of times that this Feature Words occurs in destination document, and the number of the sample that comprises this Feature Words in the Sample Storehouse of setting up in advance, the number of times occurring in destination document due to Feature Words, can reflect the possibility of the label relevant to Feature Words as the text feature of destination document, and in Sample Storehouse, comprise the number of samples of this Feature Words, can reflect the significance level of this Feature Words, therefore, adopt Fig. 2 method to extract text feature and can improve the accuracy of extracting text feature.
Particularly, the present invention also proposes, can be by Feature Words F ito label T jweight p (F i, T j) be defined as;
p(F i,T j)=MI(F i,T j)×IF(F i)×IDF(F i)。
Wherein, MI (F i, T j) be Feature Words F iwith label T jmutual information, TF (F i) be Feature Words F ithe number of times, the IDF (F that in destination document, occur i) be Feature Words F isignificance level, wherein, in advance set up Sample Storehouse in comprise this Feature Words F inumber of samples more, this Feature Words F isignificance level IDF (F i) lower.
Further, Feature Words F isignificance level IDF (F i) can be:
IDF ( F i ) = log ( 1 + N Nfi )
Wherein, N is the total number of the sample in Sample Storehouse, and Feature Words F has appearred in Nfi in Sample Storehouse inumber of samples.
In order further to improve the accuracy that text feature extracts, in text feature provided by the invention, can further adopt the mutual information that the present invention proposes to determine method, in text feature provided by the invention, Feature Words F iwith label T jmutual information MI (F i, T j) be:
MI ( F i , T j ) = log Σ k = 0 n log ( e - 1 + Num ) N × p ( F i ) × p ( T j ) ;
Wherein, there is Feature Words F in n in Sample Storehouse iand there is label T jnumber of samples, Num there is Feature Words F iand there is label T jk sample in Feature Words F ithe number of times, the p (F that occur i) be Feature Words F ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T j) be in Sample Storehouse, to there is label T jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
According to said method provided by the invention, the present invention also provides two kinds of text feature extraction elements, specifically refers to Fig. 3 and Fig. 4.
Fig. 3 is the first structural drawing of text feature extraction element provided by the invention.
As shown in Figure 3, this device comprises mutual information determination module 301 and text feature extraction module 302.
Mutual information determination module 301, for the Feature Words F in feature dictionary i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance isample in this Feature Words F ioccurrence number and comprise this Feature Words F ithe label that has of sample, determine this Feature Words F iand the mutual information between the each label in tag library.
Text feature extraction module 302, for destination document is carried out to participle, obtain all Feature Words that occur in described destination document, mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label, according to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.
Wherein, mutual information determination module 301, can be for by Feature Words F iwith label T jmutual information MI (F i, T j) be defined as: MI ( F i , T j ) = log Σ k = 0 n log ( e - 1 + Num ) N × p ( F i ) × p ( T j ) .
Wherein, there is Feature Words F in n in Sample Storehouse iand there is label T inumber of samples, Num there is Feature Words F iand there is label T jk sample in Feature Words F ithe number of times, the p (F that occur i) be Feature Words F ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T j) be in Sample Storehouse, to there is label T jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
Fig. 4 is the second structural drawing of text feature extraction element provided by the invention.
As shown in Figure 4, text feature deriving means comprises word-dividing mode 401, weight determination module 402 and text feature extraction module 403.
Word-dividing mode 401, for destination document is carried out to participle, obtains all Feature Words that occur in described destination document.
Weight determination module 402, for determining each Feature Words of described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label.
Text feature extraction module 403 for according to described total weight of each label, is determined the text feature of target labels as described destination document from described each label.
Wherein, weight determination module 402, for according to Feature Words F iwith label T jmutual information MI (F i, T j), this Feature Words F inumber of times TF (the F occurring in described destination document i) and the importance degree IDF (F of this Feature Words i) determine this Feature Words F ito label T jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F inumber of samples more, this Feature Words F iimportance degree IDF (F i) lower.
Weight determination module 402, can be for according to p (F i, T j)=MI (F i, T j) × TF (F i) × IDF (F i) determine Feature Words F ito label T jweight p (F i, T j), according to p ( F , T j ) = Σ i = 0 m MI ( F i , T j ) × TF ( F i ) × IDF ( F i ) Determine that the set F of all Feature Words in destination document is to label T jtotal weight p (F, T j), m is the number of all Feature Words in destination document.
Wherein, Feature Words F iimportance degree IDF (F i) can be:
IDF ( F i ) = log ( 1 + N Nfi )
Wherein, N is the total number of the sample in Sample Storehouse, and Feature Words F has appearred in Nfi in Sample Storehouse inumber of samples.Weight determination module 402, can be for by Feature Words F iwith label T jmutual information MI (F i, T j) be:
MI ( F i , T j ) = log Σ k = 0 n log ( e - 1 + Num ) N × p ( F i ) × p ( T j ) ;
Wherein, there is Feature Words F in n in Sample Storehouse iand there is label T jnumber of samples, Num there is Feature Words F iand there is label T jk sample in Feature Words F ithe number of times, the p (F that occur i) be Feature Words F ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T j) be in Sample Storehouse, to there is label T jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims (12)

1. a text feature, is characterized in that, the method comprises:
For the Feature Words F in feature dictionary i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance isample in this Feature Words F ioccurrence number and comprise this Feature Words F ithe label that has of sample, determine this Feature Words F iand the mutual information between the each label in tag library;
Destination document is carried out to participle, obtain all Feature Words that occur in described destination document;
Mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
According to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.
2. method according to claim 1, is characterized in that, determines this Feature Words F iand the mutual information between the each label in tag library comprises:
By Feature Words F iwith the label T in tag library jmutual information be defined as:
MI ( F i , T j ) = log Σ k = 0 n log ( e - 1 + Num ) N × p ( F i ) × p ( T j )
Wherein, there is Feature Words F in n in the Sample Storehouse of setting up in advance iand there is label T jnumber of samples, Num there is Feature Words F iand there is label T jk sample in Feature Words F ithe number of times, the p (F that occur i) be Feature Words F ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T j) be in Sample Storehouse, to there is label T jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
3. a text feature, is characterized in that, the method comprises:
Destination document is carried out to participle, obtain all Feature Words that occur in described destination document;
Determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
According to described total weight of each label, from described each label, determine the text feature of target labels as described destination document;
Wherein, determine that the each Feature Words in described destination document comprises the weight of each label:
According to Feature Words F iwith label T jmutual information MI (F i, T j), this Feature Words F inumber of times TF (the F occurring in described destination document i) and the significance level IDF (F of this Feature Words i) determine this Feature Words F ito label T jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F inumber of samples more, this Feature Words F isignificance level IDF (F i) lower.
4. method according to claim 3, is characterized in that, according to Feature Words F iwith label T jmutual information MI (F i, T j), this Feature Words F inumber of times TF (the F occurring in described destination document i) and this Feature Words F iimportance degree IDF (F i) determine this Feature Words F ito label T jweight comprise:
By Feature Words F ito label T jweight p (F i, T j) be defined as:
p(F i,T j)=MI(F i,T j)×TF(F i)×IDF(F i);
All Feature Words in described destination document are weighted the weight of same label, and all Feature Words that obtain in described destination document comprise total weight of described label:
By the set F of all Feature Words in destination document to label T jtotal weight p (F, T j) be defined as:
p ( F , T j ) = Σ i = 0 m MI ( F i , T j ) × TF ( F i ) × IDF ( F i ) , Wherein, m is the number of all Feature Words in destination document.
5. method according to claim 4, is characterized in that, Feature Words F iimportance degree IDF (F i) be:
IDF ( F i ) = log ( 1 + N Nfi )
Wherein, N is the total number of the sample in Sample Storehouse, and Feature Words F has appearred in Nfi in Sample Storehouse inumber of samples.
6. according to the method described in claim 3 or 4 or 5, it is characterized in that Feature Words F iwith label T jmutual information MI (F i, T j) be:
MI ( F i , T j ) = log Σ k = 0 n log ( e - 1 + Num ) N × p ( F i ) × p ( T j ) ;
Wherein, there is Feature Words F in n in Sample Storehouse iand there is label T jnumber of samples, Num there is Feature Words F iand there is label T jk sample in Feature Words F ithe number of times, the p (F that occur i) be Feature Words F ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T j) be in Sample Storehouse, to there is label T jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
7. a text feature extraction element, is characterized in that, this device comprises mutual information determination module and text feature extraction module;
Described mutual information determination module, for the Feature Words F in feature dictionary i, according to comprising described Feature Words F in the Sample Storehouse of setting up in advance isample in this Feature Words F ioccurrence number and comprise this Feature Words F ithe label that has of sample, determine this Feature Words F iand the mutual information between the each label in tag library;
Described text feature extraction module, for destination document is carried out to participle, obtain all Feature Words that occur in described destination document, mutual information between each Feature Words and each label in based target document, determine each Feature Words in described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label, according to described total weight of each label, from described each label, determine the text feature of target labels as described destination document.
8. device according to claim 7, is characterized in that,
Described mutual information determination module, for by Feature Words F iwith label T jmutual information MI (F i, T j) be defined as:
MI ( F i , T j ) = log Σ k = 0 n log ( e - 1 + Num ) N × p ( F i ) × p ( T j ) ;
Wherein, there is Feature Words F in n in Sample Storehouse iand there is label T jnumber of samples, Num there is Feature Words F iand there is label T jk sample in Feature Words F ithe number of times, the p (F that occur i) be Feature Words F ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T j) be in Sample Storehouse, to there is label T jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
9. a text feature extraction element, is characterized in that, this device comprises word-dividing mode, weight determination module and text feature extraction module;
Described word-dividing mode, for destination document is carried out to participle, obtains all Feature Words that occur in described destination document;
Described weight determination module, for determining each Feature Words of described destination document weight to each label, all Feature Words in described destination document are weighted the weight of same label, obtain all Feature Words in described destination document total weight to same label;
Described text feature extraction module for according to described total weight of each label, is determined the text feature of target labels as described destination document from described each label;
Wherein, described weight determination module, for according to Feature Words F iwith label T jmutual information MI (F i, T j), this Feature Words F inumber of times TF (the F occurring in described destination document i) and the importance degree IDF (F of this Feature Words i) determine this Feature Words F ito label T jweight, wherein, in advance set up Sample Storehouse in comprise this Feature Words F inumber of samples more, this Feature Words F iimportance degree IDF (F i) lower.
10. device according to claim 9, is characterized in that,
Described weight determination module, for according to p (F i, T j)=MI (F i, T j) × TF (F i) × IDF (F i) determine Feature Words F ito label T jweight p (F i, T j), according to p ( F , T j ) = Σ i = 0 m MI ( F i , T j ) × TF ( F i ) × IDF ( F i ) Determine that the set F of all Feature Words in destination document is to label T jtotal weight p (F, T j), m is the number of all Feature Words in destination document.
11. devices according to claim 10, is characterized in that, Feature Words F iimportance degree IDF (F i) be:
IDF ( F i ) = log ( 1 + N Nfi )
Wherein, N is the total number of the sample in Sample Storehouse, and Feature Words F has appearred in Nfi in Sample Storehouse inumber of samples.
12. according to the device described in claim 9 or 10 or 11, it is characterized in that,
Described weight determination module, for by Feature Words F iwith label T jmutual information MI (F i, T j) be:
MI ( F i , T j ) = log Σ k = 0 n log ( e - 1 + Num ) N × p ( F i ) × p ( T j ) ;
Wherein, there is Feature Words F in n in Sample Storehouse iand there is label T jnumber of samples, Num there is Feature Words F iand there is label T jk sample in Feature Words F ithe number of times, the p (F that occur i) be Feature Words F ithe result of the total degree gained that the total degree occurring in all samples in Sample Storehouse occurs divided by all Feature Words in all samples in Sample Storehouse, p (T j) be in Sample Storehouse, to there is label T jnumber of samples divided by the result of the total number gained of the sample in Sample Storehouse.
CN201210419624.8A 2012-10-29 2012-10-29 Textual feature extracting method and device Pending CN103793385A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210419624.8A CN103793385A (en) 2012-10-29 2012-10-29 Textual feature extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210419624.8A CN103793385A (en) 2012-10-29 2012-10-29 Textual feature extracting method and device

Publications (1)

Publication Number Publication Date
CN103793385A true CN103793385A (en) 2014-05-14

Family

ID=50669070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210419624.8A Pending CN103793385A (en) 2012-10-29 2012-10-29 Textual feature extracting method and device

Country Status (1)

Country Link
CN (1) CN103793385A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677677A (en) * 2014-11-20 2016-06-15 阿里巴巴集团控股有限公司 Information classification and device
CN105701084A (en) * 2015-12-28 2016-06-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Characteristic extraction method of text classification on the basis of mutual information
CN107562928A (en) * 2017-09-15 2018-01-09 南京大学 A kind of CCMI text feature selections method
CN114331766A (en) * 2022-01-05 2022-04-12 中国科学技术信息研究所 Method and device for determining patent technology core degree, electronic equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677677A (en) * 2014-11-20 2016-06-15 阿里巴巴集团控股有限公司 Information classification and device
CN105701084A (en) * 2015-12-28 2016-06-22 广东顺德中山大学卡内基梅隆大学国际联合研究院 Characteristic extraction method of text classification on the basis of mutual information
CN107562928A (en) * 2017-09-15 2018-01-09 南京大学 A kind of CCMI text feature selections method
CN107562928B (en) * 2017-09-15 2019-11-15 南京大学 A kind of CCMI text feature selection method
CN114331766A (en) * 2022-01-05 2022-04-12 中国科学技术信息研究所 Method and device for determining patent technology core degree, electronic equipment and storage medium
CN114331766B (en) * 2022-01-05 2022-07-08 中国科学技术信息研究所 Method and device for determining patent technology core degree, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN104408093B (en) A kind of media event key element abstracting method and device
CN108717406B (en) Text emotion analysis method and device and storage medium
CN109325165B (en) Network public opinion analysis method, device and storage medium
CN104572958B (en) A kind of sensitive information monitoring method based on event extraction
CN106250513B (en) Event modeling-based event personalized classification method and system
US9424524B2 (en) Extracting facts from unstructured text
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
Tromp et al. Graph-based n-gram language identification on short texts
CN104598535B (en) A kind of event extraction method based on maximum entropy
US8688690B2 (en) Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN105183717A (en) OSN user emotion analysis method based on random forest and user relationship
CN103294664A (en) Method and system for discovering new words in open fields
US10282467B2 (en) Mining product aspects from opinion text
CN104731958A (en) User-demand-oriented cloud manufacturing service recommendation method
Bam et al. Named entity recognition for nepali text using support vector machines
CN110298039B (en) Event place identification method, system, equipment and computer readable storage medium
CN104142912A (en) Accurate corpus category marking method and device
US20130282727A1 (en) Unexpectedness determination system, unexpectedness determination method and program
CN110741376A (en) Automatic document analysis for different natural languages
CN103793385A (en) Textual feature extracting method and device
CN103605690A (en) Device and method for recognizing advertising messages in instant messaging
CN109857869A (en) A kind of hot topic prediction technique based on Ap increment cluster and network primitive
CN105426379A (en) Keyword weight calculation method based on position of word

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140514