CN108090088A - Feature extracting method and device - Google Patents
Feature extracting method and device Download PDFInfo
- Publication number
- CN108090088A CN108090088A CN201611042411.2A CN201611042411A CN108090088A CN 108090088 A CN108090088 A CN 108090088A CN 201611042411 A CN201611042411 A CN 201611042411A CN 108090088 A CN108090088 A CN 108090088A
- Authority
- CN
- China
- Prior art keywords
- sample
- word
- candidate feature
- feature word
- sample size
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of feature extracting methods, it is first determined the candidate feature word in sample set;For any one candidate feature word, the corresponding sample size of four kinds of situations that statistical sample concentration " whether including the candidate feature word ", " whether belonging to target classification " the two conditional combinations obtain respectively;Then, the frequency of the candidate feature word is obtained;Chi-square value is calculated using the frequency of four sample sizes and the candidate feature word, finally, is determined to treat training characteristics according to chi-square value, and is trained to obtain the feature that the target classification is included using default sorting algorithm.As shown in the above, this feature extracting method adds the frequency of Feature Words when calculating candidate feature word, frequency is that an average sample the average word frequency of the candidate feature word occurs, in other words, increase the weight of the more sample of the number comprising the candidate feature word, reduce the weight for including the less sample of the candidate feature word number, so as to improve the accuracy rate of feature extraction.
Description
Technical field
The present invention relates to computer realm more particularly to feature extracting methods and device.
Background technology
Classification is exactly to label according to certain standard to object, is sorted out further according to label to distinguish, sorting algorithm is exactly certainly
The dynamic algorithm for completing the above process.Feature extraction is for sorting algorithm reasonable feature to be selected to carry out learning training, the feature of extraction
The training of disaggregated model will be directly affected, therefore, the accuracy rate for improving feature extraction is most important.
The emotional semantic classification of article is by from different platform, for example, being climbed in wechat public platform, microblogging, news, forum, webpage
The data that take carry out emotional semantic classification, wherein, emotional category can include just, in, negative three classes.At present, carried using chi-square value feature
The mode taken first, segments samples of text using 2-gram models, then calculates each point using common chi-square value formula
The chi-square value of word, and before being chosen according to chi-square value 500 phrase, the feature extracted.Chi-square value is in non-parametric test
One statistic, is mainly used in nonparametric discrimination method.Its effect is the correlation of inspection data, if for example, card side
The conspicuousness (i.e. SIG.) of value is less than 0.05, then illustrates that two variables are significantly correlated.But the feature that such mode is extracted
Accuracy rate it is relatively low.
The content of the invention
In view of the above problems, it is proposed that the technical solution of the application overcomes the above problem or at least portion in order to provide one kind
The feature extracting method and device to solve the above problems with dividing, technical solution are as follows:
In a first aspect, this application provides a kind of feature extracting method, including:
Obtain the candidate feature word in sample set;
For any one of candidate feature word, first sample quantity in the sample set, the second sample are counted respectively
This quantity, the 3rd sample size and the 4th sample size;The first sample quantity is comprising the candidate feature word and belongs to
The other sample size of target class, second sample size is not comprising the candidate feature word and to belong to the target class other
Sample size, the 3rd sample size be comprising the candidate feature word and be not belonging to the other sample size of the target class,
4th sample size is not comprising the candidate feature word and is not belonging to the other sample size of the target class;
Obtain the frequency that the candidate feature word occurs in the other sample of the target class is belonged to;
According to the first sample quantity, second sample size, the 3rd sample size, the 4th sample number
Amount and the frequency, are calculated the chi-square value between the candidate feature word and the target classification;
It determines that the target class is other according to the chi-square value and treats training characteristics word;
It treats that training characteristics word is trained to described using default sorting algorithm, obtains the spy that the target classification is included
Sign.
Optionally, the frequency for obtaining the candidate feature word and occurring in the other sample of the target class is belonged to, bag
It includes:
Statistics belongs in the other sample of the target class number for the candidate feature word occur and the target classification
Comprising total sample number amount;
The ratio for the total sample number amount that the number is included with the target classification is calculated, obtains the frequency.
Optionally, it is described according to the first sample quantity, second sample size, the 3rd sample size, institute
The 4th sample size and the frequency are stated, the chi-square value between the candidate feature word and the target classification is calculated, is wrapped
It includes:
According to formulaCalculate institute
State chi-square value;
Wherein, N is total sample number amount in sample set, and A is the first sample quantity, B is second sample size, C
It is the 4th sample size for the 3rd sample size, D;I be the target classification, t be the candidate feature word, α (t,
I it is) frequency of the candidate feature word.
Optionally, the candidate feature word obtained in sample set, including:
It is added in the dictionary for word segmentation of default segmentation methods and specifies stop words, according to updated dictionary for word segmentation to the sample
The content of text of this concentration carries out word segmentation processing, and deletes the specified stop words that the content of text includes, and obtains described
Candidate feature word.
Optionally, the candidate feature word obtained in sample set, including:
It is added in the dictionary for word segmentation of default segmentation methods and specifies candidate word, according to updated dictionary for word segmentation to the sample
The content of text of this concentration carries out word segmentation processing, obtains the candidate feature word, is specified in the candidate feature word comprising described
Candidate word.
Second aspect, the application provide a kind of feature deriving means, including:
First acquisition module, for obtaining the candidate feature word in sample set;
Statistical module, for for any one of candidate feature word, counting the first sample in the sample set respectively
This quantity, the second sample size, the 3rd sample size and the 4th sample size;The first sample quantity is comprising the candidate
Feature Words and belong to the other sample size of target class, second sample size is not comprising the candidate feature word and belongs to institute
The other sample size of target class is stated, the 3rd sample size is comprising the candidate feature word and is not belonging to the target classification
Sample size, the 4th sample size is not comprising the candidate feature word and is not belonging to the other sample number of the target class
Amount;
Second acquisition module, for obtaining the frequency that the candidate feature word occurs in the other sample of the target class is belonged to
Degree;
Chi-square value computing module, for according to the first sample quantity, second sample size, the 3rd sample
The card between the candidate feature word and the target classification is calculated in quantity, the 4th sample size and the frequency
Side's value;
It treats training characteristics determining module, training characteristics word is treated for determining that the target class is other according to the chi-square value;
Feature training module for treating that training characteristics word is trained to described using default sorting algorithm, obtains described
The feature that target classification is included.
Optionally, second acquisition module, including:
Statistic submodule belongs in the other sample of the target class number for the candidate feature word occur for counting,
And the total sample number amount that the target classification is included;
Frequency computational submodule for calculating the ratio of the number and the total sample number amount, obtains the frequency.
Optionally, the chi-square value computing module is specifically used for:
According to formulaDescribed in calculating
Chi-square value;
Wherein, N is total sample number amount in sample set, and A is the first sample quantity, B is second sample size, C
It is the 4th sample size for the 3rd sample size, D;I be the target classification, t be the candidate feature word, α (t,
I it is) frequency of the candidate feature word.
Optionally, first acquisition module, including:
First acquisition submodule specifies stop words, according to update for being added in the dictionary for word segmentation of default segmentation methods
Dictionary for word segmentation afterwards in the sample set content of text carry out word segmentation processing, and delete the content of text include it is described
Stop words is specified, obtains the candidate feature word.
Optionally, first acquisition module, including:
Second acquisition submodule specifies candidate word, according to update for being added in the dictionary for word segmentation of default segmentation methods
Dictionary for word segmentation afterwards carries out word segmentation processing to the content of text in the sample set, obtains the candidate feature word, the candidate
The specified candidate word is included in Feature Words.
By above-mentioned technical proposal, feature extracting method provided by the invention is determined from sample set at this first
The candidate feature word of reason;And respectively statistical sample concentrate " whether including the candidate feature word ", " whether belonging to target classification " this
Corresponding four sample sizes of four kinds of situations that two conditional combinations obtain;Then, the frequency of the candidate feature word is obtained;It utilizes
The frequency of four sample sizes and the candidate feature word calculates chi-square value, finally, determines that target classification is included according to chi-square value
Feature.As shown in the above, this feature extracting method adds the frequency of Feature Words, frequency when calculating candidate feature word
That an average sample the average word frequency of the candidate feature word occurs, in other words, increase the number comprising the candidate feature word compared with
The weight of more samples reduces the weight for including the less sample of the candidate feature word number, so as to improve feature extraction
Accuracy rate.
Above description is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention,
And can be practiced according to the content of specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, below the special specific embodiment for lifting the present invention.
Description of the drawings
By reading the detailed description of hereafter preferred embodiment, it is various other the advantages of and benefit it is common for this field
Technical staff will be apparent understanding.Attached drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of flow chart of feature extracting method of the embodiment of the present invention;
Fig. 2 shows the flow chart of another kind feature extracting method of the embodiment of the present invention;
Fig. 3 shows a kind of block diagram of feature deriving means of the embodiment of the present invention;
Fig. 4 shows the block diagram of another kind feature deriving means of the embodiment of the present invention;
Fig. 5 shows the block diagram of another feature deriving means of the embodiment of the present invention.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
Completely it is communicated to those skilled in the art.
Fig. 1 is referred to, shows a kind of flow chart of feature extracting method of the embodiment of the present invention, this method is applied to terminal
In equipment or server, as shown in Figure 1, this method may comprise steps of:
S110 obtains the candidate feature word in sample set.
Sample set is the sample of known class categories, generally includes positive sample and negative sample, for example, classification to be trained is
A classes, then positive sample refer to the sample for belonging to A classes, negative sample is the sample for being not belonging to A classes.The application is mainly used in text point
Feature extraction in class process, each sample are text files.
The content of text of sample can be segmented using segmentation methods, it is then determined that going out candidate feature word.Participle is calculated
Method can select the existing segmentation methods for being suitble to Chinese, and the application is not intended to limit this.
S120 counts first sample quantity in the sample set, the second sample size, the 3rd sample size and respectively
Four sample sizes.
Sample size classification is as shown in table 1:
Table 1
Contain feature X | Feature X is not contained | |
Belong to I classes | A | B |
It is not belonging to I classes | C | D |
For example, feature X, that is, candidate feature word, I classes, that is, target classification in table 1, then first sample quantity is A, represents sample
This concentration includes feature X and belongs to the sample size of I classes;B represents not including feature X and category in sample set for the second sample size
In the sample size of I classes;C is represented to include feature X in sample set and is not belonging to the sample size of I classes for the 3rd sample size;D is
4th sample size represents not including feature X in sample set and is not belonging to the sample size of I classes.
S130 obtains the frequency that candidate feature word occurs in the other sample of target class is belonged to.
The frequency of candidate feature word refers to the textual data of this feature word appearance and the ratio of class text sum.Calculation formula
It is as follows:
Frequency of the α (t, I) between the candidate feature word and I classes;For example, the denominator in formula 1 is to belong to I in table 1
The textual data of class is A+B, and molecule is the quantity of Feature Words X included in A sample.
Target classification is the classification currently to be trained.For the application scenarios of automotive-type article emotional semantic classification, including
Positive three emotion, middle emotion and negative affect classifications, target classification can be any type in these three classifications.
S140, according to first sample quantity, the second sample size, the 3rd sample size, the 4th sample size and frequency, meter
Calculation obtains the chi-square value between candidate feature word and target classification.
In formula 2, x2(t, I) represents chi-square value, and N is total sample number amount;A is first sample quantity, B is the second sample number
Amount, C are the 3rd sample size, D is the 4th sample size, and α (t, I) represents the frequency of feature t and classification I.
As (AD-BC) > 0, represent that the candidate feature word and the other degree of correlation of target class are larger,
When (AD-BC)≤0, represent that the candidate feature word and the other degree of correlation of non-target class are bigger, x2(t, I)=0, chi-square value takes
0 excludes such candidate feature word to the other influence of target class.
S150 determines that target class is other according to chi-square value and treats training characteristics.
After the chi-square value of whole candidate feature words all has been calculated, it is ranked up, and selects according to the numerical values recited of chi-square value
The default quantity candidate feature word of access value from big to small is as such another characteristic.
Default quantity can be chosen according to actual demand, for example, 500 can be chosen.
S160 treats training characteristics word using default sorting algorithm and is trained, obtains the feature that target classification is included.
That chooses treats that training characteristics word needs to be trained using default sorting algorithm, from default quantity candidate feature word
Middle training obtains really characterizing the feature of target category attribute.For example, default sorting algorithm can be SVM (Support
Vector Machine, support vector machines) algorithm, it can also use other sorting algorithms certainly, the application is not intended to limit this.
Feature extracting method provided in this embodiment, it is first determined the candidate feature word in sample set;For any one
Candidate feature word, respectively statistical sample concentrate " whether including the candidate feature word ", " whether belonging to target classification " the two
The corresponding sample size of four kinds of situations that part combines;Then, the frequency of the candidate feature word is obtained;Utilize four sample numbers
The frequency of amount and the candidate feature word calculates chi-square value, finally, is determined to treat training characteristics according to chi-square value, and utilizes default point
Class algorithm is trained to obtain the feature that the target classification is included.As shown in the above, this feature extracting method is calculating
The frequency of Feature Words is added during candidate feature word, frequency is that an average sample the average word frequency of the candidate feature word occurs,
In other words, increase the weight of the more sample of number comprising the candidate feature word, reduce comprising the candidate feature word number compared with
The weight of few sample, so as to improve the accuracy rate of feature extraction.
Fig. 2 is referred to, shows the flow chart of another kind feature extracting method of the embodiment of the present invention, this method is choosing time
When selecting Feature Words, stop words is introduced, as shown in Fig. 2, this method comprises the following steps:
S210 is added in the dictionary for word segmentation of default segmentation methods and is specified stop words and specified candidate word, after obtaining update
Dictionary for word segmentation.
Specified stop words is without specific semantic word, for example, modal particle, quantifier, conjunction etc..For example, " ", " one
It is a ", " still " etc..
In a kind of possible realization method of the present invention, classify to the text on vehicle, under such application scenarios,
Specified candidate word can be vehicle word, for example, Passat, bright ease etc., if occurring the word in vehicle word in samples of text successively
Shi Ze is divided according to vehicle word, and is determined as candidate feature word.
S220 carries out word segmentation processing to the content of text in sample set according to updated dictionary for word segmentation, obtains candidate spy
Levy word.
Do not include in candidate feature word and specify stop words, still, including specifying candidate word.
S230, for any one candidate feature word, first sample quantity that statistical sample is concentrated, the second sample size,
3rd sample size and the 4th sample size.
Wherein, first sample quantity is comprising candidate feature word and belongs to the other sample size of target class, the second sample number
Amount is not comprising the candidate feature word and belongs to the other sample size of target class, and the 3rd sample size is comprising the candidate feature word
And the other sample size of target class is not belonging to, the 4th sample size is not comprising the candidate feature word and to be not belonging to target class other
Sample size.
S240, statistics belong in the other sample of target class comprising the candidate feature word number and, wrapped in target classification
Total quantity containing sample.
This step is to count the number of the candidate feature word included in the sample of first sample quantity, meanwhile, statistics
The total quantity of sample is included in target classification.
S250 obtains the frequency of the candidate feature word using the number divided by the total quantity.
For example, the number for occurring the candidate feature word in the sample that target classification is included is n, sample is included in target classification
This total quantity is A+B (referring to table 1), then the frequency of the candidate feature word is n/ (A+B).
S260 calculates the chi-square value of the candidate feature word.
The chi-square value of the candidate feature word is calculated according to formula 2, details are not described herein again.
S210~S260 is repeated, until obtaining the corresponding chi-square value of whole candidate feature words in sample set.
S270 chooses the default quantity candidate feature word of chi-square value from big to small, determines that the target class is other and wait to train
Feature Words.
Specifically, after the corresponding chi-square value of whole candidate feature words is calculated in sample set, according to carrying out from big to small
Sequence presets quantity candidate feature word and is determined as treating training characteristics word before selection.
S280 chooses the default quantity candidate feature word of chi-square value from big to small, determines that the target class is other and wait to train
Feature Words.
Feature extracting method provided in this embodiment, in the candidate feature word in determining sample set, in dictionary for word segmentation
It adds in and specifies stop words and specified candidate word, when being segmented to samples of text, delete in text and be consistent with specified stop words
Phrase, obtained word segmentation result i.e. candidate feature word.Then, the frequency of each candidate feature word is calculated, and combines the candidate
The frequency of Feature Words calculates chi-square value;It is determined to treat training characteristics word according to chi-square value, finally utilizes default classification algorithm training
Treat that training characteristics word obtains the feature that the target classification is included.This method adds specified stop words in dictionary for word segmentation,
The stop words occurred in samples of text is directly deleted during participle, so as to reduce the quantity of candidate feature word, so as to improve
Choose the efficiency for treating training characteristics word.Moreover, adding specified candidate word in dictionary for word segmentation, and it will directly meet specified candidate
The stroke of word is grouped together, and improves the accuracy rate of participle, and then improves the accuracy rate of feature extraction.
Corresponding to above-mentioned feature extracting method embodiment, the present invention also provides feature deriving means embodiments.
Fig. 3 is referred to, shows a kind of block diagram of feature deriving means of the embodiment of the present invention, which can be applied to end
In end equipment or server.As shown in figure 3, the device can include:First acquisition module 310, statistical module 320, second are obtained
Modulus block 330, chi-square value computing module 340 treat training characteristics determining module 350 and feature training module 360.
First acquisition module 310, for obtaining the candidate feature word in sample set.
Word segmentation processing, obtained word segmentation result i.e. candidate feature word are carried out to the text in sample set using segmentation methods.
Statistical module 320, for the first sample number for any one candidate feature word, respectively statistical sample concentration
Amount, the second sample size, the 3rd sample size and the 4th sample size.
Statistical sample concentration meets the sample size of four kinds of situations shown in table 1 respectively.
Wherein, the first sample quantity is comprising the candidate feature word and belongs to the other sample size of target class, institute
Stating the second sample size is not comprising the candidate feature word and belongs to the other sample size of the target class, the 3rd sample
Quantity is comprising the candidate feature word and is not belonging to the other sample size of the target class, and the 4th sample size is not wrap
Containing the candidate feature word and it is not belonging to the other sample size of the target class.
Second acquisition module 330, for obtaining the frequency that candidate feature word occurs in the other sample of the target class is belonged to.
The frequency of candidate feature word is calculated using formula 1, details are not described herein again.
Chi-square value computing module 340, for according to first sample quantity, the second sample size, the 3rd sample size, the 4th
The candidate feature word and the other chi-square value of the target class is calculated in sample size and the frequency.
Chi-square value is calculated using formula 2, details are not described herein again.
It treats training characteristics determining module 350, training characteristics is treated for determining that the target class is other according to the chi-square value
Word.
After the chi-square value of whole candidate feature words all has been calculated, it is ranked up, and selects according to the numerical values recited of chi-square value
The default quantity candidate feature word of access value from big to small is as such another characteristic.
Default quantity can be chosen according to actual demand, for example, 500 can be chosen.
Feature training module 360 for treating that training characteristics word is trained to described using default sorting algorithm, obtains institute
State the feature that target classification is included.
That chooses can just obtain target classification after training characteristics word needs to be trained using default sorting algorithm and wrapped
The feature contained.For example, default sorting algorithm can be SVM (Support Vector Machine, support vector machines) algorithm, when
Other sorting algorithms are so can also use, the application is not intended to limit this.
Feature deriving means provided in this embodiment are determined the candidate feature word in sample set by the first acquisition module;It is right
In any one candidate feature word, using statistical module difference statistical sample concentrate " whether including the candidate feature word ", " whether
Belong to target classification " the obtained corresponding sample size of four kinds of situations of the two conditional combinations;Then, obtained by the second acquisition module
Take the frequency of the candidate feature word;And four sample sizes and the frequency meter of the candidate feature word are utilized by chi-square value computing module
Chi-square value is calculated, finally, treats that training characteristics determining module is determined to treat training characteristics according to chi-square value, and by feature training module profit
It is trained to obtain the feature that the target classification included with default sorting algorithm.As shown in the above, which is calculating
The frequency of Feature Words is added during candidate feature word, frequency is that an average sample the average word frequency of the candidate feature word occurs,
In other words, increase the weight of the more sample of number comprising the candidate feature word, reduce comprising the candidate feature word number compared with
The weight of few sample, so as to improve the accuracy rate of feature extraction.
Fig. 4 is referred to, shows the block diagram of another kind feature deriving means of the embodiment of the present invention, the present embodiment is to Fig. 3 institutes
Show that the part of module in embodiment is refined, as shown in figure 4, the device can include:First acquisition module 410, statistics
Module 420, statistic submodule 430, chi-square value computing module 450, treat training characteristics determining module at frequency computational submodule 440
460 and feature training module 470.
First acquisition module 410 is waited for adding in specify stop words and specify in the dictionary for word segmentation of default segmentation methods
Word is selected, word segmentation processing is carried out to the content of text in the sample set according to updated dictionary for word segmentation, is deleted in the text
Hold the specified stop words included, and the stroke for meeting the specified candidate word is divided into phrase, obtain candidate feature word.
In the present embodiment, specified stop words and specified candidate word are added in original dictionary for word segmentation of segmentation methods, such as
It is then deleted in the presence of the word for meeting specified stop words in fruit samples of text, if there is the word for meeting specified candidate word in samples of text
Then be divided into phrase, in this way, can candidate feature word total quantity and improve the accuracy rate of candidate feature word.
Statistical module 420, for for any one of candidate feature word, counting first in the sample set respectively
Sample size, the second sample size, the 3rd sample size and the 4th sample size.
Wherein, the first sample quantity is comprising the candidate feature word and belongs to the other sample size of target class, institute
Stating the second sample size is not comprising the candidate feature word and belongs to the other sample size of the target class, the 3rd sample
Quantity is comprising the candidate feature word and is not belonging to the other sample size of the target class, and the 4th sample size is not wrap
Containing the candidate feature word and it is not belonging to the other sample size of the target class.
Statistic submodule 430 belongs in the other sample of the target class time for the candidate feature word occur for counting
The total sample number amount that number and the target classification are included.
The number of the candidate feature word included in the sample of statistic submodule statistics first sample quantity, meanwhile,
The total quantity of sample is included in statistics target classification.
Frequency computational submodule 440 for calculating the ratio of the number and the total sample number amount, obtains the frequency
Degree.
Frequency computational submodule is calculated using formula 1, and details are not described herein again.
Chi-square value computing module 450, for according to the first sample quantity, second sample size, the described 3rd
Sample size, the 4th sample size and the frequency are calculated between the candidate feature word and the target classification
Chi-square value.
Chi-square value is calculated using formula 2, and details are not described herein again.
It treats training characteristics determining module 460, training characteristics is treated for determining that the target class is other according to the chi-square value
Word.
Specifically, after the corresponding chi-square value of whole candidate feature words is calculated in sample set, according to carrying out from big to small
Sequence presets quantity candidate feature word and is determined as treating training characteristics word before selection.
Feature training module 470 for treating that training characteristics word is trained to described using default sorting algorithm, obtains institute
State the feature that target classification is included.
For example, default sorting algorithm can be SVM algorithm, trained from default quantity candidate feature word obtain can be true
The feature of positive characterization target category attribute.
Feature deriving means provided in this embodiment, in the candidate feature word in determining sample set, in dictionary for word segmentation
It adds in and specifies stop words and specified candidate word, when being segmented to samples of text, delete in text and be consistent with specified stop words
Phrase, obtained word segmentation result i.e. candidate feature word.Then, the frequency of each candidate feature word is calculated, and combines the candidate
The frequency of Feature Words calculates chi-square value;It is determined to treat training characteristics word according to chi-square value, finally utilizes default classification algorithm training
Treat that training characteristics word obtains the feature that the target classification is included.The device adds specified stop words in dictionary for word segmentation,
The stop words occurred in samples of text is directly deleted during participle, so as to reduce the quantity of candidate feature word, so as to improve
Choose the efficiency for treating training characteristics word.Moreover, adding specified candidate word in dictionary for word segmentation, and it will directly meet specified candidate
The stroke of word is grouped together, and improves the accuracy rate of participle, and then improves the accuracy rate of feature extraction.
Fig. 5 is referred to, shows the block diagram of another feature deriving means of the embodiment of the present invention, which includes processor
510 and memory 520.
Wherein, the first above-mentioned acquisition module 310, statistical module 320, the second acquisition module 330, chi-square value computing module
340th, treat that training characteristics determining module 350 and feature training module 360 etc. are stored in as program unit in memory 520,
The above procedure unit being stored in memory 520 is performed by processor 510 to realize corresponding function.
Comprising kernel in processor 510, gone in memory 520 to transfer corresponding program unit by kernel.Kernel can be set
One or more are put, the accuracy rate of feature extraction is improved by adjusting kernel parameter.
Memory 520 may include the volatile memory in computer-readable medium, random access memory (RAM)
And/or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM), memory 520 is included at least
One storage chip.
Feature deriving means provided in this embodiment, corresponding program unit is transferred in memory by processor with complete with
Lower process:Determine the candidate feature word in sample set;For any one candidate feature word, respectively statistical sample concentrate " whether
Include the candidate feature word ", the obtained corresponding sample number of four kinds of situations of " whether belonging to target classification " the two conditional combinations
Amount;Then, the frequency of the candidate feature word is obtained;Card side is calculated using the frequency of four sample sizes and the candidate feature word
Value finally, is determined to treat training characteristics according to chi-square value, and is trained to obtain the target classification institute using default sorting algorithm
Comprising feature.This feature extraction process adds the frequency of Feature Words when calculating candidate feature word, and frequency is one average
There is the average word frequency of the candidate feature word in sample, in other words, increases the more sample of the number comprising the candidate feature word
Weight reduces the weight for including the less sample of the candidate feature word number, so as to improve the accuracy rate of feature extraction.
It is first when being performed on data processing equipment, being adapted for carrying out present invention also provides a kind of computer program product
The program code of beginningization there are as below methods step:
Obtain the candidate feature word in sample set;
For any one of candidate feature word, first sample quantity in the sample set, the second sample are counted respectively
This quantity, the 3rd sample size and the 4th sample size;The first sample quantity is comprising the candidate feature word and belongs to
The other sample size of target class, second sample size is not comprising the candidate feature word and to belong to the target class other
Sample size, the 3rd sample size be comprising the candidate feature word and be not belonging to the other sample size of the target class,
4th sample size is not comprising the candidate feature word and is not belonging to the other sample size of the target class;
Obtain the frequency that the candidate feature word occurs in the other sample of the target class is belonged to;
According to the first sample quantity, second sample size, the 3rd sample size, the 4th sample number
Amount and the frequency, are calculated the chi-square value between the candidate feature word and the target classification;
It determines that the target class is other according to the chi-square value and treats training characteristics word;
It treats that training characteristics word is trained to described using default sorting algorithm, obtains the spy that the target classification is included
Sign.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program
Product.Therefore, the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware can be used in the application
Apply the form of example.Moreover, the computer for wherein including computer usable program code in one or more can be used in the application
The computer program production that usable storage medium is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The application is with reference to the flow according to the method for the embodiment of the present application, equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram
The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided
The processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices is generated for real
The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction generation being stored in the computer-readable memory includes referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or
The function of being specified in multiple boxes.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted
Series of operation steps is performed on calculation machine or other programmable devices to generate computer implemented processing, so as in computer or
The instruction offer performed on other programmable devices is used to implement in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie
The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only memory (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, the storage of tape magnetic rigid disk or other magnetic storage apparatus
Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.It defines, calculates according to herein
Machine readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It these are only embodiments herein, be not limited to the application.To those skilled in the art,
The application can have various modifications and variations.All any modifications made within spirit herein and principle, equivalent substitution,
Improve etc., it should be included within the scope of claims hereof.
Claims (10)
1. a kind of feature extracting method, which is characterized in that including:
Obtain the candidate feature word in sample set;
For any one of candidate feature word, first sample quantity in the sample set, the second sample number are counted respectively
Amount, the 3rd sample size and the 4th sample size;The first sample quantity is comprising the candidate feature word and belongs to target
The sample size of classification, second sample size are not comprising the candidate feature word and belong to the other sample of the target class
Quantity, the 3rd sample size is comprising the candidate feature word and is not belonging to the other sample size of the target class, described
4th sample size is not comprising the candidate feature word and is not belonging to the other sample size of the target class;
Obtain the frequency that the candidate feature word occurs in the other sample of the target class is belonged to;
According to the first sample quantity, second sample size, the 3rd sample size, the 4th sample size and
The chi-square value between the candidate feature word and the target classification is calculated in the frequency;
It determines that the target class is other according to the chi-square value and treats training characteristics word;
It treats that training characteristics word is trained to described using default sorting algorithm, obtains the feature that the target classification is included.
2. according to the method described in claim 1, it is characterized in that, the acquisition candidate feature word is belonging to the target
The frequency occurred in the sample of classification, including:
Statistics belongs in the other sample of the target class number for the candidate feature word occur and the target classification includes
Total sample number amount;
The ratio for the total sample number amount that the number is included with the target classification is calculated, obtains the frequency.
It is 3. according to the method described in claim 1, it is characterized in that, described according to the first sample quantity, second sample
This quantity, the 3rd sample size, the 4th sample size and the frequency, are calculated the candidate feature word and institute
The chi-square value between target classification is stated, including:
According to formulaCalculate the card side
Value;
Wherein, N is total sample number amount in sample set, and A is the first sample quantity, B is second sample size, C is institute
State the 3rd sample size, D is the 4th sample size;I is the target classification, and t is the candidate feature word, and α (t, I) is
The frequency of the candidate feature word.
4. according to the method described in claims 1 to 3 any one, which is characterized in that the candidate obtained in sample set is special
Word is levied, including:
It is added in the dictionary for word segmentation of default segmentation methods and specifies stop words, according to updated dictionary for word segmentation to the sample set
In content of text carry out word segmentation processing, and delete the specified stop words that the content of text includes, obtain the candidate
Feature Words.
5. according to the method described in claims 1 to 3 any one, which is characterized in that the candidate obtained in sample set is special
Word is levied, including:
It is added in the dictionary for word segmentation of default segmentation methods and specifies candidate word, according to updated dictionary for word segmentation to the sample set
In content of text carry out word segmentation processing, obtain the candidate feature word, the specified candidate included in the candidate feature word
Word.
6. a kind of feature deriving means, which is characterized in that including:
First acquisition module, for obtaining the candidate feature word in sample set;
Statistical module, for for any one of candidate feature word, counting the first sample number in the sample set respectively
Amount, the second sample size, the 3rd sample size and the 4th sample size;The first sample quantity is comprising the candidate feature
Word and belong to the other sample size of target class, second sample size is not comprising the candidate feature word and belongs to the mesh
The sample size of classification is marked, the 3rd sample size is comprising the candidate feature word and is not belonging to the other sample of the target class
This quantity, the 4th sample size are not comprising the candidate feature word and are not belonging to the other sample size of the target class;
Second acquisition module, for obtaining the frequency that the candidate feature word occurs in the other sample of the target class is belonged to;
Chi-square value computing module, for according to the first sample quantity, second sample size, the 3rd sample number
Amount, the 4th sample size and the frequency, are calculated the card side between the candidate feature word and the target classification
Value;
It treats training characteristics determining module, training characteristics word is treated for determining that the target class is other according to the chi-square value;
Feature training module for treating that training characteristics word is trained to described using default sorting algorithm, obtains the target
The feature that classification is included.
7. device according to claim 6, which is characterized in that second acquisition module, including:
Statistic submodule, for count belong in the other sample of the target class number that the candidate feature word occur and
The total sample number amount that the target classification is included;
Frequency computational submodule for calculating the ratio of the number and the total sample number amount, obtains the frequency.
8. device according to claim 6, which is characterized in that the chi-square value computing module is specifically used for:
According to formulaCalculate the card side
Value;
Wherein, N is total sample number amount in sample set, and A is the first sample quantity, B is second sample size, C is institute
State the 3rd sample size, D is the 4th sample size;I is the target classification, and t is the candidate feature word, and α (t, I) is
The frequency of the candidate feature word.
9. according to the device described in claim 6 to 8 any one, which is characterized in that first acquisition module, including:
First acquisition submodule specifies stop words, according to updated for being added in the dictionary for word segmentation of default segmentation methods
Dictionary for word segmentation carries out word segmentation processing to the content of text in the sample set, and that deletes that the content of text includes described specifies
Stop words obtains the candidate feature word.
10. according to the device described in claim 6 to 8 any one, which is characterized in that first acquisition module, including:
Second acquisition submodule specifies candidate word, according to updated for being added in the dictionary for word segmentation of default segmentation methods
Dictionary for word segmentation carries out word segmentation processing to the content of text in the sample set, obtains the candidate feature word, the candidate feature
The specified candidate word is included in word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611042411.2A CN108090088A (en) | 2016-11-23 | 2016-11-23 | Feature extracting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611042411.2A CN108090088A (en) | 2016-11-23 | 2016-11-23 | Feature extracting method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108090088A true CN108090088A (en) | 2018-05-29 |
Family
ID=62171034
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611042411.2A Pending CN108090088A (en) | 2016-11-23 | 2016-11-23 | Feature extracting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108090088A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110543577A (en) * | 2019-09-03 | 2019-12-06 | 南京工程学院 | Picture retrieval method for underwater operation scene |
CN114443849A (en) * | 2022-02-09 | 2022-05-06 | 北京百度网讯科技有限公司 | Method and device for selecting marked sample, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
CN103914551A (en) * | 2014-04-13 | 2014-07-09 | 北京工业大学 | Method for extending semantic information of microblogs and selecting features thereof |
KR101429397B1 (en) * | 2013-04-11 | 2014-08-14 | 전북대학교산학협력단 | Method and system for extracting core events based on message analysis in social network service |
CN105512311A (en) * | 2015-12-14 | 2016-04-20 | 北京工业大学 | Chi square statistic based self-adaption feature selection method |
-
2016
- 2016-11-23 CN CN201611042411.2A patent/CN108090088A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101429397B1 (en) * | 2013-04-11 | 2014-08-14 | 전북대학교산학협력단 | Method and system for extracting core events based on message analysis in social network service |
CN103886108A (en) * | 2014-04-13 | 2014-06-25 | 北京工业大学 | Feature selection and weight calculation method of imbalance text set |
CN103914551A (en) * | 2014-04-13 | 2014-07-09 | 北京工业大学 | Method for extending semantic information of microblogs and selecting features thereof |
CN105512311A (en) * | 2015-12-14 | 2016-04-20 | 北京工业大学 | Chi square statistic based self-adaption feature selection method |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110543577A (en) * | 2019-09-03 | 2019-12-06 | 南京工程学院 | Picture retrieval method for underwater operation scene |
CN114443849A (en) * | 2022-02-09 | 2022-05-06 | 北京百度网讯科技有限公司 | Method and device for selecting marked sample, electronic equipment and storage medium |
CN114443849B (en) * | 2022-02-09 | 2023-10-27 | 北京百度网讯科技有限公司 | Labeling sample selection method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107291723A (en) | The method and apparatus of web page text classification, the method and apparatus of web page text identification | |
CN104915327A (en) | Text information processing method and device | |
CN113052577B (en) | Class speculation method and system for block chain digital currency virtual address | |
CN107145560A (en) | A kind of file classification method and device | |
CN109271517A (en) | IG TF-IDF Text eigenvector generates and file classification method | |
CN106960040A (en) | A kind of URL classification determines method and device | |
CN110019785A (en) | A kind of file classification method and device | |
CN108241662A (en) | The optimization method and device of data mark | |
CN111475651B (en) | Text classification method, computing device and computer storage medium | |
CN110502902A (en) | A kind of vulnerability classification method, device and equipment | |
CN103246686A (en) | Method and device for text classification, and method and device for characteristic processing of text classification | |
CN108090088A (en) | Feature extracting method and device | |
CN114896398A (en) | Text classification system and method based on feature selection | |
CN108229507A (en) | Data classification method and device | |
US9053434B2 (en) | Determining an obverse weight | |
CN114722198A (en) | Method, system and related device for determining product classification code | |
CN105787004A (en) | Text classification method and device | |
CN116029280A (en) | Method, device, computing equipment and storage medium for extracting key information of document | |
CN108228869A (en) | The method for building up and device of a kind of textual classification model | |
CN111241269B (en) | Short message text classification method and device, electronic equipment and storage medium | |
CN104699707A (en) | Data clustering method and device | |
CN110688481A (en) | Text classification feature selection method based on chi-square statistic and IDF | |
CN110807159A (en) | Data marking method and device, storage medium and electronic equipment | |
CN108108371A (en) | A kind of file classification method and device | |
CN108647335A (en) | Internet public opinion analysis method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180529 |