CN108090088A - Feature extracting method and device - Google Patents

Feature extracting method and device Download PDF

Info

Publication number
CN108090088A
CN108090088A CN201611042411.2A CN201611042411A CN108090088A CN 108090088 A CN108090088 A CN 108090088A CN 201611042411 A CN201611042411 A CN 201611042411A CN 108090088 A CN108090088 A CN 108090088A
Authority
CN
China
Prior art keywords
sample
word
candidate feature
feature word
sample size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611042411.2A
Other languages
Chinese (zh)
Inventor
雷婷睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201611042411.2A priority Critical patent/CN108090088A/en
Publication of CN108090088A publication Critical patent/CN108090088A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of feature extracting methods, it is first determined the candidate feature word in sample set;For any one candidate feature word, the corresponding sample size of four kinds of situations that statistical sample concentration " whether including the candidate feature word ", " whether belonging to target classification " the two conditional combinations obtain respectively;Then, the frequency of the candidate feature word is obtained;Chi-square value is calculated using the frequency of four sample sizes and the candidate feature word, finally, is determined to treat training characteristics according to chi-square value, and is trained to obtain the feature that the target classification is included using default sorting algorithm.As shown in the above, this feature extracting method adds the frequency of Feature Words when calculating candidate feature word, frequency is that an average sample the average word frequency of the candidate feature word occurs, in other words, increase the weight of the more sample of the number comprising the candidate feature word, reduce the weight for including the less sample of the candidate feature word number, so as to improve the accuracy rate of feature extraction.

Description

Feature extracting method and device
Technical field
The present invention relates to computer realm more particularly to feature extracting methods and device.
Background technology
Classification is exactly to label according to certain standard to object, is sorted out further according to label to distinguish, sorting algorithm is exactly certainly The dynamic algorithm for completing the above process.Feature extraction is for sorting algorithm reasonable feature to be selected to carry out learning training, the feature of extraction The training of disaggregated model will be directly affected, therefore, the accuracy rate for improving feature extraction is most important.
The emotional semantic classification of article is by from different platform, for example, being climbed in wechat public platform, microblogging, news, forum, webpage The data that take carry out emotional semantic classification, wherein, emotional category can include just, in, negative three classes.At present, carried using chi-square value feature The mode taken first, segments samples of text using 2-gram models, then calculates each point using common chi-square value formula The chi-square value of word, and before being chosen according to chi-square value 500 phrase, the feature extracted.Chi-square value is in non-parametric test One statistic, is mainly used in nonparametric discrimination method.Its effect is the correlation of inspection data, if for example, card side The conspicuousness (i.e. SIG.) of value is less than 0.05, then illustrates that two variables are significantly correlated.But the feature that such mode is extracted Accuracy rate it is relatively low.
The content of the invention
In view of the above problems, it is proposed that the technical solution of the application overcomes the above problem or at least portion in order to provide one kind The feature extracting method and device to solve the above problems with dividing, technical solution are as follows:
In a first aspect, this application provides a kind of feature extracting method, including:
Obtain the candidate feature word in sample set;
For any one of candidate feature word, first sample quantity in the sample set, the second sample are counted respectively This quantity, the 3rd sample size and the 4th sample size;The first sample quantity is comprising the candidate feature word and belongs to The other sample size of target class, second sample size is not comprising the candidate feature word and to belong to the target class other Sample size, the 3rd sample size be comprising the candidate feature word and be not belonging to the other sample size of the target class, 4th sample size is not comprising the candidate feature word and is not belonging to the other sample size of the target class;
Obtain the frequency that the candidate feature word occurs in the other sample of the target class is belonged to;
According to the first sample quantity, second sample size, the 3rd sample size, the 4th sample number Amount and the frequency, are calculated the chi-square value between the candidate feature word and the target classification;
It determines that the target class is other according to the chi-square value and treats training characteristics word;
It treats that training characteristics word is trained to described using default sorting algorithm, obtains the spy that the target classification is included Sign.
Optionally, the frequency for obtaining the candidate feature word and occurring in the other sample of the target class is belonged to, bag It includes:
Statistics belongs in the other sample of the target class number for the candidate feature word occur and the target classification Comprising total sample number amount;
The ratio for the total sample number amount that the number is included with the target classification is calculated, obtains the frequency.
Optionally, it is described according to the first sample quantity, second sample size, the 3rd sample size, institute The 4th sample size and the frequency are stated, the chi-square value between the candidate feature word and the target classification is calculated, is wrapped It includes:
According to formulaCalculate institute State chi-square value;
Wherein, N is total sample number amount in sample set, and A is the first sample quantity, B is second sample size, C It is the 4th sample size for the 3rd sample size, D;I be the target classification, t be the candidate feature word, α (t, I it is) frequency of the candidate feature word.
Optionally, the candidate feature word obtained in sample set, including:
It is added in the dictionary for word segmentation of default segmentation methods and specifies stop words, according to updated dictionary for word segmentation to the sample The content of text of this concentration carries out word segmentation processing, and deletes the specified stop words that the content of text includes, and obtains described Candidate feature word.
Optionally, the candidate feature word obtained in sample set, including:
It is added in the dictionary for word segmentation of default segmentation methods and specifies candidate word, according to updated dictionary for word segmentation to the sample The content of text of this concentration carries out word segmentation processing, obtains the candidate feature word, is specified in the candidate feature word comprising described Candidate word.
Second aspect, the application provide a kind of feature deriving means, including:
First acquisition module, for obtaining the candidate feature word in sample set;
Statistical module, for for any one of candidate feature word, counting the first sample in the sample set respectively This quantity, the second sample size, the 3rd sample size and the 4th sample size;The first sample quantity is comprising the candidate Feature Words and belong to the other sample size of target class, second sample size is not comprising the candidate feature word and belongs to institute The other sample size of target class is stated, the 3rd sample size is comprising the candidate feature word and is not belonging to the target classification Sample size, the 4th sample size is not comprising the candidate feature word and is not belonging to the other sample number of the target class Amount;
Second acquisition module, for obtaining the frequency that the candidate feature word occurs in the other sample of the target class is belonged to Degree;
Chi-square value computing module, for according to the first sample quantity, second sample size, the 3rd sample The card between the candidate feature word and the target classification is calculated in quantity, the 4th sample size and the frequency Side's value;
It treats training characteristics determining module, training characteristics word is treated for determining that the target class is other according to the chi-square value;
Feature training module for treating that training characteristics word is trained to described using default sorting algorithm, obtains described The feature that target classification is included.
Optionally, second acquisition module, including:
Statistic submodule belongs in the other sample of the target class number for the candidate feature word occur for counting, And the total sample number amount that the target classification is included;
Frequency computational submodule for calculating the ratio of the number and the total sample number amount, obtains the frequency.
Optionally, the chi-square value computing module is specifically used for:
According to formulaDescribed in calculating Chi-square value;
Wherein, N is total sample number amount in sample set, and A is the first sample quantity, B is second sample size, C It is the 4th sample size for the 3rd sample size, D;I be the target classification, t be the candidate feature word, α (t, I it is) frequency of the candidate feature word.
Optionally, first acquisition module, including:
First acquisition submodule specifies stop words, according to update for being added in the dictionary for word segmentation of default segmentation methods Dictionary for word segmentation afterwards in the sample set content of text carry out word segmentation processing, and delete the content of text include it is described Stop words is specified, obtains the candidate feature word.
Optionally, first acquisition module, including:
Second acquisition submodule specifies candidate word, according to update for being added in the dictionary for word segmentation of default segmentation methods Dictionary for word segmentation afterwards carries out word segmentation processing to the content of text in the sample set, obtains the candidate feature word, the candidate The specified candidate word is included in Feature Words.
By above-mentioned technical proposal, feature extracting method provided by the invention is determined from sample set at this first The candidate feature word of reason;And respectively statistical sample concentrate " whether including the candidate feature word ", " whether belonging to target classification " this Corresponding four sample sizes of four kinds of situations that two conditional combinations obtain;Then, the frequency of the candidate feature word is obtained;It utilizes The frequency of four sample sizes and the candidate feature word calculates chi-square value, finally, determines that target classification is included according to chi-square value Feature.As shown in the above, this feature extracting method adds the frequency of Feature Words, frequency when calculating candidate feature word That an average sample the average word frequency of the candidate feature word occurs, in other words, increase the number comprising the candidate feature word compared with The weight of more samples reduces the weight for including the less sample of the candidate feature word number, so as to improve feature extraction Accuracy rate.
Above description is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, below the special specific embodiment for lifting the present invention.
Description of the drawings
By reading the detailed description of hereafter preferred embodiment, it is various other the advantages of and benefit it is common for this field Technical staff will be apparent understanding.Attached drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of flow chart of feature extracting method of the embodiment of the present invention;
Fig. 2 shows the flow chart of another kind feature extracting method of the embodiment of the present invention;
Fig. 3 shows a kind of block diagram of feature deriving means of the embodiment of the present invention;
Fig. 4 shows the block diagram of another kind feature deriving means of the embodiment of the present invention;
Fig. 5 shows the block diagram of another feature deriving means of the embodiment of the present invention.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Completely it is communicated to those skilled in the art.
Fig. 1 is referred to, shows a kind of flow chart of feature extracting method of the embodiment of the present invention, this method is applied to terminal In equipment or server, as shown in Figure 1, this method may comprise steps of:
S110 obtains the candidate feature word in sample set.
Sample set is the sample of known class categories, generally includes positive sample and negative sample, for example, classification to be trained is A classes, then positive sample refer to the sample for belonging to A classes, negative sample is the sample for being not belonging to A classes.The application is mainly used in text point Feature extraction in class process, each sample are text files.
The content of text of sample can be segmented using segmentation methods, it is then determined that going out candidate feature word.Participle is calculated Method can select the existing segmentation methods for being suitble to Chinese, and the application is not intended to limit this.
S120 counts first sample quantity in the sample set, the second sample size, the 3rd sample size and respectively Four sample sizes.
Sample size classification is as shown in table 1:
Table 1
Contain feature X Feature X is not contained
Belong to I classes A B
It is not belonging to I classes C D
For example, feature X, that is, candidate feature word, I classes, that is, target classification in table 1, then first sample quantity is A, represents sample This concentration includes feature X and belongs to the sample size of I classes;B represents not including feature X and category in sample set for the second sample size In the sample size of I classes;C is represented to include feature X in sample set and is not belonging to the sample size of I classes for the 3rd sample size;D is 4th sample size represents not including feature X in sample set and is not belonging to the sample size of I classes.
S130 obtains the frequency that candidate feature word occurs in the other sample of target class is belonged to.
The frequency of candidate feature word refers to the textual data of this feature word appearance and the ratio of class text sum.Calculation formula It is as follows:
Frequency of the α (t, I) between the candidate feature word and I classes;For example, the denominator in formula 1 is to belong to I in table 1 The textual data of class is A+B, and molecule is the quantity of Feature Words X included in A sample.
Target classification is the classification currently to be trained.For the application scenarios of automotive-type article emotional semantic classification, including Positive three emotion, middle emotion and negative affect classifications, target classification can be any type in these three classifications.
S140, according to first sample quantity, the second sample size, the 3rd sample size, the 4th sample size and frequency, meter Calculation obtains the chi-square value between candidate feature word and target classification.
In formula 2, x2(t, I) represents chi-square value, and N is total sample number amount;A is first sample quantity, B is the second sample number Amount, C are the 3rd sample size, D is the 4th sample size, and α (t, I) represents the frequency of feature t and classification I.
As (AD-BC) > 0, represent that the candidate feature word and the other degree of correlation of target class are larger, When (AD-BC)≤0, represent that the candidate feature word and the other degree of correlation of non-target class are bigger, x2(t, I)=0, chi-square value takes 0 excludes such candidate feature word to the other influence of target class.
S150 determines that target class is other according to chi-square value and treats training characteristics.
After the chi-square value of whole candidate feature words all has been calculated, it is ranked up, and selects according to the numerical values recited of chi-square value The default quantity candidate feature word of access value from big to small is as such another characteristic.
Default quantity can be chosen according to actual demand, for example, 500 can be chosen.
S160 treats training characteristics word using default sorting algorithm and is trained, obtains the feature that target classification is included.
That chooses treats that training characteristics word needs to be trained using default sorting algorithm, from default quantity candidate feature word Middle training obtains really characterizing the feature of target category attribute.For example, default sorting algorithm can be SVM (Support Vector Machine, support vector machines) algorithm, it can also use other sorting algorithms certainly, the application is not intended to limit this.
Feature extracting method provided in this embodiment, it is first determined the candidate feature word in sample set;For any one Candidate feature word, respectively statistical sample concentrate " whether including the candidate feature word ", " whether belonging to target classification " the two The corresponding sample size of four kinds of situations that part combines;Then, the frequency of the candidate feature word is obtained;Utilize four sample numbers The frequency of amount and the candidate feature word calculates chi-square value, finally, is determined to treat training characteristics according to chi-square value, and utilizes default point Class algorithm is trained to obtain the feature that the target classification is included.As shown in the above, this feature extracting method is calculating The frequency of Feature Words is added during candidate feature word, frequency is that an average sample the average word frequency of the candidate feature word occurs, In other words, increase the weight of the more sample of number comprising the candidate feature word, reduce comprising the candidate feature word number compared with The weight of few sample, so as to improve the accuracy rate of feature extraction.
Fig. 2 is referred to, shows the flow chart of another kind feature extracting method of the embodiment of the present invention, this method is choosing time When selecting Feature Words, stop words is introduced, as shown in Fig. 2, this method comprises the following steps:
S210 is added in the dictionary for word segmentation of default segmentation methods and is specified stop words and specified candidate word, after obtaining update Dictionary for word segmentation.
Specified stop words is without specific semantic word, for example, modal particle, quantifier, conjunction etc..For example, " ", " one It is a ", " still " etc..
In a kind of possible realization method of the present invention, classify to the text on vehicle, under such application scenarios, Specified candidate word can be vehicle word, for example, Passat, bright ease etc., if occurring the word in vehicle word in samples of text successively Shi Ze is divided according to vehicle word, and is determined as candidate feature word.
S220 carries out word segmentation processing to the content of text in sample set according to updated dictionary for word segmentation, obtains candidate spy Levy word.
Do not include in candidate feature word and specify stop words, still, including specifying candidate word.
S230, for any one candidate feature word, first sample quantity that statistical sample is concentrated, the second sample size, 3rd sample size and the 4th sample size.
Wherein, first sample quantity is comprising candidate feature word and belongs to the other sample size of target class, the second sample number Amount is not comprising the candidate feature word and belongs to the other sample size of target class, and the 3rd sample size is comprising the candidate feature word And the other sample size of target class is not belonging to, the 4th sample size is not comprising the candidate feature word and to be not belonging to target class other Sample size.
S240, statistics belong in the other sample of target class comprising the candidate feature word number and, wrapped in target classification Total quantity containing sample.
This step is to count the number of the candidate feature word included in the sample of first sample quantity, meanwhile, statistics The total quantity of sample is included in target classification.
S250 obtains the frequency of the candidate feature word using the number divided by the total quantity.
For example, the number for occurring the candidate feature word in the sample that target classification is included is n, sample is included in target classification This total quantity is A+B (referring to table 1), then the frequency of the candidate feature word is n/ (A+B).
S260 calculates the chi-square value of the candidate feature word.
The chi-square value of the candidate feature word is calculated according to formula 2, details are not described herein again.
S210~S260 is repeated, until obtaining the corresponding chi-square value of whole candidate feature words in sample set.
S270 chooses the default quantity candidate feature word of chi-square value from big to small, determines that the target class is other and wait to train Feature Words.
Specifically, after the corresponding chi-square value of whole candidate feature words is calculated in sample set, according to carrying out from big to small Sequence presets quantity candidate feature word and is determined as treating training characteristics word before selection.
S280 chooses the default quantity candidate feature word of chi-square value from big to small, determines that the target class is other and wait to train Feature Words.
Feature extracting method provided in this embodiment, in the candidate feature word in determining sample set, in dictionary for word segmentation It adds in and specifies stop words and specified candidate word, when being segmented to samples of text, delete in text and be consistent with specified stop words Phrase, obtained word segmentation result i.e. candidate feature word.Then, the frequency of each candidate feature word is calculated, and combines the candidate The frequency of Feature Words calculates chi-square value;It is determined to treat training characteristics word according to chi-square value, finally utilizes default classification algorithm training Treat that training characteristics word obtains the feature that the target classification is included.This method adds specified stop words in dictionary for word segmentation, The stop words occurred in samples of text is directly deleted during participle, so as to reduce the quantity of candidate feature word, so as to improve Choose the efficiency for treating training characteristics word.Moreover, adding specified candidate word in dictionary for word segmentation, and it will directly meet specified candidate The stroke of word is grouped together, and improves the accuracy rate of participle, and then improves the accuracy rate of feature extraction.
Corresponding to above-mentioned feature extracting method embodiment, the present invention also provides feature deriving means embodiments.
Fig. 3 is referred to, shows a kind of block diagram of feature deriving means of the embodiment of the present invention, which can be applied to end In end equipment or server.As shown in figure 3, the device can include:First acquisition module 310, statistical module 320, second are obtained Modulus block 330, chi-square value computing module 340 treat training characteristics determining module 350 and feature training module 360.
First acquisition module 310, for obtaining the candidate feature word in sample set.
Word segmentation processing, obtained word segmentation result i.e. candidate feature word are carried out to the text in sample set using segmentation methods.
Statistical module 320, for the first sample number for any one candidate feature word, respectively statistical sample concentration Amount, the second sample size, the 3rd sample size and the 4th sample size.
Statistical sample concentration meets the sample size of four kinds of situations shown in table 1 respectively.
Wherein, the first sample quantity is comprising the candidate feature word and belongs to the other sample size of target class, institute Stating the second sample size is not comprising the candidate feature word and belongs to the other sample size of the target class, the 3rd sample Quantity is comprising the candidate feature word and is not belonging to the other sample size of the target class, and the 4th sample size is not wrap Containing the candidate feature word and it is not belonging to the other sample size of the target class.
Second acquisition module 330, for obtaining the frequency that candidate feature word occurs in the other sample of the target class is belonged to.
The frequency of candidate feature word is calculated using formula 1, details are not described herein again.
Chi-square value computing module 340, for according to first sample quantity, the second sample size, the 3rd sample size, the 4th The candidate feature word and the other chi-square value of the target class is calculated in sample size and the frequency.
Chi-square value is calculated using formula 2, details are not described herein again.
It treats training characteristics determining module 350, training characteristics is treated for determining that the target class is other according to the chi-square value Word.
After the chi-square value of whole candidate feature words all has been calculated, it is ranked up, and selects according to the numerical values recited of chi-square value The default quantity candidate feature word of access value from big to small is as such another characteristic.
Default quantity can be chosen according to actual demand, for example, 500 can be chosen.
Feature training module 360 for treating that training characteristics word is trained to described using default sorting algorithm, obtains institute State the feature that target classification is included.
That chooses can just obtain target classification after training characteristics word needs to be trained using default sorting algorithm and wrapped The feature contained.For example, default sorting algorithm can be SVM (Support Vector Machine, support vector machines) algorithm, when Other sorting algorithms are so can also use, the application is not intended to limit this.
Feature deriving means provided in this embodiment are determined the candidate feature word in sample set by the first acquisition module;It is right In any one candidate feature word, using statistical module difference statistical sample concentrate " whether including the candidate feature word ", " whether Belong to target classification " the obtained corresponding sample size of four kinds of situations of the two conditional combinations;Then, obtained by the second acquisition module Take the frequency of the candidate feature word;And four sample sizes and the frequency meter of the candidate feature word are utilized by chi-square value computing module Chi-square value is calculated, finally, treats that training characteristics determining module is determined to treat training characteristics according to chi-square value, and by feature training module profit It is trained to obtain the feature that the target classification included with default sorting algorithm.As shown in the above, which is calculating The frequency of Feature Words is added during candidate feature word, frequency is that an average sample the average word frequency of the candidate feature word occurs, In other words, increase the weight of the more sample of number comprising the candidate feature word, reduce comprising the candidate feature word number compared with The weight of few sample, so as to improve the accuracy rate of feature extraction.
Fig. 4 is referred to, shows the block diagram of another kind feature deriving means of the embodiment of the present invention, the present embodiment is to Fig. 3 institutes Show that the part of module in embodiment is refined, as shown in figure 4, the device can include:First acquisition module 410, statistics Module 420, statistic submodule 430, chi-square value computing module 450, treat training characteristics determining module at frequency computational submodule 440 460 and feature training module 470.
First acquisition module 410 is waited for adding in specify stop words and specify in the dictionary for word segmentation of default segmentation methods Word is selected, word segmentation processing is carried out to the content of text in the sample set according to updated dictionary for word segmentation, is deleted in the text Hold the specified stop words included, and the stroke for meeting the specified candidate word is divided into phrase, obtain candidate feature word.
In the present embodiment, specified stop words and specified candidate word are added in original dictionary for word segmentation of segmentation methods, such as It is then deleted in the presence of the word for meeting specified stop words in fruit samples of text, if there is the word for meeting specified candidate word in samples of text Then be divided into phrase, in this way, can candidate feature word total quantity and improve the accuracy rate of candidate feature word.
Statistical module 420, for for any one of candidate feature word, counting first in the sample set respectively Sample size, the second sample size, the 3rd sample size and the 4th sample size.
Wherein, the first sample quantity is comprising the candidate feature word and belongs to the other sample size of target class, institute Stating the second sample size is not comprising the candidate feature word and belongs to the other sample size of the target class, the 3rd sample Quantity is comprising the candidate feature word and is not belonging to the other sample size of the target class, and the 4th sample size is not wrap Containing the candidate feature word and it is not belonging to the other sample size of the target class.
Statistic submodule 430 belongs in the other sample of the target class time for the candidate feature word occur for counting The total sample number amount that number and the target classification are included.
The number of the candidate feature word included in the sample of statistic submodule statistics first sample quantity, meanwhile, The total quantity of sample is included in statistics target classification.
Frequency computational submodule 440 for calculating the ratio of the number and the total sample number amount, obtains the frequency Degree.
Frequency computational submodule is calculated using formula 1, and details are not described herein again.
Chi-square value computing module 450, for according to the first sample quantity, second sample size, the described 3rd Sample size, the 4th sample size and the frequency are calculated between the candidate feature word and the target classification Chi-square value.
Chi-square value is calculated using formula 2, and details are not described herein again.
It treats training characteristics determining module 460, training characteristics is treated for determining that the target class is other according to the chi-square value Word.
Specifically, after the corresponding chi-square value of whole candidate feature words is calculated in sample set, according to carrying out from big to small Sequence presets quantity candidate feature word and is determined as treating training characteristics word before selection.
Feature training module 470 for treating that training characteristics word is trained to described using default sorting algorithm, obtains institute State the feature that target classification is included.
For example, default sorting algorithm can be SVM algorithm, trained from default quantity candidate feature word obtain can be true The feature of positive characterization target category attribute.
Feature deriving means provided in this embodiment, in the candidate feature word in determining sample set, in dictionary for word segmentation It adds in and specifies stop words and specified candidate word, when being segmented to samples of text, delete in text and be consistent with specified stop words Phrase, obtained word segmentation result i.e. candidate feature word.Then, the frequency of each candidate feature word is calculated, and combines the candidate The frequency of Feature Words calculates chi-square value;It is determined to treat training characteristics word according to chi-square value, finally utilizes default classification algorithm training Treat that training characteristics word obtains the feature that the target classification is included.The device adds specified stop words in dictionary for word segmentation, The stop words occurred in samples of text is directly deleted during participle, so as to reduce the quantity of candidate feature word, so as to improve Choose the efficiency for treating training characteristics word.Moreover, adding specified candidate word in dictionary for word segmentation, and it will directly meet specified candidate The stroke of word is grouped together, and improves the accuracy rate of participle, and then improves the accuracy rate of feature extraction.
Fig. 5 is referred to, shows the block diagram of another feature deriving means of the embodiment of the present invention, which includes processor 510 and memory 520.
Wherein, the first above-mentioned acquisition module 310, statistical module 320, the second acquisition module 330, chi-square value computing module 340th, treat that training characteristics determining module 350 and feature training module 360 etc. are stored in as program unit in memory 520, The above procedure unit being stored in memory 520 is performed by processor 510 to realize corresponding function.
Comprising kernel in processor 510, gone in memory 520 to transfer corresponding program unit by kernel.Kernel can be set One or more are put, the accuracy rate of feature extraction is improved by adjusting kernel parameter.
Memory 520 may include the volatile memory in computer-readable medium, random access memory (RAM) And/or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM), memory 520 is included at least One storage chip.
Feature deriving means provided in this embodiment, corresponding program unit is transferred in memory by processor with complete with Lower process:Determine the candidate feature word in sample set;For any one candidate feature word, respectively statistical sample concentrate " whether Include the candidate feature word ", the obtained corresponding sample number of four kinds of situations of " whether belonging to target classification " the two conditional combinations Amount;Then, the frequency of the candidate feature word is obtained;Card side is calculated using the frequency of four sample sizes and the candidate feature word Value finally, is determined to treat training characteristics according to chi-square value, and is trained to obtain the target classification institute using default sorting algorithm Comprising feature.This feature extraction process adds the frequency of Feature Words when calculating candidate feature word, and frequency is one average There is the average word frequency of the candidate feature word in sample, in other words, increases the more sample of the number comprising the candidate feature word Weight reduces the weight for including the less sample of the candidate feature word number, so as to improve the accuracy rate of feature extraction.
It is first when being performed on data processing equipment, being adapted for carrying out present invention also provides a kind of computer program product The program code of beginningization there are as below methods step:
Obtain the candidate feature word in sample set;
For any one of candidate feature word, first sample quantity in the sample set, the second sample are counted respectively This quantity, the 3rd sample size and the 4th sample size;The first sample quantity is comprising the candidate feature word and belongs to The other sample size of target class, second sample size is not comprising the candidate feature word and to belong to the target class other Sample size, the 3rd sample size be comprising the candidate feature word and be not belonging to the other sample size of the target class, 4th sample size is not comprising the candidate feature word and is not belonging to the other sample size of the target class;
Obtain the frequency that the candidate feature word occurs in the other sample of the target class is belonged to;
According to the first sample quantity, second sample size, the 3rd sample size, the 4th sample number Amount and the frequency, are calculated the chi-square value between the candidate feature word and the target classification;
It determines that the target class is other according to the chi-square value and treats training characteristics word;
It treats that training characteristics word is trained to described using default sorting algorithm, obtains the spy that the target classification is included Sign.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware can be used in the application Apply the form of example.Moreover, the computer for wherein including computer usable program code in one or more can be used in the application The computer program production that usable storage medium is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.
The application is with reference to the flow according to the method for the embodiment of the present application, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided The processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices is generated for real The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction generation being stored in the computer-readable memory includes referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or The function of being specified in multiple boxes.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to generate computer implemented processing, so as in computer or The instruction offer performed on other programmable devices is used to implement in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only memory (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.It defines, calculates according to herein Machine readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It these are only embodiments herein, be not limited to the application.To those skilled in the art, The application can have various modifications and variations.All any modifications made within spirit herein and principle, equivalent substitution, Improve etc., it should be included within the scope of claims hereof.

Claims (10)

1. a kind of feature extracting method, which is characterized in that including:
Obtain the candidate feature word in sample set;
For any one of candidate feature word, first sample quantity in the sample set, the second sample number are counted respectively Amount, the 3rd sample size and the 4th sample size;The first sample quantity is comprising the candidate feature word and belongs to target The sample size of classification, second sample size are not comprising the candidate feature word and belong to the other sample of the target class Quantity, the 3rd sample size is comprising the candidate feature word and is not belonging to the other sample size of the target class, described 4th sample size is not comprising the candidate feature word and is not belonging to the other sample size of the target class;
Obtain the frequency that the candidate feature word occurs in the other sample of the target class is belonged to;
According to the first sample quantity, second sample size, the 3rd sample size, the 4th sample size and The chi-square value between the candidate feature word and the target classification is calculated in the frequency;
It determines that the target class is other according to the chi-square value and treats training characteristics word;
It treats that training characteristics word is trained to described using default sorting algorithm, obtains the feature that the target classification is included.
2. according to the method described in claim 1, it is characterized in that, the acquisition candidate feature word is belonging to the target The frequency occurred in the sample of classification, including:
Statistics belongs in the other sample of the target class number for the candidate feature word occur and the target classification includes Total sample number amount;
The ratio for the total sample number amount that the number is included with the target classification is calculated, obtains the frequency.
It is 3. according to the method described in claim 1, it is characterized in that, described according to the first sample quantity, second sample This quantity, the 3rd sample size, the 4th sample size and the frequency, are calculated the candidate feature word and institute The chi-square value between target classification is stated, including:
According to formulaCalculate the card side Value;
Wherein, N is total sample number amount in sample set, and A is the first sample quantity, B is second sample size, C is institute State the 3rd sample size, D is the 4th sample size;I is the target classification, and t is the candidate feature word, and α (t, I) is The frequency of the candidate feature word.
4. according to the method described in claims 1 to 3 any one, which is characterized in that the candidate obtained in sample set is special Word is levied, including:
It is added in the dictionary for word segmentation of default segmentation methods and specifies stop words, according to updated dictionary for word segmentation to the sample set In content of text carry out word segmentation processing, and delete the specified stop words that the content of text includes, obtain the candidate Feature Words.
5. according to the method described in claims 1 to 3 any one, which is characterized in that the candidate obtained in sample set is special Word is levied, including:
It is added in the dictionary for word segmentation of default segmentation methods and specifies candidate word, according to updated dictionary for word segmentation to the sample set In content of text carry out word segmentation processing, obtain the candidate feature word, the specified candidate included in the candidate feature word Word.
6. a kind of feature deriving means, which is characterized in that including:
First acquisition module, for obtaining the candidate feature word in sample set;
Statistical module, for for any one of candidate feature word, counting the first sample number in the sample set respectively Amount, the second sample size, the 3rd sample size and the 4th sample size;The first sample quantity is comprising the candidate feature Word and belong to the other sample size of target class, second sample size is not comprising the candidate feature word and belongs to the mesh The sample size of classification is marked, the 3rd sample size is comprising the candidate feature word and is not belonging to the other sample of the target class This quantity, the 4th sample size are not comprising the candidate feature word and are not belonging to the other sample size of the target class;
Second acquisition module, for obtaining the frequency that the candidate feature word occurs in the other sample of the target class is belonged to;
Chi-square value computing module, for according to the first sample quantity, second sample size, the 3rd sample number Amount, the 4th sample size and the frequency, are calculated the card side between the candidate feature word and the target classification Value;
It treats training characteristics determining module, training characteristics word is treated for determining that the target class is other according to the chi-square value;
Feature training module for treating that training characteristics word is trained to described using default sorting algorithm, obtains the target The feature that classification is included.
7. device according to claim 6, which is characterized in that second acquisition module, including:
Statistic submodule, for count belong in the other sample of the target class number that the candidate feature word occur and The total sample number amount that the target classification is included;
Frequency computational submodule for calculating the ratio of the number and the total sample number amount, obtains the frequency.
8. device according to claim 6, which is characterized in that the chi-square value computing module is specifically used for:
According to formulaCalculate the card side Value;
Wherein, N is total sample number amount in sample set, and A is the first sample quantity, B is second sample size, C is institute State the 3rd sample size, D is the 4th sample size;I is the target classification, and t is the candidate feature word, and α (t, I) is The frequency of the candidate feature word.
9. according to the device described in claim 6 to 8 any one, which is characterized in that first acquisition module, including:
First acquisition submodule specifies stop words, according to updated for being added in the dictionary for word segmentation of default segmentation methods Dictionary for word segmentation carries out word segmentation processing to the content of text in the sample set, and that deletes that the content of text includes described specifies Stop words obtains the candidate feature word.
10. according to the device described in claim 6 to 8 any one, which is characterized in that first acquisition module, including:
Second acquisition submodule specifies candidate word, according to updated for being added in the dictionary for word segmentation of default segmentation methods Dictionary for word segmentation carries out word segmentation processing to the content of text in the sample set, obtains the candidate feature word, the candidate feature The specified candidate word is included in word.
CN201611042411.2A 2016-11-23 2016-11-23 Feature extracting method and device Pending CN108090088A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611042411.2A CN108090088A (en) 2016-11-23 2016-11-23 Feature extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611042411.2A CN108090088A (en) 2016-11-23 2016-11-23 Feature extracting method and device

Publications (1)

Publication Number Publication Date
CN108090088A true CN108090088A (en) 2018-05-29

Family

ID=62171034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611042411.2A Pending CN108090088A (en) 2016-11-23 2016-11-23 Feature extracting method and device

Country Status (1)

Country Link
CN (1) CN108090088A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543577A (en) * 2019-09-03 2019-12-06 南京工程学院 Picture retrieval method for underwater operation scene
CN114443849A (en) * 2022-02-09 2022-05-06 北京百度网讯科技有限公司 Method and device for selecting marked sample, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
CN103914551A (en) * 2014-04-13 2014-07-09 北京工业大学 Method for extending semantic information of microblogs and selecting features thereof
KR101429397B1 (en) * 2013-04-11 2014-08-14 전북대학교산학협력단 Method and system for extracting core events based on message analysis in social network service
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101429397B1 (en) * 2013-04-11 2014-08-14 전북대학교산학협력단 Method and system for extracting core events based on message analysis in social network service
CN103886108A (en) * 2014-04-13 2014-06-25 北京工业大学 Feature selection and weight calculation method of imbalance text set
CN103914551A (en) * 2014-04-13 2014-07-09 北京工业大学 Method for extending semantic information of microblogs and selecting features thereof
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543577A (en) * 2019-09-03 2019-12-06 南京工程学院 Picture retrieval method for underwater operation scene
CN114443849A (en) * 2022-02-09 2022-05-06 北京百度网讯科技有限公司 Method and device for selecting marked sample, electronic equipment and storage medium
CN114443849B (en) * 2022-02-09 2023-10-27 北京百度网讯科技有限公司 Labeling sample selection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107291723A (en) The method and apparatus of web page text classification, the method and apparatus of web page text identification
CN104915327A (en) Text information processing method and device
CN113052577B (en) Class speculation method and system for block chain digital currency virtual address
CN107145560A (en) A kind of file classification method and device
CN109271517A (en) IG TF-IDF Text eigenvector generates and file classification method
CN106960040A (en) A kind of URL classification determines method and device
CN110019785A (en) A kind of file classification method and device
CN108241662A (en) The optimization method and device of data mark
CN111475651B (en) Text classification method, computing device and computer storage medium
CN110502902A (en) A kind of vulnerability classification method, device and equipment
CN103246686A (en) Method and device for text classification, and method and device for characteristic processing of text classification
CN108090088A (en) Feature extracting method and device
CN114896398A (en) Text classification system and method based on feature selection
CN108229507A (en) Data classification method and device
US9053434B2 (en) Determining an obverse weight
CN114722198A (en) Method, system and related device for determining product classification code
CN105787004A (en) Text classification method and device
CN116029280A (en) Method, device, computing equipment and storage medium for extracting key information of document
CN108228869A (en) The method for building up and device of a kind of textual classification model
CN111241269B (en) Short message text classification method and device, electronic equipment and storage medium
CN104699707A (en) Data clustering method and device
CN110688481A (en) Text classification feature selection method based on chi-square statistic and IDF
CN110807159A (en) Data marking method and device, storage medium and electronic equipment
CN108108371A (en) A kind of file classification method and device
CN108647335A (en) Internet public opinion analysis method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180529