CN105893388A - Text feature extracting method based on inter-class distinctness and intra-class high representation degree - Google Patents
Text feature extracting method based on inter-class distinctness and intra-class high representation degree Download PDFInfo
- Publication number
- CN105893388A CN105893388A CN201510014438.XA CN201510014438A CN105893388A CN 105893388 A CN105893388 A CN 105893388A CN 201510014438 A CN201510014438 A CN 201510014438A CN 105893388 A CN105893388 A CN 105893388A
- Authority
- CN
- China
- Prior art keywords
- class
- feature
- feature words
- text
- degree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text feature extracting method based on inter-class distinctness and intra-class high representation degree. The method comprises the following steps: preprocessing a training set text; calculating the class distinctness of each feature word through an improved feature selecting method so as to select feature words with more class representation, wherein the selected feature words are of high distinctness among different classes; and further screening the selected feature words which are of high class distinctness based on the intra-class distribution rate and information gain (IG) of the feature words. With the adoption of the method, the feature selection is carried out twice to select the feature words which are of high intra-class information entropy and high intra-class distribution rate, and thus the classifying efficiency and accuracy can be improved; in addition, the calculation is simple, so that the text classifying speed and accuracy can be improved.
Description
Technical field
The invention belongs to Text Mining Technology field, distinguish based between class particularly to one
The text feature of high sign degree in degree and class.
Background technology
In the epoch that current internet information resource quickly increases, in order to quicker
Effective discovery information needed and resource, Text Classification is as effective tissue and pipe
The important means of reason text message is arisen at the historic moment.Text Classification refers to according to pending
The content of text or attribute are assigned to the skill of one or more predefined classification
Art.In text classification field, the most popular is to use VSM vector space pair
Text is indicated, " the height of the characteristic item in order to avoid producing when setting up VSM space
Dimension disaster ", the most distinctive selection algorithm becomes to be even more important.
In text classification, Feature selection algorithm includes that the algorithm that following comparison is traditional: DF calculates
Method (document frequency algorithm), its shortcoming is to have only focused on high frequency words, low frequency can be missed but
The word that comentropy is high, and the word selected do not possess represent certain classification characteristic IG calculate
The particularity that method (information gain method) calculates due to it so that it is tend not to select enough
The Feature Words CHI algorithm (χ of quantity2Statistic law), it considers Feature Words to some
Classification impact, but the biggest MI of its amount of calculation (mutual information method), its shortcoming be
Under experimental enviroment, performance is unstable.
Therefore, it is necessary to design a kind of can select have between the strongest class difference degree and
Class belonging to it has again high efficiency, and the Feature Words selection algorithm that amount of calculation is less.
Summary of the invention
Cannot select for the existing text classification feature selection approach of solution and there is high classification generation
The Feature Words of scale and the biggest deficiency of amount of calculation, the present invention provides a kind of based on Lei Jian district
High sign degree in indexing and class, and the text feature that amount of calculation is less.Institute
The scheme of stating comprises the following steps:
Step 1: obtain different classes of text collection, as language material training set.
Step 2: the text of language material training set is carried out pretreatment, including Chinese word segmentation,
Stop words is gone to process;
Step 3: use a kind of text based on sign degree high in discrimination between class and class special
Levy extracting method and text carried out feature selection, select N number of feature (N is predetermined threshold value),
Text feature set as above-mentioned language material training set.
A kind of text feature based on sign degree high in discrimination between class and class,
Particularly as follows:
First calculate the class discrimination degree of each Feature Words, choose and there is high class discrimination degree
Feature Words, comprise the following steps:
Step (1), determines the degree of association of each Feature Words and each preset classification, its meter
Calculation formula is as follows:
Wherein RjkRepresent Feature Words tkWith text categories cjDegree of association, molecule represent literary composition
This classification cjThere is Feature Words t in apoplexy due to endogenous windkNumber of files, denominator represents text categories cjClass
In comprise the number of document.
Step (2), calculates Feature Words tkIn text categories cjClass discrimination ability in class
Value, computing formula is as follows:
Diffjk=min (Rjk-Rik)(i!=j and i take 1~s, and s is classification sum)
Note, here DiffjkIt can be negative.Negative number representation Feature Words tkIn text class
Other cjThe distribution of class is less than Feature Words tkIn text categories ciThe distribution of apoplexy due to endogenous wind.
Step (3), calculates the class discrimination degree of Feature Words tk, and computing formula is as follows:
Diffk=max{Diffjk(j takes 1~s)
And record DiffkCorresponding DiffjkJ value, i.e. have recorded Feature Words tkCharacterize
Text categories cj。
Step (4), arranges predetermined threshold value Q1, takes DiffkThe Feature Words of >=Q1 is carried out
Further screening.
Further, in conjunction with Feature Words classification rate in class and information gain IG in choosing
The Feature Words of the high class discrimination degree gone out screens further, the spy of high sign degree in choosing class
Levy word, specifically include following steps:
Step (1), the Feature Words to the high class discrimination degree selected, calculate this spy
Levy the word distributive law at its apoplexy due to endogenous wind characterized, it is assumed that Feature Words tkCharacterize text categories cj
Class, then Feature Words tkDistributive law computing formula as follows:
wtk=(class cjIn the t that compriseskWord frequency)/(class cjIn the number of files that comprises)
Step (2), arranges predetermined threshold value Q2, works as wtkDuring >=Q2, then Feature Words tkMake
For high frequency words, enter step (3), predetermined threshold value Q3 is set, works as wtkDuring <=Q3, then
Represent Feature Words tkIt is low-frequency word, enters step (4), be determined further.
Step (3), to wtkThe Feature Words of >=Q2 seeks IG, arranges predetermined threshold value Q4, when
IG(tk) < Q4, then Feature Words tkIt is eliminated, is not elected as the text of language material training set
Characteristic set.
Step (4), to wtkThe Feature Words of <=Q3 seeks IG, and arranges threshold value Q5, when
IG(tk) >=Q5 time, represent tkIt is an effective word of low frequency, is elected as language material training set
Text feature set.
Step (5), it is assumed that the dimension of the text feature set of language material training set is N, if
Dimension according to the Feature Words above taken out is less than dimension N, the most now from Q3 < tk< Q2's
Feature Words selects, selects from high in the end according to weights.Until being full.
Technical scheme provided by the present invention provides the benefit that:
First pass through the class discrimination degree calculating each Feature Words, choose and have more classification generation
The Feature Words of table so that it is there is between the class that each are different the highest discrimination, and
And by further combined with Feature Words distributive law in class and information gain IG in choosing
The Feature Words of the high class discrimination degree gone out screens further, has high comentropy in selecting class
And the Feature Words that distributive law is high, it addition, the calculating of this technical scheme is simple, using the teaching of the invention it is possible to provide
The arithmetic speed of text classification and efficiency.
Accompanying drawing explanation
Fig. 1 is present invention Text character extraction side based on sign degree high in discrimination between class and class
Method flow chart.
Fig. 2 is the detailed algorithm schematic flow sheet that the present invention selects discrimination between high class.
Fig. 3 is that the present invention selects class Nei Gaobiao based between the high class selected in the Feature Words of discrimination
The detailed algorithm schematic flow sheet of degree of levying.
Detailed description of the invention
Become apparent from for making the purpose of the present invention, technical scheme and advantage illustrate, below will
In conjunction with accompanying drawing and actual case, the present invention is described in further detail.
Fig. 1 is that present invention text feature based on sign degree high in discrimination between class and class carries
Access method flow chart, concrete function be accomplished by
Step 1: obtain certain first with web crawlers or artificially collect from the Internet
These articles are analyzed arranging by representational article in multiple fields of quantity,
It is included into language material training set, as the training sample set of Text Classification System according to classification.
Step 2: in order to extract the word that can represent text feature from text,
It is carried out participle, removes the process such as stop words.
Step 3: from choosing the spy with high class discrimination degree through the text of pretreatment
Levy word, specific as follows:
Fig. 2 is the detailed algorithm schematic flow sheet that the present invention selects discrimination between high class, below
In conjunction with accompanying drawing and example, algorithm is illustrated, specific as follows:
Assume that pre-set categories has 3 classes, respectively A class, B class, C class, wherein A
Class, B class, C class contains 10 articles being belonging respectively to its classification respectively.Assume existing
In in Feature Words 1 occurs in 10 articles belonging to A class 5, and also divide
Do not occur in that in 5 in 10 articles belonging to B and C class.Feature Words 2 goes out
Now belong to, in 9 in 10 articles of A class, occur in and belong to the 10 of B class
In in piece article 8, and occur in 1 in 10 articles belonging to C class
In.Feature Words 3 occurs in 9 in 10 articles belonging to A class, occurs in
Belong in 3 in 10 articles of B class, and occur in 10 literary compositions belonging to C class
In in chapter 1, as shown in table 1 below:
A class (10) | B class (10) | C class (10) | |
Word 1 | 5 | 5 | 5 |
Word 2 | 9 | 8 | 1 |
Word 3 | 10 | 3 | 1 |
Table 1
Each word and each predtermined category is calculated according to following relatedness computation formula
Degree of association Rjk:
Wherein RjkRepresent Feature Words tkWith text categories cjDegree of association, molecule represent literary composition
This classification cjThere is Feature Words t in apoplexy due to endogenous windkNumber of files, denominator represents text categories cjApoplexy due to endogenous wind
Comprise the number of document.
Result of calculation is as shown in table 2 below.
Rjk | A class | B class | C class |
Word 1 | RA1=1/2 | RB1=1/2 | RC1=1/2 |
Word 2 | RA2=9/10 | RB2=8/10 | RC2=1/10 |
Word 3 | RA3=10/10 | RB3=3/10 | RC3=1/10 |
Table 2
Calculate Feature Words tkIn text categories cjThe value of the class discrimination ability in class, calculates
Formula is as follows:
Diffjk=min (Rjk-Rik)(i!=j and i take 1~s, and s is classification sum)
DiffA1=min{ (1/2-1/2), (1/2-1/2) }=0
In like manner, DiffB1=0, DiffC1=0, the like, calculate DiffjkSuch as table 3 below
Shown in:
Calculating the class discrimination degree of Feature Words tk, computing formula is as follows:
Diffk=max{Diffjk(j takes 1~s)
According to table 3:
Diff1=DiffA1/DiffB1/DiffC1=0
Diff2=DiffA2=1/10
Diff3=DiffA3=7/10
Assuming that predetermined threshold value Q1 is 1/2, the most now Feature Words 1,2 are eliminated, feature
Word 3 is selected, and records its class represented respectively, i.e. Feature Words 3 and can represent A class.
Step 4: combine Feature Words classification rate in class and information gain IG to selecting
The Feature Words of high class discrimination degree do screening further, the spy of high sign degree in choosing class
Levy word.
Fig. 3 is that the present invention selects in class in the Feature Words of discrimination based between the high class selected
The detailed algorithm schematic flow sheet of high sign degree, enters algorithm with example below in conjunction with the accompanying drawings
Row explanation, specific as follows:
Assuming Feature Words 1, Feature Words 2, Feature Words 3 is all based on what step 3 was selected
Represent the Feature Words of A class (A class comprises 10 articles).Assume that Feature Words 1 is at A
Occurring in that altogether 100 times in 10 articles of class, Feature Words 2 is at 10 literary compositions of A class
Occurring in that altogether in chapter 50 times, Feature Words 3 occurs altogether in 10 articles of A class
30 times.
Feature Words t is calculated according to formulakDistributive law, computing formula is as follows:
wtk=(class cjIn the t that compriseskWord frequency)/(class cjIn the number of files that comprises)
I.e. w1=100/10=10
W2=50/10=5
W3=30/10=3
Assuming that predetermined threshold value Q2 is 7, predetermined threshold value Q3 is 4:
For Feature Words 1, seek IG, it may be judged whether less than predetermined threshold value Q4, be, wash in a pan
Eliminate, the most alternative.
For Feature Words 2, directly as alternative.
For Feature Words 3, seek IG, it may be judged whether more than or equal to predetermined threshold value Q5, be
Then select this feature word, otherwise eliminate.
Step 5: based on said method, selects N number of feature (N is predetermined threshold value), makees
Text feature set for above-mentioned language material training set.
Determine that application example, as standard, is illustrated by parameter with said process below.
Embodiment 1:
Assume that pre-set categories has 3 classes, respectively A class, B class, C class, wherein A
Class, B class, C class contains 10 articles being belonging respectively to its classification respectively.Assume existing
In in Feature Words 1 occurs in 10 articles belonging to A class 5, and also divide
Do not occur in that in 5 in 10 articles belonging to B and C class.Remaining Feature Words
Distribution situation in of all categories, as shown in table 4 below:
A class (10) | B class (10) | C class (10) | |
Feature Words 1 | 5 | 5 | 5 |
Feature Words 2 | 2 | 8 | 9 |
Feature Words 3 | 10 | 3 | 1 |
Feature Words 4 | 5 | 2 | 7 |
Feature Words 5 | 1 | 6 | 8 |
Feature Words 6 | 2 | 7 | 3 |
Table 4
According to table 4, calculate the degree of association R of each word and each predtermined categoryjk, calculate
Result is as shown in table 5 below:
Rjk | A class (10) | B class (10) | C class (10) |
Feature Words 1 | RA1=1/2 | RB1=1/2 | RC1=1/2 |
Feature Words 2 | RA2=2/10 | RB2=8/10 | RC2=9/10 |
Feature Words 3 | RA3=10/10 | RB3=3/10 | RC3=1/10 |
Feature Words 4 | RA4=5/10 | RB4=2/10 | RC4=7/10 |
Feature Words 5 | RA5=1/10 | RB5=6/10 | RC5=8/10 |
Feature Words 6 | RA6=2/10 | RB6=7/10 | RC6=3/10 |
Table 5
Calculate Feature Words tkIn text categories cjValue Diff of the class discrimination ability in classjk,
Result of calculation such as table 6 below:
Table 6
Calculate the class discrimination degree of Feature Words tk, according to table 6:
Diff1=DiffA1/DiffB1/DiffC1=0
Diff2=DiffC2=1/10
Diff3=DiffA3=7/10
Diff4=DiffC4=2/10
Diff5=DiffC5=2/10
Diff6=DiffB6=4/10
Assuming that threshold value Q1 is 1/20, the most now Feature Words 1 is eliminated.Enter next step
Feature Words selects.Now Feature Words 2, Feature Words 4, Feature Words 5 is by as representing C
The alternative features word of class enters next step Feature Words and selects.
Assume that Feature Words 2 occurs in that altogether 9 times in 10 articles of C class, feature
Word 4 occurs in that altogether 40 times in 10 articles of C class, and Feature Words 3 is in A class
10 articles occur in that altogether 20 times.
Feature Words t is calculated according to formulakDistributive law, i.e.
W2=c9/10=0.9
W2=40/10=4
W3=20/10=2
Assuming that predetermined threshold value Q2 is 3, predetermined threshold value Q3 is 1:
For Feature Words 2, seek IG, it may be judged whether less than predetermined threshold value Q4, be, wash in a pan
Eliminate, the most alternative.
For Feature Words 4, seek IG, it may be judged whether more than or equal to predetermined threshold value Q5, be
Then select this feature word, otherwise eliminate.
For Feature Words 5, directly as alternative.
Assume that now Feature Words 4 is elected as representing the Feature Words of C class.Same side
Method selects to represent the Feature Words of its classification to other classes.Assume that Feature Words 3 is selected representative
A class, Feature Words 6 is selected and represents B class.If now presetting VSM Spatial Dimension
Being 3, be full text feature set as training the most, if now
VSM Spatial Dimension is 4, then select from alternative Feature Words.
The technical scheme that the embodiment of the present invention is provided, it is possible to select more classification and represent
The Feature Words of more high information quantity in property and class, and improve the speed of text classification.
By the description of embodiment of above, those skilled in the art can understand this
Bright implementation, the present invention can be realized by software programming, corresponding software program
Can be stored in the storage medium that can read, such as CD, hard disk, mobile memory medium etc..
It is more than the specific embodiment of the present invention, but not in order to limit the present invention, for
For those skilled in the art, all in the premise without departing from the principle of the invention
Under, any modification, equivalent substitution and improvement etc. done, should be included in the present invention's
Within protection invention scope.
Claims (4)
1. a Text character extraction side based on sign degree high in discrimination between class and class
Method, it is characterised in that specifically include following steps:
Step 1: obtain different classes of text collection, as language material training set.
Step 2: the text of language material training set is carried out pretreatment, including Chinese word segmentation,
Stop words is gone to process;
Step 3: use text feature based on sign degree high in discrimination between class and class to carry
Access method carries out feature selection to text, selects N number of feature (N is predetermined threshold value), makees
Text feature set for above-mentioned language material training set.
2. as claimed in claim 1 a kind of based on sign high in discrimination between class and class
The text feature of degree, it is characterised in that step 3 uses to be distinguished based between class
In degree and class, the text feature of high sign degree carries out feature selection to text, choosing
Go out N number of feature (N is predetermined threshold value), as the text feature collection of above-mentioned language material training set
Close, it is characterised in that described method includes:
First calculate the class discrimination degree of each Feature Words, choose and there is high class discrimination degree
Feature Words.
In conjunction with Feature Words classification rate in class and information gain IG, to the high classification selected
The Feature Words of discrimination screens further, the Feature Words of high sign degree in choosing class.
3. use as claimed in claim 2 is based on efficient in discrimination between class and class
Text feature carries out feature selection to text, and (N is pre-to select N number of feature
If threshold value), as the text feature set of above-mentioned language material training set, it is characterised in that
Calculate the class discrimination degree of each Feature Words, choose the Feature Words with high class discrimination degree.
Specifically include following steps:
Step (1), determines the degree of association of each Feature Words and each preset classification, its meter
Calculation formula is as follows:
Wherein RjkRepresent Feature Words tkWith text categories cjDegree of association, molecule represent literary composition
This classification cjThere is Feature Words t in apoplexy due to endogenous windkNumber of files, denominator represents text categories cjClass
In comprise the number of document.
Step (2), calculates Feature Words tkIn text categories cjClass discrimination ability in class
Value, computing formula is as follows:
Diffjk=min (Rjk-Rik)(i!=j and i take 1~s, and s is classification sum)
Note, here DiffjkIt can be negative.Negative number representation Feature Words tkIn text class
Other cjThe distribution of class is less than Feature Words tkIn text categories ciThe distribution of apoplexy due to endogenous wind.
Step (3), calculates the class discrimination degree of Feature Words tk, and computing formula is as follows:
Diffk=max{Diffjk(j takes 1~s)
And record DiffkCorresponding DiffjkJ value, i.e. have recorded Feature Words tkCharacterize
Text categories cj。
Step (4), arranges predetermined threshold value Q1, takes DiffkThe Feature Words of >=Q1 is carried out
Further screening.
4. use as claimed in claim 2 is based on sign degree high in discrimination between class and class
Text feature text is carried out feature selection, (N is to select N number of feature
Predetermined threshold value), as the text feature set of above-mentioned language material training set, it is characterised in that
In conjunction with Feature Words classification rate in class and information gain IG in the high classification district selected
The Feature Words of indexing screens further, the Feature Words of high sign degree in choosing class.Concrete bag
Include following steps:
Step (1), the Feature Words to the high class discrimination degree selected, calculate this spy
Levy the word distributive law at its apoplexy due to endogenous wind characterized, it is assumed that Feature Words tkCharacterize text categories cj
Class, then Feature Words tkDistributive law computing formula as follows:
wtk=(class cjIn the t that compriseskWord frequency)/(class cjIn the number of files that comprises)
Step (2), arranges predetermined threshold value Q2, works as wtkDuring >=Q2, then Feature Words tkMake
For high frequency words, enter step (3), predetermined threshold value Q3 is set, works as wtkDuring <=Q3, then
Represent Feature Words tkIt is low-frequency word, enters step (4), be determined further.
Step (3), to wtkThe Feature Words of >=Q2 seeks IG, arranges predetermined threshold value Q4, when
IG(tk) < Q4, then Feature Words tkIt is eliminated, is not elected as the text of language material training set
Characteristic set.
Step (4), to wtkThe Feature Words of <=Q3 seeks IG, and arranges threshold value Q5, when
IG(tk) >=Q5 time, represent tkIt is an effective word of low frequency, is elected as language material training set
Text feature set.
Step (5), it is assumed that the dimension of the text feature set of language material training set is N, if
Dimension according to the Feature Words above taken out is less than dimension N, the most now from Q3 < tk< Q2
Feature Words in select, select from high in the end according to weights.Until be full into
Only.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510014438.XA CN105893388B (en) | 2015-01-01 | 2015-01-01 | A kind of text feature based on characterization degree high in discrimination between class and class |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510014438.XA CN105893388B (en) | 2015-01-01 | 2015-01-01 | A kind of text feature based on characterization degree high in discrimination between class and class |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105893388A true CN105893388A (en) | 2016-08-24 |
CN105893388B CN105893388B (en) | 2019-08-23 |
Family
ID=56999237
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510014438.XA Active CN105893388B (en) | 2015-01-01 | 2015-01-01 | A kind of text feature based on characterization degree high in discrimination between class and class |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105893388B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503146A (en) * | 2016-10-21 | 2017-03-15 | 江苏理工学院 | The feature selection approach of computer version, characteristic of division system of selection and system |
CN106611057A (en) * | 2016-12-27 | 2017-05-03 | 上海利连信息科技有限公司 | Text classification feature selection approach for importance weighing |
CN106897428A (en) * | 2017-02-27 | 2017-06-27 | 腾讯科技(深圳)有限公司 | Text classification feature extracting method, file classification method and device |
CN107193804A (en) * | 2017-06-02 | 2017-09-22 | 河海大学 | A kind of refuse messages text feature selection method towards word and portmanteau word |
CN107590163A (en) * | 2016-07-06 | 2018-01-16 | 北京京东尚科信息技术有限公司 | The methods, devices and systems of text feature selection |
CN108346474A (en) * | 2018-03-14 | 2018-07-31 | 湖南省蓝蜻蜓网络科技有限公司 | The electronic health record feature selection approach of distribution within class and distribution between class based on word |
CN108363810A (en) * | 2018-03-09 | 2018-08-03 | 南京工业大学 | A kind of file classification method and device |
CN108491429A (en) * | 2018-02-09 | 2018-09-04 | 湖北工业大学 | A kind of feature selection approach based on document frequency and word frequency statistics between class in class |
CN108763344A (en) * | 2018-05-15 | 2018-11-06 | 南京邮电大学 | Based on information gain and maximal correlation minimal redundancy two-stage feature selection approach |
CN111061779A (en) * | 2019-12-16 | 2020-04-24 | 延安大学 | Data processing method and device based on big data platform |
US11526754B2 (en) | 2020-02-07 | 2022-12-13 | Kyndryl, Inc. | Feature generation for asset classification |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040042665A1 (en) * | 2002-08-30 | 2004-03-04 | Lockheed Martin Corporation | Method and computer program product for automatically establishing a classifiction system architecture |
CN102567308A (en) * | 2011-12-20 | 2012-07-11 | 上海电机学院 | Information processing feature extracting method |
-
2015
- 2015-01-01 CN CN201510014438.XA patent/CN105893388B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040042665A1 (en) * | 2002-08-30 | 2004-03-04 | Lockheed Martin Corporation | Method and computer program product for automatically establishing a classifiction system architecture |
CN102567308A (en) * | 2011-12-20 | 2012-07-11 | 上海电机学院 | Information processing feature extracting method |
Non-Patent Citations (1)
Title |
---|
易军凯等: "基于类别区分度的文本特征选择算法研究", 《北京化工大学学报(自然科学版)》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590163A (en) * | 2016-07-06 | 2018-01-16 | 北京京东尚科信息技术有限公司 | The methods, devices and systems of text feature selection |
CN107590163B (en) * | 2016-07-06 | 2019-07-02 | 北京京东尚科信息技术有限公司 | The methods, devices and systems of text feature selection |
CN106503146A (en) * | 2016-10-21 | 2017-03-15 | 江苏理工学院 | The feature selection approach of computer version, characteristic of division system of selection and system |
CN106503146B (en) * | 2016-10-21 | 2019-06-07 | 江苏理工学院 | The feature selection approach of computer version |
CN106611057A (en) * | 2016-12-27 | 2017-05-03 | 上海利连信息科技有限公司 | Text classification feature selection approach for importance weighing |
CN106611057B (en) * | 2016-12-27 | 2019-08-13 | 上海利连信息科技有限公司 | The text classification feature selection approach of importance weighting |
CN106897428B (en) * | 2017-02-27 | 2022-08-09 | 腾讯科技(深圳)有限公司 | Text classification feature extraction method and text classification method and device |
CN106897428A (en) * | 2017-02-27 | 2017-06-27 | 腾讯科技(深圳)有限公司 | Text classification feature extracting method, file classification method and device |
CN107193804B (en) * | 2017-06-02 | 2019-03-29 | 河海大学 | A kind of refuse messages text feature selection method towards word and portmanteau word |
CN107193804A (en) * | 2017-06-02 | 2017-09-22 | 河海大学 | A kind of refuse messages text feature selection method towards word and portmanteau word |
CN108491429A (en) * | 2018-02-09 | 2018-09-04 | 湖北工业大学 | A kind of feature selection approach based on document frequency and word frequency statistics between class in class |
CN108363810B (en) * | 2018-03-09 | 2022-02-15 | 南京工业大学 | Text classification method and device |
CN108363810A (en) * | 2018-03-09 | 2018-08-03 | 南京工业大学 | A kind of file classification method and device |
CN108346474A (en) * | 2018-03-14 | 2018-07-31 | 湖南省蓝蜻蜓网络科技有限公司 | The electronic health record feature selection approach of distribution within class and distribution between class based on word |
CN108346474B (en) * | 2018-03-14 | 2021-09-28 | 湖南省蓝蜻蜓网络科技有限公司 | Electronic medical record feature selection method based on word intra-class distribution and inter-class distribution |
CN108763344B (en) * | 2018-05-15 | 2021-12-14 | 南京邮电大学 | Information gain and maximum correlation minimum redundancy two-stage feature selection method |
CN108763344A (en) * | 2018-05-15 | 2018-11-06 | 南京邮电大学 | Based on information gain and maximal correlation minimal redundancy two-stage feature selection approach |
CN111061779A (en) * | 2019-12-16 | 2020-04-24 | 延安大学 | Data processing method and device based on big data platform |
US11526754B2 (en) | 2020-02-07 | 2022-12-13 | Kyndryl, Inc. | Feature generation for asset classification |
US11748621B2 (en) | 2020-02-07 | 2023-09-05 | Kyndryl, Inc. | Methods and apparatus for feature generation using improved term frequency-inverse document frequency (TF-IDF) with deep learning for accurate cloud asset tagging |
Also Published As
Publication number | Publication date |
---|---|
CN105893388B (en) | 2019-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105893388A (en) | Text feature extracting method based on inter-class distinctness and intra-class high representation degree | |
Chen et al. | Application of one-class support vector machine to quickly identify multivariate anomalies from geochemical exploration data | |
CN104142918B (en) | Short text clustering and focus subject distillation method based on TF IDF features | |
WO2015085916A1 (en) | Data mining method | |
CN105893380A (en) | Improved text classification characteristic selection method | |
CN109165301A (en) | Video cover selection method, device and computer readable storage medium | |
CN110457577B (en) | Data processing method, device, equipment and computer storage medium | |
CN105760889A (en) | Efficient imbalanced data set classification method | |
CN104317784A (en) | Cross-platform user identification method and cross-platform user identification system | |
WO2018040997A1 (en) | System, method, and device for evaluating node of funnel model | |
CN104424308A (en) | Web page classification standard acquisition method and device and web page classification method and device | |
TW201833851A (en) | Risk control event automatic processing method and apparatus | |
CN110069545B (en) | Behavior data evaluation method and device | |
CN108090178A (en) | A kind of text data analysis method, device, server and storage medium | |
CN106294529A (en) | A kind of identification user's abnormal operation method and apparatus | |
CN105468731B (en) | A kind of preposition processing method of text emotion analysis signature verification | |
Ando et al. | Globalization and domestic operations: Applying the JC/JD method to Japanese manufacturing firms | |
CN110232156B (en) | Information recommendation method and device based on long text | |
CN105117466A (en) | Internet information screening system and method | |
SJ et al. | Impact of financial crisis in Asia | |
CN105574480A (en) | Information processing method and apparatus and terminal | |
CN105843608B (en) | A kind of APP user interface design pattern recommended method and system based on cluster | |
CN109800215A (en) | Method, apparatus, computer storage medium and the terminal of a kind of pair of mark processing | |
CN105224954A (en) | A kind of topic discover method removing the impact of little topic based on Single-pass | |
CN104751234B (en) | A kind of prediction technique and device of user's assets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: A text feature extraction method based on inter class discrimination and high representation within class Effective date of registration: 20220407 Granted publication date: 20190823 Pledgee: Chengdu SME financing Company Limited by Guarantee Pledgor: CHENGDU WANGAN TECHNOLOGY DEVELOPMENT Co.,Ltd. Registration number: Y2022980003814 |