CN105893388A - Text feature extracting method based on inter-class distinctness and intra-class high representation degree - Google Patents

Text feature extracting method based on inter-class distinctness and intra-class high representation degree Download PDF

Info

Publication number
CN105893388A
CN105893388A CN201510014438.XA CN201510014438A CN105893388A CN 105893388 A CN105893388 A CN 105893388A CN 201510014438 A CN201510014438 A CN 201510014438A CN 105893388 A CN105893388 A CN 105893388A
Authority
CN
China
Prior art keywords
class
feature
feature words
text
degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510014438.XA
Other languages
Chinese (zh)
Other versions
CN105893388B (en
Inventor
黄筱聪
朱永强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHENGDU WANGAN TECHNOLOGY DEVELOPMENT Co Ltd
Original Assignee
CHENGDU WANGAN TECHNOLOGY DEVELOPMENT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU WANGAN TECHNOLOGY DEVELOPMENT Co Ltd filed Critical CHENGDU WANGAN TECHNOLOGY DEVELOPMENT Co Ltd
Priority to CN201510014438.XA priority Critical patent/CN105893388B/en
Publication of CN105893388A publication Critical patent/CN105893388A/en
Application granted granted Critical
Publication of CN105893388B publication Critical patent/CN105893388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text feature extracting method based on inter-class distinctness and intra-class high representation degree. The method comprises the following steps: preprocessing a training set text; calculating the class distinctness of each feature word through an improved feature selecting method so as to select feature words with more class representation, wherein the selected feature words are of high distinctness among different classes; and further screening the selected feature words which are of high class distinctness based on the intra-class distribution rate and information gain (IG) of the feature words. With the adoption of the method, the feature selection is carried out twice to select the feature words which are of high intra-class information entropy and high intra-class distribution rate, and thus the classifying efficiency and accuracy can be improved; in addition, the calculation is simple, so that the text classifying speed and accuracy can be improved.

Description

A kind of text feature based on sign degree high in discrimination between class and class
Technical field
The invention belongs to Text Mining Technology field, distinguish based between class particularly to one The text feature of high sign degree in degree and class.
Background technology
In the epoch that current internet information resource quickly increases, in order to quicker Effective discovery information needed and resource, Text Classification is as effective tissue and pipe The important means of reason text message is arisen at the historic moment.Text Classification refers to according to pending The content of text or attribute are assigned to the skill of one or more predefined classification Art.In text classification field, the most popular is to use VSM vector space pair Text is indicated, " the height of the characteristic item in order to avoid producing when setting up VSM space Dimension disaster ", the most distinctive selection algorithm becomes to be even more important.
In text classification, Feature selection algorithm includes that the algorithm that following comparison is traditional: DF calculates Method (document frequency algorithm), its shortcoming is to have only focused on high frequency words, low frequency can be missed but The word that comentropy is high, and the word selected do not possess represent certain classification characteristic IG calculate The particularity that method (information gain method) calculates due to it so that it is tend not to select enough The Feature Words CHI algorithm (χ of quantity2Statistic law), it considers Feature Words to some Classification impact, but the biggest MI of its amount of calculation (mutual information method), its shortcoming be Under experimental enviroment, performance is unstable.
Therefore, it is necessary to design a kind of can select have between the strongest class difference degree and Class belonging to it has again high efficiency, and the Feature Words selection algorithm that amount of calculation is less.
Summary of the invention
Cannot select for the existing text classification feature selection approach of solution and there is high classification generation The Feature Words of scale and the biggest deficiency of amount of calculation, the present invention provides a kind of based on Lei Jian district High sign degree in indexing and class, and the text feature that amount of calculation is less.Institute The scheme of stating comprises the following steps:
Step 1: obtain different classes of text collection, as language material training set.
Step 2: the text of language material training set is carried out pretreatment, including Chinese word segmentation, Stop words is gone to process;
Step 3: use a kind of text based on sign degree high in discrimination between class and class special Levy extracting method and text carried out feature selection, select N number of feature (N is predetermined threshold value), Text feature set as above-mentioned language material training set.
A kind of text feature based on sign degree high in discrimination between class and class, Particularly as follows:
First calculate the class discrimination degree of each Feature Words, choose and there is high class discrimination degree Feature Words, comprise the following steps:
Step (1), determines the degree of association of each Feature Words and each preset classification, its meter Calculation formula is as follows:
R jk = | { i : t k = d j , d j ∈ C j } | | C j |
Wherein RjkRepresent Feature Words tkWith text categories cjDegree of association, molecule represent literary composition This classification cjThere is Feature Words t in apoplexy due to endogenous windkNumber of files, denominator represents text categories cjClass In comprise the number of document.
Step (2), calculates Feature Words tkIn text categories cjClass discrimination ability in class Value, computing formula is as follows:
Diffjk=min (Rjk-Rik)(i!=j and i take 1~s, and s is classification sum)
Note, here DiffjkIt can be negative.Negative number representation Feature Words tkIn text class Other cjThe distribution of class is less than Feature Words tkIn text categories ciThe distribution of apoplexy due to endogenous wind.
Step (3), calculates the class discrimination degree of Feature Words tk, and computing formula is as follows:
Diffk=max{Diffjk(j takes 1~s)
And record DiffkCorresponding DiffjkJ value, i.e. have recorded Feature Words tkCharacterize Text categories cj
Step (4), arranges predetermined threshold value Q1, takes DiffkThe Feature Words of >=Q1 is carried out Further screening.
Further, in conjunction with Feature Words classification rate in class and information gain IG in choosing The Feature Words of the high class discrimination degree gone out screens further, the spy of high sign degree in choosing class Levy word, specifically include following steps:
Step (1), the Feature Words to the high class discrimination degree selected, calculate this spy Levy the word distributive law at its apoplexy due to endogenous wind characterized, it is assumed that Feature Words tkCharacterize text categories cj Class, then Feature Words tkDistributive law computing formula as follows:
wtk=(class cjIn the t that compriseskWord frequency)/(class cjIn the number of files that comprises)
Step (2), arranges predetermined threshold value Q2, works as wtkDuring >=Q2, then Feature Words tkMake For high frequency words, enter step (3), predetermined threshold value Q3 is set, works as wtkDuring <=Q3, then Represent Feature Words tkIt is low-frequency word, enters step (4), be determined further.
Step (3), to wtkThe Feature Words of >=Q2 seeks IG, arranges predetermined threshold value Q4, when IG(tk) < Q4, then Feature Words tkIt is eliminated, is not elected as the text of language material training set Characteristic set.
Step (4), to wtkThe Feature Words of <=Q3 seeks IG, and arranges threshold value Q5, when IG(tk) >=Q5 time, represent tkIt is an effective word of low frequency, is elected as language material training set Text feature set.
Step (5), it is assumed that the dimension of the text feature set of language material training set is N, if Dimension according to the Feature Words above taken out is less than dimension N, the most now from Q3 < tk< Q2's Feature Words selects, selects from high in the end according to weights.Until being full.
Technical scheme provided by the present invention provides the benefit that:
First pass through the class discrimination degree calculating each Feature Words, choose and have more classification generation The Feature Words of table so that it is there is between the class that each are different the highest discrimination, and And by further combined with Feature Words distributive law in class and information gain IG in choosing The Feature Words of the high class discrimination degree gone out screens further, has high comentropy in selecting class And the Feature Words that distributive law is high, it addition, the calculating of this technical scheme is simple, using the teaching of the invention it is possible to provide The arithmetic speed of text classification and efficiency.
Accompanying drawing explanation
Fig. 1 is present invention Text character extraction side based on sign degree high in discrimination between class and class Method flow chart.
Fig. 2 is the detailed algorithm schematic flow sheet that the present invention selects discrimination between high class.
Fig. 3 is that the present invention selects class Nei Gaobiao based between the high class selected in the Feature Words of discrimination The detailed algorithm schematic flow sheet of degree of levying.
Detailed description of the invention
Become apparent from for making the purpose of the present invention, technical scheme and advantage illustrate, below will In conjunction with accompanying drawing and actual case, the present invention is described in further detail.
Fig. 1 is that present invention text feature based on sign degree high in discrimination between class and class carries Access method flow chart, concrete function be accomplished by
Step 1: obtain certain first with web crawlers or artificially collect from the Internet These articles are analyzed arranging by representational article in multiple fields of quantity, It is included into language material training set, as the training sample set of Text Classification System according to classification.
Step 2: in order to extract the word that can represent text feature from text, It is carried out participle, removes the process such as stop words.
Step 3: from choosing the spy with high class discrimination degree through the text of pretreatment Levy word, specific as follows:
Fig. 2 is the detailed algorithm schematic flow sheet that the present invention selects discrimination between high class, below In conjunction with accompanying drawing and example, algorithm is illustrated, specific as follows:
Assume that pre-set categories has 3 classes, respectively A class, B class, C class, wherein A Class, B class, C class contains 10 articles being belonging respectively to its classification respectively.Assume existing In in Feature Words 1 occurs in 10 articles belonging to A class 5, and also divide Do not occur in that in 5 in 10 articles belonging to B and C class.Feature Words 2 goes out Now belong to, in 9 in 10 articles of A class, occur in and belong to the 10 of B class In in piece article 8, and occur in 1 in 10 articles belonging to C class In.Feature Words 3 occurs in 9 in 10 articles belonging to A class, occurs in Belong in 3 in 10 articles of B class, and occur in 10 literary compositions belonging to C class In in chapter 1, as shown in table 1 below:
A class (10) B class (10) C class (10)
Word 1 5 5 5
Word 2 9 8 1
Word 3 10 3 1
Table 1
Each word and each predtermined category is calculated according to following relatedness computation formula Degree of association Rjk:
R jk = | { i : t k = d j , d j ∈ C j } | | C j |
Wherein RjkRepresent Feature Words tkWith text categories cjDegree of association, molecule represent literary composition This classification cjThere is Feature Words t in apoplexy due to endogenous windkNumber of files, denominator represents text categories cjApoplexy due to endogenous wind Comprise the number of document.
Result of calculation is as shown in table 2 below.
Rjk A class B class C class
Word 1 RA1=1/2 RB1=1/2 RC1=1/2
Word 2 RA2=9/10 RB2=8/10 RC2=1/10
Word 3 RA3=10/10 RB3=3/10 RC3=1/10
Table 2
Calculate Feature Words tkIn text categories cjThe value of the class discrimination ability in class, calculates Formula is as follows:
Diffjk=min (Rjk-Rik)(i!=j and i take 1~s, and s is classification sum)
DiffA1=min{ (1/2-1/2), (1/2-1/2) }=0
In like manner, DiffB1=0, DiffC1=0, the like, calculate DiffjkSuch as table 3 below Shown in:
Calculating the class discrimination degree of Feature Words tk, computing formula is as follows:
Diffk=max{Diffjk(j takes 1~s)
According to table 3:
Diff1=DiffA1/DiffB1/DiffC1=0
Diff2=DiffA2=1/10
Diff3=DiffA3=7/10
Assuming that predetermined threshold value Q1 is 1/2, the most now Feature Words 1,2 are eliminated, feature Word 3 is selected, and records its class represented respectively, i.e. Feature Words 3 and can represent A class.
Step 4: combine Feature Words classification rate in class and information gain IG to selecting The Feature Words of high class discrimination degree do screening further, the spy of high sign degree in choosing class Levy word.
Fig. 3 is that the present invention selects in class in the Feature Words of discrimination based between the high class selected The detailed algorithm schematic flow sheet of high sign degree, enters algorithm with example below in conjunction with the accompanying drawings Row explanation, specific as follows:
Assuming Feature Words 1, Feature Words 2, Feature Words 3 is all based on what step 3 was selected Represent the Feature Words of A class (A class comprises 10 articles).Assume that Feature Words 1 is at A Occurring in that altogether 100 times in 10 articles of class, Feature Words 2 is at 10 literary compositions of A class Occurring in that altogether in chapter 50 times, Feature Words 3 occurs altogether in 10 articles of A class 30 times.
Feature Words t is calculated according to formulakDistributive law, computing formula is as follows:
wtk=(class cjIn the t that compriseskWord frequency)/(class cjIn the number of files that comprises)
I.e. w1=100/10=10
W2=50/10=5
W3=30/10=3
Assuming that predetermined threshold value Q2 is 7, predetermined threshold value Q3 is 4:
For Feature Words 1, seek IG, it may be judged whether less than predetermined threshold value Q4, be, wash in a pan Eliminate, the most alternative.
For Feature Words 2, directly as alternative.
For Feature Words 3, seek IG, it may be judged whether more than or equal to predetermined threshold value Q5, be Then select this feature word, otherwise eliminate.
Step 5: based on said method, selects N number of feature (N is predetermined threshold value), makees Text feature set for above-mentioned language material training set.
Determine that application example, as standard, is illustrated by parameter with said process below.
Embodiment 1:
Assume that pre-set categories has 3 classes, respectively A class, B class, C class, wherein A Class, B class, C class contains 10 articles being belonging respectively to its classification respectively.Assume existing In in Feature Words 1 occurs in 10 articles belonging to A class 5, and also divide Do not occur in that in 5 in 10 articles belonging to B and C class.Remaining Feature Words Distribution situation in of all categories, as shown in table 4 below:
A class (10) B class (10) C class (10)
Feature Words 1 5 5 5
Feature Words 2 2 8 9
Feature Words 3 10 3 1
Feature Words 4 5 2 7
Feature Words 5 1 6 8
Feature Words 6 2 7 3
Table 4
According to table 4, calculate the degree of association R of each word and each predtermined categoryjk, calculate Result is as shown in table 5 below:
Rjk A class (10) B class (10) C class (10)
Feature Words 1 RA1=1/2 RB1=1/2 RC1=1/2
Feature Words 2 RA2=2/10 RB2=8/10 RC2=9/10
Feature Words 3 RA3=10/10 RB3=3/10 RC3=1/10
Feature Words 4 RA4=5/10 RB4=2/10 RC4=7/10
Feature Words 5 RA5=1/10 RB5=6/10 RC5=8/10
Feature Words 6 RA6=2/10 RB6=7/10 RC6=3/10
Table 5
Calculate Feature Words tkIn text categories cjValue Diff of the class discrimination ability in classjk, Result of calculation such as table 6 below:
Table 6
Calculate the class discrimination degree of Feature Words tk, according to table 6:
Diff1=DiffA1/DiffB1/DiffC1=0
Diff2=DiffC2=1/10
Diff3=DiffA3=7/10
Diff4=DiffC4=2/10
Diff5=DiffC5=2/10
Diff6=DiffB6=4/10
Assuming that threshold value Q1 is 1/20, the most now Feature Words 1 is eliminated.Enter next step Feature Words selects.Now Feature Words 2, Feature Words 4, Feature Words 5 is by as representing C The alternative features word of class enters next step Feature Words and selects.
Assume that Feature Words 2 occurs in that altogether 9 times in 10 articles of C class, feature Word 4 occurs in that altogether 40 times in 10 articles of C class, and Feature Words 3 is in A class 10 articles occur in that altogether 20 times.
Feature Words t is calculated according to formulakDistributive law, i.e.
W2=c9/10=0.9
W2=40/10=4
W3=20/10=2
Assuming that predetermined threshold value Q2 is 3, predetermined threshold value Q3 is 1:
For Feature Words 2, seek IG, it may be judged whether less than predetermined threshold value Q4, be, wash in a pan Eliminate, the most alternative.
For Feature Words 4, seek IG, it may be judged whether more than or equal to predetermined threshold value Q5, be Then select this feature word, otherwise eliminate.
For Feature Words 5, directly as alternative.
Assume that now Feature Words 4 is elected as representing the Feature Words of C class.Same side Method selects to represent the Feature Words of its classification to other classes.Assume that Feature Words 3 is selected representative A class, Feature Words 6 is selected and represents B class.If now presetting VSM Spatial Dimension Being 3, be full text feature set as training the most, if now VSM Spatial Dimension is 4, then select from alternative Feature Words.
The technical scheme that the embodiment of the present invention is provided, it is possible to select more classification and represent The Feature Words of more high information quantity in property and class, and improve the speed of text classification.
By the description of embodiment of above, those skilled in the art can understand this Bright implementation, the present invention can be realized by software programming, corresponding software program Can be stored in the storage medium that can read, such as CD, hard disk, mobile memory medium etc..
It is more than the specific embodiment of the present invention, but not in order to limit the present invention, for For those skilled in the art, all in the premise without departing from the principle of the invention Under, any modification, equivalent substitution and improvement etc. done, should be included in the present invention's Within protection invention scope.

Claims (4)

1. a Text character extraction side based on sign degree high in discrimination between class and class Method, it is characterised in that specifically include following steps:
Step 1: obtain different classes of text collection, as language material training set.
Step 2: the text of language material training set is carried out pretreatment, including Chinese word segmentation, Stop words is gone to process;
Step 3: use text feature based on sign degree high in discrimination between class and class to carry Access method carries out feature selection to text, selects N number of feature (N is predetermined threshold value), makees Text feature set for above-mentioned language material training set.
2. as claimed in claim 1 a kind of based on sign high in discrimination between class and class The text feature of degree, it is characterised in that step 3 uses to be distinguished based between class In degree and class, the text feature of high sign degree carries out feature selection to text, choosing Go out N number of feature (N is predetermined threshold value), as the text feature collection of above-mentioned language material training set Close, it is characterised in that described method includes:
First calculate the class discrimination degree of each Feature Words, choose and there is high class discrimination degree Feature Words.
In conjunction with Feature Words classification rate in class and information gain IG, to the high classification selected The Feature Words of discrimination screens further, the Feature Words of high sign degree in choosing class.
3. use as claimed in claim 2 is based on efficient in discrimination between class and class Text feature carries out feature selection to text, and (N is pre-to select N number of feature If threshold value), as the text feature set of above-mentioned language material training set, it is characterised in that Calculate the class discrimination degree of each Feature Words, choose the Feature Words with high class discrimination degree. Specifically include following steps:
Step (1), determines the degree of association of each Feature Words and each preset classification, its meter Calculation formula is as follows:
R jk = | { i : t k = d i , d j ∈ C j } | | C j |
Wherein RjkRepresent Feature Words tkWith text categories cjDegree of association, molecule represent literary composition This classification cjThere is Feature Words t in apoplexy due to endogenous windkNumber of files, denominator represents text categories cjClass In comprise the number of document.
Step (2), calculates Feature Words tkIn text categories cjClass discrimination ability in class Value, computing formula is as follows:
Diffjk=min (Rjk-Rik)(i!=j and i take 1~s, and s is classification sum)
Note, here DiffjkIt can be negative.Negative number representation Feature Words tkIn text class Other cjThe distribution of class is less than Feature Words tkIn text categories ciThe distribution of apoplexy due to endogenous wind.
Step (3), calculates the class discrimination degree of Feature Words tk, and computing formula is as follows:
Diffk=max{Diffjk(j takes 1~s)
And record DiffkCorresponding DiffjkJ value, i.e. have recorded Feature Words tkCharacterize Text categories cj
Step (4), arranges predetermined threshold value Q1, takes DiffkThe Feature Words of >=Q1 is carried out Further screening.
4. use as claimed in claim 2 is based on sign degree high in discrimination between class and class Text feature text is carried out feature selection, (N is to select N number of feature Predetermined threshold value), as the text feature set of above-mentioned language material training set, it is characterised in that In conjunction with Feature Words classification rate in class and information gain IG in the high classification district selected The Feature Words of indexing screens further, the Feature Words of high sign degree in choosing class.Concrete bag Include following steps:
Step (1), the Feature Words to the high class discrimination degree selected, calculate this spy Levy the word distributive law at its apoplexy due to endogenous wind characterized, it is assumed that Feature Words tkCharacterize text categories cj Class, then Feature Words tkDistributive law computing formula as follows:
wtk=(class cjIn the t that compriseskWord frequency)/(class cjIn the number of files that comprises)
Step (2), arranges predetermined threshold value Q2, works as wtkDuring >=Q2, then Feature Words tkMake For high frequency words, enter step (3), predetermined threshold value Q3 is set, works as wtkDuring <=Q3, then Represent Feature Words tkIt is low-frequency word, enters step (4), be determined further.
Step (3), to wtkThe Feature Words of >=Q2 seeks IG, arranges predetermined threshold value Q4, when IG(tk) < Q4, then Feature Words tkIt is eliminated, is not elected as the text of language material training set Characteristic set.
Step (4), to wtkThe Feature Words of <=Q3 seeks IG, and arranges threshold value Q5, when IG(tk) >=Q5 time, represent tkIt is an effective word of low frequency, is elected as language material training set Text feature set.
Step (5), it is assumed that the dimension of the text feature set of language material training set is N, if Dimension according to the Feature Words above taken out is less than dimension N, the most now from Q3 < tk< Q2 Feature Words in select, select from high in the end according to weights.Until be full into Only.
CN201510014438.XA 2015-01-01 2015-01-01 A kind of text feature based on characterization degree high in discrimination between class and class Active CN105893388B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510014438.XA CN105893388B (en) 2015-01-01 2015-01-01 A kind of text feature based on characterization degree high in discrimination between class and class

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510014438.XA CN105893388B (en) 2015-01-01 2015-01-01 A kind of text feature based on characterization degree high in discrimination between class and class

Publications (2)

Publication Number Publication Date
CN105893388A true CN105893388A (en) 2016-08-24
CN105893388B CN105893388B (en) 2019-08-23

Family

ID=56999237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510014438.XA Active CN105893388B (en) 2015-01-01 2015-01-01 A kind of text feature based on characterization degree high in discrimination between class and class

Country Status (1)

Country Link
CN (1) CN105893388B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503146A (en) * 2016-10-21 2017-03-15 江苏理工学院 The feature selection approach of computer version, characteristic of division system of selection and system
CN106611057A (en) * 2016-12-27 2017-05-03 上海利连信息科技有限公司 Text classification feature selection approach for importance weighing
CN106897428A (en) * 2017-02-27 2017-06-27 腾讯科技(深圳)有限公司 Text classification feature extracting method, file classification method and device
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN107590163A (en) * 2016-07-06 2018-01-16 北京京东尚科信息技术有限公司 The methods, devices and systems of text feature selection
CN108346474A (en) * 2018-03-14 2018-07-31 湖南省蓝蜻蜓网络科技有限公司 The electronic health record feature selection approach of distribution within class and distribution between class based on word
CN108363810A (en) * 2018-03-09 2018-08-03 南京工业大学 A kind of file classification method and device
CN108491429A (en) * 2018-02-09 2018-09-04 湖北工业大学 A kind of feature selection approach based on document frequency and word frequency statistics between class in class
CN108763344A (en) * 2018-05-15 2018-11-06 南京邮电大学 Based on information gain and maximal correlation minimal redundancy two-stage feature selection approach
CN111061779A (en) * 2019-12-16 2020-04-24 延安大学 Data processing method and device based on big data platform
US11526754B2 (en) 2020-02-07 2022-12-13 Kyndryl, Inc. Feature generation for asset classification

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040042665A1 (en) * 2002-08-30 2004-03-04 Lockheed Martin Corporation Method and computer program product for automatically establishing a classifiction system architecture
CN102567308A (en) * 2011-12-20 2012-07-11 上海电机学院 Information processing feature extracting method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040042665A1 (en) * 2002-08-30 2004-03-04 Lockheed Martin Corporation Method and computer program product for automatically establishing a classifiction system architecture
CN102567308A (en) * 2011-12-20 2012-07-11 上海电机学院 Information processing feature extracting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
易军凯等: "基于类别区分度的文本特征选择算法研究", 《北京化工大学学报(自然科学版)》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590163A (en) * 2016-07-06 2018-01-16 北京京东尚科信息技术有限公司 The methods, devices and systems of text feature selection
CN107590163B (en) * 2016-07-06 2019-07-02 北京京东尚科信息技术有限公司 The methods, devices and systems of text feature selection
CN106503146A (en) * 2016-10-21 2017-03-15 江苏理工学院 The feature selection approach of computer version, characteristic of division system of selection and system
CN106503146B (en) * 2016-10-21 2019-06-07 江苏理工学院 The feature selection approach of computer version
CN106611057A (en) * 2016-12-27 2017-05-03 上海利连信息科技有限公司 Text classification feature selection approach for importance weighing
CN106611057B (en) * 2016-12-27 2019-08-13 上海利连信息科技有限公司 The text classification feature selection approach of importance weighting
CN106897428B (en) * 2017-02-27 2022-08-09 腾讯科技(深圳)有限公司 Text classification feature extraction method and text classification method and device
CN106897428A (en) * 2017-02-27 2017-06-27 腾讯科技(深圳)有限公司 Text classification feature extracting method, file classification method and device
CN107193804B (en) * 2017-06-02 2019-03-29 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN108491429A (en) * 2018-02-09 2018-09-04 湖北工业大学 A kind of feature selection approach based on document frequency and word frequency statistics between class in class
CN108363810B (en) * 2018-03-09 2022-02-15 南京工业大学 Text classification method and device
CN108363810A (en) * 2018-03-09 2018-08-03 南京工业大学 A kind of file classification method and device
CN108346474A (en) * 2018-03-14 2018-07-31 湖南省蓝蜻蜓网络科技有限公司 The electronic health record feature selection approach of distribution within class and distribution between class based on word
CN108346474B (en) * 2018-03-14 2021-09-28 湖南省蓝蜻蜓网络科技有限公司 Electronic medical record feature selection method based on word intra-class distribution and inter-class distribution
CN108763344B (en) * 2018-05-15 2021-12-14 南京邮电大学 Information gain and maximum correlation minimum redundancy two-stage feature selection method
CN108763344A (en) * 2018-05-15 2018-11-06 南京邮电大学 Based on information gain and maximal correlation minimal redundancy two-stage feature selection approach
CN111061779A (en) * 2019-12-16 2020-04-24 延安大学 Data processing method and device based on big data platform
US11526754B2 (en) 2020-02-07 2022-12-13 Kyndryl, Inc. Feature generation for asset classification
US11748621B2 (en) 2020-02-07 2023-09-05 Kyndryl, Inc. Methods and apparatus for feature generation using improved term frequency-inverse document frequency (TF-IDF) with deep learning for accurate cloud asset tagging

Also Published As

Publication number Publication date
CN105893388B (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CN105893388A (en) Text feature extracting method based on inter-class distinctness and intra-class high representation degree
Chen et al. Application of one-class support vector machine to quickly identify multivariate anomalies from geochemical exploration data
CN104142918B (en) Short text clustering and focus subject distillation method based on TF IDF features
WO2015085916A1 (en) Data mining method
CN105893380A (en) Improved text classification characteristic selection method
CN109165301A (en) Video cover selection method, device and computer readable storage medium
CN110457577B (en) Data processing method, device, equipment and computer storage medium
CN105760889A (en) Efficient imbalanced data set classification method
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
WO2018040997A1 (en) System, method, and device for evaluating node of funnel model
CN104424308A (en) Web page classification standard acquisition method and device and web page classification method and device
TW201833851A (en) Risk control event automatic processing method and apparatus
CN110069545B (en) Behavior data evaluation method and device
CN108090178A (en) A kind of text data analysis method, device, server and storage medium
CN106294529A (en) A kind of identification user's abnormal operation method and apparatus
CN105468731B (en) A kind of preposition processing method of text emotion analysis signature verification
Ando et al. Globalization and domestic operations: Applying the JC/JD method to Japanese manufacturing firms
CN110232156B (en) Information recommendation method and device based on long text
CN105117466A (en) Internet information screening system and method
SJ et al. Impact of financial crisis in Asia
CN105574480A (en) Information processing method and apparatus and terminal
CN105843608B (en) A kind of APP user interface design pattern recommended method and system based on cluster
CN109800215A (en) Method, apparatus, computer storage medium and the terminal of a kind of pair of mark processing
CN105224954A (en) A kind of topic discover method removing the impact of little topic based on Single-pass
CN104751234B (en) A kind of prediction technique and device of user's assets

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A text feature extraction method based on inter class discrimination and high representation within class

Effective date of registration: 20220407

Granted publication date: 20190823

Pledgee: Chengdu SME financing Company Limited by Guarantee

Pledgor: CHENGDU WANGAN TECHNOLOGY DEVELOPMENT Co.,Ltd.

Registration number: Y2022980003814