CN103106275A - Text classification character screening method based on character distribution information - Google Patents
Text classification character screening method based on character distribution information Download PDFInfo
- Publication number
- CN103106275A CN103106275A CN2013100505834A CN201310050583A CN103106275A CN 103106275 A CN103106275 A CN 103106275A CN 2013100505834 A CN2013100505834 A CN 2013100505834A CN 201310050583 A CN201310050583 A CN 201310050583A CN 103106275 A CN103106275 A CN 103106275A
- Authority
- CN
- China
- Prior art keywords
- feature
- character
- classification
- document
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention discloses a text classification character screening method based on character distribution information. The method is used for resolving the technical problems that an existing text classification character screening method is poor in accuracy. The technical scheme includes conducting preprocessing for each document of a document set firstly; enabling the whole document collection to be presented as a vector space modal (VSM); constructing a character dictionary; counting document frequency DF (t, Cj), comprising the character t, of each classification Ci; calculating a normalized tf*idf value of each classification Ci, and then calculating the dispersion D Intra and average inter-classification dispersion D Inter Avg of the character in each classification Ci; calculating the weight wi (t) of each character tk in each classification Ci of a text character space; and enabling all the characters to be arranged in a descending order mode according to the weight of all the characters in the whole document set, and preferentially keeping the characters having front orders during character screening. On the basis of a character distribution system, the method enables the character distribution system to be applied to the character screen process, and improves text classification efficiency and accuracy.
Description
Technical field
The present invention relates to a kind of text classification Feature Selection method, particularly relate to a kind of text classification Feature Selection method based on the feature distributed intelligence.
Background technology
Along with the development of communication technology and network, on the internet, the generations such as a large amount of electronic documents such as news, mail, microblogging there is every day.Text automatic classification is used in a lot of fields widely as a kind of method of efficiently large volume document being carried out Classification Management.
Along with the explosive increase of quantity of information, the subject matter that automatic Text Categorization faces is how to process the higher-dimension text vector feature space that a large amount of text datas produce.Too high text vector feature space will produce two adverse effects to file classification method: the method for (1) a lot of comparative maturities can't be optimized in higher dimensional space, and then can't be applied in text classification.(2) because sorter is to train by training set to get, dimension too high text vector space will inevitably cause the over-fitting phenomenon to occur
[1]In the text vector space, most of dimension and text classification are also uncorrelated, and the more noise data that affects the text classification precision even adulterates
[2]The text feature screening according to certain Feature Selection algorithm, is selected the more representative text feature of a part and is consisted of the feature space that new dimension is lower from original feature space, reach the purpose of dimensionality reduction.The method is the effective method that solves the too high problem of text classification Chinese version vector feature space dimension.The purpose of text feature screening is to improve the execution efficient of text classification work efficiency and algorithm.Much experiment showed, in most of the cases, initiatively feature space is approximately subtracted and can obtain very large performance boost under less nicety of grading loss
[3]
Existing text classification Feature Selection algorithm mainly contains document frequency (DF), information gain (IG), information gain rate (GR), Chi-square Test (CHI), mutual information (MI) and Gini index etc.
[3,4]The below to wherein several in text classification effect preferably technology briefly introduce:
Document frequency (DF): document frequency refers to comprise the number of documents of t for given feature t in collection of document.The prediction that is rare feature for classification of its basic assumption is not have helpfully, perhaps can not affect overall performance.The advantage of document frequency: because it is realized simply, calculated amount is little, so feature selecting speed is very fast, and actual effect is also good; Therefore shortcoming: rare feature may not be rare in a certain class text, may comprise important classification information yet, simply weeds out, and may affect the effect of classification, should not be with a large amount of rejecting feature of DF.
Information gain (IG): information gain is a kind of appraisal procedure based on entropy, a given feature t, consider and when not considering it quantity of information respectively be what, both differences are exactly the quantity of information that this feature is brought to system, namely gain
[5]The appearance that information gain has been considered a feature whether, in unbalanced data centralization, for rare classification, experiment shows, consider the absent variable situation of feature to the contribution of judgement text categories often much smaller than considering the now interference that brings of situation of feature.
Information gain rate (GR): information gain is proved to be devious in a lot of results.Too abundant due to the more and different attribute of value for training set study causes Information Gain Method to be more prone to select this attribute, and the information gain rate has solved this shortcoming of information gain
[6]
Chi-square Test (CHI): Chi-square Test is the method for a kind of two variable independence of check commonly used in mathematical statistics, and its most basic thought is exactly to determine the correctness of theory by observing actual value and the deviation of theoretical value
[7,8]
During the experiment of text classification shows, during as feature selecting, the effect of Chi-square Test is a kind of of the best, but it has only added up whether occur feature t in text, but do not consider the number of times that feature t occurs in the text, therefore make it that low-frequency word is had and necessarily exaggerate effect, " the low-frequency word defective " that this namely Chi-square Test is famous.
The present invention is at the feature compartment system
[9]The basis on, dispersion computing method between class are improved, with this system employs in the Feature Selection process.
List of references:
[1]Jieming Yang,Yuanning Liu,Xiaodong Zhu et al,A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization,Information Processing&Management,Volume48,Issue4,2012,pp.741-754
[2]Wenqian Shang,Houkuan Huang and Haibin Zhu et al,A novel feature selection algorithm for text classification,Expert Systems with Applications,Volume33,Issuel,2007,pp.1-5
[3]Monica Rogati and Yiming Yang,High-performing feature selection for text classification.In Proceedings of the eleventh international conference on Information and knowledge management(CIKM′02).ACM,New York,NY,USA,2002,pp.659-661.
[4]Yang,Y.,Pedersen,J.O.,A Comparative Study on Feature Selection in Text Classification.In Proceedings of the14th international conference on machine learning,Nashville,USA,1997,pp.4l2-420.
[5]Forman,G.,An Extensive Empirical of Feature Selection Metrics for Text Classification.Journal of Machine Learning Research,3,2003,pp.1289-1305.
[6]Tatsunori Mori,Miwa Kikuchi and Kazufumi Yoshida,,Term Weighting Method based on Information Gain Ratio for Summarizing Documents retrieved by IR systems.Journal of Natural Language Processing,9(4),2001,pp.3-32.
[7]Zheng,Z.,Srihari,R,Optimally Combining Positive and Negative Features for Text Classification.ICML2003Workshop on Learning from Imbalanced Data Sets,2003.
[8]Luigi Galavotti,Via Jacopo Nardi and Fabrizio Sebastiani et al,Feature Selection and Negative Evidence in Automated Text Classification.In Proceedings of the 4thEuropean Conference on Research and Advanced Technology for Digital Libraries(ECDL’00),2000.
V.Lertnattee,T.Theeramunkong,Improving centroid-based text classification using term-distribution-based weighting and feature selection,In Proceedings of INTECH-01,2ndInternational Conference onIntelligent Technologies,Bangkok,Thailand,2001,pp.349-355.
Summary of the invention
In order to overcome the deficiency of existing text classification Feature Selection method poor accuracy, the invention provides a kind of text classification Feature Selection method based on the feature distributed intelligence.The method is improved dispersion computing method between class on the basis of feature compartment system, and the feature compartment system is applied in the Feature Selection process.The method takes full advantage of in the tf*idf information, class of text feature and distribution between class information, reflect more objectively the significance level of characteristic item in text, thereby select the characteristic item that can represent text feature, reach the Feature Selection purpose, can improve text classification efficient and accuracy rate.This method can be issued to higher classify accuracy in the situation of selecting less characteristic item, has advantages of simultaneously fast convergence rate, makes this method also can apply to skewed data set to the improvement of distribution between class.
The technical solution adopted for the present invention to solve the technical problems is: a kind of text classification Feature Selection method based on the feature distributed intelligence is characterized in comprising the following steps:
1. each piece document in document sets is carried out participle, removes stop words and get stem and process.
2. whole collection of document is expressed as vector space model.
3. extract all Feature Words, structural attitude dictionary from collection of document.
In statistics text feature space each Feature Words t at every piece of document d
jThe frequency TF of middle appearance (t, d
j), and at each class C
iThe frequency TF of middle appearance (t, C
i), add up simultaneously each class C
iThe number of files DF (t, the C that comprise Feature Words t
j).
5. the information that obtains according to step 4 is for each Feature Words t
k, at first calculate for each class C
iNormalized tf*idf value, then calculate this Feature Words at each class C
iDispersion DInterAvg between interior dispersion DIntra and average class.
6. the information that obtains according to step 4, step 5 step utilizes following formula to calculate each Feature Words t in the text feature space
kAt classification C
iIn weight w
i(t).
w
i(t)=tf*idf*DInterAvg*(1-DIntra)
With Feature Words t
kWeight summation in each classification is this Feature Words in the weight of whole document sets, i.e. Feature Words t
kThe TDFS value.
With whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, preferentially keep the forward Feature Words of ranking.
The invention has the beneficial effects as follows:, on the basis of feature compartment system dispersion computing method between class are improved due to the method, the feature compartment system is applied in the Feature Selection process.The method takes full advantage of in the tf*idf information, class of text feature and distribution between class information, reflect more objectively the significance level of characteristic item in text, thereby select the characteristic item that can represent text feature, reached the Feature Selection purpose, improved text classification efficient and accuracy rate.This method can be issued to higher classify accuracy in the situation of selecting less characteristic item, has advantages of simultaneously fast convergence rate, makes this method also can apply to skewed data set to the improvement of distribution between class.
Below in conjunction with drawings and Examples, the present invention is elaborated.
Description of drawings
Fig. 1 is the process flow diagram that the present invention is based on the text classification Feature Selection method of feature distributed intelligence.
Embodiment
The inventive method concrete steps are as follows:
1. the concept relevant with the present invention.
Tf*idf(Term frequency inverse document frequency): be a kind of statistical method, in order to assess a words for the significance level of a copy of it file in a file set or corpus.The number of times that the importance of words occurs hereof along with it increase that is directly proportional, but the decline that can be inversely proportional to along with the frequency that it occurs in corpus simultaneously.
Dispersion (Intra-class distribution) in class: refer to the distribution situation of a Feature Words in a certain class document, if be uniformly distributed in such each document, this Feature Words in such document in class dispersion lower; Otherwise, be distributed in a few pieces of documents if concentrate, all occur in all the other documents, this Feature Words in such document in class dispersion higher.
Dispersion between class (Inter-class distribution): refer to the distribution situation of a Feature Words in whole document sets is of all categories, if be uniformly distributed in all kinds of documents, between the class of this Feature Words in this whole document sets, dispersion is lower; Otherwise, occur if only concentrate in some or several classifications, and do not occur in other classifications, between the class of this Feature Words in whole document sets, dispersion is higher.
Dispersion (Average inter-class distribution) between average class: the present invention of this Objective Concept proposes, and is an improvement to dispersion concept between class.Because adopting the total word frequency of Feature Words in each class document, dispersion between class weighs its distribution situation in all kinds of, if different classes of middle number of documents difference is very big, be that data set exists deflection, use the method that the Feature Words in the less classification of number of documents is flooded by the larger classification of number of documents.Dispersion between the average class after improvement adopts the Feature Words word frequency that this Feature Words appears in average every piece of document in each classification to weigh its distribution situation in all kinds of, is not subjected to the impact of data skew, can accurately reflect the distribution between class situation of Feature Words.
2. the properties relevant with the present invention.
The number of times that the some Feature Words of character 1. occur in a certain class document is more, gets over the classification of energy specification documents, and weight is larger.
The original document set sample of the garbages such as table 1 removal form
Numbering | Original document | Classification |
1 | Yao has great talent in basketball games. | PE |
2 | We are playing a game about basketball in the playground. | PE |
3 | We are enj oying the music at the concert. | MUSIC |
4 | Music is an art and everybody may enjoy it. | MUSIC |
5 | Playing basketball is my favorite sport. | PE |
6 | Listening to the music is my hobby. | MUSIC |
For example: in the document sets shown in table 1, Feature Words basketball has occurred 3 times in PE class document, and weight is larger, and talent has only occurred in such once, and weight is less.
In the class of character 2. some Feature Words, dispersion is lower, gets over the classification of energy specification documents, and weight is larger.
Table 2 is through the pretreated collection of document of text
Numbering | The training document | Classification |
1 | yao ha great talent basketbal game | PE |
2 | we plai game about basketbal playground | PE |
3 | we enjoi music concert | MUSIC |
4 | music art everybodi mai enjoi | MUSIC |
5 | plai basketbal my favorit sport | PE |
6 | listen music my hobbi | MUSIC |
[0050]The descending sort of the whole Feature Words TDFS of table 3 sample collection of document value
Feature Words | The TDFS value | Feature Words | The TDFS value |
music | 0.554 | hobbi | 0.158 |
basketbal | 0.489 | listen | 0.158 |
enjoi | 0.489 | great | 0.140 |
game | 0.394 | ha | 0.140 |
plai | 0.394 | talent | 0.140 |
concert | 0.158 | yao | 0.140 |
art | 0.158 | playground | 0.140 |
everybodi | 0.158 | favorit | 0.140 |
mai | 0.158 | sport | 0.140 |
For example: in the document sets shown in table 1, Feature Words basketball evenly distributes in PE class document, and every piece of document has a basketball, in class, dispersion is low, illustrate that this word extensively evenly is present in PE class document, the classification information of specification documents well, weight is larger.Feature Words talent only occurs in one piece of document of PE, and all without talent, so in the class of this Feature Words, dispersion is higher, the document that comprises this word is not that the possibility of PE classification is larger, thereby the weight that calculates is also lower in all the other two pieces.
Between the class of character 3. some Feature Words, dispersion is higher, gets over the classification of energy specification documents, and weight is larger.
For example: in the document sets shown in table 1, Feature Words basketball only occurs in PE class document, never occurs in MUSIC class document, and between this feature part of speech, dispersion is very high, the classification information of specification documents well, and weight is larger.And Feature Words my, each occurs once in PE class and MUSIC class, belongs to even distribution between class, so between class, dispersion is very low, can not represent preferably the feature of classification, thereby weight is also lower.
Between the average class of character 4. some Feature Words, dispersion is higher, gets over the classification of energy specification documents, and weight is larger.
Between the average class of table 4, dispersion character gives an example 1
Between the average class of table 5, dispersion character gives an example 2
For example: provide two examples in table 4, table 5, suppose that the distribution situation of a certain Feature Words t in document sets is as shown in table 4, due to the total word frequency of t in A, B two classes identical (being all 2), so between the class of this Feature Words, dispersion is 0, but obviously t still has necessarily representational for category-B, for no other reason than that the number of documents gap of A, B two classes is excessive, cause the less category-B Feature Words t of number of files to be flooded by the more category-A of number of files.Example shown in table 5 calculates equally that between the class of Feature Words t, dispersion is that in 0(A, B two classes, the total word frequency of t is all 1000), but t is apparent for the significance level of distinguishing A, B two classes.This shows in the situation that the data set deflection, use when between class, dispersion is weighed Feature Words, in the less classification of number of documents, representative Feature Words can't be highlighted.If use dispersion measurement between average class, Feature Words t is very inhomogeneous in the distribution of A, B two classes, and between average class, dispersion is higher, therefore can give rational weight for Feature Words t, eliminate the adverse effect that data skew brings, also have simultaneously the original function of dispersion between class.
For a given document sets D, the present invention is as follows to the detailed process that attribute in document sets screens:
1. parse documents is concentrated all documents, rejects useless structural identification etc., extracts the main information such as exercise question, content in document.
May there be some structural identification information (seeing the following form) in document, all occur in the same way in every piece of document, at first these signs be reached with the content of text classification is incoherent to fall as information filterings such as times.
2. content of text is carried out pre-service, extract characteristic item (term) and consist of the text feature space.
For documents all in document sets, after the parsing through the 1st step, can obtain the content information of each document, see Table 1.Each piece document in document sets is carried out pre-service: participle (tokenizing), remove stop words (stop words removal), get stem (stemming) and process after, can obtain a set that is consisted of by some words, each word in set is referred to as text feature item (term), all characteristic items have just consisted of text feature space (term space), and the document in table 1 is as shown in table 2 through obtaining result after pre-service.
3. extract all Feature Words, structural attitude dictionary from collection of document.
For documents all in document sets, after the processing by the 2nd step, collect all Feature Words constitutive characteristic dictionaries that occur in document sets, as the basis of Feature Selection.
In statistics text feature space each Feature Words t at every piece of document d
jThe frequency TF of middle appearance (t, d
j), and at each class C
iThe frequency TF of middle appearance (t, C
i), add up simultaneously each class C
iThe number of files DF (t, the C that comprise Feature Words t
j).
5. the statistical information that obtains according to the 4th step is for each Feature Words calculates dispersion between dispersion in normalized tf*idf value, class, average class.
(1) tf*idf: computing formula is as follows:
In formula, n represents C
iThe whole Feature Words numbers that occur in class.L is a constant, is got by experiment test, usually gets 0.1 or 0.01.The calculation deviation that normalized tf*idf value can avoid document overlength to bring.
(2) dispersion in class: computing formula is as follows:
(3) dispersion between average class: computing formula is as follows:
6. use the result in the 5th step, calculate the weight of characteristic item t in each class.Computing formula is as follows:
wi(t)=tf*idf*DInterAvg*(1-DIntra)
7. the summation of the weight in of all categories with characteristic item t obtains the weight of this characteristic item in whole document sets, namely
The TDFS value.Computing formula is as follows:
8. in calculating document sets the TDFS value of all characteristic items is according to descending sort, and the more forward value of characteristic item in document sets of ranking is higher, and role is larger in document classification.
Claims (1)
1. text classification Feature Selection method based on the feature distributed intelligence is characterized in that comprising the following steps:
(1). each piece document in document sets is carried out participle, removes stop words and get stem and process;
(2). whole collection of document is expressed as vector space model;
(3). extract all Feature Words, structural attitude dictionary from collection of document;
(4). in statistics text feature space, each Feature Words t is at every piece of document d
jThe frequency TF of middle appearance (t, d
j), and at each class C
iThe frequency TF of middle appearance (t, C
i), add up simultaneously each class C
iThe number of files DF (t, the C that comprise Feature Words t
j);
(5). the information that obtains according to step (4), for each Feature Words t
k, at first calculate for each class C
iNormalized tf*idf value, then calculate this Feature Words at each class C
iDispersion DInterAvg between interior dispersion DIntra and average class;
(6). according to the information that step (4), step (5) step obtains, utilize following formula to calculate each Feature Words t in the text feature space
kAt classification C
iIn weight w
i(t);
w
i(t)=tf*idf*DInterAvg*(1-DIntra)
With Feature Words t
kWeight summation in each classification is this Feature Words in the weight of whole document sets, i.e. Feature Words t
kThe TDFS value;
(7). whole Feature Words according to its weight descending sort in whole document sets, when carrying out Feature Selection, are preferentially kept the forward Feature Words of ranking.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310050583.4A CN103106275B (en) | 2013-02-08 | 2013-02-08 | The text classification Feature Selection method of feature based distributed intelligence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310050583.4A CN103106275B (en) | 2013-02-08 | 2013-02-08 | The text classification Feature Selection method of feature based distributed intelligence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103106275A true CN103106275A (en) | 2013-05-15 |
CN103106275B CN103106275B (en) | 2016-02-10 |
Family
ID=48314130
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310050583.4A Expired - Fee Related CN103106275B (en) | 2013-02-08 | 2013-02-08 | The text classification Feature Selection method of feature based distributed intelligence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103106275B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462556A (en) * | 2014-12-25 | 2015-03-25 | 北京奇虎科技有限公司 | Method and device for recommending question and answer page related questions |
CN104915327A (en) * | 2014-03-14 | 2015-09-16 | 腾讯科技(深圳)有限公司 | Text information processing method and device |
CN105045812A (en) * | 2015-06-18 | 2015-11-11 | 上海高欣计算机系统有限公司 | Text topic classification method and system |
CN106054857A (en) * | 2016-05-27 | 2016-10-26 | 大连楼兰科技股份有限公司 | Maintenance decision tree/word vector-based fault remote diagnosis platform |
CN106055439A (en) * | 2016-05-27 | 2016-10-26 | 大连楼兰科技股份有限公司 | Remote fault diagnostic system and method based on maintenance and decision trees/term vectors |
CN106227768A (en) * | 2016-07-15 | 2016-12-14 | 国家计算机网络与信息安全管理中心 | A kind of short text opining mining method based on complementary language material |
CN106940703A (en) * | 2016-01-04 | 2017-07-11 | 腾讯科技(北京)有限公司 | Pushed information roughing sort method and device |
CN106997345A (en) * | 2017-03-31 | 2017-08-01 | 成都数联铭品科技有限公司 | The keyword abstraction method of word-based vector sum word statistical information |
CN106997344A (en) * | 2017-03-31 | 2017-08-01 | 成都数联铭品科技有限公司 | Keyword abstraction system |
CN107329999A (en) * | 2017-06-09 | 2017-11-07 | 江西科技学院 | Document classification method and device |
CN107844553A (en) * | 2017-10-31 | 2018-03-27 | 山东浪潮通软信息科技有限公司 | A kind of file classification method and device |
CN108153872A (en) * | 2017-12-25 | 2018-06-12 | 佛山市车品匠汽车用品有限公司 | A kind of method and apparatus of the Internet web page information filtering |
CN108491429A (en) * | 2018-02-09 | 2018-09-04 | 湖北工业大学 | A kind of feature selection approach based on document frequency and word frequency statistics between class in class |
CN108776654A (en) * | 2018-05-30 | 2018-11-09 | 昆明理工大学 | One kind being based on improved simhash transcription comparison methods |
CN110210559A (en) * | 2019-05-31 | 2019-09-06 | 北京小米移动软件有限公司 | Object screening technique and device, storage medium |
CN110442678A (en) * | 2019-07-24 | 2019-11-12 | 中智关爱通(上海)科技股份有限公司 | A kind of text words weighing computation method and system, storage medium and terminal |
CN111881668A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | Improved TF-IDF calculation model based on chi-square statistics and TF-CRF |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101290626A (en) * | 2008-06-12 | 2008-10-22 | 昆明理工大学 | Text categorization feature selection and weight computation method based on field knowledge |
CN101587493A (en) * | 2009-06-29 | 2009-11-25 | 中国科学技术大学 | Text classification method |
CN102622373A (en) * | 2011-01-31 | 2012-08-01 | 中国科学院声学研究所 | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm |
-
2013
- 2013-02-08 CN CN201310050583.4A patent/CN103106275B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101290626A (en) * | 2008-06-12 | 2008-10-22 | 昆明理工大学 | Text categorization feature selection and weight computation method based on field knowledge |
CN101587493A (en) * | 2009-06-29 | 2009-11-25 | 中国科学技术大学 | Text classification method |
CN102622373A (en) * | 2011-01-31 | 2012-08-01 | 中国科学院声学研究所 | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm |
Non-Patent Citations (3)
Title |
---|
刘海峰等: "文本分类中一种改进的特征选择方法", 《情报科学》 * |
张瑜等: "一种改进的特征权重算法", 《计算机工程》 * |
徐凤亚等: "文本自动分类中特征权重算法的改进研究", 《计算机工程与应用》 * |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915327A (en) * | 2014-03-14 | 2015-09-16 | 腾讯科技(深圳)有限公司 | Text information processing method and device |
WO2015135452A1 (en) * | 2014-03-14 | 2015-09-17 | Tencent Technology (Shenzhen) Company Limited | Text information processing method and apparatus |
US10262059B2 (en) | 2014-03-14 | 2019-04-16 | Tencent Technology (Shenzhen) Company Limited | Method, apparatus, and storage medium for text information processing |
CN104915327B (en) * | 2014-03-14 | 2019-01-29 | 腾讯科技(深圳)有限公司 | A kind of processing method and processing device of text information |
CN104462556B (en) * | 2014-12-25 | 2018-02-23 | 北京奇虎科技有限公司 | Question and answer page relevant issues recommend method and apparatus |
CN104462556A (en) * | 2014-12-25 | 2015-03-25 | 北京奇虎科技有限公司 | Method and device for recommending question and answer page related questions |
CN105045812A (en) * | 2015-06-18 | 2015-11-11 | 上海高欣计算机系统有限公司 | Text topic classification method and system |
CN105045812B (en) * | 2015-06-18 | 2019-01-29 | 上海高欣计算机系统有限公司 | The classification method and system of text subject |
CN106940703B (en) * | 2016-01-04 | 2020-09-11 | 腾讯科技(北京)有限公司 | Pushed information rough selection sorting method and device |
CN106940703A (en) * | 2016-01-04 | 2017-07-11 | 腾讯科技(北京)有限公司 | Pushed information roughing sort method and device |
CN106055439B (en) * | 2016-05-27 | 2019-09-27 | 大连楼兰科技股份有限公司 | Based on maintenance decision tree/term vector Remote Fault Diagnosis system and method |
CN106055439A (en) * | 2016-05-27 | 2016-10-26 | 大连楼兰科技股份有限公司 | Remote fault diagnostic system and method based on maintenance and decision trees/term vectors |
CN106054857A (en) * | 2016-05-27 | 2016-10-26 | 大连楼兰科技股份有限公司 | Maintenance decision tree/word vector-based fault remote diagnosis platform |
CN106054857B (en) * | 2016-05-27 | 2019-12-24 | 大连楼兰科技股份有限公司 | Maintenance decision tree/word vector-based fault remote diagnosis platform |
CN106227768B (en) * | 2016-07-15 | 2019-09-03 | 国家计算机网络与信息安全管理中心 | A kind of short text opining mining method based on complementary corpus |
CN106227768A (en) * | 2016-07-15 | 2016-12-14 | 国家计算机网络与信息安全管理中心 | A kind of short text opining mining method based on complementary language material |
CN106997344A (en) * | 2017-03-31 | 2017-08-01 | 成都数联铭品科技有限公司 | Keyword abstraction system |
CN106997345A (en) * | 2017-03-31 | 2017-08-01 | 成都数联铭品科技有限公司 | The keyword abstraction method of word-based vector sum word statistical information |
CN107329999B (en) * | 2017-06-09 | 2020-10-20 | 江西科技学院 | Document classification method and device |
CN107329999A (en) * | 2017-06-09 | 2017-11-07 | 江西科技学院 | Document classification method and device |
CN107844553A (en) * | 2017-10-31 | 2018-03-27 | 山东浪潮通软信息科技有限公司 | A kind of file classification method and device |
CN108153872A (en) * | 2017-12-25 | 2018-06-12 | 佛山市车品匠汽车用品有限公司 | A kind of method and apparatus of the Internet web page information filtering |
CN108491429A (en) * | 2018-02-09 | 2018-09-04 | 湖北工业大学 | A kind of feature selection approach based on document frequency and word frequency statistics between class in class |
CN108776654A (en) * | 2018-05-30 | 2018-11-09 | 昆明理工大学 | One kind being based on improved simhash transcription comparison methods |
CN110210559A (en) * | 2019-05-31 | 2019-09-06 | 北京小米移动软件有限公司 | Object screening technique and device, storage medium |
CN110210559B (en) * | 2019-05-31 | 2021-10-08 | 北京小米移动软件有限公司 | Object screening method and device and storage medium |
CN110442678A (en) * | 2019-07-24 | 2019-11-12 | 中智关爱通(上海)科技股份有限公司 | A kind of text words weighing computation method and system, storage medium and terminal |
CN110442678B (en) * | 2019-07-24 | 2022-03-29 | 中智关爱通(上海)科技股份有限公司 | Text word weight calculation method and system, storage medium and terminal |
CN111881668A (en) * | 2020-08-06 | 2020-11-03 | 成都信息工程大学 | Improved TF-IDF calculation model based on chi-square statistics and TF-CRF |
CN111881668B (en) * | 2020-08-06 | 2023-06-30 | 成都信息工程大学 | TF-IDF computing device based on chi-square statistics and TF-CRF improvement |
Also Published As
Publication number | Publication date |
---|---|
CN103106275B (en) | 2016-02-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103106275A (en) | Text classification character screening method based on character distribution information | |
CN103778214B (en) | A kind of item property clustering method based on user comment | |
CN102622373B (en) | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm | |
CN101587493B (en) | Text classification method | |
US9875294B2 (en) | Method and apparatus for classifying object based on social networking service, and storage medium | |
CN106156372B (en) | A kind of classification method and device of internet site | |
CN103473262B (en) | A kind of Web comment viewpoint automatic classification system based on correlation rule and sorting technique | |
CN106095996A (en) | Method for text classification | |
CN105550269A (en) | Product comment analyzing method and system with learning supervising function | |
CN102332025A (en) | Intelligent vertical search method and system | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN101937436B (en) | Text classification method and device | |
CN103886108B (en) | The feature selecting and weighing computation method of a kind of unbalanced text set | |
CN105975518B (en) | Expectation cross entropy feature selecting Text Classification System and method based on comentropy | |
CN104361037B (en) | Microblogging sorting technique and device | |
US10387805B2 (en) | System and method for ranking news feeds | |
CN102929873A (en) | Method and device for extracting searching value terms based on context search | |
CN103810264A (en) | Webpage text classification method based on feature selection | |
CN106446931A (en) | Feature extraction and classification method and system based on support vector data description | |
CN107133282B (en) | Improved evaluation object identification method based on bidirectional propagation | |
CN104463601A (en) | Method for detecting users who score maliciously in online social media system | |
CN101763431A (en) | PL clustering method based on massive network public sentiment information | |
CN106484919A (en) | A kind of industrial sustainability sorting technique based on webpage autonomous word and system | |
CN105389505A (en) | Shilling attack detection method based on stack type sparse self-encoder | |
CN105787662A (en) | Mobile application software performance prediction method based on attributes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160210 Termination date: 20200208 |
|
CF01 | Termination of patent right due to non-payment of annual fee |