CN105701084A - Characteristic extraction method of text classification on the basis of mutual information - Google Patents

Characteristic extraction method of text classification on the basis of mutual information Download PDF

Info

Publication number
CN105701084A
CN105701084A CN201511018702.3A CN201511018702A CN105701084A CN 105701084 A CN105701084 A CN 105701084A CN 201511018702 A CN201511018702 A CN 201511018702A CN 105701084 A CN105701084 A CN 105701084A
Authority
CN
China
Prior art keywords
text
classification
feature
document
log
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201511018702.3A
Other languages
Chinese (zh)
Inventor
赵秉新
印鉴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SYSU CMU Shunde International Joint Research Institute
National Sun Yat Sen University
Original Assignee
SYSU CMU Shunde International Joint Research Institute
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SYSU CMU Shunde International Joint Research Institute, National Sun Yat Sen University filed Critical SYSU CMU Shunde International Joint Research Institute
Priority to CN201511018702.3A priority Critical patent/CN105701084A/en
Publication of CN105701084A publication Critical patent/CN105701084A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a characteristic extraction method of text classification on the basis of mutual information. Text preprocessing work mainly comprises the following steps: removing a document sign, removing stop words, carrying out word segmentation, carrying out the labeling of the part of speech, carrying out statistics on word frequency, data cleaning and the like, and extracting a characteristic word according to a characteristic algorithm. A text classification stage is characterized in that a model parameter is mainly trained for a vectorized training set through a support vector machine algorithm, and a text which needs to be classified is subjected to machine learning classification. The scheme of the invention is applied, a situation that noise characteristics are brought into a machine learning process can be effectively avoided when the characteristic extraction of the text classification is carried out, text classification precision is improved, the scale of a characteristic library is greatly reduced, and memory occupation is lowered.

Description

A kind of feature extracting method of the text classification based on mutual information
Technical field
The invention belongs to the technical field of natural language processing, be specially the feature extracting method of a kind of text classification based on mutual information。
Background technology
Along with the high speed development of the Internet, multimedia and memory technology, increasing information (particularly multimedia messages) generates, propagates and accumulates。The Internet makes Information Communication be more prone to, and personal user can find and download the information that they want easily。Jumbo hard disk can store more information。Do not include the information resources on WWW, even the quantity of documents accumulated on PC is likely to there are tens gigabits。How effectively managing and conveniently utilizing these information is a big problem for personal user。According to statistics, although the multimedia messages on the Internet gets more and more at present, but in a foreseeable future, text message remains most important information source, the thing followed, not only the development of text information processing technology not because multimedia messages quantity increase stagnation rapidly, present flourish trend on the contrary。Text Classification is the powerful measure of organization and administration text message。Text classification just occurred from the sixties in 20th century, but until was just increasingly becoming study hotspot after the nineties in 20th century。Machine learning is increasingly becoming main processing method。It automatically from each class another characteristic of the text set learning classified in advance, can build automatic categorizer。It has saving manpower and respond well feature。Therefore, most researchs are both for the file classification method based on machine learning at present。
The basic task of text classification is: determine the relation between document and given classification according to the content of document。Namely concentrate from given classification and find the classification being best suitable for current document。Contacting between this document and classification can be regarded as a mapping, it is clear that, owing to document may belong to multiple classification, this mapping both can be map one by one, it is also possible to is the mapping of one-to-many。Mapping ruler is by determining the study of given Training document collection and classification collection, and the difference according to learning method, mapping ruler is also variant。System runs into when newly entering document, determines, by mapping ruler, the classification that document is corresponding。The difficult point of text classification is in that the content of text is natural language, and this makes computer be difficult to from semantically text being processed。At present, scholars utilizes statistical analysis, machine learning, it is processed by the method in the fields such as data mining, by text message is carried out content-based classification, automatically generate user-friendly Text Classification System, such that it is able to be substantially reduced tissue to arrange the human resources that document expends, help user to be quickly found out information needed。Therefore, how can be effectively prevented from including noise characteristic in machine learning flow process, one of most important research direction in precision field that improve text classification。
Summary of the invention
It is an object of the invention to overcome the deficiencies in the prior art, it is provided that one can be effectively prevented from including noise characteristic in machine learning flow process, improve the feature extracting method of the text classification based on mutual information of the precision of text classification。
In order to solve above-mentioned technical problem, the present invention by the following technical solutions: the feature extracting method of a kind of text classification based on mutual information, comprise the following steps:
A training text is carried out pretreatment by ():
Set up stop words dictionary and training text collection, the training text in data set is carried out participle, according to stop words dictionary after participle, filter out stop words, the text after participle is carried out part-of-speech tagging;
B pretreated text is carried out feature extraction by ():
According to the pretreated text of step (a), calculate remaining lexical item and the mutual information of each classification according to formula (1) and (2),
Formula (1) is:
I ( U ; C ) = Σ e t ∈ { 1 , 0 } Σ e c ∈ { 1 , 0 } P ( U = e t , C = e c ) log 2 P ( U = e t , C = e c ) P ( U = e t ) P ( C = e c )
Wherein, U is lexical item, and C is classification;U, C are binary random variables, and when document package is containing lexical item t, the value of U is et=1, otherwise et=0;When document belongs to classification c, the value ec=1 of C, otherwise ec=0,
If use maximal possibility estimation, probit above is all use Ali to calculate;Then Practical Calculation formula is as follows:
Formula (2) is:
I ( U ; C ) = N 11 N log 2 NN 11 N 1. N .1 + N 01 N log 2 NN 01 N 0. N .1 + N 10 N log 2 NN 10 N 1. N .0 + N 00 N log 2 NN 00 N 0. N .0
Wherein Nxy represents number of documents corresponding in x=et and y=ec situation;
Each classification is calculated its each lexical item mutual information with it k the lexical item that selected value is maximum;
Repetitor between each classification is deleted;Screening draws Feature Words;
C Feature Words is given weights by ():
Obtain Feature Words through step (b), calculate the frequency that each Feature Words occurs in a document, add up whole number of files, comprise the number of files of each Feature Words, calculate the weight of each feature according to formula (3),
Formula (3) is: TF-IDF computing formula: d*log (N/t)
It is wherein feature (entry) tiFrequency in document d, N is whole number of documents, for comprising entry tiNumber of files, be a constant, its value generally takes 0.01, and for anti-document frequency, denominator is normalization factor, based on training text collection, utilizes feature evaluation function TF-IDF that each Feature Words t is marked;
(d) SVM model training and prediction
Document vectorization, so as to be converted into term vector;The classification of the one-dimensional representation document of vector, the second dimension shows Feature Words and its weight to K dimension table;This vector is put in SVM model, trains model parameter, carry out text prediction afterwards。
Detailed description of the invention
Following description the specific embodiment of the present invention。
The feature extracting method of a kind of text classification based on mutual information provided by the invention, comprises the following steps:
1) article of each classification of some is obtained as the training dataset of Text Classification System from reptile the Internet;
2) training text is carried out pretreatment: training dataset is carried out participle, the participle instrument used is stammerer participle, it is the Chinese word segmentation module of increasing income of a Python exploitation, afterwards according to stop words dictionary, filter out these stop words, the text stammerer module after participle is carried out part-of-speech tagging。
3) pretreated text being carried out feature extraction: according to (2) pretreated text, leave behind the word that part of speech is noun and verb, this is to extract at the beginning of feature。Remaining lexical item and the mutual information of each classification is calculated according to formula (1) and (2),
I ( U ; C ) = Σ e t ∈ { 1 , 0 } Σ e c ∈ { 1 , 0 } P ( U = e t , C = e c ) log 2 P ( U = e t , C = e c ) P ( U = e t ) P ( C = e c ) ... ( 1 )
Wherein, U is lexical item, and C is classification。U, C are binary random variables, and when document package is containing lexical item t, the value of U is et=1, otherwise et=0;When document belongs to classification c, the value ec=1 of C, otherwise ec=0, when using maximal possibility estimation, probit above is all calculated by number Ali of lexical item in statistic document and classification。Then Practical Calculation formula is as follows:
I ( U ; C ) = N 11 N log 2 NN 11 N 1. N .1 + N 01 N log 2 NN 01 N 0. N .1 + N 10 N log 2 NN 10 N 1. N .0 + N 00 N log 2 NN 00 N 0. N .0 ... ( 2 )
Wherein Nxy represents number of documents corresponding in x=et and y=ec situation。Such as N10 represents and comprises lexical item t (now et=1) but be not belonging to classification c's (now ec=0);N1.=N10+N11 represents all number of documents comprising lexical item t。N.1=N11+N01 representing all number of documents belonging to class c, N=N00+N01+N10+N11 represents all number of documents。
Each class calculates each lexical item mutual information with it k the lexical item that selected value is maximum, and likely two classes can choose identical Feature Words certainly, removes repeating lexical item。Here it is the Feature Words finally selected。
4) weights are given to Feature Words: obtain Feature Words, calculate the frequency that each Feature Words occurs in a document, add up whole number of files, comprise the number of files of each Feature Words, the weight of each feature is calculated according to TF-IDF, TF-IDF is a kind of statistical method, for assessing a word to the importance of wherein one section of article in N section article or a corpus。
TF-IDF computing formula:
d*log(N/t)…………(3)
It it is wherein feature (entry) ti frequency in document d, N is whole number of documents, for comprising the number of files of entry ti, it it is a constant, its value generally takes 0.01, and for anti-document frequency, denominator is normalization factor, based on training text collection, utilize feature evaluation function TF-IDF that each Feature Words t is marked。
5) SVM model training and prediction: support vector machine method is built upon on VC dimension theory and the Structural risk minization basis of Statistical Learning Theory, optimal compromise is sought, to obtaining best generalization ability according between complexity and the learning capacity at model of the limited sample information。
Each section of document vectorization, so as to be converted into term vector。The classification of the one-dimensional representation document of vector, the second dimension shows Feature Words and its weight (as described in step 3) to K dimension table。This vector is put in libSVM model, trains model parameter, carry out text prediction afterwards。Model can return two result: label and score, wherein the label i.e. label of its prediction。And score is the degree of membership that this sample belongs to such, score value is more big, represents the confidence level belonging to such more big。
The announcement of book and instruction according to the above description, above-mentioned embodiment can also be modified and revise by those skilled in the art in the invention。Therefore, the invention is not limited in detailed description of the invention disclosed and described above, should also be as some modifications and changes of the present invention falling in the scope of the claims of the present invention。Although additionally, employ some specific terms in this specification, but these terms are intended merely to convenient explanation, and the present invention does not constitute any restriction。

Claims (1)

1. the feature extracting method based on the text classification of mutual information, it is characterised in that: comprise the following steps:
A training text is carried out pretreatment by ():
Set up stop words dictionary and training text collection, the training text in data set is carried out participle, according to stop words dictionary after participle, filter out stop words, the text after participle is carried out part-of-speech tagging;
B pretreated text is carried out feature extraction by ():
According to the pretreated text of step (a), calculate remaining lexical item and the mutual information of each classification according to formula (1) and (2),
Formula (1) is:
I ( U ; C ) = Σ e t ∈ { 1 , 0 } Σ e c ∈ { 1 , 0 } P ( U = e t , C = e c ) log 2 P ( U = e t , C = e c ) P ( U = e t ) P ( C = e c )
Wherein, U is lexical item, and C is classification;U, C are binary random variables, and when document package is containing lexical item t, the value of U is et=1, otherwise et=0;When document belongs to classification c, the value ec=1 of C, otherwise ec=0,
If use maximal possibility estimation, probit above is all use Ali to calculate;Then Practical Calculation formula is as follows:
Formula (2) is:
I ( U ; C ) = N 11 N log 2 NN 11 N 1. N .1 + N 01 N log 2 NN 01 N 0. N .1 + N 10 N log 2 NN 10 N 1. N .0 + N 00 N log 2 NN 00 N 0. N .0
Wherein NxyRepresent number of documents corresponding in x=et and y=ec situation;
Each classification is calculated its each lexical item mutual information with it k the lexical item that selected value is maximum;
Repetitor between each classification is deleted;Screening draws Feature Words;
C Feature Words is given weights by ():
Obtain Feature Words through step (b), calculate the frequency that each Feature Words occurs in a document, add up whole number of files, comprise the number of files of each Feature Words, calculate the weight of each feature according to formula (5),
Formula (3) is:
TF-IDF computing formula: d*log (N/t)
It is wherein feature (entry) tiFrequency in document d, N is whole number of documents, for comprising entry tiNumber of files, be a constant, its value generally takes 0.01, and for anti-document frequency, denominator is normalization factor, based on training text collection, utilizes feature evaluation function TF-IDF that each Feature Words t is marked;
(d) SVM model training and prediction
Document vectorization, so as to be converted into term vector;The classification of the one-dimensional representation document of vector, the second dimension shows Feature Words and its weight to K dimension table;This vector is put in SVM model, trains model parameter, carry out text prediction afterwards。
CN201511018702.3A 2015-12-28 2015-12-28 Characteristic extraction method of text classification on the basis of mutual information Pending CN105701084A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511018702.3A CN105701084A (en) 2015-12-28 2015-12-28 Characteristic extraction method of text classification on the basis of mutual information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511018702.3A CN105701084A (en) 2015-12-28 2015-12-28 Characteristic extraction method of text classification on the basis of mutual information

Publications (1)

Publication Number Publication Date
CN105701084A true CN105701084A (en) 2016-06-22

Family

ID=56225995

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511018702.3A Pending CN105701084A (en) 2015-12-28 2015-12-28 Characteristic extraction method of text classification on the basis of mutual information

Country Status (1)

Country Link
CN (1) CN105701084A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294542A (en) * 2016-07-25 2017-01-04 北京市信访矛盾分析研究中心 A kind of letters and calls data mining methods of marking and system
CN106502394A (en) * 2016-10-18 2017-03-15 哈尔滨工业大学深圳研究生院 Term vector computational methods and device based on EEG signals
CN106557465A (en) * 2016-11-15 2017-04-05 科大讯飞股份有限公司 A kind of preparation method and device of word weight classification
CN106709370A (en) * 2016-12-31 2017-05-24 北京明朝万达科技股份有限公司 Long word identification method and system based on text contents
CN106776562A (en) * 2016-12-20 2017-05-31 上海智臻智能网络科技股份有限公司 A kind of keyword extracting method and extraction system
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN106951498A (en) * 2017-03-15 2017-07-14 国信优易数据有限公司 Text clustering method
CN107092644A (en) * 2017-03-07 2017-08-25 重庆邮电大学 A kind of Chinese Text Categorization based on MPI and Adaboost.MH
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN107562814A (en) * 2017-08-14 2018-01-09 中国农业大学 A kind of earthquake emergency and the condition of a disaster acquisition of information sorting technique and system
CN107562928A (en) * 2017-09-15 2018-01-09 南京大学 A kind of CCMI text feature selections method
CN107633882A (en) * 2017-09-11 2018-01-26 合肥工业大学 Mix the minimally invasive medical service system and its aid decision-making method under cloud framework
CN107766323A (en) * 2017-09-06 2018-03-06 淮阴工学院 A kind of text feature based on mutual information and correlation rule
CN108108346A (en) * 2016-11-25 2018-06-01 广东亿迅科技有限公司 The theme feature word abstracting method and device of document
CN108874832A (en) * 2017-05-15 2018-11-23 腾讯科技(深圳)有限公司 Target, which is commented on, determines method and device
CN109165284A (en) * 2018-08-22 2019-01-08 重庆邮电大学 A kind of financial field human-computer dialogue intension recognizing method based on big data
CN109873755A (en) * 2019-03-02 2019-06-11 北京亚鸿世纪科技发展有限公司 A kind of refuse messages classification engine based on variant word identification technology
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach
CN110413789A (en) * 2019-07-31 2019-11-05 广西师范大学 A kind of exercise automatic classification method based on SVM
CN111104449A (en) * 2019-12-18 2020-05-05 福州市勘测院 Multisource city space-time standard address fusion method based on geographic space portrait mining
CN113157912A (en) * 2020-12-24 2021-07-23 航天科工网络信息发展有限公司 Text classification method based on machine learning

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067808A (en) * 2007-05-24 2007-11-07 上海大学 Text key word extracting method
CN101404036A (en) * 2008-11-07 2009-04-08 西安交通大学 Keyword abstraction method for PowerPoint electronic demonstration draft
CN101777347A (en) * 2009-12-07 2010-07-14 中国科学院自动化研究所 Model complementary Chinese accent identification method and system
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
CN102662923A (en) * 2012-04-23 2012-09-12 天津大学 Entity instance leading method based on machine learning
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN103559174A (en) * 2013-09-30 2014-02-05 东软集团股份有限公司 Semantic emotion classification characteristic value extraction method and system
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN103793385A (en) * 2012-10-29 2014-05-14 深圳市世纪光速信息技术有限公司 Textual feature extracting method and device
CN105183813A (en) * 2015-08-26 2015-12-23 山东省计算中心(国家超级计算济南中心) Mutual information based parallel feature selection method for document classification

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067808A (en) * 2007-05-24 2007-11-07 上海大学 Text key word extracting method
CN101404036A (en) * 2008-11-07 2009-04-08 西安交通大学 Keyword abstraction method for PowerPoint electronic demonstration draft
CN101777347A (en) * 2009-12-07 2010-07-14 中国科学院自动化研究所 Model complementary Chinese accent identification method and system
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
CN102662923A (en) * 2012-04-23 2012-09-12 天津大学 Entity instance leading method based on machine learning
CN103793385A (en) * 2012-10-29 2014-05-14 深圳市世纪光速信息技术有限公司 Textual feature extracting method and device
CN103678274A (en) * 2013-04-15 2014-03-26 南京邮电大学 Feature extraction method for text categorization based on improved mutual information and entropy
CN103279478A (en) * 2013-04-19 2013-09-04 国家电网公司 Method for extracting features based on distributed mutual information documents
CN103559174A (en) * 2013-09-30 2014-02-05 东软集团股份有限公司 Semantic emotion classification characteristic value extraction method and system
CN105183813A (en) * 2015-08-26 2015-12-23 山东省计算中心(国家超级计算济南中心) Mutual information based parallel feature selection method for document classification

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YAN XU ET AL: ""A study on mutual information-based feature selection for text categorization"", 《JOURNAL OF COMPUTATIONAL INFORMATION SYSTEMS》 *
刘海锋等: ""一种基于互信息的改进文本特征选择"", 《计算机工程与应用》 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294542B (en) * 2016-07-25 2018-03-30 北京市信访矛盾分析研究中心 A kind of letters and calls data mining methods of marking and system
CN106294542A (en) * 2016-07-25 2017-01-04 北京市信访矛盾分析研究中心 A kind of letters and calls data mining methods of marking and system
CN106502394A (en) * 2016-10-18 2017-03-15 哈尔滨工业大学深圳研究生院 Term vector computational methods and device based on EEG signals
CN106502394B (en) * 2016-10-18 2019-06-25 哈尔滨工业大学深圳研究生院 Term vector calculation method and device based on EEG signals
CN106557465A (en) * 2016-11-15 2017-04-05 科大讯飞股份有限公司 A kind of preparation method and device of word weight classification
CN106557465B (en) * 2016-11-15 2020-06-02 科大讯飞股份有限公司 Method and device for obtaining word weight categories
CN108108346B (en) * 2016-11-25 2021-12-24 广东亿迅科技有限公司 Method and device for extracting theme characteristic words of document
CN108108346A (en) * 2016-11-25 2018-06-01 广东亿迅科技有限公司 The theme feature word abstracting method and device of document
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA
CN106844424B (en) * 2016-12-09 2020-11-03 宁波大学 LDA-based text classification method
CN106776562A (en) * 2016-12-20 2017-05-31 上海智臻智能网络科技股份有限公司 A kind of keyword extracting method and extraction system
CN106776562B (en) * 2016-12-20 2020-07-28 上海智臻智能网络科技股份有限公司 Keyword extraction method and extraction system
CN106709370A (en) * 2016-12-31 2017-05-24 北京明朝万达科技股份有限公司 Long word identification method and system based on text contents
CN106709370B (en) * 2016-12-31 2019-10-29 北京明朝万达科技股份有限公司 A kind of long word recognition method and system based on content of text
CN107092644A (en) * 2017-03-07 2017-08-25 重庆邮电大学 A kind of Chinese Text Categorization based on MPI and Adaboost.MH
CN106951498A (en) * 2017-03-15 2017-07-14 国信优易数据有限公司 Text clustering method
CN108874832A (en) * 2017-05-15 2018-11-23 腾讯科技(深圳)有限公司 Target, which is commented on, determines method and device
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN107193804B (en) * 2017-06-02 2019-03-29 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN107562814A (en) * 2017-08-14 2018-01-09 中国农业大学 A kind of earthquake emergency and the condition of a disaster acquisition of information sorting technique and system
CN107766323A (en) * 2017-09-06 2018-03-06 淮阴工学院 A kind of text feature based on mutual information and correlation rule
CN107766323B (en) * 2017-09-06 2021-08-31 淮阴工学院 Text feature extraction method based on mutual information and association rule
CN107633882B (en) * 2017-09-11 2019-05-14 合肥工业大学 Mix the minimally invasive medical service system and its aid decision-making method under cloud framework
CN107633882A (en) * 2017-09-11 2018-01-26 合肥工业大学 Mix the minimally invasive medical service system and its aid decision-making method under cloud framework
CN107562928B (en) * 2017-09-15 2019-11-15 南京大学 A kind of CCMI text feature selection method
CN107562928A (en) * 2017-09-15 2018-01-09 南京大学 A kind of CCMI text feature selections method
CN109165284A (en) * 2018-08-22 2019-01-08 重庆邮电大学 A kind of financial field human-computer dialogue intension recognizing method based on big data
CN109873755A (en) * 2019-03-02 2019-06-11 北京亚鸿世纪科技发展有限公司 A kind of refuse messages classification engine based on variant word identification technology
CN109873755B (en) * 2019-03-02 2021-01-01 北京亚鸿世纪科技发展有限公司 Junk short message classification engine based on variant word recognition technology
CN110069630A (en) * 2019-03-20 2019-07-30 重庆信科设计有限公司 A kind of improved mutual information feature selection approach
CN110413789A (en) * 2019-07-31 2019-11-05 广西师范大学 A kind of exercise automatic classification method based on SVM
CN111104449A (en) * 2019-12-18 2020-05-05 福州市勘测院 Multisource city space-time standard address fusion method based on geographic space portrait mining
CN113157912A (en) * 2020-12-24 2021-07-23 航天科工网络信息发展有限公司 Text classification method based on machine learning

Similar Documents

Publication Publication Date Title
CN105701084A (en) Characteristic extraction method of text classification on the basis of mutual information
CN107463607B (en) Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning
US9195647B1 (en) System, methods, and data structure for machine-learning of contextualized symbolic associations
Fatima et al. Text Document categorization using support vector machine
Alghamdi et al. Arabic web pages clustering and annotation using semantic class features
CN104573030A (en) Textual emotion prediction method and device
CN107463715A (en) English social media account number classification method based on information gain
Banik et al. Survey on text-based sentiment analysis of bengali language
Dung Natural language understanding
Rabbimov et al. Uzbek news categorization using word embeddings and convolutional neural networks
Sigit et al. Comparison of Classification Methods on Sentiment Analysis of Political Figure Electability Based on Public Comments on Online News Media Sites
Liu Automatic argumentative-zoning using word2vec
Chader et al. Sentiment analysis in google play store: Algerian reviews case
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
Zhang et al. Grasp the implicit features: Hierarchical emotion classification based on topic model and SVM
CN115713085A (en) Document theme content analysis method and device
Patra et al. Multimodal mood classification-a case study of differences in hindi and western songs
CN103793491B (en) Chinese news story segmentation method based on flexible semantic similarity measurement
Imran et al. Twitter Sentimental Analysis using Machine Learning Approaches for SemeVal Dataset
Rohman et al. Automatic detection of argument components in text using multinomial Nave Bayes clasiffier
Jiménez et al. On Extracting Information from Semi-structured Deep Web Documents
Yu et al. Automatic Sentiment Analysis System for Myanmar News
Li et al. Predicting abstract keywords by word vectors
Franciscus et al. Beyond word-cloud: A graph model derived from beliefs
Chen et al. Incremental Patent Semantic Annotation Based on Keyword Extraction and List Extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160622

RJ01 Rejection of invention patent application after publication