CN105701084A - Characteristic extraction method of text classification on the basis of mutual information - Google Patents
Characteristic extraction method of text classification on the basis of mutual information Download PDFInfo
- Publication number
- CN105701084A CN105701084A CN201511018702.3A CN201511018702A CN105701084A CN 105701084 A CN105701084 A CN 105701084A CN 201511018702 A CN201511018702 A CN 201511018702A CN 105701084 A CN105701084 A CN 105701084A
- Authority
- CN
- China
- Prior art keywords
- text
- classification
- feature
- document
- log
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a characteristic extraction method of text classification on the basis of mutual information. Text preprocessing work mainly comprises the following steps: removing a document sign, removing stop words, carrying out word segmentation, carrying out the labeling of the part of speech, carrying out statistics on word frequency, data cleaning and the like, and extracting a characteristic word according to a characteristic algorithm. A text classification stage is characterized in that a model parameter is mainly trained for a vectorized training set through a support vector machine algorithm, and a text which needs to be classified is subjected to machine learning classification. The scheme of the invention is applied, a situation that noise characteristics are brought into a machine learning process can be effectively avoided when the characteristic extraction of the text classification is carried out, text classification precision is improved, the scale of a characteristic library is greatly reduced, and memory occupation is lowered.
Description
Technical field
The invention belongs to the technical field of natural language processing, be specially the feature extracting method of a kind of text classification based on mutual information。
Background technology
Along with the high speed development of the Internet, multimedia and memory technology, increasing information (particularly multimedia messages) generates, propagates and accumulates。The Internet makes Information Communication be more prone to, and personal user can find and download the information that they want easily。Jumbo hard disk can store more information。Do not include the information resources on WWW, even the quantity of documents accumulated on PC is likely to there are tens gigabits。How effectively managing and conveniently utilizing these information is a big problem for personal user。According to statistics, although the multimedia messages on the Internet gets more and more at present, but in a foreseeable future, text message remains most important information source, the thing followed, not only the development of text information processing technology not because multimedia messages quantity increase stagnation rapidly, present flourish trend on the contrary。Text Classification is the powerful measure of organization and administration text message。Text classification just occurred from the sixties in 20th century, but until was just increasingly becoming study hotspot after the nineties in 20th century。Machine learning is increasingly becoming main processing method。It automatically from each class another characteristic of the text set learning classified in advance, can build automatic categorizer。It has saving manpower and respond well feature。Therefore, most researchs are both for the file classification method based on machine learning at present。
The basic task of text classification is: determine the relation between document and given classification according to the content of document。Namely concentrate from given classification and find the classification being best suitable for current document。Contacting between this document and classification can be regarded as a mapping, it is clear that, owing to document may belong to multiple classification, this mapping both can be map one by one, it is also possible to is the mapping of one-to-many。Mapping ruler is by determining the study of given Training document collection and classification collection, and the difference according to learning method, mapping ruler is also variant。System runs into when newly entering document, determines, by mapping ruler, the classification that document is corresponding。The difficult point of text classification is in that the content of text is natural language, and this makes computer be difficult to from semantically text being processed。At present, scholars utilizes statistical analysis, machine learning, it is processed by the method in the fields such as data mining, by text message is carried out content-based classification, automatically generate user-friendly Text Classification System, such that it is able to be substantially reduced tissue to arrange the human resources that document expends, help user to be quickly found out information needed。Therefore, how can be effectively prevented from including noise characteristic in machine learning flow process, one of most important research direction in precision field that improve text classification。
Summary of the invention
It is an object of the invention to overcome the deficiencies in the prior art, it is provided that one can be effectively prevented from including noise characteristic in machine learning flow process, improve the feature extracting method of the text classification based on mutual information of the precision of text classification。
In order to solve above-mentioned technical problem, the present invention by the following technical solutions: the feature extracting method of a kind of text classification based on mutual information, comprise the following steps:
A training text is carried out pretreatment by ():
Set up stop words dictionary and training text collection, the training text in data set is carried out participle, according to stop words dictionary after participle, filter out stop words, the text after participle is carried out part-of-speech tagging;
B pretreated text is carried out feature extraction by ():
According to the pretreated text of step (a), calculate remaining lexical item and the mutual information of each classification according to formula (1) and (2),
Formula (1) is:
Wherein, U is lexical item, and C is classification;U, C are binary random variables, and when document package is containing lexical item t, the value of U is et=1, otherwise et=0;When document belongs to classification c, the value ec=1 of C, otherwise ec=0,
If use maximal possibility estimation, probit above is all use Ali to calculate;Then Practical Calculation formula is as follows:
Formula (2) is:
Wherein Nxy represents number of documents corresponding in x=et and y=ec situation;
Each classification is calculated its each lexical item mutual information with it k the lexical item that selected value is maximum;
Repetitor between each classification is deleted;Screening draws Feature Words;
C Feature Words is given weights by ():
Obtain Feature Words through step (b), calculate the frequency that each Feature Words occurs in a document, add up whole number of files, comprise the number of files of each Feature Words, calculate the weight of each feature according to formula (3),
Formula (3) is: TF-IDF computing formula: d*log (N/t)
It is wherein feature (entry) tiFrequency in document d, N is whole number of documents, for comprising entry tiNumber of files, be a constant, its value generally takes 0.01, and for anti-document frequency, denominator is normalization factor, based on training text collection, utilizes feature evaluation function TF-IDF that each Feature Words t is marked;
(d) SVM model training and prediction
Document vectorization, so as to be converted into term vector;The classification of the one-dimensional representation document of vector, the second dimension shows Feature Words and its weight to K dimension table;This vector is put in SVM model, trains model parameter, carry out text prediction afterwards。
Detailed description of the invention
Following description the specific embodiment of the present invention。
The feature extracting method of a kind of text classification based on mutual information provided by the invention, comprises the following steps:
1) article of each classification of some is obtained as the training dataset of Text Classification System from reptile the Internet;
2) training text is carried out pretreatment: training dataset is carried out participle, the participle instrument used is stammerer participle, it is the Chinese word segmentation module of increasing income of a Python exploitation, afterwards according to stop words dictionary, filter out these stop words, the text stammerer module after participle is carried out part-of-speech tagging。
3) pretreated text being carried out feature extraction: according to (2) pretreated text, leave behind the word that part of speech is noun and verb, this is to extract at the beginning of feature。Remaining lexical item and the mutual information of each classification is calculated according to formula (1) and (2),
Wherein, U is lexical item, and C is classification。U, C are binary random variables, and when document package is containing lexical item t, the value of U is et=1, otherwise et=0;When document belongs to classification c, the value ec=1 of C, otherwise ec=0, when using maximal possibility estimation, probit above is all calculated by number Ali of lexical item in statistic document and classification。Then Practical Calculation formula is as follows:
Wherein Nxy represents number of documents corresponding in x=et and y=ec situation。Such as N10 represents and comprises lexical item t (now et=1) but be not belonging to classification c's (now ec=0);N1.=N10+N11 represents all number of documents comprising lexical item t。N.1=N11+N01 representing all number of documents belonging to class c, N=N00+N01+N10+N11 represents all number of documents。
Each class calculates each lexical item mutual information with it k the lexical item that selected value is maximum, and likely two classes can choose identical Feature Words certainly, removes repeating lexical item。Here it is the Feature Words finally selected。
4) weights are given to Feature Words: obtain Feature Words, calculate the frequency that each Feature Words occurs in a document, add up whole number of files, comprise the number of files of each Feature Words, the weight of each feature is calculated according to TF-IDF, TF-IDF is a kind of statistical method, for assessing a word to the importance of wherein one section of article in N section article or a corpus。
TF-IDF computing formula:
d*log(N/t)…………(3)
It it is wherein feature (entry) ti frequency in document d, N is whole number of documents, for comprising the number of files of entry ti, it it is a constant, its value generally takes 0.01, and for anti-document frequency, denominator is normalization factor, based on training text collection, utilize feature evaluation function TF-IDF that each Feature Words t is marked。
5) SVM model training and prediction: support vector machine method is built upon on VC dimension theory and the Structural risk minization basis of Statistical Learning Theory, optimal compromise is sought, to obtaining best generalization ability according between complexity and the learning capacity at model of the limited sample information。
Each section of document vectorization, so as to be converted into term vector。The classification of the one-dimensional representation document of vector, the second dimension shows Feature Words and its weight (as described in step 3) to K dimension table。This vector is put in libSVM model, trains model parameter, carry out text prediction afterwards。Model can return two result: label and score, wherein the label i.e. label of its prediction。And score is the degree of membership that this sample belongs to such, score value is more big, represents the confidence level belonging to such more big。
The announcement of book and instruction according to the above description, above-mentioned embodiment can also be modified and revise by those skilled in the art in the invention。Therefore, the invention is not limited in detailed description of the invention disclosed and described above, should also be as some modifications and changes of the present invention falling in the scope of the claims of the present invention。Although additionally, employ some specific terms in this specification, but these terms are intended merely to convenient explanation, and the present invention does not constitute any restriction。
Claims (1)
1. the feature extracting method based on the text classification of mutual information, it is characterised in that: comprise the following steps:
A training text is carried out pretreatment by ():
Set up stop words dictionary and training text collection, the training text in data set is carried out participle, according to stop words dictionary after participle, filter out stop words, the text after participle is carried out part-of-speech tagging;
B pretreated text is carried out feature extraction by ():
According to the pretreated text of step (a), calculate remaining lexical item and the mutual information of each classification according to formula (1) and (2),
Formula (1) is:
Wherein, U is lexical item, and C is classification;U, C are binary random variables, and when document package is containing lexical item t, the value of U is et=1, otherwise et=0;When document belongs to classification c, the value ec=1 of C, otherwise ec=0,
If use maximal possibility estimation, probit above is all use Ali to calculate;Then Practical Calculation formula is as follows:
Formula (2) is:
Wherein NxyRepresent number of documents corresponding in x=et and y=ec situation;
Each classification is calculated its each lexical item mutual information with it k the lexical item that selected value is maximum;
Repetitor between each classification is deleted;Screening draws Feature Words;
C Feature Words is given weights by ():
Obtain Feature Words through step (b), calculate the frequency that each Feature Words occurs in a document, add up whole number of files, comprise the number of files of each Feature Words, calculate the weight of each feature according to formula (5),
Formula (3) is:
TF-IDF computing formula: d*log (N/t)
It is wherein feature (entry) tiFrequency in document d, N is whole number of documents, for comprising entry tiNumber of files, be a constant, its value generally takes 0.01, and for anti-document frequency, denominator is normalization factor, based on training text collection, utilizes feature evaluation function TF-IDF that each Feature Words t is marked;
(d) SVM model training and prediction
Document vectorization, so as to be converted into term vector;The classification of the one-dimensional representation document of vector, the second dimension shows Feature Words and its weight to K dimension table;This vector is put in SVM model, trains model parameter, carry out text prediction afterwards。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511018702.3A CN105701084A (en) | 2015-12-28 | 2015-12-28 | Characteristic extraction method of text classification on the basis of mutual information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511018702.3A CN105701084A (en) | 2015-12-28 | 2015-12-28 | Characteristic extraction method of text classification on the basis of mutual information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105701084A true CN105701084A (en) | 2016-06-22 |
Family
ID=56225995
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511018702.3A Pending CN105701084A (en) | 2015-12-28 | 2015-12-28 | Characteristic extraction method of text classification on the basis of mutual information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105701084A (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294542A (en) * | 2016-07-25 | 2017-01-04 | 北京市信访矛盾分析研究中心 | A kind of letters and calls data mining methods of marking and system |
CN106502394A (en) * | 2016-10-18 | 2017-03-15 | 哈尔滨工业大学深圳研究生院 | Term vector computational methods and device based on EEG signals |
CN106557465A (en) * | 2016-11-15 | 2017-04-05 | 科大讯飞股份有限公司 | A kind of preparation method and device of word weight classification |
CN106709370A (en) * | 2016-12-31 | 2017-05-24 | 北京明朝万达科技股份有限公司 | Long word identification method and system based on text contents |
CN106776562A (en) * | 2016-12-20 | 2017-05-31 | 上海智臻智能网络科技股份有限公司 | A kind of keyword extracting method and extraction system |
CN106844424A (en) * | 2016-12-09 | 2017-06-13 | 宁波大学 | A kind of file classification method based on LDA |
CN106951498A (en) * | 2017-03-15 | 2017-07-14 | 国信优易数据有限公司 | Text clustering method |
CN107092644A (en) * | 2017-03-07 | 2017-08-25 | 重庆邮电大学 | A kind of Chinese Text Categorization based on MPI and Adaboost.MH |
CN107193804A (en) * | 2017-06-02 | 2017-09-22 | 河海大学 | A kind of refuse messages text feature selection method towards word and portmanteau word |
CN107562814A (en) * | 2017-08-14 | 2018-01-09 | 中国农业大学 | A kind of earthquake emergency and the condition of a disaster acquisition of information sorting technique and system |
CN107562928A (en) * | 2017-09-15 | 2018-01-09 | 南京大学 | A kind of CCMI text feature selections method |
CN107633882A (en) * | 2017-09-11 | 2018-01-26 | 合肥工业大学 | Mix the minimally invasive medical service system and its aid decision-making method under cloud framework |
CN107766323A (en) * | 2017-09-06 | 2018-03-06 | 淮阴工学院 | A kind of text feature based on mutual information and correlation rule |
CN108108346A (en) * | 2016-11-25 | 2018-06-01 | 广东亿迅科技有限公司 | The theme feature word abstracting method and device of document |
CN108874832A (en) * | 2017-05-15 | 2018-11-23 | 腾讯科技(深圳)有限公司 | Target, which is commented on, determines method and device |
CN109165284A (en) * | 2018-08-22 | 2019-01-08 | 重庆邮电大学 | A kind of financial field human-computer dialogue intension recognizing method based on big data |
CN109873755A (en) * | 2019-03-02 | 2019-06-11 | 北京亚鸿世纪科技发展有限公司 | A kind of refuse messages classification engine based on variant word identification technology |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
CN110413789A (en) * | 2019-07-31 | 2019-11-05 | 广西师范大学 | A kind of exercise automatic classification method based on SVM |
CN111104449A (en) * | 2019-12-18 | 2020-05-05 | 福州市勘测院 | Multisource city space-time standard address fusion method based on geographic space portrait mining |
CN113157912A (en) * | 2020-12-24 | 2021-07-23 | 航天科工网络信息发展有限公司 | Text classification method based on machine learning |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101067808A (en) * | 2007-05-24 | 2007-11-07 | 上海大学 | Text key word extracting method |
CN101404036A (en) * | 2008-11-07 | 2009-04-08 | 西安交通大学 | Keyword abstraction method for PowerPoint electronic demonstration draft |
CN101777347A (en) * | 2009-12-07 | 2010-07-14 | 中国科学院自动化研究所 | Model complementary Chinese accent identification method and system |
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
CN102662923A (en) * | 2012-04-23 | 2012-09-12 | 天津大学 | Entity instance leading method based on machine learning |
CN103279478A (en) * | 2013-04-19 | 2013-09-04 | 国家电网公司 | Method for extracting features based on distributed mutual information documents |
CN103559174A (en) * | 2013-09-30 | 2014-02-05 | 东软集团股份有限公司 | Semantic emotion classification characteristic value extraction method and system |
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN103793385A (en) * | 2012-10-29 | 2014-05-14 | 深圳市世纪光速信息技术有限公司 | Textual feature extracting method and device |
CN105183813A (en) * | 2015-08-26 | 2015-12-23 | 山东省计算中心(国家超级计算济南中心) | Mutual information based parallel feature selection method for document classification |
-
2015
- 2015-12-28 CN CN201511018702.3A patent/CN105701084A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101067808A (en) * | 2007-05-24 | 2007-11-07 | 上海大学 | Text key word extracting method |
CN101404036A (en) * | 2008-11-07 | 2009-04-08 | 西安交通大学 | Keyword abstraction method for PowerPoint electronic demonstration draft |
CN101777347A (en) * | 2009-12-07 | 2010-07-14 | 中国科学院自动化研究所 | Model complementary Chinese accent identification method and system |
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
CN102662923A (en) * | 2012-04-23 | 2012-09-12 | 天津大学 | Entity instance leading method based on machine learning |
CN103793385A (en) * | 2012-10-29 | 2014-05-14 | 深圳市世纪光速信息技术有限公司 | Textual feature extracting method and device |
CN103678274A (en) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | Feature extraction method for text categorization based on improved mutual information and entropy |
CN103279478A (en) * | 2013-04-19 | 2013-09-04 | 国家电网公司 | Method for extracting features based on distributed mutual information documents |
CN103559174A (en) * | 2013-09-30 | 2014-02-05 | 东软集团股份有限公司 | Semantic emotion classification characteristic value extraction method and system |
CN105183813A (en) * | 2015-08-26 | 2015-12-23 | 山东省计算中心(国家超级计算济南中心) | Mutual information based parallel feature selection method for document classification |
Non-Patent Citations (2)
Title |
---|
YAN XU ET AL: ""A study on mutual information-based feature selection for text categorization"", 《JOURNAL OF COMPUTATIONAL INFORMATION SYSTEMS》 * |
刘海锋等: ""一种基于互信息的改进文本特征选择"", 《计算机工程与应用》 * |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294542B (en) * | 2016-07-25 | 2018-03-30 | 北京市信访矛盾分析研究中心 | A kind of letters and calls data mining methods of marking and system |
CN106294542A (en) * | 2016-07-25 | 2017-01-04 | 北京市信访矛盾分析研究中心 | A kind of letters and calls data mining methods of marking and system |
CN106502394A (en) * | 2016-10-18 | 2017-03-15 | 哈尔滨工业大学深圳研究生院 | Term vector computational methods and device based on EEG signals |
CN106502394B (en) * | 2016-10-18 | 2019-06-25 | 哈尔滨工业大学深圳研究生院 | Term vector calculation method and device based on EEG signals |
CN106557465A (en) * | 2016-11-15 | 2017-04-05 | 科大讯飞股份有限公司 | A kind of preparation method and device of word weight classification |
CN106557465B (en) * | 2016-11-15 | 2020-06-02 | 科大讯飞股份有限公司 | Method and device for obtaining word weight categories |
CN108108346B (en) * | 2016-11-25 | 2021-12-24 | 广东亿迅科技有限公司 | Method and device for extracting theme characteristic words of document |
CN108108346A (en) * | 2016-11-25 | 2018-06-01 | 广东亿迅科技有限公司 | The theme feature word abstracting method and device of document |
CN106844424A (en) * | 2016-12-09 | 2017-06-13 | 宁波大学 | A kind of file classification method based on LDA |
CN106844424B (en) * | 2016-12-09 | 2020-11-03 | 宁波大学 | LDA-based text classification method |
CN106776562A (en) * | 2016-12-20 | 2017-05-31 | 上海智臻智能网络科技股份有限公司 | A kind of keyword extracting method and extraction system |
CN106776562B (en) * | 2016-12-20 | 2020-07-28 | 上海智臻智能网络科技股份有限公司 | Keyword extraction method and extraction system |
CN106709370A (en) * | 2016-12-31 | 2017-05-24 | 北京明朝万达科技股份有限公司 | Long word identification method and system based on text contents |
CN106709370B (en) * | 2016-12-31 | 2019-10-29 | 北京明朝万达科技股份有限公司 | A kind of long word recognition method and system based on content of text |
CN107092644A (en) * | 2017-03-07 | 2017-08-25 | 重庆邮电大学 | A kind of Chinese Text Categorization based on MPI and Adaboost.MH |
CN106951498A (en) * | 2017-03-15 | 2017-07-14 | 国信优易数据有限公司 | Text clustering method |
CN108874832A (en) * | 2017-05-15 | 2018-11-23 | 腾讯科技(深圳)有限公司 | Target, which is commented on, determines method and device |
CN107193804A (en) * | 2017-06-02 | 2017-09-22 | 河海大学 | A kind of refuse messages text feature selection method towards word and portmanteau word |
CN107193804B (en) * | 2017-06-02 | 2019-03-29 | 河海大学 | A kind of refuse messages text feature selection method towards word and portmanteau word |
CN107562814A (en) * | 2017-08-14 | 2018-01-09 | 中国农业大学 | A kind of earthquake emergency and the condition of a disaster acquisition of information sorting technique and system |
CN107766323A (en) * | 2017-09-06 | 2018-03-06 | 淮阴工学院 | A kind of text feature based on mutual information and correlation rule |
CN107766323B (en) * | 2017-09-06 | 2021-08-31 | 淮阴工学院 | Text feature extraction method based on mutual information and association rule |
CN107633882B (en) * | 2017-09-11 | 2019-05-14 | 合肥工业大学 | Mix the minimally invasive medical service system and its aid decision-making method under cloud framework |
CN107633882A (en) * | 2017-09-11 | 2018-01-26 | 合肥工业大学 | Mix the minimally invasive medical service system and its aid decision-making method under cloud framework |
CN107562928B (en) * | 2017-09-15 | 2019-11-15 | 南京大学 | A kind of CCMI text feature selection method |
CN107562928A (en) * | 2017-09-15 | 2018-01-09 | 南京大学 | A kind of CCMI text feature selections method |
CN109165284A (en) * | 2018-08-22 | 2019-01-08 | 重庆邮电大学 | A kind of financial field human-computer dialogue intension recognizing method based on big data |
CN109873755A (en) * | 2019-03-02 | 2019-06-11 | 北京亚鸿世纪科技发展有限公司 | A kind of refuse messages classification engine based on variant word identification technology |
CN109873755B (en) * | 2019-03-02 | 2021-01-01 | 北京亚鸿世纪科技发展有限公司 | Junk short message classification engine based on variant word recognition technology |
CN110069630A (en) * | 2019-03-20 | 2019-07-30 | 重庆信科设计有限公司 | A kind of improved mutual information feature selection approach |
CN110413789A (en) * | 2019-07-31 | 2019-11-05 | 广西师范大学 | A kind of exercise automatic classification method based on SVM |
CN111104449A (en) * | 2019-12-18 | 2020-05-05 | 福州市勘测院 | Multisource city space-time standard address fusion method based on geographic space portrait mining |
CN113157912A (en) * | 2020-12-24 | 2021-07-23 | 航天科工网络信息发展有限公司 | Text classification method based on machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105701084A (en) | Characteristic extraction method of text classification on the basis of mutual information | |
CN107463607B (en) | Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning | |
US9195647B1 (en) | System, methods, and data structure for machine-learning of contextualized symbolic associations | |
Fatima et al. | Text Document categorization using support vector machine | |
Alghamdi et al. | Arabic web pages clustering and annotation using semantic class features | |
CN104573030A (en) | Textual emotion prediction method and device | |
CN107463715A (en) | English social media account number classification method based on information gain | |
Banik et al. | Survey on text-based sentiment analysis of bengali language | |
Dung | Natural language understanding | |
Rabbimov et al. | Uzbek news categorization using word embeddings and convolutional neural networks | |
Sigit et al. | Comparison of Classification Methods on Sentiment Analysis of Political Figure Electability Based on Public Comments on Online News Media Sites | |
Liu | Automatic argumentative-zoning using word2vec | |
Chader et al. | Sentiment analysis in google play store: Algerian reviews case | |
CN110705285A (en) | Government affair text subject word bank construction method, device, server and readable storage medium | |
Zhang et al. | Grasp the implicit features: Hierarchical emotion classification based on topic model and SVM | |
CN115713085A (en) | Document theme content analysis method and device | |
Patra et al. | Multimodal mood classification-a case study of differences in hindi and western songs | |
CN103793491B (en) | Chinese news story segmentation method based on flexible semantic similarity measurement | |
Imran et al. | Twitter Sentimental Analysis using Machine Learning Approaches for SemeVal Dataset | |
Rohman et al. | Automatic detection of argument components in text using multinomial Nave Bayes clasiffier | |
Jiménez et al. | On Extracting Information from Semi-structured Deep Web Documents | |
Yu et al. | Automatic Sentiment Analysis System for Myanmar News | |
Li et al. | Predicting abstract keywords by word vectors | |
Franciscus et al. | Beyond word-cloud: A graph model derived from beliefs | |
Chen et al. | Incremental Patent Semantic Annotation Based on Keyword Extraction and List Extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160622 |
|
RJ01 | Rejection of invention patent application after publication |