CN107766323B - 一种基于互信息和关联规则的文本特征提取方法 - Google Patents
一种基于互信息和关联规则的文本特征提取方法 Download PDFInfo
- Publication number
- CN107766323B CN107766323B CN201710796425.1A CN201710796425A CN107766323B CN 107766323 B CN107766323 B CN 107766323B CN 201710796425 A CN201710796425 A CN 201710796425A CN 107766323 B CN107766323 B CN 107766323B
- Authority
- CN
- China
- Prior art keywords
- text
- word
- term
- entering
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 claims abstract description 69
- 238000012549 training Methods 0.000 claims abstract description 26
- 238000001914 filtration Methods 0.000 claims abstract description 17
- 230000011218 segmentation Effects 0.000 claims abstract description 10
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 108010041559 Methionine Sulfoxide Reductases Proteins 0.000 claims description 21
- 102000000532 Methionine Sulfoxide Reductases Human genes 0.000 claims description 21
- 125000004122 cyclic group Chemical group 0.000 claims description 16
- 238000012216 screening Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- -1 that is Proteins 0.000 claims description 2
- 230000007547 defect Effects 0.000 abstract 1
- 238000004422 calculation algorithm Methods 0.000 description 12
- 238000011160 research Methods 0.000 description 5
- 238000000546 chi-square test Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 238000013145 classification model Methods 0.000 description 3
- 230000007812 deficiency Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000010365 information processing Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000010187 selection method Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- HDRXZJPWHTXQRI-BHDTVMLSSA-N diltiazem hydrochloride Chemical compound [Cl-].C1=CC(OC)=CC=C1[C@H]1[C@@H](OC(C)=O)C(=O)N(CC[NH+](C)C)C2=CC=CC=C2S1 HDRXZJPWHTXQRI-BHDTVMLSSA-N 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710796425.1A CN107766323B (zh) | 2017-09-06 | 2017-09-06 | 一种基于互信息和关联规则的文本特征提取方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710796425.1A CN107766323B (zh) | 2017-09-06 | 2017-09-06 | 一种基于互信息和关联规则的文本特征提取方法 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107766323A CN107766323A (zh) | 2018-03-06 |
CN107766323B true CN107766323B (zh) | 2021-08-31 |
Family
ID=61265086
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710796425.1A Active CN107766323B (zh) | 2017-09-06 | 2017-09-06 | 一种基于互信息和关联规则的文本特征提取方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107766323B (zh) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109240258A (zh) * | 2018-07-09 | 2019-01-18 | 上海万行信息科技有限公司 | 基于词向量的汽车故障智能辅助诊断方法和系统 |
CN109684462B (zh) * | 2018-12-30 | 2022-12-06 | 广西财经学院 | 基于权值比较和卡方分析的文本词间关联规则挖掘方法 |
CN109739953B (zh) * | 2018-12-30 | 2021-07-20 | 广西财经学院 | 基于卡方分析-置信度框架和后件扩展的文本检索方法 |
CN109857866B (zh) * | 2019-01-14 | 2021-05-25 | 中国科学院信息工程研究所 | 一种面向事件查询建议的关键词抽取方法和事件查询建议生成方法及检索系统 |
CN112818146B (zh) * | 2021-01-26 | 2022-12-02 | 山西三友和智慧信息技术股份有限公司 | 一种基于产品图像风格的推荐方法 |
CN113704447B (zh) * | 2021-03-03 | 2024-05-03 | 腾讯科技(深圳)有限公司 | 一种文本信息的识别方法以及相关装置 |
CN113807456B (zh) * | 2021-09-26 | 2024-04-09 | 大连交通大学 | 一种基于互信息的特征筛选和关联规则多标记分类方法 |
CN116644184B (zh) * | 2023-07-27 | 2023-10-20 | 浙江厚雪网络科技有限公司 | 基于数据聚类的人力资源信息管理系统 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279478A (zh) * | 2013-04-19 | 2013-09-04 | 国家电网公司 | 一种基于分布式互信息文档特征提取方法 |
CN103678274A (zh) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | 一种基于改进互信息和熵的文本分类特征提取方法 |
CN105335785A (zh) * | 2015-10-30 | 2016-02-17 | 西华大学 | 一种基于向量运算的关联规则挖掘方法 |
CN105631462A (zh) * | 2014-10-28 | 2016-06-01 | 北京交通大学 | 结合置信度和贡献度的基于时空上下文的行为识别方法 |
CN105701084A (zh) * | 2015-12-28 | 2016-06-22 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | 一种基于互信息的文本分类的特征提取方法 |
-
2017
- 2017-09-06 CN CN201710796425.1A patent/CN107766323B/zh active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678274A (zh) * | 2013-04-15 | 2014-03-26 | 南京邮电大学 | 一种基于改进互信息和熵的文本分类特征提取方法 |
CN103279478A (zh) * | 2013-04-19 | 2013-09-04 | 国家电网公司 | 一种基于分布式互信息文档特征提取方法 |
CN105631462A (zh) * | 2014-10-28 | 2016-06-01 | 北京交通大学 | 结合置信度和贡献度的基于时空上下文的行为识别方法 |
CN105335785A (zh) * | 2015-10-30 | 2016-02-17 | 西华大学 | 一种基于向量运算的关联规则挖掘方法 |
CN105701084A (zh) * | 2015-12-28 | 2016-06-22 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | 一种基于互信息的文本分类的特征提取方法 |
Non-Patent Citations (2)
Title |
---|
《Unsupervised Data Driven Feature Extraction by Means of Mutual Information Maximization》;Marinoni A 等;《IEEE Transactions on Computational Imaging》;20170615;第3卷(第2期);全文 * |
《基于词条之间关联关系的文档聚类》;任建华 等;《计算机工程与应用》;20141211;第52卷(第7期);第87页第二栏第4-7段 * |
Also Published As
Publication number | Publication date |
---|---|
CN107766323A (zh) | 2018-03-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107766323B (zh) | 一种基于互信息和关联规则的文本特征提取方法 | |
CN107609121B (zh) | 基于LDA和word2vec算法的新闻文本分类方法 | |
CN107705066B (zh) | 一种商品入库时信息录入方法及电子设备 | |
CN107633007B (zh) | 一种基于层次化ap聚类的商品评论数据标签化系统和方法 | |
KR102019194B1 (ko) | 문서 내 핵심 키워드 추출 시스템 및 방법 | |
Mandal et al. | Supervised learning methods for bangla web document categorization | |
CN109408743B (zh) | 文本链接嵌入方法 | |
US7469246B1 (en) | Method and system for classifying or clustering one item into multiple categories | |
CN106844407B (zh) | 基于数据集相关性的标签网络产生方法和系统 | |
CN111159485B (zh) | 尾实体链接方法、装置、服务器及存储介质 | |
CN108647322B (zh) | 基于词网识别大量Web文本信息相似度的方法 | |
CN111046282B (zh) | 文本标签设置方法、装置、介质以及电子设备 | |
WO2021253873A1 (zh) | 相似文档检索方法及装置 | |
CN107506472A (zh) | 一种学生浏览网页分类方法 | |
CN112699232A (zh) | 文本标签提取方法、装置、设备和存储介质 | |
CN114818674A (zh) | 商品标题关键词提取方法及其装置、设备、介质、产品 | |
CN113032556A (zh) | 一种基于自然语言处理形成用户画像的方法 | |
Wei et al. | Online education recommendation model based on user behavior data analysis | |
Perez-Tellez et al. | On the difficulty of clustering microblog texts for online reputation management | |
Qingyun et al. | Keyword extraction method for complex nodes based on TextRank algorithm | |
Senthilkumar et al. | A Survey On Feature Selection Method For Product Review | |
Godara et al. | Support vector machine classifier with principal component analysis and k mean for sarcasm detection | |
CN110020439B (zh) | 一种基于隐藏关联网络的多领域文本隐式特征抽取方法 | |
Han et al. | The application of support vector machine (SVM) on the sentiment analysis of internet posts | |
CN115827990A (zh) | 搜索方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20180306 Assignee: Fanyun software (Nanjing) Co.,Ltd. Assignor: HUAIYIN INSTITUTE OF TECHNOLOGY Contract record no.: X2021980010526 Denomination of invention: A text feature extraction method based on mutual information and association rules Granted publication date: 20210831 License type: Common License Record date: 20211011 |
|
TR01 | Transfer of patent right |
Effective date of registration: 20240506 Address after: 230000 b-1018, Woye Garden commercial office building, 81 Ganquan Road, Shushan District, Hefei City, Anhui Province Patentee after: HEFEI WISDOM DRAGON MACHINERY DESIGN Co.,Ltd. Country or region after: China Address before: 223005 Jiangsu Huaian economic and Technological Development Zone, 1 East Road. Patentee before: HUAIYIN INSTITUTE OF TECHNOLOGY Country or region before: China |
|
TR01 | Transfer of patent right |
Effective date of registration: 20240510 Address after: Room 212, Building 3, No. 2959 Gudai Road, Minhang District, Shanghai, 201199 Patentee after: Shanghai Zhutong Information Technology Co.,Ltd. Country or region after: China Address before: 230000 b-1018, Woye Garden commercial office building, 81 Ganquan Road, Shushan District, Hefei City, Anhui Province Patentee before: HEFEI WISDOM DRAGON MACHINERY DESIGN Co.,Ltd. Country or region before: China |
|
EC01 | Cancellation of recordation of patent licensing contract |
Assignee: Fanyun software (Nanjing) Co.,Ltd. Assignor: HUAIYIN INSTITUTE OF TECHNOLOGY Contract record no.: X2021980010526 Date of cancellation: 20240516 |