CN100412869C - 一种改进的基于文档结构的文档相似性度量方法 - Google Patents
一种改进的基于文档结构的文档相似性度量方法 Download PDFInfo
- Publication number
- CN100412869C CN100412869C CNB2006100725887A CN200610072588A CN100412869C CN 100412869 C CN100412869 C CN 100412869C CN B2006100725887 A CNB2006100725887 A CN B2006100725887A CN 200610072588 A CN200610072588 A CN 200610072588A CN 100412869 C CN100412869 C CN 100412869C
- Authority
- CN
- China
- Prior art keywords
- sub
- document
- topics
- sigma
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 83
- 238000011524 similarity measure Methods 0.000 title claims abstract description 22
- 238000013459 approach Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000012545 processing Methods 0.000 abstract description 3
- 238000003696 structure analysis method Methods 0.000 abstract 1
- 230000008878 coupling Effects 0.000 description 8
- 238000010168 coupling process Methods 0.000 description 8
- 238000005859 coupling reaction Methods 0.000 description 8
- 239000013598 vector Substances 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000000926 separation method Methods 0.000 description 3
- 235000012364 Peperomia pellucida Nutrition 0.000 description 2
- 240000007711 Peperomia pellucida Species 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000002950 deficient Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000013604 expression vector Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Cosine | PivotedVSM | BM25 | 最优匹配 | 本发明 | |
MAP | 0.82 | 0.723 | 0.757 | 0.85 | 0.87 |
P@5 | 0.83 | 0.81 | 0.82 | 0.87 | 0.88 |
P@10 | 0.72 | 0.71 | 0.72 | 0.773 | 0.773 |
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2006100725887A CN100412869C (zh) | 2006-04-13 | 2006-04-13 | 一种改进的基于文档结构的文档相似性度量方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2006100725887A CN100412869C (zh) | 2006-04-13 | 2006-04-13 | 一种改进的基于文档结构的文档相似性度量方法 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1828610A CN1828610A (zh) | 2006-09-06 |
CN100412869C true CN100412869C (zh) | 2008-08-20 |
Family
ID=36947002
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2006100725887A Expired - Fee Related CN100412869C (zh) | 2006-04-13 | 2006-04-13 | 一种改进的基于文档结构的文档相似性度量方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100412869C (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11176186B2 (en) | 2020-03-27 | 2021-11-16 | International Business Machines Corporation | Construing similarities between datasets with explainable cognitive methods |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101013421B (zh) * | 2007-02-02 | 2012-06-27 | 清华大学 | 基于规则的汉语基本块自动分析方法 |
CN102789452A (zh) * | 2011-05-16 | 2012-11-21 | 株式会社日立制作所 | 类似内容提取方法 |
CN102279893B (zh) * | 2011-09-19 | 2015-07-22 | 索意互动(北京)信息技术有限公司 | 文献群组多对多自动分析 |
CN103389987A (zh) * | 2012-05-09 | 2013-11-13 | 阿里巴巴集团控股有限公司 | 文本相似性比较方法及系统 |
CN103049569A (zh) * | 2012-12-31 | 2013-04-17 | 武汉传神信息技术有限公司 | 基于向量空间模型的文本相似性匹配方法 |
CN103399900B (zh) * | 2013-07-25 | 2016-12-28 | 北京京东尚科信息技术有限公司 | 基于位置服务的图片推荐方法 |
CN103823838B (zh) * | 2013-12-18 | 2018-07-20 | 国网江苏省电力有限公司常州供电分公司 | 一种多格式文档录入并比对的方法 |
CN107644079A (zh) * | 2015-05-22 | 2018-01-30 | 广东欧珀移动通信有限公司 | 一种应用推荐方法及装置和相关介质产品 |
CN105955965A (zh) * | 2016-06-21 | 2016-09-21 | 上海智臻智能网络科技股份有限公司 | 问句信息处理方法及装置 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5835893A (en) * | 1996-02-15 | 1998-11-10 | Atr Interpreting Telecommunications Research Labs | Class-based word clustering for speech recognition using a three-level balanced hierarchical similarity |
CN1403957A (zh) * | 2001-09-06 | 2003-03-19 | 联想(北京)有限公司 | 通过主题词矫正基于向量空间模型文本相似度计算的方法 |
US6542889B1 (en) * | 2000-01-28 | 2003-04-01 | International Business Machines Corporation | Methods and apparatus for similarity text search based on conceptual indexing |
US6578031B1 (en) * | 1998-09-30 | 2003-06-10 | Canon Kabushiki Kaisha | Apparatus and method for retrieving vector format data from database in accordance with similarity with input vector |
-
2006
- 2006-04-13 CN CNB2006100725887A patent/CN100412869C/zh not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5835893A (en) * | 1996-02-15 | 1998-11-10 | Atr Interpreting Telecommunications Research Labs | Class-based word clustering for speech recognition using a three-level balanced hierarchical similarity |
US6578031B1 (en) * | 1998-09-30 | 2003-06-10 | Canon Kabushiki Kaisha | Apparatus and method for retrieving vector format data from database in accordance with similarity with input vector |
US6542889B1 (en) * | 2000-01-28 | 2003-04-01 | International Business Machines Corporation | Methods and apparatus for similarity text search based on conceptual indexing |
CN1403957A (zh) * | 2001-09-06 | 2003-03-19 | 联想(北京)有限公司 | 通过主题词矫正基于向量空间模型文本相似度计算的方法 |
Non-Patent Citations (2)
Title |
---|
用于Web文档聚类的基于相似度的软聚类算法. 姜亚莉,关泽群.计算机工程,第32卷第2期. 2006 |
用于Web文档聚类的基于相似度的软聚类算法. 姜亚莉,关泽群.计算机工程,第32卷第2期. 2006 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11176186B2 (en) | 2020-03-27 | 2021-11-16 | International Business Machines Corporation | Construing similarities between datasets with explainable cognitive methods |
Also Published As
Publication number | Publication date |
---|---|
CN1828610A (zh) | 2006-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100412869C (zh) | 一种改进的基于文档结构的文档相似性度量方法 | |
CN101231634B (zh) | 一种多文档自动文摘方法 | |
CN104391942B (zh) | 基于语义图谱的短文本特征扩展方法 | |
CN104699763B (zh) | 多特征融合的文本相似性度量系统 | |
CN105653706B (zh) | 一种基于文献内容知识图谱的多层引文推荐方法 | |
CN100543735C (zh) | 基于文档结构的文档相似性度量方法 | |
CN105095477A (zh) | 一种基于多指标评分的推荐算法 | |
CN106250412A (zh) | 基于多源实体融合的知识图谱构建方法 | |
CN103235772A (zh) | 一种文本集人物关系自动提取方法 | |
CN104408153A (zh) | 一种基于多粒度主题模型的短文本哈希学习方法 | |
CN109670039A (zh) | 基于三部图和聚类分析的半监督电商评论情感分析方法 | |
CN102402561B (zh) | 一种搜索方法和装置 | |
CN104484380A (zh) | 个性化搜索方法及装置 | |
CN103970730A (zh) | 一种从单个中文文本中提取多主题词的方法 | |
CN104636325B (zh) | 一种基于极大似然估计确定文档相似度的方法 | |
CN101882136A (zh) | 文本情感倾向性分析方法 | |
CN104317838A (zh) | 一种基于耦合鉴别性字典的跨媒体哈希索引方法 | |
CN101382962B (zh) | 一种考虑概念抽象度的浅层分析自动文档综述方法 | |
CN103034726A (zh) | 文本过滤系统及方法 | |
CN106095791A (zh) | 一种基于上下文的抽象样本信息检索系统及其抽象样本特征化表示方法 | |
CN102737112A (zh) | 基于表现语义分析的概念相关度计算方法 | |
CN104008187A (zh) | 一种基于最小编辑距离的半结构化文本匹配方法 | |
CN107391482A (zh) | 一种基于句模进行模糊匹配与剪枝的方法 | |
CN107301169A (zh) | 离题作文检测方法、装置和终端设备 | |
CN114139634A (zh) | 一种基于成对标签权重的多标签特征选择方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220914 Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031 Patentee after: New founder holdings development Co.,Ltd. Patentee after: Peking University Patentee after: PEKING University FOUNDER R & D CENTER Address before: 100871, fangzheng building, 298 Fu Cheng Road, Beijing, Haidian District Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd. Patentee before: Peking University Patentee before: PEKING University FOUNDER R & D CENTER |
|
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230403 Address after: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District Patentee after: Peking University Address before: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031 Patentee before: New founder holdings development Co.,Ltd. Patentee before: Peking University Patentee before: PEKING University FOUNDER R & D CENTER |
|
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20080820 |