CN108038202A - 一种文档相似度判定方法 - Google Patents

一种文档相似度判定方法 Download PDF

Info

Publication number
CN108038202A
CN108038202A CN201711326082.9A CN201711326082A CN108038202A CN 108038202 A CN108038202 A CN 108038202A CN 201711326082 A CN201711326082 A CN 201711326082A CN 108038202 A CN108038202 A CN 108038202A
Authority
CN
China
Prior art keywords
hash values
sequence string
documents
vocabulary
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201711326082.9A
Other languages
English (en)
Inventor
王祝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yixiang (dalian) Science And Technology Co Ltd
Original Assignee
Yixiang (dalian) Science And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yixiang (dalian) Science And Technology Co Ltd filed Critical Yixiang (dalian) Science And Technology Co Ltd
Priority to CN201711326082.9A priority Critical patent/CN108038202A/zh
Publication of CN108038202A publication Critical patent/CN108038202A/zh
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种文档相似度判定方法,第一步,分词:将文档中的句子拆分成基础词汇;第二步,词汇统计:将词汇通过hash函数计算各个特征向量的hash值,所述hash值设置为64位,将所有词汇重复的数量N记录,并建立数量*词汇的数据,即N*hash值;第三步,合并:将上述各个特征向量的加权结果累加,变成只有一个序列串;第四步,降维:对于累加结果,如果大于0则置1,否则置0,得到整个文档的序列串;第五步,对比:将两个文档的序列串进行对比,得到不同位数的个数;小于等于3,则判定为相似。

Description

一种文档相似度判定方法
技术领域
本发明涉及文本处理领域,特别涉及一种文档相似度判定方法。
背景技术
随着我国知识是产权保护的意识越来越强,诞生了版权作者主动维权的意识,而人工判断侵权的工作量太大,在海量数据面前,需要计算机工具作为辅助,对数据进行筛选,然后进行人工判断。
发明内容
为了克服上述现有技术的不足,本发明提供了一种文档相似度判定方法。
第一步,分词:将文档中的句子拆分成基础词汇;
第二步,词汇统计:将词汇通过hash函数计算各个特征向量的hash值,所述hash值设置为64位,将所有词汇重复的数量N记录,并建立数量*词汇的数据,即N*hash值;
第三步,合并:将上述各个特征向量的加权结果累加,变成只有一个序列串;
第四步,降维:对于累加结果,如果大于0则置1,否则置0,得到整个文档的序列串;
第五步,对比:将两个文档的序列串进行对比,得到不同位数的个数;小于等于3,则判定为相似。
有益效果:本发明是将整个文档的词语做hash,并根据词语出现的次数做加权,最后比较1,0数列,得到一个相似对比结果。
具体实施方式
实施例:
第一步,分词:将文档中的句子拆分成基础词汇;
第二步,词汇统计:将词汇通过hash函数计算各个特征向量的hash值,所述hash值设置为64位,将所有词汇重复的数量N记录,并建立数量*词汇的数据,即N*hash值;
第三步,合并:将上述各个特征向量的加权结果累加,变成只有一个序列串;
第四步,降维:对于累加结果,如果大于0则置1,否则置0,得到整个文档的序列串;
第五步,对比:将两个文档的序列串进行对比,得到不同位数的个数;小于等于3,则判定为相似。

Claims (1)

1.一种文档相似度判定方法,包括以下步骤:
第一步,分词:将文档中的句子拆分成基础词汇;
第二步,词汇统计:将词汇通过hash函数计算各个特征向量的hash值,所述hash值设置为64位,将所有词汇重复的数量N记录,并建立数量*词汇的数据,即N*hash值;
第三步,合并:将上述各个特征向量的加权结果累加,变成只有一个序列串;
第四步,降维:对于累加结果,如果大于0则置1,否则置0,得到整个文档的序列串;
第五步,对比:将两个文档的序列串进行对比,得到不同位数的个数;小于等于3,则判定为相似。
CN201711326082.9A 2017-12-13 2017-12-13 一种文档相似度判定方法 Withdrawn CN108038202A (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711326082.9A CN108038202A (zh) 2017-12-13 2017-12-13 一种文档相似度判定方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711326082.9A CN108038202A (zh) 2017-12-13 2017-12-13 一种文档相似度判定方法

Publications (1)

Publication Number Publication Date
CN108038202A true CN108038202A (zh) 2018-05-15

Family

ID=62103008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711326082.9A Withdrawn CN108038202A (zh) 2017-12-13 2017-12-13 一种文档相似度判定方法

Country Status (1)

Country Link
CN (1) CN108038202A (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636325A (zh) * 2015-02-06 2015-05-20 中南大学 一种基于极大似然估计确定文档相似度的方法
CN106873964A (zh) * 2016-12-23 2017-06-20 浙江工业大学 一种改进的SimHash代码相似度检测方法
CN107229939A (zh) * 2016-03-24 2017-10-03 北大方正集团有限公司 相似文档的判定方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636325A (zh) * 2015-02-06 2015-05-20 中南大学 一种基于极大似然估计确定文档相似度的方法
CN107229939A (zh) * 2016-03-24 2017-10-03 北大方正集团有限公司 相似文档的判定方法和装置
CN106873964A (zh) * 2016-12-23 2017-06-20 浙江工业大学 一种改进的SimHash代码相似度检测方法

Similar Documents

Publication Publication Date Title
Wang et al. Bootstrapping both product features and opinion words from chinese customer reviews with cross-inducing
WO2007144853A3 (en) Method and apparatus for performing customized paring on a xml document based on application
WO2006113298A3 (en) System and method for parsing medical data
CN101382844A (zh) 一种输入间隔分词的方法
Oliver et al. Using randomization to attack similarity digests
Yun et al. A two-stage damage detection approach based on subset selection and genetic algorithms
Khomsah Sentiment analysis on youtube comments using word2vec and random forest
CN101504709A (zh) 用于软件配置管理的脆弱软件水印方法
CN108038202A (zh) 一种文档相似度判定方法
US10673795B2 (en) Methods and arrangements for content filtering
CN108052502A (zh) 一种快速文档相似度判定方法
Shrestha et al. Machine Translation Evaluation Metric for Text Alignment.
SG11201903685PA (en) Method and apparatus for barcode identification
Khoshsaligheh et al. Through the Iranian fansubbing glass: Insights into taboo language rendition into Persian
CN111368296A (zh) 源码文件匹配率分析方法
Borg et al. Crowd-sourcing evaluation of automatically acquired, morphologically related word groupings
Suwito et al. The coupling effect of drying shrinkage and moisture diffusion in concrete
CN103544317A (zh) 维度表数据的处理方法和装置
백란 A Study for Reconstruction of Face recognitions through the Eigen-Algorithm
Kumar Effect of strain ratio variation on equivalent stress block parameters for normal weight high strength concrete
차진선 et al. NH-based SNCR of NO: Experimental and Simulation
Manaf The needs for official Shariah compliance audit institution to protect customers of Islamic banking: an application through Hisbah concept
Pavliuk et al. The Use of Computer Technologies in the Lexicography
Kim et al. A Study on the Optimization Method for the Rule Checker in the Secure Coding
Khan Take a quicker approach to staggered blowdown

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20180515