CN104376024A - Document similarity detecting method based on seed words - Google Patents

Document similarity detecting method based on seed words Download PDF

Info

Publication number
CN104376024A
CN104376024A CN201310359673.1A CN201310359673A CN104376024A CN 104376024 A CN104376024 A CN 104376024A CN 201310359673 A CN201310359673 A CN 201310359673A CN 104376024 A CN104376024 A CN 104376024A
Authority
CN
China
Prior art keywords
word
document
sentence
sim
limitation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310359673.1A
Other languages
Chinese (zh)
Other versions
CN104376024B (en
Inventor
张琳波
王枫
胡明
石磊
梁龙
郭瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Academy of Transportation Sciences
Original Assignee
China Academy of Transportation Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Academy of Transportation Sciences filed Critical China Academy of Transportation Sciences
Priority to CN201310359673.1A priority Critical patent/CN104376024B/en
Publication of CN104376024A publication Critical patent/CN104376024A/en
Application granted granted Critical
Publication of CN104376024B publication Critical patent/CN104376024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Abstract

The invention relates to a document similarity detecting method based on seed words. The method includes: Chinese word segmentation and labeling, dictionary calculation, document expression by limited sentence sets, and document comparison. The method has the advantages that few sentences need to be processed during document comparison, and processing speed is increased; sentences with the words high in word frequency-inverse document frequency are selected during representative limited sentence set selecting, so that the selected limited sentence sets are representative in documents, and influence of unrelated or common-sense sentences on document similarity judging is eliminated; two sentences are compared in a word-by-word manner, so that the influence of similar contents and different narration manners is reduced; same words are weighed by inverse document frequency, higher weights are given to the words high in distinguishing performance, and comparison performance is increased.

Description

A kind of document similarity detection method based on seed words
Technical field
The method that the document copying that the present invention relates to a kind of pattern-recognition and technical field of information processing detects, particularly relates to a kind of document similarity detection method based on seed words.
Background technology
Present stage, China's sci-tech development base constantly improves, and input in science and technology progressively increases, and serves huge impetus to the scientific-technical progress of promotion industry.Technological project management is an important link in industry macroscopic view scientific and technological management system.At present in science and technology item is declared, there is phenomenons such as repeating project verification, bull application, greatly have impact on the rationality that financial scientific and technological funds use, reduce the integral level of industry Technological research.Therefore, the document of project initiation phase and acceptance phase is checked, be the science and the rationality that ensure science and technology item project verification, improve the efficiency of technological project management and the important means of level.
Along with science and technology item declares increasing of quantity, the simple similarity of manual type examination & verification project document that adopts is difficult to satisfy the demands.Utilize Computerized Information Processing Tech automatically to judge the similarity of report content, become technological project management problem demanding prompt solution, it is the most direct, effective means of realizing this goal that document similarity detects.Document similarity detects not only to detect and intactly indiscriminately imitates, and also comprises the displacement exchange to textual content, synonym is replaced, saying repeats.
At present, the method that document copying detects is roughly divided into two kinds: the method based on character string comparison and the method based on word frequency statistics.The method based on string matching described in application, the COPS prototype system detected for document copying has been invented by Digital Labs of nineteen ninety-five Stanford Univ USA.COPS take punctuation mark as boundary, subsequence of being formed a complete sentence by document decomposition; Then, add up the quantity of identical sentence in two sections of documents, and using the foundation of the ratio of sentence quantity total in it and two sections of documents as similarity degree between measurement two sections of documents.Document A and document B calculating formula of similarity as follows:
sim ( A , B ) = | S ( A ) IS ( B ) | | S ( A ) US ( B ) |
Wherein S (A) and S (B) represents the fingerprint set of document A and document B respectively.This COPS system is relatively effective for the detection of large-scale document copying, and computing velocity is also than comparatively fast, but it can not find the phenomenon that sentence local is overlapping.
Method based on character string comparison thought also comprises the document copying detection algorithm based on sentence similarity after improvement.The method by entire chapter document with sentence marks (,.; , subsequence of document decomposition being formed a complete sentence for boundary, again these sentences are passed through Chinese word segmentation, form keyword sequence after removing the insignificant word such as conjunction, auxiliary word, then utilize the computing formula based on longest common subsequence algorithm to calculate similarity between two documents between sentence.In the method, the basis that Documents Similarity is based upon sentence similarity calculates.The method can well solve the problem of part sentence overlap, and the detection perform of document similarity is optimized.But, in view of the feature of Chinese character file sentence, there is different aligning methods in the sentence of identical content, above-mentionedly calculates the method for longest common subsequence in sentence level and overemphasize word order between word, is unfavorable for that Detection of content is identical and adopts the Similar content that different describing mode is expressed.In addition, for the document that content is very long, the method needs longer operation time, and efficiency is not high yet.
Based in the method for word frequency statistics, Application comparison widely method for expressing is vector space model (VSM) method.Be incoherent between the real suppositive of the method basic thought and word, represent text with vector, make model possess calculability.Document is regarded as by separate entry group (T by VSM method 1, T 2, T 3..., T n) form, for each entry T, give certain weights W=(w according to it in the significance level of document 1, w 2, w 3... w n), and by (T 1, T 2, T 3..., T n) regard coordinate axis in a n dimension coordinate system, (w as 1, w 2, w 3... w n) be corresponding coordinate figure.Like this by (T 1, T 2, T 3..., T n) decompose the orthogonal entry set of vectors that obtains and just constitute a document vector space.
Calculate conventional cosine function in the functional expression of similarity, similarity is defined as by it:
S ( D i , D j ) = Σ k = 1 n w ik · w jk Σ k = 1 n w ik 2 Σ k = 1 n w jk 2
Document D iand D jbe respectively as i-th in collection of document section and jth section document, w ikand w jkbe respectively document D iand D jat coordinate axis T kthe coordinate figure of upper correspondence.The essence of these class methods is exactly calculate the included angle cosine between n-dimensional space Literature vector.
TF-IDF method has considered the different frequency of occurrences of word in all texts (TF value) and this word to the resolution characteristic (IDF value) of different text, is widely used in calculating the similarity between text.Its formula is as follows:
W td=TF td×IDF t(2)
Wherein: W tdthe significance level of representation feature item t in document d, TF tdrefer to the number of times that characteristic item t occurs in document d, IDF tthe distribution situation of reflection characteristic item t in whole collection of document.IDF to a certain extent tembody the separating capacity of characteristic item t, TF tdreflection characteristic item is in the distribution situation of inside documents.This type of algorithm can get rid of the word of those high frequencies, low discrimination, is a kind of effective weight definition method.
Above-mentioned method of adding up based on word frequency etc., fast operation, is particularly suitable for large Documents Similarity and calculates.But said method partly or entirely have ignored word position relationship in a document, and actual conditions are, same words, appear in the different sentence environment of identical document, distinct effect will be produced.
Sum up above-mentioned two class methods, the method based on character string comparison thought needs sentences all in document to carry out by comparing, and for the document that content is longer, computation complexity is higher, and document detection efficiency comparison is low.In addition, often exist in large volume document and actual influence is not formed to document essential content but very approximate statement, such as: the thing of " in recent years, XXX is developed rapidly; make a general survey of these methods, roughly can be divided into ... " common-sense, set pattern here.Although these contents can not impact the core concept of article, easily the performance of document similarity detection algorithm is impacted.And the simple structure of the statistical method such as word frequency statistics TF or anti-document frequency IDF in the vector space model of Corpus--based Method can not reflect the significance level of word and the distribution situation of Feature Words effectively, make it cannot complete function to weighed value adjusting well, so the precision of these class methods is not very high.In addition, the positional information do not fully not demonstrated in the algorithm of most of Corpus--based Method, Feature Words is different to the reflection degree of article content in different sentences, and therefore, the statistics of word is only in the level just more tangible effect of sentence.
In view of the defect of the method existence that above-mentioned existing document copying detects, the present inventor is based on being engaged in the practical experience and professional knowledge that this type of product design manufacture enriches for many years, found a kind of a kind of document similarity detection method based on seed words newly, the method that general existing document copying detects can be improved, make it have more practicality.
Summary of the invention
Fundamental purpose of the present invention is, overcome the defect of the method existence that existing document copying detects, and a kind of a kind of document similarity detection method based on seed words is newly provided, technical matters to be solved makes it improve processing speed, get rid of the impact that uncorrelated or tentative sentence judges the similarity of document, be very suitable for practicality.
Another object of the present invention is to, provide a kind of a kind of document similarity detection method based on seed words newly, technical matters to be solved makes it increase contrast properties, improves accuracy, thus be more suitable for practicality.
The object of the invention to solve the technical problems realizes by the following technical solutions.According to a kind of document similarity detection method based on seed words that the present invention proposes, it comprises the following steps:
Step 1: Chinese Word Segmentation and mark
Utilize Chinese Word Segmentation Open-Source Tools bag that the Chinese character sequence in document is cut into word independent one by one, part-of-speech tagging is carried out to each word, and again preserves;
Step 2: dictionary calculates
Step 21: select word: all documents in the set of ergodic data database documents, remove the number in document, measure word, pronoun, preposition, auxiliary word, conjunction and adverbial word etc. and document content is understood to the word and the punctuation mark that do not form impact, remaining word forms a set jointly, delete dittograph in this set and obtain dictionary, be designated as C={c 1, c 2, c 3..., c nc, Nc represents the quantity of word in dictionary;
Step 22: the anti-document frequency IDF calculating each word in described dictionary:
Any word c in dictionary nanti-document frequency IDF(c n) computing formula is: IDF(c n)=lg(Nt/N (c n)); Wherein, Nt represents the total number of documents comprised in database document set, N (c n) represent in content and comprise word c nthe quantity of document, lg represents with 10 for computing is carried out taking the logarithm in the end;
Step 23: preserve described dictionary and anti-document frequency corresponding to interior each word thereof;
Step 3: by any document D in database document set iwith limitation sentence set representations
Step 31: in statistics dictionary, each word is in document D iin word frequency-anti-document frequency, any word c in dictionary nin document D iin word frequency-anti-document frequency TF-IDF(c n) calculation procedure is as follows:
Step 311: calculate word c nin document D iin word frequency TF (c n): by word c nin document D ithe number of times of middle appearance is divided by document D iin the total quantity of all words;
Step 312: query word c in step 2 gained dictionary C nanti-document frequency IDF (c n);
Step 313: calculate word c nword frequency-anti-document frequency value at document D i:
TF-IDF(c n)=TF(c n)*IDF(c n);
Step 32: calculate seed words set
All words in dictionary are pressed it in document D iin word frequency-anti-document frequency TF-IDF value sort from big to small, suppose D in document iin total M notional word, in this M notional word before selection m notional word as document D iseed word set, be designated as seed word set S={s 1, s 2..., s m, wherein, m is the smallest positive integral being not less than M*k, and k is notional word selection ratio, can regulate according to actual detection perform;
Step 33: by document D iwith limitation sentence collection P irepresent
Traversal D iin full, all sentences comprising any one word in seed word set S are selected, for representing document D iif there is Ti sentence P i1, P i2..., P iTicomprise any one word in seed word set S, then D ibe expressed as limitation sentence collection: P i={ P i1, P i2..., P iTi;
Step 4: document contrasts
Method described in application above-mentioned steps 3, by document D ibe expressed as limitation sentence collection P i={ P i1, P i2..., P iTi, by document D jbe expressed as limitation sentence collection P j={ P j1, P j2..., P jTj, wherein Ti represents limitation sentence collection P iin the limitation sentence quantity that comprises, Tj represents limitation sentence collection P jin the limitation sentence quantity that comprises, document D iwith document D jcontrast algorithm is as follows:
Step 41: calculate limitation sentence collection P ifirst sentence P i1with limitation sentence collection P jin all sentence P j1, P j2..., P jTjsimilarity, be designated as: { sim 1, sim 2..., sim tj;
Step 42: calculate limitation sentence collection P ifirst sentence P i1with limitation sentence collection P jin the similarity { sim of all sentences 1, sim 2..., sim tjmaximal value, be designated as sim max;
Step 43: if sim described in above-mentioned steps 42 max>t, then limit the quantity sentence collection P ifirst sentence P i1with document D jsimilarity Sen_Sim 1for sim max; Otherwise, limitation sentence collection P ifirst sentence P i1with document D jsimilarity Sen_Sim 1be 0; Wherein threshold value t regulates according to database document set (G) data characteristics and determines;
Step 44: application and above-mentioned steps 41-step 43 same procedure, calculates limitation sentence collection P iin other sentences P i2, P i3..., P iTiwith document D jsimilarity, be designated as: Sen_Sim 2, Sen_Sim 3..., Sen_Sim ti;
Step 45: calculate limitation sentence collection P iin all sentence P i1, P i2..., P iTiwith document D jsimilarity and, then divided by limitation sentence collection P imiddle sentence quantity Ti and limitation sentence collection P jmiddle sentence number Tj and, obtain document D iwith document D jsimilarity value Doc_Sim ij, computing formula is: Doc_Sim ij=(Sen-Sim 1+ Sen-Sim 2+ ... + Sen-Sim ti)/(Ti+Tj).
Aforesaid a kind of document similarity detection algorithm based on seed words, limit the quantity described in wherein said step 41 sentence collection P ifirst sentence P i1with limitation sentence collection P jin all sentence P j1, P j2..., P jTjsimilarity calculating method as follows:
Step 411: utilize Chinese Word Segmentation described in step 1 and mask method, to limitation sentence collection P ifirst sentence P i1with limitation sentence collection P jfirst sentence P j1word and mark are cut in enforcement, remove the impact do not formed understood in the number in document, measure word, pronoun, preposition, auxiliary word, conjunction and adverbial word etc. word and punctuation mark on document content, the set expression that each sentence utilizes all remaining words to form; Limitation sentence collection P ifirst sentence P i1be expressed as phrase W i1={ word i11, word i12..., word i1Q1, limitation sentence collection P jfirst sentence P j1be expressed as phrase W j1={ word j11, word j12..., word j1R1, wherein Q1 represents phrase W i1the quantity of middle word, R1 represents phrase W j1the quantity of middle word;
Step 412: phrase W described in calculation procedure 411 i1in word word i11with phrase W j1similarity word_sim 1if: phrase W j1middle existence and phrase W i1in word word i11identical word, then phrase W i1in word word i11with phrase W j1similarity word_sim 1for IDF(word i11); If phrase W j1in do not exist and phrase W i1in word word i11identical word, then phrase W i1in word word i11with phrase W j1similarity word_sim 1be 0; Wherein, phrase W i1in word word i11anti-document word frequency IDF(word i11) in the dictionary that can calculate from step 22 word IDF value in inquire about and obtain;
Step 413: phrase W i1in remaining word word i12, word i13..., word i1Q1with phrase W j1similarity word_sim 2, word_sim 3..., word_sim q1computing method with step 412;
Step 414: by phrase W i1in all word word i11, word i1..., word i1Q1with phrase W j1similarity summation, then divided by phrase W i1amount R 1 and the phrase W of middle word j1the quantity Q1's of middle word and, obtain limitation sentence collection P ifirst sentence P i1with limitation sentence collection P jsimilarity sim 1, computing formula is as follows:
sim 1=(word_sim 1+word_sim 2+…+word_sim Q1)/(R1+Q1);
Step 415: limitation sentence collection P ifirst sentence P i1with limitation sentence collection P jin other sentences P j2..., P jTjsimilarity calculate by the method identical with step 411-step 415.
By technique scheme, a kind of document similarity detection method based on seed words of the present invention at least has following advantages:
1, present invention employs document limitation sentence set representations, document is contrasted needs sentence comparison to be processed few, improves processing speed;
2, select possess representational limitation sentence collection time, have selected the sentence at the word place that word frequency-anti-document frequency IF-IDF value is higher, like this process as a result, selection sentence representativeness is possessed to document; The impact that uncorrelated or common-sense sentence judges document similarity can be got rid of;
3, the control methods between the present invention's two sentences adopts by word way of contrast, reduce the similar but impact that sentence narrating mode is different of content, and utilizing anti-document word frequency IDF to be weighted to identical sentence, higher weight given in the word stronger to identification, adds contrast properties.
Concrete grammar of the present invention is provided in detail by following examples.
Wherein:
C: dictionary G: database document set
IDF: anti-document frequency TF-IDF: word frequency-anti-document frequency
S: seed word set
Embodiment
For further setting forth the present invention for the technological means reaching predetermined goal of the invention and take and effect, below in conjunction with preferred embodiment, to a kind of document similarity detection method based on seed words its embodiment, method, step, feature and effect thereof of proposing according to the present invention, be described in detail as follows:
Step 1: Chinese Word Segmentation and mark
Utilize Chinese Word Segmentation Open-Source Tools bag that the Chinese character sequence of original document is cut into word independent one by one, and part-of-speech tagging is carried out to each word.The present embodiment uses the ICTCLAS Chinese word segmentation system Open-Source Tools bag of Chinese Academy of Sciences's exploitation to implement Chinese Word Segmentation and part-of-speech tagging.
Provide application ICTCLAS Chinese word segmentation system Open-Source Tools bag below and word example cut to " research of bituminous coat project proceeding in quality control " Chinese washer section:
Original document fragment:
" this promotion project is summed up colleague both at home and abroad and, on the basis of Asphalt Pavement Construction Quality detection and control technical research achievement, is carried out a large amount of detections and data variation analysis for the production of asphalt pavement material and asphalt and bituminous pavement forming process conscientiously absorbing; Study novel Fast nondestructive evaluation technology and bituminous pavement proceeding in quality control novel device and process control effects; Unique bituminous coat project proceeding in quality control index system is built from material, equipment, personnel, technique four aspects; Optimizing process control evaluation software and proceeding in quality control guide, reach the object to construction quality monitoring.”
Cut word effect:
" basis/r popularization/v project/n /p conscientious/ad absorptions/v summarys/v both at home and abroad/s colleague/n /p bituminous road/n face/n construction/vn quality/n detections/vn and/c control/vn technology/n research/vn achievement/n /u basis/n is upper/f ,/w for/p bituminous road/n face/q material/n and/c pitch/n mixing/vd material/v /u production/vn and/c bituminous road/n face/n be shaping/vn process/n carries out/v in a large number/m /u detection/vn and/c data/n variability/n analysis/vn; / w research/v is novel/b fast/b is harmless/vn detections/vn technology/n and/c bituminous road/n face/q quality/n process/n control/v is novel/b equipment/n and/c process/n control/vn effect/n; / w from/p material/n ,/w equipment/n ,/w personnel/n ,/w technique/n tetra-/m/q aspect/n structure/v uniqueness/a /u bituminous road/n face/q engineering/n quality/n process/n control/vn index/n system/n; / w optimization/v process/n control/vn evaluation/vn software/n and/c quality/n process/n control/vn guide/n ,/w reach/v is right/p engineering/n quality/n monitoring/vn /u object/n./w”
Wherein, "/* " represents part of speech, and such as: "/n " represents noun, "/v " represents verb.
Step 2: dictionary calculates
Step 21: select word
All documents in ergodic data database documents set G, remove the number in document, measure word, pronoun, preposition, auxiliary word, conjunction and adverbial word etc. and document content is understood to the word and the punctuation mark that do not form impact, remaining word forms a set jointly, delete dittograph in this set and obtain dictionary C, be designated as C={c 1, c 2, c 3..., c nc, Nc represents the quantity of word in dictionary C;
2191 sections of documents that the implementation case database document set G has for communications and transportation technological project management platform, 43784 words can be obtained through step 21 and jointly form a set, delete dittograph in this set and obtain dictionary C, here is optional part word:
..., salt lake, salt, severe cold, harsh, research and development, research, development, rock, karst ...
Step 22: the anti-document frequency (IDF) calculating each word in described dictionary C
To any word c in dictionary C nanti-document frequency IDF(c n) computing formula is: IDF(c n)=lg(Nt/N (c n)); Wherein, Nt represents the total number of documents comprised in database document set, N (c n) represent in content and comprise word c nthe quantity of document, lg represents with 10 for computing is carried out taking the logarithm in the end;
For any word " salt lake " in this embodiment, as described in step 21, database document set G has 2191 sections of documents, i.e. Nt=2191; According to statistics, " salt lake " occurs in 43 sections of documents wherein, therefore, anti-document frequency IDF(" salt lake ")=lg(2191/43)=1.7072, the anti-document frequency IDF of other words can be calculated after the same method;
Step 23: preserve dictionary C and anti-document frequency IDF corresponding to interior each word thereof;
Step 3: by document D iwith limitation sentence expression
Step 31: in statistics dictionary C, each word is in document D iin word frequency-anti-document frequency (TF-IDF)
Right word c in dictionary C nin document D iin word frequency-anti-document frequency TF-IDF(c n) calculation procedure is as follows:
Step 311: calculate word c nin document D iin word frequency TF (c n): by word c nin document D ithe number of times of middle appearance is divided by document D iin the total quantity of all words;
For " salt lake "; it has occurred 3 times in document " research of bituminous coat project proceeding in quality control "; entire chapter document has 6202 words, then " salt lake " word frequency TF (" salt lake ")=3/6202=0.00048 in document " research of bituminous coat project proceeding in quality control ";
Step 312: query word c in step 2 gained dictionary C nanti-document frequency IDF (c n);
In the present embodiment from gained dictionary C the anti-document word frequency rate IDF (" salt lake ")=1.7072 of query word " salt lake ";
Step 313: calculate word c nat the word frequency-anti-document frequency TF-IDF (c of document D i n) value:
TF-IDF(c n)=TF(c n)*IDF(c n);
In this embodiment, the word frequency-anti-document frequency according to above-mentioned formulae discovery word " salt lake " is:
TF-IDF(" salt lake ")=0.00048*1.7072=0.00082;
Step 32: calculate seed words set
All words in dictionary C are pressed it in document D iin TF-IDF value sort from big to small, suppose D in document iin total M notional word, in this M notional word before selection m notional word as document D iseed word set S, be designated as seed word set S={s 1, s 2..., s m, wherein, m is the smallest positive integral being not less than M*k, and k is notional word selection ratio, can regulate according to actual detection perform, and the present embodiment selects k=1/2 8=1/256=0.004;
" research of bituminous coat project proceeding in quality control " entire chapter document totally 1165 notional words described in step 1,1165*0.004=4.66, then m=5; Being selected as seed word set S to be designated as: S={ bituminous road, pitch, control, variability, harmless, the IF-IDF value of its correspondence is respectively: { 6.2435,6.1628,4.1924,3.8871,3.2274};
Step 33: by document D iwith limitation sentence expression
Traversal D iin full, all sentences comprising any one word in seed word set S are selected, for representing document D iif there is Ti sentence P i1, P i2..., P iTicomprise any one word in seed words S set, then D ibe expressed as limitation sentence collection: P i={ P i1, P i2..., P iTi;
Described in the present embodiment note step 1, the document fragment of " research of bituminous coat project proceeding in quality control " is document D idue to the 1st sentence " this promotion project is summed up colleague both at home and abroad and, on the basis of Asphalt Pavement Construction Quality detection and control technical research achievement, is carried out a large amount of detections and data variation analysis for the production of asphalt pavement material and asphalt and bituminous pavement forming process conscientiously absorbing; " comprise seed word set S={ bituminous road, pitch, control, variability, harmless in { bituminous road, pitch, control, variability }, this sentence is designated as P i1put into limitation sentence collection P iin.As a same reason, because the 2nd sentence comprises seed word set S={ bituminous road, pitch, control, variability, harmless } in { harmless, bituminous road, control }, 3rd sentence comprises seed word set S={ bituminous road, pitch, control, variability, harmless } in { bituminous road, control }, 4th sentence comprises seed word set S={ bituminous road, pitch, control, variability, harmless } in { control }, so the 2nd sentence, the 3rd sentence, the 4th sentence are designated as P respectively i2p i3p i4put into limitation sentence collection P iin.
Also exist in " research of bituminous coat project proceeding in quality control " original text one " author has shared this technology true cognition in force; provide some to edify the application for everybody ", because this sentence does not comprise seed word set S={ bituminous road, pitch, control, variability, harmless } in any one word, therefore, this sentence is not placed into limitation sentence collection P iin.
(3) step 4: document contrasts
Method described in application above-mentioned steps 3, by document D ibe expressed as limitation sentence collection P i={ P i1, P i2..., P iTi, by document D jbe expressed as limitation sentence collection P j={ P j1, P j2..., P jTj, wherein Ti represents limitation sentence collection P iin the limitation sentence quantity that comprises, Tj represents limitation sentence collection P jin the limitation sentence quantity that comprises.
Because the present embodiment document content is more, selection portion single cent washer section will be distinguished in " research of bituminous coat project proceeding in quality control " and " kir development of resources and pavement performance research " two sections of documents below, be designated as document D respectively 1and document D 2, then document D is described 1and D 2comparison procedure.
Describe according to aforementioned process, T1=4, D 1by limitation sentence collection P 1be expressed as follows:
P 1:{
P 11: this promotion project is summed up colleague both at home and abroad and, on the basis of Asphalt Pavement Construction Quality detection and control technical research achievement, is carried out a large amount of detections and data variation analysis for the production of asphalt pavement material and asphalt and bituminous pavement forming process conscientiously absorbing;
P 12: study novel Fast nondestructive evaluation technology and bituminous pavement proceeding in quality control novel device and process control effects;
P 13: build unique bituminous coat project proceeding in quality control index system from material, equipment, personnel, technique four aspects;
P 14: optimizing process control evaluation software and proceeding in quality control guide, reach the object to construction quality monitoring.”
Note " kir development of resources and pavement performance research " document fragment is D 2, utilize and D 1identical processing procedure, can obtain T2=3, and document fragment is D 2by limitation sentence collection P 2be expressed as follows:
P 2:{
P 21: the many technological achievements of west China logistics reach international most advanced level, and wherein " the composite modified series of products of kir " obtain national inventing patent and State Torch Program certificate;
P 22: by applying the high temperature anti-rut behavior of integrated lifting China Bituminous Pavement of Running Overload Vehicles with subsequent technology;
P 23: promote Xinjiang kir development of resources, promote local economic development, drive local employment;
}
Document D 1with document D 2contrast algorithm is as follows:
Step 41: calculate limitation sentence collection P ifirst sentence P i1with limitation sentence collection P jin all sentence P j1, P j2..., P jTjsimilarity, be designated as: { sim 1, sim 2..., sim tj, be as calculated: similarity is for { 0.1104,0.1211,0.0946}, concrete computation process is shown in step 411-step 415.
Step 42: calculate document D 1first limitation sentence P 11with document D 2limitation sentence expression i.e. { P 21, P 22, P 23in the similarity { sim of all sentences 1, sim 2..., sim tjmaximal value, be as calculated: { the maximal value sim of 0.1104,0.1211,0.0946} max=0.1211;
Step 43: if sim described in above-mentioned steps 42 max>t, then limit the quantity sentence collection P ifirst sentence P i1with document D jsimilarity Sen_Sim 1for sim max; Otherwise, limitation sentence collection P ifirst sentence P i1with document D jsimilarity Sen_Sim 1be 0; Wherein threshold value t regulates according to database document set G data characteristics and determines;
The present embodiment selects t=0.5, because sim max=0.1211<0.5, limitation sentence collection P 1first sentence P 11with document D 2similarity Sen-Sim 1=0;
Step 44: apply the method identical with above-mentioned steps 41-step 43, calculates limitation sentence collection P 1in other sentences P 12, P 13, P 14with document D 2similarity: Sen-Sim 2, Sen-Sim 3, Sen-Sim 4; In the present embodiment, Sen-Sim 2, Sen-Sim 3, Sen-Sim 4be 0;
Step 45: calculate limitation sentence collection P 1in all sentence P 11, P 12, P 13, P 14with document D 2similarity and, then divided by limitation sentence collection P 1sentence number and limitation sentence collection P jmiddle sentence number and, obtain document D 1with document D 2similarity Doc-Sim 12, computing formula is: Doc-Sim 12=(Sen-Sim 1+ Sen-Sim 2+ Sen-Sim 3+ Sen-Sim 4)/(Ti+Tj)=Sen-Sim 1+ Sen-Sim 2+ Sen-Sim 3+ Sen-Sim 4)/(4+3)=0.
The present invention limits the quantity sentence collection P as described in step 41 1first sentence P 11with limitation sentence collection P 2in all sentence P 21, P 22, P 23similarity calculating method as follows:
Step 411: utilize Chinese Word Segmentation described in step 1 and mask method, to limitation sentence collection P 1first sentence P 11with limitation sentence collection P 2first sentence P 21word and mark are cut in enforcement, remove the impact do not formed understood in the number in document, measure word, pronoun, preposition, auxiliary word, conjunction and adverbial word etc. word and punctuation mark on document content, the set expression that each sentence utilizes all remaining words to form; Limitation sentence collection P 1first sentence P 11be expressed as phrase W 11={ word 111, word 112..., word 11Q1, limitation sentence collection P 2first sentence P 21be expressed as phrase W 21={ word 211, word 212..., word 21R1, wherein Q1 represents phrase W 11the quantity of middle word, R1 represents phrase W 11the quantity of middle word;
The present embodiment Q1=33, R1=24, W 11={ word 111, word 112..., word 11 (33)}={ popularization, project, conscientious, absorb, sum up, both at home and abroad, colleague, bituminous road, face, construction, quality, detection, control, technology, research, achievement, basis, bituminous road, face, material, pitch, mixing, material, production, bituminous road, face, shaping, process, carry out, detect, data, variability, analysis; W 21={ word 211, word 212..., word 21 (24)}={ western part, project, many, technology, achievement, reach, the world, advanced person, level, rock, pitch, compound, change, property, series, product, acquisition, country, invention, patent, country, torch, plan, certificate;
Step 412: phrase W described in calculation procedure 411 11in word word 111with phrase W 21similarity word_sim 1if: phrase W 21middle existence and phrase W 11in word word 111identical word, then phrase W 11in word word 111with phrase W 21similarity word_sim 1for IDF(word 111); If phrase W 21in do not exist and phrase W 11in word word 111identical word, then phrase W 11in word word 111with phrase W 21similarity word_sim 1be 0; Wherein, phrase W 11in word word 111anti-document word frequency IDF(word 111) in the dictionary that can calculate from step 22 word IDF value in inquire about and obtain;
The present embodiment phrase W 11in word word 111=" popularization ", phrase W 21in do not have " popularization ", then word-sim 1=0;
Step 413: phrase W described in step 411 11in remaining word word 112, word 113..., word 11 (33)with phrase W 21i.e. { word 211, word 212..., word 21 (24)similarity { word_sim 2, word_sim 3..., word_sim (33)calculating with step 412;
The phrase W of the present embodiment 11in the 2nd word word 112=" project ", at phrase W 21middle appearance, then word-sim 2=IDF(project)=0.0023; Phrase W 11in in have 33 words, application with process the 1st word and the 2nd mode that word is same and can obtain: { word-sim 3..., word-sim 33}={ 0,0,0,0,0,0,0,0,0,0,0,0.0045,0,0.0127,0,0,0,0,6.1628,0,0,0,0,0,0,0,0,0,0,0,0,0}
Step 414: by phrase W 11in all word word 111, word 11..., word 11(33)with phrase W 21similarity summation, then divided by phrase W 11quantity 33 and the phrase W of middle word 21the quantity 24 of middle word and, obtain limitation sentence collection P 1first sentence P 11with limitation sentence collection P 2similarity sim 1, computing formula is as follows:
sim 1=(word_sim 1+word_sim 2+…+word_sim (33))/(33+24)=6.1823/57=0.1104;
Step 415: limitation sentence collection P 1first sentence P 11with limitation sentence collection P jin other sentences P j2..., P jTjsimilarity calculate by the method identical with step 411-step 415.
Through said process, the similarity between any two sections of documents can be calculated.
In the present embodiment, the main similarity calculating existing document in new document and database, the present embodiment result, can have two kinds of usages:
(1) look into heavy needs according to document feature in embodiment with to document, determine document similarity judgment threshold Thres=0.4 in the present embodiment, if Documents Similarity is greater than 0.4, namely judge two sections of document similarities.User according to industry feature and the standard to Documents Similarity, the value of definite threshold Thres.
(2) user sorts to existing document according to Similarity Measure result, and whether the existing document then selecting similarity maximum is similar according to user experience subjective judgement document.
The above, it is only preferred embodiment of the present invention, not any pro forma restriction is done to the present invention, although the present invention discloses as above with preferred embodiment, but and be not used to limit the present invention, any those skilled in the art, do not departing within the scope of technical solution of the present invention, make a little change when the technology contents of above-mentioned announcement can be utilized or be modified to the Equivalent embodiments of equivalent variations, in every case be the content not departing from technical solution of the present invention, according to any simple modification that technical spirit of the present invention is done above embodiment, equivalent variations and modification, all still belong in the scope of technical solution of the present invention.

Claims (2)

1., based on a document similarity detection method for seed words, it is characterized in that it comprises the following steps:
Step 1: Chinese Word Segmentation and mark
Utilize Chinese Word Segmentation Open-Source Tools bag that the Chinese character sequence in document is cut into word independent one by one, part-of-speech tagging is carried out to each word, and again preserves;
Step 2: dictionary calculates
Step 21: select word: all documents in ergodic data database documents set (G), remove the number in document, measure word, pronoun, preposition, auxiliary word, conjunction and adverbial word etc. and document content is understood to the word and the punctuation mark that do not form impact, remaining word forms a set jointly, delete dittograph in this set and obtain dictionary (C), be designated as C={c 1, c 2, c 3..., c nc, wherein Nc represents the quantity of word in dictionary (C);
Step 22: the anti-document frequency (IDF) calculating each word in described dictionary (C):
Any word c in dictionary (C) nanti-document frequency IDF(c n) computing formula is: IDF(c n)=lg(Nt/N (c n)); Wherein, Nt represents the total number of documents comprised in database document set, N (c n) represent in content and comprise word c nthe quantity of document, lg represents with 10 for computing is carried out taking the logarithm in the end;
Step 23: preserve described dictionary (C) and anti-document frequency (IDF) corresponding to interior each word thereof;
Step 3: by any document D iwith limitation sentence collection P irepresent
Step 31: in statistics dictionary (C), each word is in document D iin word frequency-anti-document frequency (TF-IDF)
Any word c in dictionary (C) nin document D iin word frequency-anti-document frequency TF-IDF(c n) calculation procedure is as follows:
Step 311: calculate word c nin document D iin word frequency TF (c n): by word c nin document D ithe number of times of middle appearance is divided by document D iin the total quantity of all words;
Step 312: query word c in step 2 gained dictionary (C) nanti-document frequency IDF (c n);
Step 313: calculate word c nword frequency-anti-document frequency (TF-IDF) value at document D i:
TF-IDF(c n)=TF(c n)*IDF(c n);
Step 32: calculate seed words set
All words in dictionary (C) are pressed it in document D iin word frequency-anti-document frequency (TF-IDF) value sort from big to small, suppose D in document iin total M notional word, in this M notional word before selection m notional word as document D iseed word set (S), be designated as seed word set S={s 1, s 2..., s m, wherein, m is the smallest positive integral being not less than M*k, and k is notional word selection ratio, can regulate according to actual detection perform;
Step 33: by document D iwith limitation sentence collection P irepresent
Traversal D iin full, all sentences comprising any one word in seed word set (S) are selected, for representing document D iif there is Ti sentence P i1, P i2..., P iTicomprise any one word in seed word set (S), then D ibe expressed as limitation sentence collection: P i={ P i1, P i2..., P iTi;
Step 4: document contrasts
Method described in application above-mentioned steps 3, by document D ibe expressed as limitation sentence collection P i={ P i1, P i2..., P iTi, by document D jbe expressed as limitation sentence collection P j={ P j1, P j2..., P jTj, wherein Ti represents limitation sentence collection P iin the limitation sentence quantity that comprises, Tj represents limitation sentence collection P jin the limitation sentence quantity that comprises, document D iwith document D jcontrast algorithm is as follows:
Step 41: calculate limitation sentence collection P ifirst sentence P i1with limitation sentence collection P jin all sentence P j1, P j2..., P jTjsimilarity, be designated as: { sim 1, sim 2..., sim tj;
Step 42: calculate limitation sentence collection P ifirst sentence P i1with limitation sentence collection P jin the similarity { sim of all sentences 1, sim 2..., sim tjmaximal value, be designated as sim max;
Step 43: if sim described in above-mentioned steps 42 max>t, then limit the quantity sentence collection P ifirst sentence P i1with document D jsimilarity Sen_Sim 1for sim max; Otherwise, limitation sentence collection P ifirst sentence P i1with document D jsimilarity Sen_Sim 1be 0; Wherein threshold value t regulates according to database document set (G) data characteristics and determines;
Step 44: application and above-mentioned steps 41-step 43 same procedure, calculates limitation sentence collection P iin other sentences P i2, P i3..., P iTiwith document D jsimilarity, be designated as: Sen_Sim 2, Sen_Sim 3..., Sen_Sim ti;
Step 45: calculate limitation sentence collection P iin all sentence P i1, P i2..., P iTiwith document D jsimilarity and, then divided by limitation sentence collection P imiddle sentence quantity Ti and limitation sentence collection P jmiddle sentence number Tj and, obtain document D iwith document D jsimilarity value Doc_Sim ij, computing formula is: Doc_Sim ij=(Sen-Sim 1+ Sen-Sim 2+ ... + Sen-Sim ti)/(Ti+Tj).
2. a kind of document similarity detection algorithm based on seed words according to claim 1, is characterized in that the sentence collection P that limits the quantity described in described step 41 ifirst sentence P i1with limitation sentence collection P jin all sentence P j1, P j2..., P jTjsimilarity calculating method as follows:
Step 411: utilize Chinese Word Segmentation described in step 1 and mask method, to limitation sentence collection P ifirst sentence P i1with limitation sentence collection P jfirst sentence P j1word and mark are cut in enforcement, remove the impact do not formed understood in the number in document, measure word, pronoun, preposition, auxiliary word, conjunction and adverbial word etc. word and punctuation mark on document content, the set expression that each sentence utilizes all remaining words to form; Limitation sentence collection P ifirst sentence P i1be expressed as phrase W i1={ word i11, word i12..., word i1Q1, limitation sentence collection P jfirst sentence P j1be expressed as phrase W j1={ word j11, word j12..., word j1R1, wherein Q1 represents phrase W i1the quantity of middle word, R1 represents phrase W j1the quantity of middle word;
Step 412: phrase W described in calculation procedure 411 i1in word word i11with phrase W j1similarity word_sim 1if: phrase W j1middle existence and phrase W i1in word word i11identical word, then phrase W i1in word word i11with phrase W j1similarity word_sim 1for IDF(word i11); If phrase W j1in do not exist and phrase W i1in word word i11identical word, then phrase W i1in word word i11with phrase W j1similarity word_sim 1be 0; Wherein, phrase W i1in word word i11anti-document word frequency IDF(word i11) in the dictionary that can calculate from step 22 word IDF value in inquire about and obtain;
Step 413: phrase W i1in remaining word word i12, word i13..., word i1Q1with phrase W j1similarity word_sim 2, word_sim 3..., word_sim q1computing method with step 412;
Step 414: by phrase W i1in all word word i11, word i1..., word i1Q1with phrase W j1similarity summation, then divided by phrase W i1amount R 1 and the phrase W of middle word j1the quantity Q1's of middle word and, obtain limitation sentence collection P ifirst sentence P i1with limitation sentence collection P jsimilarity sim 1, computing formula is as follows:
sim 1=(word_sim 1+word_sim 2+…+word_sim Q1)/(R1+Q1);
Step 415: limitation sentence collection P ifirst sentence P i1with limitation sentence collection P jin other sentences P j2..., P jTjsimilarity calculate by the method identical with step 411-step 415.
CN201310359673.1A 2013-08-16 2013-08-16 A kind of document similarity detection method based on seed words Active CN104376024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310359673.1A CN104376024B (en) 2013-08-16 2013-08-16 A kind of document similarity detection method based on seed words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310359673.1A CN104376024B (en) 2013-08-16 2013-08-16 A kind of document similarity detection method based on seed words

Publications (2)

Publication Number Publication Date
CN104376024A true CN104376024A (en) 2015-02-25
CN104376024B CN104376024B (en) 2017-12-15

Family

ID=52554938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310359673.1A Active CN104376024B (en) 2013-08-16 2013-08-16 A kind of document similarity detection method based on seed words

Country Status (1)

Country Link
CN (1) CN104376024B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126497A (en) * 2016-06-21 2016-11-16 同方知网数字出版技术股份有限公司 A kind of automatic mining correspondence executes leader section and the method for cited literature textual content fragment
CN107229939A (en) * 2016-03-24 2017-10-03 北大方正集团有限公司 The decision method and device of similar document
CN109002508A (en) * 2018-07-01 2018-12-14 东莞市华睿电子科技有限公司 A kind of text information crawling method based on web crawlers
CN109271641A (en) * 2018-11-20 2019-01-25 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus and electronic equipment
CN112307738A (en) * 2020-11-11 2021-02-02 北京沃东天骏信息技术有限公司 Method and device for processing text

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
CN101655866A (en) * 2009-08-14 2010-02-24 北京中献电子技术开发中心 Automatic decimation method of scientific and technical terminology
CN101710317A (en) * 2009-11-17 2010-05-19 上海第二工业大学 Word partial weight calculating method based on word distribution
CN101963989A (en) * 2010-09-30 2011-02-02 大连理工大学 Word elimination process for extracting domain ontology concept
US8473279B2 (en) * 2008-05-30 2013-06-25 Eiman Al-Shammari Lemmatizing, stemming, and query expansion method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814A (en) * 2007-09-26 2009-04-01 北京大学 Method and system for simultaneously abstracting document summarization and key words
US8473279B2 (en) * 2008-05-30 2013-06-25 Eiman Al-Shammari Lemmatizing, stemming, and query expansion method and system
CN101655866A (en) * 2009-08-14 2010-02-24 北京中献电子技术开发中心 Automatic decimation method of scientific and technical terminology
CN101710317A (en) * 2009-11-17 2010-05-19 上海第二工业大学 Word partial weight calculating method based on word distribution
CN101963989A (en) * 2010-09-30 2011-02-02 大连理工大学 Word elimination process for extracting domain ontology concept

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229939A (en) * 2016-03-24 2017-10-03 北大方正集团有限公司 The decision method and device of similar document
CN107229939B (en) * 2016-03-24 2020-12-04 北大方正集团有限公司 Similar document judgment method and device
CN106126497A (en) * 2016-06-21 2016-11-16 同方知网数字出版技术股份有限公司 A kind of automatic mining correspondence executes leader section and the method for cited literature textual content fragment
CN109002508A (en) * 2018-07-01 2018-12-14 东莞市华睿电子科技有限公司 A kind of text information crawling method based on web crawlers
CN109002508B (en) * 2018-07-01 2021-08-06 上海众引文化传播股份有限公司 Text information crawling method based on web crawler
CN109271641A (en) * 2018-11-20 2019-01-25 武汉斗鱼网络科技有限公司 A kind of Text similarity computing method, apparatus and electronic equipment
CN109271641B (en) * 2018-11-20 2023-09-08 广西三方大供应链技术服务有限公司 Text similarity calculation method and device and electronic equipment
CN112307738A (en) * 2020-11-11 2021-02-02 北京沃东天骏信息技术有限公司 Method and device for processing text

Also Published As

Publication number Publication date
CN104376024B (en) 2017-12-15

Similar Documents

Publication Publication Date Title
CN104834747B (en) Short text classification method based on convolutional neural networks
CN105488024B (en) The abstracting method and device of Web page subject sentence
CN103617157B (en) Based on semantic Text similarity computing method
CN102945228B (en) A kind of Multi-document summarization method based on text segmentation technology
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN103164540B (en) A kind of patent hotspot finds and trend analysis
CN104598611B (en) The method and system being ranked up to search entry
CN106126620A (en) Method of Chinese Text Automatic Abstraction based on machine learning
CN104376024A (en) Document similarity detecting method based on seed words
CN104915448A (en) Substance and paragraph linking method based on hierarchical convolutional network
CN107832457A (en) Power transmission and transforming equipment defect dictionary method for building up and system based on TextRank algorithm
CN105045812A (en) Text topic classification method and system
CN104598813A (en) Computer intrusion detection method based on integrated study and semi-supervised SVM
Le et al. Text classification: Naïve bayes classifier with sentiment Lexicon
CN103631858A (en) Science and technology project similarity calculation method
CN104484380A (en) Personalized search method and personalized search device
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN105389505A (en) Shilling attack detection method based on stack type sparse self-encoder
CN105654144A (en) Social network body constructing method based on machine learning
CN105095430A (en) Method and device for setting up word network and extracting keywords
CN104572631A (en) Training method and system for language model
CN104361059A (en) Harmful information identification and web page classification method based on multi-instance learning
CN104933032A (en) Method for extracting keywords of blog based on complex network
CN104809105A (en) Method and system for identifying event argument and argument role based on maximum entropy
CN105243053A (en) Method and apparatus for extracting key sentence of document

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant