CN104376024A

CN104376024A - Document similarity detecting method based on seed words

Info

Publication number: CN104376024A
Application number: CN201310359673.1A
Authority: CN
Inventors: 张琳波; 王枫; 胡明; 石磊; 梁龙; 郭瑜
Original assignee: China Academy of Transportation Sciences
Current assignee: China Academy of Transportation Sciences
Priority date: 2013-08-16
Filing date: 2013-08-16
Publication date: 2015-02-25
Anticipated expiration: 2033-08-16
Also published as: CN104376024B

Abstract

The invention relates to a document similarity detecting method based on seed words. The method includes: Chinese word segmentation and labeling, dictionary calculation, document expression by limited sentence sets, and document comparison. The method has the advantages that few sentences need to be processed during document comparison, and processing speed is increased; sentences with the words high in word frequency-inverse document frequency are selected during representative limited sentence set selecting, so that the selected limited sentence sets are representative in documents, and influence of unrelated or common-sense sentences on document similarity judging is eliminated; two sentences are compared in a word-by-word manner, so that the influence of similar contents and different narration manners is reduced; same words are weighed by inverse document frequency, higher weights are given to the words high in distinguishing performance, and comparison performance is increased.

Description

A kind of document similarity detection method based on seed words

Technical field

The method that the document copying that the present invention relates to a kind of pattern-recognition and technical field of information processing detects, particularly relates to a kind of document similarity detection method based on seed words.

Background technology

Present stage, China's sci-tech development base constantly improves, and input in science and technology progressively increases, and serves huge impetus to the scientific-technical progress of promotion industry.Technological project management is an important link in industry macroscopic view scientific and technological management system.At present in science and technology item is declared, there is phenomenons such as repeating project verification, bull application, greatly have impact on the rationality that financial scientific and technological funds use, reduce the integral level of industry Technological research.Therefore, the document of project initiation phase and acceptance phase is checked, be the science and the rationality that ensure science and technology item project verification, improve the efficiency of technological project management and the important means of level.

Along with science and technology item declares increasing of quantity, the simple similarity of manual type examination & verification project document that adopts is difficult to satisfy the demands.Utilize Computerized Information Processing Tech automatically to judge the similarity of report content, become technological project management problem demanding prompt solution, it is the most direct, effective means of realizing this goal that document similarity detects.Document similarity detects not only to detect and intactly indiscriminately imitates, and also comprises the displacement exchange to textual content, synonym is replaced, saying repeats.

At present, the method that document copying detects is roughly divided into two kinds: the method based on character string comparison and the method based on word frequency statistics.The method based on string matching described in application, the COPS prototype system detected for document copying has been invented by Digital Labs of nineteen ninety-five Stanford Univ USA.COPS take punctuation mark as boundary, subsequence of being formed a complete sentence by document decomposition; Then, add up the quantity of identical sentence in two sections of documents, and using the foundation of the ratio of sentence quantity total in it and two sections of documents as similarity degree between measurement two sections of documents.Document A and document B calculating formula of similarity as follows:

sim (A, B) = \frac{| S (A) IS (B) |}{| S (A) US (B) |}

Wherein S (A) and S (B) represents the fingerprint set of document A and document B respectively.This COPS system is relatively effective for the detection of large-scale document copying, and computing velocity is also than comparatively fast, but it can not find the phenomenon that sentence local is overlapping.

Method based on character string comparison thought also comprises the document copying detection algorithm based on sentence similarity after improvement.The method by entire chapter document with sentence marks (,.; , subsequence of document decomposition being formed a complete sentence for boundary, again these sentences are passed through Chinese word segmentation, form keyword sequence after removing the insignificant word such as conjunction, auxiliary word, then utilize the computing formula based on longest common subsequence algorithm to calculate similarity between two documents between sentence.In the method, the basis that Documents Similarity is based upon sentence similarity calculates.The method can well solve the problem of part sentence overlap, and the detection perform of document similarity is optimized.But, in view of the feature of Chinese character file sentence, there is different aligning methods in the sentence of identical content, above-mentionedly calculates the method for longest common subsequence in sentence level and overemphasize word order between word, is unfavorable for that Detection of content is identical and adopts the Similar content that different describing mode is expressed.In addition, for the document that content is very long, the method needs longer operation time, and efficiency is not high yet.

Based in the method for word frequency statistics, Application comparison widely method for expressing is vector space model (VSM) method.Be incoherent between the real suppositive of the method basic thought and word, represent text with vector, make model possess calculability.Document is regarded as by separate entry group (T by VSM method ₁, T ₂, T ₃..., T _n) form, for each entry T, give certain weights W=(w according to it in the significance level of document ₁, w ₂, w ₃... w _n), and by (T ₁, T ₂, T ₃..., T _n) regard coordinate axis in a n dimension coordinate system, (w as ₁, w ₂, w ₃... w _n) be corresponding coordinate figure.Like this by (T ₁, T ₂, T ₃..., T _n) decompose the orthogonal entry set of vectors that obtains and just constitute a document vector space.

Calculate conventional cosine function in the functional expression of similarity, similarity is defined as by it:

S (D_{i}, D_{j}) = \frac{Σ_{k = 1}^{n} w_{ik} \cdot w_{jk}}{\sqrt{Σ_{k = 1}^{n} w_{ik}^{2} Σ_{k = 1}^{n} w_{jk}^{2}}}

Document D _iand D _jbe respectively as i-th in collection of document section and jth section document, w _ikand w _jkbe respectively document D _iand D _jat coordinate axis T _kthe coordinate figure of upper correspondence.The essence of these class methods is exactly calculate the included angle cosine between n-dimensional space Literature vector.

TF-IDF method has considered the different frequency of occurrences of word in all texts (TF value) and this word to the resolution characteristic (IDF value) of different text, is widely used in calculating the similarity between text.Its formula is as follows:

W _td=TF _td×IDF _t（2）

Wherein: W _tdthe significance level of representation feature item t in document d, TF _tdrefer to the number of times that characteristic item t occurs in document d, IDF _tthe distribution situation of reflection characteristic item t in whole collection of document.IDF to a certain extent _tembody the separating capacity of characteristic item t, TF _tdreflection characteristic item is in the distribution situation of inside documents.This type of algorithm can get rid of the word of those high frequencies, low discrimination, is a kind of effective weight definition method.

Above-mentioned method of adding up based on word frequency etc., fast operation, is particularly suitable for large Documents Similarity and calculates.But said method partly or entirely have ignored word position relationship in a document, and actual conditions are, same words, appear in the different sentence environment of identical document, distinct effect will be produced.

Sum up above-mentioned two class methods, the method based on character string comparison thought needs sentences all in document to carry out by comparing, and for the document that content is longer, computation complexity is higher, and document detection efficiency comparison is low.In addition, often exist in large volume document and actual influence is not formed to document essential content but very approximate statement, such as: the thing of " in recent years, XXX is developed rapidly; make a general survey of these methods, roughly can be divided into ... " common-sense, set pattern here.Although these contents can not impact the core concept of article, easily the performance of document similarity detection algorithm is impacted.And the simple structure of the statistical method such as word frequency statistics TF or anti-document frequency IDF in the vector space model of Corpus--based Method can not reflect the significance level of word and the distribution situation of Feature Words effectively, make it cannot complete function to weighed value adjusting well, so the precision of these class methods is not very high.In addition, the positional information do not fully not demonstrated in the algorithm of most of Corpus--based Method, Feature Words is different to the reflection degree of article content in different sentences, and therefore, the statistics of word is only in the level just more tangible effect of sentence.

In view of the defect of the method existence that above-mentioned existing document copying detects, the present inventor is based on being engaged in the practical experience and professional knowledge that this type of product design manufacture enriches for many years, found a kind of a kind of document similarity detection method based on seed words newly, the method that general existing document copying detects can be improved, make it have more practicality.

Summary of the invention

Fundamental purpose of the present invention is, overcome the defect of the method existence that existing document copying detects, and a kind of a kind of document similarity detection method based on seed words is newly provided, technical matters to be solved makes it improve processing speed, get rid of the impact that uncorrelated or tentative sentence judges the similarity of document, be very suitable for practicality.

Another object of the present invention is to, provide a kind of a kind of document similarity detection method based on seed words newly, technical matters to be solved makes it increase contrast properties, improves accuracy, thus be more suitable for practicality.

The object of the invention to solve the technical problems realizes by the following technical solutions.According to a kind of document similarity detection method based on seed words that the present invention proposes, it comprises the following steps:

Step 1: Chinese Word Segmentation and mark

Utilize Chinese Word Segmentation Open-Source Tools bag that the Chinese character sequence in document is cut into word independent one by one, part-of-speech tagging is carried out to each word, and again preserves;

Step 2: dictionary calculates

Step 21: select word: all documents in the set of ergodic data database documents, remove the number in document, measure word, pronoun, preposition, auxiliary word, conjunction and adverbial word etc. and document content is understood to the word and the punctuation mark that do not form impact, remaining word forms a set jointly, delete dittograph in this set and obtain dictionary, be designated as C={c ₁, c ₂, c ₃..., c _nc, Nc represents the quantity of word in dictionary;

Step 22: the anti-document frequency IDF calculating each word in described dictionary:

Any word c in dictionary _nanti-document frequency IDF(c _n) computing formula is: IDF(c _n)=lg(Nt/N (c _n)); Wherein, Nt represents the total number of documents comprised in database document set, N (c _n) represent in content and comprise word c _nthe quantity of document, lg represents with 10 for computing is carried out taking the logarithm in the end;

Step 23: preserve described dictionary and anti-document frequency corresponding to interior each word thereof;

Step 3: by any document D in database document set _iwith limitation sentence set representations

Step 31: in statistics dictionary, each word is in document D _iin word frequency-anti-document frequency, any word c in dictionary _nin document D _iin word frequency-anti-document frequency TF-IDF(c _n) calculation procedure is as follows:

Step 311: calculate word c _nin document D _iin word frequency TF (c _n): by word c _nin document D _ithe number of times of middle appearance is divided by document D _iin the total quantity of all words;

Step 312: query word c in step 2 gained dictionary C _nanti-document frequency IDF (c _n);

Step 313: calculate word c _nword frequency-anti-document frequency value at document D i:

TF-IDF（c _n）=TF（c _n）*IDF(c _n)；

Step 32: calculate seed words set

All words in dictionary are pressed it in document D _iin word frequency-anti-document frequency TF-IDF value sort from big to small, suppose D in document _iin total M notional word, in this M notional word before selection m notional word as document D _iseed word set, be designated as seed word set S={s ₁, s ₂..., s _m, wherein, m is the smallest positive integral being not less than M*k, and k is notional word selection ratio, can regulate according to actual detection perform;

Step 33: by document D _iwith limitation sentence collection P _irepresent

Traversal D _iin full, all sentences comprising any one word in seed word set S are selected, for representing document D _iif there is Ti sentence P _i1, P _i2..., P _iTicomprise any one word in seed word set S, then D _ibe expressed as limitation sentence collection: P _i={ P _i1, P _i2..., P _iTi;

Step 4: document contrasts

Method described in application above-mentioned steps 3, by document D _ibe expressed as limitation sentence collection P _i={ P _i1, P _i2..., P _iTi, by document D _jbe expressed as limitation sentence collection P _j={ P _j1, P _j2..., P _jTj, wherein Ti represents limitation sentence collection P _iin the limitation sentence quantity that comprises, Tj represents limitation sentence collection P _jin the limitation sentence quantity that comprises, document D _iwith document D _jcontrast algorithm is as follows:

Step 41: calculate limitation sentence collection P _ifirst sentence P _i1with limitation sentence collection P _jin all sentence P _j1, P _j2..., P _jTjsimilarity, be designated as: { sim ₁, sim ₂..., sim _tj;

Step 42: calculate limitation sentence collection P _ifirst sentence P _i1with limitation sentence collection P _jin the similarity { sim of all sentences ₁, sim ₂..., sim _tjmaximal value, be designated as sim _max;

Step 43: if sim described in above-mentioned steps 42 _max>t, then limit the quantity sentence collection P _ifirst sentence P _i1with document D _jsimilarity Sen_Sim ₁for sim _max; Otherwise, limitation sentence collection P _ifirst sentence P _i1with document D _jsimilarity Sen_Sim ₁be 0; Wherein threshold value t regulates according to database document set (G) data characteristics and determines;

Step 44: application and above-mentioned steps 41-step 43 same procedure, calculates limitation sentence collection P _iin other sentences P _i2, P _i3..., P _iTiwith document D _jsimilarity, be designated as: Sen_Sim ₂, Sen_Sim ₃..., Sen_Sim _ti;

Step 45: calculate limitation sentence collection P _iin all sentence P _i1, P _i2..., P _iTiwith document D _jsimilarity and, then divided by limitation sentence collection P _imiddle sentence quantity Ti and limitation sentence collection P _jmiddle sentence number Tj and, obtain document D _iwith document D _jsimilarity value Doc_Sim _ij, computing formula is: Doc_Sim _ij=(Sen-Sim ₁+ Sen-Sim ₂+ ... + Sen-Sim _ti)/(Ti+Tj).

Aforesaid a kind of document similarity detection algorithm based on seed words, limit the quantity described in wherein said step 41 sentence collection P _ifirst sentence P _i1with limitation sentence collection P _jin all sentence P _j1, P _j2..., P _jTjsimilarity calculating method as follows:

Step 411: utilize Chinese Word Segmentation described in step 1 and mask method, to limitation sentence collection P _ifirst sentence P _i1with limitation sentence collection P _jfirst sentence P _j1word and mark are cut in enforcement, remove the impact do not formed understood in the number in document, measure word, pronoun, preposition, auxiliary word, conjunction and adverbial word etc. word and punctuation mark on document content, the set expression that each sentence utilizes all remaining words to form; Limitation sentence collection P _ifirst sentence P _i1be expressed as phrase W _i1={ word _i11, word _i12..., word _i1Q1, limitation sentence collection P _jfirst sentence P _j1be expressed as phrase W _j1={ word _j11, word _j12..., word _j1R1, wherein Q1 represents phrase W _i1the quantity of middle word, R1 represents phrase W _j1the quantity of middle word;

Step 412: phrase W described in calculation procedure 411 _i1in word word _i11with phrase W _j1similarity word_sim ₁if: phrase W _j1middle existence and phrase W _i1in word word _i11identical word, then phrase W _i1in word word _i11with phrase W _j1similarity word_sim ₁for IDF(word _i11); If phrase W _j1in do not exist and phrase W _i1in word word _i11identical word, then phrase W _i1in word word _i11with phrase W _j1similarity word_sim ₁be 0; Wherein, phrase W _i1in word word _i11anti-document word frequency IDF(word _i11) in the dictionary that can calculate from step 22 word IDF value in inquire about and obtain;

Step 413: phrase W _i1in remaining word word _i12, word _i13..., word _i1Q1with phrase W _j1similarity word_sim ₂, word_sim ₃..., word_sim _q1computing method with step 412;

Step 414: by phrase W _i1in all word word _i11, word _i1..., word _i1Q1with phrase W _j1similarity summation, then divided by phrase W _i1amount R 1 and the phrase W of middle word _j1the quantity Q1's of middle word and, obtain limitation sentence collection P _ifirst sentence P _i1with limitation sentence collection P _jsimilarity sim ₁, computing formula is as follows:

sim ₁=(word_sim ₁+word_sim ₂+…+word_sim _Q1)/(R1+Q1)；

Step 415: limitation sentence collection P _ifirst sentence P _i1with limitation sentence collection P _jin other sentences P _j2..., P _jTjsimilarity calculate by the method identical with step 411-step 415.

By technique scheme, a kind of document similarity detection method based on seed words of the present invention at least has following advantages:

1, present invention employs document limitation sentence set representations, document is contrasted needs sentence comparison to be processed few, improves processing speed;

2, select possess representational limitation sentence collection time, have selected the sentence at the word place that word frequency-anti-document frequency IF-IDF value is higher, like this process as a result, selection sentence representativeness is possessed to document; The impact that uncorrelated or common-sense sentence judges document similarity can be got rid of;

3, the control methods between the present invention's two sentences adopts by word way of contrast, reduce the similar but impact that sentence narrating mode is different of content, and utilizing anti-document word frequency IDF to be weighted to identical sentence, higher weight given in the word stronger to identification, adds contrast properties.

Concrete grammar of the present invention is provided in detail by following examples.

Wherein:

C: dictionary G: database document set

IDF: anti-document frequency TF-IDF: word frequency-anti-document frequency

S: seed word set

Embodiment

For further setting forth the present invention for the technological means reaching predetermined goal of the invention and take and effect, below in conjunction with preferred embodiment, to a kind of document similarity detection method based on seed words its embodiment, method, step, feature and effect thereof of proposing according to the present invention, be described in detail as follows:

Step 1: Chinese Word Segmentation and mark

Utilize Chinese Word Segmentation Open-Source Tools bag that the Chinese character sequence of original document is cut into word independent one by one, and part-of-speech tagging is carried out to each word.The present embodiment uses the ICTCLAS Chinese word segmentation system Open-Source Tools bag of Chinese Academy of Sciences's exploitation to implement Chinese Word Segmentation and part-of-speech tagging.

Provide application ICTCLAS Chinese word segmentation system Open-Source Tools bag below and word example cut to " research of bituminous coat project proceeding in quality control " Chinese washer section:

Original document fragment:

" this promotion project is summed up colleague both at home and abroad and, on the basis of Asphalt Pavement Construction Quality detection and control technical research achievement, is carried out a large amount of detections and data variation analysis for the production of asphalt pavement material and asphalt and bituminous pavement forming process conscientiously absorbing; Study novel Fast nondestructive evaluation technology and bituminous pavement proceeding in quality control novel device and process control effects; Unique bituminous coat project proceeding in quality control index system is built from material, equipment, personnel, technique four aspects; Optimizing process control evaluation software and proceeding in quality control guide, reach the object to construction quality monitoring.”

Cut word effect:

" basis/r popularization/v project/n /p conscientious/ad absorptions/v summarys/v both at home and abroad/s colleague/n /p bituminous road/n face/n construction/vn quality/n detections/vn and/c control/vn technology/n research/vn achievement/n /u basis/n is upper/f ,/w for/p bituminous road/n face/q material/n and/c pitch/n mixing/vd material/v /u production/vn and/c bituminous road/n face/n be shaping/vn process/n carries out/v in a large number/m /u detection/vn and/c data/n variability/n analysis/vn; / w research/v is novel/b fast/b is harmless/vn detections/vn technology/n and/c bituminous road/n face/q quality/n process/n control/v is novel/b equipment/n and/c process/n control/vn effect/n; / w from/p material/n ,/w equipment/n ,/w personnel/n ,/w technique/n tetra-/m/q aspect/n structure/v uniqueness/a /u bituminous road/n face/q engineering/n quality/n process/n control/vn index/n system/n; / w optimization/v process/n control/vn evaluation/vn software/n and/c quality/n process/n control/vn guide/n ,/w reach/v is right/p engineering/n quality/n monitoring/vn /u object/n./w”

Wherein, "/* " represents part of speech, and such as: "/n " represents noun, "/v " represents verb.

Step 2: dictionary calculates

Step 21: select word

All documents in ergodic data database documents set G, remove the number in document, measure word, pronoun, preposition, auxiliary word, conjunction and adverbial word etc. and document content is understood to the word and the punctuation mark that do not form impact, remaining word forms a set jointly, delete dittograph in this set and obtain dictionary C, be designated as C={c ₁, c ₂, c ₃..., c _nc, Nc represents the quantity of word in dictionary C;

2191 sections of documents that the implementation case database document set G has for communications and transportation technological project management platform, 43784 words can be obtained through step 21 and jointly form a set, delete dittograph in this set and obtain dictionary C, here is optional part word:

..., salt lake, salt, severe cold, harsh, research and development, research, development, rock, karst ...

Step 22: the anti-document frequency (IDF) calculating each word in described dictionary C

To any word c in dictionary C _nanti-document frequency IDF(c _n) computing formula is: IDF(c _n)=lg(Nt/N (c _n)); Wherein, Nt represents the total number of documents comprised in database document set, N (c _n) represent in content and comprise word c _nthe quantity of document, lg represents with 10 for computing is carried out taking the logarithm in the end;

For any word " salt lake " in this embodiment, as described in step 21, database document set G has 2191 sections of documents, i.e. Nt=2191; According to statistics, " salt lake " occurs in 43 sections of documents wherein, therefore, anti-document frequency IDF(" salt lake ")=lg(2191/43)=1.7072, the anti-document frequency IDF of other words can be calculated after the same method;

Step 23: preserve dictionary C and anti-document frequency IDF corresponding to interior each word thereof;

Step 3: by document D _iwith limitation sentence expression

Step 31: in statistics dictionary C, each word is in document D _iin word frequency-anti-document frequency (TF-IDF)

Right word c in dictionary C _nin document D _iin word frequency-anti-document frequency TF-IDF(c _n) calculation procedure is as follows:

For " salt lake "; it has occurred 3 times in document " research of bituminous coat project proceeding in quality control "; entire chapter document has 6202 words, then " salt lake " word frequency TF (" salt lake ")=3/6202=0.00048 in document " research of bituminous coat project proceeding in quality control ";

In the present embodiment from gained dictionary C the anti-document word frequency rate IDF (" salt lake ")=1.7072 of query word " salt lake ";

Step 313: calculate word c _nat the word frequency-anti-document frequency TF-IDF (c of document D i _n) value:

TF-IDF（c _n）=TF（c _n）*IDF(c _n)；

In this embodiment, the word frequency-anti-document frequency according to above-mentioned formulae discovery word " salt lake " is:

TF-IDF(" salt lake ")=0.00048*1.7072=0.00082;

Step 32: calculate seed words set

All words in dictionary C are pressed it in document D _iin TF-IDF value sort from big to small, suppose D in document _iin total M notional word, in this M notional word before selection m notional word as document D _iseed word set S, be designated as seed word set S={s ₁, s ₂..., s _m, wherein, m is the smallest positive integral being not less than M*k, and k is notional word selection ratio, can regulate according to actual detection perform, and the present embodiment selects k=1/2 ⁸=1/256=0.004;

" research of bituminous coat project proceeding in quality control " entire chapter document totally 1165 notional words described in step 1,1165*0.004=4.66, then m=5; Being selected as seed word set S to be designated as: S={ bituminous road, pitch, control, variability, harmless, the IF-IDF value of its correspondence is respectively: { 6.2435,6.1628,4.1924,3.8871,3.2274};

Step 33: by document D _iwith limitation sentence expression

Traversal D _iin full, all sentences comprising any one word in seed word set S are selected, for representing document D _iif there is Ti sentence P _i1, P _i2..., P _iTicomprise any one word in seed words S set, then D _ibe expressed as limitation sentence collection: P _i={ P _i1, P _i2..., P _iTi;

Described in the present embodiment note step 1, the document fragment of " research of bituminous coat project proceeding in quality control " is document D _idue to the 1st sentence " this promotion project is summed up colleague both at home and abroad and, on the basis of Asphalt Pavement Construction Quality detection and control technical research achievement, is carried out a large amount of detections and data variation analysis for the production of asphalt pavement material and asphalt and bituminous pavement forming process conscientiously absorbing; " comprise seed word set S={ bituminous road, pitch, control, variability, harmless in { bituminous road, pitch, control, variability }, this sentence is designated as P _i1put into limitation sentence collection P _iin.As a same reason, because the 2nd sentence comprises seed word set S={ bituminous road, pitch, control, variability, harmless } in { harmless, bituminous road, control }, 3rd sentence comprises seed word set S={ bituminous road, pitch, control, variability, harmless } in { bituminous road, control }, 4th sentence comprises seed word set S={ bituminous road, pitch, control, variability, harmless } in { control }, so the 2nd sentence, the 3rd sentence, the 4th sentence are designated as P respectively _i2p _i3p _i4put into limitation sentence collection P _iin.

Also exist in " research of bituminous coat project proceeding in quality control " original text one " author has shared this technology true cognition in force; provide some to edify the application for everybody ", because this sentence does not comprise seed word set S={ bituminous road, pitch, control, variability, harmless } in any one word, therefore, this sentence is not placed into limitation sentence collection P _iin.

(3) step 4: document contrasts

Method described in application above-mentioned steps 3, by document D _ibe expressed as limitation sentence collection P _i={ P _i1, P _i2..., P _iTi, by document D _jbe expressed as limitation sentence collection P _j={ P _j1, P _j2..., P _jTj, wherein Ti represents limitation sentence collection P _iin the limitation sentence quantity that comprises, Tj represents limitation sentence collection P _jin the limitation sentence quantity that comprises.

Because the present embodiment document content is more, selection portion single cent washer section will be distinguished in " research of bituminous coat project proceeding in quality control " and " kir development of resources and pavement performance research " two sections of documents below, be designated as document D respectively ₁and document D ₂, then document D is described ₁and D ₂comparison procedure.

Describe according to aforementioned process, T1=4, D ₁by limitation sentence collection P ₁be expressed as follows:

P ₁：{

P ₁₁: this promotion project is summed up colleague both at home and abroad and, on the basis of Asphalt Pavement Construction Quality detection and control technical research achievement, is carried out a large amount of detections and data variation analysis for the production of asphalt pavement material and asphalt and bituminous pavement forming process conscientiously absorbing;

P ₁₂: study novel Fast nondestructive evaluation technology and bituminous pavement proceeding in quality control novel device and process control effects;

P ₁₃: build unique bituminous coat project proceeding in quality control index system from material, equipment, personnel, technique four aspects;

P ₁₄: optimizing process control evaluation software and proceeding in quality control guide, reach the object to construction quality monitoring.”

Note " kir development of resources and pavement performance research " document fragment is D ₂, utilize and D ₁identical processing procedure, can obtain T2=3, and document fragment is D ₂by limitation sentence collection P ₂be expressed as follows:

P ₂：{

P ₂₁: the many technological achievements of west China logistics reach international most advanced level, and wherein " the composite modified series of products of kir " obtain national inventing patent and State Torch Program certificate;

P ₂₂: by applying the high temperature anti-rut behavior of integrated lifting China Bituminous Pavement of Running Overload Vehicles with subsequent technology;

P ₂₃: promote Xinjiang kir development of resources, promote local economic development, drive local employment;

}

Document D ₁with document D ₂contrast algorithm is as follows:

Step 41: calculate limitation sentence collection P _ifirst sentence P _i1with limitation sentence collection P _jin all sentence P _j1, P _j2..., P _jTjsimilarity, be designated as: { sim ₁, sim ₂..., sim _tj, be as calculated: similarity is for { 0.1104,0.1211,0.0946}, concrete computation process is shown in step 411-step 415.

Step 42: calculate document D ₁first limitation sentence P ₁₁with document D ₂limitation sentence expression i.e. { P ₂₁, P ₂₂, P ₂₃in the similarity { sim of all sentences ₁, sim ₂..., sim _tjmaximal value, be as calculated: { the maximal value sim of 0.1104,0.1211,0.0946} _max=0.1211;

Step 43: if sim described in above-mentioned steps 42 _max>t, then limit the quantity sentence collection P _ifirst sentence P _i1with document D _jsimilarity Sen_Sim ₁for sim _max; Otherwise, limitation sentence collection P _ifirst sentence P _i1with document D _jsimilarity Sen_Sim ₁be 0; Wherein threshold value t regulates according to database document set G data characteristics and determines;

The present embodiment selects t=0.5, because sim _max=0.1211<0.5, limitation sentence collection P ₁first sentence P ₁₁with document D ₂similarity Sen-Sim ₁=0;

Step 44: apply the method identical with above-mentioned steps 41-step 43, calculates limitation sentence collection P ₁in other sentences P ₁₂, P ₁₃, P ₁₄with document D ₂similarity: Sen-Sim ₂, Sen-Sim ₃, Sen-Sim ₄; In the present embodiment, Sen-Sim ₂, Sen-Sim ₃, Sen-Sim ₄be 0;

Step 45: calculate limitation sentence collection P ₁in all sentence P ₁₁, P ₁₂, P ₁₃, P ₁₄with document D ₂similarity and, then divided by limitation sentence collection P ₁sentence number and limitation sentence collection P _jmiddle sentence number and, obtain document D ₁with document D ₂similarity Doc-Sim ₁₂, computing formula is: Doc-Sim ₁₂=(Sen-Sim ₁+ Sen-Sim ₂+ Sen-Sim ₃+ Sen-Sim ₄)/(Ti+Tj)=Sen-Sim ₁+ Sen-Sim ₂+ Sen-Sim ₃+ Sen-Sim ₄)/(4+3)=0.

The present invention limits the quantity sentence collection P as described in step 41 ₁first sentence P ₁₁with limitation sentence collection P ₂in all sentence P ₂₁, P ₂₂, P ₂₃similarity calculating method as follows:

Step 411: utilize Chinese Word Segmentation described in step 1 and mask method, to limitation sentence collection P ₁first sentence P ₁₁with limitation sentence collection P ₂first sentence P ₂₁word and mark are cut in enforcement, remove the impact do not formed understood in the number in document, measure word, pronoun, preposition, auxiliary word, conjunction and adverbial word etc. word and punctuation mark on document content, the set expression that each sentence utilizes all remaining words to form; Limitation sentence collection P ₁first sentence P ₁₁be expressed as phrase W ₁₁={ word ₁₁₁, word ₁₁₂..., word _11Q1, limitation sentence collection P ₂first sentence P ₂₁be expressed as phrase W ₂₁={ word ₂₁₁, word ₂₁₂..., word _21R1, wherein Q1 represents phrase W ₁₁the quantity of middle word, R1 represents phrase W ₁₁the quantity of middle word;

The present embodiment Q1=33, R1=24, W ₁₁={ word ₁₁₁, word ₁₁₂..., word _{11 (33)}}={ popularization, project, conscientious, absorb, sum up, both at home and abroad, colleague, bituminous road, face, construction, quality, detection, control, technology, research, achievement, basis, bituminous road, face, material, pitch, mixing, material, production, bituminous road, face, shaping, process, carry out, detect, data, variability, analysis; W ₂₁={ word ₂₁₁, word ₂₁₂..., word _{21 (24)}}={ western part, project, many, technology, achievement, reach, the world, advanced person, level, rock, pitch, compound, change, property, series, product, acquisition, country, invention, patent, country, torch, plan, certificate;

Step 412: phrase W described in calculation procedure 411 ₁₁in word word ₁₁₁with phrase W ₂₁similarity word_sim ₁if: phrase W ₂₁middle existence and phrase W ₁₁in word word ₁₁₁identical word, then phrase W ₁₁in word word ₁₁₁with phrase W ₂₁similarity word_sim ₁for IDF(word ₁₁₁); If phrase W ₂₁in do not exist and phrase W ₁₁in word word ₁₁₁identical word, then phrase W ₁₁in word word ₁₁₁with phrase W ₂₁similarity word_sim ₁be 0; Wherein, phrase W ₁₁in word word ₁₁₁anti-document word frequency IDF(word ₁₁₁) in the dictionary that can calculate from step 22 word IDF value in inquire about and obtain;

The present embodiment phrase W ₁₁in word word ₁₁₁=" popularization ", phrase W ₂₁in do not have " popularization ", then word-sim ₁=0;

Step 413: phrase W described in step 411 ₁₁in remaining word word ₁₁₂, word ₁₁₃..., word _{11 (33)}with phrase W ₂₁i.e. { word ₂₁₁, word ₂₁₂..., word _{21 (24)}similarity { word_sim ₂, word_sim ₃..., word_sim ₍₃₃₎calculating with step 412;

The phrase W of the present embodiment ₁₁in the 2nd word word ₁₁₂=" project ", at phrase W ₂₁middle appearance, then word-sim ₂=IDF(project)=0.0023; Phrase W ₁₁in in have 33 words, application with process the 1st word and the 2nd mode that word is same and can obtain: { word-sim ₃..., word-sim ₃₃}={ 0,0,0,0,0,0,0,0,0,0,0,0.0045,0,0.0127,0,0,0,0,6.1628,0,0,0,0,0,0,0,0,0,0,0,0,0}

Step 414: by phrase W ₁₁in all word word ₁₁₁, word ₁₁..., word ₁₁₍₃₃₎with phrase W ₂₁similarity summation, then divided by phrase W ₁₁quantity 33 and the phrase W of middle word ₂₁the quantity 24 of middle word and, obtain limitation sentence collection P ₁first sentence P ₁₁with limitation sentence collection P ₂similarity sim ₁, computing formula is as follows:

sim ₁=(word_sim ₁+word_sim ₂+…+word_sim _(33）)/(33+24)=6.1823/57=0.1104；

Step 415: limitation sentence collection P ₁first sentence P ₁₁with limitation sentence collection P _jin other sentences P _j2..., P _jTjsimilarity calculate by the method identical with step 411-step 415.

Through said process, the similarity between any two sections of documents can be calculated.

In the present embodiment, the main similarity calculating existing document in new document and database, the present embodiment result, can have two kinds of usages:

(1) look into heavy needs according to document feature in embodiment with to document, determine document similarity judgment threshold Thres=0.4 in the present embodiment, if Documents Similarity is greater than 0.4, namely judge two sections of document similarities.User according to industry feature and the standard to Documents Similarity, the value of definite threshold Thres.

(2) user sorts to existing document according to Similarity Measure result, and whether the existing document then selecting similarity maximum is similar according to user experience subjective judgement document.

The above, it is only preferred embodiment of the present invention, not any pro forma restriction is done to the present invention, although the present invention discloses as above with preferred embodiment, but and be not used to limit the present invention, any those skilled in the art, do not departing within the scope of technical solution of the present invention, make a little change when the technology contents of above-mentioned announcement can be utilized or be modified to the Equivalent embodiments of equivalent variations, in every case be the content not departing from technical solution of the present invention, according to any simple modification that technical spirit of the present invention is done above embodiment, equivalent variations and modification, all still belong in the scope of technical solution of the present invention.

Claims

1., based on a document similarity detection method for seed words, it is characterized in that it comprises the following steps:

Step 1: Chinese Word Segmentation and mark

Step 2: dictionary calculates

Step 21: select word: all documents in ergodic data database documents set (G), remove the number in document, measure word, pronoun, preposition, auxiliary word, conjunction and adverbial word etc. and document content is understood to the word and the punctuation mark that do not form impact, remaining word forms a set jointly, delete dittograph in this set and obtain dictionary (C), be designated as C={c ₁, c ₂, c ₃..., c _nc, wherein Nc represents the quantity of word in dictionary (C);

Step 22: the anti-document frequency (IDF) calculating each word in described dictionary (C):

Any word c in dictionary (C) _nanti-document frequency IDF(c _n) computing formula is: IDF(c _n)=lg(Nt/N (c _n)); Wherein, Nt represents the total number of documents comprised in database document set, N (c _n) represent in content and comprise word c _nthe quantity of document, lg represents with 10 for computing is carried out taking the logarithm in the end;

Step 23: preserve described dictionary (C) and anti-document frequency (IDF) corresponding to interior each word thereof;

Step 3: by any document D _iwith limitation sentence collection P _irepresent

Step 31: in statistics dictionary (C), each word is in document D _iin word frequency-anti-document frequency (TF-IDF)

Any word c in dictionary (C) _nin document D _iin word frequency-anti-document frequency TF-IDF(c _n) calculation procedure is as follows:

Step 312: query word c in step 2 gained dictionary (C) _nanti-document frequency IDF (c _n);

Step 313: calculate word c _nword frequency-anti-document frequency (TF-IDF) value at document D i:

TF-IDF（c _n）=TF（c _n）*IDF(c _n)；

Step 32: calculate seed words set

All words in dictionary (C) are pressed it in document D _iin word frequency-anti-document frequency (TF-IDF) value sort from big to small, suppose D in document _iin total M notional word, in this M notional word before selection m notional word as document D _iseed word set (S), be designated as seed word set S={s ₁, s ₂..., s _m, wherein, m is the smallest positive integral being not less than M*k, and k is notional word selection ratio, can regulate according to actual detection perform;

Step 33: by document D _iwith limitation sentence collection P _irepresent

Traversal D _iin full, all sentences comprising any one word in seed word set (S) are selected, for representing document D _iif there is Ti sentence P _i1, P _i2..., P _iTicomprise any one word in seed word set (S), then D _ibe expressed as limitation sentence collection: P _i={ P _i1, P _i2..., P _iTi;

Step 4: document contrasts

2. a kind of document similarity detection algorithm based on seed words according to claim 1, is characterized in that the sentence collection P that limits the quantity described in described step 41 _ifirst sentence P _i1with limitation sentence collection P _jin all sentence P _j1, P _j2..., P _jTjsimilarity calculating method as follows:

sim ₁=(word_sim ₁+word_sim ₂+…+word_sim _Q1)/(R1+Q1)；