WO2014206241A1 - Procédé de calcul de similarité de documents, et procédé et dispositif de détection de documents approximativement dupliqués - Google Patents

Procédé de calcul de similarité de documents, et procédé et dispositif de détection de documents approximativement dupliqués Download PDF

Info

Publication number
WO2014206241A1
WO2014206241A1 PCT/CN2014/080318 CN2014080318W WO2014206241A1 WO 2014206241 A1 WO2014206241 A1 WO 2014206241A1 CN 2014080318 W CN2014080318 W CN 2014080318W WO 2014206241 A1 WO2014206241 A1 WO 2014206241A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
similarity
word segmentation
word
detected
Prior art date
Application number
PCT/CN2014/080318
Other languages
English (en)
Chinese (zh)
Inventor
李国良
冯建华
魏建生
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2014206241A1 publication Critical patent/WO2014206241A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a document similarity calculation method, an approximate duplicate document detection method and apparatus. Background technique
  • the approximate repetitive detection of massive text data has two main problems: First, the accuracy, due to the different data input and processing pipelines, the approximate duplicate documents are not necessarily identical. For example, the same news, even if the content is exactly the same, when it is logged out on different websites, they do not necessarily overlap completely, and the web crawler can only capture the information completely, but cannot identify which ones are real. news. If you directly use the text to match exactly, the two documents that are essentially the same will be treated as different entities. Second, efficiency, the expansion of information brought about by the development of Internet technology has increased the complexity of discovering duplicate content from massive data.
  • the existing similarity calculation methods can be roughly divided into two categories: a set-based similarity calculation method and a string-based similarity calculation method.
  • a collection-based similarity calculation method treats a document as a word segmentation The collection directly determines whether the two documents are similar by calculating the similarity between the sets.
  • the string-based similarity calculation method treats the document as a complete string to measure the degree of similarity between them.
  • the typical measurement method is Edit Distance (Edi t Di s tance ).
  • the minimum number of meta edit operations (insert, delete, replace) required to change the distance from one string to another is used as a measure of the similarity between two strings.
  • the corresponding number of edits is the edit distance.
  • the existing set-based similarity calculation method cannot perceive the edit similarity of the text at the word segmentation level.
  • the set-based similarity function treats two homologous participles as different instances. Therefore, the resulting document similarity is smaller than the actual value, which affects the detection accuracy.
  • the string-based similarity calculation method has high computational complexity and accuracy is susceptible to word segmentation. Internet data is massive and interactive. Editing errors are widely present in the process of document generation and propagation, resulting in editing similarity. Below the actual value, the detection accuracy is affected. Summary of the invention
  • the object of the present invention is to provide a document similarity calculation method, an approximate duplicate document detection method and device, which can effectively recognize approximate repeated texts including word segmentation editing errors, improve the accuracy of approximate duplicate document detection, reduce computational complexity, and optimize calculation. effectiveness.
  • a first aspect of the present invention provides a method for calculating a similarity of a document, the method comprising:
  • the calculating the maximum weighted matching value of the weighted even graph includes:
  • the determining, by using the maximum weighted matching value, the similarity between the to-be-detected documents includes:
  • ⁇ ( & , & ) 2 ⁇ , ⁇ ( 7 , r 2 ) / ( + );
  • (A represents the maximum weight matchings to be detected value of the document A, the similarity between the &, [delta] (7 ;, 2) detecting a document to be represented by ⁇ , corresponding to the word & dual set of weighting ;, FIG. 2, Indicates the cardinality of the word segment set 1, and I ⁇ I represents the cardinality of the word segment set ⁇ .
  • the editing similarity satisfies the requirement that: the editing similarity is greater than or equal to a preset edit similarity threshold.
  • the present invention further provides an approximate duplicate document detecting method, the method comprising: performing word segmentation processing on each to-be-detected document to obtain a word segment set of each of the to-be-detected documents; respectively Each participle is divided into substrings whose length is smaller than the participle, and the substring is used to form a signature set of the word segmentation;
  • the method further includes:
  • the participles obtained by the word processing are numbered and the participle number is recorded, and the participle number indicates the order in which the participle appears in the document to be detected;
  • Each substring in the word segmentation is sorted, and the substring satisfying the requirement is composed of the document signature of the to-be-detected document.
  • the sub-strings in the word segment set are sorted, and the sub-strings satisfying the requirements are configured to be Detect document signatures for documents, including:
  • the deletion is stopped, and the M is a preset number threshold
  • the remaining substrings are grouped into the document signature of the document to be inspected.
  • each of the participles in the participle set is divided into sub-strings whose length is smaller than the participle, including:
  • the q-ary method d-gram is used to obtain all consecutive substrings of length q in the original participle.
  • a fourth possible implementation in the second aspect includes:
  • a set of signatures of the word segmentation is formed using the reserved substrings.
  • the calculating the similarity of the pair of candidate documents in the candidate set includes:
  • the calculating the maximum weighted matching value ⁇ (7;, ⁇ 2 ) of the weighted even graph includes:
  • Binding fifth possible embodiment of the second aspect, the seventh possible embodiment of the second aspect, the use of the maximum weight matchings fA value, calculating the word set corresponding to 7 ;, 2 & document of The similarity between FAs s 2 ) includes:
  • FAs 1 , s 2 fAT T 2 ) / ( ⁇ TA x ⁇ T 2 ⁇ ) ;
  • ( & , & ) 2 x , ⁇ ( 7 , r 2 ) / ( + ) ;
  • indicates the cardinality of the word segmentation 2 .
  • the editing similarity satisfaction requirement includes: the editing similarity is greater than or equal to a preset editing similarity value.
  • the method before calculating the similarity of the pair of document pairs in the candidate set, the method further includes:
  • Calculating the similarity of the candidate set of document pairs including:
  • the similarity of the filtered pairs of documents in the candidate set is calculated.
  • the filtering, by the prune policy, filtering the document pair in the candidate set including:
  • the pairs of documents in the candidate set are sequentially extracted, and the upper limit of the similarity between the documents is evaluated; if the upper limit is less than the preset similarity threshold ⁇ , the corresponding pair of documents is pruned.
  • the present invention further provides a document similarity calculation device, where the device includes: a word segmentation module, configured to perform word segmentation processing on two to-be-detected documents respectively, to obtain respective word segment sets of the to-be-detected documents;
  • a word segmentation module configured to perform word segmentation processing on two to-be-detected documents respectively, to obtain respective word segment sets of the to-be-detected documents;
  • a first calculation module configured to calculate all the weighted even graph creation modules of the two participle sets obtained by the word segmentation module, where the edited similarity obtained by the first calculation module satisfies the required word segmentation An edge is established, and the edit similarity is a weight of an edge of the corresponding word segment, and a weighted even graph is obtained;
  • a second calculating module configured to calculate a most powerful p-weight matching value of the weighted even graph obtained by the weighted even graph establishing module
  • a third calculating module configured to use the maximum weighted matching value obtained by using the second calculating module, Calculating the similarity between the documents to be detected.
  • the second calculating module is specifically configured to find a sum of the weights in the weighted diffractogram obtained by the weighted even graph establishing module And the edge set of the common vertices is the sum of the weights and the sum of the weights of the edge sets of the common vertices is the maximum weighted matching value of the weighted even graph.
  • the third calculating module is specifically configured to calculate the to-be-detected document according to any formula listed below by using the maximum weighted matching value Similarity between:
  • ⁇ ( & , & ) 2 ⁇ , ⁇ ( 7 , r 2 ) / ( + );
  • (A represents the maximum weight matchings to be detected value of the document A, the similarity between the &, [delta] (7 ;, 2) detecting a document to be represented by ⁇ , corresponding to the word & dual set of weighting ;, FIG. 2, Indicates the cardinality of the word segment set 1, and I ⁇ I represents the cardinality of the word segment set ⁇ .
  • the editing similarity satisfies the requirement that: the editing similarity is greater than or equal to a preset edit similarity threshold.
  • the present invention further provides an apparatus for detecting an approximately duplicate document, the apparatus comprising: a word segmentation processing module, configured to perform word segmentation processing on each document to be detected, to obtain a word segmentation set of each of the to-be-detected documents;
  • a string processing module configured to respectively divide each word segment in the word segment set obtained by the word segmentation module into a substring whose length is smaller than the word segment, and form a signature set of the word segment by using the substring; a module, configured to combine a signature set of all the word segments in the word segment set corresponding to the same document to be detected, to generate a document signature of the document to be detected;
  • An indexing module is configured to establish an inverted index on the document signature, and pair the documents corresponding to the two word segment sets that appear in the index table item corresponding to the same substring into a document pair and join the candidate set;
  • a calculation module configured to calculate a similarity of the pairs of the candidate set of documents
  • a judging module configured to identify, by the computing module, the document pair whose similarity meets the requirement is an approximate duplicate document.
  • the word segmentation processing module is further configured to: after performing word segmentation processing on each to-be-detected document, numbering and recording the word segmentation obtained by the word segmentation process a word segmentation number, the word segmentation number indicating an order in which the word segment appears in the document to be detected;
  • the document signature generation module specifically includes: a merging unit and a sorting unit;
  • the merging unit is configured to combine the signatures of all the word segments in the word segment set corresponding to the same document to be detected, and record the word segment number in which each substring is located;
  • the sorting unit is configured to sort each substring in the word segmentation set, and form a substring satisfying the requirements to form a document signature of the to-be-detected document.
  • the sorting unit is further configured to perform, after sorting each substring in the word segmentation set After deleting the substring in the same word segment set, and using the data structure table to record the deleted substring, add it to the data structure table; when the number of elements in the data structure table reaches M Stop deleting, the M is a preset number threshold; and the remaining substrings are composed into the document signature of the document to be detected.
  • the sub-string processing module is specifically configured to use the q-ary method d_gram to divide each participle in the word segmentation to obtain the original participle All consecutive substrings of length q.
  • the sub-string processing module is further configured to sort the consecutive sub-strings, and retain the first N sub-strings, Where N is a preset positive integer; and the reserved substring is used to form a signature set of the word segmentation.
  • the calculating module includes:
  • a first calculating unit configured to calculate two pairs of the candidate set of documents obtained by the indexing module
  • the wording set ⁇ the edit similarity of all the participle pairs (t 2 , j) in ⁇ 2 , where T, t 2 , j ⁇ T 2 , ⁇ i ⁇ m, ⁇ j ⁇ n, and 7 respectively
  • T, t 2 , j ⁇ T 2 , ⁇ i ⁇ m, ⁇ j ⁇ n, and 7 respectively The cardinality of the word segmentation 7 and ⁇ 2 ;
  • a weighted even graph establishing unit configured to establish an edge between the word segment pairs obtained by the first calculating unit that the editing similarity satisfies the requirement, and the editing similarity is a weight of an edge of the corresponding word segment pair, and obtain a weighted even Figure
  • a second calculating unit configured to calculate a maximum weighted matching value ⁇ (7, 7 2 ) of the weighted even graph obtained by the weighted even graph establishing unit;
  • a third calculating unit configured to calculate, by using the maximum weighted matching value ⁇ ( ⁇ , T 2 ) obtained by the second calculating unit, a similarity F between the documents ⁇ , & corresponding to the word segment set T ⁇ 2 b ( Sl , s 2 ).
  • the second calculating unit is specifically configured to use the weighted coupling obtained by the weighted even graph establishing unit The figure finds the edge set of the sum of the weights that is the largest and does not share the vertices, and the sum of the weights of the edge sets that have the largest sum of the weights and not the common vertices as the maximum weighted matching value of the weighted even graph ⁇ ( ⁇ , T 2 ).
  • the third calculating unit is specifically configured to use the maximum weighted matching value f (n, , according to the following For any of the formulas listed, the similarity between the documents ⁇ , & corresponding to the word segment set T ⁇ 2 is calculated / 7 ⁇ ( ⁇ i, s 2 ):
  • FAs 1 , s 2 fAT 1 , r 2 )/( ⁇ r 1 ⁇ ⁇ ⁇ n ⁇ ) ;
  • ⁇ ( ⁇ i, 3 ⁇ 4) 2 x 5(r 1 ,r 2 )/(ir 1
  • the editing similarity satisfaction requirement includes: the editing similarity is greater than or equal to a preset editing similarity value.
  • the device further includes: a filtering module, where the filtering module is connected to the computing module, configured to use the pruning policy Filtering the pairs of documents in the candidate set obtained by the indexing module;
  • the calculation module calculates a similarity of the pairs of documents in the candidate set filtered by the filtering module.
  • the filtering module is specifically configured to sequentially extract a document pair in the candidate set obtained by the indexing module, The upper limit value of the similarity between the documents is evaluated; if the upper limit value is smaller than the preset similarity width threshold ⁇ , the corresponding document pair is pruned.
  • the document similarity calculation method, the approximate duplicate document detection method and the device provided by the invention use the edit similarity of the word segmentation to calculate the document similarity, can effectively identify the approximate repeated text including the word segmentation editing error, and improve the accurate detection of the approximate duplicate document. Degree, reduce computational complexity, and optimize computational efficiency.
  • FIG. 1 is a flowchart of a method for calculating a similarity degree of a document according to Embodiment 1 of the present invention
  • FIG. 2 is a flowchart of a method for detecting an approximately duplicate document according to Embodiment 2 of the present invention
  • FIG. 3 is a schematic diagram of a word segmentation set and a weighted diffractogram corresponding to a second embodiment of the present invention
  • FIG. 4 is a schematic diagram of a document similarity calculation device according to Embodiment 3 of the present invention.
  • FIG. 5 is a schematic diagram of an apparatus for detecting an approximate duplicate document according to Embodiment 4 of the present invention.
  • FIG. 6 is a schematic diagram of a computing module according to Embodiment 4 of the present invention. detailed description
  • the document similarity calculation method and the approximate repeated document detection method and device provided by the embodiments of the present invention are applicable to the approximate repeated detection of massive text data of a computer system, and are particularly suitable for the number of Internet texts. Web application system logging or text records in the database, etc.
  • FIG. 1 is a flowchart of a document similarity calculation method provided by this embodiment. As shown in FIG. 1, the document similarity calculation method of the present invention includes:
  • S10 performs word segmentation processing on each of the two to-be-detected documents to obtain a respective word segmentation set of the to-be-detected documents.
  • the two words to be detected are respectively subjected to word segmentation to obtain a word segment set T ⁇ 2 .
  • Edit distance is the minimum number of meta-editing operations (insert, delete, replace) required to change one string to another.
  • the corresponding number of edits is the edit distance. It is not difficult to find that the fewer editing operations required, the smaller the editing distance, the more similar the strings.
  • the editing similarity satisfaction requirement includes: the editing similarity is greater than or equal to a preset editing similarity threshold.
  • Jaccard similarity formula: F, ( Sl , s 2 ) f & ( , ⁇ 2 ) / ( ⁇ ⁇ + ⁇ T 2 ⁇ -f & , T 2 ) ).
  • Cosine similarity formula: F ( Sl , s 2 ) , ⁇ (TT 2 ) / ( ⁇ TA x ⁇ T 2 ⁇ ) 1 2 .
  • ⁇ (7;, ⁇ 2 ) represents the document A to be detected, and the corresponding word segment set 7;, the maximum weighting of the weighted even graph of ⁇ 2
  • the matching value indicates the cardinality of the word segment set 1
  • I ⁇ I indicates the cardinality of the word segment set ⁇ .
  • FIG. 2 is a flowchart of the method for detecting an approximately duplicate document according to the embodiment. As shown in FIG. 2, the method for detecting an approximate duplicate document of the present invention includes:
  • S20 performing word-cutting processing on each of the to-be-detected documents, obtaining respective word-dividing methods of the to-be-detected documents, and using an existing word-cutting method, for example, by identifying a specific non-English character (such as punctuation, number, etc.)
  • the forward maximum matching method, etc. performs word segmentation processing on each document to be detected to obtain a word segmentation set.
  • the participles obtained by the word processing are numbered and the participle number is recorded, and the participle number indicates the order in which the participle appears in the document to be detected.
  • the word processing is obtained.
  • each participle in the word segmentation it may be, but not limited to, using the q-ary method d_gram to obtain all consecutive substrings of length q in the original participle.
  • ⁇ . ⁇ t 2 , y ⁇ 2 ( l ⁇ i ⁇ niA ⁇ j ⁇ n) are the cardinalities of the word segment sets 7 and 2 , respectively, and divide them into substrings of shorter length to obtain the set 2 , j, for example, obtains consecutive substrings of all lengths in the original participle by -gram division.
  • Token ""
  • its corresponding 2_gram set is ⁇
  • an appropriate filtering method may be selected according to the splitting strategy of the substring to delete part of the substring, to reduce the substring set and reduce the storage overhead.
  • Forming the signature set of the word segment by using the substring includes: Sorting the consecutive substrings, retaining the first N substrings, where N is a preset positive integer; forming a signature set of the word segment using the reserved substrings.
  • S203 combines the signature sets of all the word segments in the word segmentation set, and records the word segment numbers in which each substring is located.
  • S2032 Sort each substring in the word segmentation set, and form a substring satisfying the requirement to form a document signature of the to-be-detected document.
  • the method includes: sorting each substring in the word segment set; deleting the substring in the same word segment set from the back to the front, and using the data structure table to record the word segment number corresponding to the deleted substring; When the number of elements in the data structure table reaches M, M is a preset number threshold, and the deletion is stopped; and the remaining substrings are composed into the document signature of the document to be detected.
  • M is a preset number of thresholds.
  • each substring corresponds to at least one word segment set that appears. If two word segment sets appear in the index table corresponding to the same substring, they are paired and added to the candidate set.
  • S205 calculates an edit similarity of all the word segment pairs ( ⁇ 2 , .) in the two word segment sets T ⁇ 2 of the candidate set, wherein t, i, t - T 2 , l ⁇ i ⁇ m, 1 ⁇ j ⁇ n, and ? are the cardinal numbers of the word segment sets 7 and ⁇ 2 , respectively.
  • ⁇ 2 calculates the edit similarity of all word segment pairs ( ⁇ , ⁇ 2 , .
  • the length of the string of the word segmentation, max ⁇ I ⁇ , i I , I t ltj ⁇ function means taking the larger of I and I i 2 , .l.
  • S2052 Create an edge between the word segment pairs whose edit similarity satisfies the requirement, and the edit similarity is a weight of the edge corresponding to the word segment pair, to obtain a weighted even graph.
  • the editing similarity satisfaction requirement includes: the editing similarity is greater than or equal to a preset editing similarity threshold.
  • a set of edges of the non-common vertices is found in the weighted diffractogram, and a set of edges having the largest sum of weights of all sides is used as the maximum weighted matching value ⁇ ( ⁇ , ⁇ 2 ) of the weighted diffractogram.
  • the similarity between the documents to be detected is calculated according to the Jaccard formula or the cosine similarity formula or the distance formula or the like. Specifically:
  • ⁇ (A, represents a document to be detected A, & similarity between, ⁇ (7 ;, ⁇ 2) detection of a document to be represented by a corresponding word set & maximum weight bipartite graph matching weighted ;, 7 2, Indicates the cardinality of the word segment set 1, and I ⁇ I represents the cardinality of the word segment set 2 .
  • Identifying the document pair whose similarity meets the requirement as an approximate duplicate document includes: identifying a document pair whose document similarity ( ⁇ i, S 2 ) is greater than a preset document similarity threshold as an approximate duplicate document.
  • the method further includes: filtering the document pair in the candidate set by using a pruning policy; and calculating, in S205, the similarity of the filtered document pairs in the candidate set.
  • the method includes: sequentially extracting pairs of documents in the candidate set, and evaluating an upper limit of the similarity between the documents; if the upper limit is less than a preset similarity threshold ⁇ , the corresponding pair of documents is pruned.
  • a weighted even graph is established for each pair of candidate word segment sets (7;, 2 ) in the candidate set, and for each vertex in the word segment set 7 or 2, the edge with the largest weight associated with it is selected, and the weights of all selected edges are selected.
  • step 2 a signature set for each participle is generated.
  • the word segmentation consists of substrings contained in the word segmentation. You can choose different ways to divide the substrings as needed.
  • the signature set of the - gram substring is obtained, and the pre ( ⁇ ⁇ '+1) substring is retained, wherein 5)
  • the 2-gram sub-cut is divided into examples.
  • sig ⁇ (mvp) ⁇ mv ⁇
  • sig ⁇ (tracy) ⁇ ac, cy, ra ⁇
  • sig ⁇ (trey) ⁇ cy, rc, tr ⁇
  • sig ⁇ (macgrady) ⁇ ac, eg, ad ⁇ .
  • Step 3 generate a document signature. Combine the signature collections of all participles in the same document, and record the participle numbers they are in. For example, ⁇ ad, eg, dy ⁇ is decomposed from the third participle "mcgrady" in the original participle set ⁇ 2 , so when it is added to ⁇ ( 2 ), its source is marked, ie ⁇ ad 3 , eg 3 , Dy 3 ⁇ 0
  • ⁇ (7) ⁇ ac 2 , ad 2 , eg 2 , cy 1 , mv 3 , rc 1 , tr 1 ⁇
  • SIG b (T 2 ) ⁇ ac 2 , ad 3 , eg 3 , cy 2 , dy 3 , mv 1 , ra 2 ⁇ .
  • the document signatures corresponding to each word segment set are sorted in lexicographic order, and deleted from the back until the deleted substring appears in " ⁇ '- a different participle.
  • the value of ⁇ ' is similar to the width of the similarity
  • an inverse index is established for the signature of the document, and each substring corresponds to at least one set of word segments that it has appeared, and the pair of low similarity documents are filtered, for example: ⁇ ac: ⁇ 1 , T 2 ; ad: ⁇ 1 , T 2 ; eg: ⁇ 1 , ⁇ 2 ; cy: ⁇ 2 ; dy: ⁇ 2 ⁇ . If two word segment sets appear in the index table corresponding to the same substring, they are paired and added to the candidate set. In this step, (L, ⁇ 2 ) will be identified as a high similarity document pair and added to the candidate set.
  • step 5 the high similarity document pair is checked to calculate the similarity of the document pairs. After all the index entries have been processed, the pairs of documents in the candidate set are sequentially taken out, and the similarity between the documents is evaluated. Before calculating the exact similarity, the upper bound of the similarity between the two is estimated by a certain strategy. If the upper bound is less than the threshold ⁇ , the actual similarity between the two documents cannot be greater than ⁇ .
  • the specific calculation method is as follows: In the corresponding weighted even graph, for each vertex belonging to the same segmentation set ( 7; or ⁇ 2 ), the weight associated with it is selected. The side with the largest value, accumulate it
  • the approximate duplicate document detection method provided by the present invention utilizes the "first filter, post check" approximate duplicate document detection algorithm, which can improve the calculation efficiency while improving the accuracy of the approximate duplicate document detection, and is therefore suitable for approximate repeated detection of massive text data.
  • the document similarity calculation apparatus of the present invention includes: a word segmentation module 401, a first calculation module 402, a weighted digraph creation module 403, and a second The calculation module 404 and the third calculation module 405.
  • the word segmentation module 401 is configured to perform word segmentation processing on the two to-be-detected documents to obtain respective word segmentation sets of the to-be-detected documents.
  • the word segmentation module 401 performs word segmentation processing on the two documents to be detected, respectively, to obtain a word segment set n, ⁇ 2 .
  • the first calculation module 402 is configured to calculate the edit similarity of all the word segment pairs in the two word segment sets obtained by the word segmentation module 401. Specifically, the first calculation module 402 establishes a Bipartite Graph with all the participles i, t 2 , j ⁇ ⁇ 2 in the word segment sets 7 and 2 , where l ⁇ i ⁇ ni, ⁇ j ⁇ n, And ? are the cardinalities of the word segment sets 7 and ⁇ 2 , respectively, and calculate the edit similarity of all the word segment pairs, , , ⁇ 2 , .
  • the length of the word segmentation string, max ⁇ I ⁇ , y I , I t l J ⁇ ] function represents the larger value among IJ and I ⁇ 2 , .
  • the weighted even graph creation module 403 is configured to establish an edge between the word segment pairs obtained by the first calculating module 402 that meet the requirements of the edit similarity, and the edit similarity is the weight of the edge of the corresponding word segment, and obtain a weighted even graph. .
  • the editing similarity satisfaction requirement includes: the editing similarity is greater than or equal to a preset editing similarity threshold.
  • the weighted even graph creation module 403 establishes an edge between t i , ⁇ 2 , . and assigns e , , , ⁇ 2 , For the weight of the edge, the weighted even figure 6 ight is obtained .
  • the second calculation module 404 is configured to calculate a maximum weighted matching value of the weighted diffractogram obtained by the weighted digraph generation module 403.
  • the second calculating module 404 is specifically configured to find, in the weighted diffractogram obtained by the weighted even graph establishing module 403, a set of edges with the largest sum of weights and no common vertices, and the sum of the weights is the largest and the vertices are not common.
  • the sum of the weights of the edge sets is taken as the maximum weighted matching value of the weighted even graph.
  • the second calculation module 404 finds an edge set of the non-common vertex in the weighted even graph ight , and the sum of the weights is the largest compared with the edge sets of all other non-common vertices, and the weight is called
  • the maximum weighted match for and is denoted by ⁇ ( ⁇ , ⁇ 2 ).
  • the third calculating module 405 is configured to calculate the similarity between the documents to be detected by using the maximum weighted matching value obtained by the second calculating module 404.
  • the third calculating module 405 is specifically configured to calculate a similarity between the to-be-detected documents according to a Jaccard formula or a cosine similarity formula or a distance formula or the like by using a maximum weighted matching value.
  • ⁇ (7;, ⁇ 2 ) represents the document A to be detected, and the corresponding word segment set 7;, the maximum weighting of the weighted even graph of ⁇ 2
  • the matching value indicates the cardinality of the word segment set 1
  • I ⁇ I indicates the cardinality of the word segment set ⁇ .
  • each similarity function is [0, 1], and the closer the similarity is to 1, the more similar the two document sets are.
  • the document similarity calculation apparatus since the character-based similarity is considered in calculating ⁇ (7;, ⁇ 2 ), the obtained similarity function can be combined with the character similarity function and the set-based similarity function.
  • FIG. 5 is a schematic diagram of an apparatus for detecting an approximate duplicate document according to the embodiment.
  • the apparatus for detecting an approximate duplicate document of the present invention includes: a word segmentation processing module 501, a string processing module 502, a document signature generation module 503, and an index module. 504.
  • the word segmentation processing module 501 is configured to perform word segmentation processing on each of the to-be-detected documents to obtain a word segmentation set of each of the to-be-detected documents.
  • the word segmentation module 501 uses the existing word-cutting method, for example, by identifying a specific non-English character (such as punctuation, number, etc.) word-cutting method, forward maximum matching method, etc., cutting the words to be detected. Processing, get a collection of word segments.
  • the word-sharing processing module 501 is further configured to number the participles obtained by the word-cutting process and record the participle numbers, where the participle numbers indicate that the participles appear in the to-be-detected document. order of.
  • the string processing module 502 is configured to respectively divide each word segment in the word segment set obtained by the word segmentation processing module 501 into a substring whose length is smaller than the word segment, and form a signature set of the word segment by using the substring.
  • the string processing module 502 is specifically configured to use, for example, but not limited to, each of the word segmentation sets obtained by the word segmentation processing module 501 to obtain all consecutive substrings of length q in the original participle.
  • the string processing module 502 may select an appropriate filtering method to delete part of the sub-string according to the sub-string splitting strategy, to reduce the sub-string set and reduce the storage overhead.
  • the string processing module 502 is further configured to sort the consecutive sub-strings after performing word-cutting processing on each of the to-be-detected documents, and retain the first N sub-strings, where N is a preset positive integer; The substrings form a signature set of the word segmentation.
  • the sub-string processing module 502 can sort the sub-strings in lexicographic order, leaving only the previous ⁇ (QX ⁇ '+l) sub-strings, and the reduced sub-string set is called a participle. Signature, recorded as /, .
  • ⁇ ' is the estimate of the preset edit distance ⁇ , which is defined as ⁇ ' ⁇ ⁇ ) ⁇
  • is the default edit similarity threshold
  • ij is the character of the part ⁇ String length.
  • the document signature generation module 503 is configured to combine the signature sets of all the word segments in the word segment set corresponding to the same document to be detected, and generate a document signature of the document to be detected.
  • the document signature generation module 503 specifically includes: a merging unit and a sorting unit (not shown).
  • the merging unit is used to merge the signature sets of all the word segments in the word segment set corresponding to the same document to be detected, and record the word segment numbers in which each substring is located.
  • the sorting unit is configured to sort each substring in the word segmentation set, and form a substring satisfying the requirements to form a document signature of the to-be-detected document.
  • the sorting unit After sorting each substring in the word segmentation set, the sorting unit is also used to delete from the back to the front.
  • the word segment number corresponding to the currently deleted substring does not appear in the hash table.
  • M is a preset number of thresholds.
  • " ⁇ '- ⁇
  • ⁇ ' is an estimate of the maximum weighted matching threshold
  • is the preset similarity threshold
  • is the cardinality of the word segmentation 7. Since ⁇ ' is only dependent on the use of ⁇ here, it is advantageous to improve the efficiency at the time of simplification.
  • the same method is used to process ⁇ ( 2 ), and the reduced set (7;) and sig b ( ⁇ 2 ) are respectively used as the word segment sets 7 and 2 corresponding to the document &&& document signature.
  • the indexing module 504 is configured to establish an inverted index on the document signature generated by the document signature generation module 503, and pair the documents corresponding to the two word segment sets appearing in the index table item corresponding to the same substring into a document pair and join the candidate set.
  • the indexing module 504 establishes an inverted index on the document signature, and each substring corresponds to at least one word segment set that appears. If two word segment sets appear in the index table corresponding to the same substring, they are paired and added to the candidate set.
  • the calculation module 505 is configured to calculate the similarity of the candidate set of document pairs.
  • the computing module 505 includes a first calculating unit 5051, a weighting even graph establishing unit 5052, a second calculating unit 5053, and a third calculating unit 5054.
  • the first calculating unit 5051 is connected to the indexing module 504, and is configured to calculate an editing similarity of all the word segment pairs (, t 2 , j) in the two word segment sets TT 2 of the candidate set of document pairs obtained by the indexing module 504, where , T ⁇ , t 2 , j- T 2 , ⁇ i ⁇ m, 1 ⁇ j ⁇ n, and ? are the cardinal numbers of the word segment sets 7 and ⁇ 2 , respectively.
  • the first calculating unit 5051 uses all the participles in the word segment sets 7 and 2 . 7;, t 2 , j ⁇ ⁇ 2 is a bipartite graph for the vertices, where l ⁇ i ⁇ ni, ⁇ ⁇ j ⁇ n, and ? are the cardinalities of the word segment sets 7 and ⁇ 2 , respectively, and calculate all the word segment pairs. ,,, ⁇ 2 , .) Edit similarity.
  • the i I , I t ltj ⁇ function represents the larger of I and I i 2 , .l.
  • the weighted even graph establishing unit 5052 is configured to establish an edge between the word segment pairs obtained by the first calculating unit 5051 that the editing similarity satisfies the requirement, and the editing similarity is the weight of the edge of the corresponding word segment pair, and obtain a weighted even graph. .
  • the editing similarity satisfaction requirement includes: the editing similarity is greater than or equal to a preset editing similarity threshold.
  • the weighted even graph establishing unit 5052 establishes an edge between ⁇ , ⁇ 2 , . and assigns e , ,., ⁇ 2 , For the weight of the edge, the weighted even figure 6 ight is obtained .
  • the second calculating unit 5053 is configured to calculate a maximum weighted matching value ⁇ of the weighted even graph obtained by the weighted even graph establishing unit 5052 (7, .
  • the second calculating unit 5053 is specifically configured to find, in the weighted diffractogram obtained by the weighted even graph establishing unit 5052, a set of edges whose sum of weights is the largest and not common, and the sum of the weights is the largest and not the common vertex
  • the sum of the weights of the edge sets is taken as the maximum weighted matching value ⁇ ( ⁇ , T 2 ) of the weighted even graph.
  • the second calculating unit 5053 finds a set of edges of the non-common vertices in the weighted even graph ight , and the edge set has the largest sum of weights compared with the edge sets of all other non-common vertices, so that the The sum of the weights is the maximum weighted match of 6 ht , denoted as ⁇ ( , ⁇ 2 ).
  • the third calculating unit 5054 is configured to calculate, by using the maximum weighted matching value ⁇ ( , T 2 ) obtained by the second calculating unit 5053, the similarity F b between the documents ⁇ , & corresponding to the word segment set 7, 2 ( Sl , &).
  • the third calculating unit 5054 calculates the similarity between the documents to be detected according to the Jaccard formula or the cosine similarity formula or the distance formula or the like using the maximum weighted matching value. Specifically:
  • ⁇ (A, represents a document to be detected A, & similarity between, ⁇ (7 ;, ⁇ 2) detection of a document to be represented by a corresponding word set & maximum weight bipartite graph matching weighted ;, 7 2, Indicates the cardinality of the word segment set 1, and I ⁇ I represents the cardinality of the word segment set ⁇ .
  • each similarity function is [0, 1], and the closer the similarity is to 1, the more similar the two document sets are.
  • the judging module 506 is configured to identify the document pair whose similarity meets the requirement as an approximate duplicate document.
  • the determining module 506 is specifically configured to identify the document pair whose document similarity F b (ss 2 ) is greater than the preset document similarity threshold as an approximate duplicate document.
  • the apparatus for detecting an approximate duplicate document may further include: a filtering module (not shown), where the filtering module is connected to the computing module 505, and is configured to use the pruning strategy to obtain the index module 504.
  • the document pairs in the candidate set are filtered.
  • the calculation module 505 calculates the similarity of the document pairs in the candidate set filtered by the filtering module.
  • the filtering module is specifically configured to sequentially extract the pairs of documents in the candidate set obtained by the indexing module, and evaluate an upper limit value of similarity between the documents; if the upper limit value is smaller than a preset similarity width ⁇ , Then pruning the corresponding document pair.
  • the filtering module establishes a weighted even graph for each pair of candidate word segment sets (7;, 2 ) in the candidate set, and selects the edge with the largest weight associated with each of the vertex sets 7 or 2 , and selects all selected edges.
  • the weight accumulation is used as the estimate of the maximum weighted match (7;, ⁇ 2 ).
  • the approximate duplicate document detecting apparatus utilizes the "re-filtering, post-checking" approximate repeated document detecting algorithm, which can improve the calculation efficiency while improving the accuracy of the approximate duplicate document detection, and is therefore suitable for approximate repeated detection of massive text data.
  • the method and device for calculating the similarity degree of the document use the editing similarity between the word segments to calculate the similarity of the document, and can effectively recognize the approximate repeated text including the editing error of the word segmentation, and is particularly suitable for the cleaning and analysis of the Internet data.
  • the document similarity calculation method and apparatus provided by the invention can also be applied to text records in an Internet application system record or a database.
  • terminal is different from “terminater”, and such spelling mistakes are very common in everyday life. It can be seen that either the first two queries or the next two queries are related to an actor called “Schwarzenegger” and his movie called “terminator”. And these strings are not very different in form, so they need to be treated as the same query record. Using the similarity calculation method of the present invention, it is possible to integrate two The advantages of the function, the high accuracy.
  • the set-based similarity function can more accurately reflect whether two names correspond to the same entity.
  • the data in the system is mostly manually entered, editing errors are inevitable, and the character-based similarity function is also of great value.
  • the similarity calculation method provided by the invention can comprehensively combine the advantages of the similarity function based on the set and the similarity function based on the character, and can greatly help the enterprise to realize the integration and integration of the multi-source information.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM, or technical field Any other form of storage medium known.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé de calcul de similarité de documents, et un procédé et un dispositif de détection de documents approximativement dupliqués. Le procédé de calcul consiste à : exécuter respectivement un traitement de segmentation de mots sur deux documents à détecter en vue d'obtenir des ensembles de segmentation de mots respectifs des documents à détecter ; calculer la similarité d'édition de l'ensemble des paires de segmentation de mots des deux ensembles de segmentation de mots, deux éléments de segmentation de mots de chacune des paires de segmentation de mots provenant respectivement de deux des ensembles de segmentation de mots ; établir un contour entre les paires de segmentation de mots dont la similarité d'édition satisfait aux exigences dans l'ensemble des paires de segmentation de mots, la similarité d'édition consistant en une pondération du contour correspondant aux paires de segmentation de mots, et obtenir ensuite un graphe biparti pondéré ; calculer la valeur de correspondance pondérée maximale du graphe biparti pondéré ; et utiliser la valeur de correspondance pondérée maximale pour calculer la similarité entre les documents à détecter. Le procédé de calcul de similarité de documents, et le procédé et le dispositif de détection de documents approximativement dupliqués selon la présente invention présentent un taux de précision élevé et permettent d'identifier efficacement des documents approximativement dupliqués contenant des ensembles de segmentation de mots édités de manière incorrecte, ce qui permet d'améliorer la précision de détection des documents approximativement dupliqués, de réduire la complexité de calcul et d'optimiser l'efficacité de calcul.
PCT/CN2014/080318 2013-06-26 2014-06-19 Procédé de calcul de similarité de documents, et procédé et dispositif de détection de documents approximativement dupliqués WO2014206241A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310260390.1 2013-06-26
CN201310260390.1A CN104252445B (zh) 2013-06-26 2013-06-26 近似重复文档检测方法及装置

Publications (1)

Publication Number Publication Date
WO2014206241A1 true WO2014206241A1 (fr) 2014-12-31

Family

ID=52141044

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/080318 WO2014206241A1 (fr) 2013-06-26 2014-06-19 Procédé de calcul de similarité de documents, et procédé et dispositif de détection de documents approximativement dupliqués

Country Status (2)

Country Link
CN (1) CN104252445B (fr)
WO (1) WO2014206241A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200019814A1 (en) * 2016-10-31 2020-01-16 salesforce.com,inc. Jaccard similarity estimation of weighted samples: scaling and randomized rounding sample selection with circular smearing
CN112926310A (zh) * 2019-12-06 2021-06-08 北京搜狗科技发展有限公司 一种关键词提取方法及装置
EP3835997A1 (fr) * 2019-12-11 2021-06-16 Naver Corporation Procédés et systèmes de détection de documents en double à l'aide d'un modèle de mesure de similarité de document basé sur des références croisées d'apprentissage profond vers des applications connexes

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598986B (zh) * 2015-10-16 2020-11-27 北京国双科技有限公司 相似度计算的方法及装置
CN106815179B (zh) * 2015-11-27 2020-10-20 阿里巴巴集团控股有限公司 一种文本相似度确定方法及装置
CN105897445A (zh) * 2016-05-30 2016-08-24 北京奇艺世纪科技有限公司 一种小号外挂的确定方法及装置
CN106372202B (zh) * 2016-08-31 2020-04-17 北京奇艺世纪科技有限公司 文本相似度计算方法及装置
CN106528714B (zh) * 2016-10-26 2018-08-03 广州酷狗计算机科技有限公司 获取文字提示文件的方法及装置
CN106980870B (zh) * 2016-12-30 2020-07-28 中国银联股份有限公司 短文本之间的文本匹配度计算方法
CN107066623A (zh) * 2017-05-12 2017-08-18 湖南中周至尚信息技术有限公司 一种文章合并方法及装置
CN107133335B (zh) * 2017-05-15 2020-06-02 北京航空航天大学 一种基于分词与索引技术的重复记录检测方法
CN106982396A (zh) * 2017-05-17 2017-07-25 深圳天珑无线科技有限公司 一种信息处理方法、装置及移动终端
CN107463605B (zh) * 2017-06-21 2021-06-11 北京百度网讯科技有限公司 低质新闻资源的识别方法及装置、计算机设备及可读介质
CN107391671B (zh) * 2017-07-21 2019-11-26 华中科技大学 一种文档泄露检测方法及系统
CN107577665B (zh) * 2017-09-11 2020-11-03 电子科技大学 文本情感倾向的判别方法
CN109241505A (zh) * 2018-10-09 2019-01-18 北京奔影网络科技有限公司 文本去重方法及装置
CN109615017B (zh) * 2018-12-21 2021-06-29 大连海事大学 考虑多参考因素的Stack Overflow重复问题检测方法
CN111488497B (zh) * 2019-01-25 2023-05-12 北京沃东天骏信息技术有限公司 字符串集合的相似度确定方法、装置、终端及可读介质
CN110083832B (zh) * 2019-04-17 2020-12-29 北大方正集团有限公司 文章转载关系的识别方法、装置、设备及可读存储介质
CN112183052B (zh) * 2020-09-29 2024-03-05 百度(中国)有限公司 一种文档重复度检测方法、装置、设备和介质
CN117494231A (zh) * 2023-11-15 2024-02-02 山东农业大学 一种基于大数据的分布式数据管理监测系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1959671A (zh) * 2005-10-31 2007-05-09 北大方正集团有限公司 基于文档结构的文档相似性度量方法
CN1963807A (zh) * 2005-11-11 2007-05-16 威知资讯股份有限公司 相似文件的自动侦测方法
US20080140616A1 (en) * 2005-09-21 2008-06-12 Nicolas Encina Document processing
CN103092828A (zh) * 2013-02-06 2013-05-08 杭州电子科技大学 基于语义分析和语义关系网络的文本相似度度量方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231641B (zh) * 2007-01-22 2010-05-19 北大方正集团有限公司 一种自动分析互联网上热点主题传播过程的方法及系统
US20120109994A1 (en) * 2010-10-28 2012-05-03 Microsoft Corporation Robust auto-correction for data retrieval

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080140616A1 (en) * 2005-09-21 2008-06-12 Nicolas Encina Document processing
CN1959671A (zh) * 2005-10-31 2007-05-09 北大方正集团有限公司 基于文档结构的文档相似性度量方法
CN1963807A (zh) * 2005-11-11 2007-05-16 威知资讯股份有限公司 相似文件的自动侦测方法
CN103092828A (zh) * 2013-02-06 2013-05-08 杭州电子科技大学 基于语义分析和语义关系网络的文本相似度度量方法

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200019814A1 (en) * 2016-10-31 2020-01-16 salesforce.com,inc. Jaccard similarity estimation of weighted samples: scaling and randomized rounding sample selection with circular smearing
US11615261B2 (en) * 2016-10-31 2023-03-28 Salesforce, Inc. Jaccard similarity estimation of weighted samples: scaling and randomized rounding sample selection with circular smearing
CN112926310A (zh) * 2019-12-06 2021-06-08 北京搜狗科技发展有限公司 一种关键词提取方法及装置
CN112926310B (zh) * 2019-12-06 2023-11-03 北京搜狗科技发展有限公司 一种关键词提取方法及装置
EP3835997A1 (fr) * 2019-12-11 2021-06-16 Naver Corporation Procédés et systèmes de détection de documents en double à l'aide d'un modèle de mesure de similarité de document basé sur des références croisées d'apprentissage profond vers des applications connexes
KR20220070181A (ko) * 2019-12-11 2022-05-30 네이버 주식회사 딥러닝 기반의 문서 유사도 측정 모델을 이용한 중복 문서 탐지 방법 및 시스템
US11631270B2 (en) 2019-12-11 2023-04-18 Naver Corporation Methods and systems for detecting duplicate document using document similarity measuring model based on deep learning
KR102523160B1 (ko) 2019-12-11 2023-04-18 네이버 주식회사 딥러닝 기반의 문서 유사도 측정 모델을 이용한 중복 문서 탐지 방법 및 시스템

Also Published As

Publication number Publication date
CN104252445B (zh) 2017-11-24
CN104252445A (zh) 2014-12-31

Similar Documents

Publication Publication Date Title
WO2014206241A1 (fr) Procédé de calcul de similarité de documents, et procédé et dispositif de détection de documents approximativement dupliqués
US10146862B2 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
KR101715432B1 (ko) 단어쌍취득장치, 단어쌍취득방법 및 기록 매체
CN103729402B (zh) 一种基于图书目录的知识图谱的构建方法
WO2020215667A1 (fr) Procédé et appareil de suppression rapide des doublons de contenu de texte, dispositif informatique et support de données
Kulkarni et al. Document allocation policies for selective searching of distributed indexes
WO2017096819A1 (fr) Procédé et système d'exploration de données basés sur les synonymes
CN105279277A (zh) 知识数据的处理方法和装置
CN111324784A (zh) 一种字符串处理方法及装置
MX2008013657A (es) Anotacion a traves de busqueda.
CN103049575A (zh) 一种主题自适应的学术会议搜索系统
CN110489745B (zh) 基于引文网络的论文文本相似性的检测方法
WO2017113592A1 (fr) Procédé de génération de modèles, procédé de pondération de mots, appareil, dispositif et support d'enregistrement informatique
WO2017096777A1 (fr) Procédé de normalisation de document, procédé de recherche de document, appareils correspondants, dispositif et support de stockage
US11036818B2 (en) Method and system for detecting graph based event in social networks
CN111324801A (zh) 基于热点词的司法领域热点事件发现方法
Dumani et al. Quality-aware ranking of arguments
US20230281239A1 (en) Suppressing personally objectionable content in search results
EP2661710A2 (fr) Procédé et appareil destinés à comparer des vidéos
Shokouhi et al. Federated text retrieval from uncooperative overlapped collections
Martins et al. Barbara made the news: Mining the behavior of crowds for time-aware learning to rank
CN113111178B (zh) 无监督的基于表示学习的同名作者消歧方法及装置
Kashefi et al. Optimizing Document Similarity Detection in Persian Information Retrieval.
Zopf et al. Sequential clustering and contextual importance measures for incremental update summarization
Scells et al. QUT at the NTCIR Lifelog Semantic Access Task.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14816981

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14816981

Country of ref document: EP

Kind code of ref document: A1