CN104462582A - Web data similarity detection method based on two-stage filtration of structure and content - Google Patents

Web data similarity detection method based on two-stage filtration of structure and content Download PDF

Info

Publication number
CN104462582A
CN104462582A CN201410843460.0A CN201410843460A CN104462582A CN 104462582 A CN104462582 A CN 104462582A CN 201410843460 A CN201410843460 A CN 201410843460A CN 104462582 A CN104462582 A CN 104462582A
Authority
CN
China
Prior art keywords
node
tree
nodes
vector
character string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410843460.0A
Other languages
Chinese (zh)
Other versions
CN104462582B (en
Inventor
李石君
吴岳廷
范珊珊
张健
李宇轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201410843460.0A priority Critical patent/CN104462582B/en
Publication of CN104462582A publication Critical patent/CN104462582A/en
Application granted granted Critical
Publication of CN104462582B publication Critical patent/CN104462582B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Web data similarity detection method based on two-stage filtration of a structure and content. On the basis of a traditional universal similarity detection method, the distribution characteristics of the structure and the content of Web data are dug out, and detected document sets are subjected to two-stage filtration; the first-stage filtration of the two-stage filtration is structure similarity filtration, wherein each Web document is modeled into a Tag tree structure so as to remove the document sets dissimilar in structure, the remaining documents are subjected to key content extraction, key content is expressed in the form of tuple vectors, and key messages are connected to generate character string sets; the second-stage filtration of the two-stage filtration is to conduct Trie tree structure modeling on the character string sets generated after the first-stage filtration, and similar character strings are connected to obtain a final result. Multiple experiments prove that by the adoption of the method, the efficiency of data similarity detection in the web field can be improved remarkably.

Description

Web data similarity detection method based on structure and content secondary filtering
Technical Field
The invention belongs to the field of similarity detection of Web data structures and contents, and particularly relates to a Web data similarity detection method based on structure and content secondary filtering.
Background
Web data is typically stored and exchanged in XML or HTML, etc., and the data is organized by various tags. Because the Web data has high transmission speed and the format of the html webpage is flexible, the probability that the webpage with wide transmission is transferred or changed is higher, and the data organization forms which belong to the same theme may be different from each other, but the similarity is higher. Therefore, the detection and the exclusion of the similarity data are very significant in the fields of data classification and clustering, data mining, focused search engines, pattern recognition, article detection and the like.
The traditional Web data similarity detection method mainly focuses on two fields of Web data structure similarity detection and content character string similarity detection. For structural similarity detection, an index tree is mainly established first based on data contained in documents. Currently proposed index tree based K-nearest neighbor search algorithms such as R-tree (see document 3), K-D tree (see document 4), SR tree (see document 5) return accurate results, but are not time efficient for high dimensional data. Document 2 proposes a method of a location sensitive hash function LSH, and the basic method is to use a location sensitive hash function family to hash an object in a high dimensional space into an adjacent bucket. When similarity search is performed, the indexing method hashes a query object into a bucket, takes the value in the hashed bucket as a candidate result set, and then sorts the candidate result set based on similarity distance measurement.
Web documents in the candidate dataset may be modeled as an ordered tagged root tree (see documents 1, 2). The elements and attributes in a document can be viewed as structural nodes or internal nodes. The tree edit distance is used as a standard for measuring the similarity between TAG trees, wherein the tree edit distance refers to the minimum node operated by converting one tree into another tree by using operations such as deleting, inserting, renaming labels and the like. Two trees are considered similar when their edit distance is within a certain threshold.
In recent studies, the scholars Panigrahy proposed an entropy-based LSH method (see document 6) that randomly entropy-processes objects near a query object in addition to the query object, and then uses the intersection of all the results as a candidate set. This method trades time for space. However, in the case of a large Web data set, the time is not practical and the practicability is not high. Although various improved LSH algorithms exist, the LSH algorithm itself has the defects that it is difficult to find a hash function that varies with the scale and distribution characteristics of given data, and a large amount of memory resources are required to be consumed, and the hash function cannot be released in time, so that the solving process is inefficient, and the final result is not likely to be an absolute approximation, and it is difficult to obtain the expected effect only by improving the LSH algorithm.
The existing algorithm generally adopts a framework of filtering and then verifying the character string similarity of the document content (see document 7), and in the filtering step, a signature is generally generated for each character string in a character string set, and an unverified approximate pair is generated by using the characteristics between the signatures seen by the character strings. Then, in the subsequent verification process, the approximate pair is verified for correctness. After the given conditions are satisfied, the final similar results are output, and the Binary branch (see reference 7), the Min-Hashing based LSH (see reference 8) and the All-Pairs-ED (see reference 9) methods are similarity detection methods in this framework. Although the algorithms can meet the requirement of partial character string similarity detection, the algorithms cannot adapt to the condition that the average character string length in a character string set is small or large, so that the generated signature is inaccurate, the similarity detection range is easy to be too wide, a large number of similar pairs are generated, the pressure of the subsequent verification process is increased, no verification method which is suitable for any condition exists, and finally the generated result is too much and inaccurate.
The existing similarity detection methods are generally improved in one aspect of algorithm efficiency and result accuracy, and do not deeply explore the application field. For example, for documents and data in a Web space, if similarity mining is performed according to a general existing similarity detection method, it is difficult to efficiently find out approximate content blocks because structural features of the Web data and distribution region characteristics of character string contents are not fully utilized.
Document 1: andrea Tagarelli, Sergio greco.
Document 2: giovanni Mella, Elena Ferrari, Elisa Bertini, Yunhua Koglin. controlled and collaborative updates of XML documents in byzantine and failure-protein distributed systems. in TISSEC, 2006.
Document 3: a dynamic index structure for spatial search. InSIGMOD, 1984.
Document 4: L.Bentley.K-D trees for semi-dynamic points sets. in SCG, 1990.
Document 5: N.Katayama and S.Satoh.the SR-tree: An index structure for high-dimensional neighbor queries.in SIGMOD, 1997.
Document 6: r. panigrahy. control based near search in high dimensions. in SODA, 2006.
Document 7: rui Yang, Panos Kalnis, and Anthony K.H.Tung.silicon evaluation tree-structured data. in SIGMOD Conference, 2005.
Document 8: r. panigrahy. control based near search in high dimensions. in SODA, 2006.
Document 9: r.j.bayardo, y.ma, and r.srikant.scaling up all pages similarity WWW, 2007.
Disclosure of Invention
Aiming at the problems of the existing algorithm, the invention provides a Web data similarity detection method based on structure and content secondary filtering.
The technical scheme adopted by the invention is as follows: a Web data similarity detection method based on structure and content secondary filtering is characterized in that: on the basis of a traditional general similarity detection method, the characteristics of Web data structure and content distribution are discovered, and two-stage filtering is carried out on a detected document set; the first stage of filtering in the two-stage filtering is structural similarity filtering, each Web document is modeled into a Tag tree structure, so that document sets which are dissimilar in structure are removed, the key contents of the rest documents are extracted and expressed into a tuple vector form, and key information is connected to generate a character string set; and the second-stage filtration in the two-stage filtration is used for carrying out Trie tree structure modeling on the character string set generated after the first-stage filtration, and connecting similar character strings to obtain a final result.
Preferably, the first stage filtering specifically comprises the following steps:
step A1: preprocessing an input Web document set to be detected, extracting a Web document structure model tag.tree, and establishing an effective node index;
step A2: constructing binary branch vectors (BinaryBranch Vector) of all tags in parallel by combining MapReduce;
step A3: establishing a multi-pointer LSH index for a binary branch vector of the tag.tree;
step A4: structure similarity Web document set S for inquiring given inquiry q1
Step A5: obtaining a set of Web documents S1Based on a content six-tuple vector set W extracted by a Tag tree, and based on the W, a character string connection algorithm is used for obtaining a similarity result set S of a query q2
Preferably, the specific implementation of step a1 includes the following sub-steps:
step A1.1: combining MapReduce, processing a Web document data set stored in a distributed document system (HDFS) in parallel, and extracting a Web document structure model tag.tree by using a Tag tree modeling mode based on a path Tag;
step A1.2: newly building a MapReduce task, wherein the input key of the Map task is a file name, the value is tag.
Preferably, the specific implementation of step a2 includes the following sub-steps:
step A2.1: newly building a MapReduce task, calculating binary node vector sets S of all tags, and then distributing the universe set S to each DataNode in the system by using a distributedPath function of Hadoop;
step A2.2: constructing a binary branch vector BBV by using a MapReduce task according to the binary node vector of the S and each tag.tree; key of map task is file name, value is binary node vector of tag.tree; the key of the reduce task is a file name, and the value is BBV corresponding to tag.tree;
step A2.3: the task writes the BBV of each tag.
Preferably, the specific implementation of step a3 includes the following sub-steps:
step A3.1: initializing L hash tables for establishing a multi-pointer LSH index, and initializing M hash functions and r hash buckets in each table; calculating indexes of BBV of all tags by using a MapReduce task;
step A3.2: assuming that a binary branch vector of a certain Web document X is M, firstly, M times of hash functions are calculated for M, and g is seti(m) is the hash value of query m in the ith hash table, i.e. g (m) ═ h1(m),…,hM(m));
Step A3.3: initializing an L-dimensional perturbation vector Δ, wherein Δ ═ Δ1,…,ΔL),gi(m)+Δ1Is to interfere with the vector delta1Is applied to gi(m) obtaining a first-order pointer to the hash bucket, the first-order pointer pointing to another hash bucket in the hash table, each perturbation vector being mapped to a unique set of hash values, and the pointer pointing to a hash bucket at most once;
step A3.4: applying the perturbation vector in delta to g in turni(m), obtaining i (i ═ 1,2, …, L) order pointers, pointing to different hash buckets, and establishing a multi-pointer index;
step A3.5: according to the pointer construction condition of all tag.tree, a hash bucket within the second order gives a group number BandID; keys of the MapReduce task are a group number BandID and a barrel number BucketID, and values are corresponding Web file names.
Preferably, the specific implementation of step a4 includes the following sub-steps:
step A4.1: extracting a binary branch vector of a given query q, quickly retrieving a top-k proximity point of the q based on a multi-pointer LSH index, and screening out a binary branch distance BDist (T) in the binary branch vectori,Tq) Satisfies the conditionsT tree set S1As a secondary index input;
step A4.2: quickly retrieving a top-k proximity point of q based on the index s;
computing M hash functions of query q, let g (q) be the hash value of query q in the ith hash table, i.e., where g (q) ═ h1(q),…,hM(q)),gi(q)+Δ1Is to interfere with the vector delta1Is applied to giAnd (q) obtaining all hash buckets within two orders away from the given point q according to the index s by the first-order pointer of the hash bucket, namely the top-k adjacent point of the query q.
Preferably, the specific implementation of step a5 includes the following sub-steps:
step A5.1: based on step A1, a Web document set S is acquired1Tree set;
step A5.2: obtaining the contents in the leaves of all effective nodes according to the effective index of the tag of the Web document X;
step A5.3: and after judging the leaf node type according to the father node label, adding a six-tuple vector of X, wherein the six-tuple vector consists of the number (id) of X, a title (title), content (content), a KeyWord (KeyWord), a link (URL) and release time (PageTime).
Preferably, the second-stage filtration is realized by the following steps:
step B1: extracting a character string set U (U) from the six-element group vector generated by the first-stage filtration1,u2,u3,…uk) And V (V)1,v2,v3,…vl) The two sets of strings have the exact same possibility; each character string is a result of serial connection of a number (id), a title (title), content (content), a KeyWord (KeyWord), a link (URL) and release time (PageTime) in each tuple vector; determining a character string editing distance threshold tau according to an actual application situation;
step B2: sorting the character strings in the character string sets U and V according to a dictionary to form an ordered set, and establishing a Trie tree H based on the ordered setu∪vThe Trie tree consists of nodes corresponding to each character in the U and the V, and the nodes of the Trie tree store node numbers, mapping relations between the nodes and the characters, character string sets to which the nodes belong and structural relations between the nodes;
for example, Trie Tree Hu∪vIn the method, each node is numbered according to the prior traversal, and each node is labeled (only belongs to U, or only belongs to V, or belongs to U U.U.V) based on the set to which the node belongs. Meanwhile, the nodes need to store the position relationships of parents, children, brothers and the like with other nodes so as to facilitate traversal and search.
Step B3: creating an accessed node stack data structure for storing accessed Trie tree nodes; the stack is initialized to be only a root node and an active node set corresponding to the root node;
for example, a length | H is created from the maximum depth of the Trie treeU∪VEach element in the "visited node stack" ST, ST of | includes the number of the node and the active node set of the node.
Step B4: establishing a dynamic Trie tree H' only with root node root, solving an active node set of the root node root, and determining a next node for solving the active node set, which is called a destination node;
step B5: creating a virtual Trie-Tree H 'for the next destination node'u∪v
Step B6: solving an active node set of a destination node through an active node set of an element positioned at the uppermost layer of a stack ST based on the theorem 1;
wherein the introduction 1: given Trie Tree HxOf a node l and its parent p, and a string edit distance threshold τ for any node l' ∈ Qτ(l) There must be one node p ' that is the ancestor of l ' and that satisfies p ' e.Qτ(p);
Step B7: updating an active node set of elements, the distance between the elements and the topmost element of the stack ST is within tau, based on the description of the symmetry of the active nodes in the Trie tree active node definition;
step B8: judging whether the node p currently processed is a certain leaf node of the Trie tree structure, if so, determining whether the node p currently processed is a certain leaf node of the Trie tree structureOutputting similar string pairings<o,p>;
Step B9: in Trie Tree H'u∪vAnd taking out the newest destination node of the child node or the brother node of the destination node belonging to the same character set, and repeating the steps B5 to B8 until the stack ST is empty.
Preferably, the specific implementation of step B4 includes the following sub-steps:
step B4.1: active node set Q for solving root node based on character string editing distanceτ(root), pushing the root node and the corresponding set of active nodes into the stack ST;
the character string editing distance specifies two character strings s(s)1s2s3…sm) And t (t)1t2t3…tn) S and t character string edit distanceExpressed as ED (s, t), expressed as the minimum operation times that the character string s is converted into t by three operation modes of replacing, inserting and deleting in a single character unit; if s is empty, it is denoted ED (s, 0);
for ED (s, t), there is a degree of incremental:
①ED(si0)=i,ED(tj,0)=j,1≤i≤m,1≤j≤n;
②ED(si,tj)≥ED(si-1,tj)+1,1<i≤m,1≤j≤n;
③ED(si,tj)≥ED(si,tj-1)+1,1≤i≤m,1<j≤n;
④ED(si,tj)=ED(si-1,tj-1),1<i≤m,1<j≤n,si=tj
⑤ED(si,tj)≥ED(si-1,tj-1)+1,1<i≤m,1<j≤n,si≠tj
step B4.2: and extracting a node set N only belonging to the character string set U from the active node set of the root node, and selecting the node N' closest to the root node.
Preferably, the specific implementation of step B5 includes the following sub-steps:
step B5.1: in Trie tree H based on lemma A and lemma Bu∪vScreening out a node area irrelevant to the active node set of the solving node n';
the theory A: given a string s and t, and a string edit distance threshold τ, if | | s | - | t | > τ, then there are: ED (s, t) > τ, where | s | and | t | represent the number of characters contained in the character strings s and t, respectively;
the theory B is as follows: given Trie tree structure HxAnd two nodes u and v therein, and a string edit distance threshold τ, ifOne of the following conditions exists:
(1) v is the ancestor node of u, and subtrees with v and u as root nodes respectively contain the same leaf node;
(2) if only one string exists and v and u are simultaneously used as prefix strings, Q is calculatedτ(u) all subsequent nodes having v as parent or ancestor are not considered;
step B5.2: gray marking is carried out on the node area to form a tree H based on the original Trieu∪vIs dedicated to a virtual Trie-Tree H 'of n'u∪v
Preferably, the specific implementation of step B6 includes the following sub-steps:
step B6.1: if the destination node belongs to the two sets, the destination node is set to belong to one set and is in a virtual Trie tree H'u∪vExcluding nodes belonging only to the set; if the destination node only belongs to a certain set, the same is carried out in the virtual Trie tree H'u∪vExcluding nodes belonging only to the set;
step B6.2: h'u∪vAccording to an active node set of a parent node at the uppermost layer on a stack ST, an active node set of a target node is obtained;
step B6.3: and pressing the destination node into the stack ST, wherein the uppermost element is the number of the destination node and the corresponding active node set.
The method has the advantage that the structural characteristics and distribution characteristics of the Internet semi-structured data are fully utilized in the similarity detection process. On one hand, the structure modeling and the content processing are respectively carried out on the input Web document set based on the extracted similarity characteristics, and the Web data is detected from the two aspects of structure construction and content distribution through two-stage filtering, so that the precision of Web data similarity detection is improved; on the other hand, some impurity data which cannot belong to the similarity set can be removed in advance through the first-level structure similarity detection filtering, and the number of elements of a subsequent similarity matching candidate set is reduced. Meanwhile, a multi-pointer LSH index and a Tree modeling mechanism are respectively adopted in the similarity process, so that the efficiency and the practicability of similarity detection are improved. Multiple experiments prove that the method provided by the invention can obviously improve the data similarity detection effect in the Web field.
Drawings
FIG. 1: the invention is a framework diagram of the parallel processing flow of the Web document similarity detection.
FIG. 2: the invention discloses a Tag tree modeling diagram based on path tags, wherein part (a) is a Web document, and part (b) shows a Tag which is converted from the Web document.
FIG. 3: is a diagram illustrating an example of the conversion between the Tag tree and the binary tree according to the embodiment of the present invention.
FIG. 4: is an example of a binary node vector of an embodiment of the present invention.
FIG. 5: is a flow framework diagram of the multi-pointer LSH algorithm of the embodiment of the invention.
FIG. 6: the distribution diagram of the distance of the closest barrel to top-K in the embodiment of the invention is shown, wherein part (a) is a statistical diagram of single different hash values, and part (b) is a statistical diagram of the number indicated by different order disturbance vectors.
FIG. 7: the present invention is a Trie structure established based on an example string set.
FIG. 8: is an edit distance matrix diagram of two example character strings of an embodiment of the present invention.
FIG. 9: the Trie tree structure established based on the two character string sets is disclosed by the embodiment of the invention.
FIG. 10: the method is characterized by comprising the following steps of cutting a node set area irrelevant to solving of an active node set.
FIG. 11: the invention discloses a schematic diagram for showing the accessed sequence of Trie tree nodes.
FIG. 12: the method and the device for performing string similarity connection between different string sets in the embodiment of the invention are shown in the drawing.
FIG. 13: the method based on different parameters in the embodiment of the invention is a comparison graph of the operation effect of the traditional method.
FIG. 14: the embodiment of the invention utilizes the method in different data sets to compare the operation effect with the traditional method, wherein part (a) is the method operation effect graph provided by the invention, and part (b) is the traditional method operation effect graph.
FIG. 15: the accuracy comparison graph of the method based on different parameters and the traditional method is shown in the embodiment of the invention.
FIG. 16: the invention is a flow chart of a similarity detection method based on structure and content secondary filtering.
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.
Referring to fig. 1, the technical scheme adopted by the invention is as follows: a Web data similarity detection method based on structure and content two-stage filtering is characterized in that on the basis of a traditional general similarity detection method, the characteristics of Web data structure and content distribution are discovered, and two-stage filtering is carried out on a detected document set; the invention considers that the first structures of the documents containing similar Web data should be similar, and if the structures of the two documents are greatly different, the significance of further solving the similarity of the contents is not great. Therefore, the first-stage filtering in the two-stage filtering is structural similarity filtering, each Web document is modeled into a Tag tree structure, so that document sets which are dissimilar in structure are removed, key contents of the rest documents are extracted and expressed into a tuple vector form, and key information is connected to generate a character string set; and the second-stage filtration in the two-stage filtration is used for carrying out Trie tree structure modeling on the character string set generated after the first-stage filtration, and connecting similar character strings to obtain a final result.
The Tag tree is an important data structure invented herein for tree modeling of the structure of a Web document, and decomposes a Web document according to the path of elements or attributes in the Web document.
Under a large-scale Web document data set, a non-text area in the tag.tree is converted into a binary tree, so that a large amount of storage space is wasted, and the accuracy of judging the structural similarity of the document is influenced. In order to solve the problem, a method for establishing an effective node index for the tag.tree is adopted, only an index area of the tag.tree is converted, and the conversion efficiency of the binary tree is improved.
Definition 1(Web document tag node tree): a tree is an ordered tagged root tree of a Web document X, with each node of the tree being associated with an element or attribute of X. The id of each node is related to its order in the X document, and the id is edited in a sequence number hierarchical manner. For each node, there is a sequence number path (tag)0,tag1,…,tagn)。
Referring to fig. 2, part (a) of fig. 2 is a Web document, and the tags and the content between the tags are abstracted into simple letters to emphasize the distribution of the tag structure. (b) Partially showing the tag of the Web document, through a path hierarchical mode, the parent node sequence of a certain id can be found quickly. Tree can be seen to have the following properties:
leaf nodes have no sibling nodes.
The leaf nodes are corresponding elements in the label, and are marked with NULL when no element exists.
Definition 2(tag. tree valid node index): and traversing the tag, and if the element in the leaf node accessed under a certain path is not empty and is not noise, all the nodes on the path are effective nodes. All valid nodes constitute the valid node index of the tag.
Please see table 1, all valid nodes in fig. 2 are marked, and the marking process is shown in table 1.
TABLE 1 example of creating valid node index for Tag tree
Key Traversed node id Active node
(abc,w) 1,1.1,1.1.1 a,b,c
(abde,null) 1,1.1,1.1.2,1.1.2.1
(abdf,null) 1,1.1,1.1.2,1.1.2.2
(agh,null) 1,1.2,1.2.1
(agik,null) 1,1.2,1.2.2,1.2.2.1 a,g,i,k
(agil,null) 1,1.2,1.2.2,1.2.2.2 a,g,i,l
(agim,null) 1,1.2,1.2.2,1.2.2.3 a,g,i,m
(agil,null) 1,1.2,1.2.2,1.2.2.4 a,g,i,n
(agj,null) 1,1.2,1.2.3
Definition 3 (path edit distance): let T be tag, pre (T) and post (T) of document X, which respectively represent the first-order traversal operation and the last-order traversal operation of tree T. The traversal paths of pre (T) and post (T) are represented as strings s1, s2, then ed (s1, s2) is the path edit distance between s1, s 2. The tree edit distance TD (T1, T2) has the following relationship with the path edit distance ed (s1, s 2):
ed(pre(T1),pre(T2))≤TD(T1,T2)
ed(post(T1),post(T2))≤TD(T1,T2)
a tree and its binary tree may be transformed into each other, with any node in the tree, in the process of transforming into a binary tree, its left child becoming the left node of the binary tree and its right sibling becoming the right node of the left child.
Please see fig. 3, let T12、T21Conversion to binary tree B (T)12)、B(T21) In the invention, B (T)12)、B(T21) And converting into a full binary tree, and replacing the full binary tree with a star when the nodes are insufficient. As can be seen from the figure, T12Root node u1Left child u2Is B (T)12) Left child of u2The right brother of (A) is B (T)12) Left child u2The right node of (c).
Definition 4 (Binary Branch): let B be a binary tree and let B be a binary tree,there is one binary branch, node u, and its two child nodes.
Referring to FIG. 3, a binary tree B (T) is shown12)、B(T21) All binary branches of (1). As can be seen from the figure, for B (T)12)、B(T21) There is a binary branch for each node in, e.g., B (T)12) Parent node with binary branch as u1And child points u2,*。
Definition 5 (Binary Node Vector): let B be a binary tree, representing all binary branches of B as a vector V ═ V (V)1,v2,…,vn),viIs the ith branch of the tree T, let viLabel is a, its child node label is a, then vi ═ a, a.
Referring to FIG. 4, a binary tree B (T) is shown12)、B(T21) And a binary node vector corresponding to all the binary nodes. For example B (T)12) Parent node with binary branch as u1And child points u2And the corresponding binary node vector is (a, a).
Definition 6 (Binary Branch Vector): let the binary branch vector BBV (T) of tree T be (b)1,b2,…,b|B|) Wherein b isiRefers to the sequence number of the ith node in the binary tree B of the tree T, and | B | represents the size of the binary tree B.
Please refer to table 2, which lists the construction method of the binary branch vector in fig. 5, where a binary node vector exists in T in the binary tree and is marked as 1, and when the binary node vector does not exist, is marked as 0.
Table 2 constructs binary branch vectors
BBV(T1) BBV(T2)
(a,a,*) 1 1
(a,*,e) 1 0
(a,f,b) 0 1
(b,*,c) 1 0
(b,*,X) 0 1
(b,*,y) 0 1
(c,*,g) 1 0
(e,*,b) 1 0
(f,*,b) 0 1
(g,*,*) 1 0
(x,*,*) 0 1
(y,*,*) 0 1
Definition 7 (Binary Branch Distance): let BBV (T1) ═ B1, B2, …, B |, BBV (T2) ═ B1 ', B2 ', …, B | ') be binary of tree T1, T2Branch vectors whose binary branch distance isFor any two trees T1, T2 satisfies <math> <mrow> <mfrac> <mrow> <mi>BDist</mi> <mrow> <mo>(</mo> <mi>T</mi> <mn>1</mn> <mo>,</mo> <mi>T</mi> <mn>2</mn> <mo>)</mo> </mrow> </mrow> <mn>5</mn> </mfrac> <mo>&le;</mo> <mi>TD</mi> <mrow> <mo>(</mo> <mi>T</mi> <mn>1</mn> <mo>,</mo> <mi>T</mi> <mn>2</mn> <mo>)</mo> </mrow> <mo>.</mo> </mrow> </math>
Define 8 (Hash interference Vectors): let m dimension vector delta ═ great face1,…,m},gi(q) is the hash value of query q in the ith hash table. Applying the interference vector delta to the hash bucket to obtain a hash bucket pointer gi(q) + Δ, which points to a set of hash values. By using multiple interference vectors, the hash bucket closest to the query object hash value can be quickly located.
For a given query q, the hash bucket is retrieved using the basic LSH algorithm to get the hash value g (q) ═ h1(q),…,hM(q)). The interference vector Δ is applied to g (q) resulting in a set of hash buckets with pointers g (q) + Δ. Wherein the LSH function is of the formIf W is large enough, there is a high probability that similar objects have the same or adjacent values, and therefore the interference vector Δ may be used to point to those buckets that contain adjacent values, where W is large enoughi∈{-1,0,1}。
Applying the perturbation vector directly to the hash value of the query object avoids point perturbations and the overhead of computing the hash value using entropy-based LSH methods. A set of perturbation vector sequences is designed, where each perturbation vector maps to a unique set of hash values and does not point to a hash bucket more than once.
Please see fig. 5, set gi(q) is the hash value of query q in the ith hash table, (Δ)1,…,ΔL) Is a sequence of pointers, gi(q)+Δ1Is to interfere with the vector delta1Is applied to gi(q) the new hash value obtained, which points to another hash bucket in the hash table. According to the principle, the n-order perturbation vector points to all first-order buckets first, then points to second-order buckets and so on. Based on L hash tables and LSH index of M hash functions in each table, the total number of n-order buckets is <math> <mrow> <mi>L</mi> <mo>&times;</mo> <mfenced open='(' close=')'> <mtable> <mtr> <mtd> <mi>M</mi> </mtd> </mtr> <mtr> <mtd> <mi>n</mi> </mtd> </mtr> </mtable> </mfenced> <mo>&times;</mo> <msup> <mn>2</mn> <mi>n</mi> </msup> <mo>,</mo> </mrow> </math> The total number of buckets in s order is <math> <mrow> <mi>L</mi> <mo>&times;</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>s</mi> </msubsup> <mfenced open='(' close=')'> <mtable> <mtr> <mtd> <mi>M</mi> </mtd> </mtr> <mtr> <mtd> <mi>n</mi> </mtd> </mtr> </mtable> </mfenced> <mo>&times;</mo> <msup> <mn>2</mn> <mi>n</mi> </msup> <mo>.</mo> </mrow> </math> By using multiple interference vectors, the present invention locates more hash buckets whose hash values approximate the hash value of the query object.
The core idea of the multi-pointer LSH algorithm is to use a sequence of pointers to examine a large number of buckets that may contain points nearest to the query object q. Based on the nature of the location sensitive hash function, the present invention understands that if an object is close to query point q, but is not hashed to q in the same bucket, but in a bucket adjacent to q, i.e. the hash values of the two buckets are only slightly different. The goal of the present invention is to locate these adjacent buckets, thus increasing the chance of finding the adjacent points of q.
Definition 9(top-K nearest neighbor distribution): let n-th order hash interference vector delta not be zero, delta is used to detect the bucket closest to the given query hash value. Based on the attributes of the location sensitive hash function, the first order probe pointer points to a bucket that is closer to the query point than the second order probe pointer.
FIG. 6 shows the distribution of the nearest bucket distances to top-K, where (a) is a histogram of different hash values, (b) is a histogram of the numbers of perturbation vectors of different orders, and (a) is a curve showing a single hash valueiIn contrast, part of the curve (b) shows that the number of hash values is different from the hash value of the query point. It can be seen from the graph that the hash values of all K nearest neighbors in the set are either the same (c:)i0) or differenti+1 ori-1. Similarly, most K nearest neighbors are hashed into hash buckets that are within second order of the hash bucket from a given point.
Please refer to fig. 16, which is an overall flow chart of the two-stage filtering based on structure and content proposed by the present invention, wherein the left half illustrates the algorithm process of the first-stage structure filtering. The method specifically comprises the following steps:
step 1: decomposing the Web document according to the paths of the elements and the attributes in the Web document, extracting a structure model tag of the Web document, and establishing an effective index.
Substep 1: referring to fig. 2, a tree is constructed according to the order of nodes in a Web document by using a sequence number hierarchical mode.
Substep 2: calculating effective nodes of the tag.
Step 2: and acquiring a binary branch vector of the tag.
Substep 1: see fig. 3, which converts the index region structure tree of the tag.
Substep 2: referring to fig. 4, all binary branches of the binary tree B are obtained, and all binary branches of the binary tree B are converted into binary node vectors.
Substep 3: and acquiring a binary branch vector of the binary tree B according to the binary node vector set S of all the Web documents.
And step 3: a multi-pointer LSH index is established for the binary branch vector of the tag.
Substep 1: according to the step 1, the binary branch vector m of the Web document X is obtained in the step 2.
Substep 2: initializing L hash tables, initializing M hash functions and r hash buckets in each table, calculating M hash functions for binary branch vector M of X, and setting gi(m) is the hash value of query m in the ith hash table, i.e. g (m) ═ h1(m),…,hM(m))。
Substep 3: initializing an n-dimensional perturbation vector Δ, wherein Δ ═ Δ1,…,ΔL)。
Substep 4: applying the perturbation vector in delta to g in turni(m) to obtain an i (i ═ 1,2, …, n) order index, e.g. gi(m)+Δ1Is to interfere with the vector delta1Is applied to gi(m) obtaining a first order pointer to the hash bucket that points to another set of hash buckets in the hash table, each perturbation vector mapping to a unique set of hash values, and the pointer pointing to a hash bucket at most once.
Substep 5: for each document X in the Web document set, a multi-pointer LSH index is built according to substeps 1,2, 3, 4.
And 4, step 4: structure similarity Web document set S for inquiring given inquiry q1
Substep 1: and extracting a binary branch vector of the given query q, and quickly retrieving a top-k proximity point of the q based on a multi-pointer LSH index retrieval algorithm.
Referring to fig. 5, the detailed process of quickly retrieving the top-k proximity point of q based on the index s includes the following steps:
calculating M times of hash function of query q, and setting g (q) as the hash value of query q in the ith hash table, i.e. where g (q) ═ h1(q),…,hm(q))。
② initializing an L-dimensional interference vector Delta, wherein Delta is (Delta)1,…,ΔL) And Δi={1,…,mWill interfere with the vector Δ1Is applied to gi(q) the first order pointer g of the hash bucket obtainedi(q)+Δ1And by analogy, acquiring each-order bucket pointer of q.
And thirdly, obtaining all vectors in the hash bucket within two orders of q from the given point according to the LSH multi-pointer index, wherein the vectors are top-k adjacent points of the query q.
Substep 2: screening out binary branch distance BDist (T) from top-k proximity point of qi,Tq) Satisfies the conditionsT tree set S1And is input as a secondary index.
A second level of content string similarity filtering follows, and before elaborating, a number of key definitions that play a key role in the algorithmic specification are first given.
Definition 10 (string edit distance): specifying two character strings s(s)1s2s3…sm) And t (t)1t2t3…tn) The string edit distance of s and t is denoted as ED (s, t). Is expressed as that the character string s is converted by three operation modes of replacement, insertion and deletion in a single character unitThe minimum number of operations of t.
For example, the string edit distance ED (tobe, buffer) between the string "tobe" and the string "buffer" is 4, that is, the string "tobe" can be changed into the string "buffer" by three character replacing operations and one character deleting operation, which are t- > u, o- > f, b- > f, respectively, and the character f is deleted.
If s is empty, it is denoted ED (s, 0). For ED (s, t), it is easy to conclude that there is a certain degree of incremental of ED (s, t):
①ED(si0)=i,ED(tj,0)=j,1≤i≤m,1≤j≤n),
②ED(si,tj)≥ED(si-1,tj)+1,1<i≤m,1≤j≤n
③ED(si,tj)≥ED(si,tj-1)+1,1≤i≤m,1<j≤n
④ED(si,tj)=ED(si-1,tj-1),1<i≤m,1<j≤n,si=tj
⑤ED(si,tj)≥ED(si-1,tj-1)+1,1<i≤m,1<j≤n,si≠tj
definition 11 (string set similarity connection is similar to string): given two character string sets S and T and a character string editing distance threshold tau, the invention considers that the criterion for judging the similarity of the character S and the character string T is that the character string editing distance of the S and the T is not more than the set threshold tau, namely, the proposition is determinedED(s,t)<A conclusion s ≈ t can be derived, "≈" indicates string similarity. The similarity connection of the character string sets S and T can be understood as that one character string is taken from each of the two sets to form a matching pair of similar character strings. Noting that the result of the connection isIs shown as
For example, the character strings included in the two character string sets S and T are: s { "interlace", "ideology", "align", "expand", "primary", "association", T {, "immigrant", "interactive", "interaction", "thermal", "deletion", "primary", "extension", "compensation", and if a string edit distance threshold τ is specified to 1, the S and T similarity connection results in the case where the string edit distance threshold τ is specified to 1 If τ is 2, the result of the affinity join is
In the present invention, if the string edit distance threshold τ is not specifically described, τ is set to 1 as a default.
Definition 12 (string prefix): let s be s1s2s3…snThen, thens1s2s3...sm(1≤m<n) is the character s in siThe previous one string prefix s'. If the string s is considered as a set of characters in the original order, s' is a subset of s. s' must maintain s original positionThe order of the characters and one of the characters cannot be arbitrarily missing. The prefix string preceding the character c (c e s) in the string s is denoted s<c
For example, "pre", "prereq", and "prerequis" are all prefixes of s for a string s ═ prerequisite ".
Definition 13(Trie structure): the Trie structure is a tree-like data structure proposed by the present invention for second-level content string filtering. Wherein, the root of the tree is at the top layer, and the middle node and the leaf node are below and are in a tree structure from top to bottom. The specific meanings of each moiety in the Tie structure are shown below
The root node is root and represents an empty character string phi.
The intermediate node represents a prefix character string of a character string and at least comprises one character.
And the node only has one father node (the root node has no father node), and 0 or more child nodes.
And the path from the root to the node represents a substring or a complete character string formed by the characters from top to bottom. The path corresponds to its last node.
The leaf node represents a path from the root node root to the leaf node and corresponds to a complete character string.
Sixthly, all child nodes of a certain node in the tree have the same prefix character string.
Referring to fig. 7, the Trie structure of several strings is shown, which are "barren", "beam", "bean", "bear", "exend", "instance" and "thorough", respectively. The number marked next to each node represents its number ID, where the ID of the root node root is 0. The number ID of the middle node represents the character substring corresponding to the node, and the number ID of the leaf node represents the complete character string corresponding to the node. The intermediate node numbered 15 in FIG. 7 represents the substring "exde"And the leaf node numbered 6 is denoted "barren". Conversely, a character sub-string or complete character string can uniquely identify its ID, for example, the sub-string "bea" of "bean" corresponds to node a with ID 8, and the character string "instant" uniquely identifies node t with ID 21. I.e. the nodes represented by the IDs correspond to character sub-strings or complete character strings. Specifying a set of strings S, Trie structure denoted as Hx(x ∈ S), and | H'xI represents the depth of a node x in the Trie structure, | HxL represents the maximum depth of the Trie structure, and defines a substring or a complete character string represented by a path from a root node root to an intermediate node or a leaf node behind a node x as hx<
Wherein S ═ { S ═ S1,s2,s3,…skThe expression string set S is composed of k character strings, the length of each character string, namely the number of contained characters, is expressed as | SiI (1 ≦ i ≦ k), S ≦ represents the number of character strings included in the character string set S, and S ≦ k, forAll can be in Trie structure HxCorresponds to a node hxThe Trie structure is formed by the nodes according to a tree structure and has Hx={hxI x belongs to S, letIs the maximum length of each character string in S, having <math> <mrow> <mo>&dtri;</mo> <mo>|</mo> <mi>S</mi> <mo>|</mo> <mo>=</mo> <mi>max</mi> <mo>|</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>|</mo> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>i</mi> <mo>&le;</mo> <mi>k</mi> <mo>,</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>&Element;</mo> <mi>S</mi> <mo>)</mo> <mo>.</mo> <mo>|</mo> <msub> <mi>H</mi> <mi>x</mi> </msub> <mo>|</mo> <mo>=</mo> <mo>&dtri;</mo> <mo>|</mo> <mi>S</mi> <mo>|</mo> <mo>,</mo> </mrow> </math> At this time <math> <mrow> <mi>x</mi> <mo>&Element;</mo> <mo>{</mo> <msub> <mi>s</mi> <mi>j</mi> </msub> <mo>|</mo> <mo>&dtri;</mo> <mo>|</mo> <mi>S</mi> <mo>|</mo> <mo>=</mo> <mo>|</mo> <msub> <mi>s</mi> <mi>j</mi> </msub> <mo>|</mo> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>k</mi> <mo>}</mo> <mo>;</mo> </mrow> </math>
Let | x | denote as a character string in which a character corresponding to the x node is locatedThe length of a character string consisting of all characters from the first character to the character x according to the original sequence, or the path length from the root node to the x. With | x | ═ JxIf x is the last character in J, then | x | ═ J |.
For example, in the Trie shown in fig. 7, the longest path is 0(root) → 22(t) → 23(H) → 24(o) → 25(r) → 26(o) → 27(u) → 28(g) → 29(H), and the length of the path is 8, so | H → 22(t) → 23(H), and therefore | H → 27(u) → 28(g) → 29(H)xAnd 8. If x represents a node numbered 9, | H'9(m)|=|beam|=4。
In order to distinguish the nodes of the Trie tree structure from the represented character strings, the representation of the nodes is defined as follows: if the node number ID is known, if the node is represented by the character c, representing a character string w composed from the root node to the character c, it is represented as ID (c). In other cases where confusion is not likely to arise, node c is used to represent it.
Definition 14 (edit distance matrix): given two strings s and t, the value s ═ s, respectively1s2s3…sm,t=t1t2t3…tnA data structure is constructed to store values of string edit distances between all the substrings of s and all the substrings of t, i.e., an edit distance matrix Mm+1,n+1,For any character string r ═ r1r2r3…rnIts substring rsub={φ∪r1r2...rlL is more than or equal to 1 and less than or equal to n, and the total number is 1+ n.
Edit distance matrix Mm+1,n+1The dimensions of (a) respectively correspond to the total number of sub-strings of the character strings s and t. The total number of substrings of s is 1+ M, the total number of substrings of t is 1+ n, Mm+1,n+1Is (1+ m) (1+ n), the total number of edit distances calculated for each substring of s and for each substring of t, and each element of the matrix is ED(s)sub,tsub). FIG. 8 is a matrix of edit distances between the character string "immatural" and the character string "latitude".
The calculation of each element of the edit distance matrix can be derived from the associated formula defining the edit distance in 14, see formula:
<math> <mrow> <msub> <mo>&dtri;</mo> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mi>min</mi> <mrow> <mo>(</mo> <mi>M</mi> <mo>[</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>-</mo> <mn>1</mn> <mo>]</mo> <mo>+</mo> <mn>1</mn> <mo>,</mo> <mi>M</mi> <mo>[</mo> <mi>i</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mi>j</mi> <mo>]</mo> <mo>+</mo> <mn>1</mn> <mo>,</mo> <mi>M</mi> <mo>[</mo> <mi>i</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mi>j</mi> <mo>-</mo> <mn>1</mn> <mo>]</mo> <mo>+</mo> <mi>&xi;</mi> <mo>)</mo> </mrow> </mrow> </math>
where ξ is denoted as siCharacter string composed of single character and tjThe string edit distance between strings composed of individual characters is naturally:
<math> <mrow> <mi>&xi;</mi> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <mn>0</mn> </mtd> <mtd> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>=</mo> <msub> <mi>t</mi> <mi>j</mi> </msub> </mtd> </mtr> <mtr> <mtd> <mn>1</mn> </mtd> <mtd> <mi>otherwise</mi> </mtd> </mtr> </mtable> </mfenced> <mo>,</mo> </mrow> </math>
<math> <mrow> <mi>M</mi> <mo>[</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>]</mo> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <mi>j</mi> </mtd> <mtd> <mi>i</mi> <mo>=</mo> <mn>0,0</mn> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>n</mi> <mo>,</mo> </mtd> </mtr> <mtr> <mtd> <mi>i</mi> </mtd> <mtd> <mi>j</mi> <mo>=</mo> <mn>0,0</mn> <mo>&le;</mo> <mi>i</mi> <mo>&le;</mo> <mi>m</mi> <mo>,</mo> </mtd> </mtr> <mtr> <mtd> <mo>&dtri;</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mtd> <mtd> <mn>0</mn> <mo>&lt;</mo> <mi>i</mi> <mo>&le;</mo> <mi>m</mi> <mo>,</mo> <mn>0</mn> <mo>&lt;</mo> <mi>j</mi> <mo>&le;</mo> <mi>n</mi> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>
for example, in FIG. 8, M [0, k ]]=M[k,0]=k,(0≤k≤8),M[2,3]=3,M[1,3]=3,M[2,2]=2,M[1,2]2 in the illustrated edit distance matrix, s2=m,t3T due to s2≠t3According to the computational formula of xi, xi is obtained as 1, and M [2,3 ] is satisfied]=min(M[1,3]+1,M[2,2]+1,M[1,2]+1). For si=tjIn the case of (1), M6, 5 is satisfied]=min(M[5,5]+1,M[6,4]+1,M[5,4]) And ξ is 0.
Definition 15 (matrix approximation and Trie tree active nodes): according to the definitions 13 and 14, the matrix approximation term is defined as M [ k, r ] in the present invention, and M [ k, r ] is equal to or less than τ, where k is equal to or less than 0 and is equal to or less than M +1, and r is equal to or less than 0 and is equal to or less than n +1, as can be seen from the definition 14 of the string edit distance, due to the incremental character edit distance, M [ k '> k, r ] ≧ τ, and M [ k, r' > r ] > τ exist, and the active node set of all nodes of the Trie tree of FIG. 11 is shown in Table 3.
Table 3 active node set for all nodes of the Trie of fig. 11
The shaded portion shown in fig. 8 indicates the matrix approximation term when τ is 1, and the solid circled portion indicates the matrix approximation term when τ is 3. It is readily seen that the number of matrix approximation entries grows with the increase in τ. That is, the wider the range of the determination approximation, the easier the value of the approximation result is taken.
Since the character strings s and t can be expressed asThe edit distance matrix, in turn, can be represented as a Trie structure, thus giving a set of stringsTrie tree H corresponding to character string set Sx(x belongs to S) and a character string t, wherein the Trie tree path corresponding to the character string S is hυ(. nu. e.s) forIf ED (upsilon, t) is less than or equal to tau, a node corresponding to upsilon is called as a Trie tree active node of the character string t. All active nodes of Trie tree of character string t form active node set of Trie tree, and is marked as Qτ(t), it can be concluded that if ED (t, e) ≦ τ, then e ∈ Qτ(t) of (d). The direction of the arrows marked in fig. 11 shows the sequence of solving the active node sets of the Trie in the general case, i.e., preferentially solving the active node sets of the parent nodes and then solving the active node sets of the child nodes or the sibling nodes. Wherein, the set marked by the root node is an active node set with root being tau-1, Q1(root) {0,1,9,13 }. Active node sets of nodes with the Trie tree number range of 0-17 under two conditions of tau being 1 and tau being 2 are shown in a table 3.
Given a node n and a string s, t, defining s as follows:
if it is notThen h isn<S is not always true. Wherein s is<cDenoted as the prefix string (by definition 13) preceding the c character in the string s. h isn<Expressed as a substring or string represented by a node in the Trie in which the string t is located after the node n (see definition 13), this conclusion indicates that if the node n is not an active node of the Trie of any prefix string of the string s, the substring or string represented by the node in the Trie after the node n is unlikely to be similar to s.
The active nodes of the Trie tree have symmetry, i.e. arbitrary to the Trie tree structureTwo nodes u and v, if u ∈ Qτ(v) Then there must be v ∈ Qτ(u). For example, active node set Q of node 4(b) in FIG. 111(4(b)) {3,4,5}, and an active node set Q for node 5(i)1(5(i)) {4,5}, and an active node set Q of node 3(m)1(3(m)) {2,3,4,8,11 }. Because 3 ∈ Q1(4(b)),5∈Q1(4(b)), so there is 4 ∈ Q1(3(m)),4∈Q1(5(i))。
Definition 16 (node belongs to string set): the definition that the node n belongs to the character string set is described as follows:
given a set of strings S (S)1,s2,s3,..sm) Trie Tree Hx(x belongs to S and delta) and one node n in the (x belongs to S and delta) represents other character string sets corresponding to the node set in the Trie tree, if one character string S exists, S belongs to S, and the substring S (n) corresponding to the node n is a prefix character string of the character string S. Then n e S can be derived.
The active node set plays a very crucial role in solving similar string pairs. To find the active node sets of all nodes of the Trie, the active node sets of root nodes root of the Trie may be first computed, and then the active node sets of the children nodes of the root nodes may be computed. If traversing all nodes to find the corresponding active node set is very resource-consuming, some rules should be explored to improve the efficiency of solving the active node sets of all nodes in the Trie. Therefore, in the invention, several directions for improving the solving efficiency are given firstly, then corresponding reasoning is proposed, and finally, examples are given and the reasoning is actually applied to the calculation process of the active node set.
Several ways to improve the computational efficiency are as follows:
the structural characteristics of the Trie tree are fully utilized. The Trie tree is to arrange all characters from top to bottom according to the original sequence of all characters of the character string, and the path from the root node root to each leaf node is unique and represents a complete character string. A group of strings with the same prefix shares nodes in a Trie. The active node set of the intermediate nodes or leaf nodes in the tree may be computed using the active node set of the parent node. In this way, the number of computations for certain common substrings is reduced by using a common prefix.
From the definition 15 of the matrix approximation item and the active nodes of the Trie, it can be seen that the active nodes of the Trie for solving the character string t only use the Trie to map the nodes of each character of the character string s, and do not perform the transformation of the Trie on the character string t. For the character string sets S and T, for any character string therein, the advantage of sharing the prefix should be fully utilized, and the calculation amount of the active node is relatively small. Therefore, a Trie structure should be constructed separately for both string sets S and T. Fig. 9 shows a Trie established based on S and T, and all the strings are represented as paths of the Trie, so that the characteristics of the dual-set expressed in the Trie can be fully utilized to improve the calculation efficiency of the active node set.
And thirdly, discovering the mode characteristics of a node set irrelevant to solving the active node set of a certain node in the Trie tree structure, and cutting a node area set conforming to certain characteristics by utilizing the mode characteristics in the process of calculating the active node set of each node from top to bottom, thereby reducing the number of nodes of the Trie tree and reducing the calculation of the unorthodous nodes.
And fourthly, solving the active node set of the child node according to the obtained active node set of the father node, deleting the active node sets of some unnecessary nodes in a certain program for calculation, and reducing the calculation times, but if the active node sets known for all the nodes spend resources such as memory and the like for storage, a large amount of memory space is occupied, and the method is not suitable for the condition that the character string is long or the Trie tree is large in scale. Therefore, the characteristic that the calculation of some node active node sets only depends on part of nodes is utilized to timely release part of node sets which are invalid for solving the follow-up nodes, the use amount of system resources is greatly reduced, and the practicability of the algorithm is improved.
Some reasoning is specifically proposed below for the above proposed method for improving the solution efficiency of the active node set.
Introduction 1: given Trie Tree HxOf a node l and its parent p, and a string edit distance threshold τ for any node l' ∈ Qτ(l) There must be one node p ' that is the ancestor of l ' and that satisfies p ' e.Qτ(p)。
For example, in fig. 11, it is assumed that the character string corresponding to the node 14(o) is "so", and the character string corresponding to the node 15(a) is "wa", and w is a child node of e. To calculate Qτ(w), only any node t ∈ Q needs to be judgedτ(e) And whether all child nodes of t are active nodes of the node w or not is sufficient. As can be seen from table 3, when τ is 1, Q is presentτ(e) Q is calculated as {13,14,15}τ(w) whenever it is checked whether the nodes 13,14,15 and all their children belong to Qτ(w), the node set passing the final check is Qτ(w) has Qτ(w)={14,15,16,17}。
2, leading: given Trie tree structure HxAnd two of the nodes u and v, and a string edit distance threshold τ, if for any ancestor node v' of v, there isAnd for any ancestor node u' of u, all satisfyThe substring or string s (u) represented by the node after the node u in the Trie and the substring or string t (v) represented by the node after the node v are unlikely to be similar, i.e. s (v) is not true.
For example: in fig. 11, looking at the node 1(b), the node 6(e), the node 13(s) and the node 14(o), the character strings represented by the four nodes are "b", "be", "s" and "so", respectively, since the four nodes are represented by the characters "b", "be", "s" and "so" respectively <math> <mrow> <mn>6</mn> <mo>&NotElement;</mo> <mrow> <mo>(</mo> <msub> <mi>Q</mi> <mi>&tau;</mi> </msub> <mrow> <mo>(</mo> <mi>root</mi> <mo>)</mo> </mrow> <mo>&cup;</mo> <msub> <mi>Q</mi> <mi>&tau;</mi> </msub> <mrow> <mo>(</mo> <mn>13</mn> <mrow> <mo>(</mo> <mi>s</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>&cup;</mo> <msub> <mi>Q</mi> <mi>&tau;</mi> </msub> <mrow> <mo>(</mo> <mn>14</mn> <mrow> <mo>(</mo> <mi>o</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>,</mo> </mrow> </math> And is also provided with <math> <mrow> <mn>14</mn> <mo>&NotElement;</mo> <mrow> <mo>(</mo> <msub> <mi>Q</mi> <mi>&tau;</mi> </msub> <mrow> <mo>(</mo> <mi>root</mi> <mo>)</mo> </mrow> <mo>&cup;</mo> <msub> <mi>Q</mi> <mi>&tau;</mi> </msub> <mrow> <mo>(</mo> <mn>1</mn> <mrow> <mo>(</mo> <mi>b</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>&cup;</mo> <msub> <mi>Q</mi> <mi>&tau;</mi> </msub> <mrow> <mo>(</mo> <mn>6</mn> <mrow> <mo>(</mo> <mi>e</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>,</mo> </mrow> </math> All subsequent nodes with node 6(e) and node 14(o) may not represent similar strings, i.e., "bed" and "soak", "bem" and "soak", "bed" and "soap", and "bem" and "soap" may not be similar.
And 3, introduction: given a string s and t, and a string edit distance threshold τ, if | | s | - | t | > τ, then there are: ED (s, t) > τ. Where | s | and | t | represent the number of characters contained in the character strings s and t, respectively.
In part (a) of fig. 10, a character string length section is marked next to the node s and the node tWherein,indicating the shortest path length in the sub Trie tree structure with node e as the "root node",the longest path length. In the figure [1,4 ]]The length of the character string representing the subsequent node representation of the node s is in the range of 1-4, and [7, 9%]The length of the character string represented by the nodes subsequent to the node s is in the range of 7-9. Due to the fact thatThat is, the length difference between the two groups of character strings is greater than the character string editing distance threshold, so that Q can be solvedτIn the process of (t), all subsequent nodes of the node s may be ignored. In a sub-Trie structure with the root node root of the actual Trie tree as the root node and the s and t nodes as the leaf nodes, l min root = l max root = 1 , there are any subsequent nodes μ to the node, there are <math> <mrow> <mi>s</mi> <mo>&Element;</mo> <msub> <mi>Q</mi> <mi>&tau;</mi> </msub> <mi>t</mi> <mo>^</mo> <mi>&mu;</mi> <mo>&NotElement;</mo> <msub> <mi>Q</mi> <mi>&tau;</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> <mo>.</mo> </mrow> </math>
For example, in calculating the active node set of node 1(b) of FIG. 11; using the lemma to delete a portion and calculate Qτ(1(b)) an unrelated node. The character string length intervals of the nodes 1(b), 12(e) and 22(t) are [4,6 ] respectively],[6,6]And [8,8 ]]Since the difference between the string lengths of node 12(e) and node 1(b) is [0,2 ]]And node 22(t) differs from node 1(b) by [2,4 ]]When τ is 1, the node 22(t) may be deleted, but the node 12(e) may not be deleted.
And (4) introduction: given Trie tree structure HxAnd two of the nodes u and v, and a string edit distance threshold τ, if one of the following conditions exists:
(1) v is the ancestor node of u, and subtrees with v and u as root nodes respectively contain the same leaf node;
(2) there is only one string prefixing v and u simultaneously,
then Q is calculatedτ(u) all successor nodes with v as parent or ancestor are not considered.
Fig. 10 shows two embodiments of the conditions in part (b) and part (c). (b) In the case that the nodes t and s have the same leaf node and t is the parent node of s, Q is calculatedτ(s) subsequent nodes to node t will not be considered. (c) In (2), since there is only one character string "sto" with the node s (representing the character string "s") and the node t (representing the character string "st") as prefix strings at the same time, Q is calculatedτ(s) node o is not considered.
Lemma 3 and lemma 4 can be combined as an important method to improve the computational efficiency of the active node set. Because these two arguments can reduce the amount of computation of the active node set by deleting irrelevant nodes. For example, solving for the active node set Q of node 4(r) in FIG. 7τ(4(r)), Q can be obtained by theorem 3τ(4(r)) {3,4,5}, and Q was obtained using theorem 4τ(4(r)) - {4,5}, and the final result is Qτ(4(r))={3,4,5}。
For the comparison of the similarity of character strings between two character string sets, the influence of the relationship of the character string sets on the algorithm process should be considered, and the method should be applicable to different types of character string sets. The invention firstly creates a Trie tree structure based on the shared prefix of the character string and the structural characteristics among the character strings. Preprocessing is performed before similarity calculation, namely, in order to avoid repeated calculation, a node area which is not related to the calculation of the active node set is deleted by using the interaction between the nodes of the Trie tree and the lemma related to the active node set. In order to improve the efficiency of the algorithm, the characteristics of the Trie tree structure are fully utilized, the active node set of the child node is solved through the known active node set of the father node, and meanwhile, the unknown node of the active node set is subjected to dynamic incremental access, so that on one hand, each node is not required to be accessed and calculated, and on the other hand, the whole Trie tree structure is prevented from being traversed for solving the active node set of a certain node. In this way, the computational efficiency can be significantly increased in a manner that reduces the amount of access. Meanwhile, in order to avoid the algorithm from occupying a large amount of memory resources, the invention stores the accessed nodes which are only effective to solve the active node set of the nodes to be accessed instead of storing all the accessed node information by introducing the data structure of the access node stack. By improving the calculation efficiency and limiting the use of algorithm system resources, the algorithm provided by the invention has good practicability and has advantages compared with the existing method.
Based on the above-mentioned text description of the overall process of the algorithm, the following is further illustrated with specific algorithm flows and accompanying drawings.
Fig. 16 is an overall flow chart of the two-stage filtering based on structure and content, wherein the right half part illustrates an algorithm process of the second-stage content filtering, and the specific implementation includes the following steps:
step 1: extracting a character string set U (U) from the six-element group vector generated by the first-stage structural similarity filtering1,u2,u3,…uk) And V (V)1,v2,v3,…vl) There is a possibility that both sets of character strings are identical. Each character string is the result of the concatenation of the title (title), the KeyWord (KeyWord) and the content (content) in each tuple vector. And determining the character string editing distance threshold tau according to the practical application situation.
Step 2: sorting the character strings in the character string sets U and V according to a dictionary to form an ordered set, and establishing a Trie tree H based on the ordered setu∪vThe Trie tree is composed of nodes corresponding to each character in the U and the V, and the nodes of the Trie tree store node numbers, mapping relations between the nodes and the characters, character string sets to which the nodes belong and structural relations between the nodes.
For example, Trie Tree Hu∪vIn the method, each node is numbered according to the prior traversal, and each node is labeled (only belongs to U, or only belongs to V, or belongs to U U.U.V) based on the set to which the node belongs. Meanwhile, the nodes need to store the position relationships of parents, children, brothers and the like with other nodes so as to facilitate traversal and search.
And step 3: an "accessed node stack" data structure ST is created for storing the Trie nodes that have been accessed. The stack is initialized to have only the root node and its corresponding set of active nodes.
For example, a length | H is created from the maximum depth of the Trie treeU∪VEach element in the "visited node stack" ST, ST of | includes the number of the node and the active node set of the node.
And 4, step 4: and establishing a dynamic Trie tree H' only with root nodes and roots, solving an active node set of the root nodes and roots, and determining the next node for solving the active node set, which is called a destination node. Comprising the following substeps.
Substep 1: incremental character string edit distance-based definition 10 (incremental character) method for solving active node set Q of root nodeτ(root), the root node and corresponding set of active nodes are pushed onto the stack ST.
Substep 2: and extracting a node set N only belonging to the character string set U from the active node set of the root node, and selecting the node N' closest to the root node.
And 5: creating a virtual Trie-Tree H 'for the next destination node'u∪v. Comprises the following steps.
Substep 1: exploring in Trie tree H based on theory 3 and theory 4u∪vThe node area of (a) that is not relevant to solving the active node set of node n',
substep 2: gray marking is carried out on the node area to form a tree H based on the original Trieu∪vIs dedicated to a virtual Trie-Tree H 'of n'u∪v
Step 6: the active node set of the destination node is found from the active node set of the uppermost element of the stack ST based on lemma 1.
Substep 1: if the destination node belongs to the two sets, the destination node is set to belong to one set and is in a virtual Trie tree H'u∪vExcluding nodes belonging only to the set; if the destination node only belongs to a certain set, the same is carried out in the virtual Trie tree H'u∪vExcluding nodes that belong only to the set.
Substep 2: h'u∪vIn the method, the active node set of the destination node is obtained according to the active node set of the parent node at the uppermost layer on the stack ST
Substep 3: the destination node is pushed onto the stack ST. The uppermost layer element is the number of the destination node and its corresponding active node set.
And 7: and updating the active node set of the elements within the distance tau from the uppermost element of the stack in the stack ST based on the description of the symmetry of the active nodes in the Trie tree active node definition.
And 8: judging whether the node p currently processed is a certain leaf node of the Trie tree structure, if so, determining whether the node p currently processed is a certain leaf node of the Trie tree structureOutputting similar string pairings<o,p>And pops up the uppermost node in the stack ST.
And step 9: in Trie Tree H'u∪vAnd (4) taking out the newest destination node of the child node or the brother node of the destination node belonging to the same character set, and repeating the steps from 5 to 8 until the stack ST is empty.
Fig. 12 is an example of the implementation of the above algorithm flow, and the following first analyzes the influence of the relationship between the character string sets U and V on the algorithm, and then gives a specific algorithm execution example in conjunction with fig. 12:
let U and V be two sets of strings, where U ═ U1,u2,u3,…um},V={v1,v2,v3,…vn}. Tau is string edit distance threshold, R is output similar string pairing set, HU∪VRepresenting the Trie structure built on the character string sets U and V, and root representing the root node of the Trie structure.
The relationship of U and V affects the execution process and the result of the algorithm. If U and V are the same set, i.e., U { < U, w > | U, w ∈ U ^ ED (U, w) ≦ τ }. It can be understood that if a similar character string pairing set is solved in the same set, the result set is composed of tuples formed by two similar character strings in the same character string set. Otherwise, R { < U, V > | U ∈ U, V ∈ V, and ED (U, V) ≦ τ }. The specific implementation of the algorithm is as follows:
(1) the number of a node o is ID (o), ST represents an accessed node stack, and the length | ST | ≦ H of the node o is set in the Trie treeU∪VEach element storing the number ID of the accessed Trie-tree node o and the corresponding extended active node set Qτ' (o) is represented by<ID(o),Qτ'(o)>ST is initialized to<ID(root),Qτ'(root)>Is initialized to Qτ' (root) ← phi, wait for step (4) to pair Qτ' (root) update.
(2) Each node o in the Trie may be denoted as o (id (o), char (o), charset, string fo, bel). Wherein char (o) represents the character corresponding to the node o, charset represents the character string set to which the node o belongs, strucfo represents the structural relationship, such as father, son or brother, between the node o and other adjacent nodes, stored by the node o, bel represents the relationship between the node o and the character sets U and V,dictionary ordering for any character string s ∈ U ≧ V yielding ordered set U Θ V ═ r | r1>r2>...>rm+nAnd r belongs to U and V. ">" indicates that the character string preceding the symbol is arranged before the character string following the symbol in the dictionary ordering.
(3) Building Trie tree H based on ordered set U theta VU∪VFor nodes belonging to different sets, bel is initialized to different values respectively.
(4) Establishing a dynamic Trie tree H 'based on a root node root, initializing as H' ← root, and traversing HU∪VSolving for Qτ(root) ← { q | q ∈ U ^ V | ≦ τ }, and element ∈ U ^ q | ≦ τ }, and its preparation method<ID(root),Qτ'(root)>The stack ST is pushed, marked as visited node.
Wherein, for node r, Q'τ(r) Trie Tree active node set Q for rτ(r) extended. Q'τ(r) at QτAnd (r) includes influence factors of the relationship of the initial character string set on the algorithm process. Solving for QτThe method of' (r) is as follows:
(5) defining dn as destination node, and dn 'as next destination node, initializing as dn ← root.1stchild, dn' ← dn.1stchild, satisfyingWherein C isS PRepresents the complement of P in the corpus S, S ← { U, V, U ≦ V }.
(6) If stack ST is empty, go to (15), otherwise go to (7).
(7) If dn is not null and dn ∈ U, go (8), otherwise go to (13).
(8) Let element tp be the uppermost element of stack ST, tp (ID (tp), Qτ' (tp)) ═ st.tope (); clipping the nodes conforming to the lemma to form a virtual Trie tree H'U∪V,H'U∪V←HU∪VDs, ds is the set of nodal regions that conform to the lemma.
(9) Active node set (dn, Q) by element tpτ' (tp)) solving for Qτ' (dn) if dn is a leaf node, go to (10), otherwise go to (12).
(10) Let s (node) represent the substring represented by node or the complete character string, and for any Trie tree node l ∈ Qτ' (dn) ^ l ≠ n ^ l.child ← phi, has s (dn) ≈ s (l), combines output results into a result set R, and calculates R { (dn, l) };
(11) in stack ST, let | t | be the position of the stack element corresponding to t, and the node corresponding to the uppermost element of stack ST be dn, with | dn |>If the node t meets the condition that the value of | dn | - | t | ≦ tau, dn is added to Qτ' (t) calculating Qτ'(t)∪={dn}。
(12) Push(s) to push the destination node dn onto the stack ST<dn,Qτ'(dn)>) Solving for the next destination node dn',
(13) if dn is not null, derive dn' ← dn. next finding.dn. next finding that represents the next sibling of node dn, go (7); otherwise, go to (14).
(14) Post ST pops up the top element, tp (ID (tp), Qτ'(tp)) ═ st.pop (), and define the sibling node of the node corresponding to the element as the object to be accessed next, dn' ← tp.nextsabling.
(15) And outputting a final result R, observing an actual matching program of the character string similarity pair in the R, and ending the algorithm.
The experimental data of the invention come from the Yahoo official website and the webpage data in the China Daily, crawl in the set theme through tools such as web crawlers, and form the Yahoo data sets and the China Daily data sets with different character string lengths through processing. Fig. 13,14 and 15 are comparative graphs of new and old methods plotted based on experimental result data.
Referring to fig. 13, the new method based on the two-stage filtering is obviously superior to the conventional method in comparison of the operation efficiency of the same data set. In fig. 13, it can be seen that the time for the operation of the conventional method and the method proposed in the present invention is first decreased, then slowly increased, and finally gradually decreased with the increase of the average string length in the string set. In the case of short average string length, the old method takes more runtime than the new method because the new method has one more layer of structural similarity filtering. The advantages of the new method are shown as the average string length increases. When the string edit distance threshold τ is 10, the drop is greater than τ 20, because the smaller the string edit distance threshold τ is, the more strict the definition of the similarity is, the higher the hierarchy of deletion of the nodes that do not satisfy the definition of the similarity is, and the less the nodes that need to be traversed is, the shorter the running time is. It can also be seen from the figure that the new method is more efficient than the old method both when τ is 10 and τ is 20.
Referring to fig. 14, the new method based on the two-stage filtering is obviously superior to the conventional method in comparison of the operating efficiency of different data sets. Fig. 14 is a graph showing a comparison of results plotted in the case where the method of the present invention and the conventional method are performed in the case where the average string lengths are different in two different data sets, respectively, the Yahoo data set and the China daisy data set, which are divided into 4 groups, and the average string lengths are 300, 500, 700, and 900, respectively. It can be found that the same regularity is present in both data sets, presenting a similar trend to that of fig. 14.
As shown in fig. 15, the new method based on the two-stage filtering proposed by the present invention is significantly superior to the conventional method in terms of accuracy. The three fold lines in fig. 15 are: the new method provided by the invention is a traditional similarity detection method and the relation between the number of the similar character string pairs and the size of the character string set under the actual condition. It can be seen from the figure that the method of secondary filtering proposed in the present invention has a smaller gap from the actual situation than the conventional method, and therefore, the new method has higher accuracy.
It should be understood that parts of the present invention not specifically set forth are within the prior art.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (11)

1. A Web data similarity detection method based on structure and content secondary filtering is characterized in that: on the basis of a traditional general similarity detection method, the characteristics of Web data structure and content distribution are discovered, and two-stage filtering is carried out on a detected document set; the first stage of filtering in the two-stage filtering is structural similarity filtering, each Web document is modeled into a Tag tree structure, so that document sets which are dissimilar in structure are removed, the key contents of the rest documents are extracted and expressed into a tuple vector form, and key information is connected to generate a character string set; and the second-stage filtration in the two-stage filtration is used for carrying out Trie tree structure modeling on the character string set generated after the first-stage filtration, and connecting similar character strings to obtain a final result.
2. The method for detecting similarity of Web data based on structure and content secondary filtering as claimed in claim 1, wherein the first-level filtering specifically comprises the following steps:
step A1: preprocessing an input Web document set to be detected, extracting a Web document structure model tag.tree, and establishing an effective node index;
step A2: constructing binary branch vectors (BinaryBranch Vector) of all tags in parallel by combining MapReduce;
step A3: establishing a multi-pointer LSH index for a binary branch vector of the tag.tree;
step A4: structure similarity Web document set S for inquiring given inquiry q1
Step A5: obtaining a set of Web documents S1Based on a content six-tuple vector set W extracted by a Tag tree, and based on the W, a character string connection algorithm is used for obtaining a similarity result set S of a query q2
3. The Web data similarity detection method based on structure and content secondary filtering as claimed in claim 2, wherein the specific implementation of step a1 includes the following sub-steps:
step A1.1: combining MapReduce, processing a Web document data set stored in a distributed document system (HDFS) in parallel, and extracting a Web document structure model tag.tree by using a Tag tree modeling mode based on a path Tag;
step A1.2: newly building a MapReduce task, wherein the input key of the Map task is a file name, the value is tag.
4. The Web data similarity detection method based on structure and content secondary filtering as claimed in claim 3, wherein the specific implementation of step A2 includes the following sub-steps:
step A2.1: newly building a MapReduce task, calculating binary node vector sets S of all tags, and then distributing the universe set S to each DataNode in the system by using a distributedPath function of Hadoop;
step A2.2: constructing a binary branch vector BBV by using a MapReduce task according to the binary node vector of the S and each tag.tree; key of map task is file name, value is binary node vector of tag.tree; the key of the reduce task is a file name, and the value is BBV corresponding to tag.tree;
step A2.3: the task writes the BBV of each tag.
5. The Web data similarity detection method based on structure and content secondary filtering as claimed in claim 4, wherein the specific implementation of step A3 includes the following sub-steps:
step A3.1: initializing L hash tables for establishing a multi-pointer LSH index, and initializing M hash functions and r hash buckets in each table; calculating indexes of BBV of all tags by using a MapReduce task;
step A3.2: assuming that a binary branch vector of a certain Web document X is M, firstly, M times of hash functions are calculated for M, and g is seti(m) is the hash value of query m in the ith hash table, i.e. g (m) ═ h1(m),…,hM(m));
Step A3.3: initializing an L-dimensional perturbation vector delta, wherein delta is equal to delta1,…,△L),gi(m)+△1Is to interfere with the vector delta1Is applied to gi(m) obtaining a first-order pointer to the hash bucket, the first-order pointer pointing to another hash bucket in the hash table, each perturbation vector being mapped to a unique set of hash values, and the pointer pointing to a hash bucket at most once;
step A3.4: applying the perturbation vector in delta to g in turni(m), obtaining i (i ═ 1,2, …, L) order pointers, pointing to different hash buckets, and establishing a multi-pointer index;
step A3.5: according to the pointer construction condition of all tag.tree, a hash bucket within the second order gives a group number BandID; keys of the MapReduce task are a group number BandID and a barrel number BucketID, and values are corresponding Web file names.
6. The Web data similarity detection method based on structure and content secondary filtering as claimed in claim 5, wherein the specific implementation of step A4 includes the following sub-steps:
step A4.1: extracting a binary branch vector of a given query q, quickly retrieving a top-k proximity point of the q based on a multi-pointer LSH index, and screening out a binary branch distance BDist (T) in the binary branch vectori,Tq) Satisfies the conditionsT tree set S1As a secondary index input;
step A4.2: quickly retrieving a top-k proximity point of q based on the index s;
computing M hash functions of query q, let g (q) be the hash value of query q in the ith hash table, i.e., where g (q) ═ h1(q),…,hM(q)),gi(q)+△1Is to interfere with the vector delta1Is applied to giAnd (q) obtaining all hash buckets within two orders away from the given point q according to the index s by the first-order pointer of the hash bucket, namely the top-k adjacent point of the query q.
7. The Web data similarity detection method based on structure and content secondary filtering as claimed in claim 5, wherein the specific implementation of step A5 includes the following sub-steps:
step A5.1: based on step A1, a Web document set S is acquired1Tree set;
step A5.2: obtaining the contents in the leaves of all effective nodes according to the effective index of the tag of the Web document X;
step A5.3: and after judging the leaf node type according to the father node label, adding a six-tuple vector of X, wherein the six-tuple vector consists of the number (id) of X, a title (title), content (content), a KeyWord (KeyWord), a link (URL) and release time (PageTime).
8. The method for detecting similarity of Web data based on secondary filtering of structure and content as claimed in claim 7, wherein the secondary filtering specifically includes the following steps:
step B1: extracting a character string set U (U) from the six-element group vector generated by the first-stage filtration1,u2,u3,…uk) And V (V)1,v2,v3,…vl) The two sets of strings have the exact same possibility; each character string is a result of serial connection of a number (id), a title (title), content (content), a KeyWord (KeyWord), a link (URL) and release time (PageTime) in each tuple vector; determining a character string editing distance threshold tau according to an actual application situation;
step B2: sorting the character strings in the character string sets U and V according to a dictionary to form an ordered set, and establishing a Trie tree H based on the ordered setu∪vThe Trie tree consists of nodes corresponding to each character in the U and the V, and the nodes of the Trie tree store node numbers, mapping relations between the nodes and the characters, character string sets to which the nodes belong and structural relations between the nodes;
step B3: creating an accessed node stack data structure for storing accessed Trie tree nodes; the stack is initialized to be only a root node and an active node set corresponding to the root node;
step B4: establishing a dynamic Trie tree H' only with root node root, solving an active node set of the root node root, and determining a next node for solving the active node set, which is called a destination node;
step B5: creating a virtual Trie-Tree H 'for the next destination node'u∪v
Step B6: solving an active node set of a destination node through an active node set of an element positioned at the uppermost layer of a stack ST based on the theorem 1;
wherein the introduction 1: given aTrie Tree HxOf a node l and its parent p, and a string edit distance threshold τ for any node l' ∈ Qτ(l) There must be one node p ' that is the ancestor of l ' and that satisfies p ' e.Qτ(p);
Step B7: updating an active node set of elements, the distance between the elements and the topmost element of the stack ST is within tau, based on the description of the symmetry of the active nodes in the Trie tree active node definition;
step B8: judging whether the node p currently processed is a certain leaf node of the Trie tree structure, if so, determining whether the node p currently processed is a certain leaf node of the Trie tree structureOutputting similar string pairings<o,p>;
Step B9: in Trie Tree H'u∪vAnd taking out the newest destination node of the child node or the brother node of the destination node belonging to the same character set, and repeating the steps B5 to B8 until the stack ST is empty.
9. The Web data similarity detection method based on structure and content secondary filtering as claimed in claim 8, wherein the specific implementation of step B4 includes the following sub-steps:
step B4.1: active node set Q for solving root node based on character string editing distanceτ(root), pushing the root node and the corresponding set of active nodes into the stack ST;
the character string editing distance specifies two character strings s(s)1s2s3…sm) And t (t)1t2t3…tn) The string edit distance of s and t is represented as ED (s, t), and is represented as the minimum operation times of converting the string s into t by three operation modes of replacing, inserting and deleting with a single character as a unit; if s is empty, it is denoted ED (s, 0);
for ED (s, t), there is a degree of incremental:
①ED(si0)=i,ED(tj,0)=j,1≤i≤m,1≤j≤n;
②ED(si,tj)≥ED(si-1,tj)+1,1<i≤m,1≤j≤n;
③ED(si,tj)≥ED(si,tj-1)+1,1≤i≤m,1<j≤n;
④ED(si,tj)=ED(si-1,tj-1),1<i≤m,1<j≤n,si=tj
⑤ED(si,tj)≥ED(si-1,tj-1)+1,1<i≤m,1<j≤n,si≠tj
step B4.2: and extracting a node set N only belonging to the character string set U from the active node set of the root node, and selecting the node N' closest to the root node.
10. The Web data similarity detection method based on structure and content secondary filtering of claim 9, wherein the concrete implementation of step B5 includes the following sub-steps:
step B5.1: in Trie tree H based on lemma A and lemma Bu∪vScreening out a node area irrelevant to the active node set of the solving node n';
the theory A: given a string s and t, and a string edit distance threshold τ, if | | s | - | t | > τ, then there are: ED (s, t) > τ, where | s | and | t | represent the number of characters contained in the character strings s and t, respectively;
the theory B is as follows: given Trie tree structure HxAnd two of the nodes u and v, and a string edit distance threshold τ, if one of the following conditions exists:
(1) v is the ancestor node of u, and subtrees with v and u as root nodes respectively contain the same leaf node;
(2) if only one string exists and v and u are simultaneously used as prefix strings, Q is calculatedτ(u) all subsequent nodes having v as parent or ancestor are not considered;
step B5.2: gray marking is carried out on the node area to form a tree H based on the original Trieu∪vIs specially designed forVirtual Trie-Tree H 'for n'u∪v
11. The Web data similarity detection method based on structure and content secondary filtering as claimed in claim 10, wherein the specific implementation of step B6 includes the following sub-steps:
step B6.1: if the destination node belongs to the two sets, the destination node is set to belong to one set and is in a virtual Trie tree H'u∪vExcluding nodes belonging only to the set; if the destination node only belongs to a certain set, the same is carried out in the virtual Trie tree H'u∪vExcluding nodes belonging only to the set;
step B6.2: h'u∪vAccording to an active node set of a parent node at the uppermost layer on a stack ST, an active node set of a target node is obtained;
step B6.3: and pressing the destination node into the stack ST, wherein the uppermost element is the number of the destination node and the corresponding active node set.
CN201410843460.0A 2014-12-30 2014-12-30 A kind of web data similarity detection method based on structure and content secondary filtration Expired - Fee Related CN104462582B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410843460.0A CN104462582B (en) 2014-12-30 2014-12-30 A kind of web data similarity detection method based on structure and content secondary filtration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410843460.0A CN104462582B (en) 2014-12-30 2014-12-30 A kind of web data similarity detection method based on structure and content secondary filtration

Publications (2)

Publication Number Publication Date
CN104462582A true CN104462582A (en) 2015-03-25
CN104462582B CN104462582B (en) 2017-07-11

Family

ID=52908617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410843460.0A Expired - Fee Related CN104462582B (en) 2014-12-30 2014-12-30 A kind of web data similarity detection method based on structure and content secondary filtration

Country Status (1)

Country Link
CN (1) CN104462582B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975547A (en) * 2016-04-29 2016-09-28 武汉大学 Approximate web document detection method based on content and position features
CN106102140A (en) * 2016-05-27 2016-11-09 北京灵龄科技有限责任公司 The power consumption optimization method of wireless senser and device
CN106326217A (en) * 2015-06-15 2017-01-11 华东师范大学 A stochastic algorithm-based distributed entity matching method
CN106844481A (en) * 2016-12-23 2017-06-13 北京信息科技大学 Font similarity and font replacement method
CN108021692A (en) * 2017-12-18 2018-05-11 北京天融信网络安全技术有限公司 A kind of method of web page monitored, server and computer-readable recording medium
CN109359701A (en) * 2018-11-28 2019-02-19 重庆邮电大学 A kind of three-dimensional modeling data analytic method of extracted with high accuracy and Fast Classification
CN109684438A (en) * 2018-12-26 2019-04-26 成都科来软件有限公司 A method of data are retrieved with father and son's hierarchical structure
CN110059085A (en) * 2019-03-18 2019-07-26 浙江工业大学 A kind of parsing of JSON data and modeling method of web oriented 2.0
CN110263570A (en) * 2019-05-10 2019-09-20 电子科技大学 A kind of gene data desensitization method for realizing efficient similarity query and access control
CN110502629A (en) * 2019-08-27 2019-11-26 桂林电子科技大学 A kind of filtering verifying character string similarity join method based on LSH
CN110598190A (en) * 2019-09-06 2019-12-20 湖南天河国云科技有限公司 Method for determining authority of text data on chain based on block chain
CN110968555A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Dimension data processing method and device
CN114780541A (en) * 2022-04-01 2022-07-22 港珠澳大桥管理局 Data partitioning method, device, equipment and medium in micro-batch stream processing system
CN116737256A (en) * 2016-02-22 2023-09-12 斯诺普西斯公司 Method for automatically adjusting computing system, computing system and computer readable medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661468A (en) * 2008-08-29 2010-03-03 中国科学院计算技术研究所 Method for extracting post metadata from forum post list pages

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661468A (en) * 2008-08-29 2010-03-03 中国科学院计算技术研究所 Method for extracting post metadata from forum post list pages

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘桂峰 等: ""一种自动抽取We撇据对象的方法"", 《计算机应用与软件》 *
袁培森: ""基于LSH的Web数据相似性查询研究"", 《万方学位论文:复旦大学》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326217A (en) * 2015-06-15 2017-01-11 华东师范大学 A stochastic algorithm-based distributed entity matching method
CN116737256A (en) * 2016-02-22 2023-09-12 斯诺普西斯公司 Method for automatically adjusting computing system, computing system and computer readable medium
CN105975547A (en) * 2016-04-29 2016-09-28 武汉大学 Approximate web document detection method based on content and position features
CN105975547B (en) * 2016-04-29 2019-06-25 武汉大学 Based on content web document detection method approximate with position feature
CN106102140A (en) * 2016-05-27 2016-11-09 北京灵龄科技有限责任公司 The power consumption optimization method of wireless senser and device
CN106102140B (en) * 2016-05-27 2022-03-22 集道成科技(北京)有限公司 Power consumption optimization method and device of wireless sensor
CN106844481A (en) * 2016-12-23 2017-06-13 北京信息科技大学 Font similarity and font replacement method
CN106844481B (en) * 2016-12-23 2021-01-05 北京信息科技大学 Font similarity and font replacement method
CN108021692A (en) * 2017-12-18 2018-05-11 北京天融信网络安全技术有限公司 A kind of method of web page monitored, server and computer-readable recording medium
CN108021692B (en) * 2017-12-18 2022-03-11 北京天融信网络安全技术有限公司 Method for monitoring webpage, server and computer readable storage medium
CN110968555A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Dimension data processing method and device
CN110968555B (en) * 2018-09-30 2023-07-04 北京国双科技有限公司 Dimension data processing method and device
CN109359701A (en) * 2018-11-28 2019-02-19 重庆邮电大学 A kind of three-dimensional modeling data analytic method of extracted with high accuracy and Fast Classification
CN109684438A (en) * 2018-12-26 2019-04-26 成都科来软件有限公司 A method of data are retrieved with father and son's hierarchical structure
CN110059085A (en) * 2019-03-18 2019-07-26 浙江工业大学 A kind of parsing of JSON data and modeling method of web oriented 2.0
CN110263570B (en) * 2019-05-10 2020-09-25 电子科技大学 Gene data desensitization method for realizing efficient similarity query and access control
CN110263570A (en) * 2019-05-10 2019-09-20 电子科技大学 A kind of gene data desensitization method for realizing efficient similarity query and access control
CN110502629A (en) * 2019-08-27 2019-11-26 桂林电子科技大学 A kind of filtering verifying character string similarity join method based on LSH
CN110598190A (en) * 2019-09-06 2019-12-20 湖南天河国云科技有限公司 Method for determining authority of text data on chain based on block chain
CN110598190B (en) * 2019-09-06 2024-03-08 湖南天河国云科技有限公司 Method for determining right of text data on chain based on block chain
CN114780541A (en) * 2022-04-01 2022-07-22 港珠澳大桥管理局 Data partitioning method, device, equipment and medium in micro-batch stream processing system
CN114780541B (en) * 2022-04-01 2024-04-12 港珠澳大桥管理局 Data partitioning method, device, equipment and medium in micro batch flow processing system

Also Published As

Publication number Publication date
CN104462582B (en) 2017-07-11

Similar Documents

Publication Publication Date Title
CN104462582B (en) A kind of web data similarity detection method based on structure and content secondary filtration
Schenker Graph-theoretic techniques for web content mining
Leung Uncertain frequent pattern mining
CN106528648B (en) In conjunction with the distributed RDF keyword proximity search method of Redis memory database
Ferragina et al. Learned data structures
Schenker Graph-theoretic techniques for web content mining
JP5241738B2 (en) Method and apparatus for building tree structure data from tables
WO2015010509A1 (en) One-dimensional liner space-based method for implementing trie tree dictionary search
O’Hare et al. A review of unsupervised and semi-supervised blocking methods for record linkage
Adamu et al. A survey on big data indexing strategies
CN115982390B (en) Industrial chain construction and iterative expansion development method
Jiménez et al. Frequent tree pattern mining: A survey
Wu et al. Modeling and extracting deep-web query interfaces
Fang et al. Mining bucket order-preserving submatrices in gene expression data
CN107657067B (en) Cosine distance-based leading-edge scientific and technological information rapid pushing method and system
Priya et al. Entity resolution for high velocity streams using semantic measures
Wang et al. Research on a frequent maximal induced subtrees mining method based on the compression tree sequence
Endres et al. Index structures for preference database queries
Li et al. A novel approach for mining probabilistic frequent itemsets over uncertain data streams
Abd El-Ghafar et al. An Efficient Multi-Phase Blocking Strategy for Entity Resolution in Big Data
Zhang et al. On efficient and effective association rule mining from XML data
Yan et al. Fuzzy XML Queries and Index
Sasikala et al. Mining association rules from XML document using modified index table
Li et al. An Incremental Partitioning Graph Similarity Search Based on Tree Structure Index
Sridevi et al. Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170711

Termination date: 20171230