CN1719442A - Network content grading index structure based on CPat-Tree and cutting method - Google Patents

Network content grading index structure based on CPat-Tree and cutting method Download PDF

Info

Publication number
CN1719442A
CN1719442A CN 200510027784 CN200510027784A CN1719442A CN 1719442 A CN1719442 A CN 1719442A CN 200510027784 CN200510027784 CN 200510027784 CN 200510027784 A CN200510027784 A CN 200510027784A CN 1719442 A CN1719442 A CN 1719442A
Authority
CN
China
Prior art keywords
node
treemap
array
nodemap
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200510027784
Other languages
Chinese (zh)
Inventor
赵泽宇
薛向阳
石静
许源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN 200510027784 priority Critical patent/CN1719442A/en
Publication of CN1719442A publication Critical patent/CN1719442A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a structure model of contents hierarchical index for URL and its directional network entity and clipping algorithm based on CPat-tree. Said hierarchical index structure uses several arrays of TreeMap, NodeMap, EList and IFArray to record URL of entity and hierarchical information. Said invention also provides the concrete steps of said clipping algorithm. Said invention can greatly reduce storage capacity of CPat-Tree index structure and reduce magnetic disk access frequency in inquiry process and CPU computing cost, and has high inquiry efficiency.

Description

Web content hierarchical index structure and method of cutting out based on CPat-Tree
Technical field
The invention belongs to fields such as document information retrieval and Internet technology, be specifically related to a kind of internet content entity classify and grading index structure and at this structural method of cutting out.
Technical background
Along with the high speed development of network technology, the internet more and more is deep in people's work and the life, becomes the important channel of issuing and obtaining information.Yet, because internet information issue and obtaining exists high anonymity, high privacy, high interactive and do not have characteristics such as region, be difficult to the uneven information content is effectively managed and restricted, this has brought serious negative effect for public life and social production.
Aspect the public's life, be full of on the current internet in a large number with violence, pornographic, anti-government and antisocial be the flame of main contents, positive maximum is disturbing people's audiovisual, and the sound development of society has been caused adverse effect.According to U.S. N2H2 company statistics in 2003, it is pornographic webpage that roughly there is 8% webpage in the whole world, and 1/4th relevant Pornographs are arranged in the request that search engine is submitted to every day; Simultaneously with anti-government, antisocial be that the website and webpage and the Email of content is all-pervasive.
Aspect social production, computer network in the air brings serious negative effect [1] usually for corporate culture, production cost and the efficient of enterprise.At first the employee utilizes the internet to carry out and the irrelevant behavior of work through the working time of being everlasting, and takies the working time, directly causes yield-power to descend; Secondly, the employee has taken the outlet bandwidth that leads to the internet in a large number with the irrelevant internet usage of working, worm-type virus and the spam that many malicious websites are brought, and enterprise is a large amount of expenses of supplementary payments for this reason; The malicious code that unsafe webpage comprises, trojan-horse program and spyware (spyware) also can be usurped and destroy enterprise commerce and scientific research information.The private purpose of enterprise's outlet network bandwidth consumption of 30%-40% the employee pointed out in the portion report of IDC LLC; The report of North America management federation points out that 27% enterprise of Fortune 500 all once was absorbed in the improper internet usage of employee and Email is propagated in the scandal of pornographic information.
The negative effect of network is big, and the flame content is wide, is that people are unexpected.In order to ensure the content safety of internet information, for the public opens up the network world of " cleaning ", for enterprise rationally the use Internet resources of safety provide technical guarantee, research and development Web content security tool have important practical significance [7].
Common Web content safety technique mainly contains [2] [3] such as label filtration, keyword filtration, url filtering, categorical filtering and information filtering at present.Because there is shortcomings such as being difficult to supervise the implementation of, being difficult to guarantee rating information real-time and implementation complexity height respectively in these information filtering technology, often realize jointly in conjunction with above-mentioned several filtering techniques so drop into the Web content filtering system of research and development and operation.
Url filtering is one of technology the most frequently used in the Web content filtering system, and it is mainly by relatively user's request and predefined rating information determine whether to tackle user's request.The data directory structure that the traditional URL filtering technique adopts mainly contains Trie structure and two kinds of hash tables [5].Advantage based on the Trie index structure is that search efficiency is very high, and shortcoming is to consume big amount of ram in the implementation procedure, and the complexity of serializing and unserializing is higher, needs to carry out complicated conversion in the process of disk access and Network Transmission.Advantage based on hash table is that search efficiency is very high, and because use linear array representation index structure, so serializing and unserializing complexity are lower.Its shortcoming is choosing of hash function and uses and increased implementation complexity that the storage space of hash table still remains on a higher relatively level.
Innovative point of the present invention is, has proposed based on CPat-Tree[6] the Web content hierarchical index structure of structure realization and the method for cutting out of directly on the storage array, realizing.Utilize this index structure and trimming algorithm, can make the memory capacity of content classification index structure can reach very little level, also promote query efficient significantly.
The present invention not only can be used for the URL interception technology of content filtering system, also can be applied to other field of information retrieval.For example content transmission network (CDN, Content Delivery Network) [8] and multicast tree (MulticastRouting Tree) [9] modeling etc.
List of references
1.Jacob?Palme.Information?Filtering.http://cmc.dsv.su.se/select/information-filtering.pdf,1998-06-01.
2.Jonathan?Zittrain,Benjamin?Edelman.Internet?Filtering?in?China.IEEE?Internet?Computing,2003,7(2):70-77.
3.Justin?Basilico,Thomas?Hofmann.A?joint?framework?for?collaborative?and?content?filtering.27th?Annual?International?ACM?SIGIR?Conference.NY:ACM?Press,2004:550-551.
4.Menahem?Friedman?and?Abraham?Kandel.Introduction?to?Pattern?Recognition-Statistical,Structural,Neural?And?Fuzzy?Logic?Approaches.World?Scientific,1999.
5.Zornitza?Genova?Prodanoff.Performance?Evaluation?of?URL?Routing?for?ContentDistribution?Networks.SF:University?of?South?Florida,2003.
6.M.Shishibori,M.Okada,T.Sumitomo?and?J.Aoe.Design?of?a?Compact?Data?Structure?for?thePatricia?Trie.IEICE?Transactions?on?Information?and?Systems,1998,Vol.E81-D,No.4,pp.364-371.
7.Chen?Ding,Chi-Hung?Chi,Jing?Deng,and?Chun-Lei?Dong.Centralized?Content-Based?WebFiltering?and?Blocking:How?Far?Can?It?Go?In?Proceeding?of?IEEE?InternationalConference?on?Systems,Man?and?Cybernetics(SMC),1999.
8.Survey?of?Content?Delivery?Networks(CDNs),http://cgi.di.uoa.gr/~grad0377/cdnsurvey.pdf.
9.Gaurav?Sharma.Internet?topology?and?tomography.https://engineering.purdue.edu/people/gaurav.sharma.3/Reports/Modeling.ppt,2005-04.
Symbol table (implication of institute's symbolization in whole documents of the present invention)
TreeMap is by the bit array of preorder journal CPat-Tree tree node structure
NodeMap comprises the bit array of bit number by preorder journal CPat-Tree tree node
EList comprises the bit array of bit value by preorder journal CPat-Tree tree node
IFArray is by the information vector of preorder journal CPat-Tree leaf node correspondence
{ L 0, L 1, L 2... L uThe information vector set of leaf node correspondence in the CPat-Tree structure
C bunch of central point arrives the mapping of corresponding leaf node
C jJ bunch center
The radii fixus of γ cluster spheroid
L represents present node
B represents the brotgher of node of present node
F represents father's node of present node
TreePos TreeMap array vernier
The validity of valid, father, brother and ifpos valid element representation present node; The father's node of father and brother rubidium marking present node and the brotgher of node subscript in TreeMap and NodeInfo array; Ifpos rubidium marking present node is in the reference position of IFArray.
The vernier of nipos NodeInfo array
The vernier of nodepos NodeMap array
TreeMap ', NodeMap ', EList ' and IFArray ' are respectively the several storage arrays after the reorganization
Summary of the invention
The objective of the invention is to propose a kind of Web content hierarchical index structural model of realizing based on CPat-Tree and, make the index structure storage space significantly reduce based on the trimming algorithm of this structure.
Introduce the notion of URL below earlier.
URL is the abbreviation of Uniform Resource Locator (uniform resource locator), and its data structure is: agreement: // host name: port numbers/directory path/filename.
URL is corresponding with concrete data object on website or the server, for example corresponding portal of URL or BBS server, also can a corresponding website in a width of cloth particular picture under catalogue.Therefore, if stop certain website of user capture, server or certain data objects, then send this URL request as long as stop to the network user.
The resource type of agreement section explanation Internet, as: http represents HTML (Hypertext Markup Language) or WWW.Other agreements have: ftp (expression file transfer protocol (FTP)), telnet (expression Telnet), news (expression newsgroup), mailto (expression Email), mms (expression Streaming Media) etc.
The server name of host name section explanation Internet, for example: www.fudan.edu.cn.The directory path section is pointed out file or partial document position on the internet server.Each grade catalogue separates with a forward slash (/) symbol.
The filename section is the actual name of document, image or the script that will visit, for example: index.html, logo.gif, script.cgi.These all belong to the optional ingredient of URL port numbers, directory path, filename.
Provide the example of some URL below:
Http:// www.w3.org/index.html: the corresponding website of this URL
Http: // 10.64.130.4/images/advice.gif: the corresponding width of cloth picture of this URL
Ftp: // 10.11.3.8: the corresponding ftp server of this URL
Mms: // 10.11.4.6/abc.avi: this URL is used for audio and video program of program request
Telnet: //bbs.fudan.edu.en: the corresponding BBS server of this URL
The Web content hierarchical index structure that the present invention proposes realizes based on CPat-Tree.Model adopts several data structure storage whole C Pat-Tree such as TreeMap, NodeMap, EList and IFArray.With CPat-Tree[6] in definition the same, TreeMap and NodeMap are the bit array by preorder journal tree construction, EList and IFArray then are the arrays that the present invention introduces auxiliary record information, the bit sequence of EList record vertex ticks, IFArray writes down the information vector of each node correspondence.Specifically,
(1) the hierarchical index structure is constituted jointly by TreeMap, NodeMap, EList and IFArray array; Wherein,
(2) the TreeMap bit array is pressed the tree-shaped node structure of preorder journal CPat-Tree, with bit 0 mark internal node, and bit 1 mark external node;
(3) bit number that comprises by each node of preorder journal of NodeMap bit array, with 1 bit 0 and 1 combined mark of some bits, its total number equals node and comprises bit number;
(4) bit value that comprises by each node of preorder sequential storage of EList bit array;
(5) the IFArray array information vector of carrying by preorder journal leaf node.
The trimming algorithm that the present invention proposes is directly realized at the storage array of index structure.Its main thought is: the network entity on the internet under many websites or the catalogue has same or analogous information vector, and has identical URL prefix.The URL with same or similar information vector that has same prefix can substitute with its common prefix, reduces the storage space of index structure with this.Basic step is as follows:
Cluster process: each leaf node among the CPat-Tree is mapped in the vector space a bit according to information vector, method with spatial pattern recognition, all points be divided into some do not overlap bunch, each bunch is by the space spheroid sign of a radii fixus, point in bunch is in together in the spheroid, drop on the information vector that leaf node in the same spheroid is considered to similar (having identical), this information vector is as the integrated information vector of this cluster, the central point of corresponding spheroid;
Merging process: according to merging rule, successively the leaf node in each bunch is upwards merged, delete this leaf node, make its father's node become new leaf node.The information vector of new leaf node replaces with the integrated information vector at bunch center.
Regrouping process: remove the node that the merging of interim Bolean number group echo is fallen, regenerate the storage array of cutting.After living through regrouping process, the cutting process finishes, and index is still keeping the CPat-Tree structure.
Among the present invention, cluster process obtains the information vector of leaf node from the IFArray array, cluster in vector space, and the information vector of the Centroid correspondence of calculating cluster is as the integrated information vector of cluster.Determine whether similar methods of leaf node: by the point of the information mapping of carrying in the vector space, adopt the TOD method to come cluster leaf node.
Merging process merges the adjacent leaf node that close information vector is arranged, and leaf node is canceled, and father's node is changed to new leaf node.Merging process once can be finished by contrary preorder order traversal the node of CPat-Tree data structure.Bring frequent bit shifting function for fear of frequent union operation, merging process utilizes a node that is canceled with the isometric interim boolean's array record of TreeMap array, the storage array in regrouping process subsequently after the disposable generation cutting.Respective nodes is not canceled in the True value representation TreeMap array, and respective nodes is canceled in the False value representation TreeMap array.
Description of drawings
Fig. 1: the tree-shaped synoptic diagram of CPat-Tree index model.
Fig. 2: the storage array synoptic diagram of CPat-Tree index model.
Fig. 3: the memory capacity of CPat-Tree index under various similarities of 160000/320000 URL formation is synoptic diagram (logarithmic graph) relatively.
Fig. 4: the CPat-Tree of 32000/96000/160000/320000 URL structure is the TreeMap cutting surplus ratio figure under similarity 0.3-0.9 respectively.
Fig. 5: the search efficiency changing trend diagram of the CPat-Tree of the cutting that 32000/96000/160000/320000 URL generates under similarity 0.3-0.9.
Embodiment
The CPat-Tree index structure
The present invention adopts improved CPat-Tree to come the URL coding and the classify and grading information of index uniquely tagged network entity.Each URL coding inserts CPat-Tree as the key of unique identification entity, and the create-rule of each key (coding rule of URL) has guaranteed a prefix that key is not another key.The bit sequence of the binary sequence corresponding keys of experience from the root node to the leaf node.Model adopts several data structure storage whole C Pat-Tree such as TreeMap, NodeMap, EList and IFArray.The same with the definition among the CPat-Tree, TreeMap and NodeMap are the bit array by preorder journal tree construction, EList and IFArray then are the arrays that the present invention introduces auxiliary record information, the bit sequence of EList record vertex ticks, IFArray writes down the information vector of each node correspondence.
As shown in the figure, accompanying drawing 1 has provided the structural drawing of a simple improved CPat-Tree, and each node has 1 market bit and several implicit bits.Market bit is that 0 this node of sign is the left side of father node, otherwise is the right son of father node; Implicit bit correspondence list sequence node among Full-Trie and the Ordinary Trie.Dark vertex ticks has the node of implicit bit sequence, and the light color vertex ticks does not have the bit node.The number of implicit bit can be 0, also can be non-0 positive integer.Root node is empty node, unmarked bit and implicit bit.Accompanying drawing 2 respective figure 1 have provided the store data structure that improves CPat-Tree.Several data structures are defined as follows:
(1) TreeMap bit array: by the tree node of preorder sequence notation CP-Tree, with bit 0 mark internal node, bit 1 labeled leaf child node.
(2) NodeMap bit array: by the figure place of preorder sequence notation bit that each tree node comprises, the market bit of 1 bit 0 flag node, bit 1 mark implies bit.The number of bit 1 equals the number of implicit bit.
(3) bit value that indicates by each tree node of preorder journal of EList bit array, EList and NodeMap array are isometric, among the NodeMap value of each bit respectively correspondence each bit value in the EList array; EList array and NodeMap array are isometric.
(4) the IFArray array has been preserved the vector of key in information space.The information vector of the leaf node journal key that the IFArray array is crossed by preorder traversal.The number of vector equals the number of bit 1 (leaf) in the TreeMap array in the IFArray array.
Trimming algorithm
Cluster process:
Cluster process is according to the information vector of leaf node, and each leaf node is mapped to a point in the vector space, uses the clustering method in the pattern-recognition to come point is carried out cluster.Clustering result, the leaf node among the CPat-Tree of corresponding different URL are divided in different bunches, and the information vector of all leaf nodes replaces with the information vector of bunch center correspondence in bunch.The present invention mainly realizes based on the cluster process of TOD (Threshold Order-Dependent Clustering Algorithm) [4] method.Its main thought is as follows:
The set of definition sample point is all the leaf node information vector { L in the IFArray array 0, L 1, L 2... L u; Whether enough little radius of a ball γ is present in the decision boundary of certain bunch as judgement sample point; Storage bunch central point and the array C that comprises leaf node.At first with L 0Center C as first bunch 0Then successively sample point L i(1≤i≤u) compare with the cluster centre order is if exist certain C jSatisfy | | L i - C j | | = min k | | L i - C k | | And ‖ L i-C j‖<γ is then with L iBe included into C jFor the center bunch in, and compute cluster C again jThe center; If there is not bunch center of satisfying above-mentioned condition, then L iAs a new bunch center.
Merging process:
The step of merging process has defined a temporary structure array isometric with the TreeMap array for fear of merging the frequent shifting function that the leaf node operation produces, and is used for the validity of each element of TreeMap array behind the mark merging process.
The data structure that the definition merging process uses is as follows:
(1) current processing node L, the brotgher of node B of L node, father's node F of L and B node;
(2) the vernier TreePos of TreeMap array;
(3) NodeInfo, structural array, isometric with the TreeMap array, the information of respective nodes among each structure tag TreeMap.Each structure comprises valid, father, brother and ifpos element.The validity of valid element representation present node wherein, the true value representation is effective, and false represents invalid; The father's node of father and brother rubidium marking present node and the brotgher of node subscript in TreeMap and NodeInfo array; Ifpos rubidium marking present node is in the reference position of IFArray.
The merging process step is as follows:
Step 1.{ initialization } each element of initialization NodeInfo array, wherein the valid element is initialized as the true value; Make TreePos ← TreeMap.size-1, to point to last element of TreeMap.
Step 2.{ judges the validity of present node } if TreePos≤0, then the node ergodic process finishes, and changes step 7; Present node is L otherwise make, if NodeInfo[TreePos] .valid=false, then L is deleted, and TreePos successively decreases 1, changes step 2;
Step 3.{ locatees leaf node } value that makes TreePos point to TreeMap is 0, and what mean sensing is non-leaf node, and then TreePos successively decreases 1, changes step 2; Otherwise what point to is leaf node, changes step 4.
The step 4.{ location father and the brotgher of node } from NodeInfo[TreePos] .father and NodeInfo[TreePos] father's node F of .brother location L and the position of brotgher of node B.If B node mark 0 in TreeMap shows that then the brotgher of node is internal node, abandon merging so to current leaf, TreePos successively decreases 1, changes step 2; Otherwise change step 5.
Step 5.{ checks whether satisfy the merging condition } from NodeInfo[TreePos] .ifpos and NodeInfo[NodeInfo[TreePos] .brother] .ifpos locatees the information vector of L and B, if L and B are in same bunch together, changes step 6; Otherwise TreePos successively decreases 1, changes step 2.
Step 6.{ merge node } nullify leaf node, put L and the B valid value in the NodeInfo array and be false (node failure); Putting father's node is new leaf node, and the position of putting the corresponding TreeMap of F is 1; The information vector value of new leaf node is pointed to the relevant position of L in IFArray among the NodeInfo, and the integrated information vector of L place cluster is inserted; With the node that new leaf node calculates as next round, TreePos ← TreeMap F, change step 2.
Step 7.{ finishes } the merging process end.
In said process, if the brotgher of node B of leaf node L is an internal node, perhaps B is a leaf node but with not in same bunch, L and B can not be merged so.Reset the L node this moment in NodeMap length is 1, can reduce storage space.
Regrouping process:
In the merging process of node, having only the present node and the brotgher of node all is that leaf node and situation about being in same bunch can be carried out union operation.The result of merging process is that some nodes are canceled, and whether the valid of NodeInfo structure of arrays is used for flag node and is canceled, and regrouping process regenerates the storage array according to this information.
Definition nipos and nodepos are respectively the verniers of NodeInfo array and NodeMap array, and TreeMap ', NodeMap ', EList ' and IFArray ' are respectively the several storage arrays after the reorganization.Regrouping process implementation procedure step is as follows:
Step 1.{ initialization } the initialization vernier, nipos ← 0, nodepos ← 0.
Step 2.{ skips canceled node } for the structure in the NodeInfo array of nipos mark, if NodeInfo[nipos] .valid=false, show that this node is canceled, nipos increases progressively 1, nodepos increases progressively until NodeMap[nodepos] be 0 once more, change step 2; Otherwise this node is not canceled, and changes step 3.
Step 3.{ judges the type of output node } if TreeMap[nipos]=0, then change step 4 output internal node; Otherwise change step 5 output leaf node.
Step 4.{ exports internal node } make i=0, carry out following a) b) c) three sub-steps to the canned data of TreeMap ', NodeMap ' and this node of EList ' output, change step 6 respectively then;
(a) to TreeMap ' output bit 0
(b) to NodeMap ' output NodeMap[nodepos+i], i increases progressively 1; Repeat this process until NodeMap[nodepos+i] be 0 once more, record output length is nodelen=i.
(c) to EList ' output from EList[nodepos] beginning length be one section bit sequence of nodelen.Step 5.{ exports leaf node } make i=0, carry out following a) b) c) d) four sub-steps to the canned data of TreeMap ', NodeMap ', EList ' and this node of IFArray ' output, change step 6 respectively then;
(a) to TreeMap ' output bit 1
(b) to NodeMap ' output NodeMap[nodepos+i], i increases progressively 1; Repeat this process until NodeMap[nodepos+i] be 0 once more, record output length is nodelen=i+1.
(c) to EList ' output from EList[nodepos] beginning length be one section bit sequence of nodelen.
(d) to IFArray ' output IFArray[NodeInfo[nipos] .ifpos].
Step 6.{ turns to next node } nipos increases progressively 1, if nipos 〉=TreeMap.size has then traveled through all nodes, changes step 7; Otherwise turn to next node, nodepos increases progressively until NodeMap[nodepos] be 0 once more, change step 2;
Step 7.{ finishes } the regrouping process end.
This trimming algorithm merges the adjacent and leaf node of information vector in same bunch in the CPat-Tree structure on the one hand, cuts down the bit sequence that leaf node carries on the other hand.Experiment showed, that this algorithm can reduce the storage space of CPat-Tree structure, improve search efficiency.Accompanying drawing 3,4,5 has provided the test result of trimming algorithm.Accompanying drawing 3 compared 160000 with CPat-Tree index memory capacity after the cutting under various similarities relative not cutting that 320000 URL constitute before memory capacity.As can be seen, from the 0.3-0.9 change procedure, the memory capacity ratio of not cutting structure is basic identical relatively after cutting for the CPat-Tree structure that different URL constitutes, and reaches 5%-15% in similarity; Attached Figure 4 and 5 have been showed the residue node of CPat-Tree of the cutting that 32000,96000,160000,320000 URL generate and the ratio of original interstitial content respectively under similarity 0.3-0.9, and the ratio that improves of search efficiency.As seen, the residue interstitial content steadily descends with similarity, and search efficiency improves 30%-60%.
In a word, the present invention proposes a kind of based on CPat-Tree Web content hierarchical index structural model and and the trimming algorithm realized on this basis, can significantly reduce index structure memory capacity and improve search efficiency.

Claims (7)

  1. But 1, a kind of Web content hierarchical index structure of the cutting that realizes based on CPat-Tree is characterized in that:
    (1) the hierarchical index structure is constituted jointly by TreeMap, NodeMap, EList and IFArray array; Wherein,
    (2) the TreeMap bit array is pressed the tree-shaped node structure of preorder journal CPat-Tree, with bit 0 mark internal node, and bit 1 mark external node;
    (3) bit number that comprises by each node of preorder journal of NodeMap bit array, with 1 bit 0 and 1 combined mark of some bits, its total number equals node and comprises bit number;
    (4) bit value that comprises by each node of preorder sequential storage of EList bit array;
    (5) the IFArray array information vector of carrying by preorder journal leaf node.
  2. 2, a kind of trimming algorithm of the Web content hierarchical index structure based on CPat-Tree, it is characterized in that: basic step is as follows:
    Cluster process: each leaf node among the CPat-Tree is mapped in the vector space a bit according to information vector, method with spatial pattern recognition, all points be divided into some do not overlap bunch, each bunch is by the space spheroid sign of a radii fixus, point in bunch is in together in the spheroid, the leaf node that drops in the same spheroid is considered to the information similar vector, and this information vector is as the integrated information vector of this cluster, the central point of corresponding spheroid;
    Merging process: according to merging rule, successively the leaf node in each bunch is upwards merged, delete this leaf node, make its father's node become new leaf node.The information vector of new leaf node replaces with the integrated information vector at bunch center;
    Regrouping process: remove the node that the merging of interim Bolean number group echo is fallen, regenerate the storage array of cutting; After living through regrouping process, the cutting process finishes, and index is still keeping the CPat-Tree structure.
  3. 3, method of cutting out according to claim 2 is characterized in that: determine whether similar methods of leaf node: by the point of the information mapping of carrying in the vector space, adopt the TOD method to come cluster leaf node.
  4. 4, method of cutting out according to claim 2 is characterized in that merging process once can finish by contrary preorder order traversal the node of CPat-Tree data structure.
  5. 5, method of cutting out according to claim 2 is characterized in that merging process utilizes a node that is canceled with the isometric interim boolean's array record of TreeMap array; Respective nodes is not canceled in the True value representation TreeMap array, and respective nodes is canceled in the False value representation TreeMap array.
  6. 6, method of cutting out according to claim 2 is characterized in that the treatment step of merging process is as follows:
    The data structure that the definition merging process uses:
    (1) current processing node L, the brotgher of node B of L node, father's node F of L and B node;
    (2) the vernier TreePos of TreeMap array;
    (3) NodeInfo, structural array, isometric with the TreeMap array, the information of respective nodes among each structure tag TreeMap; Each structure comprises valid, father, brother and ifpos element; The validity of valid element representation present node wherein, the true value representation is effective, and false represents invalid; The father's node of father and brother rubidium marking present node and the brotgher of node subscript in TreeMap and NodeInfo array; Ifpos rubidium marking present node is in the reference position of IFArray;
    Step 1.{ initialization } each element of initialization NodeInfo array, wherein the valid element is initialized as the true value; Make TreePos ← TreeMap.size-1, to point to last element of TreeMap;
    Step 2.{ judges the validity of present node } if TreePos≤0, then the node ergodic process finishes, and changes step 7; Present node is L otherwise make, if NodeInfo[TreePos] .valid=false, then L is deleted, and TreePos successively decreases 1, changes step 2;
    Step 3.{ locatees leaf node } value that makes TreePos point to TreeMap is 0, and what mean sensing is non-leaf node, and then TreePos successively decreases 1, changes step 2; Otherwise what point to is leaf node, changes step 4;
    The step 4.{ location father and the brotgher of node } from NodeInfo[TreePos] .father and NodeInfo[TreePos] father's node F of .brother location L and the position of brotgher of node B.If B node mark 0 in TreeMap shows that then the brotgher of node is internal node, abandon merging so to current leaf, TreePos successively decreases 1, changes step 2; Otherwise change step 5;
    Step 5.{ checks whether satisfy the merging condition } from NodeInfo[TreePos] .ifpos and NodeInfo[NodeInfo[TreePos] .brother] .ifpos locatees the information vector of L and B, if L and B are in same bunch together, changes step 6; Otherwise TreePos successively decreases 1, changes step 2;
    Step 6.{ merge node } nullify leaf node, put L and the B valid value in the NodeInfo array and be false (node failure); Putting father's node is new leaf node, and the position of putting the corresponding TreeMap of F is 1; The information vector value of new leaf node is pointed to the relevant position of L in IFArray among the NodeInfo, and the integrated information vector of L place cluster is inserted; With the node that new leaf node calculates as next round, TreePos ← TreeMap F, change step 2;
    Step 7.{ finishes } the merging process end.
  7. 7, method of cutting out according to claim 2 is characterized in that the treatment step of regrouping process is as follows:
    Definition nipos and nodepos are respectively the verniers of NodeInfo array and NodeMap array, and TreeMap ', NodeMap ', EList ' and IFArray ' are respectively the several storage arrays after the reorganization;
    Step 1.{ initialization } the initialization vernier, nipos ← 0, nodepos ← 0;
    Step 2.{ skips canceled node } for the structure in the NodeInfo array of nipos mark, if NodeInfo[nipos] .valid=false, show that this node is canceled, nipos increases progressively 1, nodepos increases progressively until NodeMap[nodepos] be 0 once more, change step 2; Otherwise this node is not canceled, and changes step 3;
    Step 3.{ judges the type of output node } if TreeMap[nipos]=0, then change step 4 output internal node; Otherwise change step 5 output leaf node;
    Step 4.{ exports internal node } make i=0, carry out following a) b) c) three sub-steps to the canned data of TreeMap ', NodeMap ' and this node of EList ' output, change step 6 respectively then;
    A) to TreeMap ' output bit 0;
    B) to NodeMap ' output NodeMap[nodepos+i], i increases progressively 1; Repeat this process until NodeMap[nodepos+i] be 0 once more, record output length is nodelen=i;
    C) to EList ' output from EList[nodepos] beginning length be one section bit sequence of nodelen.
    Step 5.{ exports leaf node } make i=0, carry out following a) b) c) d) four sub-steps to the canned data of TreeMap ', NodeMap ', EList ' and this node of IFArray ' output, change step 6 respectively then;
    (a) to TreeMap ' output bit 1;
    (b) to NodeMap ' output NodeMap[nodepos+i], i increases progressively 1; Repeat this process until NodeMap[nodepos+i] be 0 once more, record output length is nodelen=i+1;
    (c) to EList ' output from EList[nodepos] beginning length be one section bit sequence of nodelen;
    (d) to IFArray ' output IFArray[NodeInfo[nipos] .ifpos];
    Step 6.{ turns to next node } nipos increases progressively 1, if nipos 〉=TreeMap.size has then traveled through all nodes, changes
    Step 7; Otherwise turn to next node, nodepos increases progressively until NodeMap[nodepos] be 0 once more, change step 2;
    Step 7.{ finishes } the regrouping process end.
CN 200510027784 2005-07-15 2005-07-15 Network content grading index structure based on CPat-Tree and cutting method Pending CN1719442A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510027784 CN1719442A (en) 2005-07-15 2005-07-15 Network content grading index structure based on CPat-Tree and cutting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510027784 CN1719442A (en) 2005-07-15 2005-07-15 Network content grading index structure based on CPat-Tree and cutting method

Publications (1)

Publication Number Publication Date
CN1719442A true CN1719442A (en) 2006-01-11

Family

ID=35931276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510027784 Pending CN1719442A (en) 2005-07-15 2005-07-15 Network content grading index structure based on CPat-Tree and cutting method

Country Status (1)

Country Link
CN (1) CN1719442A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101063972B (en) * 2006-04-28 2010-05-12 国际商业机器公司 Method and apparatus for enhancing visuality of image tree

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101063972B (en) * 2006-04-28 2010-05-12 国际商业机器公司 Method and apparatus for enhancing visuality of image tree

Similar Documents

Publication Publication Date Title
Jiang et al. Efficient processing of XML twig queries with OR-predicates
CN1253813C (en) Contents-index search system and its method
Chakrabarti et al. Page-level template detection via isotonic smoothing
CN1609854A (en) Sharing computer object with association
CN1705944A (en) System and method for conducting adaptive search using a peer-to-peer network
CN1822005A (en) Information pushing system and method based on web sit automatic forming and search engine
CN1781105A (en) Retaining hierarchical information in mapping between XML documents and relational data
CN1687926A (en) Method of PDF file information extraction system based on XML
CN1540552A (en) Computer search with correlation
CN1909522A (en) Method for acquiring front-page keyword and its application system
CN1932816A (en) Full text search system based on ciphertext
CN1858737A (en) Method and system for data searching
WO2008098502A1 (en) Method and device for creating index as well as method and system for retrieving
CN1794239A (en) Automatic generating system of template network station possessing searching function and its method
CN1725213A (en) Method and system for structuring, maintaining personal sort tree, sort display file
CN1667607A (en) Personalized category treatment method and system for document browsing
CN101075239A (en) Composite searching method and system
CN1746891A (en) Information handling
US7765204B2 (en) Method of finding candidate sub-queries from longer queries
CN101030206A (en) Method for discovering and generating search engine key word
CN101030230A (en) Image searching method and system
CN1797301A (en) Digital information search method and system
CN1825306A (en) XML data storage and access method based on relational database
CN1968358A (en) Time constraint-based automatic video summary generation method in frequent camera mode
Augsten et al. Efficient top-k approximate subtree matching in small memory

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication