CN1719442A - Network content grading index structure based on CPat-Tree and cutting method - Google Patents
Network content grading index structure based on CPat-Tree and cutting method Download PDFInfo
- Publication number
- CN1719442A CN1719442A CN 200510027784 CN200510027784A CN1719442A CN 1719442 A CN1719442 A CN 1719442A CN 200510027784 CN200510027784 CN 200510027784 CN 200510027784 A CN200510027784 A CN 200510027784A CN 1719442 A CN1719442 A CN 1719442A
- Authority
- CN
- China
- Prior art keywords
- node
- treemap
- array
- nodemap
- bit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a structure model of contents hierarchical index for URL and its directional network entity and clipping algorithm based on CPat-tree. Said hierarchical index structure uses several arrays of TreeMap, NodeMap, EList and IFArray to record URL of entity and hierarchical information. Said invention also provides the concrete steps of said clipping algorithm. Said invention can greatly reduce storage capacity of CPat-Tree index structure and reduce magnetic disk access frequency in inquiry process and CPU computing cost, and has high inquiry efficiency.
Description
Technical field
The invention belongs to fields such as document information retrieval and Internet technology, be specifically related to a kind of internet content entity classify and grading index structure and at this structural method of cutting out.
Technical background
Along with the high speed development of network technology, the internet more and more is deep in people's work and the life, becomes the important channel of issuing and obtaining information.Yet, because internet information issue and obtaining exists high anonymity, high privacy, high interactive and do not have characteristics such as region, be difficult to the uneven information content is effectively managed and restricted, this has brought serious negative effect for public life and social production.
Aspect the public's life, be full of on the current internet in a large number with violence, pornographic, anti-government and antisocial be the flame of main contents, positive maximum is disturbing people's audiovisual, and the sound development of society has been caused adverse effect.According to U.S. N2H2 company statistics in 2003, it is pornographic webpage that roughly there is 8% webpage in the whole world, and 1/4th relevant Pornographs are arranged in the request that search engine is submitted to every day; Simultaneously with anti-government, antisocial be that the website and webpage and the Email of content is all-pervasive.
Aspect social production, computer network in the air brings serious negative effect [1] usually for corporate culture, production cost and the efficient of enterprise.At first the employee utilizes the internet to carry out and the irrelevant behavior of work through the working time of being everlasting, and takies the working time, directly causes yield-power to descend; Secondly, the employee has taken the outlet bandwidth that leads to the internet in a large number with the irrelevant internet usage of working, worm-type virus and the spam that many malicious websites are brought, and enterprise is a large amount of expenses of supplementary payments for this reason; The malicious code that unsafe webpage comprises, trojan-horse program and spyware (spyware) also can be usurped and destroy enterprise commerce and scientific research information.The private purpose of enterprise's outlet network bandwidth consumption of 30%-40% the employee pointed out in the portion report of IDC LLC; The report of North America management federation points out that 27% enterprise of Fortune 500 all once was absorbed in the improper internet usage of employee and Email is propagated in the scandal of pornographic information.
The negative effect of network is big, and the flame content is wide, is that people are unexpected.In order to ensure the content safety of internet information, for the public opens up the network world of " cleaning ", for enterprise rationally the use Internet resources of safety provide technical guarantee, research and development Web content security tool have important practical significance [7].
Common Web content safety technique mainly contains [2] [3] such as label filtration, keyword filtration, url filtering, categorical filtering and information filtering at present.Because there is shortcomings such as being difficult to supervise the implementation of, being difficult to guarantee rating information real-time and implementation complexity height respectively in these information filtering technology, often realize jointly in conjunction with above-mentioned several filtering techniques so drop into the Web content filtering system of research and development and operation.
Url filtering is one of technology the most frequently used in the Web content filtering system, and it is mainly by relatively user's request and predefined rating information determine whether to tackle user's request.The data directory structure that the traditional URL filtering technique adopts mainly contains Trie structure and two kinds of hash tables [5].Advantage based on the Trie index structure is that search efficiency is very high, and shortcoming is to consume big amount of ram in the implementation procedure, and the complexity of serializing and unserializing is higher, needs to carry out complicated conversion in the process of disk access and Network Transmission.Advantage based on hash table is that search efficiency is very high, and because use linear array representation index structure, so serializing and unserializing complexity are lower.Its shortcoming is choosing of hash function and uses and increased implementation complexity that the storage space of hash table still remains on a higher relatively level.
Innovative point of the present invention is, has proposed based on CPat-Tree[6] the Web content hierarchical index structure of structure realization and the method for cutting out of directly on the storage array, realizing.Utilize this index structure and trimming algorithm, can make the memory capacity of content classification index structure can reach very little level, also promote query efficient significantly.
The present invention not only can be used for the URL interception technology of content filtering system, also can be applied to other field of information retrieval.For example content transmission network (CDN, Content Delivery Network) [8] and multicast tree (MulticastRouting Tree) [9] modeling etc.
List of references
1.Jacob?Palme.Information?Filtering.http://cmc.dsv.su.se/select/information-filtering.pdf,1998-06-01.
2.Jonathan?Zittrain,Benjamin?Edelman.Internet?Filtering?in?China.IEEE?Internet?Computing,2003,7(2):70-77.
3.Justin?Basilico,Thomas?Hofmann.A?joint?framework?for?collaborative?and?content?filtering.27th?Annual?International?ACM?SIGIR?Conference.NY:ACM?Press,2004:550-551.
4.Menahem?Friedman?and?Abraham?Kandel.Introduction?to?Pattern?Recognition-Statistical,Structural,Neural?And?Fuzzy?Logic?Approaches.World?Scientific,1999.
5.Zornitza?Genova?Prodanoff.Performance?Evaluation?of?URL?Routing?for?ContentDistribution?Networks.SF:University?of?South?Florida,2003.
6.M.Shishibori,M.Okada,T.Sumitomo?and?J.Aoe.Design?of?a?Compact?Data?Structure?for?thePatricia?Trie.IEICE?Transactions?on?Information?and?Systems,1998,Vol.E81-D,No.4,pp.364-371.
7.Chen?Ding,Chi-Hung?Chi,Jing?Deng,and?Chun-Lei?Dong.Centralized?Content-Based?WebFiltering?and?Blocking:How?Far?Can?It?Go?In?Proceeding?of?IEEE?InternationalConference?on?Systems,Man?and?Cybernetics(SMC),1999.
8.Survey?of?Content?Delivery?Networks(CDNs),http://cgi.di.uoa.gr/~grad0377/cdnsurvey.pdf.
9.Gaurav?Sharma.Internet?topology?and?tomography.https://engineering.purdue.edu/people/gaurav.sharma.3/Reports/Modeling.ppt,2005-04.
Symbol table (implication of institute's symbolization in whole documents of the present invention)
TreeMap is by the bit array of preorder journal CPat-Tree tree node structure
NodeMap comprises the bit array of bit number by preorder journal CPat-Tree tree node
EList comprises the bit array of bit value by preorder journal CPat-Tree tree node
IFArray is by the information vector of preorder journal CPat-Tree leaf node correspondence
{ L
0, L
1, L
2... L
uThe information vector set of leaf node correspondence in the CPat-Tree structure
C bunch of central point arrives the mapping of corresponding leaf node
C
jJ bunch center
The radii fixus of γ cluster spheroid
L represents present node
B represents the brotgher of node of present node
F represents father's node of present node
TreePos TreeMap array vernier
The validity of valid, father, brother and ifpos valid element representation present node; The father's node of father and brother rubidium marking present node and the brotgher of node subscript in TreeMap and NodeInfo array; Ifpos rubidium marking present node is in the reference position of IFArray.
The vernier of nipos NodeInfo array
The vernier of nodepos NodeMap array
TreeMap ', NodeMap ', EList ' and IFArray ' are respectively the several storage arrays after the reorganization
Summary of the invention
The objective of the invention is to propose a kind of Web content hierarchical index structural model of realizing based on CPat-Tree and, make the index structure storage space significantly reduce based on the trimming algorithm of this structure.
Introduce the notion of URL below earlier.
URL is the abbreviation of Uniform Resource Locator (uniform resource locator), and its data structure is: agreement: // host name: port numbers/directory path/filename.
URL is corresponding with concrete data object on website or the server, for example corresponding portal of URL or BBS server, also can a corresponding website in a width of cloth particular picture under catalogue.Therefore, if stop certain website of user capture, server or certain data objects, then send this URL request as long as stop to the network user.
The resource type of agreement section explanation Internet, as: http represents HTML (Hypertext Markup Language) or WWW.Other agreements have: ftp (expression file transfer protocol (FTP)), telnet (expression Telnet), news (expression newsgroup), mailto (expression Email), mms (expression Streaming Media) etc.
The server name of host name section explanation Internet, for example: www.fudan.edu.cn.The directory path section is pointed out file or partial document position on the internet server.Each grade catalogue separates with a forward slash (/) symbol.
The filename section is the actual name of document, image or the script that will visit, for example: index.html, logo.gif, script.cgi.These all belong to the optional ingredient of URL port numbers, directory path, filename.
Provide the example of some URL below:
Http:// www.w3.org/index.html: the corresponding website of this URL
Http: // 10.64.130.4/images/advice.gif: the corresponding width of cloth picture of this URL
Ftp: // 10.11.3.8: the corresponding ftp server of this URL
Mms: // 10.11.4.6/abc.avi: this URL is used for audio and video program of program request
Telnet: //bbs.fudan.edu.en: the corresponding BBS server of this URL
The Web content hierarchical index structure that the present invention proposes realizes based on CPat-Tree.Model adopts several data structure storage whole C Pat-Tree such as TreeMap, NodeMap, EList and IFArray.With CPat-Tree[6] in definition the same, TreeMap and NodeMap are the bit array by preorder journal tree construction, EList and IFArray then are the arrays that the present invention introduces auxiliary record information, the bit sequence of EList record vertex ticks, IFArray writes down the information vector of each node correspondence.Specifically,
(1) the hierarchical index structure is constituted jointly by TreeMap, NodeMap, EList and IFArray array; Wherein,
(2) the TreeMap bit array is pressed the tree-shaped node structure of preorder journal CPat-Tree, with bit 0 mark internal node, and bit 1 mark external node;
(3) bit number that comprises by each node of preorder journal of NodeMap bit array, with 1 bit 0 and 1 combined mark of some bits, its total number equals node and comprises bit number;
(4) bit value that comprises by each node of preorder sequential storage of EList bit array;
(5) the IFArray array information vector of carrying by preorder journal leaf node.
The trimming algorithm that the present invention proposes is directly realized at the storage array of index structure.Its main thought is: the network entity on the internet under many websites or the catalogue has same or analogous information vector, and has identical URL prefix.The URL with same or similar information vector that has same prefix can substitute with its common prefix, reduces the storage space of index structure with this.Basic step is as follows:
Cluster process: each leaf node among the CPat-Tree is mapped in the vector space a bit according to information vector, method with spatial pattern recognition, all points be divided into some do not overlap bunch, each bunch is by the space spheroid sign of a radii fixus, point in bunch is in together in the spheroid, drop on the information vector that leaf node in the same spheroid is considered to similar (having identical), this information vector is as the integrated information vector of this cluster, the central point of corresponding spheroid;
Merging process: according to merging rule, successively the leaf node in each bunch is upwards merged, delete this leaf node, make its father's node become new leaf node.The information vector of new leaf node replaces with the integrated information vector at bunch center.
Regrouping process: remove the node that the merging of interim Bolean number group echo is fallen, regenerate the storage array of cutting.After living through regrouping process, the cutting process finishes, and index is still keeping the CPat-Tree structure.
Among the present invention, cluster process obtains the information vector of leaf node from the IFArray array, cluster in vector space, and the information vector of the Centroid correspondence of calculating cluster is as the integrated information vector of cluster.Determine whether similar methods of leaf node: by the point of the information mapping of carrying in the vector space, adopt the TOD method to come cluster leaf node.
Merging process merges the adjacent leaf node that close information vector is arranged, and leaf node is canceled, and father's node is changed to new leaf node.Merging process once can be finished by contrary preorder order traversal the node of CPat-Tree data structure.Bring frequent bit shifting function for fear of frequent union operation, merging process utilizes a node that is canceled with the isometric interim boolean's array record of TreeMap array, the storage array in regrouping process subsequently after the disposable generation cutting.Respective nodes is not canceled in the True value representation TreeMap array, and respective nodes is canceled in the False value representation TreeMap array.
Description of drawings
Fig. 1: the tree-shaped synoptic diagram of CPat-Tree index model.
Fig. 2: the storage array synoptic diagram of CPat-Tree index model.
Fig. 3: the memory capacity of CPat-Tree index under various similarities of 160000/320000 URL formation is synoptic diagram (logarithmic graph) relatively.
Fig. 4: the CPat-Tree of 32000/96000/160000/320000 URL structure is the TreeMap cutting surplus ratio figure under similarity 0.3-0.9 respectively.
Fig. 5: the search efficiency changing trend diagram of the CPat-Tree of the cutting that 32000/96000/160000/320000 URL generates under similarity 0.3-0.9.
Embodiment
The CPat-Tree index structure
The present invention adopts improved CPat-Tree to come the URL coding and the classify and grading information of index uniquely tagged network entity.Each URL coding inserts CPat-Tree as the key of unique identification entity, and the create-rule of each key (coding rule of URL) has guaranteed a prefix that key is not another key.The bit sequence of the binary sequence corresponding keys of experience from the root node to the leaf node.Model adopts several data structure storage whole C Pat-Tree such as TreeMap, NodeMap, EList and IFArray.The same with the definition among the CPat-Tree, TreeMap and NodeMap are the bit array by preorder journal tree construction, EList and IFArray then are the arrays that the present invention introduces auxiliary record information, the bit sequence of EList record vertex ticks, IFArray writes down the information vector of each node correspondence.
As shown in the figure, accompanying drawing 1 has provided the structural drawing of a simple improved CPat-Tree, and each node has 1 market bit and several implicit bits.Market bit is that 0 this node of sign is the left side of father node, otherwise is the right son of father node; Implicit bit correspondence list sequence node among Full-Trie and the Ordinary Trie.Dark vertex ticks has the node of implicit bit sequence, and the light color vertex ticks does not have the bit node.The number of implicit bit can be 0, also can be non-0 positive integer.Root node is empty node, unmarked bit and implicit bit.Accompanying drawing 2 respective figure 1 have provided the store data structure that improves CPat-Tree.Several data structures are defined as follows:
(1) TreeMap bit array: by the tree node of preorder sequence notation CP-Tree, with bit 0 mark internal node, bit 1 labeled leaf child node.
(2) NodeMap bit array: by the figure place of preorder sequence notation bit that each tree node comprises, the market bit of 1 bit 0 flag node, bit 1 mark implies bit.The number of bit 1 equals the number of implicit bit.
(3) bit value that indicates by each tree node of preorder journal of EList bit array, EList and NodeMap array are isometric, among the NodeMap value of each bit respectively correspondence each bit value in the EList array; EList array and NodeMap array are isometric.
(4) the IFArray array has been preserved the vector of key in information space.The information vector of the leaf node journal key that the IFArray array is crossed by preorder traversal.The number of vector equals the number of bit 1 (leaf) in the TreeMap array in the IFArray array.
Trimming algorithm
Cluster process:
Cluster process is according to the information vector of leaf node, and each leaf node is mapped to a point in the vector space, uses the clustering method in the pattern-recognition to come point is carried out cluster.Clustering result, the leaf node among the CPat-Tree of corresponding different URL are divided in different bunches, and the information vector of all leaf nodes replaces with the information vector of bunch center correspondence in bunch.The present invention mainly realizes based on the cluster process of TOD (Threshold Order-Dependent Clustering Algorithm) [4] method.Its main thought is as follows:
The set of definition sample point is all the leaf node information vector { L in the IFArray array
0, L
1, L
2... L
u; Whether enough little radius of a ball γ is present in the decision boundary of certain bunch as judgement sample point; Storage bunch central point and the array C that comprises leaf node.At first with L
0Center C as first bunch
0Then successively sample point L
i(1≤i≤u) compare with the cluster centre order is if exist certain C
jSatisfy
And ‖ L
i-C
j‖<γ is then with L
iBe included into C
jFor the center bunch in, and compute cluster C again
jThe center; If there is not bunch center of satisfying above-mentioned condition, then L
iAs a new bunch center.
Merging process:
The step of merging process has defined a temporary structure array isometric with the TreeMap array for fear of merging the frequent shifting function that the leaf node operation produces, and is used for the validity of each element of TreeMap array behind the mark merging process.
The data structure that the definition merging process uses is as follows:
(1) current processing node L, the brotgher of node B of L node, father's node F of L and B node;
(2) the vernier TreePos of TreeMap array;
(3) NodeInfo, structural array, isometric with the TreeMap array, the information of respective nodes among each structure tag TreeMap.Each structure comprises valid, father, brother and ifpos element.The validity of valid element representation present node wherein, the true value representation is effective, and false represents invalid; The father's node of father and brother rubidium marking present node and the brotgher of node subscript in TreeMap and NodeInfo array; Ifpos rubidium marking present node is in the reference position of IFArray.
The merging process step is as follows:
Step 1.{ initialization } each element of initialization NodeInfo array, wherein the valid element is initialized as the true value; Make TreePos ← TreeMap.size-1, to point to last element of TreeMap.
Step 2.{ judges the validity of present node } if TreePos≤0, then the node ergodic process finishes, and changes step 7; Present node is L otherwise make, if NodeInfo[TreePos] .valid=false, then L is deleted, and TreePos successively decreases 1, changes step 2;
Step 3.{ locatees leaf node } value that makes TreePos point to TreeMap is 0, and what mean sensing is non-leaf node, and then TreePos successively decreases 1, changes step 2; Otherwise what point to is leaf node, changes step 4.
The step 4.{ location father and the brotgher of node } from NodeInfo[TreePos] .father and NodeInfo[TreePos] father's node F of .brother location L and the position of brotgher of node B.If B node mark 0 in TreeMap shows that then the brotgher of node is internal node, abandon merging so to current leaf, TreePos successively decreases 1, changes step 2; Otherwise change step 5.
Step 5.{ checks whether satisfy the merging condition } from NodeInfo[TreePos] .ifpos and NodeInfo[NodeInfo[TreePos] .brother] .ifpos locatees the information vector of L and B, if L and B are in same bunch together, changes step 6; Otherwise TreePos successively decreases 1, changes step 2.
Step 6.{ merge node } nullify leaf node, put L and the B valid value in the NodeInfo array and be false (node failure); Putting father's node is new leaf node, and the position of putting the corresponding TreeMap of F is 1; The information vector value of new leaf node is pointed to the relevant position of L in IFArray among the NodeInfo, and the integrated information vector of L place cluster is inserted; With the node that new leaf node calculates as next round, TreePos ← TreeMap
F, change step 2.
Step 7.{ finishes } the merging process end.
In said process, if the brotgher of node B of leaf node L is an internal node, perhaps B is a leaf node but with not in same bunch, L and B can not be merged so.Reset the L node this moment in NodeMap length is 1, can reduce storage space.
Regrouping process:
In the merging process of node, having only the present node and the brotgher of node all is that leaf node and situation about being in same bunch can be carried out union operation.The result of merging process is that some nodes are canceled, and whether the valid of NodeInfo structure of arrays is used for flag node and is canceled, and regrouping process regenerates the storage array according to this information.
Definition nipos and nodepos are respectively the verniers of NodeInfo array and NodeMap array, and TreeMap ', NodeMap ', EList ' and IFArray ' are respectively the several storage arrays after the reorganization.Regrouping process implementation procedure step is as follows:
Step 1.{ initialization } the initialization vernier, nipos ← 0, nodepos ← 0.
Step 2.{ skips canceled node } for the structure in the NodeInfo array of nipos mark, if NodeInfo[nipos] .valid=false, show that this node is canceled, nipos increases progressively 1, nodepos increases progressively until NodeMap[nodepos] be 0 once more, change step 2; Otherwise this node is not canceled, and changes step 3.
Step 3.{ judges the type of output node } if TreeMap[nipos]=0, then change step 4 output internal node; Otherwise change step 5 output leaf node.
Step 4.{ exports internal node } make i=0, carry out following a) b) c) three sub-steps to the canned data of TreeMap ', NodeMap ' and this node of EList ' output, change step 6 respectively then;
(a) to TreeMap ' output bit 0
(b) to NodeMap ' output NodeMap[nodepos+i], i increases progressively 1; Repeat this process until NodeMap[nodepos+i] be 0 once more, record output length is nodelen=i.
(c) to EList ' output from EList[nodepos] beginning length be one section bit sequence of nodelen.Step 5.{ exports leaf node } make i=0, carry out following a) b) c) d) four sub-steps to the canned data of TreeMap ', NodeMap ', EList ' and this node of IFArray ' output, change step 6 respectively then;
(a) to TreeMap ' output bit 1
(b) to NodeMap ' output NodeMap[nodepos+i], i increases progressively 1; Repeat this process until NodeMap[nodepos+i] be 0 once more, record output length is nodelen=i+1.
(c) to EList ' output from EList[nodepos] beginning length be one section bit sequence of nodelen.
(d) to IFArray ' output IFArray[NodeInfo[nipos] .ifpos].
Step 6.{ turns to next node } nipos increases progressively 1, if nipos 〉=TreeMap.size has then traveled through all nodes, changes step 7; Otherwise turn to next node, nodepos increases progressively until NodeMap[nodepos] be 0 once more, change step 2;
Step 7.{ finishes } the regrouping process end.
This trimming algorithm merges the adjacent and leaf node of information vector in same bunch in the CPat-Tree structure on the one hand, cuts down the bit sequence that leaf node carries on the other hand.Experiment showed, that this algorithm can reduce the storage space of CPat-Tree structure, improve search efficiency.Accompanying drawing 3,4,5 has provided the test result of trimming algorithm.Accompanying drawing 3 compared 160000 with CPat-Tree index memory capacity after the cutting under various similarities relative not cutting that 320000 URL constitute before memory capacity.As can be seen, from the 0.3-0.9 change procedure, the memory capacity ratio of not cutting structure is basic identical relatively after cutting for the CPat-Tree structure that different URL constitutes, and reaches 5%-15% in similarity; Attached Figure 4 and 5 have been showed the residue node of CPat-Tree of the cutting that 32000,96000,160000,320000 URL generate and the ratio of original interstitial content respectively under similarity 0.3-0.9, and the ratio that improves of search efficiency.As seen, the residue interstitial content steadily descends with similarity, and search efficiency improves 30%-60%.
In a word, the present invention proposes a kind of based on CPat-Tree Web content hierarchical index structural model and and the trimming algorithm realized on this basis, can significantly reduce index structure memory capacity and improve search efficiency.
Claims (7)
- But 1, a kind of Web content hierarchical index structure of the cutting that realizes based on CPat-Tree is characterized in that:(1) the hierarchical index structure is constituted jointly by TreeMap, NodeMap, EList and IFArray array; Wherein,(2) the TreeMap bit array is pressed the tree-shaped node structure of preorder journal CPat-Tree, with bit 0 mark internal node, and bit 1 mark external node;(3) bit number that comprises by each node of preorder journal of NodeMap bit array, with 1 bit 0 and 1 combined mark of some bits, its total number equals node and comprises bit number;(4) bit value that comprises by each node of preorder sequential storage of EList bit array;(5) the IFArray array information vector of carrying by preorder journal leaf node.
- 2, a kind of trimming algorithm of the Web content hierarchical index structure based on CPat-Tree, it is characterized in that: basic step is as follows:Cluster process: each leaf node among the CPat-Tree is mapped in the vector space a bit according to information vector, method with spatial pattern recognition, all points be divided into some do not overlap bunch, each bunch is by the space spheroid sign of a radii fixus, point in bunch is in together in the spheroid, the leaf node that drops in the same spheroid is considered to the information similar vector, and this information vector is as the integrated information vector of this cluster, the central point of corresponding spheroid;Merging process: according to merging rule, successively the leaf node in each bunch is upwards merged, delete this leaf node, make its father's node become new leaf node.The information vector of new leaf node replaces with the integrated information vector at bunch center;Regrouping process: remove the node that the merging of interim Bolean number group echo is fallen, regenerate the storage array of cutting; After living through regrouping process, the cutting process finishes, and index is still keeping the CPat-Tree structure.
- 3, method of cutting out according to claim 2 is characterized in that: determine whether similar methods of leaf node: by the point of the information mapping of carrying in the vector space, adopt the TOD method to come cluster leaf node.
- 4, method of cutting out according to claim 2 is characterized in that merging process once can finish by contrary preorder order traversal the node of CPat-Tree data structure.
- 5, method of cutting out according to claim 2 is characterized in that merging process utilizes a node that is canceled with the isometric interim boolean's array record of TreeMap array; Respective nodes is not canceled in the True value representation TreeMap array, and respective nodes is canceled in the False value representation TreeMap array.
- 6, method of cutting out according to claim 2 is characterized in that the treatment step of merging process is as follows:The data structure that the definition merging process uses:(1) current processing node L, the brotgher of node B of L node, father's node F of L and B node;(2) the vernier TreePos of TreeMap array;(3) NodeInfo, structural array, isometric with the TreeMap array, the information of respective nodes among each structure tag TreeMap; Each structure comprises valid, father, brother and ifpos element; The validity of valid element representation present node wherein, the true value representation is effective, and false represents invalid; The father's node of father and brother rubidium marking present node and the brotgher of node subscript in TreeMap and NodeInfo array; Ifpos rubidium marking present node is in the reference position of IFArray;Step 1.{ initialization } each element of initialization NodeInfo array, wherein the valid element is initialized as the true value; Make TreePos ← TreeMap.size-1, to point to last element of TreeMap;Step 2.{ judges the validity of present node } if TreePos≤0, then the node ergodic process finishes, and changes step 7; Present node is L otherwise make, if NodeInfo[TreePos] .valid=false, then L is deleted, and TreePos successively decreases 1, changes step 2;Step 3.{ locatees leaf node } value that makes TreePos point to TreeMap is 0, and what mean sensing is non-leaf node, and then TreePos successively decreases 1, changes step 2; Otherwise what point to is leaf node, changes step 4;The step 4.{ location father and the brotgher of node } from NodeInfo[TreePos] .father and NodeInfo[TreePos] father's node F of .brother location L and the position of brotgher of node B.If B node mark 0 in TreeMap shows that then the brotgher of node is internal node, abandon merging so to current leaf, TreePos successively decreases 1, changes step 2; Otherwise change step 5;Step 5.{ checks whether satisfy the merging condition } from NodeInfo[TreePos] .ifpos and NodeInfo[NodeInfo[TreePos] .brother] .ifpos locatees the information vector of L and B, if L and B are in same bunch together, changes step 6; Otherwise TreePos successively decreases 1, changes step 2;Step 6.{ merge node } nullify leaf node, put L and the B valid value in the NodeInfo array and be false (node failure); Putting father's node is new leaf node, and the position of putting the corresponding TreeMap of F is 1; The information vector value of new leaf node is pointed to the relevant position of L in IFArray among the NodeInfo, and the integrated information vector of L place cluster is inserted; With the node that new leaf node calculates as next round, TreePos ← TreeMap F, change step 2;Step 7.{ finishes } the merging process end.
- 7, method of cutting out according to claim 2 is characterized in that the treatment step of regrouping process is as follows:Definition nipos and nodepos are respectively the verniers of NodeInfo array and NodeMap array, and TreeMap ', NodeMap ', EList ' and IFArray ' are respectively the several storage arrays after the reorganization;Step 1.{ initialization } the initialization vernier, nipos ← 0, nodepos ← 0;Step 2.{ skips canceled node } for the structure in the NodeInfo array of nipos mark, if NodeInfo[nipos] .valid=false, show that this node is canceled, nipos increases progressively 1, nodepos increases progressively until NodeMap[nodepos] be 0 once more, change step 2; Otherwise this node is not canceled, and changes step 3;Step 3.{ judges the type of output node } if TreeMap[nipos]=0, then change step 4 output internal node; Otherwise change step 5 output leaf node;Step 4.{ exports internal node } make i=0, carry out following a) b) c) three sub-steps to the canned data of TreeMap ', NodeMap ' and this node of EList ' output, change step 6 respectively then;A) to TreeMap ' output bit 0;B) to NodeMap ' output NodeMap[nodepos+i], i increases progressively 1; Repeat this process until NodeMap[nodepos+i] be 0 once more, record output length is nodelen=i;C) to EList ' output from EList[nodepos] beginning length be one section bit sequence of nodelen.Step 5.{ exports leaf node } make i=0, carry out following a) b) c) d) four sub-steps to the canned data of TreeMap ', NodeMap ', EList ' and this node of IFArray ' output, change step 6 respectively then;(a) to TreeMap ' output bit 1;(b) to NodeMap ' output NodeMap[nodepos+i], i increases progressively 1; Repeat this process until NodeMap[nodepos+i] be 0 once more, record output length is nodelen=i+1;(c) to EList ' output from EList[nodepos] beginning length be one section bit sequence of nodelen;(d) to IFArray ' output IFArray[NodeInfo[nipos] .ifpos];Step 6.{ turns to next node } nipos increases progressively 1, if nipos 〉=TreeMap.size has then traveled through all nodes, changesStep 7; Otherwise turn to next node, nodepos increases progressively until NodeMap[nodepos] be 0 once more, change step 2;Step 7.{ finishes } the regrouping process end.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510027784 CN1719442A (en) | 2005-07-15 | 2005-07-15 | Network content grading index structure based on CPat-Tree and cutting method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510027784 CN1719442A (en) | 2005-07-15 | 2005-07-15 | Network content grading index structure based on CPat-Tree and cutting method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN1719442A true CN1719442A (en) | 2006-01-11 |
Family
ID=35931276
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200510027784 Pending CN1719442A (en) | 2005-07-15 | 2005-07-15 | Network content grading index structure based on CPat-Tree and cutting method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1719442A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101063972B (en) * | 2006-04-28 | 2010-05-12 | 国际商业机器公司 | Method and apparatus for enhancing visuality of image tree |
-
2005
- 2005-07-15 CN CN 200510027784 patent/CN1719442A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101063972B (en) * | 2006-04-28 | 2010-05-12 | 国际商业机器公司 | Method and apparatus for enhancing visuality of image tree |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jiang et al. | Efficient processing of XML twig queries with OR-predicates | |
CN1253813C (en) | Contents-index search system and its method | |
Chakrabarti et al. | Page-level template detection via isotonic smoothing | |
CN1609854A (en) | Sharing computer object with association | |
CN1705944A (en) | System and method for conducting adaptive search using a peer-to-peer network | |
CN1822005A (en) | Information pushing system and method based on web sit automatic forming and search engine | |
CN1781105A (en) | Retaining hierarchical information in mapping between XML documents and relational data | |
CN1687926A (en) | Method of PDF file information extraction system based on XML | |
CN1540552A (en) | Computer search with correlation | |
CN1909522A (en) | Method for acquiring front-page keyword and its application system | |
CN1932816A (en) | Full text search system based on ciphertext | |
CN1858737A (en) | Method and system for data searching | |
WO2008098502A1 (en) | Method and device for creating index as well as method and system for retrieving | |
CN1794239A (en) | Automatic generating system of template network station possessing searching function and its method | |
CN1725213A (en) | Method and system for structuring, maintaining personal sort tree, sort display file | |
CN1667607A (en) | Personalized category treatment method and system for document browsing | |
CN101075239A (en) | Composite searching method and system | |
CN1746891A (en) | Information handling | |
US7765204B2 (en) | Method of finding candidate sub-queries from longer queries | |
CN101030206A (en) | Method for discovering and generating search engine key word | |
CN101030230A (en) | Image searching method and system | |
CN1797301A (en) | Digital information search method and system | |
CN1825306A (en) | XML data storage and access method based on relational database | |
CN1968358A (en) | Time constraint-based automatic video summary generation method in frequent camera mode | |
Augsten et al. | Efficient top-k approximate subtree matching in small memory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |