CN101807188A - Perfect hash function construction method and system for dictionary with random scale - Google Patents

Perfect hash function construction method and system for dictionary with random scale Download PDF

Info

Publication number
CN101807188A
CN101807188A CN200910077767A CN200910077767A CN101807188A CN 101807188 A CN101807188 A CN 101807188A CN 200910077767 A CN200910077767 A CN 200910077767A CN 200910077767 A CN200910077767 A CN 200910077767A CN 101807188 A CN101807188 A CN 101807188A
Authority
CN
China
Prior art keywords
item
class
module
merging
elementary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910077767A
Other languages
Chinese (zh)
Inventor
王晓春
王亚军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JINYUANJIAN COMPUTER TECHNOLOGY Co Ltd BEIJING
Original Assignee
JINYUANJIAN COMPUTER TECHNOLOGY Co Ltd BEIJING
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JINYUANJIAN COMPUTER TECHNOLOGY Co Ltd BEIJING filed Critical JINYUANJIAN COMPUTER TECHNOLOGY Co Ltd BEIJING
Priority to CN200910077767A priority Critical patent/CN101807188A/en
Publication of CN101807188A publication Critical patent/CN101807188A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a perfect hash function construction method for a dictionary with a random scale, which comprises the following steps of: (1) constructing a TRIE tree according to the dictionary or a key word set; (2) determining the correlation length k of words; (3) by beginning from the k layer of the TRIE tree to a zero layer, calculating the correlation value of each character forming the dictionary on each layer from bottom to top layer by layer; and (4) realizing the perfect mapping of the dictionary to a hash table according to the calculated correlation length of the words and the correlation value of each character on each layer.

Description

The perfect hash function construction method and the system that are used for dictionary with random scale
Technical field
The present invention relates to once comparing and retrieval of word in the dictionary, key word, relate more specifically to be used for the perfect hash function construction method and the system of dictionary with random scale.
Background technology
Dictionary index is a problem the most basic in the natural language processing with searching.In the compiler among the judgement of keyword and predefine identifier, the IDE field such as the judgement of the keyword Postlist of the spell check in painted, the editing machine of keyword, search engine location, stop words and Chinese word segmentation all have a wide range of applications.
Dictionary index and the research of searching have history for a long time.Indexing means commonly used has that two minutes (reducing by half) of linear list search, various search tree and Hash method etc.Because linear list will adopt binary chop, needing table is a kind of ordered list, and search efficiency is relevant with the length of linear list.Typical search tree comprises B+, B-, and TRIE tree etc., its time complexity is the degree of depth of tree, but that general index takies storage space is very big.The Hash method usually only need once be searched, and is lookup method the most efficiently, therefore is subjected to preferential selection in most occasions.Its gordian technique is the design of hash function, and the homogeneity that adopts the reasonable manner control data to distribute reduces conflict, improves space availability ratio.But the Hash method is in case generation mapping conflict has just reduced search efficiency.The researcher calls perfect hash function to the hash function that does not clash.
In the research of perfect hash function, Cichelli proposes a kind of construction algorithm of feasible minimum perfect hash function first, but this method can't be distinguished the word that length is identical, initial is also identical, and desin speed is slow.Its most fatal shortcoming is to can not find hash function when word surpasses 45 in the dictionary basically.David Alan Wolverton can't solve the perfect Hash problem of Ada language reserved word at the Cichelli algorithm, proposes the computing method of word cryptographic hash, but still can only construct hash function at include file.People such as Nick Cercone does a lot of work to hash function on this basis, but still can't solve the dictionary problem.Chang utilizes China's residue theoretical, has proposed a kind of method at the minimum perfect hash function of prime number sequence structure order-preserving.Because its prime number generator can only produce 40 left and right sides prime numbers, can not construct hash function at character string simultaneously, so its method effect is very limited, has only mathematical meaning basically.Can handle scale dictionary on the Mincycle algorithm principle that Sager proposes, but some situation is also not necessarily effective to include file.Famous perfect hash function maker gperf under the linux system can not guarantee that each run can both find the perfect hash function of dictionary, and in the dictionary during entry number excessive (above 10000) load factor will be very low.In correlation technique, the perfect hash function structure mainly depends on the smooth function structure of corresponding dictionary, but the smooth function coefficient of corresponding dictionary is selected relatively difficulty, and when the word entries number is too many, problems such as it is long also can to produce contingency table, and load factor is very low.As can be seen, be the very work of difficulty for the dictionary construction perfect hash function.
Summary of the invention
In order to solve the problems of the technologies described above, the invention provides a kind of perfect hash function construction method that is used for dictionary with random scale, comprising: step 1: set up the TRIE tree according to dictionary or set of keywords; Step 2: determine word correlation length k; Step 3: begin up to 0 layer the bottom-up relating value of each character on each layer of forming described dictionary that successively calculates from the k layer of described TRIE tree; And step 4: realize of the perfection mapping of described dictionary to Hash table according to described word correlation length that calculates and the described relating value of each character on each layer.
According to one embodiment of present invention, described step 3 further comprises: step 31: all nodes that obtain l layer in the described TRIE tree; Step 32: the space hold scope of each child node of described all nodes of the described l layer of initialization; Step 33: the described space hold scope of the child node of described all nodes of described l layer is changed into elementary item; Step 34: all described elementary items are arranged in Linear Space according to assigned priority and mutual superposition, merge into a single whole, guaranteeing under the situation that non-empty position does not overlap, make to merge back occupation space minimum; Step 35: the space hold scope of calculating character relating value and each node of described l layer, simultaneously, the space hold scope of each node of l layer is passed to the corresponding father node of l-1 layer of described TRIE tree, wherein l is the integer from k to 0, wherein elementary item is corresponding with group item, the unified item that is referred to as.
According to one embodiment of present invention, in described step 33, all nodes of described l layer are changed in the process of elementary item, with the space hold situation of the child node of all nodes of the described l layer of described TRIE tree with a matrix A N * mRepresent, m represents branch's number of the node of described TRIE tree, n represents the number in the tree node in the l layer, matrix A is from left to right to collect by the child node space hold situation with all nodes of l layer described in the described TRIE tree, the child node of identical characters representative lists same, and each node is put into matrix A from left to right, from top to bottom successively as delegation, each row is considered to an elementary item, forms m elementary item.
According to one embodiment of present invention, elementary item comprises following feature: the null item: the institute in certain delegation of elementary item has living space all unoccupied; Non-null: all exist on all row of elementary item to take up room; Flat pattern: all Back_Idle and Front_Idle are 0 in the elementary item; Convex-concave: all elementary items beyond the flat pattern item; Single-row: elementary item is made of row; And multiple row item: elementary item comprises a plurality of row.Wherein, step 34 further comprises: step 341: described elementary item is divided into class I, class II, class III and four classifications of class IV according to the feature of elementary item, and wherein, described class I has the feature of multiple row, non-null and convex-concave; Described class II has the feature of multiple row, null; Described class III has feature single-row, null; And described class IV has feature single-row, non-null; Step 342: handle the item in the described class I, be 0 any two and merge satisfying the encapsulation number, generate group item, and the merging relation between the record elementary item, the group item of generation adds and continues to participate in union operation in the class I; Step 343: the residual term in the described class I is joined in the described class II; Step 344: according to preferential merging index, in described class II the merging degree greater than 0 every between progressively merge, and the relation of the merging between the record elementary item, also note with single-row form being encapsulated in inner free cells in the merging process, form to be filled, also note these to be filled position in merging item simultaneously; Step 345: the residual term in the described class II is sorted, it is minimum to be enclosed in inner free cells when they are merged according to this ordering, according to described ordering the item in the described class II is merged into interim T then, and continues to generate or merge to be filled according to described interim T; Step 346: the Xiang Yuwei to be filled that is generated is merged according to the position by the free space of described interim T encapsulation, and sort from small to large according to be filled the filler of how much treating that is encapsulated in inner number by described interim T; Step 347: according to be filled the item in the described class III is merged in the item in the described class II as best one can, and the merging relation between the record elementary item; Step 348: the residue elementary item in the described class III is merged into single-row, the merging relation between the record elementary item, and guarantee can not continue to merge between these synthetic single-row; And step 349: the residual term in described class II, described class III, the described class IV is sorted, and they are merged into an item, and make the free space minimum that is enclosed in inside according to this ordering.
According to one embodiment of present invention, in step 35, the width of utilization merging relation and each elementary item calculates the relating value of the pairing character of each elementary item.
The present invention also provides a kind of perfect hash function tectonic system that is used for dictionary with random scale, comprising: TRIE sets constructing module, is connected to lexicon file, is used for setting up the TRIE tree according to dictionary or set of keywords; Determine word correlation length module, be connected to described TRIE tree constructing module, be used for determining word correlation length k; The layering relating value is provided with module, be connected to described definite word correlation length module, be used for from the k layer of described TRIE tree, the relating value of each character on each layer of described dictionary formed in bottom-up calculating successively, and relating value is deposited in the contingency table; And perfect mapping block, be connected to described layering relating value module is set, be used for according to the described word correlation length that calculates by described definite word correlation length module and the described relating value of each character on each layer that module calculates be set realizing that described dictionary shines upon to the perfection of Hash table by described layering relating value.
According to one embodiment of present invention, the layering relating value is provided with module and comprises: initialization module is used for the space hold information of all nodes of the described l layer of the described TRIE of initialization tree; The elementary item generation module is used for generating elementary item according to the space hold information of each node of the described l layer of described TRIE tree; Merge module, be used for all described elementary items that generated by described elementary item generation module are arranged also mutual superposition, merged into a single whole according to assigned priority in Linear Space, guaranteeing under the situation that non-empty position does not overlap, making to merge back occupation space minimum; The relating value computing module is used for the final amalgamation result according to described merging module, calculates the relating value of the composition character of described dictionary at this layer; And space hold information uploading module, be used for the space hold situation of l layer node is passed to the corresponding father node of l-1 layer.
According to one embodiment of present invention, merging module comprises: the elementary item sort module, be connected to described elementary item generation module, be used to calculate the eigenwert of all elementary items of the described l layer of described TRIE tree, and according to the feature of each elementary item in each elementary item, each elementary item is divided into class I, class II, class III and four classifications of class IV, and wherein, described class I has the feature of multiple row, non-null and convex-concave; Described class II has the feature of multiple row, null; Described class III has feature single-row, null; And described class IV has feature single-row, non-null; Class I self merges module, be connected to described elementary item sort module, being used for merging the encapsulation number between described class I every is to merge between 0 the item, and the reference position of record elementary item in merging, and the item after will merging adds and participates in new computing in the class I; When in the class I between any two the encapsulation number residual term in the described class I is joined in the described class II all greater than 0 the time; Class II self merges module, be connected to described class I self and merge module, be used for described class II merging degree greater than the preferential index that merges of foundation between 0 the item, carrying out priority merges, group item after the merging is put into the class II and is continued to participate in merging, up to any two merging degree all is 0, and notes corresponding to be filled in merging process, and the reference position of elementary item in merging item; Class II item is filled up module, is connected to described class II self and merges module, is used for elementary item with described class III and merges to as far as possible in the free cells that the item in the described class II comprised, and the elementary item in described class III again can't merge in the item of class II; Class III self merges module, is connected to described class II and fills up module, be used for that described class III is satisfied mergings degree and merge greater than 0 elementary item, and the columns that takies after guaranteeing to merge is minimum, and the merging between the record record elementary item concerns; The outlier order module is connected to described class III self and merges module, and all that are used for described class II, described class III, described class IV sort, make every merge according to this ordering after, be encapsulated in inner free space and count minimum; And outlier merges module, is connected to described outlier order module, is used for the ordering relation that provides according to described outlier order module to the merging of described class II, described class III, described class IV, and the merging relation between the record record elementary item.
According to one embodiment of present invention, class II item is filled up module and is comprised: the outlier order module, be used for the sorting of described class II, make every merge according to this ordering after, be encapsulated in inner free space and count minimum; Outlier merges module, is used for the item of described class II is merged into one interim according to the ordering relation that provides in the described outlier order module; Final to be filled sequence generation module, be used for all free cells and to be filled merging with the item of described class II, generate new to be filled, and be enclosed in what inner orderings from less to more according to not merged the interim item that generates in the module by described outlier with these to be filled; Fill up and concern generation module, be used for providing each to be filled at described final sequence generation module to be filled, according to ordering relation, seek it one by one and by which elementary item in the described class III form to make it remaining free space is minimum and fill up relation with generation; And fill up module item by item, be used for concern that according to described filling described that generation module generates fill up relation, wait to fill up the relation of item and the discipline of described class II, elementary item in the described class III is merged in the corresponding entry of described class II, and the merging relation between the record record elementary item.
According to perfect hash function construction method and the system that is applicable to dictionary with random scale of the present invention, construct a more outstanding dictionary perfect hash function, realize that word is to the perfection mapping of Hash table in the dictionary.It utilizes the relating value addition can judge fast whether a word belongs to dictionary, and the present invention has reduced the space complexity of dictionary index designing aspect the increase Hash table space availability ratio.Because the Hash function construction method that provides of the present invention, the EMS memory occupation space is less relatively in similar index, can be applied in the high application system of speed and space requirement.
Description of drawings
Can more be expressly understood above-mentioned and other aspects, feature and other advantages of the present invention by specific descriptions below in conjunction with accompanying drawing.
Fig. 1 is the synoptic diagram that visual form is represented a position in item and the Hash table;
Fig. 2 is the synoptic diagram of encapsulation and solid row;
Fig. 3 is the process flow diagram according to the perfect hash function construction method of dictionary with random scale of the present invention;
Fig. 4 is the structural drawing according to the perfect hash function tectonic system of dictionary with random scale of the present invention;
Fig. 5 is the structural drawing of perfect hash function tectonic system according to an embodiment of the invention;
Fig. 6 is the structural representation that layering relating value according to an embodiment of the invention is provided with module;
Fig. 7 is the structural representation of merging module according to an embodiment of the invention;
Fig. 8 is class I storage according to an embodiment of the invention and merges synoptic diagram;
Fig. 9 is the structural representation that class II item according to an embodiment of the invention is filled up module;
Figure 10 shows example according to an embodiment of the invention to be filled;
Figure 11 shows the example of the sequencer procedure in the outlier order module according to an embodiment of the invention;
Figure 12 shows outlier optimal sequencing search procedure according to an embodiment of the invention; And
Figure 13 shows the synoptic diagram that relating value according to an embodiment of the invention calculates.
Embodiment
Specifically describe exemplary embodiment of the present invention below with reference to the accompanying drawings.
For the present invention can be more readily understood, provided the following definition and explanation:
Definition 1: in the present invention, a position in the Hash table is expressed as a rectangle in visual form, and it has two states, and is promptly idle and take:
Idle (idle): represent this locus also not and the some speech in the dictionary have mapping relations;
Take (occupied): represent this locus with dictionary in some speech had mapping relations.
Definition 2:: can vivid building blocks version of usefulness represent, its visual description as shown in Figure 1, and be divided into idle and take two kinds of component units, with free time in space in the Hash table with take correspondingly, promptly whether have mapping relations with certain speech in the dictionary.Item is divided into two kinds of elementary item and group items:
Elementary item refers to that the child node by l layer node takes up room and directly is transformed through corresponding matrix A, has corresponding relation with the respective symbols of described dictionary;
Group item: merge the item that forms by elementary item, group item.
Definition 3: merge: two items overlap mutually in linear space and synthesize an item, can not be overlapping between the unit but take.
Definition 4: the merging degree: when two items merged in linear space, the latter embedded the former length.
Definition 5: outlier: the merging degree is two items of 0;
Definition 6: to be filled: be encapsulated in inner free cells during merging and show, be seizure condition so that with other it is filled up again afterwards with the form of item.
Definition 7: the encapsulation number: occupied unit linearity is enclosed in inner free cells number (as shown in Figure 2).
Definition 8: solid row: the row (as shown in the figure 2) that refer to not contain in the item free cells.
Definition 9: single-row: item with single-row feature.
For explaining and explanation the present invention, defined some basic variables:
Front_Idle: the free cells number (null is not calculated Front_Idle) that is arranged in a head delegation;
Back_Idle: the free cells number (null is not calculated Front_Idle) that is arranged in an afterbody delegation.
Fig. 3 is the process flow diagram according to the perfect hash function construction method of dictionary with random scale of the present invention.As shown in Figure 3, the perfect hash function construction method that is used for dictionary with random scale according to the present invention comprises: step S302: set up the TRIE tree according to dictionary or set of keywords; Step S304: determine word correlation length k; Step S306: begin up to 0 layer the bottom-up relating value of each character on each layer of forming described dictionary that successively calculates from the k layer of described TRIE tree; And step S308: realize of the perfection mapping of described dictionary to Hash table according to described word correlation length that calculates and the described relating value of each character on each layer.
According to one embodiment of present invention, further comprise among the above-mentioned steps S306:
Step 31: all nodes that obtain l layer in the TRIE tree;
Step 32: each child node space hold scope of initialization l layer node;
Step 33: the child node space hold scope of l layer node is changed into elementary item represent;
Step 34: merge all;
Step 35: the character relating value of l layer is set, simultaneously, the space hold scope of each node of l layer is passed to the corresponding father node of l-1 layer of described TRIE tree, wherein l is the integer from k to 0, and wherein elementary item is corresponding with group item, unifies to be referred to as item.
Wherein comprise among the step S33: l layer node changed in the process that elementary item represents, with the space hold situation of the child node of the l layer node of TRIE tree with a matrix A N * mRepresent, m represents branch's number of tree node, n represents the number in the tree node in the l layer, the generation way of matrix A then is that the child node space hold situation of l layer node from left to right collected during TRIE was set, the child node of identical characters representative lists same, and each node is put into matrix A from left to right, from top to bottom successively as delegation, each row is thought an item like this, forms m.
In step 34, liken item to the building blocks plate, all are arranged in Linear Space according to certain priority and mutual superposition, merge into a single whole, guaranteeing under the situation that non-empty position does not overlap, make to merge back occupation space minimum.
In step 35, also comprise: the width that utilize to merge relation (comprise merge successively, merging degree) and each elementary item be provided with each elementary item the relating value of corresponding character.
According to one embodiment of present invention, further comprise in the step 34:
Step 341: elementary item is divided into 4 classifications such as I, II, III, IV.
Step 342: handle item in the class I, be 0 any two and merge satisfying the encapsulation number, generate group item, and the merging relation between the record elementary item, the group item of generation adds and continues the participation union operation in the class I;
Step 343: the residual term in the class I is joined in the class II;
Step 344: according to preferential merging index, with merging degree in the class II greater than 0 the item between progressively merge, generate group item, and the relation of the merging between the record elementary item, also note with single-row form being encapsulated in inner free cells in the merging process, form to be filled, also note simultaneously these position of item to be filled in the merging item, the group item that generates adds continuation participation union operation in the class I simultaneously;
Step 345: the residual term in the class II is sorted, and it is minimum to be enclosed in inner free cells when they are merged according to this ordering.According to this ordering the discipline of class II is merged into interim T then, and continues generation or merge to be filled according to T.
Step 346: the Xiang Yuwei to be filled that is generated is merged according to the position by the free space that T encapsulates, and sorted from small to large by the filler of how much treating that T is encapsulated in inner number according to item to be filled;
Step 347: according to be filled with merging in the item in the class II that the item in the class III is tried one's best, and the merging relation between the record record elementary item;
Step 348: the residual term in the class III is merged into a plurality of single-row, guarantees to continue to merge between these single-row, the merging relation between the record elementary item;
Step 349: the residual term in class II, class III, the class IV is sorted, and they are merged into an item, and make the free space minimum that is enclosed in inside according to this ordering.
Wherein, in step 341, the elementary item feature of using in the classification comprises:
Feature 1: be null item or non-null item.
The null item: the institute in certain delegation of item has living space all unoccupied;
Non-null: all exist on all row of item to take up room;
Feature 2: be flat pattern item or convex-concave item.
Flat pattern: all Back_Idle and Front_Idle are 0 in;
Convex-concave: all beyond the flat pattern item;
Feature 3: be single-row or multiple row item.
Single-row: refer to what Xiang Youyi row constituted;
Multiple row item: the item that contains a plurality of row;
Can be divided into following classification according to above elementary item feature:
I class: multiple row, non-null, convex-concave;
II class: multiple row, null;
III class: single-row, null;
IV class: single-row, non-null.
In step 344, it preferentially merges index and comprises:
Merge the encapsulation number: two items merge occupied space, back linearity and are enclosed in inner free cells number;
Merge gain layer: two items merge the solid columns that the back increases;
Merge back compression number: the minimizing number that refers to the free space after two items merge.
Above-mentioned three's priority is for reducing successively from top to bottom.
According to the perfect hash function construction method that is applicable to dictionary with random scale of the present invention, utilize TRIE tree, the assembly of building blocks plate in linear space to plug together and mechanism determine this dictionary the word correlation length, form character at each locational relating value, thereby realize that dictionary shines upon to the perfection of Hash table.
Fig. 4 is the structural drawing according to the perfect hash function tectonic system of dictionary with random scale of the present invention.As shown in Figure 4, the perfect hash function tectonic system according to dictionary with random scale of the present invention comprises: TRIE sets constructing module 402, is connected to lexicon file, is used for setting up the TRIE tree according to dictionary or set of keywords; Determine word correlation length module 404, be connected to described TRIE tree constructing module, be used for determining word correlation length k; The layering relating value is provided with module 406, is connected to described definite word correlation length module, is used for from the k layer of described TRIE tree, the bottom-up relating value of each character on each layer of forming described dictionary that successively calculate; And perfect mapping block 408, be connected to described layering relating value module is set, be used for according to the described word correlation length that calculates by described definite word correlation length module and the described relating value of each character on each layer that module calculates be set realizing that described dictionary shines upon to the perfection of Hash table by described layering relating value.
Fig. 5 is the structural drawing of perfect hash function tectonic system according to an embodiment of the invention.As shown in Figure 5, perfect hash function tectonic system according to an embodiment of the invention, it comprises: TRIE sets constructing module, determines that word correlation length module, layering relating value are provided with module.
TRIE sets constructing module 504: be connected to lexicon file 502, based on the fundamental element number that constitutes dictionary, rational TRIE tree set up in the entry that reads in the dictionary.For fundamental element less dictionary or set of keywords, directly adopt the TRIE tree to be in the tolerance interval as the waste of storage organization space.Electronic dictionary for this class of picture Chinese, base character is in that (wherein GB2312 includes 6763 of simplified Hanzis more than 6763, GBK includes 20912 Chinese characters, up-to-date GB18030 includes 27533 Chinese characters, the BIG5 sign indicating number is included 13053 Chinese characters), the pointer space hold is too wasted in the TRIE tree construction of this dictionary.At this moment, represent to reduce space hold greatly with chained list or variable length array between the child node with inter-node in the TRIE tree.
Determine word correlation length module 506: range traversal TRIE tree, seek the child node number greater than 1 maximum level k, like this, k is the correlation length of word, k just is a contingency table length with the long-pending of the number of characters that constitutes dictionary or set of keywords.To cut to the node between the leaf node below the k layer then, making the k+1 layer all is leaf node, and the TRIE tree also is the k+1 layer now.
The layering relating value is provided with module 508: be responsible for the layer from TRIE tree k, the bottom-up relating value of each character on each layer of forming dictionary that successively calculate.
As shown in Figure 6, the layering relating value is provided with module 508 and comprises:
Initialization module 602 is used for the space hold information of all nodes of the described l layer of the described TRIE of initialization tree.When the k layer node in the processing TRIE tree, if its child node connection is leaf node, then the space hold scope of corresponding child node is 1, and anterior free space is 0; If what its child node connected is not leaf node, then the space hold scope of corresponding child node is the space hold situation that transmits from its child node; If child node is empty, the scope that then takies is 0, and anterior free space also is 0.
Elementary item generation module 604 is used for generating elementary item according to the space hold information of each node of the described l layer of described TRIE tree.The elementary item of l layer generates and can be undertaken by following five steps in the TRIE tree:
The first step: the child node for the l layer node in the TRIE tree, it is from left to right joined in the matrix A, the corresponding node of every row, every row are corresponding to the situation that takies of a child node, and the space hold situation of all child nodes is all since 0;
Second step: ask and contain element in the matrix A, and the row that contain element in this journey are joined in the List sequence greater than 2 row;
The 3rd step: remove the row that are not included among the List;
The 4th step: remove the not row of containing element; And
The 5th step: each row of matrix A form an elementary item, record elementary item and the relation that constitutes the dictionary base character.
Merge module 606, be used for all described elementary items that generated by described elementary item generation module 604 are arranged also mutual superposition, merged into a single whole according to assigned priority in Linear Space, guaranteeing under the situation that non-empty position does not overlap, making to merge back occupation space minimum.
As shown in Figure 7, merging module 606 comprises:
Elementary item sorter 702: at first analyze three features of each elementary item, according to these features elementary item is assigned in I, II, III, four buckets of IV then.
Class I self merges module 704: adopt the algorithm that is similar to merge sort in the class I all to merge encapsulation numbers be to merge between 0 the item, can carry out (but being not limited thereto) according to following steps with reference to figure 8: the first step: pointer is deposited in two buffer zones respectively in the class I all, but shared same entity.Then two buffer zones are sorted according to Front_Idle_Total and Back_Idle_Total respectively.Second step: Front_Idle_Total equaled Back_Idle_Total and be not to merge between same the item, and amalgamation result is stored in the result buffer, be made as the used sign with two simultaneously.The item that has the used sign so just can not participate in union operation again.The 3rd step: traversal actual element sequence if do not contain the item of band used sign, then merges and finishes; Otherwise, from the actual element sequence, delete the item that all have the used sign, and the item in the amalgamation result buffer zone joined in the actual element sequence, and jump to the first step.Wherein, Front_Idle_Total represent all Front_Idle and, and Back_Idle_Total represent all Back_Idle and.
Class II self merges module 706: it will finish three tasks: according to preferential merging index, progressively merge mergings degree in the class II greater than 0; In merging, merged to inner free cells with single-row form record by item, form a series of to be filled; The position of each elementary item of record in merging item in the merging.
Because consider the priority that merges, concrete the implementation phase, can adopt following method to quicken to merge (but method is not unique), promptly at first the maximum cost between the generating item is set, utilize this to generate optimum of the as far as possible once many merging of tree then, and note to be filled.
Maximum cost generate tree be with every as the summit, preferentially to merge index be that the maximum cost of the figure of weight generates tree, that is: for any summit among the figure, only choose connected, preferentially merge a best limit of index, after selecting to finish at all summits, all remove on other limit.
Seek optimum when merging item and merging utilize generating tree, adoptable concrete grammar is as follows: establish N={V, E}} is that a maximum cost of gathering generates tree, and the set of the item that V indicates to merge, { E} represents that the optimum between all merges the set that concerns among the V.If i=0; T j.V be a T jThe set of composition node, carry out following step then:
The first step: if E} is empty, and then search finishes, otherwise, get weight maximum among the current E a limit (u, v), i.e. the merging of u and v, and it is deleted from E;
Second step: if u, v does not belong to any T j.V, then with (u, v) generating item T i, and with u, v joins T i.V, state W i=unFinished jumps to the first step;
The 3rd step: if u, v is at T kIn, then jump to the first step;
The 4th step: if u, v is respectively at T k.V, T j.V in, if Wk, W jBe unFinished all, and with T k, T jAmalgamation result is greater than { the maximum limit among the E} is then with T k, T jMerge, the state after the merging is unFinished, and with T k.V with T j.V merge, otherwise with W k, W jAll be made as Finished, jump to the first step;
The 5th step: if u or v have only one to be arranged in T, the two processing is identical.With u at T kBe example.If u is at T kIn, W kIf=unFinished is T kWith the v amalgamation result greater than { the maximum limit among the E} is then with T kMerge with v, v joins T k.V in; Jump to the first step.
In the class II, between when merging, when the encapsulation number greater than 0 the time, then be enclosed in inner free cells with single-row form record, forms to be filled, and writes down the position of item to be filled in the item of place.As shown in figure 10, (a) being item after merging, is that free cells is represented that with the form of single-row item empty forms is shown in (a) and takies (b) and (c); The ash colour specification is unoccupied.
In merging process, stack between to be filled also may appear, and whether two to be filled of the main foundation that whether need to superpose lists same, if will gather union at same row.
Class II item is filled up module 708, is connected to class II self and merges module 706, is used for elementary item with described class III and merges to as far as possible in the free cells that the item in the class II comprised, and the elementary item in the class III again can't merge in the item of class II.As shown in Figure 9, class II item is filled up module 708 and is comprised:
Outlier order module 902:, can only make these in merging process, the least possible free cells is enclosed in inside by the mode of ordering when an item set can't reduce free space by the mode of space stack.Be enclosed in volume inside hour and make, then need the untight free cells in both sides to count maximum by ordering.Based on this thought, the implementation phase adopted the method for progressively seeking optimal sequencing from both sides end to end simultaneously to the centre, as shown in the figure 11, concrete grammar is as follows:
Designing a storehouse and two search trees assists and realizes heuristic search (as Figure 12).Wherein, storehouse is used for writing down the sequencing with the control and treatment node, and content comprises: processing be any one tree, and which node.The root node of two search trees is respectively head, the caudal knot point in the sequence.
In two search trees, there is a current node of expanding respectively, its child node is all related with it, and intersects the size of degree, from left to right ordering between the child node according to the null between itself and the corresponding father node (the current node of expanding).
Stacked rule: from every search tree current expands the child node of node, select to intersect the maximum and child node that in storehouse, do not occur of degree and be made as the current node of expanding of candidate respectively; If the current child node of expanding node has been searched for finish, then select in its sibling, that do not visit and the node that in storehouse, do not occur as current candidate's node.After two current candidate's nodes were selected to finish, then selection can make current the expand node of the bigger node of free cells number as corresponding search tree in two current candidate's nodes, and makes this node stacked;
Recall rule: as Cur_Idle_Num+Expect_Idle_Num+RowNum*ElseWidth<=Max_Idle_Num, then current all branches of expanding under the node of stack top place search tree are cut out, that is: stack top element is popped, get in its sibling of not visiting, intersect the maximum and node that in storehouse, do not occur of degree with its father node and be made as the current node of expanding of candidate, otherwise continue the stack top node is popped.Wherein: Cur_Idle_Num represents the idle number that obtained in the current search process; Expect_Idle_Num represents the idle number that also may increase; MaxI_dle_Num represents free cells number that found, optimal sequence at present; RowNum represents in the current search sequence, still all is the number of null between the both sides, front and back at present; And ElseWidth represents not enter at present all width of storehouse; Current optimal sequencing setting: in the stacked process of node, when current free cells number no longer increases, Max_Idle_Num=Cur_Idle_Num is set, the optimal sequencing of current outlier is set according to stacked sequencing.
Outlier merges module 904: for an item set, according to the ordering relation that the outlier order module provides, from left to right all are progressively merged into one.
Final to be filled sequence generation module 906:, generated corresponding to be filled at class II self merging phase.After the item in the class II carries out outlier ordering, generate a new T according to this ordering relation, also add in corresponding to be filled or generate new item to be filled the free cells of encapsulation T in inside.Then, the free cells that is not enclosed in T inside is also merged in corresponding to be filled according to the position, thereby generate final to be filled.Form final to be filled sequence with these to be filled according to its how much from small to large ordering that contains the free cells number that is not enclosed in T inside again.
Fill up and concern generation module 908: wait to fill up a sequence for what generate, adopt greedy formula searching algorithm, seek it by which is formed and will make its remaining free space minimum in the class III one by one according to the sequencing of arranging.A kind of concrete grammar that the present invention provides is as follows:
The first step: fill up a sequence S for empty for waiting, then finish;
Second step: get first S that waits to fill up a sequence S 1, and with S 1From sequence, delete, it is added in the result set;
The 3rd step: adopt greedy algorithm, utilize relation of inclusion in the class III, to remain the longest item U of search in the subclass sequence, make S jComprise U, if there is no such item jumps to the first step;
The 4th step: S 1=S 1-U deletes U from corresponding subclass, and U is joined among the set Condiate, if this subclass has been empty, then this subclass is deleted from the subclass sequence, carries out for the 3rd step.
Fill up module 910 item by item: at each T in the class II, to be filled according to it comprised, and to be filled the relation of filling up with the discipline of class III, with merging among the T of corresponding class III discipline, write down the reference position of class III discipline in T simultaneously.
Class III self merges module 710, is connected to the class II and fills up module 708, be used for that the class III is satisfied the merging degree and merge greater than 0 elementary item, and the columns that takies after guaranteeing to merge is minimum.In the class III each, an available 0-1 vector represent that the space free time on 0 this row of expression, 1 expression takies.For reducing the space, this vector can be compressed, remove 0 component, the situation that takies is represented with the number of being expert at, thereby item is represented with a set.
For ease of understanding the method that the present invention provides, introduce three notions:
Row complete or collected works: ask for set that the discipline of III class takies, all row;
Negate: refer to according to complete or collected works, try to achieve this unappropriated row, arrange from small to large, and represent with the form of set by line number;
Comprise: after item was represented with the form of set, an item was contained in another, was referred to as item and comprised.
Class III according to an embodiment of the invention self merging process is as follows:
The first step: in the class III all, classify in initial row position according to item, reference position according to each class between these classes sorts from small to large, forms the subclass sequence, and in each class according to how many orderings from big to small of number of elements in each set;
Second step: when the subclass sequence was sky, merging process finished; Otherwise changed for the 3rd step over to;
The 3rd step: ask for the capable complete or collected works in the current class III;
The 4th step:,, form the candidate and merge S set, and first subclass is deleted from the subclass sequence to all negates in first subclass in the subclass sequence according to complete or collected works;
The 5th step: when S set is not sky, carried out for the 6th step, otherwise jumped to for second step;
The 6th step: get the item S in the S set j
The 7th step: adopt greedy algorithm, utilize relation of inclusion in the class III, to remain the longest item U of search search in the subclass sequence, make S jComprise U, if there is no such item jumped to for the 6th step;
The 8th step: S j=S j-U deletes U from corresponding subclass, if this subclass has been empty, then this subclass is deleted from the subclass sequence, carries out for the 9th step;
The 9th step: with S jCorresponding item and all potential merging items are merged into an item, and it is added in the amalgamation result sequence, jump to for the 5th step.
Turn back to Fig. 7, outlier order module 712 is connected to class III self and merges module, all that are used for class II, class III, class IV sort, make every merge according to this ordering after, be encapsulated in inner free space and count minimum, and and record record elementary item between the merging relation; And
Outlier merges module 714, is connected to the outlier order module, is used for the ordering relation that provides according to the outlier order module item of class II, class III, class IV is merged (referring to Figure 11).
Turn back to Fig. 6, relating value computing module 608 is used for according to the final amalgamation result that merges module 606, and the composition character of Dictionary of Computing is at the relating value of this layer; When all that set certain one deck as TRIE were merged into one, the reference position of all elementary items in merging during traversal merges was provided with the relating value of place layer respective symbols; The space hold scope of corresponding node then is the situation that takies of this corresponding line.For example, Figure 13 is the final amalgamation result of certain layer of node among the TRIE, and 1,2,3 represent three nodes, the space hold situation of speech in every section corresponding each child node of color (each element), E 1, E 2, E 3Correspond respectively to the reference position that three elementary items take up room, i.e. the relating value of respective element correspondence, L 1, L 2, L 3The space hold scope that merges postjunction has been described.For the row in the deletion of simplification stage, then the relating value of corresponding element is 0.For the row of deletion, its space hold scope of corresponding node equal the space hold scope of non-NULL child node.And
Space hold information uploading module 610 is used for the space hold situation of this layer node is passed to last layer.When all of certain one deck are merged into one, with pairing space hold situation of each row in the item pass to this journey the father node of corresponding node, and be stored in the corresponding child node component.
This system utilizes the relating value addition can judge fast whether a word belongs to dictionary, and the present invention has reduced the space complexity of dictionary index designing aspect the increase Hash table space availability ratio.Because the Hash function construction method that provides of the present invention, the EMS memory occupation space is less relatively in similar index, can be applied in the high application system of speed and space requirement.
Although described different embodiments of the invention, to those skilled in the art, may there be more embodiment and implementation within the scope of the invention.According to any variation of invention with change the protection domain all fall into claim.

Claims (10)

1. a perfect hash function construction method that is used for dictionary with random scale is characterized in that, comprising:
Step 1: set up the TRIE tree according to dictionary or set of keywords;
Step 2: determine word correlation length k;
Step 3: begin up to 0 layer the bottom-up relating value of each character on each layer of forming described dictionary that successively calculates from the k layer of described TRIE tree; And
Step 4: realize of the perfection mapping of described dictionary to Hash table according to determined described word correlation length and the described relating value of each character on each layer.
2. perfect hash function construction method according to claim 1 is characterized in that, described step 3 further comprises:
Step 31: all nodes that obtain l layer in the described TRIE tree;
Step 32: the space hold scope of each child node of described all nodes of the described l layer of initialization;
Step 33: the described space hold scope of the child node of described all nodes of described l layer is changed into elementary item;
Step 34: all described elementary items are arranged in Linear Space according to assigned priority and mutual superposition, merge into a single whole, guaranteeing under the situation that non-empty position does not overlap, make to merge back occupation space minimum;
Step 35: calculate the space hold scope of character relating value and each node of described l layer, simultaneously, the space hold scope of each node of l layer is passed to the corresponding father node of l-1 layer of described TRIE tree
Wherein l is the integer from k to 0, and wherein elementary item is corresponding with group item, the unified item that is referred to as.
3. perfect hash function construction method according to claim 2, it is characterized in that, in described step 33, all nodes of described l layer are changed in the process of elementary item, with the space hold situation of the child node of all nodes of the described l layer of described TRIE tree with a matrix A N * mRepresent, m represents branch's number of the node of described TRIE tree, n represents the number in the tree node in the l layer, matrix A is from left to right to collect by the child node space hold situation with all nodes of l layer described in the described TRIE tree, the child node of identical characters representative lists same, and each node is put into matrix A from left to right, from top to bottom successively as delegation, each row is considered to an elementary item, forms m elementary item.
4. perfect hash function construction method according to claim 2 is characterized in that, described elementary item comprises following feature:
The null item: the institute in certain delegation of elementary item has living space all unoccupied;
Non-null: all exist on all row of elementary item to take up room;
Flat pattern: all Back_Idle and Front_Idle are 0 in the elementary item;
Convex-concave: all elementary items beyond the flat pattern item;
Single-row: elementary item is made of row; And
The multiple row item: elementary item comprises a plurality of row.
5. perfect hash function construction method according to claim 4 is characterized in that, described step 34 further comprises:
Step 341: described elementary item is divided into class I, class II, class III and four classifications of class IV according to the feature of elementary item, and wherein, described class I has the feature of multiple row, non-null and convex-concave; Described class II has the feature of multiple row, null; Described class III has feature single-row, null; And described class IV has feature single-row, non-null;
Step 342: handle the item in the described class I, be 0 any two and merge satisfying the encapsulation number, generate group item, and the merging relation between the record elementary item, the group item of generation adds and continues to participate in union operation in the class I;
Step 343: the residual term in the described class I is joined in the described class II;
Step 344: according to preferential merging index, in described class II the merging degree greater than 0 every between progressively merge, and the relation of the merging between the record elementary item, also note with single-row form being encapsulated in inner free cells in the merging process, form to be filled, also note simultaneously these to be filled position in merging item, the group item after the merging is put into the class II and is continued to participate in merging;
Step 345: the residual term in the described class II is sorted, it is minimum to be enclosed in inner free cells when they are merged according to this ordering, according to described ordering the item in the described class II is merged into interim T then, and continues to generate or merge to be filled according to described interim T;
Step 346: the Xiang Yuwei to be filled that is generated is merged according to the position by the free space of described interim T encapsulation, and sort from small to large according to be filled the filler of how much treating that is encapsulated in inner number by described interim T;
Step 347: the item in the described class III is merged in the item in the described class II as best one can according to be filled;
Step 348: the residue elementary item in the described class III is merged into single-row, the merging relation between the record elementary item, and guarantee can not continue to merge between these synthetic single-row; And
Step 349: the residual term in described class II, described class III, the described class IV is sorted, and they are merged into an item, and make the free space minimum that is enclosed in inside according to this ordering.
6. perfect hash function construction method according to claim 2 is characterized in that, in step 35, the width of utilization merging relation and each elementary item calculates the relating value of the pairing character of each elementary item.
7. a perfect hash function tectonic system that is used for dictionary with random scale is characterized in that, comprising:
TRIE sets constructing module, is connected to lexicon file, is used for setting up the TRIE tree according to dictionary or set of keywords;
Determine word correlation length module, be connected to described TRIE tree constructing module, be used for determining word correlation length k;
The layering relating value is provided with module, be connected to described definite word correlation length module, be used for from the k layer of described TRIE tree, the relating value of each character on each layer of described dictionary formed in bottom-up calculating successively, and relating value is deposited in the contingency table; And
Perfect mapping block, be connected to described layering relating value module is set, be used for according to the described word correlation length that calculates by described definite word correlation length module and the described relating value of each character on each layer that module calculates be set realizing that described dictionary shines upon to the perfection of Hash table by described layering relating value.
8. perfect hash function tectonic system according to claim 7 is characterized in that, described layering relating value is provided with module and comprises:
Initialization module is used for the space hold information of all nodes of the described l layer of the described TRIE of initialization tree;
The elementary item generation module is used for generating elementary item according to the space hold information of each node of the described l layer of described TRIE tree;
Merge module, be used for all described elementary items that generated by described elementary item generation module are arranged also mutual superposition, merged into a single whole according to assigned priority in Linear Space, guaranteeing under the situation that non-empty position does not overlap, making to merge back occupation space minimum;
The relating value computing module is used for the final amalgamation result according to described merging module, calculates the relating value of the composition character of described dictionary at this layer; And
Space hold information uploading module is used for the space hold situation of l layer node is passed to the corresponding father node of l-1 layer.
9. perfect hash function tectonic system according to claim 8 is characterized in that, described merging module comprises:
The elementary item sort module, be connected to described elementary item generation module, be used to calculate the eigenwert of all elementary items of the described l layer of described TRIE tree, and according to the feature of each elementary item in each elementary item, each elementary item is divided into class I, class II, class III and four classifications of class IV, wherein, described class I has the feature of multiple row, non-null and convex-concave; Described class II has the feature of multiple row, null; Described class III has feature single-row, null; And described class IV has feature single-row, non-null;
Class I self merges module, be connected to described elementary item sort module, being used for merging the encapsulation number between described class I every is to merge between 0 the item, and the reference position of record elementary item in merging, and the item after the merging adds and participates in new computing in the class I; When in the class I between any two the encapsulation number in the described class I all are joined in the described class II all greater than 0 the time;
Class II self merges module, be connected to described class I self and merge module, be used for described class II merging degree greater than the preferential index that merges of foundation between 0 the item, carrying out priority merges, group item after the merging is put into the class II and is continued to participate in merging, up to any two merging degree all is 0, and notes corresponding to be filled in merging process, and the reference position of elementary item in merging item;
Class II item is filled up module, is connected to described class II self and merges module, is used for elementary item with described class III and merges to as far as possible in the free cells that the item in the described class II comprised, and the elementary item in described class III again can't merge in the item of class II;
Class III self merges module, is connected to described class II and fills up module, be used for that described class III is satisfied mergings degree and merge greater than 0 elementary item, and the columns that takies after guaranteeing to merge is minimum, and the merging between the record record elementary item concerns;
The outlier order module is connected to described class III self and merges module, and all that are used for described class II, described class III, described class IV sort, make every merge according to this ordering after, be encapsulated in inner free space and count minimum; And
Outlier merges module, is connected to described outlier order module, is used for the ordering relation that provides according to described outlier order module merging described class II, described class III, described class IV.
10. perfect hash function tectonic system according to claim 8 is characterized in that, described class II item is filled up module and comprised:
The outlier order module is used for the sorting of described class II, make every merge according to this ordering after, be encapsulated in inner free space and count minimum; Outlier merges module, is used for the item of described class II is merged into one interim according to the ordering relation that provides in the described outlier order module;
Final to be filled sequence generation module, be used for all free cells and to be filled merging with the item of described class II, generate new to be filled, and be enclosed in what inner orderings from less to more according to not merged the interim item that generates in the module by described outlier with these to be filled;
Fill up and concern generation module, be used for providing each to be filled at described final sequence generation module to be filled, according to ordering relation, seek it one by one and by which elementary item in the described class III form to make it remaining free space is minimum and fill up relation with generation; And
Fill up module item by item, be used for concern that according to described filling described that generation module generates fill up relation, wait to fill up the relation of item and the discipline of described class II, elementary item in the described class III is merged in the corresponding entry of described class II, and the merging relation between the record elementary item.
CN200910077767A 2009-02-18 2009-02-18 Perfect hash function construction method and system for dictionary with random scale Pending CN101807188A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910077767A CN101807188A (en) 2009-02-18 2009-02-18 Perfect hash function construction method and system for dictionary with random scale

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910077767A CN101807188A (en) 2009-02-18 2009-02-18 Perfect hash function construction method and system for dictionary with random scale

Publications (1)

Publication Number Publication Date
CN101807188A true CN101807188A (en) 2010-08-18

Family

ID=42608986

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910077767A Pending CN101807188A (en) 2009-02-18 2009-02-18 Perfect hash function construction method and system for dictionary with random scale

Country Status (1)

Country Link
CN (1) CN101807188A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201972A (en) * 2021-12-14 2022-03-18 长安银行股份有限公司 Financing product data processing method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1066570B1 (en) * 1998-02-26 2003-10-29 Sap Ag Fast string searching and indexing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1066570B1 (en) * 1998-02-26 2003-10-29 Sap Ag Fast string searching and indexing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王全礼: ""基于HASH机制的分词词典的设计与实现"", 《电子科技大学硕士论文》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201972A (en) * 2021-12-14 2022-03-18 长安银行股份有限公司 Financing product data processing method

Similar Documents

Publication Publication Date Title
Snoeyink Point location
Kumar et al. Improved algorithms and data structures for solving graph problems in external memory
CN103765381B (en) Parallel work-flow to B+ tree
Arge Efficient external-memory data structures and applications
Prasad et al. GPU-based Parallel R-tree Construction and Querying
CN106919769A (en) A kind of hierarchy type FPGA placement-and-routings method based on Hierarchy Method and empowerment hypergraph
CN102750328B (en) A kind of construction and storage method of data structure
JP5241738B2 (en) Method and apparatus for building tree structure data from tables
CN110147377A (en) General polling algorithm based on secondary index under extensive spatial data environment
CN109889205A (en) Encoding method and system, decoding method and system, and encoding and decoding method and system
CN101477555B (en) Fast retrieval and generation display method for task tree based on SQL database
CN108021702A (en) Classification storage method, device, OLAP database system and medium based on LSM-tree
CN101788990A (en) Global optimization and construction method and system of TRIE double-array
Talcott An actor rewriting theory
Arge et al. Cache-oblivious data structures
Roumelis et al. Bulk-loading and bulk-insertion algorithms for xBR^+-trees xBR+-trees in Solid State Drives
Kaplan et al. Purely functional, real-time deques with catenation
Han et al. Parallel integer sorting is more efficient than parallel comparison sorting on exclusive write PRAMs
CN101807188A (en) Perfect hash function construction method and system for dictionary with random scale
Edelkamp et al. Optimizing binary heaps
CN107273443A (en) A kind of hybrid index method based on big data model metadata
Grygiel et al. Cube diagram bundles: a new representation of strongly unspecified multiple-valued functions and relations
CN104268092B (en) File storage system and file storage method
CN110162531B (en) Distributed concurrent data processing task decision method
CN110110270B (en) Parallel processing generation method and device for large genealogy lineage diagram

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20100818