CN1889080A - Method for searching character string - Google Patents

Method for searching character string Download PDF

Info

Publication number
CN1889080A
CN1889080A CN 200610052710 CN200610052710A CN1889080A CN 1889080 A CN1889080 A CN 1889080A CN 200610052710 CN200610052710 CN 200610052710 CN 200610052710 A CN200610052710 A CN 200610052710A CN 1889080 A CN1889080 A CN 1889080A
Authority
CN
China
Prior art keywords
character string
node
character
multiway tree
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200610052710
Other languages
Chinese (zh)
Inventor
陈纯
卜佳俊
刘康苗
陈伟
赵梦
潘照明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN 200610052710 priority Critical patent/CN1889080A/en
Publication of CN1889080A publication Critical patent/CN1889080A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method to search the character string. The invention records the character sequence information of the character string using the manifold tree as the storage mode and searches the character string using the information. So it is proper for searching the character string and the prefix, suffix. It manages the memory distribution of the manifold tree crunode hierarchically and compresses the manifold tree. So it saves the memory resource and has the high efficient.

Description

A kind of method that is used for searching character string
Technical field
The present invention relates to the relevant technical field of a large amount of string assemble management, particularly relate to a kind of method that is used for searching character string.
Background technology
In this year, along with continuing to increase of computer user, various computer utilitys continue to bring out, and many computer utilitys all relate to the effective problem of management to a very large string assemble.For example, from file administration, bibliography search, geography information index, search engine all relate to a very large string assemble to the processing that WEB goes up a large amount of texts.Simultaneously, along with the rise of IPv6, how in the IP address of flood tide, to realize the quick retrieval of IP is become the problem that people are concerned about;
Existing many at present data structures that are used for the character string management, wherein binary search tree BST is the simplest and the most a kind of.The performance of BST structure depends primarily on the form of tree, and is relevant with the arrangement of input character trail.The distribution of supposing string assemble is stable, and then the BST structure is quickish; If the distribution instability of string assemble is particularly orderly, the performance of BST is the poorest to be (o (n)); BST deteriorates to a Dan Zhishu.Therefore BST is not suitable for the input character set of strings of arrangement in order especially.BST has many mutation in addition, as AVL-tree and red black tree, they are to insert the structure of reorganization tree in the process of character string with BST tree different, the approximate equilibrium that keeps tree, thereby guarantee Dan Zhishu can not occur, though BST tree and mutation thereof occupy little space but speed is slow, so be not optimal data structure.Another data structure commonly used was Hash table and B tree during character string was handled.Hash table is a kind of method of direct calculating record storage address, and it directly sets up reflection between key (character string) and memory location.But, require character string to store in order in the application of (searching) at some as index maintenance and prefix, it is not very big that Hash table just just is worth.Also there is this problem in the B tree when requiring quick searching character string;
If can define a general data storage and access method, then can improve storage efficiency, and reduce the time of data access at the inner structure characteristics of wanting visit data.
Summary of the invention
The object of the present invention is to provide a kind of method that is used for the efficient retrieval character string.
The technical scheme that the present invention solves its technical matters employing is as follows:
1) writes down the character string structural information of character string with the storage mode of multiway tree, simultaneously the Memory Allocation of multiway tree node has been carried out layer-management
Come the store character string with the form of multiway tree, the ground floor of tree is a root node, ensuing each layer represented a character in this character string successively, each tree node has been stored current character, the numbering id from root node to the formed character string of this node and the pointer of successor node, this numbering id is used to identify a unique character string, and acquiescence is since 1 in the program, distribute according to the order of sequence, or specify by the user; Concerning the Chinese character string, the ground floor of multiway tree is a root node, the second layer is then encoded according to the GB2312 of first Chinese character of character string and is identified, have more than 3,000 branch, represent more than 3,000 Chinese character respectively, the GB2312 coding of Far Left branch is minimum, and the expression Chinese character " Ah ", other Chinese character just utilizes it and " difference of Ah "'s GB2312 coding is come index; Represent second Chinese character in the character string for the 3rd layer, because fixing combination and the collocation form of Chinese vocabulary is not that a word can be formed in any two Chinese characters, the Chinese character that each Chinese character back may occur is limited, rather than the complete or collected works of Chinese character; So since the 3rd layer, if the number of the follow-up Chinese character that one deck may occur under each Chinese character node more than or equal to half of different Chinese character number, just with follow-up Chinese character with " difference that Ah "'s GB2312 encodes is come index; Otherwise, if the number of the follow-up Chinese character that one deck may occur under each Chinese character node less than half of different Chinese character number, is just only stored the information of the follow-up Chinese character that one deck may occur under the Chinese character node; The rest may be inferred, so just can save storage space greatly;
Concerning the English character string, the ground floor of multiway tree also is a root node, the second layer is represented first letter of character string, have 26 branches, respectively corresponding 26 English alphabets because the number of English alphabet is few, can not take very big storage space, therefore each following layer of the multiway tree second layer also adopted each letter that comes in the same way to represent successively in the character string, and promptly each layer all comprises 26 branches;
2) character string is inserted in the multiway tree
A) for the Chinese character string:
The first step, GB2312 with first Chinese character in the character string encodes with " difference of Ah "'s GB2312 coding indexes the branch node of the multiway tree second layer, what this branch node was stored is its lower floor's node, just the pointer of the information of second of character string Chinese character;
Second step, if the number of the possible follow-up Chinese character of first Chinese character more than or equal to half of different Chinese character number, then the GB2312 of follow-up Chinese character coding is with " difference of Ah "'s GB2312 coding indexes the 3rd layer branch node; Otherwise be first 16 lower floors of Chinese character node predistribution node storage unit (here for fear of each interpolation character string time all will redistribute, divide timing) 16 lower floor node unit of disposable predistribution at every turn; If distributed the space of storage lower floor (the 3rd layer) node, distribute one of them unit then for second Chinese character for this branch node; If last 16 unit that distribute use up, then reallocate 16 again; Stored the pointer of information of numbering id number and the 3rd Chinese character of second Chinese character and current character string of having added in the three-layered node dot element;
The 3rd step, in like manner, if the number of the possible follow-up Chinese character of second Chinese character more than or equal to half of different Chinese character number, then the GB2312 of follow-up Chinese character coding is with " difference of Ah "'s GB2312 coding indexes
The 4th layer branch node; Otherwise give 16 lower floors of three-layered node point predistribution node storage unit, store the pointer of information of numbering id number and the 4th Chinese character of the 3rd Chinese character and current character string of having added; The rest may be inferred, is followed successively by the storage space that each node distributes its lower floor's node, and up to the end of character string, this character string was just added multiway tree to and suffered this moment;
B) for the English character string
The first step indexes the branch node of the multiway tree second layer according to the sequence number of first letter in alphabet in the character string, this branch node storage be the three-layered node point, the pointer of the information of the second letter of character string just;
Second goes on foot, and in like manner indexes the branch node of the 3rd layer of multiway tree according to the sequence number of second letter in the character string in alphabet; The rest may be inferred, and up to the end of character string, this character string was just added multiway tree to and suffered this moment;
3) compression of character trail
Because when setting up multiway tree, may there be a plurality of branches in some layer, such as the second layer of the multiway tree of Chinese more than 3,000 branch just arranged, some branch of character trail may not used; And branch less the layer all be to be 16 lower floors of node predistribution node storage unit at every turn, some space that dispenses may not used like this; Therefore earlier the multiway tree structure is write external memory, in the time of need reading in internal memory, discharge the no unit of these skies in the multiway tree,, so just realized the compression of character trail greatly only there being the unit of data to read in the internal memory according to original structure;
4) accurate searching character string in multiway tree
A) for the Chinese character string:
The first step, according to the GB2312 of first Chinese character in character string coding with " difference of Ah "'s GB2312 coding finds the branch node of this Chinese character second layer in multiway tree;
In second step, along this branch down, find the 3rd layer branch node according to second Chinese character of character string to be retrieved; From the 3rd layer branch node, find the 4th layer branch node again according to the 3rd Chinese character in the character string to be retrieved; The rest may be inferred, and up to the end of this character string, the node at place, end has write down the numbering id of this character string, returns the accurate retrieval that this numbering has then been finished character string for id number; If the branch of multiway tree can't arrive the end of this character string, prove that then this character string is not in multiway tree;
B) for the English character string
The first step finds the branch node of this letter second layer in multiway tree according to the sequence number of first letter in alphabet in the character string;
In second step, along this branch down, find this letter branch node of the 3rd layer in multiway tree according to the sequence number of second letter in the character string in alphabet; The rest may be inferred, and up to the end of character string, the node at place, end has write down the numbering id of this character string, returns the accurate retrieval that this numbering has then been finished character string for id number; If the branch of multiway tree can't arrive the end of this character string, prove that then this character string is not in multiway tree;
5) fuzzy search character string in multiway tree
The fuzzy search of character string is meant the prefix or the suffix of given character string, retrieves the character string that all comprise this prefix or suffix in multiway tree; Retrieval to the prefix character string, adopt and the identical mode of accurate retrieval, begin successively to find node in this character string prefix place branch from the root node of multiway tree, end up to prefix, find all branches of lower floor that begin from the end node, these lower floor's branching representations be the successive character string of this prefix, this prefix character string is the result of fuzzy search respectively with the combination of the successive character string of these lower floor's branching representations; If the branch of multiway tree can't arrive the end of this prefix character string, then retrieval failure;
Retrieval to the suffix character string, then opposite direction is from the bottom node of multiway tree, upwards successively find with the suffix character string in the corresponding branch node of character, here also be that end character from the suffix character string begins to search forward, up to suffix foremost, find all upper strata branches of the corresponding node of character foremost, these upper strata branching representations be the character string of this suffix front, these upper strata branches are connected in series the result that this suffix character string is fuzzy search respectively; If when upwards searching, the branch of multiway tree can't arrive this suffix character string foremost, then retrieval failure;
6) delete character string from multiway tree
The first step finds character string the to be deleted branch in multiway tree according to the method for accurate retrieval;
In second step, if the branch node at the place, end of character string is the bottom node of multiway tree, promptly this character string does not have successive character, then deletes the numbering id of this character string in the node of end, and discharges the storage space that the end node takies; If the branch node at the place, end of character string is not the bottom node of multiway tree, promptly the end node also has branch of lower floor, then only deletes the numbering id of this character string in the node of end, and other are constant; Delete successfully.
The present invention compares with background technology, and the useful effect that has is:
The present invention is a kind of being applicable under the extensive string assemble environment, is used for retrieving efficiently the method for Chinese and English character string.
(1) the present invention is a kind of brand-new indexing means, can realize storage, interpolation, retrieval and deletion efficiently to the Chinese and English character string, be particluarly suitable in the extensive character string processing environment, searching fast and various forms of fuzzy search such as prefix, suffix of character string all is better than traditional Btree indexing means on time and space efficiency.
(2) the present invention has carried out layer-management to the storage of multiway tree simultaneously, the mode that has adopted static allocation and dynamic assignment to combine for the Memory Allocation of multiway tree node, and the multiway tree structure of last generation carried out appropriate compression, effectively saved memory source, though overcome the high but undue consumes memory resource of the low or recall precision of traditional existing recall precision of searching character string method and do not supported deficiency such as fuzzy query.
Description of drawings
Fig. 1 is the general structure synoptic diagram of system;
Fig. 2 is inserted four character strings (participle, internet are searched, search) in multiway tree after, the logical schematic of multiway tree;
Fig. 3 is inserted four character strings (participle, internet are searched, search) in multiway tree after, the inner structure synoptic diagram of multiway tree;
Embodiment
In the application system of handling based on extensive character string, adopting multiway tree index accesses mode provided by the present invention, can realize the efficient retrieval to character string, is example with dictionary for word segmentation in the search engine, the general structure synoptic diagram of system as shown in Figure 1, concrete implementation step is as follows:
1. create the storage organization of character string, the initialization string storage organization distributes necessary memory headroom, and the number of plies of decision static allocation internal memory and the number of plies of dynamic assigning memory.
2. call the program of inserting character string in multiway tree, detailed process is as follows:
A) for the Chinese character string:
The first step, GB2312 with first Chinese character in the character string encodes with " difference of Ah "'s GB2312 coding indexes the branch node of the multiway tree second layer, what this branch node was stored is its lower floor's node, just the pointer of the information of second of character string Chinese character;
Second step, if the number of the possible follow-up Chinese character of first Chinese character more than or equal to half of different Chinese character number, then the GB2312 of follow-up Chinese character coding is with " difference of Ah "'s GB2312 coding indexes the 3rd layer branch node; Otherwise be first 16 lower floors of Chinese character node predistribution node storage unit (here for fear of each interpolation character string time all will redistribute, divide timing) 16 lower floor node unit of disposable predistribution at every turn; If distributed the space of storage lower floor (the 3rd layer) node, distribute one of them unit then for second Chinese character for this branch node; If last 16 unit that distribute use up, then reallocate 16 again; Stored the pointer of information of numbering id number and the 3rd Chinese character of second Chinese character and current character string of having added in the three-layered node dot element;
The 3rd step, in like manner, if the number of the possible follow-up Chinese character of second Chinese character more than or equal to half of different Chinese character number, then the GB2312 of follow-up Chinese character coding is with " difference of Ah "'s GB2312 coding indexes the 4th layer branch node; Otherwise give 16 lower floors of three-layered node point predistribution node storage unit, store the pointer of information of numbering id number and the 4th Chinese character of the 3rd Chinese character and current character string of having added; The rest may be inferred, is followed successively by the storage space that each node distributes its lower floor's node, and up to the end of character string, this character string was just added multiway tree to and suffered this moment;
B) for the English character string
The first step indexes the branch node of the multiway tree second layer according to the sequence number of first letter in alphabet in the character string, this branch node storage be the three-layered node point, the pointer of the information of the second letter of character string just;
Second goes on foot, and in like manner indexes the branch node of the 3rd layer of multiway tree according to the sequence number of second letter in the character string in alphabet; The rest may be inferred, and up to the end of character string, this character string was just added multiway tree to and suffered this moment;
As, in multiway tree, insert character string " internet ", step is:
(1) the GB2312 coding with " mutually " deducts " Ah "'s GB2312 coding, if difference is a, then should be inserted into (a+1) individual branch of the multiway tree second layer, what store in this branch node is the pointer of the follow-up Chinese character " connection " of " mutually ", and it is oriented to the storage unit that Chinese character " connection " is distributed.
(2) if the number of the follow-up Chinese character of " mutually " more than or equal to half of different Chinese character number, " Ah "'s the GB2312 coding of then using the GB2312 coding of " connection " to deduct, if difference is b, then " connection " in (b+1) individual branch of the 3rd layer of multiway tree, and what store in this branch node is the pointer of the follow-up Chinese character " net " of the numbering id of character string " interconnected " and " connection ".If the number of the follow-up Chinese character of " mutually " is less than half of different Chinese character number, for " mutually " distributes 16 lower floor's node storage unit, the pointer of the follow-up Chinese character " net " of the numbering id of character string " interconnected " and " connection " is deposited in one of them lower floor node unit.
(3) in like manner, if the number of the follow-up Chinese character of " connection " is more than or equal to half of different Chinese character number, " Ah "'s the GB2312 coding of then using the GB2312 coding of " net " to deduct, if difference is c, then " connection " in (c+1) individual branch of the 3rd layer of multiway tree, that store in this branch node is the numbering id of character string " internet ".If the number of the follow-up Chinese character of " mutually " less than half of different Chinese character number, for " connection " distributes 16 lower floor's node storage unit, is deposited the numbering id of character string " internet " in one of them lower floor node unit.So just finished the insertion of character string " internet ".
In multiway tree, insert after (search is searched for participle, internet) four character strings, the logical schematic of multiway tree as shown in Figure 2, wherein, the black node is used for sign can form a character string from root to this node, be used for specific numbering id mark it; Insert back corresponding characters string storage inside structural representation as shown in Figure 3.
3. call the condensing routine of character trail, detailed process is as follows:
In above-mentioned insertion the (participle, internet, search, search) the character string storage organization of four character strings, the effect of calling character trail condensing routine is exactly by pre-assigned unnecessary null pointer in two-layer behind the multiway tree is discharged, to realize the compression of internal memory.
4. call the program of accurate searching character string, detailed process is as follows:
A) for the Chinese character string:
The first step, according to the GB2312 of first Chinese character in character string coding with " difference of Ah "'s GB2312 coding finds the branch node of this Chinese character second layer in multiway tree;
Second step, along this branch down, use the method for binary chop, find the 3rd layer branch node according to second Chinese character of character string to be retrieved; From the 3rd layer branch node, find the 4th layer branch node again according to the 3rd Chinese character in the character string to be retrieved; The rest may be inferred, and up to the end of this character string, the node at place, end has write down the numbering id of this character string, returns the accurate retrieval that this numbering has then been finished character string for id number; If the branch of multiway tree can't arrive the end of this character string, prove that then this character string is not in multiway tree;
B) for the English character string
The first step finds the branch node of this letter second layer in multiway tree according to the sequence number of first letter in alphabet in the character string;
In second step, along this branch down,, find this letter branch node of the 3rd layer in multiway tree according to the sequence number of second letter in the character string in alphabet with the method for binary chop; The rest may be inferred, and up to the end of character string, the node at place, end has write down the numbering id of this character string, returns the accurate retrieval that this numbering has then been finished character string for id number; If the branch of multiway tree can't arrive the end of this character string, prove that then this character string is not in multiway tree;
As, in the multiway tree that has inserted (search is searched for participle, internet) four character strings, search " internet " this character string, step is:
(1) " Ah "'s GB2312 coding if difference is a, then finds (a+1) individual branch node of the multiway tree second layer, has stored the pointer of lower floor's node of " mutually " in this branch node to use the GB2312 coding of " mutually " to deduct.
(2) in lower floor's node of " mutually ", search the branch node at " connection " place fast, wherein stored the pointer of lower floor's node of " connection " with binary chop.
(3) in like manner search the node at " net " place fast with binary chop in lower floor's node of " connection ", the numbering id of return string " internet " searches successfully.
Here the identification number of " internet " this character string is 2, because it is second character string inserting in multiway tree, if outside not appointment is distributed since 1 by Automatic Program at identification number.
5. call the program of fuzzy search character string, detailed process is as follows:
The fuzzy search of character string is meant the prefix or the suffix of given character string, retrieves the character string that all comprise this prefix or suffix in multiway tree; Retrieval to the prefix character string, adopt and the identical mode of accurate retrieval, begin successively to find node in this character string prefix place branch from the root node of multiway tree, end up to prefix, find all branches of lower floor that begin from the end node, these lower floor's branching representations be the successive character string of this prefix, this prefix character string is the result of fuzzy search respectively with the combination of the successive character string of these lower floor's branching representations; If the branch of multiway tree can't arrive the end of this prefix character string, then retrieval failure;
Retrieval to the suffix character string, then opposite direction is from the bottom node of multiway tree, upwards successively find with the suffix character string in the corresponding branch node of character, here also be that end character from the suffix character string begins to search forward, up to suffix foremost, find all upper strata branches of the corresponding node of character foremost, these upper strata branching representations be the character string of this suffix front, these upper strata branches are connected in series the result that this suffix character string is fuzzy search respectively; If when upwards searching, the branch of multiway tree can't arrive this suffix character string foremost, then retrieval failure;
As, in the multiway tree that has inserted (search is searched for participle, internet) four character strings, carry out fuzzy search, step is:
(1) if the prefix character string is " searching ", then earlier with accurately retrieving the branch node that same method finds " searching " place, and " searching " all branches of lower floor, " searching " is connected in series with these branch nodes respectively, just obtained " search " and " search " two character strings.
(2) if the suffix character string is " speech ", then the bottom node from multiway tree begins to find all to comprise the branch of " speech ", and " speech " all upper strata branches, these branches just is connected in series with " speech " has obtained " participle " this character string.
(3) if the prefix character string is " mutually ", the suffix character string is " net ", then earlier with accurately retrieving the branch node that same method finds " mutually " place, and " mutually " all branches of lower floor, in these branches of lower floor from the bottom node, the branch that finds all to comprise " net ", the path corresponding characters string from the node at " mutually " place to the node at " net " place is the result, has just found " internet ".
The 6th step: carry out cleaning work, call the program of delete character string from multiway tree, detailed process is as follows:
The first step finds character string the to be deleted branch in multiway tree according to the method for accurate retrieval;
In second step, if the branch node at the place, end of character string is the bottom node of multiway tree, promptly this character string does not have successive character, then deletes the numbering id of this character string in the node of end, and discharges the storage space that the end node takies; If the branch node at the place, end of character string is not the bottom node of multiway tree, promptly the end node also has branch of lower floor, then only deletes the numbering id of this character string in the node of end, and other are constant; Delete successfully.
As, in the multiway tree that has inserted (search is searched for participle, internet) four character strings, the step of delete character string " search " is:
(1) in multiway tree, searches " search " according to the method for accurate retrieval, find end node " rope ";
(2) the numbering id of " search " in the node of deletion end discharges its memory source that takies fully.

Claims (1)

1. method that is used for searching character string is characterized in that:
1) writes down the character string structural information of character string with the storage mode of multiway tree, simultaneously the Memory Allocation of multiway tree node has been carried out layer-management
Come the store character string with the form of multiway tree, the ground floor of tree is a root node, ensuing each layer represented a character in this character string successively, each tree node has been stored current character, the numbering id from root node to the formed character string of this node and the pointer of successor node, this numbering id is used to identify a unique character string, and acquiescence is since 1 in the program, distribute according to the order of sequence, or specify by the user; Concerning the Chinese character string, the ground floor of multiway tree is a root node, the second layer is then encoded according to the GB2312 of first Chinese character of character string and is identified, have more than 3,000 branch, represent more than 3,000 Chinese character respectively, the GB2312 coding of Far Left branch is minimum, and the expression Chinese character " Ah ", other Chinese character just utilizes it and " difference of Ah "'s GB2312 coding is come index; Represent second Chinese character in the character string for the 3rd layer, because fixing combination and the collocation form of Chinese vocabulary is not that a word can be formed in any two Chinese characters, the Chinese character that each Chinese character back may occur is limited, rather than the complete or collected works of Chinese character; So since the 3rd layer, if the number of the follow-up Chinese character that one deck may occur under each Chinese character node more than or equal to half of different Chinese character number, just with follow-up Chinese character with " difference that Ah "'s GB2312 encodes is come index; Otherwise, if the number of the follow-up Chinese character that one deck may occur under each Chinese character node less than half of different Chinese character number, is just only stored the information of the follow-up Chinese character that one deck may occur under the Chinese character node; The rest may be inferred, so just can save storage space greatly;
Concerning the English character string, the ground floor of multiway tree also is a root node, the second layer is represented first letter of character string, have 26 branches, respectively corresponding 26 English alphabets because the number of English alphabet is few, can not take very big storage space, therefore each following layer of the multiway tree second layer also adopted each letter that comes in the same way to represent successively in the character string, and promptly each layer all comprises 26 branches;
2) character string is inserted in the multiway tree
A) for the Chinese character string:
The first step, GB2312 with first Chinese character in the character string encodes with " difference of Ah "'s GB2312 coding indexes the branch node of the multiway tree second layer, what this branch node was stored is its lower floor's node, just the pointer of the information of second of character string Chinese character;
Second step, if the number of the possible follow-up Chinese character of first Chinese character more than or equal to half of different Chinese character number, then the GB2312 of follow-up Chinese character coding is with " difference of Ah "'s GB2312 coding indexes the 3rd layer branch node; Otherwise be first 16 lower floors of Chinese character node predistribution node storage unit (here for fear of each interpolation character string time all will redistribute, divide timing) 16 lower floor node unit of disposable predistribution at every turn; If distributed the space of storage lower floor (the 3rd layer) node, distribute one of them unit then for second Chinese character for this branch node; If last 16 unit that distribute use up, then reallocate 16 again; Stored the pointer of information of numbering id number and the 3rd Chinese character of second Chinese character and current character string of having added in the three-layered node dot element;
The 3rd step, in like manner, if the number of the possible follow-up Chinese character of second Chinese character more than or equal to half of different Chinese character number, then the GB2312 of follow-up Chinese character coding is with " difference of Ah "'s GB2312 coding indexes the 4th layer branch node; Otherwise give 16 lower floors of three-layered node point predistribution node storage unit, store the pointer of information of numbering id number and the 4th Chinese character of the 3rd Chinese character and current character string of having added; The rest may be inferred, is followed successively by the storage space that each node distributes its lower floor's node, and up to the end of character string, this character string was just added multiway tree to and suffered this moment;
B) for the English character string
The first step indexes the branch node of the multiway tree second layer according to the sequence number of first letter in alphabet in the character string, this branch node storage be the three-layered node point, the pointer of the information of the second letter of character string just;
Second goes on foot, and in like manner indexes the branch node of the 3rd layer of multiway tree according to the sequence number of second letter in the character string in alphabet; The rest may be inferred, and up to the end of character string, this character string was just added multiway tree to and suffered this moment;
3) compression of character trail
Because when setting up multiway tree, may there be a plurality of branches in some layer, such as the second layer of the multiway tree of Chinese more than 3,000 branch just arranged, some branch of character trail may not used; And branch less the layer all be to be 16 lower floors of node predistribution node storage unit at every turn, some space that dispenses may not used like this; Therefore earlier multiway tree is write external memory, when from external memory, reading in internal memory then, discharge the no unit of these skies in the multiway tree,, so just realized the compression of character trail greatly only there being the unit of data to read in the internal memory according to original structure;
4) accurate searching character string in multiway tree
A) for the Chinese character string:
The first step, according to the GB2312 of first Chinese character in character string coding with " difference of Ah "'s GB2312 coding finds the branch node of this Chinese character second layer in multiway tree;
In second step, along this branch down, find the 3rd layer branch node according to second Chinese character of character string to be retrieved; From the 3rd layer branch node, find the 4th layer branch node again according to the 3rd Chinese character in the character string to be retrieved; The rest may be inferred, and up to the end of this character string, the node at place, end has write down the numbering id of this character string, returns the accurate retrieval that this numbering has then been finished character string for id number; If the branch of multiway tree can't arrive the end of this character string, prove that then this character string is not in multiway tree;
B) for the English character string
The first step finds the branch node of this letter second layer in multiway tree according to the sequence number of first letter in alphabet in the character string;
In second step, along this branch down, find this letter branch node of the 3rd layer in multiway tree according to the sequence number of second letter in the character string in alphabet; The rest may be inferred, and up to the end of character string, the node at place, end has write down the numbering id of this character string, returns the accurate retrieval that this numbering has then been finished character string for id number; If the branch of multiway tree can't arrive the end of this character string, prove that then this character string is not in multiway tree;
5) fuzzy search character string in multiway tree
The fuzzy search of character string is meant the prefix or the suffix of given character string, retrieves the character string that all comprise this prefix or suffix in multiway tree; Retrieval to the prefix character string, adopt and the identical mode of accurate retrieval, begin successively to find node in this character string prefix place branch from the root node of multiway tree, end up to prefix, find all branches of lower floor that begin from the end node, these lower floor's branching representations be the successive character string of this prefix, this prefix character string is the result of fuzzy search respectively with the combination of the successive character string of these lower floor's branching representations; If the branch of multiway tree can't arrive the end of this prefix character string, then retrieval failure;
Retrieval to the suffix character string, then opposite direction is from the bottom node of multiway tree, upwards successively find with the suffix character string in the corresponding branch node of character, here also be that end character from the suffix character string begins to search forward, up to suffix foremost, find all upper strata branches of the corresponding node of character foremost, these upper strata branching representations be the character string of this suffix front, these upper strata branches are connected in series the result that this suffix character string is fuzzy search respectively; If when upwards searching, the branch of multiway tree can't arrive this suffix character string foremost, then retrieval failure;
6) delete character string from multiway tree
The first step finds character string the to be deleted branch in multiway tree according to the method for accurate retrieval;
In second step, if the branch node at the place, end of character string is the bottom node of multiway tree, promptly this character string does not have successive character, then deletes the numbering id of this character string in the node of end, and discharges the storage space that the end node takies; If the branch node at the place, end of character string is not the bottom node of multiway tree, promptly the end node also has branch of lower floor, then only deletes the numbering id of this character string in the node of end, and other are constant; Delete successfully.
CN 200610052710 2006-07-31 2006-07-31 Method for searching character string Pending CN1889080A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200610052710 CN1889080A (en) 2006-07-31 2006-07-31 Method for searching character string

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200610052710 CN1889080A (en) 2006-07-31 2006-07-31 Method for searching character string

Publications (1)

Publication Number Publication Date
CN1889080A true CN1889080A (en) 2007-01-03

Family

ID=37578358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200610052710 Pending CN1889080A (en) 2006-07-31 2006-07-31 Method for searching character string

Country Status (1)

Country Link
CN (1) CN1889080A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901254A (en) * 2010-07-20 2010-12-01 无敌科技(西安)有限公司 Entry query method and system thereof
CN102084363A (en) * 2008-07-03 2011-06-01 加利福尼亚大学董事会 A method for efficiently supporting interactive, fuzzy search on structured data
WO2012071992A1 (en) * 2010-12-03 2012-06-07 Huawei Technologies Co., Ltd. Method and apparatus for high performance, updatable, and deterministic hash table for network equipment
CN103339624A (en) * 2010-12-14 2013-10-02 加利福尼亚大学董事会 High efficiency prefix search algorithm supporting interactive, fuzzy search on geographical structured data
CN106815282A (en) * 2016-11-29 2017-06-09 腾讯科技(深圳)有限公司 Data access method and device
CN107153647A (en) * 2016-03-02 2017-09-12 奇简软件(北京)有限公司 Carry out method, device, system and the computer program product of data compression
CN110597800A (en) * 2018-05-23 2019-12-20 杭州海康威视数字技术股份有限公司 Method and device for determining annotation information and constructing prefix tree

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102084363A (en) * 2008-07-03 2011-06-01 加利福尼亚大学董事会 A method for efficiently supporting interactive, fuzzy search on structured data
CN102084363B (en) * 2008-07-03 2014-11-12 加利福尼亚大学董事会 A method for efficiently supporting interactive, fuzzy search on structured data
CN101901254A (en) * 2010-07-20 2010-12-01 无敌科技(西安)有限公司 Entry query method and system thereof
WO2012071992A1 (en) * 2010-12-03 2012-06-07 Huawei Technologies Co., Ltd. Method and apparatus for high performance, updatable, and deterministic hash table for network equipment
CN103339624A (en) * 2010-12-14 2013-10-02 加利福尼亚大学董事会 High efficiency prefix search algorithm supporting interactive, fuzzy search on geographical structured data
CN107153647A (en) * 2016-03-02 2017-09-12 奇简软件(北京)有限公司 Carry out method, device, system and the computer program product of data compression
CN107153647B (en) * 2016-03-02 2021-12-07 北京字节跳动网络技术有限公司 Method, apparatus, system and computer program product for data compression
CN106815282A (en) * 2016-11-29 2017-06-09 腾讯科技(深圳)有限公司 Data access method and device
CN106815282B (en) * 2016-11-29 2019-12-06 腾讯科技(深圳)有限公司 data access method and device
CN110597800A (en) * 2018-05-23 2019-12-20 杭州海康威视数字技术股份有限公司 Method and device for determining annotation information and constructing prefix tree

Similar Documents

Publication Publication Date Title
JP5996088B2 (en) Cryptographic hash database
US20230006144A9 (en) Trie-Based Indices for Databases
CN1889080A (en) Method for searching character string
CN1786962A (en) Method for managing and searching dictionary with perfect even numbers group TRIE Tree
CN110188108B (en) Data storage method, device, system, computer equipment and storage medium
CN1955958A (en) Sort data storage and split catalog inquiry method based on catalog tree
CN1838124A (en) Method for rapidly positioning grid + T tree index in mass data memory database
JP2014099163A (en) Method, system, and computer program product for hybrid table implementation using buffer pool as permanent in-memory storage for memory-resident data
CN1790335A (en) XML file data access method
CN1504912A (en) Performance and memory bandwidth utilization for tree searches using tree fragmentation
CN1831825A (en) Document management method and apparatus and document search method and apparatus
CN1613073A (en) Enhanced multiway radix tree
CN101030165A (en) Magnetic disk space management and managing system
CN101051309A (en) Researching system and method used in digital labrary
CN1848118A (en) Apparatus and method for a managing file system
Conway et al. Optimal hashing in external memory
Pibiri et al. Efficient data structures for massive n-gram datasets
CN102867049A (en) Chinese PINYIN quick word segmentation method based on word search tree
CN1614591A (en) Method for organizing and accessing distributive catalogue of document system
CN1255748C (en) Metadata hierarchy management method and system of storage virtualization system
Fan et al. Fulgor: A fast and compact k-mer index for large-scale matching and color queries
CN1834957A (en) Multi-chart information initializing method of database
CN1260546A (en) Method and apparatus for storing and searching data in hand-held device
Roumelis et al. Bulk-loading and bulk-insertion algorithms for xBR^+-trees xBR+-trees in Solid State Drives
Arseneau et al. STILT: Unifying spatial, temporal and textual search using a generalized multi-dimensional index

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication