CN108509505A

CN108509505A - A kind of character string retrieving method and device based on subregion even numbers group Trie

Info

Publication number: CN108509505A
Application number: CN201810179880.1A
Authority: CN
Inventors: 陈文焰; 贾连印; 丁家满; 李孟娟; 游进国; 章露露; 吕晓伟
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-03-05
Filing date: 2018-03-05
Publication date: 2018-09-07
Anticipated expiration: 2038-03-05
Also published as: CN108509505B

Abstract

The present invention relates to a kind of character string retrieving methods and device based on subregion even numbers group Trie, belong to database technical field.The present invention includes data prediction step, to character string sorting and counts the character string quantity of different initial characters；Index creation step carries out subregion division according to the number of partitions N of input, generates subregion mapping table and is the independent even numbers group Trie index structures of each partition creating；Searching step inputs the character string of retrieval, is retrieved on subregion even numbers group Trie index structures.The present invention can effectively reduce the cost that traditional even numbers group creates the conflict and processing conflict of process, can greatly improve the efficiency and effectiveness of retrieval of index creation by the even numbers group that creates the division.

Description

A kind of character string retrieving method and device based on subregion even numbers group Trie

Technical field

The present invention relates to a kind of character string retrieving methods and device based on subregion even numbers group Trie, belong to database technology Field.

Background technology

In recent years, database field has carried out a large amount of research to the retrieval of character string.Trie is that character is stored on side An orderly tree construction, can be widely used in lexical analyzer, bibliography search, the dictionary of spell check, language model reality The fields such as the existing, lookup of IP routing address, the similarity search of character string or set and connection.There are two types of common Trie at present Storage mode, 1) matrix storage, 2) chain type storage.Above two mode is required to create complete Trie, so as to cause larger Storage overhead, especially data set than it is sparse in the case of.To reduce the space expense of Trie, Aoe, Yata et al. are proposed Even numbers group Trie data structures, the structure using BASE and CHECK two each character strings of storage of array can with it is all other The prefix part that character string is mutually distinguished deposits the suffix portion that TAIL stores character string, when retrieving only with another character array It is related to the access of array and adds two kinds of operations, therefore efficiency is higher.

Nevertheless, the problems such as that there is also space expenses is higher by even numbers group Trie, and index creation expense is big, there is more be directed to Some optimizing research that DAT is carried out, for example, during index creation, the more nodes of priority processing branch, this optimization Strategy can improve the utilization in space, but its comparison ramifying introduces additional expense, to reduce even numbers group Trie The efficiency of index creation；Also CDA (Compression Double Array, a kind of even numbers group Trie that space is more compressed) Character information is stored in and compresses memory space in CHECK arrays, but this method needs additional expense to meet BASE The uniqueness of value also results in the reduction of index creation efficiency；On the basis of CDA, and there is scholar to propose odd number group Trie Structure, which removes BASE arrays, CHECK storage of array character informations is used only, but this method is mainly used in Such as character string of postcode regular length.

The above-mentioned optimization algorithm to DAT focuses mostly in maximum compression memory space, often leads to it and creates the big of efficiency Width reduces.

And dissected from the establishment process of even numbers group Trie indexes, find index establishment process along with conflict ( In even numbers group Trie be inserted into character string when, different father positions fight for same sub- position the case where be known as conflict) generation, and rush Prominent number increased with the increase of character string quantity.

Invention content

The technical problem to be solved in the present invention is to provide a kind of character string retrieving method based on subregion even numbers group Trie with Device, it is therefore intended that the quantity to conflict during DAT index creations and the expense for solving conflict are reduced, to improve DAT ropes The efficiency that guiding structure creates；It effectively solves the problems, such as to cause DAT index creations efficiency drastically to decline because data volume increases, simultaneously Improve the recall precision of DAT.

The technical scheme is that：A kind of character string retrieving method based on subregion even numbers group Trie, including following step Suddenly：

Data prediction step：The character string quantity of different initial characters is ranked up and counted to string data collection；

Index creation step：According to the number of partitions N of input, the division of subregion is carried out, regeneration Composition Region mapping table is (following Abbreviation PMT, so-called PMT are a kind of mapping relations between prefix and subregion) and be the independent even numbers group of each partition creating Trie (double array trie, hereinafter referred to as DAT) index structure；

Searching step：The character string for inputting retrieval, is retrieved on subregion DAT index structures.

Data prediction step is divided into two steps：

Step 110：Lexcographical order ascending sort is carried out to string data collection；

Step 111：Count the character string quantity of different initial characters.

Index creation step executes as follows：

Step 210：The division of subregion；

Step 220：Generate PMT；

Step 230：The establishment of subregion DAT index structures.

The step 210, executes as follows：

Step 211：To given number of partitions N, N<M, m represent the quantity of independent initial character, determine that N-1 cut-off rule can Data set is balancedly divided into N number of subregion；

Step 212：Cut-off rule is adjusted according to common prefix characteristic, if certain part has the data of common prefix (for example, first Character is all data of " b ") two parts are divided by certain cut-off rule, then by the cut-off rule move to an off the cut-off rule it is nearest should The edge of partial data ensures that the identical character string of initial character is divided in the same subregion.

The step 220 specifically executes step：

Build PMT according to data set actual division situation, in mapping table each list item by<Character string initial character, subregion Number>Composition, i.e., be mapped to corresponding subregion by the initial character of character string.

The step 230, executes as follows：

Step 231：To a character string in subregion DAT to be inserted into, is mapped, obtained in PMT according to its initial character Take the subregion that it to be inserted into；

Step 232：Character string is inserted into corresponding subregion according to the formula for creating DAT indexes, for being inserted into character " c " is transformed into state t from state s, and formula is：

BASE [s]+CODE [c]=t (1)

CHECK [t]=s (2)

Wherein CODE [c] indicates the numeric coding of character c, for English character, character " # ", " a ", " b ", The encoded radio of " c " " z " corresponds to 1,2,3,427 respectively.

The searching step is divided into two steps：

Step 310：A character string to be retrieved is given, its initial character is taken to be mapped in PMT, it is right to obtain its The subregion answered；

Step 320：The searching algorithm of DAT is executed in corresponding subregion, and returns to retrieval result.

A kind of string search device based on subregion even numbers group Trie, including：

Data preprocessing module：The character string quantity of different initial characters is ranked up and counted to string data collection；

Index creation module：According to the number of partitions N of input, the division of subregion is carried out, regenerate Composition Region mapping table and is The independent even numbers group Trie index structures of each partition creating；

Retrieve module：The character string for inputting retrieval, is retrieved on subregion DAT index structures.

The beneficial effects of the invention are as follows：The conflict that even numbers group trie creates process can be greatly decreased, to improve index wound Build the efficiency inquired with character string.

Description of the drawings

Fig. 1 is that the present invention is based on the search index functional block diagrams of subregion even numbers group Trie

Fig. 2 is the subregion mapping table of " bachelor# " of the invention, " badge# ", " baby# ", " jack# ", " jar# "

Fig. 3 is the initialization figure of even numbers group Trie of the present invention

Fig. 4 is the reduced trie and even numbers group schematic diagram after present invention insertion " baby# "

Fig. 5 is the reduced trie and even numbers group schematic diagram after present invention insertion " bachelor# "

Fig. 6 is the reduced trie and even numbers group schematic diagram after present invention insertion " badge# "

Fig. 7 is the reduced trie and even numbers group schematic diagram of establishment after subregion of the present invention

Fig. 8 is the comparison figure of the index creation time of DAT and DO-DAT of the present invention

Fig. 9 is the comparison figure of the data amount of movement of DAT and DO-DAT of the present invention

Figure 10 is the influence diagram of number of partitions index creation time of the present invention

Figure 11 is influence diagram of the number of partitions of the present invention to number of collisions

Figure 12 is influence diagram of the number of partitions of the present invention to detection BASE value length

Figure 13 is the comparison figure of different index Structure Creating time of the present invention

Figure 14 is the comparison figure of different index structure retrieval time of the present invention

Figure 15 is the comparison figure of different index structure memory space of the present invention

Specific implementation mode

With reference to the accompanying drawings and detailed description, the invention will be further described.

Embodiment 1：A kind of character string retrieving method based on subregion even numbers group Trie, including：

Data prediction step is ranked up data set by lexcographical order ascending order, for data set K= { " bachelor# ", " jar# ", " badge# ", " baby# ", " jack# " }, here in order to distinguish as " the ", " then " in this way Character string, the end that " # " is specially added behind each character string as each character string marks, then sorted set Ko={ " baby# ", " bachelor# ", " badge# ", " jack# ", " jar# " }.

The division of subregion, according to initial character carry out subregion, you can be divided into K1=" baby# ", " bachelor# ", " badge# " } and two subregions of K2={ " jack# ", " jar# " }.

The mapping table of the subregion generated at this time is as shown in Figure 2.

General T rie structures need to store entire character trail K, this needs larger storage overhead.To reduce space expense, Only need to store the prefix part that can be mutually distinguished with all other character string by each character string in DAT (such as in " bachelor# " Prefix " bac " can mutually be distinguished with other all character strings), i.e. parts reduced trie.DAT is isometric one-dimensional whole by two Number array BASE and CHECK and the character array TAIL compositions for storing suffix portion, wherein BASE storage of array state shift Base value, CHECK storage of array check values, for detecting state transfer correctness.For DAT, BASE and CHECK are word The prefix part that symbol string has indexed, TAIL is the suffix portion not indexed.In DAT, one character " c " of input turns from state s The state t of moving on to need to meet following two relational expressions：

BASE [s]+CODE [c]=t (1)

CHECK [t]=s (2)

Wherein, CODE [c] indicates the numeric coding of character " c ", for English character, character " # " " a " " b " The encoded radio of " c " " z " corresponds to 1,2,3,427 respectively.

To array index i, BASE [i] and CHECK [i] are 0 to show that the position is sky, for can be whole when BASE values are negative Only state takes position of the suffix portion of its absolute value pointing character string in TAIL arrays.

The establishment of subregion DAT index structures, take the character string being inserted into and PMT shown in Fig. 2 mapped known to should general K1={ " baby# ", " bachelor# ", " badge# " } is inserted into No. 1 subregion, and K2={ " jack# ", " jar# " } is inserted into To No. 2 subregions.

The following detailed description of the establishment process of set K1 even numbers groups Trie.

The initialization of even numbers group Trie is as shown in figure 3, wherein the value of POS shows when the preceding position for being inserted into character to TAIL arrays It sets.

Character string " baby# " is inserted into No. 1 subregion, is divided into following steps：

Step A1：The establishment of index is proceeded by from even numbers group BASE arrays position 1, the encoded radio of " b " is 3, then Just have：

BASE [1]+" b "=BASE [1]+3=1+3=4, and CHECK [4]=0 ≠ 1

Step A2：CHECK values show to be inserted into remaining character string to TAIL arrays for 0, and being inserted into " b " at this time can Unique identification " baby# ", then by remaining part " aby# " from being sequentially inserted into from POS=1 in TAIL arrays.

Step A3：Setting

BASE [4] ←-POS=-1

Show the absolute value for the position i.e. BASE [4] that remaining character string starts to read in TAIL arrays.

Update

POS=1+length (" aby# ")=1+4=5

It updates again

CHECK [4]=1

Show that node 4 is child's node that the i.e. node 4 redirected from node 1 is node 1, the reduce constructed at this time Trie and even numbers group are as shown in Figure 4

Character string " bachelor# " is inserted into No. 1 subregion：

Step B1：The establishment of index is proceeded by from even numbers group BASE arrays position 1, the encoded radio of " b " is 3, then has：

BASE [1]+" b "=BASE [1]+3=1+3=4, and CHECK [4]=1

Non-zero CHECK values show to have existed the side from node 1 to node 4.

Step B2：It needs to index to distinguish the two character strings in more characters to even numbers group, then node 4 will be made For the base value of state transfer, and BASE [4]=- 1 at this time shows that inquiry being over.By the value of current BASE [4], there are one In temporary variable TEMP, accessing X_CHECK (LIST), (X_CHECK (LIST) function is to return to minimum integer q, q to meet q>0 And CHECK [q+c]=0 finds an empty position, and c is the character in LIST, and the value of q is always incremented by since 1) function is simultaneously A new base value is found for BASE [4].

TEMP ← BASE [4]=- 1

Step B3：A new base value is found for BASE [4], new base value will meet is inserted into a sky by character " a " Position, the encoded radio of " a " is 2, so accessing X_CHECK (LIST), (X_CHECK (LIST) function is to return to minimum integer Q, q meet q>0 and CHECK [q+c]=0 finds an empty position, and c is that the character in LIST needs to index even numbers group In character, the value of q is always incremented by since 1) function and find a new base value for BASE [4].

CHECK [q+ " a "]=CHECK [1+2]=CHECK [3]=BASE [3]=0

An empty position is found, the q values of return are for 1

BASE [4]=1

Step B4：Character " b ", " c " are indexed in even numbers group, X_ is accessed with differentiation " baby# ", " bachelor# " CHECK (LIST) function finds suitable empty position and is inserted into character " b ", and " c ", as BASE [3] find a suitable base value The transfer of carry out state：

CHECK [q+ " b "]=CHECK [1+3]=CHECK [4] ≠ 0, q=1 is unavailable

CHECK [q+ " b "]=CHECK [2+3]=CHECK [5]=0, q=2 is available

CHECK [2+ " c "]=CHECK [2+4]=CHECK [6]=0, q=2 is available, then

BASE [3]=2

Step B5：It indexes in " b " to even numbers group：

BASE [3]+" b "=2+3=5

It enables

CHECK [5]=3

BASE [5] ← TEMP=-1

It indexes in " c " to even numbers group：

BASE [3]+" c "=2+4=6

It enables

BASE [6] ←-POS=-5

CHECK [6]=3

Step B6：It updates again

POS=5+length (" helor# ")=5+6=11

The reduce trie and even numbers group constructed at this time is as shown in Figure 5；

It is inserted into character string " badge# " to No. 1 subregion：

Step C1：The establishment of index is proceeded by from even numbers group BASE arrays position 1, the encoded radio of " b " is 3, then has：

BASE [1]+" b "=1+3=4 and CHECK [4]=1

BASE [4]+" a "=1+2=3 and CHECK [3]=4

BASE [3]+" d "=2+5=7 and CHECK [7]=0 ≠ 3

Step C2：CHECK values are 0 to show to be inserted into remaining character string to TAIL arrays, from POS=11 according to In secondary insertion " ge# " to TAIL arrays.

Step C3：It enables

BASE [7] ←-POS=-11

CHECK [7]=3

Step C4：It updates again

POS=11+length (" ge# ")=11+3=14

The reduced trie and even numbers group constructed at this time is as shown in Figure 6.

Set K2 creates process of the establishment process with reference to above-mentioned set K1 establishments DAT of DAT, what set K1, K2 created Reduced trie and even numbers group are as shown in Figure 7.

Referring again to the retrieving of subregion DAT, by taking retrieval " bachelor# " as an example.

Take particular prefix, in the present embodiment i.e. initial character " b ", look into PMT shown in Fig. 2, it is known that should in No. 1 subregion into Row retrieval.

Step D1：Retrieved since root node, i.e., at the even numbers group position BASE [1] retrieve.

Step D2：Retrieve first character " b "

BASE [1]+" b "=1+3=4 and CHECK [4]=1

Step D3：Retrieve second character " a "

BASE [4]+" a "=1+2=3 and CHECK [3]=4

Step D4：Inquire third character " c "

BASE [3]+" c "=2+4=6 and CHECK [6]=3

And BASE [6]=- 5<0

Negative value is retrieved, shows the poll-final in BASE and CHECK arrays, is only needed at this time corresponding in TAIL arrays The suffix portion of character string is read in position, i.e., reads the surplus of " bachelor# " at-BASE [6]=5 in TAIL arrays Remaining part point " helor# ".

Based on the string search device of subregion even numbers group Trie, including：

Index creation module：According to the number of partitions N of input, the division of subregion is carried out, regenerate Composition Region mapping table and is Each partition creating independent even numbers group Trie (double array trie, hereinafter referred to as DAT) index structure；

Validity to illustrate the invention, the present embodiment carry out comparative sorting to the DAT ropes with the time overhead of index creation Draw the influence of establishment；Using the time overhead of index creation and retrieval time as index, to compare shadow of the number of partitions to the invention It rings, as described below.

Experimental data set：183361 respective character strings in DBLP data set titles are extracted to create index, character string Minimum length be 1, maximum length 49, average length 8.6.And retrieved with the data set, investigate what retrieval executed Efficiency.

Experimental result：(wherein DO-DAT indicates lexcographical order to unsorted and sequence index creation time overhead as shown in Figure 8 DAT after ascending sort), as seen from the figure, after sequence the index creation time of DAT reduce about compared to unsorted DAT 12.4%.The main reason is that sequence reduces the data volume of required movement, as shown in Figure 9.

The time overhead of subregion index creation is as shown in Figure 10, as seen from the figure, creates the time overhead of index with subregion Increasing for quantity and substantially reduce.In extreme circumstances, not subregion when index creation time be about 105s, and when subregion be 20 When, the time for creating index is only 6.7s, and the improved efficiency of index creation is more than 15 times.The main reason is that subregion DAT is reduced The quantity of conflict and solve conflict expense, that is, BASE detection length reduction, as a result respectively as is illustrated by figs. 11 and 12. When number of partitions is 10, preferable experiment effect is obtained, the time of index creation, which declines, tends towards stability.

For the convenience of description, subregion even numbers group Trie index structures are referred to as DO-PDAT, to verify the effect of DO-PDAT Rate, by its also with the index structures such as DAT, DO-DAT, DO-CDA (lexcographical order ascending sort corresponding CDA) from index creation when Between, query time and memory space etc. compared.Respectively select DBLP data sets in preceding 50,000,100,000,150,000 and Alphabet string indexes to create, and the partitioned parameters of DO-PDAT are set as 10.

The result of index creation time is as shown in figure 13, and DO-CDA ensures the unique of BASE values because needing additional expense Property, therefore index creation is less efficient.And DO-PDAT takes full advantage of the advantage of lexcographical order and subregion, therefore with highest Efficiency.

On query time, the character string for creating index is considered into the efficiency of inquiry as inquiry.Experimental result is as schemed Shown in 14, there is no the search algorithms for changing DAT by CDA, so the efficiency of inquiry is almost consistent with DAT, and the inquiry of DO-PDAT Efficiency is far above other index structures.

On memory space, experimental result is as shown in figure 15.CDA optimizes DAT from the angle of compression stroke, So have better compression effectiveness, and there is no have significant increase than DAT in space utilization by DO-PDAT, DO-DAT.

The specific implementation mode of the present invention is explained in detail above in association with attached drawing, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of character string retrieving method based on subregion even numbers group Trie, it is characterised in that include the following steps：

Index creation step：According to the number of partitions N of input, the division of subregion is carried out, regenerates Composition Region mapping table, abbreviation PMT, And it is the independent even numbers group Trie index structures of each partition creating, abbreviation DAT index structures；

2. the character string retrieving method according to claim 1 based on subregion even numbers group Trie, which is characterized in that its data Pre-treatment step is divided into two steps：

Step 111：Count the character string quantity of different initial characters.

3. the character string retrieving method according to claim 1 based on subregion even numbers group Trie, which is characterized in that it is indexed Foundation step executes as follows：

Step 210：The division of subregion；

Step 220：Generate PMT；

Step 230：The establishment of subregion DAT index structures.

4. the character string retrieving method according to claim 3 based on subregion even numbers group Trie, which is characterized in that the step Rapid 210, it executes as follows：

Step 211：To given number of partitions N, N<M, m represent the quantity of independent initial character, determine that N-1 cut-off rule can be by number It is balancedly divided into N number of subregion according to collection；

Step 212：Cut-off rule is adjusted according to common prefix characteristic, if certain part has the data of common prefix by certain cut-off rule point For two parts, then the edge that the cut-off rule is moved to an off to the nearest partial data of the cut-off rule ensures that initial character is identical Character string is divided in the same subregion.

5. the character string retrieving method according to claim 3 based on subregion even numbers group Trie, which is characterized in that the step Rapid 220 specific execution steps：

Build PMT according to data set actual division situation, in mapping table each list item by<Character string initial character, partition number>Group At the initial character of character string is mapped to corresponding subregion.

6. the character string retrieving method according to claim 3 based on subregion even numbers group Trie, which is characterized in that the step Rapid 230, it executes as follows：

Step 231：To a character string in subregion DAT to be inserted into, is mapped in PMT according to its initial character, obtain it The subregion to be inserted into；

Step 232：Character string is inserted into corresponding subregion according to the formula for creating DAT indexes, for being inserted into character " c ", It is transformed into state t from state s, formula is：

BASE [s]+CODE [c]=t (1)

CHECK [t]=s (2)

Wherein CODE [c] indicates the numeric coding of character c, for English character, character " # ", " a ", " b ", " c " The encoded radio of " z " corresponds to 1,2,3,427 respectively.

7. the character string retrieving method according to claim 1 based on subregion even numbers group Trie, which is characterized in that the inspection Rope step is divided into two steps：

Step 310：A character string to be retrieved is given, its initial character is taken to be mapped in PMT, it is corresponding to obtain its Subregion；

8. a kind of string search device based on subregion even numbers group Trie, it is characterised in that：Including：

Index creation module：According to the number of partitions N of input, the division of subregion is carried out, Composition Region mapping table is regenerated and is each The independent even numbers group Trie index structures of partition creating；