CN101464899B - Commercial scale dictionary storage method and query method with low search error rate - Google Patents

Commercial scale dictionary storage method and query method with low search error rate Download PDF

Info

Publication number
CN101464899B
CN101464899B CN200910003645XA CN200910003645A CN101464899B CN 101464899 B CN101464899 B CN 101464899B CN 200910003645X A CN200910003645X A CN 200910003645XA CN 200910003645 A CN200910003645 A CN 200910003645A CN 101464899 B CN101464899 B CN 101464899B
Authority
CN
China
Prior art keywords
index
hash
dictionary
entry
drop rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200910003645XA
Other languages
Chinese (zh)
Other versions
CN101464899A (en
Inventor
孙海涛
孙健
侯磊
张勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN200910003645XA priority Critical patent/CN101464899B/en
Publication of CN101464899A publication Critical patent/CN101464899A/en
Priority to HK09110585.1A priority patent/HK1130547A1/en
Application granted granted Critical
Publication of CN101464899B publication Critical patent/CN101464899B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for storing large scale lexicons in a low false drop rate and an inquiry method thereof. The method for storing the large scale lexicons in the low false drop rate is used for compressing and storing required dictionary data when machine translation is performed, and comprises the following steps: firstly, the existing dictionary data needed during the machine translation is arranged according to a given format; next, the dictionary data is read, and Bloomfilter index is produced by utilizing Bloomfilter technique; and finally, the dictionary data is read, and a perfect hash index is produced by utilizing the perfect hash algorithm. The invention ensures that the error occurring rate is further reduced.

Description

The scale dictionary storage means and the querying method of low false drop rate
Technical field
The application relates to the Computer Storage and the inquiry field of dictionary, particularly the scale dictionary storage and the querying method of the low false drop rate of computing machine execution.
Background technology
Along with the fast development of computing machine, utilize computing machine to realize the translation technology between the different language, known by the people already.It is the same with grammer before translation, must to grasp macaronic vocabulary as the people, and computing machine also must be stored machine dictionary and machine grammer in its storer before carrying out the language translation.Computing machine at first will be through looking into translation code and the syntactic information that the original text dictionary provides the original text of importing in translation process, and then according to translation code and syntactic information from the translation dictionary, find out the translation that will translate.
Proposed a kind of method and device thereof of precision machine translation like the file of China national Patent Office application number 200610136705.1, as shown in Figure 1, its method may further comprise the steps:
S101 is a unit with the sentence, is core with the predicate verb, and sentence element is carried out precise classification, thereby sets up " universal grammar formula ".
S103 sets up source language database and target language data storehouse.
S105 analyzes the concrete sentence of source language, derive source language concrete sentence resemble the numberization formula.
S107, with source language resemble the numberization formula convert into target language concrete sentence resemble the numberization formula.
S109 retrieves corresponding speech and phrase one by one in source language database and target language data storehouse, thereby obtains the concrete sentence of corresponding target language.
The method of above-mentioned precision machine translation can promote translation accuracy, conveying feelings property, synchronism greatly.Yet, all inevitably to through the process of consulting the dictionary no matter be which kind of machine translation mothod.For large-scale dictionary, its inquiry velocity has directly determined the efficient of mechanical translation.
In the prior art, utilizing the technological stored key word of hash function (hash) is a kind of method commonly used to improve dictionary enquiry speed, is about to store after key word calculates through hash function again, thereby reaches the purpose of compressing dictionary.But inevitably collision problem can appear when utilizing the hash function technology to store; For example: keyword keyA and key word keyB; Both contents are different, but still maybe be identical through the value after the hash function computing, and promptly equality hash (keyA)=hash (keyB) sets up.
In order to address this problem, people have proposed the structure strategy of perfect hash function again, and perfect hash function is meant that to each entry in the dictionary with the problem that does not have collision after certain hash function computing, these algorithms are called perfect Hash.As shown in Figure 2, it is existing a kind of process flow diagram that utilizes perfect hash function to carry out dictionary enquiry.Whole process can be divided into structure entry index process S200 and query script S204.
Making up entry index process S200 may further comprise the steps:
S201 matees dictionary in order, finds out several hash functions, makes that each entry in the dictionary all has corresponding hash function.
S203 handles the data in the orderly matching process, and generates the perfect hash function index.
Query script S204 may further comprise the steps:
S205, inquiry perfect hash function index, and utilize the hash function in the perfect hash function index that entry to be checked is calculated.
S207 judges the value calculate whether greater than the maximal value of entry value in the dictionary, if be not more than the maximal value of entry value in the dictionary, entry then to be checked is present in the dictionary.
Existing this perfect hash function is the continuous equally distributed situation structure that gets off based on entry value in the dictionary normally, thus its when judging whether certain entry exists, often with dictionary in the maximal value or the minimum value of entry value compare.If the entry value during greater than the maximal value in the dictionary or less than the minimum value in the dictionary, thinks that then entry does not exist, if the value of entry in the codomain scope of dictionary, thinks then that entry exists.But the value in the actual dictionary might not be continuous, and the occurrence number of a few everyday words is higher than other speech possibly from far away, and the occurrence number of a few remote speech will be well below other speech simultaneously.As in ecommerce, the occurrence number of vocabulary " supply * * product " is high especially.And when through existing this perfect hash function the entry in the dictionary pockety being calculated, false drop just may occur and ask.
The number of times that in dictionary, occurs with the inquiry entry is an example.Suppose that an entry n-gram in the dictionary is a uneven distribution, if n-gram in the dictionary 1N-gram has appearred 10 times 2~n-gram 100000The number of times that occurs between 500~1500, n-gram 100001N-gram has appearred 10000 times 100001Value be the maximal value of this dictionary.If certain g (n-gram that calculates through existing perfect hash function like this i) be 8000 times, 2<i<100000, if judge according to codomain, 8000<10000, this n-gram is described iBe present in the dictionary.But n-gram in the dictionary in fact 2~n-gram 100000Between, might not exist this n-gram i, promptly possibly occur 8000 times by basic neither one entry, so the appearance that has just caused false drop to be ask.
When the dictionary index that utilizes existing perfect hash function to make up was inquired about, when the entry that exists in dictionary of inquiry, that value of returning was undoubtedly correct.But if one of inquiry not during the entry in dictionary, possibly also can be returned a value, this has just caused false drop.That is to say that perfect hash function is only to the entry in the known dictionary " perfection ", and to the speech of the unknown not " perfection ".And when existing this perfect hash function is applied in the mechanical translation, if in computer memory of translation during non-existent entry, the translation that possibly will make the mistake.
Summary of the invention
The application provides a kind of scale dictionary storage means of low false drop rate, to solve in the existing machine translation mothod, utilizes perfect hash function to carry out the technical matters that wrong translation can take place dictionaries store.
The application provides a kind of scale dictionary storage and querying method of low false drop rate, to solve in the existing machine translation mothod, utilizes perfect hash function to carry out the technical matters that wrong translation can take place dictionary enquiry.
The application provides a kind of scale dictionary storage means of low false drop rate, and to solve in the existing dictionary memory technology, the dictionary that utilizes perfect hash function to store exists false drop to ask possible technical matters.
The application provides a kind of scale dictionary querying method of low false drop rate, to solve in the existing dictionary inquiring technology, utilizes perfect hash function to carry out dictionary enquiry and exists false drop to ask possible technical matters.
The application proposes a kind of scale dictionary storage means of low false drop rate; Needed dictionary data during in order to compression and storage mechanical translation; May further comprise the steps: at first, existing dictionary data needed is arranged according to the form of setting during with mechanical translation.Secondly, read dictionary data, and utilize the Bloomfilter technology to generate the Bloomfilter index.At last, read dictionary data, and utilize perfect hash algorithm to generate perfect hash index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, when generating perfect hash index, store the parameter of each hash function of number and the correspondence of used hash function.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, may further comprise the steps when generating the Bloomfilter index: at first, an array is set in the storer of computing machine.Then, utilize several separate hash functions to carry out each entry in the dictionary data to be mapped on the array after the computing.At last, the position that is mapped with cryptographic hash in the array is provided with sign, and generates the Bloomfilter index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, the position that is mapped with cryptographic hash in the array is provided with sign is specially: the position set that is mapped with cryptographic hash in the said array.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, further comprising the steps of before after utilizing hash function to carry out computing to each entry in the existing dictionary data, being mapped on the array: the false drop rate of expecting through the scale and the Bloomfilter index of dictionary data is confirmed the size of array.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, further comprising the steps of after the size of having confirmed array: the number of confirming the used hash function of Bloomfilter index according to the size of the scale of dictionary data and array.
Scale dictionary storage means according to the described low false drop rate of the application's preferred embodiment; May further comprise the steps when generating perfect hash index: at first; Entry in the dictionary data is mated in order; Make each entry that a corresponding hash function all arranged, and the parameter of its corresponding hash function of each entry is stored as an element.Then, according to the sequencing that stores element is handled, and generated perfect hash index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, in carrying out orderly matching process, employing is not more than ten hash functions dictionary is mated in order.
The application proposes a kind of querying method of the formed dictionary of scale dictionary storage means to above-mentioned low false drop rate; In order to carry out language when translation at computing machine; Treat the entry of translation and inquire about, may further comprise the steps: at first, obtain and wait that several candidates that translate entry translate item.Then, inquiry Bloomfilter index, and judge that the candidate translates item and whether exists.Then, inquire about perfect hash index, and calculate the score value that the candidate who exists translates item through the hash function that from perfect hash index, obtains.At last, obtain the highest candidate of score value and translate item, and as the translation result of waiting to translate entry.
According to the scale dictionary querying method of the described low false drop rate of the application's preferred embodiment, may further comprise the steps during the score value of calculated candidate translation: at first, read the parameter that candidate in the perfect hash index translates a pairing hash function.Next is according to the parameter reduction hash function of hash function.At last, the candidate translates a calculating through pairing hash function, obtains the score value that each candidate translates item.
Scale dictionary querying method according to the described low false drop rate of the application's preferred embodiment; Judge that the candidate translates item and may further comprise the steps when whether existing: at first, translate the candidate through the hash function same and to be mapped on the array through after the computing with generating the Bloomfilter index.Then, judge whether the position that is mapped with cryptographic hash in the array all has sign.At last, if all sign is arranged, then return the candidate and translate an existence.
The application proposes a kind of scale dictionary storage means of low false drop rate, in order to compression and storage existing dictionary data, may further comprise the steps: at first, the entry in the existing dictionary data is arranged according to the form of setting.Secondly, read dictionary data, and utilize the Bloomfilter technology to generate the Bloomfilter index.At last, read dictionary data, and utilize perfect hash algorithm to generate perfect hash index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, when generating perfect hash index, store the parameter of each hash function of number and the correspondence of used hash function.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, may further comprise the steps when generating the Bloomfilter index: at first, an array is set in the storer of computing machine.Secondly, utilize several separate hash functions to carry out each entry in the dictionary data to be mapped on the array after the computing.At last, the position that is mapped with cryptographic hash in the array is provided with sign, and generates the Bloomfilter index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, the position that is mapped with cryptographic hash in the array is provided with sign is specially: the position set that is mapped with cryptographic hash in the said array.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, further comprising the steps of before after utilizing hash function to carry out computing to each entry in the existing dictionary data, being mapped on the array: the false drop rate of expecting through the scale and the Bloomfilter index of dictionary data is confirmed the size of array.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, after the size of having confirmed array, also comprise step: the number of confirming the more used hash function of Bloomfilter index according to the size of the scale of dictionary data and array.
Scale dictionary storage means according to the described low false drop rate of the application's preferred embodiment; May further comprise the steps when generating perfect hash index: at first; Entry in the dictionary data is mated in order; Make each entry that a corresponding hash function all arranged, and the parameter of its corresponding hash function of each entry is stored as an element.Then, according to the sequencing that stores element is handled, and generated perfect hash index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, in carrying out orderly matching process, employing is not more than ten hash functions dictionary is mated in order.
The application proposes a kind of querying method of the formed dictionary of scale dictionary storage means to above-mentioned low false drop rate, may further comprise the steps: at first, and inquiry Bloomfilter index, and judge whether entry to be checked exists.Then,, then inquire about perfect hash index, obtain the value of entry if entry exists.
Scale dictionary querying method according to the described low false drop rate of the application's preferred embodiment; Inquiry Bloomfilter may further comprise the steps during index: at first, be mapped to entry to be checked on the array through after the computing through the hash function same with generating the Bloomfilter index.Then, judge whether the position that is mapped with cryptographic hash in the array all has sign.At last, if all sign is arranged, then return entry to be checked and exist.
According to the scale dictionary querying method of the described low false drop rate of the application's preferred embodiment, may further comprise the steps when inquiring about perfect hash index: at first, read the parameter of the pairing hash function of entry to be checked in the perfect hash index.Then, according to the parameter of hash function reduction hash function.At last, through obtaining the cryptographic hash of entry to be checked after the pairing hash function computing of entry to be checked.
With respect to prior art; The application comprises following advantage: the application is utilizing before perfect salted hash Salted consults the dictionary; Judge through the Bloomfilter technology whether entry itself is present in the dictionary earlier; Whether exist with entry itself to replace judging with codomain whether entry exists in the prior art, thereby reduced the false drop rate of dictionary enquiry, promoted the inquiry effect.In addition,, therefore very limits compressed the storage of dictionary, saved the storage space of computing machine because the hash function that the application's perfect hash index is stored only need take little space.
Certainly, arbitrary product of embodiment of the present invention must not necessarily need reach above-described all advantages simultaneously.
Description of drawings
Fig. 1 is the process flow diagram of method of a kind of precision machine translation of the embodiment that proposes of the file of China national Patent Office application number 200610136705.1;
Fig. 2 is existing a kind of process flow diagram that utilizes perfect hash function to carry out dictionary enquiry;
Fig. 3 is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the application embodiment;
Fig. 4 is the scale dictionary querying method process flow diagram of a kind of low false drop rate of the application embodiment;
Fig. 5 is the structure process flow diagram of a kind of Bloomfilter index of the application embodiment;
Fig. 6 is the structure process flow diagram of a kind of perfect hash index of the application embodiment;
Fig. 7 is the synoptic diagram of a kind of orderly matching process of the application embodiment;
Process flow diagram when whether Fig. 8 exists for a kind of candidate of judgement of the application embodiment translates item;
Fig. 9 is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the application embodiment;
Figure 10 is the process flow diagram of scale dictionary querying method of a kind of low false drop rate of the application embodiment.
Embodiment
The application's main thought is in dictionaries store and has combined Bloomfilter technology and perfect salted hash Salted when consulting the dictionary.When dictionaries store, at first, the entry in the existing dictionary data is arranged according to the form of setting.Secondly, read dictionary data, and utilize the Bloomfilter technology to generate the Bloomfilter index.At last, read dictionary data, and utilize perfect hash algorithm to generate perfect hash index.When dictionary enquiry, fall the not entry in dictionary through the Bloomfilter technical filter earlier, the entry that only stays is just inquired about through perfect hash function.Wherein, the Bloomfilter technology described in the application is the very high probabilistic data structure of a kind of space availability ratio, and whether it often is used to detect an element and is present in the specific set.
Below in conjunction with accompanying drawing, specify the present invention.
The application proposes a kind of scale dictionary storage means of low false drop rate, needed dictionary data during in order to compression and storage mechanical translation.See also Fig. 3, it is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the application embodiment, and it may further comprise the steps:
S301 arranges existing dictionary data according to the form of setting.Existing dictionary data can be obtained from certain website user's daily record, records the user's of this website of login historical query speech in this user journal.
The mode of arranging can adopt two gram language model, and is as shown in table 1:
Supply ^ camera 7000
Supply ^PVC 9000
Children ^ number camera 3000
Telescope ^ camera 2600
Disposable ^ camera 3200
Family expenses ^ camera 3800
......
Table 1
Entry in the existing dictionaries of expression such as " the supply ^ camera " in the table 1, " supply ^PVC ", this entry of " ^ " symbolic representation is the binary linguistic form, promptly is made up of two words., the number of times that the corresponding entries of expression such as " 7000 ", " 9000 " occur in dictionary.The number of times that comprises the entry appearance of " supply ", " camera " like table 1 expression is 7000 times.In addition, also can adopt other arrangement mode, like three gram language model, quaternary language model etc.
S303 reads dictionary data, and utilizes the Bloomfilter technology to generate the Bloomfilter index, and stores.
The basic thought that utilizes the Bloomfilter technology to generate the Bloomfilter index is: be mapped to each entry in the dictionary in the array with a series of separate hash functions; And sign is set in these positions; As shown in Figure 5, it is the structure process flow diagram of a kind of Bloomfilter index of the application embodiment.
S501 is provided with array in the storer of computing machine.This array operated by rotary motion is in internal memory.
S503, the false drop rate of expecting through the scale and the Bloomfilter index of existing dictionary is confirmed the size of array.
S505 confirms the number of the used hash function of Bloomfilter index according to the size of the scale of existing dictionary and array.
S507 utilizes several separate hash functions to carry out each entry in the existing dictionary to be mapped on the array after the computing.Here used hash function can be a picked at random.
S509 is provided with sign with the position that is mapped with cryptographic hash in the array, and generates the Bloomfilter index.
For the ease of understanding, the program false code that at present provides the foundation of Bloomfilter index is following:
Symbol description:
Hash function bunch: h i(x) → [0, M 1), 0≤i<K wherein 1
A_B: position index array.1 * M 1Be that each unit accounts for 1 bit (position), be total to M 1The set of individual such unitS:n-gram,<n-gram j, v (n-gram j)>, 0 &le; v ( n - Gra m j ) < 2 N 1 False code (// expression explanation) set up in the Bloomfilter index:
Input: S, Hash function bunch
Output: array of indexes A_B
// initialization A_B
For?i∈[0,M 1)do
A_B[i]=0
End?For
For?i∈[0,|S|)do
For?j∈[0,K 1)do
A_B[h j(n-gram i)]=1
End?For
End?for
M in the above-mentioned false code 1The scale of expression Bloomfilter index, S representes the set of entry in the dictionary, n-gram is a natural language processing dictionary language model commonly used, refers to the entry among the S here.This section false code is with the position set that has cryptographic hash among the array of indexes A_B, other position reset, and its index structure is as shown in table 2:
The value number ?1 1 0 1 0 0 0 1 1 0 1 0 0 ......
Table 2
The value number has been represented the number of the value of Bloomfilte index, is equivalent to M 1In ensuing a series of value, this bit is represented by set in " 1 ", and " 0 " is represented not by set.
In the practice, before setting up the Bloomfilte index, to confirm the number K of used hash function earlier 1False drop rate p with the expectation of Bloomfilter index can pass through formula K then 1=In2*M 1/ | S| and p=(1-(1-1/m) Kn) k≈ (1-e -kn/m) kCan derive the array size M of Bloomfilter index 1=2p/In2*|S|.And the scale of existing dictionary | S| can calculate acquisition when it stores in the storer of computing machine, so M 1Size can come to adjust at any time according to the false drop rate p of Bloomfilter index expectation.Wherein, in the equality of the false drop rate p of above-mentioned Bloomfilter index expectation, k is the hash function number, and n is the entry number in the dictionary, and m is the number of bit array.In having confirmed the Bloomfilter index size of array with and error rate after just can make up this Bloomfilter index.
S305 reads dictionary data, and utilizes perfect hash algorithm to generate perfect hash index, and wherein, perfect hash index comprises the number of used hash function and the parameter of corresponding each hash function.
Set up after the Bloomfilter index, also need set up perfect hash index.As shown in Figure 6, it is the structure process flow diagram of a kind of perfect hash index of the application embodiment, and it may further comprise the steps:
S601 matees the entry in the existing dictionary in order, makes each entry that a corresponding hash function all arranged, and the parameter of an its corresponding hash function of entry is stored as an element.
S603 handles element according to the sequencing that stores, and generates perfect hash index.
The purpose of coupling is to find out several hash functions in order, makes to each the entry n-gram in the entry S set j, 0≤j<| S|), all use certain corresponding hash function hash i, 0≤i<K calculates cryptographic hash hash i(n-gram j).Different entry n-gram jMaybe corresponding different hash iFunction is like n-gram 1Corresponding hash 2, n-gram 2Corresponding hash 1Deng.Satisfying this | S| hash value all is unique.
To handle n-gram simultaneously jOrder and n-gram jCorresponding hash function is kept among the matched, and matched is meant a kind of formation, and the inside can be deposited entry successively, the record of each entry and corresponding hash function.If hash function is then changed in matching process failure in order, continue to attempt, till finding all hash functions that satisfy condition.Facts have proved that when M >=1.23|S|, failed probability is smaller.
For the ease of understanding, provide the false code (// expression explanation) of orderly matching process below:
Input: the S set of n-gram, Hash function bunch h i(x) → [0, M), 0≤i<K wherein
Output: orderly set of matches matched, perhaps failure
// initialization Matched is empty
matched &DoubleLeftArrow; &Phi;
Each element of // initialization r2l is empty
For?i∈[0,M)do
r 2 l i &DoubleLeftArrow; &Phi;
End?for
// statistics
For?i∈[0,|S|)do
l 2 r i &DoubleLeftArrow; &Phi;
For?j∈[0,K)do
l 2 r i &DoubleLeftArrow; l 2 r i &cup; h j ( n - gram i )
r 2 l h j ( n - gram i ) &DoubleLeftArrow; r 2 l h j ( n - gram i ) &cup; i
End?for
End?for
// extraction r2l iIn have only an element, promptly in-degree is 1
degree _ one &DoubleLeftArrow; { i | 0 &le; i < M , | r 2 l i | = 1 }
While(|degree_one|≥1)do
rhs &DoubleLeftArrow; POP ( degree _ one )
lhs &DoubleLeftArrow; POP ( r 2 l rhs )
PUSH(lhs,rhs)onto?matched
For?all?rhs′∈l2r lhs?do
POP(r2l rhs′)
If(|r2l rhs′|==1)then
degree _ one &DoubleLeftArrow; degree _ one &cup; rhs &prime;
End?if
End?for
End?While
If|matched|==|S|then
Return?matched
Else
Return?false
End?if
Below in conjunction with diagram orderly matching process is further described, see also Fig. 7, it is the synoptic diagram of a kind of orderly matching process of the application embodiment.Wherein, a, b, c and d represent the entry in the dictionary, a, b, c and d have used 3 hash algorithms (promptly and a, b, the line that c links to each other with d) respectively, like (c, 5) among Fig. 7, entry a, b, c and d are through hash algorithm value of obtaining 1,2,3,4,5,6,7.All has only an in-degree (in-degree refers to link the number of the line of node, promptly links 4,5 line) and be worth 4,5; Cryptographic hash 4 is described; The 5th, unique, hash function corresponding with it and entry meet the requirement of orderly coupling, store the back deletion so in orderly matching process, can pick out at random one of them node of these two cryptographic hash; And remove those with abridge the line that is connected a little arranged, carry out record simultaneously.Present embodiment has been deleted node 5, records entry c, cryptographic hash 5 and hash function corresponding with it among matched simultaneously.Be (d, 4) among Fig. 7 behind the deletion of node 5.Obtain (b, 2) among Fig. 7 after in like manner continuing to remove node 4, write down entry d, cryptographic hash 4 and hash function corresponding simultaneously with it.In like manner continue to remove node 2 and node 1 laggard line item again.So just accomplished the process of orderly coupling.
Each element of storing among the matched comprises the parameter of an its corresponding hash function of entry, the element among the matched is handled according to order first-in last-out after having mated in order again, obtains the array A of perfect hash index.The mode of handling can according to formula: A [ h r ( j ) ( n - Gram j ) ] = v ( n - Gram j ) &CircleTimes; i = 0 &cap; i &NotEqual; r ( j ) K 2 - 1 A [ h i ( n - Gram j ) ] . Be that perfect hash index comprises two parts, be respectively hash function index and value index, as shown in table 3:
Figure G200910003645XD00122
Table 3
The Hash function parameters generally has two ingredients in the table 3, i.e. type function and function seed.Type function has been specified hash algorithm, like the hash algorithm of Justin Sobel, and hash algorithm of peter J.Weinberger or the like.The function seed refers to the reference value of various hash algorithms.These 2 parameters can effectively be restored hash function, have saved the expense of storage hash function simultaneously.
For example the hash algorithm of Justin Sobel is following:
unsigned?int?CJSHash::hash(const?char*key,size_t?keyLen)
{
unsigned?int?hash=m_seed;
for(std:ize_ti=0;i<keyLen;i++)
{
hash^=((hash<<5)+key[i]+(hash>>2));
}
return?hash;
}
And its seed is exactly the m_seed in the program.
The used orderly matching process of the application is only used a few hash function (only having used three hash functions among the embodiment of Fig. 7), and the hash function parameter of therefore in perfect hash index, being stored has only taken minimum storage space.
The value index record occurrence of number and each value of value.The value occurrence according to A [ h r ( j ) ( n - Gram j ) ] = v ( n - Gram j ) &CircleTimes; i = 0 &cap; i &NotEqual; r ( j ) K 2 - 1 A [ h i ( n - Gram j ) ] Confirm.
The scale M of perfect hash index is set in practice, 2For being slightly larger than the scale of dictionary | S|, M 2The closer to | S|, then this perfect hash index is saved the space more, but simultaneously more difficulty find hash function bunch.Generally get M 2=1.1|S|.
Scale dictionary storage means to the low false drop rate of top introduction; The application also proposes a kind of scale dictionary querying method of low false drop rate; It is used for when computing machine carries out the language translation; The dictionary that utilizes storage means to make up is treated the entry of translation and is inquired about, and as shown in Figure 4, it may further comprise the steps:
S401 obtains and waits that several candidates that translate entry translate item.This process can realize through the syntax conversion technology of active computer." liking " translating into English like entry can be " like " or " love ", and " like " and " love " are the candidate that entry " likes " and translate item here.
S403, inquiry Bloomfilter index, and judge that the candidate translates item and whether exists.
The aforementioned position that when generating the Bloomfilter index, is mapped with cryptographic hash on its array is provided with sign; Therefore whether be present in dictionary, just be equivalent to judge that the candidate translates through hash function is mapped in the process whether position in the array has sign if judging that the candidate translates.As shown in Figure 8, judge that the candidate translates a process that whether exists and may further comprise the steps:
S801 translates the candidate through hash function identical when building index and to be mapped on the array of Bloomfilter index through after the computing.
S803 judges whether the position that is mapped with cryptographic hash in the array all has sign.If be mapped with the position of cryptographic hash in the array one or more not signs are arranged, explain that then this candidate translates item and do not exist, then computing machine returns the candidate and translates a non-existent information.
S805 if sign is arranged, then returns the candidate and translates an existence.
For the ease of understanding, provide through inquiry Bloomfilter index at present and judge that the candidate translates the false code whether item exists:
Input: n-gram
Output: whether have n-gram
For?j∈[0,K 1)do
If(A_B[h j(n-gram)]==0)then
Return does not exist
End?If
End?For
Return exists
S405 inquires about perfect hash index, and calculates the score value that the candidate who exists translates item through the hash function that from perfect hash index, obtains.
When inquiring about perfect hash index, at first, obtain the parameter of hash function, and make up hash function, calculate the score value g (n-gram) that the candidate who exists translates with these hash functions then through reading the hash function index part: g ( n - Gram ) = &CircleTimes; i = 0 K 2 - 1 A [ h i ( n - Gram ) ] . Here the score value that said candidate translates item is meant that the candidate translates a number of times that in existing dictionary, occurs.
S407 obtains the highest candidate of score value and translates item, and as the translation result of waiting to translate entry.
With the example that is translated as of " I like eating apple " this sentence, " I like eating apple " and " I love eatingapple " are arranged if the candidate that computing machine obtains this sentence through existing syntax conversion technology translates item.In order to confirm the correct translation of " I like eating apple "; Respectively the candidate is translated " an I like eatingapple " and " I love eating apple " and do the Bloomfilter search index; Suppose to draw the candidate and translate " an I like eating apple " and " I love eating apple " and all in dictionary, exist, and then respectively " I like eating apple " and " I love eating apple " to be done that perfect hash index inquires about and breathe out respectively be function calculation.If obtaining the score value of " I like eating apple " is 1000; And the score value of " I love eatingapple " is 10; Show that then " I like eating apple " occurred 1000 times in the dictionary; And " I love eating apple " occurred 10 times, 1000>10, and therefore the result of translation is " I like eatingapple ".
The thought that the Bloomfilter index that the application proposes combines with perfect hash index is not only applicable in the mechanical translation field when language is translated the inquiry of entry occurrence number also is applicable to the speech of any situation in other field, the dictionary enquiry of value.Therefore the application proposes a kind of scale dictionary storage and querying method of low false drop rate in addition, in order to the compression existing dictionary data, and entry is inquired about, and it is divided into storing process and query script.
See also Fig. 9, it is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the application embodiment.
S901 arranges the entry in the existing dictionary data according to the form of setting.The mode of arranging can be provided with according to actual needs, is not limited to the number of times that in dictionary, occurs with entry.As can arrange with the code name of entry, with the entry length arrangement etc.
S903 reads dictionary data, and utilizes the Bloomfilter technology to generate the Bloomfilter index.The basic thought that utilizes the Bloomfilter technology to generate the Bloomfilter index is: be mapped to each entry in the dictionary in the array with a series of separate hash functions, and in these positions sign be set, the normally set of the form of sign.This process is identical with aforesaid Bloomfilter index construct process, no longer is described in detail here.
S905 reads dictionary data, and utilizes perfect hash algorithm to generate perfect hash index, and wherein, perfect hash index comprises the number of used hash function and the parameter of corresponding each hash function.
When making up perfect hash index, at first, the entry in the existing dictionary is mated in order, make each entry that a corresponding hash function all arranged, and the parameter of an its corresponding hash function of entry is stored as an element.Secondly, according to the sequencing that stores element is handled, and generated perfect hash index.
Here, the value that entry is calculated through the hash function of index in the dictionary can be the number of times that entry occurs in dictionary, also can be other round values.
See also Figure 10, it is the scale dictionary querying method process flow diagram of a kind of low false drop rate of the application embodiment.
S1001, inquiry Bloomfilter index, and judge whether entry to be checked exists.
During inquiry Bloomfilter index; Through identical hash function when making up the Bloomfilter index entry is mapped on the array of Bloomfilter index through after the computing; And judge that array shows the position of penetrating cryptographic hash whether sign is all arranged, all there is sign to represent that then entry is present in the dictionary.This step is identical with the query script of aforesaid Bloomfilter index, no longer is described in detail here.
S1003 if entry exists, then inquires about perfect hash index, obtains the value of entry.
When inquiring about perfect hash index, at first, read the parameter of the pairing hash function of entry to be checked in the perfect hash index.Next is according to the parameter reduction hash function of hash function.At last, through obtaining the cryptographic hash of entry to be checked after the pairing hash function computing of entry to be checked.
False drop appears and unique possibly be that entry to be looked into is not in dictionary, if entry to be looked into is in dictionary then can not false drop.The application's purpose is exactly to reduce the possibility that this false drop takes place as much as possible.For example inquire about entry " digital camera ", during with the Bloomfilter technical filter, with obtaining all value A_B [h of entry " digital camera " in index after all the hash function computings in the Bloomfilter index j(digital camera)]==1, entry " digital camera " is described in dictionary, and is got into perfect Hash operation.If have one not to be at 1 o'clock through the value after the hash function computing in the Bloomfilter index, explain that then entry " digital camera " not in dictionary, then can not get into perfect Hash operation again.
Compared with prior art; The application comprises following advantage: the application is utilizing before perfect salted hash Salted consults the dictionary; Judge through the Bloomfilter technology whether entry itself is present in the dictionary earlier; Whether exist with entry itself to replace judging with codomain whether entry exists in the prior art, thereby reduced the false drop rate of dictionary enquiry, promoted the inquiry effect.In addition,, therefore very limits compressed the storage of dictionary, saved the storage space of computing machine because the hash function that the application's perfect hash index is stored only need take little space.
More than disclosedly be merely several specific embodiment of the present invention, but the present invention is not limited thereto, any those skilled in the art can think variation, all should drop in protection scope of the present invention.

Claims (20)

1. the scale dictionary querying method of a low false drop rate when carrying out the language translation at computing machine, is treated the entry of translation and is inquired about, and it is characterized in that, may further comprise the steps:
Needed dictionary data when 1) utilization is hanged down the scale dictionary storage means compression of false drop rate and stored mechanical translation, it further comprises:
Needed existing dictionary data is arranged according to the form of setting during with mechanical translation;
Read dictionary data, and utilize the Bloomfilter technology to generate a Bloomfilter index;
Read dictionary data, and utilize perfect hash algorithm to generate a perfect hash index;
2) obtain and wait that several candidates that translate entry translate item;
3) the said Bloomfilter index of inquiry, and judge that said candidate translates item and whether exists;
4) inquire about said perfect hash index, and calculate the score value that the said candidate who exists translates item through the hash function that from said perfect hash index, obtains;
5) obtain the highest candidate of score value and translate item, and as the translation result of waiting to translate entry.
2. the scale dictionary querying method of low false drop rate as claimed in claim 1 is characterized in that,
When generating said perfect hash index, store the parameter of each hash function of number and the correspondence of used hash function.
3. the scale dictionary querying method of low false drop rate as claimed in claim 1 is characterized in that, may further comprise the steps when generating said Bloomfilter index:
One array is set in the storer of computing machine;
Utilize several separate hash functions to carry out each entry in the dictionary data to be mapped on the said array after the computing;
The position that is mapped with cryptographic hash in the said array is provided with a sign, and generates said Bloomfilter index.
4. the scale dictionary querying method of low false drop rate as claimed in claim 3 is characterized in that, the position that is mapped with cryptographic hash in the said array is provided with a sign is specially: with the position set that is mapped with cryptographic hash in the said array.
5. the scale dictionary querying method of low false drop rate as claimed in claim 3 is characterized in that, after utilizing said hash function to carry out computing to each entry in the existing dictionary data, is mapped on the said array before further comprising the steps of:
The false drop rate of scale through dictionary data and the expectation of said Bloomfilter index is confirmed the size of said array.
6. the scale dictionary querying method of low false drop rate as claimed in claim 5 is characterized in that, after the size of having confirmed said array, also comprises step:
Confirm the number of the used said hash function of said Bloomfilter index according to the size of the scale of dictionary data and said array.
7. the scale dictionary querying method of low false drop rate as claimed in claim 1 is characterized in that, may further comprise the steps when generating said perfect hash index:
Entry in the dictionary data is mated in order, make each entry that a corresponding hash function all arranged, and the parameter of its corresponding hash function of said each entry is stored as an element;
Sequencing according to storing is handled said element, and generates said perfect hash index.
8. the scale dictionary querying method of low false drop rate as claimed in claim 7 is characterized in that, in carrying out orderly matching process, employing is not more than ten hash functions dictionary is mated in order.
9. the scale dictionary querying method of low false drop rate as claimed in claim 1 is characterized in that, may further comprise the steps when calculating the score value that said candidate translates:
Read the parameter that candidate described in the said perfect hash index translates a pairing hash function;
Parameter reduction hash function according to said hash function;
Said candidate translates a calculating through pairing hash function, obtains the score value that each candidate translates item.
10. the scale dictionary querying method of low false drop rate as claimed in claim 1 is characterized in that, judges that said candidate translates item and may further comprise the steps when whether existing:
Translating item to the candidate through the said hash function same with generating the Bloomfilter index is mapped on the said array through after the computing;
Judge whether the position that is mapped with cryptographic hash in the said array all has said sign;
If all said sign is arranged, then return the candidate and translate an existence.
11. the scale dictionary querying method of a low false drop rate is characterized in that, may further comprise the steps:
1) utilize the scale dictionary storage means of low false drop rate to compress and the storage existing dictionary data, it further comprises:
Entry in the existing dictionary data is arranged according to the form of setting;
Read dictionary data, and utilize the Bloomfilter technology to generate a Bloomfilter index;
Read dictionary data, and utilize perfect hash algorithm to generate a perfect hash index;
2) the said Bloomfilter index of inquiry, and judge whether entry to be checked exists;
3) if entry exists, then inquire about said perfect hash index, obtain the value of entry.
12. the scale dictionary querying method of low false drop rate as claimed in claim 11 is characterized in that, when generating said perfect hash index, store the parameter of each hash function of number and the correspondence of used hash function.
13. the scale dictionary querying method of low false drop rate as claimed in claim 11 is characterized in that, may further comprise the steps when generating said Bloomfilter index:
One array is set in the storer of computing machine;
Utilize several separate hash functions to carry out each entry in the dictionary data to be mapped on the said array after the computing;
The position that is mapped with cryptographic hash in the said array is provided with a sign, and generates said Bloomfilter index.
14. the scale dictionary querying method of low false drop rate as claimed in claim 13 is characterized in that, the position that is mapped with cryptographic hash in the said array is provided with a sign is specially: with the position set that is mapped with cryptographic hash in the said array.
15. the scale dictionary querying method of low false drop rate as claimed in claim 13 is characterized in that, after utilizing said hash function to carry out computing to each entry in the existing dictionary data, is mapped on the said array before further comprising the steps of:
The false drop rate of scale through dictionary data and the expectation of said Bloomfilter index is confirmed the size of said array.
16. the scale dictionary querying method of low false drop rate as claimed in claim 15 is characterized in that, after the size of having confirmed said array, also comprises step:
Confirm the number of the used said hash function of said Bloomfilter index according to the size of the scale of dictionary data and said array.
17. the scale dictionary querying method of low false drop rate as claimed in claim 11 is characterized in that, may further comprise the steps when generating said perfect hash index:
Entry in the dictionary data is mated in order, make each entry that a corresponding hash function all arranged, and the parameter of its corresponding hash function of said each entry is stored as an element;
Sequencing according to storing is handled said element, and generates said perfect hash index.
18. the scale dictionary querying method of low false drop rate as claimed in claim 17 is characterized in that, in carrying out orderly matching process, employing is not more than ten hash functions dictionary is mated in order.
19. the scale dictionary querying method of low false drop rate as claimed in claim 11 is characterized in that, may further comprise the steps when inquiring about said Bloomfilter index:
Said hash function through same with generating the Bloomfilter index is mapped to entry to be checked on the said array through after the computing;
Judge whether the position that is mapped with cryptographic hash in the said array all has said sign;
If all said sign is arranged, then return entry to be checked and exist.
20. the scale dictionary querying method of low false drop rate as claimed in claim 11 is characterized in that, may further comprise the steps when inquiring about said perfect hash index:
Read the parameter of the pairing hash function of entry to be checked in the said perfect hash index;
Parameter reduction hash function according to hash function;
Through obtaining the cryptographic hash of entry to be checked after the pairing hash function computing of entry to be checked.
CN200910003645XA 2009-01-13 2009-01-13 Commercial scale dictionary storage method and query method with low search error rate Active CN101464899B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN200910003645XA CN101464899B (en) 2009-01-13 2009-01-13 Commercial scale dictionary storage method and query method with low search error rate
HK09110585.1A HK1130547A1 (en) 2009-01-13 2009-11-12 Method for storing and querying a large scale dictionary with low rate of error query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910003645XA CN101464899B (en) 2009-01-13 2009-01-13 Commercial scale dictionary storage method and query method with low search error rate

Publications (2)

Publication Number Publication Date
CN101464899A CN101464899A (en) 2009-06-24
CN101464899B true CN101464899B (en) 2012-05-23

Family

ID=40805474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910003645XA Active CN101464899B (en) 2009-01-13 2009-01-13 Commercial scale dictionary storage method and query method with low search error rate

Country Status (2)

Country Link
CN (1) CN101464899B (en)
HK (1) HK1130547A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184245B (en) * 2011-05-18 2013-03-06 华北电力大学 Method for fast searching massive text data keywords
CN108228875B (en) * 2018-01-18 2021-12-14 奇安信科技集团股份有限公司 Log analysis method and device based on perfect hash
CN109684439B (en) * 2018-12-28 2020-10-30 语联网(武汉)信息技术有限公司 Method and device for indexing prefix in word segmentation process

Also Published As

Publication number Publication date
CN101464899A (en) 2009-06-24
HK1130547A1 (en) 2009-12-31

Similar Documents

Publication Publication Date Title
Agarwal et al. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training
CN106663124B (en) Generating and using knowledge-enhanced models
Singh et al. Relevance feedback based query expansion model using Borda count and semantic similarity approach
US7689405B2 (en) Statistical method for building a translation memory
US20150356106A1 (en) Search and retrieval of electronic documents using key-value based partition-by-query indices
CN1661593B (en) Method for translating computer language and translation system
CN102138142A (en) Dictionary suggestions for partial user entries
WO2012061462A1 (en) Systems and methods regarding keyword extraction
JP2009540398A (en) Concept-based cross-media indexing and retrieval of audio documents
KR20200040652A (en) Natural language processing system and method for word representations in natural language processing
CN102999534A (en) Chinese word segmentation algorithm based on reverse maximum matching
Ahmed et al. Revised n-gram based automatic spelling correction tool to improve retrieval effectiveness
US9870433B2 (en) Data processing method and system of establishing input recommendation
CN101464899B (en) Commercial scale dictionary storage method and query method with low search error rate
US20090055358A1 (en) Efficient processing of mapped boolean queries via generative indexing
CN105404677A (en) Tree structure based retrieval method
CN101398830A (en) Thesaurus fuzzy enquiry method and thesaurus fuzzy enquiry system
El-Defrawy et al. Cbas: Context based arabic stemmer
Gupta et al. Improving unsupervised stemming by using partial lemmatization coupled with data-based heuristics for Hindi
Song et al. Natural language question answering and analytics for diverse and interlinked datasets
CN110795617A (en) Error correction method and related device for search terms
CN105938469A (en) Code storage method, data storage structure of texts and method for compressed storage of texts and statistics output
Lu et al. Cross-lingual short-text entity linking: Generating features for neuro-symbolic methods
CN105426490A (en) Tree structure based indexing method
CN102567424B (en) Poetry association library system and realization method thereof as well as electronic learning equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1130547

Country of ref document: HK

C14 Grant of patent or utility model
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1130547

Country of ref document: HK