CN101464899A - Commercial scale dictionary storage method and query method with low search error rate - Google Patents

Commercial scale dictionary storage method and query method with low search error rate Download PDF

Info

Publication number
CN101464899A
CN101464899A CNA200910003645XA CN200910003645A CN101464899A CN 101464899 A CN101464899 A CN 101464899A CN A200910003645X A CNA200910003645X A CN A200910003645XA CN 200910003645 A CN200910003645 A CN 200910003645A CN 101464899 A CN101464899 A CN 101464899A
Authority
CN
China
Prior art keywords
index
hash
entry
dictionary
drop rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA200910003645XA
Other languages
Chinese (zh)
Other versions
CN101464899B (en
Inventor
孙海涛
孙健
侯磊
张勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN200910003645XA priority Critical patent/CN101464899B/en
Publication of CN101464899A publication Critical patent/CN101464899A/en
Priority to HK09110585.1A priority patent/HK1130547A1/en
Application granted granted Critical
Publication of CN101464899B publication Critical patent/CN101464899B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for storing large scale lexicons in a low false drop rate and an inquiry method thereof. The method for storing the large scale lexicons in the low false drop rate is used for compressing and storing required dictionary data when machine translation is performed, and comprises the following steps: firstly, the existing dictionary data needed during the machine translation is arranged according to a given format; next, the dictionary data is read, and Bloomfilter index is produced by utilizing Bloomfilter technique; and finally, the dictionary data is read, and a perfect hash index is produced by utilizing the perfect hash algorithm. The invention ensures that the error occurring rate is further reduced.

Description

The scale dictionary storage means and the querying method of low false drop rate
Technical field
The application relates to the Computer Storage and the inquiry field of dictionary, particularly the scale dictionary storage and the querying method of the low false drop rate of computing machine execution.
Background technology
Along with the fast development of computing machine, utilize computing machine to realize translation technology between the different language, early well known.Must grasp macaronic vocabulary and grammer as the people before translation, computing machine also must be stored machine dictionary and machine grammer in its storer before carrying out the language translation.Computing machine at first will be by looking into translation code and the syntactic information that the original text dictionary provides the original text of importing in translation process, and then find out the translation that will translate according to translation code and syntactic information from the translation dictionary.
Proposed a kind of method and device thereof of precision machine translation as the file of China national Patent Office application number 200610136705.1, as shown in Figure 1, its method may further comprise the steps:
S101 is a unit with the sentence, is core with the predicate verb, and sentence element is carried out precise classification, thereby sets up " universal grammar formula ".
S103 sets up source language database and target language data storehouse.
S105 analyzes the concrete sentence of source language, derive source language concrete sentence resemble the numberization formula.
S107, with source language resemble the numberization formula be converted to target language concrete sentence resemble the numberization formula.
S109 retrieves corresponding speech and phrase one by one in source language database and target language data storehouse, thereby obtains the concrete sentence of corresponding target language.
The method of above-mentioned precision machine translation can promote translation accuracy, conveying feelings property, synchronism greatly.Yet no matter be which kind of machine translation mothod, all inevitably will be through the process of consulting the dictionary.For large-scale dictionary, its inquiry velocity has directly determined the efficient of mechanical translation.
In the prior art, utilizing hash function (hash) technology stored key word is a kind of method commonly used to improve dictionary enquiry speed, is about to store after key word calculates by hash function again, thereby reaches the purpose of compressing dictionary.But inevitably collision problem can appear when utilizing the hash function technology to store, for example: keyword keyA and key word keyB, both content differences, but still may be identical by the value after the hash function computing, promptly equation hash (keyA)=hash (keyB) sets up.
In order to address this problem, people have proposed the structure strategy of perfect hash function again, and perfect hash function is meant that to each entry in the dictionary with the problem that does not have collision after certain hash function computing, these algorithms are called perfect Hash.As shown in Figure 2, it is existing a kind of process flow diagram that utilizes perfect hash function to carry out dictionary enquiry.Whole process can be divided into structure entry index process S200 and query script S204.
Making up entry index process S200 may further comprise the steps:
S201 mates in order to dictionary, finds out several hash functions, makes that each entry in the dictionary all has corresponding hash function.
S203 handles the data in the orderly matching process, and generates the perfect hash function index.
Query script S204 may further comprise the steps:
S205, inquiry perfect hash function index, and utilize the hash function in the perfect hash function index that entry to be checked is calculated.
S207 judges the value calculate whether greater than the maximal value of entry value in the dictionary, if be not more than the maximal value of entry value in the dictionary, entry then to be checked is present in the dictionary.
Existing this perfect hash function is the continuous equally distributed situation structure that gets off based on entry value in the dictionary normally, thus its when judging whether certain entry exists, often with dictionary in the maximal value or the minimum value of entry value compare.If the entry value during greater than the maximal value in the dictionary or less than the minimum value in the dictionary, thinks that then entry does not exist, if the value of entry in the codomain scope of dictionary, thinks then that entry exists.But the value in the actual dictionary might not be continuous, and the occurrence number of a few everyday words is higher than other speech possibly from far away, and the occurrence number of a few remote speech will be well below other speech simultaneously.As in ecommerce, vocabulary " supply *Product " occurrence number high especially.And when by existing this perfect hash function the entry in the dictionary pockety being calculated, false drop just may occur and ask.
The number of times that occurs in dictionary with the inquiry entry is an example.Suppose that an entry n-gram in the dictionary is a uneven distribution, if n-gram in the dictionary 1N-gram has appearred 10 times 2~n-gram 100000The number of times that occurs between 500~1500, n-gram 100001N-gram has appearred 10000 times 100001Value be the maximal value of this dictionary.If certain g (n-gram that calculates by existing perfect hash function like this i) be 8000 times, 2<i<100000, if judge according to codomain, 8000<10000, this n-gram is described iBe present in the dictionary.But n-gram in the dictionary in fact 2~n-gram 100000Between, might not exist this n-gram i, promptly may occur 8000 times by basic neither one entry, so the appearance that has just caused false drop to be ask.
When the dictionary index that utilizes existing perfect hash function to make up was inquired about, when the entry that exists in dictionary of inquiry, that value of returning was undoubtedly correct.But if one of inquiry not during the entry in dictionary, may also can be returned a value, this has just caused false drop.That is to say that perfect hash function is only to the entry in the known dictionary " perfection ", and to the speech of the unknown not " perfection ".And when existing this perfect hash function is applied in the mechanical translation, if in computer memory of translation during non-existent entry, the translation that may will make the mistake.
Summary of the invention
The application provides a kind of scale dictionary storage means of low false drop rate, to solve in the existing machine translation mothod, utilizes perfect hash function to carry out the technical matters that wrong translation can take place dictionaries store.
The application provides a kind of scale dictionary storage and querying method of low false drop rate, to solve in the existing machine translation mothod, utilizes perfect hash function to carry out the technical matters that wrong translation can take place dictionary enquiry.
The application provides a kind of scale dictionary storage means of low false drop rate, and to solve in the existing dictionary memory technology, the dictionary that utilizes perfect hash function to store exists false drop to ask possible technical matters.
The application provides a kind of scale dictionary querying method of low false drop rate, to solve in the existing dictionary inquiring technology, utilizes perfect hash function to carry out dictionary enquiry and exists false drop to ask possible technical matters.
The application proposes a kind of scale dictionary storage means of low false drop rate, needed dictionary data during in order to compression and storage mechanical translation, may further comprise the steps: at first, existing dictionary data needed is arranged according to the form of setting during with mechanical translation.Secondly, read dictionary data, and utilize the Bloomfilter technology to generate the Bloomfilter index.At last, read dictionary data, and utilize perfect hash algorithm to generate perfect hash index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, when generating perfect hash index, store the parameter of each hash function of the number of used hash function and correspondence.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, may further comprise the steps when generating the Bloomfilter index: at first, an array is set in the storer of computing machine.Then, utilize several separate hash functions that each entry in the dictionary data is carried out being mapped on the array after the computing.At last, the position that is mapped with cryptographic hash in the array is provided with sign, and generates the Bloomfilter index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, the position that is mapped with cryptographic hash in the array is provided with sign is specially: will be mapped with the position set of cryptographic hash in the described array.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, further comprising the steps of before being mapped on the array after utilizing hash function that each entry in the existing dictionary data is carried out computing: the scale by dictionary data and the false drop rate of Bloomfilter index expectation are determined the size of array.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, further comprising the steps of after the size of having determined array: the number of determining the used hash function of Bloomfilter index according to the size of the scale of dictionary data and array.
Scale dictionary storage means according to the described low false drop rate of the application's preferred embodiment, may further comprise the steps when generating perfect hash index: at first, entry in the dictionary data is mated in order, make each entry that a corresponding hash function all be arranged, and the parameter of its corresponding hash function of each entry is stored as an element.Then, according to the sequencing that stores element is handled, and generated perfect hash index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, in carrying out orderly matching process, employing is not more than ten hash functions dictionary is mated in order.
The application proposes a kind of querying method of the formed dictionary of scale dictionary storage means to above-mentioned low false drop rate, in order to carry out language when translation at computing machine, entry to be translated is inquired about, be may further comprise the steps: at first, several candidates that obtain entry to be translated translate item.Then, inquiry Bloomfilter index, and judge that the candidate translates item and whether exists.Then, inquire about perfect hash index, and calculate the score value that the candidate who exists translates item by the hash function that from perfect hash index, obtains.At last, obtain the highest candidate of score value and translate item, and as the translation result of entry to be translated.
According to the scale dictionary querying method of the described low false drop rate of the application's preferred embodiment, may further comprise the steps during the score value of calculated candidate translation: at first, read the parameter that candidate in the perfect hash index translates a pairing hash function.Next is according to the parameter reduction hash function of hash function.At last, the candidate translates a calculating through pairing hash function, obtains the score value that each candidate translates item.
Scale dictionary querying method according to the described low false drop rate of the application's preferred embodiment, judge that the candidate translates item and may further comprise the steps when whether existing: at first, by the hash function same the candidate is translated and to be mapped on the array through after the computing with generating the Bloomfilter index.Then, judge whether the position that is mapped with cryptographic hash in the array all has sign.At last, if all sign is arranged, then return the candidate and translate an existence.
The application proposes a kind of scale dictionary storage means of low false drop rate, in order to compression and storage existing dictionary data, may further comprise the steps: at first, the entry in the existing dictionary data is arranged according to the form of setting.Secondly, read dictionary data, and utilize the Bloomfilter technology to generate the Bloomfilter index.At last, read dictionary data, and utilize perfect hash algorithm to generate perfect hash index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, when generating perfect hash index, store the parameter of each hash function of the number of used hash function and correspondence.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, may further comprise the steps when generating the Bloomfilter index: at first, an array is set in the storer of computing machine.Secondly, utilize several separate hash functions that each entry in the dictionary data is carried out being mapped on the array after the computing.At last, the position that is mapped with cryptographic hash in the array is provided with sign, and generates the Bloomfilter index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, the position that is mapped with cryptographic hash in the array is provided with sign is specially: will be mapped with the position set of cryptographic hash in the described array.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, further comprising the steps of before being mapped on the array after utilizing hash function that each entry in the existing dictionary data is carried out computing: the scale by dictionary data and the false drop rate of Bloomfilter index expectation are determined the size of array.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, after the size of having determined array, also comprise step: the number of determining the more used hash function of Bloomfilter index according to the size of the scale of dictionary data and array.
Scale dictionary storage means according to the described low false drop rate of the application's preferred embodiment, may further comprise the steps when generating perfect hash index: at first, entry in the dictionary data is mated in order, make each entry that a corresponding hash function all be arranged, and the parameter of its corresponding hash function of each entry is stored as an element.Then, according to the sequencing that stores element is handled, and generated perfect hash index.
According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, in carrying out orderly matching process, employing is not more than ten hash functions dictionary is mated in order.
The application proposes a kind of querying method of the formed dictionary of scale dictionary storage means to above-mentioned low false drop rate, may further comprise the steps: at first, and inquiry Bloomfilter index, and judge whether entry to be checked exists.Then,, then inquire about perfect hash index, obtain the value of entry if entry exists.
Scale dictionary querying method according to the described low false drop rate of the application's preferred embodiment, inquiry Bloomfilter may further comprise the steps during index: at first, by the hash function same with generating the Bloomfilter index entry to be checked is mapped on the array through after the computing.Then, judge whether the position that is mapped with cryptographic hash in the array all has sign.At last, if all sign is arranged, then return entry to be checked and exist.
According to the scale dictionary querying method of the described low false drop rate of the application's preferred embodiment, may further comprise the steps when inquiring about perfect hash index: at first, read the parameter of the pairing hash function of entry to be checked in the perfect hash index.Then, according to the parameter of hash function reduction hash function.At last, by obtaining the cryptographic hash of entry to be checked after the pairing hash function computing of entry to be checked.
With respect to prior art, the application comprises following advantage: the application is utilizing before perfect salted hash Salted consults the dictionary, earlier judge by the Bloomfilter technology whether entry itself is present in the dictionary, whether exist with entry itself and to replace judging with codomain whether entry exists in the prior art, thereby reduced the false drop rate of dictionary enquiry, promoted the inquiry effect.In addition, because the hash function that the application's perfect hash index is stored only need take little space, therefore very big limit ground has compressed the storage of dictionary, has saved the storage space of computing machine.
Certainly, implement arbitrary product of the present invention and must not necessarily need to reach simultaneously above-described all advantages.
Description of drawings
Fig. 1 is the process flow diagram of method of a kind of precision machine translation of the embodiment that proposes of the file of China national Patent Office application number 200610136705.1;
Fig. 2 is existing a kind of process flow diagram that utilizes perfect hash function to carry out dictionary enquiry;
Fig. 3 is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the embodiment of the present application;
Fig. 4 is the scale dictionary querying method process flow diagram of a kind of low false drop rate of the embodiment of the present application;
Fig. 5 is the structure process flow diagram of a kind of Bloomfilter index of the embodiment of the present application;
Fig. 6 is the structure process flow diagram of a kind of perfect hash index of the embodiment of the present application;
Fig. 7 is the synoptic diagram of a kind of orderly matching process of the embodiment of the present application;
Process flow diagram when whether Fig. 8 exists for a kind of candidate of judgement of the embodiment of the present application translates item;
Fig. 9 is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the embodiment of the present application;
Figure 10 is the process flow diagram of scale dictionary querying method of a kind of low false drop rate of the embodiment of the present application.
Embodiment
The application's main thought is in dictionaries store and combines Bloomfilter technology and perfect salted hash Salted when consulting the dictionary.When dictionaries store, at first, the entry in the existing dictionary data is arranged according to the form of setting.Secondly, read dictionary data, and utilize the Bloomfilter technology to generate the Bloomfilter index.At last, read dictionary data, and utilize perfect hash algorithm to generate perfect hash index.When dictionary enquiry, fall the not entry in dictionary by the Bloomfilter technical filter earlier, the entry that only stays is just inquired about by perfect hash function.Wherein, the Bloomfilter technology described in the application is the very high probabilistic data structure of a kind of space availability ratio, and whether it often is used to detect an element and is present in the specific set.
Below in conjunction with accompanying drawing, specify the present invention.
The application proposes a kind of scale dictionary storage means of low false drop rate, needed dictionary data during in order to compression and storage mechanical translation.See also Fig. 3, it is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the embodiment of the present application, and it may further comprise the steps:
S301 arranges existing dictionary data according to the form of setting.Existing dictionary data can be obtained from certain website user's daily record, records the user's of this website of login historical query speech in this user journal.
The mode of arranging can adopt two gram language model, and is as shown in table 1:
Supply ^ camera 7000
Supply ^PVC 9000
Children ^ number camera 3000
Telescope ^ camera 2600
Disposable ^ camera 3200
Family expenses ^ camera 3800
……
Table 1
Entry in the existing dictionaries of expression such as " the supply ^ camera " in the table 1, " supply ^PVC ", this entry of " ^ " symbolic representation is the binary linguistic form, promptly is made up of two words., the number of times that the corresponding entries of expression such as " 7000 ", " 9000 " occur in dictionary.The number of times that comprises the entry appearance of " supply ", " camera " as table 1 expression is 7000 times.In addition, also can adopt other arrangement mode, as three gram language model, quaternary language model etc.
S303 reads dictionary data, and utilizes the Bloomfilter technology to generate the Bloomfilter index, and stores.
The basic thought that utilizes the Bloomfilter technology to generate the Bloomfilter index is: with a series of separate hash functions each entry in the dictionary is mapped in the array, and sign is set in these positions, as shown in Figure 5, it is the structure process flow diagram of a kind of Bloomfilter index of the embodiment of the present application.
S501 is provided with array in the storer of computing machine.This array operated by rotary motion is in internal memory.
S503, the scale by existing dictionary and the false drop rate of Bloomfilter index expectation are determined the size of array.
S505 determines the number of the used hash function of Bloomfilter index according to the size of the scale of existing dictionary and array.
S507 utilizes several separate hash functions that each entry in the existing dictionary is carried out being mapped on the array after the computing.Here used hash function can be a picked at random.
S509 is provided with sign with the position that is mapped with cryptographic hash in the array, and generates the Bloomfilter index.
For the ease of understanding, the program false code that now provides the foundation of Bloomfilter index is as follows:
Symbol description:
Hash function bunch: h i(x) → [0, M 1), 0≤i<K wherein 1
A_B: position index array.1 * M 1Be that each unit accounts for 1 bit (position), be total to M 1The set of individual such unitS:n-gram,<n-gram j, v (n-gram j), 0 &le; v ( n - gram j ) < 2 N 1 False code (// expression explanation) set up in the Bloomfilter index:
Input: S, Hash function bunch
Output: array of indexes A_B
// initialization A_B
For i∈[0,M 1)do
A-B[i]=0
End For
For i∈[0,|S|)do
For j∈[0,K 1)do
A-B[h j(n-gram i)]=1
End For
End for
M in the above-mentioned false code 1The scale of expression Bloomfilter index, S represents the set of entry in the dictionary, n-gram is a natural language processing dictionary language model commonly used, refers to the entry among the S here.This section false code is to have the position set of cryptographic hash among the array of indexes A_B, other position reset, and its index structure is as shown in table 2:
The value number 1 1 0 1 0 0 0 1 1 0 1 0 0 ……
Table 2
The value number has been represented the number of the value of Bloomfilte index, is equivalent to M 1In ensuing a series of value, " 1 " represents this bit by set, and " 0 " is represented not by set.
In the practice, before setting up the Bloomfilte index, to determine the number K of used hash function earlier 1False drop rate p with the expectation of Bloomfilter index can pass through formula K then 1=In2*M 1/ | S| and p=(1-(1-1/m) Kn) k≈ (1-e -kn/m) kCan derive the array size M of Bloomfilter index 1=2p/In2*|S|.And the scale of existing dictionary | S| can calculate acquisition when it stores in the storer of computing machine, so M 1Size can adjust at any time according to the false drop rate p of Bloomfilter index expectation.Wherein, in the equation of the false drop rate p of above-mentioned Bloomfilter index expectation, k is the hash function number, and n is the entry number in the dictionary, and m is the number of bit array.In having determined the Bloomfilter index size of array with and error rate after just can make up this Bloomfilter index.
S305 reads dictionary data, and utilizes perfect hash algorithm to generate perfect hash index, and wherein, perfect hash index comprises the number of used hash function and the parameter of corresponding each hash function.
Set up after the Bloomfilter index, also needed to set up perfect hash index.As shown in Figure 6, it is the structure process flow diagram of a kind of perfect hash index of the embodiment of the present application, and it may further comprise the steps:
S601 mates the entry in the existing dictionary in order, makes each entry that a corresponding hash function all be arranged, and the parameter of an its corresponding hash function of entry is stored as an element.
S603 handles element according to the sequencing that stores, and generates perfect hash index.
The purpose of coupling is to find out several hash functions in order, makes to each the entry n-gram in the entry S set j, 0≤j<| S|), all use certain corresponding hash function hash i, 0≤i<K calculates cryptographic hash hash i(n-gram j).Different entry n-gram jMay corresponding different hash iFunction is as n-gram 1Corresponding hash 2, n-gram 2Corresponding hash 1Deng.Satisfying this | S| hash value all is unique.
To handle n-gram simultaneously jOrder and n-gram jCorresponding hash function is kept among the matched, and matched is meant a kind of formation, and the inside can be deposited entry successively, the record of each entry and corresponding hash function.If hash function is then changed in matching process failure in order, continue to attempt, till finding all hash functions that satisfy condition.Facts have proved that when M 〉=1.23|S|, failed probability is smaller.
For the ease of understanding, provide the false code (// expression explanation) of orderly matching process below:
Input: the S set of n-gram, Hash function bunch h i(x) → [0, M), 0≤i<K wherein
Output: orderly set of matches matched, perhaps failure
// initialization Matched is empty
matched &DoubleLeftArrow; &Phi;
Each element of // initialization r2l is empty
For i∈[0,M)do
r 2 l i &DoubleLeftArrow; &Phi;
End for
// statistics
For i∈[0,|S|)do
l 2 r i &DoubleLeftArrow; &Phi;
For j∈[0,K)do
l 2 r i &DoubleLeftArrow; l 2 r i &cup; h j ( n - gram i )
r 2 l h j ( n - gram i ) &DoubleLeftArrow; r 2 l h j ( n - gram i ) &cup; i
End for
End for
// extraction r2l iIn have only an element, promptly in-degree is 1
degree _ one &DoubleLeftArrow; { i | 0 &le; i < M , | r 2 l i | = 1 }
While(|degree_one|≥1)do
rhs &DoubleLeftArrow; POP ( degree _ one )
lhs &DoubleLeftArrow; POP ( r 2 l rhs )
PUSH(lhs,rhs)onto matched
For all rhs′∈l2r lhs do
POP(r2l rhs′)
If(|r2l rhs′|==1)then
degree _ one &DoubleLeftArrow; degree _ one &cup; rhs &prime;
End if
End for
End While
If|matched|==|S|then
Return matched
Else
Return false
End if
Below in conjunction with diagram orderly matching process is further described, see also Fig. 7, it is the synoptic diagram of a kind of orderly matching process of the embodiment of the present application.Wherein, a, b, c and d represent the entry in the dictionary, a, b, c and d have used 3 hash algorithms (promptly and a, b, the line that c links to each other with d) respectively, as (c, 5) among Fig. 7, entry a, b, c and d are by hash algorithm value of obtaining 1,2,3,4,5,6,7.And be worth 4,5 all have only an in-degree, and (in-degree refers to link the number of the line of node, promptly link 4,5 line), cryptographic hash 4 is described, the 5th, unique, hash function corresponding with it and entry meet the requirement of orderly coupling, store the back deletion so can pick out at random one of them node of these two cryptographic hash in orderly matching process, and remove those with abridge the line that is connected a little arranged, carry out record simultaneously.Present embodiment has been deleted node 5, simultaneously entry c, cryptographic hash 5 and hash function corresponding with it is recorded among the matched.Be (d, 4) among Fig. 7 behind the deletion of node 5.In like manner continue to remove (b, 2) that obtain behind the node 4 among Fig. 7, write down entry d, cryptographic hash 4 and hash function corresponding simultaneously with it.In like manner continue to remove node 2 and node 1 laggard line item again.So just finished the process of orderly coupling.
Each element of storing among the matched comprises the parameter of an its corresponding hash function of entry, the element among the matched is handled according to order first-in last-out after having mated in order again, obtains the array A of perfect hash index.The mode of handling can be according to formula: A [ h r ( j ) ( n - gram j ) ] = v ( n - gram j ) &CircleTimes; i = 0 &cap; i &NotEqual; r ( j ) K 2 - 1 A [ h i ( n - gram j ) ] . Be that perfect hash index comprises two parts, be respectively hash function index and value index, as shown in table 3:
Figure A200910003645D00172
Table 3
The Hash function parameters generally has two ingredients in the table 3, i.e. type function and function seed.Type function has been specified hash algorithm, as the hash algorithm of Justin Sobel, and hash algorithm of peter J.Weinberger or the like.The function seed refers to the reference value of various hash algorithms.These 2 parameters can effectively be restored hash function, have saved the expense of storage hash function simultaneously.
For example the hash algorithm of Justin Sobel is as follows:
unsigned int CJSHash::hash(const char*key,size_tkeyLen)
{
unsigned int hash=m_seed;
for(std:ize_ti=0;i<keyLen;i++)
{
hash^=((hash<<5)+key[i]+(hash>>2));
}
returnhash;
}
And its seed is exactly the m_seed in the program.
The used orderly matching process of the application is only used a few hash function (only having used three hash functions among the embodiment of Fig. 7), and therefore the hash function parameter of being stored in perfect hash index has only taken minimum storage space.
The value index record number and the occurrence of each value of value.The value occurrence according to A [ h r ( j ) ( n - gram j ) ] = v ( n - gram j ) &CircleTimes; i = 0 &cap; i &NotEqual; r ( j ) K 2 - 1 A [ h i ( n - gram j ) ] . Determine.
The scale M of perfect hash index is set in practice, 2For being slightly larger than the scale of dictionary | S|, M 2The closer to | S|, then this perfect hash index is saved the space more, but the difficult more simultaneously hash function bunch of finding.Generally get M 2=1.1|S|.
Scale dictionary storage means at the low false drop rate of introducing above, the application also proposes a kind of scale dictionary querying method of low false drop rate, it is used for when computing machine carries out the language translation, the dictionary that utilizes storage means to make up is inquired about entry to be translated, as shown in Figure 4, it may further comprise the steps:
S401, several candidates that obtain entry to be translated translate item.This process can realize by the syntax conversion technology of active computer." liking " translating into English as entry can be " like " or " love ", and " like " and " love " are the candidate that entry " likes " and translate item here.
S403, inquiry Bloomfilter index, and judge that the candidate translates item and whether exists.
The aforementioned position that is mapped with cryptographic hash when generating the Bloomfilter index on its array is provided with sign, therefore whether be present in dictionary, just be equivalent to judge that the candidate translates by hash function is mapped in the process whether position in the array has sign if judging that the candidate translates.As shown in Figure 8, judge that the candidate translates a process that whether exists and may further comprise the steps:
S801 translates the candidate by hash function identical when building index and to be mapped on the array of Bloomfilter index through after the computing.
S803 judges whether the position that is mapped with cryptographic hash in the array all has sign.If be mapped with the position of cryptographic hash in the array one or more not signs are arranged, illustrate that then this candidate translates item and do not exist, then computing machine returns the candidate and translates a non-existent information.
S805 if sign is arranged, then returns the candidate and translates an existence.
For the ease of understanding, now provide by inquiry Bloomfilter index and judge that the candidate translates the false code whether item exists:
Input: n-gram
Output: whether have n-gram
For j∈[0,K 1)do
If(A_B[h j(n-gram)]==0)then
Return does not exist
End If
End For
Return exists
S405 inquires about perfect hash index, and calculates the score value that the candidate who exists translates item by the hash function that obtains from perfect hash index.
When inquiring about perfect hash index, at first, obtain the parameter of hash function, and make up hash function, calculate the score value that the candidate who exists translates item with these hash functions then by reading the hash function index part g ( n - gram ) : g ( n - gram ) = &CircleTimes; i = 0 K 2 - 1 A [ h i ( n - gram ) ] . Here the score value that said candidate translates item is meant that the candidate translates a number of times that occurs in existing dictionary.
S407 obtains the highest candidate of score value and translates item, and as the translation result of entry to be translated.
With the example that is translated as of " I like eating apple " this sentence, " I like eating apple " and " I love eatingapple " are arranged if the candidate that computing machine obtains this sentence by existing syntax conversion technology translates item.In order to determine the correct translation of " I like eating apple ", respectively the candidate is translated " an I like eatingapple " and " I love eatingapple " and do the Bloomfilter search index, suppose to draw the candidate and translate " an I like eatingapple " and " I love eatingapple " and all in dictionary, exist, and then respectively " I likeeatingapple " and " I love eating apple " to be done that perfect hash index inquires about and breathe out respectively be function calculation.If obtaining the score value of " I like eatingapple " is 1000, and the score value of " I love eatingapple " is 10, show that then " I like eatingapple " occurred 1000 times in the dictionary, and " I love eating apple " occurred 10 times, 1000〉10, therefore the result of translation is " I like eatingapple ".
The thought that the Bloomfilter index that the application proposes combines with perfect hash index is not only applicable in the mechanical translation field when language is translated the inquiry of entry occurrence number also is applicable to the speech of any situation in other field, the dictionary enquiry of value.Therefore the application proposes a kind of scale dictionary storage and querying method of low false drop rate in addition, in order to the compression existing dictionary data, and entry is inquired about, and it is divided into storing process and query script.
See also Fig. 9, it is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the embodiment of the present application.
S901 arranges the entry in the existing dictionary data according to the form of setting.The mode of arranging can be provided with according to actual needs, is not limited to the number of times that occurs in dictionary with entry.As can arrange with the code name of entry, with the entry length arrangement etc.
S903 reads dictionary data, and utilizes the Bloomfilter technology to generate the Bloomfilter index.The basic thought that utilizes the Bloomfilter technology to generate the Bloomfilter index is: with a series of separate hash functions each entry in the dictionary is mapped in the array, and in these positions sign is set, the normally set of the form of sign.This process is identical with aforesaid Bloomfilter index construct process, no longer is described in detail herein.
S905 reads dictionary data, and utilizes perfect hash algorithm to generate perfect hash index, and wherein, perfect hash index comprises the number of used hash function and the parameter of corresponding each hash function.
When making up perfect hash index, at first, the entry in the existing dictionary is mated in order, make each entry that a corresponding hash function all be arranged, and the parameter of an its corresponding hash function of entry is stored as an element.Secondly, according to the sequencing that stores element is handled, and generated perfect hash index.
Here, the value that entry is calculated by the hash function of index in the dictionary can be the number of times that entry occurs in dictionary, also can be other round values.
See also Figure 10, it is the scale dictionary querying method process flow diagram of a kind of low false drop rate of the embodiment of the present application.
S1001, inquiry Bloomfilter index, and judge whether entry to be checked exists.
During inquiry Bloomfilter index, by hash function identical when making up the Bloomfilter index entry is mapped on the array of Bloomfilter index through after the computing, and judge that array shows the position of penetrating cryptographic hash whether sign is all arranged, all there is sign to represent that then entry is present in the dictionary.This step is identical with the query script of aforesaid Bloomfilter index, no longer is described in detail herein.
S1003 if entry exists, then inquires about perfect hash index, obtains the value of entry.
When inquiring about perfect hash index, at first, read the parameter of the pairing hash function of entry to be checked in the perfect hash index.Next is according to the parameter reduction hash function of hash function.At last, by obtaining the cryptographic hash of entry to be checked after the pairing hash function computing of entry to be checked.
False drop appears and unique may be that entry to be looked into is not in dictionary, if entry to be looked into is in dictionary then can not false drop.The application's purpose is exactly to reduce the possibility that this false drop takes place as much as possible.For example inquire about entry " digital camera ", during with the Bloomfilter technical filter, with obtaining all value A_B[h of entry " digital camera " in index after all the hash function computings in the Bloomfilter index j(digital camera)]==1, entry " digital camera " is described in dictionary, and enters perfect Hash operation.If have one not to be at 1 o'clock by the value after the hash function computing in the Bloomfilter index, illustrate that then entry " digital camera " not in dictionary, then can not enter perfect Hash operation again.
Compared with prior art, the application comprises following advantage: the application is utilizing before perfect salted hash Salted consults the dictionary, earlier judge by the Bloomfilter technology whether entry itself is present in the dictionary, whether exist with entry itself and to replace judging with codomain whether entry exists in the prior art, thereby reduced the false drop rate of dictionary enquiry, promoted the inquiry effect.In addition, because the hash function that the application's perfect hash index is stored only need take little space, therefore very big limit ground has compressed the storage of dictionary, has saved the storage space of computing machine.
More than disclosed only be several specific embodiment of the present invention, but the present invention is not limited thereto, any those skilled in the art can think variation, all should drop in protection scope of the present invention.

Claims (22)

1, a kind of scale dictionary storage means of low false drop rate, needed dictionary data during in order to compression and storage mechanical translation is characterized in that, may further comprise the steps:
Needed existing dictionary data is arranged according to the form of setting during with mechanical translation;
Read dictionary data, and utilize the Bloomfilter technology to generate a Bloomfilter index;
Read dictionary data, and utilize perfect hash algorithm to generate a perfect hash index.
2, the scale dictionary storage means of low false drop rate as claimed in claim 1 is characterized in that, when generating described perfect hash index, store the parameter of each hash function of the number of used hash function and correspondence.
3, the scale dictionary storage means of low false drop rate as claimed in claim 1 is characterized in that, may further comprise the steps when generating described Bloomfilter index:
One array is set in the storer of computing machine;
Utilize several separate hash functions that each entry in the dictionary data is carried out being mapped on the described array after the computing;
The position that is mapped with cryptographic hash in the described array is provided with a sign, and generates described Bloomfilter index.
4, the scale dictionary storage means of low false drop rate as claimed in claim 3 is characterized in that, the position that is mapped with cryptographic hash in the described array is provided with a sign is specially: will be mapped with the position set of cryptographic hash in the described array.
5, the scale dictionary storage means of low false drop rate as claimed in claim 3 is characterized in that, and is further comprising the steps of before being mapped on the described array after utilizing described hash function that each entry in the existing dictionary data is carried out computing:
The false drop rate of scale by dictionary data and the expectation of described Bloomfilter index is determined the size of described array.
6, the scale dictionary storage means of low false drop rate as claimed in claim 5 is characterized in that, also comprises step after the size of having determined described array:
Determine the number of the used described hash function of described Bloomfilter index according to the size of the scale of dictionary data and described array.
7, the scale dictionary storage means of low false drop rate as claimed in claim 1 is characterized in that, may further comprise the steps when generating described perfect hash index:
Entry in the dictionary data is mated in order, make each entry that a corresponding hash function all be arranged, and the parameter of its corresponding hash function of described each entry is stored as an element;
According to the sequencing that stores described element is handled, and generated described perfect hash index.
8, the scale dictionary storage means of low false drop rate as claimed in claim 7 is characterized in that, in carrying out orderly matching process, employing is not more than ten hash functions dictionary is mated in order.
9, a kind of querying method of the formed dictionary of scale dictionary storage means to each described low false drop rate of claim 1-8 when carrying out the language translation at computing machine, is inquired about entry to be translated, it is characterized in that, may further comprise the steps:
Several candidates that obtain entry to be translated translate item;
Inquire about described Bloomfilter index, and judge that described candidate translates item and whether exists;
Inquire about described perfect hash index, and calculate the score value that the described candidate who exists translates item by the hash function that from described perfect hash index, obtains;
Obtain the highest candidate of score value and translate item, and as the translation result of entry to be translated.
10, the scale dictionary querying method of low false drop rate as claimed in claim 9 is characterized in that, may further comprise the steps when calculating the score value that described candidate translates:
Read the parameter that candidate described in the described perfect hash index translates a pairing hash function;
Parameter reduction hash function according to described hash function;
Described candidate translates a calculating through pairing hash function, obtains the score value that each candidate translates item.
11, the scale dictionary querying method of low false drop rate as claimed in claim 9 is characterized in that, judges that described candidate translates item and may further comprise the steps when whether existing:
By the described hash function same with generating the Bloomfilter index candidate being translated item is mapped on the described array through after the computing;
Judge whether the position that is mapped with cryptographic hash in the described array all has described sign;
If all described sign is arranged, then return the candidate and translate an existence.
12, a kind of scale dictionary storage means of low false drop rate in order to compression and storage existing dictionary data, is characterized in that, may further comprise the steps:
Entry in the existing dictionary data is arranged according to the form of setting;
Read dictionary data, and utilize the Bloomfilter technology to generate a Bloomfilter index;
Read dictionary data, and utilize perfect hash algorithm to generate a perfect hash index.
13, the scale dictionary storage means of low false drop rate as claimed in claim 12 is characterized in that, when generating described perfect hash index, store the parameter of each hash function of the number of used hash function and correspondence.
14, the scale dictionary storage means of low false drop rate as claimed in claim 12 is characterized in that, may further comprise the steps when generating described Bloomfilter index:
One array is set in the storer of computing machine;
Utilize several separate hash functions that each entry in the dictionary data is carried out being mapped on the described array after the computing;
The position that is mapped with cryptographic hash in the described array is provided with a sign, and generates described Bloomfilter index.
15, the scale dictionary storage means of low false drop rate as claimed in claim 14 is characterized in that, the position that is mapped with cryptographic hash in the described array is provided with a sign is specially: will be mapped with the position set of cryptographic hash in the described array.
16, the scale dictionary storage means of low false drop rate as claimed in claim 14 is characterized in that, and is further comprising the steps of before being mapped on the described array after utilizing described hash function that each entry in the existing dictionary data is carried out computing:
The false drop rate of scale by dictionary data and the expectation of described Bloomfilter index is determined the size of described array.
17, the scale dictionary storage means of low false drop rate as claimed in claim 16 is characterized in that, also comprises step after the size of having determined described array:
Determine the number of the used described hash function of described Bloomfilter index according to the size of the scale of dictionary data and described array.
18, the scale dictionary storage means of low false drop rate as claimed in claim 12 is characterized in that, may further comprise the steps when generating described perfect hash index:
Entry in the dictionary data is mated in order, make each entry that a corresponding hash function all be arranged, and the parameter of its corresponding hash function of described each entry is stored as an element;
According to the sequencing that stores described element is handled, and generated described perfect hash index.
19, the scale dictionary storage means of low false drop rate as claimed in claim 18 is characterized in that, in carrying out orderly matching process, employing is not more than ten hash functions dictionary is mated in order.
20, a kind of querying method of the formed dictionary of scale dictionary storage means to each described low false drop rate of claim 12-19 is characterized in that, may further comprise the steps:
Inquire about described Bloomfilter index, and judge whether entry to be checked exists;
If entry exists, then inquire about described perfect hash index, obtain the value of entry.
21, the scale dictionary querying method of low false drop rate as claimed in claim 20 is characterized in that, may further comprise the steps when inquiring about described Bloomfilter index:
By the described hash function same entry to be checked is mapped on the described array through after the computing with generating the Bloomfilter index;
Judge whether the position that is mapped with cryptographic hash in the described array all has described sign;
If all described sign is arranged, then return entry to be checked and exist.
22, the scale dictionary querying method of low false drop rate as claimed in claim 20 is characterized in that, may further comprise the steps when inquiring about described perfect hash index:
Read the parameter of the pairing hash function of entry to be checked in the described perfect hash index;
Parameter reduction hash function according to hash function;
By obtaining the cryptographic hash of entry to be checked after the pairing hash function computing of entry to be checked.
CN200910003645XA 2009-01-13 2009-01-13 Commercial scale dictionary storage method and query method with low search error rate Active CN101464899B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN200910003645XA CN101464899B (en) 2009-01-13 2009-01-13 Commercial scale dictionary storage method and query method with low search error rate
HK09110585.1A HK1130547A1 (en) 2009-01-13 2009-11-12 Method for storing and querying a large scale dictionary with low rate of error query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910003645XA CN101464899B (en) 2009-01-13 2009-01-13 Commercial scale dictionary storage method and query method with low search error rate

Publications (2)

Publication Number Publication Date
CN101464899A true CN101464899A (en) 2009-06-24
CN101464899B CN101464899B (en) 2012-05-23

Family

ID=40805474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910003645XA Active CN101464899B (en) 2009-01-13 2009-01-13 Commercial scale dictionary storage method and query method with low search error rate

Country Status (2)

Country Link
CN (1) CN101464899B (en)
HK (1) HK1130547A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184245A (en) * 2011-05-18 2011-09-14 华北电力大学 Method for fast searching massive text data keywords
CN108228875A (en) * 2018-01-18 2018-06-29 北京奇安信科技有限公司 Daily record analysis method and device based on perfect Hash
CN109684439A (en) * 2018-12-28 2019-04-26 语联网(武汉)信息技术有限公司 The method and device of prefix index is carried out during participle

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184245A (en) * 2011-05-18 2011-09-14 华北电力大学 Method for fast searching massive text data keywords
CN102184245B (en) * 2011-05-18 2013-03-06 华北电力大学 Method for fast searching massive text data keywords
CN108228875A (en) * 2018-01-18 2018-06-29 北京奇安信科技有限公司 Daily record analysis method and device based on perfect Hash
CN108228875B (en) * 2018-01-18 2021-12-14 奇安信科技集团股份有限公司 Log analysis method and device based on perfect hash
CN109684439A (en) * 2018-12-28 2019-04-26 语联网(武汉)信息技术有限公司 The method and device of prefix index is carried out during participle
CN109684439B (en) * 2018-12-28 2020-10-30 语联网(武汉)信息技术有限公司 Method and device for indexing prefix in word segmentation process

Also Published As

Publication number Publication date
CN101464899B (en) 2012-05-23
HK1130547A1 (en) 2009-12-31

Similar Documents

Publication Publication Date Title
Singh et al. Relevance feedback based query expansion model using Borda count and semantic similarity approach
CN100507903C (en) Method and system for searching confirmatory sentence
CN101430695B (en) System and method for computing difference affinities of word
US7555475B2 (en) Natural language based search engine for handling pronouns and methods of use therefor
KR101672579B1 (en) Systems and methods regarding keyword extraction
CN109446341A (en) The construction method and device of knowledge mapping
CN106776562A (en) A kind of keyword extracting method and extraction system
CN1661593B (en) Method for translating computer language and translation system
CN107153658A (en) A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN105408890A (en) Performing an operation relative to tabular data based upon voice input
CN105468605A (en) Entity information map generation method and device
CN110674252A (en) High-precision semantic search system for judicial domain
CN102138142A (en) Dictionary suggestions for partial user entries
CN101794307A (en) Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea
CN102023989A (en) Information retrieval method and system thereof
CN104636466A (en) Entity attribute extraction method and system oriented to open web page
US9110852B1 (en) Methods and systems for extracting information from text
WO2014210387A2 (en) Concept extraction
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN103733193A (en) Statistical spell checker
CN103116573B (en) A kind of automatic extending method of domain lexicon based on vocabulary annotation
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
Chew et al. An information-theoretic, vector-space-model approach to cross-language information retrieval
CN101464899B (en) Commercial scale dictionary storage method and query method with low search error rate
Costa et al. A blocking scheme for entity resolution in the semantic web

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1130547

Country of ref document: HK

C14 Grant of patent or utility model
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1130547

Country of ref document: HK