CN101464899A

CN101464899A - Commercial scale dictionary storage method and query method with low search error rate

Info

Publication number: CN101464899A
Application number: CNA200910003645XA
Authority: CN
Inventors: 孙海涛; 孙健; 侯磊; 张勤
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2009-01-13
Filing date: 2009-01-13
Publication date: 2009-06-24
Anticipated expiration: 2029-01-13
Also published as: CN101464899B; HK1130547A1

Abstract

The invention provides a method for storing large scale lexicons in a low false drop rate and an inquiry method thereof. The method for storing the large scale lexicons in the low false drop rate is used for compressing and storing required dictionary data when machine translation is performed, and comprises the following steps: firstly, the existing dictionary data needed during the machine translation is arranged according to a given format; next, the dictionary data is read, and Bloomfilter index is produced by utilizing Bloomfilter technique; and finally, the dictionary data is read, and a perfect hash index is produced by utilizing the perfect hash algorithm. The invention ensures that the error occurring rate is further reduced.

Description

The scale dictionary storage means and the querying method of low false drop rate

Technical field

The application relates to the Computer Storage and the inquiry field of dictionary, particularly the scale dictionary storage and the querying method of the low false drop rate of computing machine execution.

Background technology

Along with the fast development of computing machine, utilize computing machine to realize translation technology between the different language, early well known.Must grasp macaronic vocabulary and grammer as the people before translation, computing machine also must be stored machine dictionary and machine grammer in its storer before carrying out the language translation.Computing machine at first will be by looking into translation code and the syntactic information that the original text dictionary provides the original text of importing in translation process, and then find out the translation that will translate according to translation code and syntactic information from the translation dictionary.

Proposed a kind of method and device thereof of precision machine translation as the file of China national Patent Office application number 200610136705.1, as shown in Figure 1, its method may further comprise the steps:

S101 is a unit with the sentence, is core with the predicate verb, and sentence element is carried out precise classification, thereby sets up " universal grammar formula ".

S103 sets up source language database and target language data storehouse.

S105 analyzes the concrete sentence of source language, derive source language concrete sentence resemble the numberization formula.

S107, with source language resemble the numberization formula be converted to target language concrete sentence resemble the numberization formula.

S109 retrieves corresponding speech and phrase one by one in source language database and target language data storehouse, thereby obtains the concrete sentence of corresponding target language.

The method of above-mentioned precision machine translation can promote translation accuracy, conveying feelings property, synchronism greatly.Yet no matter be which kind of machine translation mothod, all inevitably will be through the process of consulting the dictionary.For large-scale dictionary, its inquiry velocity has directly determined the efficient of mechanical translation.

In the prior art, utilizing hash function (hash) technology stored key word is a kind of method commonly used to improve dictionary enquiry speed, is about to store after key word calculates by hash function again, thereby reaches the purpose of compressing dictionary.But inevitably collision problem can appear when utilizing the hash function technology to store, for example: keyword keyA and key word keyB, both content differences, but still may be identical by the value after the hash function computing, promptly equation hash (keyA)=hash (keyB) sets up.

In order to address this problem, people have proposed the structure strategy of perfect hash function again, and perfect hash function is meant that to each entry in the dictionary with the problem that does not have collision after certain hash function computing, these algorithms are called perfect Hash.As shown in Figure 2, it is existing a kind of process flow diagram that utilizes perfect hash function to carry out dictionary enquiry.Whole process can be divided into structure entry index process S200 and query script S204.

Making up entry index process S200 may further comprise the steps:

S201 mates in order to dictionary, finds out several hash functions, makes that each entry in the dictionary all has corresponding hash function.

S203 handles the data in the orderly matching process, and generates the perfect hash function index.

Query script S204 may further comprise the steps:

S205, inquiry perfect hash function index, and utilize the hash function in the perfect hash function index that entry to be checked is calculated.

S207 judges the value calculate whether greater than the maximal value of entry value in the dictionary, if be not more than the maximal value of entry value in the dictionary, entry then to be checked is present in the dictionary.

Existing this perfect hash function is the continuous equally distributed situation structure that gets off based on entry value in the dictionary normally, thus its when judging whether certain entry exists, often with dictionary in the maximal value or the minimum value of entry value compare.If the entry value during greater than the maximal value in the dictionary or less than the minimum value in the dictionary, thinks that then entry does not exist, if the value of entry in the codomain scope of dictionary, thinks then that entry exists.But the value in the actual dictionary might not be continuous, and the occurrence number of a few everyday words is higher than other speech possibly from far away, and the occurrence number of a few remote speech will be well below other speech simultaneously.As in ecommerce, vocabulary " supply ^*Product " occurrence number high especially.And when by existing this perfect hash function the entry in the dictionary pockety being calculated, false drop just may occur and ask.

The number of times that occurs in dictionary with the inquiry entry is an example.Suppose that an entry n-gram in the dictionary is a uneven distribution, if n-gram in the dictionary ₁N-gram has appearred 10 times ₂～n-gram ₁₀₀₀₀₀The number of times that occurs between 500～1500, n-gram ₁₀₀₀₀₁N-gram has appearred 10000 times ₁₀₀₀₀₁Value be the maximal value of this dictionary.If certain g (n-gram that calculates by existing perfect hash function like this _i) be 8000 times, 2＜i＜100000, if judge according to codomain, 8000＜10000, this n-gram is described _iBe present in the dictionary.But n-gram in the dictionary in fact ₂～n-gram ₁₀₀₀₀₀Between, might not exist this n-gram _i, promptly may occur 8000 times by basic neither one entry, so the appearance that has just caused false drop to be ask.

When the dictionary index that utilizes existing perfect hash function to make up was inquired about, when the entry that exists in dictionary of inquiry, that value of returning was undoubtedly correct.But if one of inquiry not during the entry in dictionary, may also can be returned a value, this has just caused false drop.That is to say that perfect hash function is only to the entry in the known dictionary " perfection ", and to the speech of the unknown not " perfection ".And when existing this perfect hash function is applied in the mechanical translation, if in computer memory of translation during non-existent entry, the translation that may will make the mistake.

Summary of the invention

The application provides a kind of scale dictionary storage means of low false drop rate, to solve in the existing machine translation mothod, utilizes perfect hash function to carry out the technical matters that wrong translation can take place dictionaries store.

The application provides a kind of scale dictionary storage and querying method of low false drop rate, to solve in the existing machine translation mothod, utilizes perfect hash function to carry out the technical matters that wrong translation can take place dictionary enquiry.

The application provides a kind of scale dictionary storage means of low false drop rate, and to solve in the existing dictionary memory technology, the dictionary that utilizes perfect hash function to store exists false drop to ask possible technical matters.

The application provides a kind of scale dictionary querying method of low false drop rate, to solve in the existing dictionary inquiring technology, utilizes perfect hash function to carry out dictionary enquiry and exists false drop to ask possible technical matters.

The application proposes a kind of scale dictionary storage means of low false drop rate, needed dictionary data during in order to compression and storage mechanical translation, may further comprise the steps: at first, existing dictionary data needed is arranged according to the form of setting during with mechanical translation.Secondly, read dictionary data, and utilize the Bloomfilter technology to generate the Bloomfilter index.At last, read dictionary data, and utilize perfect hash algorithm to generate perfect hash index.

According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, when generating perfect hash index, store the parameter of each hash function of the number of used hash function and correspondence.

According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, may further comprise the steps when generating the Bloomfilter index: at first, an array is set in the storer of computing machine.Then, utilize several separate hash functions that each entry in the dictionary data is carried out being mapped on the array after the computing.At last, the position that is mapped with cryptographic hash in the array is provided with sign, and generates the Bloomfilter index.

According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, the position that is mapped with cryptographic hash in the array is provided with sign is specially: will be mapped with the position set of cryptographic hash in the described array.

According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, further comprising the steps of before being mapped on the array after utilizing hash function that each entry in the existing dictionary data is carried out computing: the scale by dictionary data and the false drop rate of Bloomfilter index expectation are determined the size of array.

According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, further comprising the steps of after the size of having determined array: the number of determining the used hash function of Bloomfilter index according to the size of the scale of dictionary data and array.

Scale dictionary storage means according to the described low false drop rate of the application's preferred embodiment, may further comprise the steps when generating perfect hash index: at first, entry in the dictionary data is mated in order, make each entry that a corresponding hash function all be arranged, and the parameter of its corresponding hash function of each entry is stored as an element.Then, according to the sequencing that stores element is handled, and generated perfect hash index.

According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, in carrying out orderly matching process, employing is not more than ten hash functions dictionary is mated in order.

The application proposes a kind of querying method of the formed dictionary of scale dictionary storage means to above-mentioned low false drop rate, in order to carry out language when translation at computing machine, entry to be translated is inquired about, be may further comprise the steps: at first, several candidates that obtain entry to be translated translate item.Then, inquiry Bloomfilter index, and judge that the candidate translates item and whether exists.Then, inquire about perfect hash index, and calculate the score value that the candidate who exists translates item by the hash function that from perfect hash index, obtains.At last, obtain the highest candidate of score value and translate item, and as the translation result of entry to be translated.

According to the scale dictionary querying method of the described low false drop rate of the application's preferred embodiment, may further comprise the steps during the score value of calculated candidate translation: at first, read the parameter that candidate in the perfect hash index translates a pairing hash function.Next is according to the parameter reduction hash function of hash function.At last, the candidate translates a calculating through pairing hash function, obtains the score value that each candidate translates item.

Scale dictionary querying method according to the described low false drop rate of the application's preferred embodiment, judge that the candidate translates item and may further comprise the steps when whether existing: at first, by the hash function same the candidate is translated and to be mapped on the array through after the computing with generating the Bloomfilter index.Then, judge whether the position that is mapped with cryptographic hash in the array all has sign.At last, if all sign is arranged, then return the candidate and translate an existence.

The application proposes a kind of scale dictionary storage means of low false drop rate, in order to compression and storage existing dictionary data, may further comprise the steps: at first, the entry in the existing dictionary data is arranged according to the form of setting.Secondly, read dictionary data, and utilize the Bloomfilter technology to generate the Bloomfilter index.At last, read dictionary data, and utilize perfect hash algorithm to generate perfect hash index.

According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, may further comprise the steps when generating the Bloomfilter index: at first, an array is set in the storer of computing machine.Secondly, utilize several separate hash functions that each entry in the dictionary data is carried out being mapped on the array after the computing.At last, the position that is mapped with cryptographic hash in the array is provided with sign, and generates the Bloomfilter index.

According to the scale dictionary storage means of the described low false drop rate of the application's preferred embodiment, after the size of having determined array, also comprise step: the number of determining the more used hash function of Bloomfilter index according to the size of the scale of dictionary data and array.

The application proposes a kind of querying method of the formed dictionary of scale dictionary storage means to above-mentioned low false drop rate, may further comprise the steps: at first, and inquiry Bloomfilter index, and judge whether entry to be checked exists.Then,, then inquire about perfect hash index, obtain the value of entry if entry exists.

Scale dictionary querying method according to the described low false drop rate of the application's preferred embodiment, inquiry Bloomfilter may further comprise the steps during index: at first, by the hash function same with generating the Bloomfilter index entry to be checked is mapped on the array through after the computing.Then, judge whether the position that is mapped with cryptographic hash in the array all has sign.At last, if all sign is arranged, then return entry to be checked and exist.

According to the scale dictionary querying method of the described low false drop rate of the application's preferred embodiment, may further comprise the steps when inquiring about perfect hash index: at first, read the parameter of the pairing hash function of entry to be checked in the perfect hash index.Then, according to the parameter of hash function reduction hash function.At last, by obtaining the cryptographic hash of entry to be checked after the pairing hash function computing of entry to be checked.

With respect to prior art, the application comprises following advantage: the application is utilizing before perfect salted hash Salted consults the dictionary, earlier judge by the Bloomfilter technology whether entry itself is present in the dictionary, whether exist with entry itself and to replace judging with codomain whether entry exists in the prior art, thereby reduced the false drop rate of dictionary enquiry, promoted the inquiry effect.In addition, because the hash function that the application's perfect hash index is stored only need take little space, therefore very big limit ground has compressed the storage of dictionary, has saved the storage space of computing machine.

Certainly, implement arbitrary product of the present invention and must not necessarily need to reach simultaneously above-described all advantages.

Description of drawings

Fig. 1 is the process flow diagram of method of a kind of precision machine translation of the embodiment that proposes of the file of China national Patent Office application number 200610136705.1;

Fig. 2 is existing a kind of process flow diagram that utilizes perfect hash function to carry out dictionary enquiry;

Fig. 3 is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the embodiment of the present application;

Fig. 4 is the scale dictionary querying method process flow diagram of a kind of low false drop rate of the embodiment of the present application;

Fig. 5 is the structure process flow diagram of a kind of Bloomfilter index of the embodiment of the present application;

Fig. 6 is the structure process flow diagram of a kind of perfect hash index of the embodiment of the present application;

Fig. 7 is the synoptic diagram of a kind of orderly matching process of the embodiment of the present application;

Process flow diagram when whether Fig. 8 exists for a kind of candidate of judgement of the embodiment of the present application translates item;

Fig. 9 is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the embodiment of the present application;

Figure 10 is the process flow diagram of scale dictionary querying method of a kind of low false drop rate of the embodiment of the present application.

Embodiment

The application's main thought is in dictionaries store and combines Bloomfilter technology and perfect salted hash Salted when consulting the dictionary.When dictionaries store, at first, the entry in the existing dictionary data is arranged according to the form of setting.Secondly, read dictionary data, and utilize the Bloomfilter technology to generate the Bloomfilter index.At last, read dictionary data, and utilize perfect hash algorithm to generate perfect hash index.When dictionary enquiry, fall the not entry in dictionary by the Bloomfilter technical filter earlier, the entry that only stays is just inquired about by perfect hash function.Wherein, the Bloomfilter technology described in the application is the very high probabilistic data structure of a kind of space availability ratio, and whether it often is used to detect an element and is present in the specific set.

Below in conjunction with accompanying drawing, specify the present invention.

The application proposes a kind of scale dictionary storage means of low false drop rate, needed dictionary data during in order to compression and storage mechanical translation.See also Fig. 3, it is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the embodiment of the present application, and it may further comprise the steps:

S301 arranges existing dictionary data according to the form of setting.Existing dictionary data can be obtained from certain website user's daily record, records the user's of this website of login historical query speech in this user journal.

The mode of arranging can adopt two gram language model, and is as shown in table 1:

Supply ^ camera 7000
Supply ^ camera 7000	Supply ^PVC 9000
Children ^ number camera 3000	Supply ^PVC 9000
Children ^ number camera 3000	Telescope ^ camera 2600
Disposable ^ camera 3200	Telescope ^ camera 2600
Disposable ^ camera 3200	Family expenses ^ camera 3800
……	Family expenses ^ camera 3800

Table 1

Entry in the existing dictionaries of expression such as " the supply ^ camera " in the table 1, " supply ^PVC ", this entry of " ^ " symbolic representation is the binary linguistic form, promptly is made up of two words., the number of times that the corresponding entries of expression such as " 7000 ", " 9000 " occur in dictionary.The number of times that comprises the entry appearance of " supply ", " camera " as table 1 expression is 7000 times.In addition, also can adopt other arrangement mode, as three gram language model, quaternary language model etc.

S303 reads dictionary data, and utilizes the Bloomfilter technology to generate the Bloomfilter index, and stores.

The basic thought that utilizes the Bloomfilter technology to generate the Bloomfilter index is: with a series of separate hash functions each entry in the dictionary is mapped in the array, and sign is set in these positions, as shown in Figure 5, it is the structure process flow diagram of a kind of Bloomfilter index of the embodiment of the present application.

S501 is provided with array in the storer of computing machine.This array operated by rotary motion is in internal memory.

S503, the scale by existing dictionary and the false drop rate of Bloomfilter index expectation are determined the size of array.

S505 determines the number of the used hash function of Bloomfilter index according to the size of the scale of existing dictionary and array.

S507 utilizes several separate hash functions that each entry in the existing dictionary is carried out being mapped on the array after the computing.Here used hash function can be a picked at random.

S509 is provided with sign with the position that is mapped with cryptographic hash in the array, and generates the Bloomfilter index.

For the ease of understanding, the program false code that now provides the foundation of Bloomfilter index is as follows:

Symbol description:

Hash function bunch: h _i(x) → [0, M ₁), 0≤i＜K wherein ₁

A_B: position index array.1 * M ₁Be that each unit accounts for 1 bit (position), be total to M ₁The set of individual such unitS:n-gram,＜n-gram _j, v (n-gram _j),

0 \leq v (n - {gram}_{j}) < 2^{N_{1}}

False code (// expression explanation) set up in the Bloomfilter index:

Input: S, Hash function bunch

Output: array of indexes A_B

// initialization A_B

For i∈[0，M ₁)do

A-B[i]＝0

End For

For i∈[0，|S|)do

For j∈[0，K ₁)do

A-B[h _j(n-gram _i)]＝1

End For

End for

M in the above-mentioned false code ₁The scale of expression Bloomfilter index, S represents the set of entry in the dictionary, n-gram is a natural language processing dictionary language model commonly used, refers to the entry among the S here.This section false code is to have the position set of cryptographic hash among the array of indexes A_B, other position reset, and its index structure is as shown in table 2:

The value number

1

0

1

0

1

0

1

0

……

Table 2

The value number has been represented the number of the value of Bloomfilte index, is equivalent to M ₁In ensuing a series of value, " 1 " represents this bit by set, and " 0 " is represented not by set.

In the practice, before setting up the Bloomfilte index, to determine the number K of used hash function earlier ₁False drop rate p with the expectation of Bloomfilter index can pass through formula K then ₁=In2*M ₁/ | S| and p=(1-(1-1/m) ^Kn) ^k≈ (1-e ^-kn/m) ^kCan derive the array size M of Bloomfilter index ₁=2p/In2*|S|.And the scale of existing dictionary | S| can calculate acquisition when it stores in the storer of computing machine, so M ₁Size can adjust at any time according to the false drop rate p of Bloomfilter index expectation.Wherein, in the equation of the false drop rate p of above-mentioned Bloomfilter index expectation, k is the hash function number, and n is the entry number in the dictionary, and m is the number of bit array.In having determined the Bloomfilter index size of array with and error rate after just can make up this Bloomfilter index.

S305 reads dictionary data, and utilizes perfect hash algorithm to generate perfect hash index, and wherein, perfect hash index comprises the number of used hash function and the parameter of corresponding each hash function.

Set up after the Bloomfilter index, also needed to set up perfect hash index.As shown in Figure 6, it is the structure process flow diagram of a kind of perfect hash index of the embodiment of the present application, and it may further comprise the steps:

S601 mates the entry in the existing dictionary in order, makes each entry that a corresponding hash function all be arranged, and the parameter of an its corresponding hash function of entry is stored as an element.

S603 handles element according to the sequencing that stores, and generates perfect hash index.

The purpose of coupling is to find out several hash functions in order, makes to each the entry n-gram in the entry S set _j, 0≤j＜| S|), all use certain corresponding hash function hash _i, 0≤i＜K calculates cryptographic hash hash _i(n-gram _j).Different entry n-gram _jMay corresponding different hash _iFunction is as n-gram ₁Corresponding hash ₂, n-gram ₂Corresponding hash ₁Deng.Satisfying this | S| hash value all is unique.

To handle n-gram simultaneously _jOrder and n-gram _jCorresponding hash function is kept among the matched, and matched is meant a kind of formation, and the inside can be deposited entry successively, the record of each entry and corresponding hash function.If hash function is then changed in matching process failure in order, continue to attempt, till finding all hash functions that satisfy condition.Facts have proved that when M 〉=1.23|S|, failed probability is smaller.

For the ease of understanding, provide the false code (// expression explanation) of orderly matching process below:

Input: the S set of n-gram, Hash function bunch h _i(x) → [0, M), 0≤i＜K wherein

Output: orderly set of matches matched, perhaps failure

// initialization Matched is empty

matched &DoubleLeftArrow; Φ

Each element of // initialization r2l is empty

For i∈[0，M)do

{r 2 l}_{i} &DoubleLeftArrow; Φ

End for

// statistics

For i∈[0，|S|)do

{l 2 r}_{i} &DoubleLeftArrow; Φ

For j∈[0，K)do

{l 2 r}_{i} &DoubleLeftArrow; {l 2 r}_{i} \cup h_{j} (n - {gram}_{i})

{r 2 l}_{h_{j} (n - {gram}_{i})} &DoubleLeftArrow; {r 2 l}_{h_{j} (n - {gram}_{i})} \cup i

End for

// extraction r2l _iIn have only an element, promptly in-degree is 1

degree_one &DoubleLeftArrow; {i | 0 \leq i < M, | {r 2 l}_{i} | = 1}

While(|degree_one|≥1)do

rhs &DoubleLeftArrow; POP (degree_one)

lhs &DoubleLeftArrow; POP ({r 2 l}_{rhs})

PUSH(lhs，rhs)onto matched

For all rhs′∈l2r _lhs do

POP(r2l _rhs′)

If(|r2l _rhs′|＝＝1)then

degree_one &DoubleLeftArrow; degree_one \cup rhs'

End if

End for

End While

If|matched|＝＝|S|then

Return matched

Else

Return false

End if

Below in conjunction with diagram orderly matching process is further described, see also Fig. 7, it is the synoptic diagram of a kind of orderly matching process of the embodiment of the present application.Wherein, a, b, c and d represent the entry in the dictionary, a, b, c and d have used 3 hash algorithms (promptly and a, b, the line that c links to each other with d) respectively, as (c, 5) among Fig. 7, entry a, b, c and d are by hash algorithm value of obtaining 1,2,3,4,5,6,7.And be worth 4,5 all have only an in-degree, and (in-degree refers to link the number of the line of node, promptly link 4,5 line), cryptographic hash 4 is described, the 5th, unique, hash function corresponding with it and entry meet the requirement of orderly coupling, store the back deletion so can pick out at random one of them node of these two cryptographic hash in orderly matching process, and remove those with abridge the line that is connected a little arranged, carry out record simultaneously.Present embodiment has been deleted node 5, simultaneously entry c, cryptographic hash 5 and hash function corresponding with it is recorded among the matched.Be (d, 4) among Fig. 7 behind the deletion of node 5.In like manner continue to remove (b, 2) that obtain behind the node 4 among Fig. 7, write down entry d, cryptographic hash 4 and hash function corresponding simultaneously with it.In like manner continue to remove node 2 and node 1 laggard line item again.So just finished the process of orderly coupling.

Each element of storing among the matched comprises the parameter of an its corresponding hash function of entry, the element among the matched is handled according to order first-in last-out after having mated in order again, obtains the array A of perfect hash index.The mode of handling can be according to formula:

A [h_{r (j)} (n - {gram}_{j})] = v (n - {gram}_{j}) {&CircleTimes;}_{i = 0 \cap i &NotEqual; r (j)}^{K_{2} - 1} A [h_{i} (n - {gram}_{j})] .

Be that perfect hash index comprises two parts, be respectively hash function index and value index, as shown in table 3:

Table 3

The Hash function parameters generally has two ingredients in the table 3, i.e. type function and function seed.Type function has been specified hash algorithm, as the hash algorithm of Justin Sobel, and hash algorithm of peter J.Weinberger or the like.The function seed refers to the reference value of various hash algorithms.These 2 parameters can effectively be restored hash function, have saved the expense of storage hash function simultaneously.

For example the hash algorithm of Justin Sobel is as follows:

unsigned int CJSHash::hash(const char*key，size_tkeyLen)

{

unsigned int hash＝m_seed；

for(std：ize_ti＝0；i<keyLen；i++)

{

hash^＝((hash<<5)+key[i]+(hash>>2))；

}

returnhash；

}

And its seed is exactly the m_seed in the program.

The used orderly matching process of the application is only used a few hash function (only having used three hash functions among the embodiment of Fig. 7), and therefore the hash function parameter of being stored in perfect hash index has only taken minimum storage space.

The value index record number and the occurrence of each value of value.The value occurrence according to

A [h_{r (j)} (n - {gram}_{j})] = v (n - {gram}_{j}) {&CircleTimes;}_{i = 0 \cap i &NotEqual; r (j)}^{K_{2} - 1} A [h_{i} (n - {gram}_{j})] .

Determine.

The scale M of perfect hash index is set in practice, ₂For being slightly larger than the scale of dictionary | S|, M ₂The closer to | S|, then this perfect hash index is saved the space more, but the difficult more simultaneously hash function bunch of finding.Generally get M ₂=1.1|S|.

Scale dictionary storage means at the low false drop rate of introducing above, the application also proposes a kind of scale dictionary querying method of low false drop rate, it is used for when computing machine carries out the language translation, the dictionary that utilizes storage means to make up is inquired about entry to be translated, as shown in Figure 4, it may further comprise the steps:

S401, several candidates that obtain entry to be translated translate item.This process can realize by the syntax conversion technology of active computer." liking " translating into English as entry can be " like " or " love ", and " like " and " love " are the candidate that entry " likes " and translate item here.

S403, inquiry Bloomfilter index, and judge that the candidate translates item and whether exists.

The aforementioned position that is mapped with cryptographic hash when generating the Bloomfilter index on its array is provided with sign, therefore whether be present in dictionary, just be equivalent to judge that the candidate translates by hash function is mapped in the process whether position in the array has sign if judging that the candidate translates.As shown in Figure 8, judge that the candidate translates a process that whether exists and may further comprise the steps:

S801 translates the candidate by hash function identical when building index and to be mapped on the array of Bloomfilter index through after the computing.

S803 judges whether the position that is mapped with cryptographic hash in the array all has sign.If be mapped with the position of cryptographic hash in the array one or more not signs are arranged, illustrate that then this candidate translates item and do not exist, then computing machine returns the candidate and translates a non-existent information.

S805 if sign is arranged, then returns the candidate and translates an existence.

For the ease of understanding, now provide by inquiry Bloomfilter index and judge that the candidate translates the false code whether item exists:

Input: n-gram

Output: whether have n-gram

For j∈[0，K ₁)do

If(A_B[h _j(n-gram)]＝＝0)then

Return does not exist

End If

End For

Return exists

S405 inquires about perfect hash index, and calculates the score value that the candidate who exists translates item by the hash function that obtains from perfect hash index.

When inquiring about perfect hash index, at first, obtain the parameter of hash function, and make up hash function, calculate the score value that the candidate who exists translates item with these hash functions then by reading the hash function index part

g (n - gram) : g (n - gram) = {&CircleTimes;}_{i = 0}^{K_{2} - 1} A [h_{i} (n - gram)] .

Here the score value that said candidate translates item is meant that the candidate translates a number of times that occurs in existing dictionary.

S407 obtains the highest candidate of score value and translates item, and as the translation result of entry to be translated.

With the example that is translated as of " I like eating apple " this sentence, " I like eating apple " and " I love eatingapple " are arranged if the candidate that computing machine obtains this sentence by existing syntax conversion technology translates item.In order to determine the correct translation of " I like eating apple ", respectively the candidate is translated " an I like eatingapple " and " I love eatingapple " and do the Bloomfilter search index, suppose to draw the candidate and translate " an I like eatingapple " and " I love eatingapple " and all in dictionary, exist, and then respectively " I likeeatingapple " and " I love eating apple " to be done that perfect hash index inquires about and breathe out respectively be function calculation.If obtaining the score value of " I like eatingapple " is 1000, and the score value of " I love eatingapple " is 10, show that then " I like eatingapple " occurred 1000 times in the dictionary, and " I love eating apple " occurred 10 times, 1000〉10, therefore the result of translation is " I like eatingapple ".

The thought that the Bloomfilter index that the application proposes combines with perfect hash index is not only applicable in the mechanical translation field when language is translated the inquiry of entry occurrence number also is applicable to the speech of any situation in other field, the dictionary enquiry of value.Therefore the application proposes a kind of scale dictionary storage and querying method of low false drop rate in addition, in order to the compression existing dictionary data, and entry is inquired about, and it is divided into storing process and query script.

See also Fig. 9, it is the scale dictionary storage means process flow diagram of a kind of low false drop rate of the embodiment of the present application.

S901 arranges the entry in the existing dictionary data according to the form of setting.The mode of arranging can be provided with according to actual needs, is not limited to the number of times that occurs in dictionary with entry.As can arrange with the code name of entry, with the entry length arrangement etc.

S903 reads dictionary data, and utilizes the Bloomfilter technology to generate the Bloomfilter index.The basic thought that utilizes the Bloomfilter technology to generate the Bloomfilter index is: with a series of separate hash functions each entry in the dictionary is mapped in the array, and in these positions sign is set, the normally set of the form of sign.This process is identical with aforesaid Bloomfilter index construct process, no longer is described in detail herein.

S905 reads dictionary data, and utilizes perfect hash algorithm to generate perfect hash index, and wherein, perfect hash index comprises the number of used hash function and the parameter of corresponding each hash function.

When making up perfect hash index, at first, the entry in the existing dictionary is mated in order, make each entry that a corresponding hash function all be arranged, and the parameter of an its corresponding hash function of entry is stored as an element.Secondly, according to the sequencing that stores element is handled, and generated perfect hash index.

Here, the value that entry is calculated by the hash function of index in the dictionary can be the number of times that entry occurs in dictionary, also can be other round values.

See also Figure 10, it is the scale dictionary querying method process flow diagram of a kind of low false drop rate of the embodiment of the present application.

S1001, inquiry Bloomfilter index, and judge whether entry to be checked exists.

During inquiry Bloomfilter index, by hash function identical when making up the Bloomfilter index entry is mapped on the array of Bloomfilter index through after the computing, and judge that array shows the position of penetrating cryptographic hash whether sign is all arranged, all there is sign to represent that then entry is present in the dictionary.This step is identical with the query script of aforesaid Bloomfilter index, no longer is described in detail herein.

S1003 if entry exists, then inquires about perfect hash index, obtains the value of entry.

When inquiring about perfect hash index, at first, read the parameter of the pairing hash function of entry to be checked in the perfect hash index.Next is according to the parameter reduction hash function of hash function.At last, by obtaining the cryptographic hash of entry to be checked after the pairing hash function computing of entry to be checked.

False drop appears and unique may be that entry to be looked into is not in dictionary, if entry to be looked into is in dictionary then can not false drop.The application's purpose is exactly to reduce the possibility that this false drop takes place as much as possible.For example inquire about entry " digital camera ", during with the Bloomfilter technical filter, with obtaining all value A_B[h of entry " digital camera " in index after all the hash function computings in the Bloomfilter index _j(digital camera)]==1, entry " digital camera " is described in dictionary, and enters perfect Hash operation.If have one not to be at 1 o'clock by the value after the hash function computing in the Bloomfilter index, illustrate that then entry " digital camera " not in dictionary, then can not enter perfect Hash operation again.

Compared with prior art, the application comprises following advantage: the application is utilizing before perfect salted hash Salted consults the dictionary, earlier judge by the Bloomfilter technology whether entry itself is present in the dictionary, whether exist with entry itself and to replace judging with codomain whether entry exists in the prior art, thereby reduced the false drop rate of dictionary enquiry, promoted the inquiry effect.In addition, because the hash function that the application's perfect hash index is stored only need take little space, therefore very big limit ground has compressed the storage of dictionary, has saved the storage space of computing machine.

More than disclosed only be several specific embodiment of the present invention, but the present invention is not limited thereto, any those skilled in the art can think variation, all should drop in protection scope of the present invention.

Claims

1, a kind of scale dictionary storage means of low false drop rate, needed dictionary data during in order to compression and storage mechanical translation is characterized in that, may further comprise the steps:

Needed existing dictionary data is arranged according to the form of setting during with mechanical translation;

Read dictionary data, and utilize the Bloomfilter technology to generate a Bloomfilter index;

Read dictionary data, and utilize perfect hash algorithm to generate a perfect hash index.

2, the scale dictionary storage means of low false drop rate as claimed in claim 1 is characterized in that, when generating described perfect hash index, store the parameter of each hash function of the number of used hash function and correspondence.

3, the scale dictionary storage means of low false drop rate as claimed in claim 1 is characterized in that, may further comprise the steps when generating described Bloomfilter index:

One array is set in the storer of computing machine;

Utilize several separate hash functions that each entry in the dictionary data is carried out being mapped on the described array after the computing;

The position that is mapped with cryptographic hash in the described array is provided with a sign, and generates described Bloomfilter index.

4, the scale dictionary storage means of low false drop rate as claimed in claim 3 is characterized in that, the position that is mapped with cryptographic hash in the described array is provided with a sign is specially: will be mapped with the position set of cryptographic hash in the described array.

5, the scale dictionary storage means of low false drop rate as claimed in claim 3 is characterized in that, and is further comprising the steps of before being mapped on the described array after utilizing described hash function that each entry in the existing dictionary data is carried out computing:

The false drop rate of scale by dictionary data and the expectation of described Bloomfilter index is determined the size of described array.

6, the scale dictionary storage means of low false drop rate as claimed in claim 5 is characterized in that, also comprises step after the size of having determined described array:

Determine the number of the used described hash function of described Bloomfilter index according to the size of the scale of dictionary data and described array.

7, the scale dictionary storage means of low false drop rate as claimed in claim 1 is characterized in that, may further comprise the steps when generating described perfect hash index:

Entry in the dictionary data is mated in order, make each entry that a corresponding hash function all be arranged, and the parameter of its corresponding hash function of described each entry is stored as an element;

According to the sequencing that stores described element is handled, and generated described perfect hash index.

8, the scale dictionary storage means of low false drop rate as claimed in claim 7 is characterized in that, in carrying out orderly matching process, employing is not more than ten hash functions dictionary is mated in order.

9, a kind of querying method of the formed dictionary of scale dictionary storage means to each described low false drop rate of claim 1-8 when carrying out the language translation at computing machine, is inquired about entry to be translated, it is characterized in that, may further comprise the steps:

Several candidates that obtain entry to be translated translate item;

Inquire about described Bloomfilter index, and judge that described candidate translates item and whether exists;

Inquire about described perfect hash index, and calculate the score value that the described candidate who exists translates item by the hash function that from described perfect hash index, obtains;

Obtain the highest candidate of score value and translate item, and as the translation result of entry to be translated.

10, the scale dictionary querying method of low false drop rate as claimed in claim 9 is characterized in that, may further comprise the steps when calculating the score value that described candidate translates:

Read the parameter that candidate described in the described perfect hash index translates a pairing hash function;

Parameter reduction hash function according to described hash function;

Described candidate translates a calculating through pairing hash function, obtains the score value that each candidate translates item.

11, the scale dictionary querying method of low false drop rate as claimed in claim 9 is characterized in that, judges that described candidate translates item and may further comprise the steps when whether existing:

By the described hash function same with generating the Bloomfilter index candidate being translated item is mapped on the described array through after the computing;

Judge whether the position that is mapped with cryptographic hash in the described array all has described sign;

If all described sign is arranged, then return the candidate and translate an existence.

12, a kind of scale dictionary storage means of low false drop rate in order to compression and storage existing dictionary data, is characterized in that, may further comprise the steps:

Entry in the existing dictionary data is arranged according to the form of setting;

13, the scale dictionary storage means of low false drop rate as claimed in claim 12 is characterized in that, when generating described perfect hash index, store the parameter of each hash function of the number of used hash function and correspondence.

14, the scale dictionary storage means of low false drop rate as claimed in claim 12 is characterized in that, may further comprise the steps when generating described Bloomfilter index:

One array is set in the storer of computing machine;

15, the scale dictionary storage means of low false drop rate as claimed in claim 14 is characterized in that, the position that is mapped with cryptographic hash in the described array is provided with a sign is specially: will be mapped with the position set of cryptographic hash in the described array.

16, the scale dictionary storage means of low false drop rate as claimed in claim 14 is characterized in that, and is further comprising the steps of before being mapped on the described array after utilizing described hash function that each entry in the existing dictionary data is carried out computing:

17, the scale dictionary storage means of low false drop rate as claimed in claim 16 is characterized in that, also comprises step after the size of having determined described array:

18, the scale dictionary storage means of low false drop rate as claimed in claim 12 is characterized in that, may further comprise the steps when generating described perfect hash index:

19, the scale dictionary storage means of low false drop rate as claimed in claim 18 is characterized in that, in carrying out orderly matching process, employing is not more than ten hash functions dictionary is mated in order.

20, a kind of querying method of the formed dictionary of scale dictionary storage means to each described low false drop rate of claim 12-19 is characterized in that, may further comprise the steps:

Inquire about described Bloomfilter index, and judge whether entry to be checked exists;

If entry exists, then inquire about described perfect hash index, obtain the value of entry.

21, the scale dictionary querying method of low false drop rate as claimed in claim 20 is characterized in that, may further comprise the steps when inquiring about described Bloomfilter index:

By the described hash function same entry to be checked is mapped on the described array through after the computing with generating the Bloomfilter index;

If all described sign is arranged, then return entry to be checked and exist.

22, the scale dictionary querying method of low false drop rate as claimed in claim 20 is characterized in that, may further comprise the steps when inquiring about described perfect hash index:

Read the parameter of the pairing hash function of entry to be checked in the described perfect hash index;

Parameter reduction hash function according to hash function;

By obtaining the cryptographic hash of entry to be checked after the pairing hash function computing of entry to be checked.