Summary of the invention
The object of the present invention is to provide a kind of language model training method and system; These method and system have realized the distributed version of Katz algorithm; Can train language model effectively, simultaneously can effectively solve the sparse problem of data, for the discrimination of speech recognition provides support based on mass data.
For addressing the above problem, the present invention provides a kind of language model training method, comprising:
Corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, wherein, N is the positive integer more than or equal to 2;
The word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtain the COC statistic of N tuple;
The word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtain the probable value of N tuple;
Carry out many wheel MapReduce operation, calculate the rollback coefficient of a tuple to m tuple respectively, wherein m=N-1;
Gather said probable value and rollback coefficient and obtain the language model of APRA form.
Optional, in said method, said corpus is carried out taking turns the MapReduce operation, the step of the word frequency statistics amount of statistics N tuple comprises:
The Map operation is exported first speech of N tuple as key assignments;
Shuffle resets the N tuple of operating different key assignments and is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs on each Reduce node is as the word frequency statistics amount of N tuple.
Optional, in said method, the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtain the step of the COC statistic of N tuple, comprising:
The boundary value K that preset discount is calculated, wherein K is a positive integer;
Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K;
Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes;
Gather said word frequency statistics amount and obtain the COC statistic of said N tuple.
Optional, in said method, the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtain the step of a tuple to the probable value of N tuple, comprising:
Map operation first speech of N tuple that the word frequency statistics amount of said N tuple is corresponding is exported as key assignments;
Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes;
On each Reduce node, all import said COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount;
Calculate the probable value of a tuple according to said discount factor to the N tuple.
Optional, in said method, carry out many wheel MapReduce operation, calculate the step of the rollback coefficient of a tuple to m tuple respectively, comprising:
Carry out taking turns the MapReduce operation, calculate the rollback coefficient of a tuple;
Carry out many wheel MapReduce operation, calculate the rollback coefficient of doublet to m tuple respectively.
Optional, in said method, carry out taking turns the MapReduce operation, calculate the step of a tuple rollback coefficient, comprising:
On each Reduce node, distribute the data of all tuples and the data of doublet;
First speech in one tuple or the doublet is exported as key assignments;
Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition;
Adopt the Katz smoothing algorithm to calculate on each node the rollback coefficient with the corresponding tuple of data of a said tuple and doublet according to the said word frequency statistics amount of a tuple and doublet.
Optional, in said method, calculate the step of the rollback coefficient of doublet to m tuple respectively, comprising:
On each node, distribute the data of all m tuples and the data of m+1 tuple;
The penult speech of m tuple or m+1 tuple is exported as key assignments;
Shuffle resets the m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes;
Calculate on each node the rollback coefficient with the data and the corresponding m tuple of m+1 tuple of said m tuple according to the said word frequency statistics amount of m tuple and m+1 tuple.
According to another side of the present invention, a kind of language model training system is provided, comprising:
The word frequency module is used for corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, and wherein N is the positive integer more than or equal to 2;
The COC module is used for the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtains the COC statistic of N tuple;
The probability module is used for the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtains the probable value of N tuple;
The rollback coefficient module is carried out many wheel MapReduce operation, calculates the rollback coefficient of a tuple to m tuple respectively, wherein m=N-1; And
Summarizing module gathers the language model that said probable value and rollback coefficient obtain the APRA form.
Optional; In said system; Said word frequency module is carried out the Map operation first speech of N tuple is exported as key assignments; Shuffle resets the N tuple of operating different key assignments and is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs on each Reduce node is as the word frequency statistics amount of N tuple.
Optional; In said system; The boundary value K that the preset discount of said COC module is calculated, wherein K is a positive integer, the Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K; Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that said word frequency statistics amount obtains said N tuple.
Optional; In said system; Said probability module is carried out Map operation first speech of N tuple that the word frequency statistics amount of said N tuple is corresponding and is exported as key assignments; Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes, on each Reduce node, all imports the discount factor calculating that said COC statistic is carried out corresponding N tuple word frequency statistics amount, calculates the probable value of a tuple to the N tuple according to said discount factor.
Optional, in said system, said rollback coefficient module comprises:
One tuple unit is used to carry out taking turns the MapReduce operation, calculates the rollback coefficient of a tuple;
Polynary group of unit carries out many wheel MapReduce operation, calculates the rollback coefficient of doublet to m tuple respectively.
Optional; In said system; A said tuple unit distributes the data of all tuples and the data of doublet on each Reduce node; First speech in one tuple or the doublet is exported as key assignments; Shuffle resets operation one tuple of different key assignments is fitted on the different Reduce nodes with binary composition, according on said each node of word frequency statistics amount employing Katz smoothing algorithm calculating of a tuple and doublet with the rollback coefficient of the corresponding tuple of data of a said tuple and doublet.
Optional; In said system; Said polynary group of unit distributes the data of all m tuples and the data of m+1 tuple on each node; The penult speech of m tuple or m+1 tuple is exported as key assignments, and Shuffle resets operation the m tuple or the m+1 tuple of different key assignments is assigned on the different nodes, according on said each node of word frequency statistics amount calculating of m tuple and m+1 tuple with the rollback coefficient of the corresponding m tuple of data of said m tuple and m+1 tuple.
Compared with prior art, the present invention adopts the data structure that is the basis with the Hash prefix trees, dexterously mass data is broken and makes up; Be distributed to each node of cluster to data, the statistics corresponding data values is carried out concurrent operation then; Obtain a language model, realized the distributed version of Katz algorithm, train language model effectively based on mass data based on mass data; Simultaneously can effectively solve the sparse problem of data, improve its discrimination.
Embodiment
The language model training method and the system's further explain that the present invention are proposed below in conjunction with accompanying drawing and specific embodiment.
As depicted in figs. 1 and 2, the present invention provides a kind of language model training method, and this method is with the instrument of MapReduce as cluster distributed management, comprises following step for the distributed smoothing algorithm of Katz:
Step S 1, and corpus is carried out taking turns MapReduce (being also referred to as Map/Reduce, at random piecemeal/key assignments piecemeal) operation; The word frequency statistics amount (WordCount) of statistics N tuple, wherein N is the positive integer more than or equal to 2, comprising the Map operation first speech of N tuple is exported as key assignments; Shuffle resets operation the N tuple of different key assignments is assigned on the different Reduce nodes, for example: a tuple " we " and doublet " we not " will be assigned to same Reduce node, and the number of times that each N tuple of Reduce tabulate statistics occurs on each Reduce node is as the word frequency statistics amount of N tuple; Concrete, MapReduce is a programming model, 2004; The MapReduce system that Google publishes thesis and introduces them; 2005, web crawlers project Nutch realized a MapReduce system that increases income, and MapReduce became an independent project of increasing income in 2006; List of references is a Hadoop authority guide; Tom White work, Zhou Minqi etc. translate, and publishing house of Tsing-Hua University publishes;
Step S2 carries out taking turns the MapReduce operation to the word frequency statistics amount of said N tuple, obtains COC statistic (the word frequency equivalent number of N tuple; Count of count), comprising the boundary value K that preset discount is calculated, wherein K is a positive integer; Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K; Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that said word frequency statistics amount obtains said N tuple, and is concrete; The COC statistic is represented is that the word frequency statistics amount is the number of 1 to K speech; K is the boundary value that discount is calculated, and the word frequency that surpasses K-1 need not carried out discount, and the COC statistic is used for the discount computing of word frequency;
Step S3; The word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation; Calculate the probable value of N tuple; First speech of N tuple comprising the word frequency statistics amount of the said N tuple of Map operation handlebar is corresponding is exported as key assignments, and Shuffle resets operation the N tuple word frequency of different key assignments is assigned on the node of different Reduce, on each Reduce node, all imports earlier said COC statistic and carries out the discount factor (discounting) of corresponding N tuple word frequency statistics amount and calculate; Calculate the probable value (probability) of a tuple to N tuple then according to said discount factor, gather at last and obtain all N-gram probability;
Step S4; Carry out taking turns the MapReduce operation; Calculate a tuple rollback coefficient; Comprising on each node, distributing the data of all tuples and the data of doublet earlier, then first speech in a tuple or the doublet is exported as key assignments, Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition; At last adopt on each node of Katz smoothing algorithm calculating the rollback coefficient with the corresponding tuple of data of the data of a said tuple and doublet according to the said word frequency statistics amount of a tuple and doublet; Concrete, calculate the word frequency statistics amount that a tuple rollback coefficient needs a tuple and doublet, the mode of data allocations is different with step S1; Each node all need distribute the data of all tuples and the data of doublet; Be assigned to certain node in the cluster according to first speech in a tuple or the doublet, can guarantee that like this monobasic rollback coefficient that each node calculates all is overall, it is the same promptly all being placed on the value of calculating on the node with total data;
Step S5; Carry out taking turns the MapReduce operation; Calculate m tuple rollback coefficient (m=2); Comprising on each Reduce node, distributing the data of all m tuples and the data of m+1 tuple earlier, then the penult speech of m tuple or m+1 tuple is exported as key assignments, Shuffle resets the m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes; At last according on said each node of word frequency statistics amount calculating of m tuple and m+1 tuple with the rollback coefficient of the corresponding m tuple of data of said m tuple and m+1 tuple; Concrete, all need distribute a m tuple and m+1 tuple data for each node, data allocations is undertaken by the penult word of m unit and m+1 tuple; For example: " we not " and " leaning on us to strive for ", " inspiring ours " are waited and will be assigned to same node, can guarantee that like this each node all has the first rollback coefficient of sufficient data computation m;
M is added one, judge that whether m is smaller or equal to N-1 (like the step S6 of Fig. 1);
If, repeat step S5, up to m=N-1, all calculative N tuple rollback coefficients are all intact;
If not, gather the language model that said probable value and rollback coefficient obtain the APRA form, withdraw from calculating (like step S7 among Fig. 1).
Among step S5 in this method and the step S6 with the penult speech of N tuple distribution key assignments as data; Can guarantee in the rollback coefficient that calculates the m rank; Other m rank and the m+1 rank corresponding data that need use are assigned to same node, thereby have guaranteed the correctness and the validity of data allocations, compare with unit; The number of supposing m rank and m+1 rank N-gram is H; X speech (everyday words in the Chinese can reach hundreds thousand of) arranged in the dictionary, and such distribution can guarantee that N-gram corresponding in each node in the cluster on average has only H/X, gives full play of the advantage of distributed arithmetic.
This method has proposed a kind of new method of training big data natural language model, and this method can overcome the restriction of unit internal memory under the present art, along with the expansion of cluster scale; The scale of language model has extensibility, and in the training based on the natural language model of Web language material, this method can be handled big data very effectively; Train large-scale language model; Under the ideal situation, data-handling capacity can expand to hundreds thousand of times (in fact, processing power will receive the influence of the factors such as scale of DATA DISTRIBUTION balanced intensity and server cluster) of unit; Accumulation along with data volume; The sparse problem relative remission of data, thus for speech recognition provides the speech model near true environment, improve its discrimination effectively.
With the mode of simple example, the distributed training that how to realize trigram language model is described below.
If the input corpus has three words: " today, I will go out ", " weather of today is pretty good ", " I will go out carwash "
Through after the participle operation, obtain one and to the tlv triple raw data be:
Sentence one:
" today "
" today I "
" today, I wanted "
" I "
" I want "
" I will go out "
" want "
" to go out "
" go out "
Sentence two:
" today "
" today "
" weather of today "
" "
" weather "
" weather pretty good "
" weather "
" weather is pretty good "
" well "
Sentence three:
" I "
" I want "
" I will go out "
" want "
" to go out "
" to go out "
" go out "
" go out "
" carwash of going out "
" go "
" remove carwash "
" carwash "
As depicted in figs. 1 and 2, the distributed training flow process of concrete trigram language model is following:
Step S1:
Through Map with after resetting, key assignments will be assigned to a Reduce node for the N tuple of " today ", as:
" today "
" today I "
" today, I wanted "
" today "
" today "
" weather of today "
Through obtaining the word frequency statistics amount after the Reduce statistics
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
All the other N tuple operations similarly.
Step S2:
The word frequency of supposing N unit is that the COC statistic of k is defined as COC [N] [k].
Through Map with after resetting, key assignments will be assigned to same Reduce node for the word frequency of " 2 ", as:
" today " 2
" I " 2
" I want " 2
" I will go out " 2
" want " 2
" to go out " 2
" go out " 2
Reduce adds up, and word frequency is that a tuple of 2 has 4, then COC [1] [2]=4; Doublet has 2, i.e. COC [2] [2]=2; Tlv triple has 1, i.e. COC [3] [2]=1, and other COC statistic all can come out according to same procedure.
Step S3:
Need to calculate the corresponding discount factor of each tuple before the beginning MapReduce distributed operation, the discount factor that Katz is level and smooth calculates based on the Good-Turing method:
Wherein, r is word frequency statistics amount (number of a N tuple), and dr representes the discount factor that word frequency r is corresponding; Nr representes that the COC statistic is that word frequency is the number of the N tuple of r, and k is the boundary value that discount is calculated, and need not do discount greater than the word frequency of k and calculate; Suppose that the N tuple word frequency statistics amount that calculates is that the discount factor of k is D [N] [k]; Through Map with after resetting, key assignments will be assigned to a Reduce node for the N tuple word frequency statistics amount of " today ", as:
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
To calculate " today " probability be example:
P (" today ")=Count (" today ") * D [1] [2]/Count (the word frequency summations of all tuples),
Calculate " today " back occurs " I " probability:
P (" I " | " today ")=Count (" today I ") * D [2] [1]/Count (" today ").
Step S4:
Through Map and after resetting, the data of depositing at a node are the doublet word frequency of " today " for all tuples and key assignments:
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
" I " 2
" want " 2
" go out " 2
" " 1
" weather " 1
" well " 1
" go " 1
" carwash " 1
On this node, will use the Katz smoothing algorithm to calculate the rollback coefficient of a tuple " today ", correspondingly, all will calculate the rollback coefficient of the corresponding tuple of corresponding key assignments on each node.
Step S5:
Through Map and after resetting, the data of on certain node, depositing are that key assignments is the binary and the tlv triple word frequency data of " I ":
" I want " 2
" today, I wanted " 1
On each node, can calculate the rollback coefficient of preceding two yuan of correspondences of tlv triple, correspondingly on this node be " today I ".
The rollback coefficient formulas is following:
Wherein, The corresponding rollback coefficient of α (xy) expression doublet xy, the word frequency of C (xyz) expression tlv triple xyz, PKatz representes through N-gram probability corresponding after the discount computing; Hence one can see that; Need on the molecule all word frequency of statistics greater than zero and prefix be the general probability of the tlv triple of xy, needing all prefixs of statistics on the denominator be y, and when suffix is z the word frequency of corresponding ternary group xyz greater than the general probability of zero doublet.
Hence one can see that, calculates " today I " corresponding rollback coefficient, needs prefix to be " today I " all tlv triple and prefix be all doublet word frequency statistics amounts of " I ".Therefore just in time can all be placed on same node to all corresponding data with this allocation scheme.
Under the situation that has more sentences as input, this node can calculate all suffix and be the doublet of " I ", like the rollback coefficient of " today I ", " tomorrow I ", " being me " etc.
Step S6:
M is added one; Judge that whether m is smaller or equal to N-1; In three gram language model, tlv triple does not have the rollback coefficient, therefore directly gets into step S7; The calculating of the rollback coefficient of tlv triple and the rollback coefficient calculations of doublet are similarly, will distribute tlv triple and data four-tuple as key assignments according to the penult speech of tlv triple.
Step S7:
ARPA is the N-gram language model storage format standard of generally acknowledging at present, no longer describes here.
As shown in Figure 3, according to another side of the present invention, a kind of language model training system is provided also, comprise word frequency module 1, COC module 2, probability module 3, rollback coefficient module 4 and summarizing module 5.
Said word frequency module 1 word frequency module; Be used for corpus is carried out taking turns the MapReduce operation; The word frequency statistics amount of statistics N tuple, wherein N is the positive integer more than or equal to 2, said word frequency module 1 is carried out the Map operation first speech of N tuple is exported as key assignments; Shuffle resets the N tuple of operating different key assignments and is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs on each Reduce node is as the word frequency statistics amount of N tuple.
Said COC module 2 is used for the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation; Obtain the COC statistic; The boundary value K that said COC module 2 preset discounts are calculated, wherein K is a positive integer, the Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K; Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that said word frequency statistics amount obtains said N tuple.
Said probability module 3 is used for the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation; Calculate the probable value of N tuple; Said probability module 3 is carried out Map operation first speech of N tuple that the word frequency statistics amount of said N tuple is corresponding and is exported as key assignments; Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes; Elder generation all imports the discount factor calculating that said COC statistic is carried out corresponding N tuple word frequency statistics amount on each Reduce node, calculate the probable value of a tuple to the N tuple according to said discount factor then, gathers at last to obtain all N-gram probability.
Said rollback coefficient module 4 is used to carry out many wheel MapReduce operation, calculates the rollback coefficient of a tuple to m tuple respectively, m=N-1 wherein, and said rollback coefficient module 4 comprises:
One tuple unit 41; Be used to carry out taking turns the MapReduce operation; Calculate the rollback coefficient of a tuple (m=1); A said tuple unit 41 distributes the data of all tuples and the data of doublet earlier on each Reduce node, then first speech in a tuple or the doublet is exported as key assignments, and Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition; At last adopt on each node of Katz smoothing algorithm calculating the rollback coefficient with the corresponding tuple of data of the data of a said tuple and doublet according to the said word frequency statistics amount of a tuple and doublet; Concrete, each node all need distribute the data of all tuples and the data of doublet, is assigned to certain node in the cluster according to first speech in a tuple or the doublet; Can guarantee that like this monobasic rollback coefficient that each node calculates all is overall, it is the same promptly all being placed on the value of calculating on the node with total data.
Polynary group of unit 42; Carry out many wheel MapReduce operation; Calculate rollback coefficient (the rollback coefficient of 2<=m<=N-1) of doublet to m tuple respectively; Said polynary group of unit 42 distributes the data of all m tuples and the data of m+1 tuple earlier on each node; Then the penult speech of m tuple or m+1 tuple is exported as key assignments, Shuffle resets operation the m tuple or the m+1 tuple of different key assignments is assigned on the different nodes, at last according on said each node of word frequency statistics amount calculating of m tuple and m+1 tuple with the rollback coefficient of the data and the corresponding m tuple of m+1 tuple of said m tuple; Concrete; Here data allocations is undertaken by the penult word of m unit and m+1 tuple, and for example: " we not " and " leaning on us to strive for ", " inspiring ours " are waited and will be assigned to same node, can guarantee that like this each node all has the first rollback coefficient of sufficient data computation m.
Said summarizing module 5 is used to gather the language model that said probable value and rollback coefficient obtain the APRA form.
The present invention adopts the data structure that is the basis with the Hash prefix trees, dexterously mass data is broken and makes up, and is distributed to data each node of cluster; The statistics corresponding data values is carried out concurrent operation then, obtains a language model based on mass data; Realized the distributed version of Katz algorithm; Train language model effectively, can effectively solve the sparse problem of data simultaneously, improve its discrimination based on mass data.
Each embodiment adopts the mode of going forward one by one to describe in this instructions, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For the disclosed system of embodiment, because corresponding with the embodiment disclosed method, so description is fairly simple, relevant part is partly explained referring to method and is got final product.
The professional can also further recognize; The unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein; Can realize with electronic hardware, client software or the combination of the two; For the interchangeability of hardware and software clearly is described, the composition and the step of each example described prevailingly according to function in above-mentioned explanation.These functions still are that software mode is carried out with hardware actually, depend on the application-specific and the design constraint of technical scheme.The professional and technical personnel can use distinct methods to realize described function to each certain applications, but this realization should not thought and exceeds scope of the present invention.
Obviously, those skilled in the art can carry out various changes and modification to invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these revise and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these change and modification.