CN102509549A - Language model training method and system - Google Patents

Language model training method and system Download PDF

Info

Publication number
CN102509549A
CN102509549A CN201110301029XA CN201110301029A CN102509549A CN 102509549 A CN102509549 A CN 102509549A CN 201110301029X A CN201110301029X A CN 201110301029XA CN 201110301029 A CN201110301029 A CN 201110301029A CN 102509549 A CN102509549 A CN 102509549A
Authority
CN
China
Prior art keywords
tuple
word frequency
key assignments
language model
statistics amount
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201110301029XA
Other languages
Chinese (zh)
Other versions
CN102509549B (en
Inventor
孙宏亮
蔡洪滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI GEAK ELECTRONICS Co.,Ltd.
Original Assignee
Shengle Information Technolpogy Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengle Information Technolpogy Shanghai Co Ltd filed Critical Shengle Information Technolpogy Shanghai Co Ltd
Priority to CN 201110301029 priority Critical patent/CN102509549B/en
Publication of CN102509549A publication Critical patent/CN102509549A/en
Application granted granted Critical
Publication of CN102509549B publication Critical patent/CN102509549B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a language model training method and system. The method comprises the following steps of: performing a round of MapReduce operation on the training corpus, and counting the word frequency statistic of an N-element group; performing a round of MapReduce operation on the word frequency statistic of the N-element group to obtain a COC statistic of the N-element group; performing a round of MapReduce operation on the word frequency statistic of the N-element group to obtain the probability value of the N-element group; performing multiple rounds of MapReduce operations, and calculating the rollback coefficients of one-element group to m-element group respectively; and summarizing the probability value and the rollback coefficients to obtain a language model of APRA format. In the invention, a data structure based on a Hash prefix tree is adopted, the mass data are skillfully divided and combined and dispersed to each node of the cluster, corresponding data values are counted and concurrent operation is performed to obtain a language model based on mass data; and through the method, a distributed version of the Katz algorithm is realized, the language model based on mass data is effectively trained, the problem of sparse data can be effectively solved, and the identification rate is improved.

Description

Language model training method and system
Technical field
The present invention relates to a kind of language model training method and system.
Background technology
Jelinek and his team have started the statistical language model application, and N gram language model (N-gram) is by the time and facts have proved effectively simple.A sparse problem of basic data is arranged in the N gram language model, and promptly training data is sufficient inadequately forever concerning actual application environment, therefore when the probability of certain N tuple that in training data, did not occur of prediction, the zero probability problem always occurs.Corresponding solution is exactly a smoothing algorithm.Wherein, The Katz algorithm is the classical smoothing algorithm that is suggested in 1987; Visible list of references: the Katz of the specific descriptions of this algorithm, Slava is of probabilities from sparse data for the language model component of a speech recognizer.IEEE Transactions on Acoustic M.1987.Estimation, Speech and Signal Processing; ASSP-35 (3): 400-401, March.
Along with the fast development of internet, accumulation and processing mass data become very important gradually, and Distributed Calculation is fast-developing in this stage; The development of speech recognition and mechanical translation; The language model that requirement has big data to support improves discrimination, and separate unit or a small amount of several computing machines can not adapt to the needs of current application, and unit computing and memory source are limited; Mass data processing and language model training are wasted time and energy, and can not effectively solve the sparse problem of data simultaneously.
Therefore, need a kind of language model training method and system badly, can train language model effectively, simultaneously can effectively solve the sparse problem of data, for the discrimination of speech recognition provides support based on mass data.
Summary of the invention
The object of the present invention is to provide a kind of language model training method and system; These method and system have realized the distributed version of Katz algorithm; Can train language model effectively, simultaneously can effectively solve the sparse problem of data, for the discrimination of speech recognition provides support based on mass data.
For addressing the above problem, the present invention provides a kind of language model training method, comprising:
Corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, wherein, N is the positive integer more than or equal to 2;
The word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtain the COC statistic of N tuple;
The word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtain the probable value of N tuple;
Carry out many wheel MapReduce operation, calculate the rollback coefficient of a tuple to m tuple respectively, wherein m=N-1;
Gather said probable value and rollback coefficient and obtain the language model of APRA form.
Optional, in said method, said corpus is carried out taking turns the MapReduce operation, the step of the word frequency statistics amount of statistics N tuple comprises:
The Map operation is exported first speech of N tuple as key assignments;
Shuffle resets the N tuple of operating different key assignments and is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs on each Reduce node is as the word frequency statistics amount of N tuple.
Optional, in said method, the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtain the step of the COC statistic of N tuple, comprising:
The boundary value K that preset discount is calculated, wherein K is a positive integer;
Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K;
Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes;
Gather said word frequency statistics amount and obtain the COC statistic of said N tuple.
Optional, in said method, the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtain the step of a tuple to the probable value of N tuple, comprising:
Map operation first speech of N tuple that the word frequency statistics amount of said N tuple is corresponding is exported as key assignments;
Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes;
On each Reduce node, all import said COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount;
Calculate the probable value of a tuple according to said discount factor to the N tuple.
Optional, in said method, carry out many wheel MapReduce operation, calculate the step of the rollback coefficient of a tuple to m tuple respectively, comprising:
Carry out taking turns the MapReduce operation, calculate the rollback coefficient of a tuple;
Carry out many wheel MapReduce operation, calculate the rollback coefficient of doublet to m tuple respectively.
Optional, in said method, carry out taking turns the MapReduce operation, calculate the step of a tuple rollback coefficient, comprising:
On each Reduce node, distribute the data of all tuples and the data of doublet;
First speech in one tuple or the doublet is exported as key assignments;
Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition;
Adopt the Katz smoothing algorithm to calculate on each node the rollback coefficient with the corresponding tuple of data of a said tuple and doublet according to the said word frequency statistics amount of a tuple and doublet.
Optional, in said method, calculate the step of the rollback coefficient of doublet to m tuple respectively, comprising:
On each node, distribute the data of all m tuples and the data of m+1 tuple;
The penult speech of m tuple or m+1 tuple is exported as key assignments;
Shuffle resets the m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes;
Calculate on each node the rollback coefficient with the data and the corresponding m tuple of m+1 tuple of said m tuple according to the said word frequency statistics amount of m tuple and m+1 tuple.
According to another side of the present invention, a kind of language model training system is provided, comprising:
The word frequency module is used for corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, and wherein N is the positive integer more than or equal to 2;
The COC module is used for the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtains the COC statistic of N tuple;
The probability module is used for the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtains the probable value of N tuple;
The rollback coefficient module is carried out many wheel MapReduce operation, calculates the rollback coefficient of a tuple to m tuple respectively, wherein m=N-1; And
Summarizing module gathers the language model that said probable value and rollback coefficient obtain the APRA form.
Optional; In said system; Said word frequency module is carried out the Map operation first speech of N tuple is exported as key assignments; Shuffle resets the N tuple of operating different key assignments and is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs on each Reduce node is as the word frequency statistics amount of N tuple.
Optional; In said system; The boundary value K that the preset discount of said COC module is calculated, wherein K is a positive integer, the Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K; Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that said word frequency statistics amount obtains said N tuple.
Optional; In said system; Said probability module is carried out Map operation first speech of N tuple that the word frequency statistics amount of said N tuple is corresponding and is exported as key assignments; Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes, on each Reduce node, all imports the discount factor calculating that said COC statistic is carried out corresponding N tuple word frequency statistics amount, calculates the probable value of a tuple to the N tuple according to said discount factor.
Optional, in said system, said rollback coefficient module comprises:
One tuple unit is used to carry out taking turns the MapReduce operation, calculates the rollback coefficient of a tuple;
Polynary group of unit carries out many wheel MapReduce operation, calculates the rollback coefficient of doublet to m tuple respectively.
Optional; In said system; A said tuple unit distributes the data of all tuples and the data of doublet on each Reduce node; First speech in one tuple or the doublet is exported as key assignments; Shuffle resets operation one tuple of different key assignments is fitted on the different Reduce nodes with binary composition, according on said each node of word frequency statistics amount employing Katz smoothing algorithm calculating of a tuple and doublet with the rollback coefficient of the corresponding tuple of data of a said tuple and doublet.
Optional; In said system; Said polynary group of unit distributes the data of all m tuples and the data of m+1 tuple on each node; The penult speech of m tuple or m+1 tuple is exported as key assignments, and Shuffle resets operation the m tuple or the m+1 tuple of different key assignments is assigned on the different nodes, according on said each node of word frequency statistics amount calculating of m tuple and m+1 tuple with the rollback coefficient of the corresponding m tuple of data of said m tuple and m+1 tuple.
Compared with prior art, the present invention adopts the data structure that is the basis with the Hash prefix trees, dexterously mass data is broken and makes up; Be distributed to each node of cluster to data, the statistics corresponding data values is carried out concurrent operation then; Obtain a language model, realized the distributed version of Katz algorithm, train language model effectively based on mass data based on mass data; Simultaneously can effectively solve the sparse problem of data, improve its discrimination.
Description of drawings
Fig. 1 is the process flow diagram of the language model training method of one embodiment of the invention;
Fig. 2 is the distributed training process flow diagram of the trigram language model of one embodiment of the invention;
Fig. 3 is the high-level schematic functional block diagram of the language model training system of one embodiment of the invention.
Embodiment
The language model training method and the system's further explain that the present invention are proposed below in conjunction with accompanying drawing and specific embodiment.
As depicted in figs. 1 and 2, the present invention provides a kind of language model training method, and this method is with the instrument of MapReduce as cluster distributed management, comprises following step for the distributed smoothing algorithm of Katz:
Step S 1, and corpus is carried out taking turns MapReduce (being also referred to as Map/Reduce, at random piecemeal/key assignments piecemeal) operation; The word frequency statistics amount (WordCount) of statistics N tuple, wherein N is the positive integer more than or equal to 2, comprising the Map operation first speech of N tuple is exported as key assignments; Shuffle resets operation the N tuple of different key assignments is assigned on the different Reduce nodes, for example: a tuple " we " and doublet " we not " will be assigned to same Reduce node, and the number of times that each N tuple of Reduce tabulate statistics occurs on each Reduce node is as the word frequency statistics amount of N tuple; Concrete, MapReduce is a programming model, 2004; The MapReduce system that Google publishes thesis and introduces them; 2005, web crawlers project Nutch realized a MapReduce system that increases income, and MapReduce became an independent project of increasing income in 2006; List of references is a Hadoop authority guide; Tom White work, Zhou Minqi etc. translate, and publishing house of Tsing-Hua University publishes;
Step S2 carries out taking turns the MapReduce operation to the word frequency statistics amount of said N tuple, obtains COC statistic (the word frequency equivalent number of N tuple; Count of count), comprising the boundary value K that preset discount is calculated, wherein K is a positive integer; Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K; Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that said word frequency statistics amount obtains said N tuple, and is concrete; The COC statistic is represented is that the word frequency statistics amount is the number of 1 to K speech; K is the boundary value that discount is calculated, and the word frequency that surpasses K-1 need not carried out discount, and the COC statistic is used for the discount computing of word frequency;
Step S3; The word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation; Calculate the probable value of N tuple; First speech of N tuple comprising the word frequency statistics amount of the said N tuple of Map operation handlebar is corresponding is exported as key assignments, and Shuffle resets operation the N tuple word frequency of different key assignments is assigned on the node of different Reduce, on each Reduce node, all imports earlier said COC statistic and carries out the discount factor (discounting) of corresponding N tuple word frequency statistics amount and calculate; Calculate the probable value (probability) of a tuple to N tuple then according to said discount factor, gather at last and obtain all N-gram probability;
Step S4; Carry out taking turns the MapReduce operation; Calculate a tuple rollback coefficient; Comprising on each node, distributing the data of all tuples and the data of doublet earlier, then first speech in a tuple or the doublet is exported as key assignments, Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition; At last adopt on each node of Katz smoothing algorithm calculating the rollback coefficient with the corresponding tuple of data of the data of a said tuple and doublet according to the said word frequency statistics amount of a tuple and doublet; Concrete, calculate the word frequency statistics amount that a tuple rollback coefficient needs a tuple and doublet, the mode of data allocations is different with step S1; Each node all need distribute the data of all tuples and the data of doublet; Be assigned to certain node in the cluster according to first speech in a tuple or the doublet, can guarantee that like this monobasic rollback coefficient that each node calculates all is overall, it is the same promptly all being placed on the value of calculating on the node with total data;
Step S5; Carry out taking turns the MapReduce operation; Calculate m tuple rollback coefficient (m=2); Comprising on each Reduce node, distributing the data of all m tuples and the data of m+1 tuple earlier, then the penult speech of m tuple or m+1 tuple is exported as key assignments, Shuffle resets the m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes; At last according on said each node of word frequency statistics amount calculating of m tuple and m+1 tuple with the rollback coefficient of the corresponding m tuple of data of said m tuple and m+1 tuple; Concrete, all need distribute a m tuple and m+1 tuple data for each node, data allocations is undertaken by the penult word of m unit and m+1 tuple; For example: " we not " and " leaning on us to strive for ", " inspiring ours " are waited and will be assigned to same node, can guarantee that like this each node all has the first rollback coefficient of sufficient data computation m;
M is added one, judge that whether m is smaller or equal to N-1 (like the step S6 of Fig. 1);
If, repeat step S5, up to m=N-1, all calculative N tuple rollback coefficients are all intact;
If not, gather the language model that said probable value and rollback coefficient obtain the APRA form, withdraw from calculating (like step S7 among Fig. 1).
Among step S5 in this method and the step S6 with the penult speech of N tuple distribution key assignments as data; Can guarantee in the rollback coefficient that calculates the m rank; Other m rank and the m+1 rank corresponding data that need use are assigned to same node, thereby have guaranteed the correctness and the validity of data allocations, compare with unit; The number of supposing m rank and m+1 rank N-gram is H; X speech (everyday words in the Chinese can reach hundreds thousand of) arranged in the dictionary, and such distribution can guarantee that N-gram corresponding in each node in the cluster on average has only H/X, gives full play of the advantage of distributed arithmetic.
This method has proposed a kind of new method of training big data natural language model, and this method can overcome the restriction of unit internal memory under the present art, along with the expansion of cluster scale; The scale of language model has extensibility, and in the training based on the natural language model of Web language material, this method can be handled big data very effectively; Train large-scale language model; Under the ideal situation, data-handling capacity can expand to hundreds thousand of times (in fact, processing power will receive the influence of the factors such as scale of DATA DISTRIBUTION balanced intensity and server cluster) of unit; Accumulation along with data volume; The sparse problem relative remission of data, thus for speech recognition provides the speech model near true environment, improve its discrimination effectively.
With the mode of simple example, the distributed training that how to realize trigram language model is described below.
If the input corpus has three words: " today, I will go out ", " weather of today is pretty good ", " I will go out carwash "
Through after the participle operation, obtain one and to the tlv triple raw data be:
Sentence one:
" today "
" today I "
" today, I wanted "
" I "
" I want "
" I will go out "
" want "
" to go out "
" go out "
Sentence two:
" today "
" today "
" weather of today "
" "
" weather "
" weather pretty good "
" weather "
" weather is pretty good "
" well "
Sentence three:
" I "
" I want "
" I will go out "
" want "
" to go out "
" to go out "
" go out "
" go out "
" carwash of going out "
" go "
" remove carwash "
" carwash "
As depicted in figs. 1 and 2, the distributed training flow process of concrete trigram language model is following:
Step S1:
Through Map with after resetting, key assignments will be assigned to a Reduce node for the N tuple of " today ", as:
" today "
" today I "
" today, I wanted "
" today "
" today "
" weather of today "
Through obtaining the word frequency statistics amount after the Reduce statistics
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
All the other N tuple operations similarly.
Step S2:
The word frequency of supposing N unit is that the COC statistic of k is defined as COC [N] [k].
Through Map with after resetting, key assignments will be assigned to same Reduce node for the word frequency of " 2 ", as:
" today " 2
" I " 2
" I want " 2
" I will go out " 2
" want " 2
" to go out " 2
" go out " 2
Reduce adds up, and word frequency is that a tuple of 2 has 4, then COC [1] [2]=4; Doublet has 2, i.e. COC [2] [2]=2; Tlv triple has 1, i.e. COC [3] [2]=1, and other COC statistic all can come out according to same procedure.
Step S3:
Need to calculate the corresponding discount factor of each tuple before the beginning MapReduce distributed operation, the discount factor that Katz is level and smooth calculates based on the Good-Turing method:
d r = ( r + 1 ) n r + 1 rn r - ( k + 1 ) n k + 1 n 1 1 - ( k + 1 ) n k + 1 n 1
Wherein, r is word frequency statistics amount (number of a N tuple), and dr representes the discount factor that word frequency r is corresponding; Nr representes that the COC statistic is that word frequency is the number of the N tuple of r, and k is the boundary value that discount is calculated, and need not do discount greater than the word frequency of k and calculate; Suppose that the N tuple word frequency statistics amount that calculates is that the discount factor of k is D [N] [k]; Through Map with after resetting, key assignments will be assigned to a Reduce node for the N tuple word frequency statistics amount of " today ", as:
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
To calculate " today " probability be example:
P (" today ")=Count (" today ") * D [1] [2]/Count (the word frequency summations of all tuples),
Calculate " today " back occurs " I " probability:
P (" I " | " today ")=Count (" today I ") * D [2] [1]/Count (" today ").
Step S4:
Through Map and after resetting, the data of depositing at a node are the doublet word frequency of " today " for all tuples and key assignments:
" today " 2
" today I " 1
" today, I wanted " 1
" today " 1
" weather of today " 1
" I " 2
" want " 2
" go out " 2
" " 1
" weather " 1
" well " 1
" go " 1
" carwash " 1
On this node, will use the Katz smoothing algorithm to calculate the rollback coefficient of a tuple " today ", correspondingly, all will calculate the rollback coefficient of the corresponding tuple of corresponding key assignments on each node.
Step S5:
Through Map and after resetting, the data of on certain node, depositing are that key assignments is the binary and the tlv triple word frequency data of " I ":
" I want " 2
" today, I wanted " 1
On each node, can calculate the rollback coefficient of preceding two yuan of correspondences of tlv triple, correspondingly on this node be " today I ".
The rollback coefficient formulas is following:
α ( xy ) = 1 - Σ z , C ( xyz ) > 0 P Katz ( z | xy ) 1 - Σ z , C ( xyz ) > 0 P Katz ( z | y )
Wherein, The corresponding rollback coefficient of α (xy) expression doublet xy, the word frequency of C (xyz) expression tlv triple xyz, PKatz representes through N-gram probability corresponding after the discount computing; Hence one can see that; Need on the molecule all word frequency of statistics greater than zero and prefix be the general probability of the tlv triple of xy, needing all prefixs of statistics on the denominator be y, and when suffix is z the word frequency of corresponding ternary group xyz greater than the general probability of zero doublet.
Hence one can see that, calculates " today I " corresponding rollback coefficient, needs prefix to be " today I " all tlv triple and prefix be all doublet word frequency statistics amounts of " I ".Therefore just in time can all be placed on same node to all corresponding data with this allocation scheme.
Under the situation that has more sentences as input, this node can calculate all suffix and be the doublet of " I ", like the rollback coefficient of " today I ", " tomorrow I ", " being me " etc.
Step S6:
M is added one; Judge that whether m is smaller or equal to N-1; In three gram language model, tlv triple does not have the rollback coefficient, therefore directly gets into step S7; The calculating of the rollback coefficient of tlv triple and the rollback coefficient calculations of doublet are similarly, will distribute tlv triple and data four-tuple as key assignments according to the penult speech of tlv triple.
Step S7:
ARPA is the N-gram language model storage format standard of generally acknowledging at present, no longer describes here.
As shown in Figure 3, according to another side of the present invention, a kind of language model training system is provided also, comprise word frequency module 1, COC module 2, probability module 3, rollback coefficient module 4 and summarizing module 5.
Said word frequency module 1 word frequency module; Be used for corpus is carried out taking turns the MapReduce operation; The word frequency statistics amount of statistics N tuple, wherein N is the positive integer more than or equal to 2, said word frequency module 1 is carried out the Map operation first speech of N tuple is exported as key assignments; Shuffle resets the N tuple of operating different key assignments and is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs on each Reduce node is as the word frequency statistics amount of N tuple.
Said COC module 2 is used for the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation; Obtain the COC statistic; The boundary value K that said COC module 2 preset discounts are calculated, wherein K is a positive integer, the Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K; Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that said word frequency statistics amount obtains said N tuple.
Said probability module 3 is used for the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation; Calculate the probable value of N tuple; Said probability module 3 is carried out Map operation first speech of N tuple that the word frequency statistics amount of said N tuple is corresponding and is exported as key assignments; Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes; Elder generation all imports the discount factor calculating that said COC statistic is carried out corresponding N tuple word frequency statistics amount on each Reduce node, calculate the probable value of a tuple to the N tuple according to said discount factor then, gathers at last to obtain all N-gram probability.
Said rollback coefficient module 4 is used to carry out many wheel MapReduce operation, calculates the rollback coefficient of a tuple to m tuple respectively, m=N-1 wherein, and said rollback coefficient module 4 comprises:
One tuple unit 41; Be used to carry out taking turns the MapReduce operation; Calculate the rollback coefficient of a tuple (m=1); A said tuple unit 41 distributes the data of all tuples and the data of doublet earlier on each Reduce node, then first speech in a tuple or the doublet is exported as key assignments, and Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition; At last adopt on each node of Katz smoothing algorithm calculating the rollback coefficient with the corresponding tuple of data of the data of a said tuple and doublet according to the said word frequency statistics amount of a tuple and doublet; Concrete, each node all need distribute the data of all tuples and the data of doublet, is assigned to certain node in the cluster according to first speech in a tuple or the doublet; Can guarantee that like this monobasic rollback coefficient that each node calculates all is overall, it is the same promptly all being placed on the value of calculating on the node with total data.
Polynary group of unit 42; Carry out many wheel MapReduce operation; Calculate rollback coefficient (the rollback coefficient of 2<=m<=N-1) of doublet to m tuple respectively; Said polynary group of unit 42 distributes the data of all m tuples and the data of m+1 tuple earlier on each node; Then the penult speech of m tuple or m+1 tuple is exported as key assignments, Shuffle resets operation the m tuple or the m+1 tuple of different key assignments is assigned on the different nodes, at last according on said each node of word frequency statistics amount calculating of m tuple and m+1 tuple with the rollback coefficient of the data and the corresponding m tuple of m+1 tuple of said m tuple; Concrete; Here data allocations is undertaken by the penult word of m unit and m+1 tuple, and for example: " we not " and " leaning on us to strive for ", " inspiring ours " are waited and will be assigned to same node, can guarantee that like this each node all has the first rollback coefficient of sufficient data computation m.
Said summarizing module 5 is used to gather the language model that said probable value and rollback coefficient obtain the APRA form.
The present invention adopts the data structure that is the basis with the Hash prefix trees, dexterously mass data is broken and makes up, and is distributed to data each node of cluster; The statistics corresponding data values is carried out concurrent operation then, obtains a language model based on mass data; Realized the distributed version of Katz algorithm; Train language model effectively, can effectively solve the sparse problem of data simultaneously, improve its discrimination based on mass data.
Each embodiment adopts the mode of going forward one by one to describe in this instructions, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For the disclosed system of embodiment, because corresponding with the embodiment disclosed method, so description is fairly simple, relevant part is partly explained referring to method and is got final product.
The professional can also further recognize; The unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein; Can realize with electronic hardware, client software or the combination of the two; For the interchangeability of hardware and software clearly is described, the composition and the step of each example described prevailingly according to function in above-mentioned explanation.These functions still are that software mode is carried out with hardware actually, depend on the application-specific and the design constraint of technical scheme.The professional and technical personnel can use distinct methods to realize described function to each certain applications, but this realization should not thought and exceeds scope of the present invention.
Obviously, those skilled in the art can carry out various changes and modification to invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these revise and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these change and modification.

Claims (14)

1. a language model training method is characterized in that, comprising:
Corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, wherein, N is the positive integer more than or equal to 2;
The word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtain the COC statistic of N tuple;
The word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtain the probable value of N tuple;
Carry out many wheel MapReduce operation, calculate the rollback coefficient of a tuple to m tuple respectively, wherein m=N-1;
Gather said probable value and rollback coefficient and obtain the language model of APRA form.
2. language model training method as claimed in claim 1 is characterized in that, said corpus is carried out taking turns MapReduce operation, and the step of the word frequency statistics amount of statistics N tuple comprises:
The Map operation is exported first speech of N tuple as key assignments;
Shuffle resets the N tuple of operating different key assignments and is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs on each Reduce node is as the word frequency statistics amount of N tuple.
3. language model training method as claimed in claim 2 is characterized in that, the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtains the step of the COC statistic of N tuple, comprising:
The boundary value K that preset discount is calculated, wherein K is a positive integer;
Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K;
Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes;
Gather said word frequency statistics amount and obtain the COC statistic of said N tuple.
4. language model training method as claimed in claim 3 is characterized in that, the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtains the step of a tuple to the probable value of N tuple, comprising:
Map operation first speech of N tuple that the word frequency statistics amount of said N tuple is corresponding is exported as key assignments;
Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes;
On each Reduce node, all import said COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount;
Calculate the probable value of a tuple according to said discount factor to the N tuple.
5. language model training method as claimed in claim 2 is characterized in that, carries out many wheel MapReduce operation, calculates the step of the rollback coefficient of a tuple to m tuple respectively, comprising:
Carry out taking turns the MapReduce operation, calculate the rollback coefficient of a tuple;
Carry out many wheel MapReduce operation, calculate the rollback coefficient of doublet to m tuple respectively.
6. language model training method as claimed in claim 5 is characterized in that, carries out taking turns the MapReduce operation, calculates the step of a tuple rollback coefficient, comprising:
On each Reduce node, distribute the data of all tuples and the data of doublet;
First speech in one tuple or the doublet is exported as key assignments;
Shuffle resets a tuple of operating different key assignments and is fitted on the different Reduce nodes with binary composition;
Adopt the Katz smoothing algorithm to calculate on each node the rollback coefficient with the corresponding tuple of data of a said tuple and doublet according to the said word frequency statistics amount of a tuple and doublet.
7. language model training method as claimed in claim 5 is characterized in that, calculates the step of the rollback coefficient of doublet to m tuple respectively, comprising:
On each node, distribute the data of all m tuples and the data of m+1 tuple;
The penult speech of m tuple or m+1 tuple is exported as key assignments;
Shuffle resets the m tuple or the m+1 tuple of operating different key assignments and is assigned on the different nodes;
Calculate on each node the rollback coefficient with the corresponding m tuple of data of said m tuple and m+1 tuple according to the said word frequency statistics amount of m tuple and m+1 tuple.
8. a language model training system is characterized in that, comprising:
The word frequency module is used for corpus is carried out taking turns the MapReduce operation, the word frequency statistics amount of statistics N tuple, and wherein N is the positive integer more than or equal to 2;
The COC module is used for the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtains the COC statistic of N tuple;
The probability module is used for the word frequency statistics amount of said N tuple is carried out taking turns the MapReduce operation, obtains the probable value of N tuple;
The rollback coefficient module is carried out many wheel MapReduce operation, calculates the rollback coefficient of a tuple to m tuple respectively, wherein m=N-1; And
Summarizing module gathers the language model that said probable value and rollback coefficient obtain the APRA form.
9. language model training system as claimed in claim 8; It is characterized in that; Said word frequency module is carried out the Map operation first speech of N tuple is exported as key assignments; Shuffle resets the N tuple of operating different key assignments and is assigned on the different Reduce nodes, and the number of times that each N tuple of Reduce tabulate statistics occurs on each Reduce node is as the word frequency statistics amount of N tuple.
10. language model training system as claimed in claim 9; It is characterized in that; The boundary value K that the preset discount of said COC module is calculated, wherein K is a positive integer, the Map operation will be exported as key assignments smaller or equal to the word frequency statistics amount of the N tuple of K; Shuffle rearrangement operation is assigned to the word frequency statistics amount of the N tuple of different key assignments on the different Reduce nodes, gathers the COC statistic that said word frequency statistics amount obtains said N tuple.
11. language model training system as claimed in claim 10; It is characterized in that; Said probability module is carried out Map operation first speech of N tuple that the word frequency statistics amount of said N tuple is corresponding and is exported as key assignments; Shuffle resets the N tuple word frequency of operating different key assignments and is assigned on the different Reduce nodes; On each Reduce node, all import said COC statistic and carry out the discount factor calculating of corresponding N tuple word frequency statistics amount, calculate the probable value of a tuple to the N tuple according to said discount factor.
12. language model training system as claimed in claim 9 is characterized in that, said rollback coefficient module comprises:
One tuple unit is used to carry out taking turns the MapReduce operation, calculates the rollback coefficient of a tuple;
Polynary group of unit carries out many wheel MapReduce operation, calculates the rollback coefficient of doublet to m tuple respectively.
13. language model training system as claimed in claim 12; It is characterized in that; A said tuple unit distributes the data of all tuples and the data of doublet on each Reduce node; First speech in one tuple or the doublet is exported as key assignments; Shuffle resets operation one tuple of different key assignments is fitted on the different Reduce nodes with binary composition, according on said each node of word frequency statistics amount employing Katz smoothing algorithm calculating of a tuple and doublet with the rollback coefficient of the corresponding tuple of data of a said tuple and doublet.
14. language model training system as claimed in claim 12; It is characterized in that; Said polynary group of unit distributes the data of all m tuples and the data of m+1 tuple on each node; The penult speech of m tuple or m+1 tuple is exported as key assignments; Shuffle resets operation the m tuple or the m+1 tuple of different key assignments is assigned on the different nodes, according on said each node of word frequency statistics amount calculating of m tuple and m+1 tuple with the rollback coefficient of the corresponding m tuple of data of said m tuple and m+1 tuple.
CN 201110301029 2011-09-28 2011-09-28 Language model training method and system Active CN102509549B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110301029 CN102509549B (en) 2011-09-28 2011-09-28 Language model training method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110301029 CN102509549B (en) 2011-09-28 2011-09-28 Language model training method and system

Publications (2)

Publication Number Publication Date
CN102509549A true CN102509549A (en) 2012-06-20
CN102509549B CN102509549B (en) 2013-08-14

Family

ID=46221624

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110301029 Active CN102509549B (en) 2011-09-28 2011-09-28 Language model training method and system

Country Status (1)

Country Link
CN (1) CN102509549B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514230A (en) * 2012-06-29 2014-01-15 北京百度网讯科技有限公司 Method and device used for training language model according to corpus sequence
CN103631771A (en) * 2012-08-28 2014-03-12 株式会社东芝 Method and device for improving linguistic model
CN103871404A (en) * 2012-12-13 2014-06-18 北京百度网讯科技有限公司 Language model training method, query method and corresponding device
CN104112447A (en) * 2014-07-28 2014-10-22 科大讯飞股份有限公司 Method and system for improving statistical language model accuracy
CN105679317A (en) * 2014-12-08 2016-06-15 三星电子株式会社 Method and apparatus for language model training and speech recognition
CN106055543A (en) * 2016-05-23 2016-10-26 南京大学 Spark-based training method of large-scale phrase translation model
CN106156010A (en) * 2015-04-20 2016-11-23 阿里巴巴集团控股有限公司 Translation training method, device, system and translation on line method and device
CN106257441A (en) * 2016-06-30 2016-12-28 电子科技大学 A kind of training method of skip language model based on word frequency
CN106649269A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Extraction method and device of colloquial sentence
CN107436865A (en) * 2016-05-25 2017-12-05 阿里巴巴集团控股有限公司 A kind of word alignment training method, machine translation method and system
CN108021712A (en) * 2017-12-28 2018-05-11 中南大学 The method for building up of N-Gram models
CN110379416A (en) * 2019-08-15 2019-10-25 腾讯科技(深圳)有限公司 A kind of neural network language model training method, device, equipment and storage medium
CN112862662A (en) * 2021-03-12 2021-05-28 云知声智能科技股份有限公司 Method and equipment for distributed training of transform-xl language model

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229286A (en) * 2017-05-27 2018-06-29 北京市商汤科技开发有限公司 Language model generates and application process, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1673997A (en) * 2004-03-26 2005-09-28 微软公司 Representation of a deleted interpolation n-gram language model in ARPA standard format
US20080243481A1 (en) * 2007-03-26 2008-10-02 Thorsten Brants Large Language Models in Machine Translation
CN101604522A (en) * 2009-07-16 2009-12-16 北京森博克智能科技有限公司 The embedded Chinese and English mixing voice recognition methods and the system of unspecified person
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1673997A (en) * 2004-03-26 2005-09-28 微软公司 Representation of a deleted interpolation n-gram language model in ARPA standard format
US20080243481A1 (en) * 2007-03-26 2008-10-02 Thorsten Brants Large Language Models in Machine Translation
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN101604522A (en) * 2009-07-16 2009-12-16 北京森博克智能科技有限公司 The embedded Chinese and English mixing voice recognition methods and the system of unspecified person

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514230A (en) * 2012-06-29 2014-01-15 北京百度网讯科技有限公司 Method and device used for training language model according to corpus sequence
CN103514230B (en) * 2012-06-29 2018-06-05 北京百度网讯科技有限公司 A kind of method and apparatus being used for according to language material sequence train language model
CN103631771A (en) * 2012-08-28 2014-03-12 株式会社东芝 Method and device for improving linguistic model
CN103871404A (en) * 2012-12-13 2014-06-18 北京百度网讯科技有限公司 Language model training method, query method and corresponding device
CN103871404B (en) * 2012-12-13 2017-04-12 北京百度网讯科技有限公司 Language model training method, query method and corresponding device
CN104112447B (en) * 2014-07-28 2017-08-25 安徽普济信息科技有限公司 Method and system for improving accuracy of statistical language model
CN104112447A (en) * 2014-07-28 2014-10-22 科大讯飞股份有限公司 Method and system for improving statistical language model accuracy
CN105679317A (en) * 2014-12-08 2016-06-15 三星电子株式会社 Method and apparatus for language model training and speech recognition
CN105679317B (en) * 2014-12-08 2020-11-17 三星电子株式会社 Method and apparatus for training language models and recognizing speech
CN106156010A (en) * 2015-04-20 2016-11-23 阿里巴巴集团控股有限公司 Translation training method, device, system and translation on line method and device
CN106055543A (en) * 2016-05-23 2016-10-26 南京大学 Spark-based training method of large-scale phrase translation model
CN106055543B (en) * 2016-05-23 2019-04-09 南京大学 The training method of extensive phrase translation model based on Spark
CN107436865A (en) * 2016-05-25 2017-12-05 阿里巴巴集团控股有限公司 A kind of word alignment training method, machine translation method and system
CN107436865B (en) * 2016-05-25 2020-10-16 阿里巴巴集团控股有限公司 Word alignment training method, machine translation method and system
CN106257441A (en) * 2016-06-30 2016-12-28 电子科技大学 A kind of training method of skip language model based on word frequency
CN106649269A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Extraction method and device of colloquial sentence
CN108021712A (en) * 2017-12-28 2018-05-11 中南大学 The method for building up of N-Gram models
CN110379416A (en) * 2019-08-15 2019-10-25 腾讯科技(深圳)有限公司 A kind of neural network language model training method, device, equipment and storage medium
CN110379416B (en) * 2019-08-15 2021-10-22 腾讯科技(深圳)有限公司 Neural network language model training method, device, equipment and storage medium
CN112862662A (en) * 2021-03-12 2021-05-28 云知声智能科技股份有限公司 Method and equipment for distributed training of transform-xl language model

Also Published As

Publication number Publication date
CN102509549B (en) 2013-08-14

Similar Documents

Publication Publication Date Title
CN102509549B (en) Language model training method and system
US9864807B2 (en) Identifying influencers for topics in social media
CN102123172B (en) Implementation method of Web service discovery based on neural network clustering optimization
CN107608953B (en) Word vector generation method based on indefinite-length context
CN107609141A (en) It is a kind of that quick modelling method of probabilistic is carried out to extensive renewable energy source data
CN109871502A (en) A kind of flow data canonical matching process based on Storm
Wu et al. A deadline-aware estimation of distribution algorithm for resource scheduling in fog computing systems
CN104881399A (en) Event identification method and system based on probability soft logic PSL
CN104915396A (en) Knowledge retrieving method
Yang et al. Node importance ranking in complex networks based on multicriteria decision making
CN104636454B (en) A kind of joint clustering method towards large scale scale heterogeneous data
Kastrati et al. An improved concept vector space model for ontology based classification
CN107436865A (en) A kind of word alignment training method, machine translation method and system
CN105023170A (en) Processing method and device of click stream data
JP2017524175A (en) Method and apparatus for generating instances of technical indicators
JP6261669B2 (en) Query calibration system and method
Chen et al. Investigation of back-off based interpolation between recurrent neural network and n-gram language models
CN112579775B (en) Method for classifying unstructured text and computer-readable storage medium
CN106610945A (en) Improved ontology concept semantic similarity computing method
CN110751161B (en) Spark-based node similarity calculation method, device and terminal
CN106611039A (en) Calculation method for hybrid solution of semantic similarity of ontology concept
CN106934489B (en) Time sequence link prediction method for complex network
CN104965869A (en) Mobile application sorting and clustering method based on heterogeneous information network
Chen Spam message filtering recognition system based on tensorflow
CN114936220B (en) Search method and device for Boolean satisfiability problem solution, electronic equipment and medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHANGHAI GUOKE ELECTRONIC CO., LTD.

Free format text: FORMER OWNER: SHENGYUE INFORMATION TECHNOLOGY (SHANGHAI) CO., LTD.

Effective date: 20140919

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20140919

Address after: 201203, room 1, building 380, 108 Yin Yin Road, Shanghai, Pudong New Area

Patentee after: Shanghai Guoke Electronic Co., Ltd.

Address before: 201203 Shanghai Guo Shou Jing Road, Zhangjiang High Tech Park of Pudong New Area No. 356 building 3 Room 102

Patentee before: Shengle Information Technology (Shanghai) Co., Ltd.

CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: Room 127, building 3, 356 GuoShouJing Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 200120

Patentee after: SHANGHAI GEAK ELECTRONICS Co.,Ltd.

Address before: Room 108, building 1, 380 Yinbei Road, Pudong New Area, Shanghai 201203

Patentee before: Shanghai Nutshell Electronics Co.,Ltd.