CN110019832A - The acquisition methods and device of language model - Google Patents

The acquisition methods and device of language model Download PDF

Info

Publication number
CN110019832A
CN110019832A CN201710910635.9A CN201710910635A CN110019832A CN 110019832 A CN110019832 A CN 110019832A CN 201710910635 A CN201710910635 A CN 201710910635A CN 110019832 A CN110019832 A CN 110019832A
Authority
CN
China
Prior art keywords
corpus
language
language model
degree
aliasing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710910635.9A
Other languages
Chinese (zh)
Other versions
CN110019832B (en
Inventor
郑昊
唐璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710910635.9A priority Critical patent/CN110019832B/en
Publication of CN110019832A publication Critical patent/CN110019832A/en
Application granted granted Critical
Publication of CN110019832B publication Critical patent/CN110019832B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of acquisition methods of language model and devices.Wherein, this method comprises: obtaining the first corpus and the second corpus, wherein the first corpus is the language text of random acquisition, and the second corpus is the language text chosen under default context;Using the first language model by being trained to the second corpus, degree of aliasing calculating is carried out to the first corpus, third corpus is obtained with screening, wherein the degree of aliasing between third corpus and the second corpus is less than preset threshold;Second language model by the way that the first corpus to be trained and the third language model by the way that third corpus to be trained carry out interpolation processing, obtain language model ready for use.The present invention solves the technical issues of degraded performance of language model.

Description

The acquisition methods and device of language model
Technical field
The present invention relates to language model fields, in particular to the acquisition methods and device of a kind of language model.
Background technique
Currently, language model is an important link in entire identification process, or even in natural language in speech recognition It is also extremely important in speech understanding, so that the performance to speech recognition produces far-reaching influence.However, matching of the corpus to data Degree is very sensitive, for example, whether corpus matches the performance that can seriously restrict language model for some specific field, from And restrict the performance of whole system.
Traditional language model training is often using the method for piling up corpus.In the insufficient situation of corpus, corpus quantity Influence for language model performance, the influence considerably beyond corpus quality for language model performance.Corpus quantity also When can not meet demand, people go targetedly to promote corpus performance there are no surplus energy.It is established with language model The maturation of process and popularizing for internet, the acquisition for magnanimity corpus data is no longer impossible task.Can On the basis of obtaining mass data, the corpus being more bonded in the field of practical application can be searched for specific tasks.? During searching corpus, the only understanding according to developer itself for task artificially selectively goes to obtain Corpus relevant to task.
When obtaining language model, the topic model of development set is obtained by clustering to development set term vector at present, is passed through Sentence by sentence calculate magnanimity corpus in sentence at a distance from theme, and be arranged threshold value carry out data screening.But this method have it is following Defect:
(1) very big computing resource is needed, for training term vector mapping network.When training term vector mapping network needs When more computing resource, for extensive major term poster material, the calculation amount of one term vector model of training is much larger than training one A N-gram (N-Gram) language model, this is a no small expense in language model field.In addition, term vector network Quality also will have a direct impact on the performance of whole system;
(2) cluster centre number is difficult to determine, and is easily trapped into local optimum.Term vector method is embodied based on the strategy of cluster Modeling to theme.But classical and quick clustering algorithm is often easily trapped into local optimum, for example, K-Means algorithm, compared with Difficulty is preferably distinguished in the case where absolutely not supervising, and causes the degraded performance of language model;
(3) acquiring average value as the operation of the vector of sentence by word has certain irrationality.Term vector network is by word It is converted to vector, and in the clustering method based on term vector, need the mean value by calculating term vector in sentence as this 's Vector is simultaneously modeled for the topic model of next step.However assist the ratio of word often higher in sentence, semanteme is not produced but Life clearly influences, and simple averaging method lays particular stress on vector too much to have arrived auxiliary word, much assists words itself simultaneously after all Semanteme cannot be characterized well, but accounts for biggish ratio in sentence, and has ignored the keyword that should be worked in sentence, To make the degraded performance of language model;
(4) Project Realization is complex.Although term vector network has Open-Source Tools realization, words and phrases vector later turns It changes, theme central cluster, sentence by sentence COS distance etc., it is difficult to be directly realized by by Open-Source Tools, engineering staff oneself is needed to write Code etc. is realized, causes certain inconvenience to obtain language model.
Iflytek has been delivered one in 2016 international conference of INTERSPEECH and has been judged by term vector COS distance The article of the degree of correlation, equally exists the above problem.
Aiming at the problem that degraded performance of above-mentioned language model, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the invention provides a kind of acquisition methods of language model and devices, at least to solve the property of language model The low technical problem of energy.
According to an aspect of an embodiment of the present invention, a kind of acquisition methods of language model are provided.The language model Acquisition methods include: to obtain the first corpus and the second corpus, wherein the first corpus is the language text of random acquisition, the second language Material is the language text chosen under default context;Using the first language model by being trained to the first corpus, Degree of aliasing calculating is carried out to the first corpus, third corpus is obtained with screening, wherein obscuring between third corpus and the second corpus Degree is less than preset threshold;By the second language model that is trained first corpus and by by third corpus The third language model being trained carries out interpolation processing, obtains language model ready for use.
According to another aspect of an embodiment of the present invention, a kind of acquisition device of language model is additionally provided.The language model Acquisition device include: acquisition module, for obtaining the first corpus and the second corpus, wherein the first corpus is random acquisition Language text, the second corpus are the language text chosen under default context;Computing module, for using by the first corpus The first language model being trained carries out degree of aliasing calculating to the first corpus, obtains third corpus with screening, wherein the Degree of aliasing between three corpus and the second corpus is less than preset threshold;Processing module, for by the way that the first corpus to be trained Obtained second language model and the third language model by the way that third corpus to be trained carry out interpolation processing, obtain To language model ready for use.
In embodiments of the present invention, using the first corpus of acquisition and the second corpus, wherein the first corpus is random acquisition Language text, the second corpus are the language text chosen under default context;Using by being trained to obtain to the first corpus First language model, degree of aliasing calculating is carried out to the first corpus, third corpus is obtained with screening, wherein third corpus and the Degree of aliasing between two corpus is less than preset threshold;By the second language model for being trained the first corpus and lead to It crosses the third language model for being trained third corpus and carries out interpolation processing, obtain language model ready for use.Due to It calculates by using the first language model chosen under default context from the first corpus degree of aliasing of random acquisition, with screening The low third corpus of degree of aliasing between the second corpus is obtained, and by being trained to obtain third language mould to third corpus Second language model and third language model are carried out interpolation processing, obtain language model ready for use, reached acquisition language by type It says the purpose of model, to realize the technical effect for improving the performance of language model, and then solves the performance of language model Low technical problem.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is a kind of schematic diagram of the acquisition system of language model according to an embodiment of the present invention;
Fig. 2 is a kind of flow chart of the acquisition methods for the language model implemented according to the present invention;
Fig. 3 is a kind of flow diagram of the acquisition methods of language model according to an embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of the acquisition device of language model according to an embodiment of the present invention;And
Fig. 5 is a kind of structural block diagram of terminal according to an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
Embodiment 1
The embodiment of the invention provides a kind of acquisition systems of language model.Fig. 1 is one kind according to an embodiment of the present invention The schematic diagram of the acquisition system of language model.As shown in Figure 1, the acquisition system 100 of the language model include: input unit 102, Processor 104 and output device 106.
Input unit 102, for inputting the first corpus and the second corpus to processor 104, wherein the first corpus is random The language text of acquisition, the second corpus are the language text chosen under default context.
Corpus, that is, language text, can be the language text of random acquisition, it can be the language chosen under default context Say text.Training set corpus of the corpus as language model ready for use, can be the corpus in each source in daily life, That is, the corpus can derive from the corpus in various channels, the various aspects being related in life, for example, the language in information labeling Material, webpage take off the corpus in taking, the corpus in open source library, effective corpus in some field of user's offer etc., from a wealth of sources, number It is big according to amount.The corpus can correspond to a certain amount of tasks in areas, that is, finding and some needle in the very complicated corpus in source Corpus similar in task to property.
In this embodiment, the first corpus is the language text of random acquisition, for being trained to language model, That is, the first corpus is training set, data volume can be very big, can be whole training set corpus, composition is complicated, is stored in pond In.Second corpus is the language text chosen under default context, for being directed to specific mesh to language model progress tuning Mark, for example, second corpus is one section of word that pre-set user provides under default context.
Processor 104 is connected with input unit 102, for obtaining the first corpus and the second corpus;Using by The first language model that one corpus is trained carries out degree of aliasing calculating to the first corpus, obtains third corpus with screening, Wherein, the degree of aliasing between third corpus and the second corpus is less than preset threshold;Pass through be trained the first corpus Second language model and the third language model by the way that third corpus to be trained are subjected to interpolation processing, obtain to The language model used.
Optionally, processor 104 obtains the first corpus and the second corpus.The first corpus and second is obtained in processor 104 After corpus, processor 104 is trained to obtain second language model by the first corpus, the second language model, that is, beating Bottom language model or training set language model can be trained to obtain second according to N-GRAN language model to the first corpus Language model, wherein the standard exercise process of N-GRAM model is to estimate the probability of each n tuple, that is, passing through time occurred Number divided by sum obtain as a result, to smoothing out the result not occurred.Optionally, processor 104 uses whole training sets Corpus one biggish N-GRAM language model of training is as above-mentioned bottoming language model.By being trained to the second corpus To first language model, the first language model, that is, development set language model, can satisfy the needs tested in field, it can To be trained to obtain first language model to the second corpus according to N-GRAM language model.
Second language model is obtained by being trained to the first corpus in processor 104, and by the second corpus It is trained after obtaining first language model, processor 104 carries out degree of aliasing meter to the first corpus using first language model It calculates, third corpus is obtained with screening, which is to be directed to according to what first language model filtered out in the first corpus Property corpus, optionally, using first language model calculate the first corpus in data degree of aliasing (PPL), select degree of aliasing Relatively low data are as third corpus, that is, being given a mark using first language model to the data in the first corpus, select The relatively low data of score are as third corpus, for example, being carried out using first language model to every a word in the first corpus Marking, the sentence for selecting score relatively low is as third corpus.Wherein, degree of aliasing is lower, indicates the corresponding sentence of the degree of aliasing More meet the distribution of model, degree of aliasing is higher, indicates that the corresponding sentence of the degree of aliasing does not meet the distribution of model more.
Degree of aliasing between the third corpus and the second corpus of the embodiment is less than preset threshold, which can be Preset threshold value, or the threshold value being calculated.The preset threshold can be preset threshold value, for example, being Rule of thumb preset threshold value;The preset threshold may be the threshold value being calculated, for example, when calculating preset threshold, it can To be ranked up from high to low to the data in the first corpus according to degree of aliasing, the first corpus after being sorted, after sequence The first corpus in choose in order quantity be no more than preset ratio data, the corresponding degree of aliasing of preset ratio is determined as pre- If threshold value, optionally, every a word in the first corpus is ranked up according to score is descending, the first corpus after sequence In choose the sentence that quantity is no more than preset ratio in order, the score of the corresponding sentence of preset ratio is determined as default threshold Value.
Preferably, which uses the method according to preset ratio garbled data, for example, needing total amount of data is 5% Data, then obscure 5% data are corresponding angle value and be determined as preset threshold, for another example, choose before score 20% data, that Calculate and (find out) the corresponding degree of aliasing of the 20%th data and be determined as preset threshold, less than preset threshold data volume with regard to proper Account for 20% well.This method is not limited by N-GRAM size and first number, so as to preferably meet the needs of various tasks.
When processor 104 selects the relatively low data of degree of aliasing, processor 104 judges whether degree of aliasing is less than default threshold Value, processor 104 is if it is judged that degree of aliasing is then retained less than the corresponding number of degree of aliasing of preset threshold less than preset threshold According to processor 104 is if it is judged that degree of aliasing then abandons the corresponding number of degree of aliasing of preset threshold more than or equal to preset threshold According to.The embodiment is when language model obtains, without carrying out purpose modeling for language model, but directly in default language On the basis of the second corpus chosen under border, the degree of aliasing between the second corpus is searched from magnanimity in the first corpus of random acquisition Low third corpus.The first corpus degree of aliasing is calculated using first language model in processor 104, third language is obtained with screening After material, processor 104 obtains third language model by being trained to third corpus, can be according to N-GRAM language model Third corpus is trained to obtain third language model, which is also screening language model.
Degree of aliasing calculating is carried out to the first corpus using first language model in processor 104, third language is obtained with screening Material, and by being trained after obtaining third language model to third corpus, processor 104 is by instructing the first corpus The second language model got and the third language model by the way that third corpus to be trained carry out interpolation processing, To which second language model and third language model are carried out interpolation processing, language model ready for use is obtained, this is ready for use Language model meets demand in generalization ability and field simultaneously.
Optionally, processor 104 carries out difference processing to second language model and third language model, obtains interpolation system Multiple N-GRAM language models are synthesized a unified language model according to interpolation coefficient, obtain language mould ready for use by number Type.The language model ready for use can test the performance of the language model by test set.Optionally, interpolation coefficient is carried out Estimation, can rule of thumb be estimated.
Optionally, processor 104 use whole training set corpus one biggish N-GRAM language models of training as Bottoming language model, while training the corresponding screening language model of garbled data to meet the needs tested in field, by bottoming Language model and screening language model are carried out according to certain weight by weight average, meet while obtaining final generalization ability and The language model of demand in field.
Output device 106 is connected with processor 104, for exporting language model ready for use.The output device 106 It can include but is not limited to the equipment such as display, printer.
As an alternative embodiment, above-mentioned processor 104 includes: data preparation module 108, model training module 110, degree of aliasing computing module 112 and data screening module 114.
Data preparation module 108 obtains the first corpus for being prepared to the data for obtaining language model ready for use With the second corpus.
Model training module 110, for by being trained to obtain second language model to the first corpus, and by pair Second corpus is trained to obtain first language model.
Degree of aliasing computing module 112, for determining that the language in the first corpus divides unit;Using first language model pair The word sequence that each language segmentation unit is included in first corpus carries out degree of aliasing calculating respectively, successively obtains and each language Divide the corresponding calculated result of unit, wherein calculated result corresponding with each language segmentation unit is for showing the language point Cut the degree of aliasing between the word sequence that unit is included and first language model.Optionally, it is single successively to seek each language segmentation The word probability of N-1 word sequence before each word sequence that position is included is equivalent to, wherein the value of N is according to first language Model is predetermined;Probability of occurrence, Ke Yili are obtained by carrying out product calculation to the corresponding word probability of each word sequence The corresponding word probability of each word sequence is calculated with maximum- likelihood estimation;It seeks dividing with each language using probability of occurrence The corresponding cross entropy of the word sequence that unit is included;Cross entropy is set to index and sets the truth of a matter for default value to carry out Exponent arithmetic obtains calculated result corresponding with each language segmentation unit.
It is the unit for calculating degree of aliasing that the language, which divides unit, can be every a word in the first corpus, can also Think each section of words in the first corpus.It includes word sequence that language, which divides unit, can be made of multiple words.Degree of aliasing calculates Module 112 obtains the word sequence that each language segmentation unit is included in the first corpus, using first language model to the first language The word sequence that each language segmentation unit is included in material carries out degree of aliasing calculating, successively obtains dividing unit pair with each language The calculated result answered, for example, the development set language model obtained using training carries out every words in training set language model Degree of aliasing calculates.Wherein, degree of aliasing, that is, puzzlement degree, can be used for describing a Duan Wenben, a word, a word and N-GRAM The similarity degree of language model.When degree of aliasing is lower, degree of aliasing computing module 112 trains the calculated result come with the first language The degree of aliasing of the speech model word sequence that includes is lower, that is, degree of aliasing computing module 112 train the calculated result come with The similarity of test set distribution is higher, and degree of aliasing computing module 112 trains corresponding language segmentation unit and more meets model point The expectation of cloth, when degree of aliasing is higher, degree of aliasing computing module 112 trains corresponding language segmentation unit and does not meet model more Distribution.
Optionally, degree of aliasing computing module 112 calculates word sequence that each language segmentation unit is included in the first corpus In probability of occurrence.For example, the n-th word W in the word sequence for being included for language segmentation unit, the word of word W are general Rate is only uniquely determined by N-1 word before it, can use maximum- likelihood estimation (Maximum Likelihood Estimation, referred to as MLE) calculate the corresponding word probability of each word sequence.By to the corresponding word of each word sequence Probability carries out product calculation and obtains the probability of occurrence of word sequence that each language segmentation unit is included in the first corpus, i.e. P (W)=P (W | W1W2…WN-1), wherein WiFor i-th of word before word W.
Word sequence the going out in the first corpus that each language segmentation unit is included is obtained in degree of aliasing computing module 112 After existing probability, degree of aliasing computing module 112 is sought dividing the word sequence that unit is included with each language using probability of occurrence Corresponding cross entropy, for example, including word sequence W for one section1, W2…WNContent,For The cross entropy of the word sequence, wherein the number for the word sequence that N includes by each language segmentation unit.Degree of aliasing computing module 112 can set cross entropy to the index of exponent arithmetic, default value be set to the truth of a matter of exponent arithmetic, according to cross entropy Exponent arithmetic is carried out with preset data, obtains calculated result corresponding with each language segmentation unit.For example, 2 are determined as referring to The truth of a matter of number operations, then 2HAs secondary sequence W1, W2…WNDegree of aliasing.When 2HMore hour, then 2HIt is corresponding to be distributed with test set Similarity it is higher, the expectation that corresponding language segmentation unit more meets model profile is trained, when 2HWhen bigger, corresponding language Speech segmentation unit does not meet the distribution of model more.
Data screening module 114, the calculating corresponding with each language segmentation unit according to preset threshold to being calculated As a result it is screened, obtains third corpus.
After degree of aliasing computing module 112 successively obtains calculated result corresponding with each language segmentation unit, data Screening module 114 screens the calculated result corresponding with each language segmentation unit being calculated according to preset threshold, Obtain third corpus.When calculated result is less than preset threshold, data screening module 114 retains the corresponding language point of calculated result Unit is cut, when calculated result is more than or equal to preset threshold, data screening module 114 abandons the corresponding language segmentation of calculated result Unit.
Optionally, processor 104 is obtained by the way that second language model and third language model are carried out interpolation processing wait make Language model.Determine that multiple groups weighted value combines, wherein every group of weighted value combination includes: corresponding with second language model First weighted value and the second weighted value corresponding with third language model;It is combined using the weighted value currently chosen to the second language Speech model and third language model are weighted and averaged calculating, obtain alternate language model;Using alternative language model to The word sequence that each language segmentation unit is included in two corpus carries out degree of aliasing calculating respectively, obtains corresponding with the second corpus Degree of aliasing assessment result;Judge with the presence or absence of the weighted value combination not yet chosen in the combination of multiple groups weighted value, if it is, returning Calculating is weighted and averaged to second language model and third language model using the weighted value combination currently chosen, is obtained alternative Language model;If it is not, then continuing to execute comparison step;Comparison step: corresponding degree of aliasing is combined to every group of weighted value and is assessed As a result it is comprehensively compared, chooses the minimum alternate language model of degree of aliasing assessment result as language model ready for use.
The embodiment inputs the first corpus and the second corpus to processor 104 by input unit 102, wherein the first language Material is the language text of random acquisition, and the second corpus is the language text chosen under default context, by processor 104 with it is defeated Enter device 102 to be connected, obtains the first corpus and the second corpus;By being trained to obtain second language mould to the first corpus Type, and by being trained to obtain first language model to the second corpus;The first corpus is obscured using first language model Degree calculates, and obtains third corpus with screening, and by being trained to obtain third language model to third corpus, wherein third Degree of aliasing between corpus and the second corpus is less than preset threshold;Second language model and third language model are carried out at interpolation Reason, obtains language model ready for use;By output device 106, it is connected with processor 104, for exporting language ready for use Say model, due to by using the first language model chosen under default context to the first corpus degree of aliasing from random acquisition It calculates, the third corpus low with the second corpus degree of aliasing is obtained with screening, and by being trained to obtain third to third corpus Language model;Second language model and third language model are subjected to interpolation processing, language model ready for use is obtained, reaches The purpose of language model is obtained, to realize the technical effect for improving the performance of language model, and then solves language model Degraded performance the technical issues of.
Embodiment 2
According to embodiments of the present invention, a kind of embodiment of the acquisition methods of language model is additionally provided, it should be noted that Step shown in the flowchart of the accompanying drawings can execute in a computer system such as a set of computer executable instructions, and It, in some cases, can be to be different from sequence execution institute herein and although logical order is shown in flow charts The step of showing or describing.
Fig. 2 is a kind of flow chart of the acquisition methods for the language model implemented according to the present invention.As shown in Fig. 2, the language The acquisition methods of model the following steps are included:
Step S202 obtains the first corpus and the second corpus.
In the technical solution that above-mentioned steps S202 of the present invention is provided, the first corpus and the second corpus are obtained, wherein first Corpus is the language text of random acquisition, and the second corpus is the language text chosen under default context.
Corpus can be the language text of random acquisition, can be the language text chosen under default context.Corpus conduct The training set corpus of language model ready for use, can be the corpus in each source in daily life, for example, in information labeling Corpus, webpage are taken off take in corpus, effective corpus in some field etc. for providing of the corpus in open source library, user, it is from a wealth of sources, Data volume is big.The corpus can correspond to a certain amount of tasks in areas.
In this embodiment, the data for obtaining language model ready for use are prepared, obtain the first corpus and Second corpus, wherein the first corpus is the language text of random acquisition, for being trained to language model, that is, the first language Material is training set, and data volume can be very big, can be whole training set corpus, constitute complexity, be stored in pond in.Second Corpus is the language text chosen under default context, for being directed to specific target to language model progress tuning, than Such as, which is one section of word that pre-set user provides under default context.
Step S204 carries out the first corpus using the first language model by being trained to the second corpus Degree of aliasing calculates, and obtains third corpus with screening.
In the technical solution that above-mentioned steps S204 of the present invention is provided, using by being trained to the second corpus First language model, carries out degree of aliasing calculating to the first corpus, obtains third corpus with screening, wherein third corpus and second Degree of aliasing between corpus is less than preset threshold.
In this embodiment, after obtaining the first corpus and the second corpus, it is trained to obtain the by the first corpus Two language models, the second language model, that is, bottoming language model or training set language model, it can be according to N-GRAN Language model is trained the first corpus to obtain second language model, wherein the standard exercise process of N-GRAM model is to estimate Count the probability of each n tuple, that is, by the number that occurs divided by sum obtain as a result, to smoothing out the knot not occurred Fruit.Optionally, use whole training set corpus one biggish N-GRAM language models of training as above-mentioned bottoming language mould Type.Optionally, which obtains second language model first language model, first language by being trained to the first corpus Model is sayed, that is, development set language model, can satisfy the needs tested in field.
Degree of aliasing calculating is carried out to the first corpus using first language model, third corpus is obtained with screening, and by pair Third corpus is trained to obtain third language model, wherein the degree of aliasing between third corpus and the second corpus is less than default Threshold value.
By being trained to obtain second language model to the first corpus, and by being trained to the second corpus To after first language model, the first corpus degree of aliasing is calculated using first language model, third corpus is obtained with screening, it should Third corpus is the targeted corpus filtered out in the first corpus according to first language model, optionally, using first Language model calculates the degree of aliasing (PPL) of the data in the first corpus, the data for selecting degree of aliasing relatively low as third corpus, That is, being given a mark using first language model to the data in the first corpus, the data for selecting score relatively low are as third Corpus, for example, given a mark using first language model to every a word in the first corpus, the sentence for selecting score relatively low As third corpus.Wherein, degree of aliasing is lower, indicates that the corresponding sentence of the degree of aliasing more meets the distribution of model, degree of aliasing is got over Height indicates that the corresponding sentence of the degree of aliasing does not meet the distribution of model more.
The third of the embodiment expects that the degree of aliasing between the second corpus is less than preset threshold.Wherein, the preset threshold Can be preset threshold value, for example, preset threshold value can according to the preset threshold value of experience;The preset threshold It can be the threshold value being calculated.It optionally, can be to the data in the first corpus according to degree of aliasing when calculating preset threshold It is ranked up from high to low, chooses the data that quantity is no more than preset ratio in the first corpus after sequence in order, it will be pre- If the corresponding degree of aliasing of ratio is determined as preset threshold, for example, can to every a word in the first corpus according to score by big It is ranked up to small, the sentence that quantity is no more than preset ratio is chosen in the first corpus after sequence in order, by default ratio The score of the corresponding sentence of example is determined as preset threshold.For example, needing total amount of data is 5% data, then it is 5% data are corresponding Angle value of obscuring be determined as preset threshold;For another example, choose score before 20% data, then calculating and (finding out) the 20%th The corresponding degree of aliasing of data is determined as preset threshold, and the data volume less than preset threshold just accounts for 20% just.This method is not by N- The limitation of GRAM size and first number, so as to preferably meet the needs of various tasks.
Optionally, in the relatively low data of selection degree of aliasing, judge whether degree of aliasing is less than preset threshold, if it is determined that Degree of aliasing is less than above-mentioned preset threshold out, then the corresponding data of degree of aliasing of preset threshold is retained less than, if it is judged that obscuring Degree is more than or equal to above-mentioned preset threshold, then abandons the corresponding data of degree of aliasing of preset threshold.
It is above-mentioned that degree of aliasing calculating is carried out to the first corpus using first language model, to screen the method for obtaining third corpus It is not limited by N-GRAM size and first number, so as to preferably meet the needs of various tasks, and without being directed to language Say that model carries out purpose modeling, but directly by the basis of the second corpus chosen under default context, from random acquisition Magnanimity searches the third corpus low with the second corpus degree of aliasing in first corpus.
The first corpus degree of aliasing is calculated using first language model, after obtaining third corpus with screening, by right Third corpus is trained to obtain third language model, can be trained to obtain to third corpus according to N-GRAM language model Third language model, the third language model are also screening language model.
Optionally, the embodiment do not do it is thematic it is assumed that therefore there is no cluster operation, also avoid to fall into local optimum The problem of.
Step S206, by the second language model that is trained the first corpus and by by third corpus into The third language model that row training obtains carries out interpolation processing, obtains language model ready for use.
In the technical solution that above-mentioned steps S206 of the present invention is provided, the first corpus is being carried out using first language model Degree of aliasing calculates, and obtains third corpus with screening, and by being trained after obtaining third language model to third corpus, will Second language model and third language model carry out interpolation processing, obtain language model ready for use, the language mould ready for use Type meets demand in generalization ability and field simultaneously.
Optionally, model difference is carried out to second language model and third language model, interpolation coefficient is obtained, according to interpolation Multiple N-GRAM language models are synthesized a unified language model by coefficient, obtain language model ready for use.This is to be used Language model the performance of the language model can be tested by test set.Optionally, interpolation coefficient is estimated, Ke Yigen Estimated according to experience.
Optionally, use whole training set corpus one biggish N-GRAM language models of training as bottoming language mould Type, while the corresponding screening language model of training garbled data to be to meet the needs tested in field, by bottoming language model with Screening language model carries out meeting demand in generalization ability and field while obtaining final by weight average according to certain weight Language model.
S202 to step S206 through the above steps, using obtaining the first corpus and the second corpus, wherein the first corpus is The language text of random acquisition, the second corpus are the language text chosen under default context;By being instructed to the first corpus The first language model got, carries out degree of aliasing calculating to the first corpus, obtains third corpus with screening, wherein third language Degree of aliasing between material and the second corpus is less than preset threshold;Pass through the second language model for being trained the first corpus And the third language model by the way that third corpus to be trained carries out interpolation processing, obtains language mould ready for use Type obtains third corpus due to calculating by using first language model the first corpus degree of aliasing with screening, and by the Three corpus are trained to obtain third language model;Second language model and third language model are subjected to interpolation processing, obtained Language model ready for use has achieved the purpose that obtain language model, to realize the technology for improving the performance of language model Effect, and then solve the technical issues of degraded performance of language model.
As an alternative embodiment, using the first language model by being trained to the second corpus, Degree of aliasing calculating is carried out to the first corpus, third corpus is obtained with screening and comprises determining that the language in the first corpus divides unit; Degree of aliasing calculating is carried out using the word sequence that first language model is included to language each in the first corpus segmentation unit respectively, Successively obtain calculated result corresponding with each language segmentation unit, wherein calculating corresponding with each language segmentation unit is tied Fruit is used to show the degree of aliasing between the language segmentation unit word sequence for being included and first language model;According to preset threshold The calculated result corresponding with each language segmentation unit being calculated is screened, third corpus is obtained.
The first corpus degree of aliasing is calculated using first language model, when obtaining third corpus with screening, determines first Language in corpus divides unit, and it is the unit for calculating degree of aliasing which, which divides unit, can be in the first corpus Every a word, or each section of words in the first corpus.It includes word sequence that language, which divides unit, can be by multiple group of words At.The word sequence that each language segmentation unit is included in the first corpus is obtained, using first language model in the first corpus The word sequence that each language segmentation unit is included carries out degree of aliasing calculating, successively obtains corresponding with each language segmentation unit Calculated result, for example, being obscured using the development set language model that training obtains every words in training set language model Degree calculates.Wherein, degree of aliasing can be used for describing a Duan Wenben, a word, journey in short similar to N-GRAM language model Degree.When degree of aliasing is lower, the similarity for training the word sequence that the calculated result come includes with first language model is got over Height trains corresponding language segmentation unit and gets over that is, the similarity that the calculated result trained is distributed with test set is higher The expectation for meeting model profile, when degree of aliasing is higher, corresponding language segmentation unit does not meet the distribution of model more.
Optionally, there are many Open-Source Tools relevant to N-GRAM, which, which can use, supports opening for N-GRAM model Source tool realizes that the language determined in the first corpus divides unit, using first language model to each language in the first corpus point It cuts the word sequence that unit is included and carries out degree of aliasing calculating respectively, successively obtain calculating corresponding with each language segmentation unit and tie Fruit, the Open-Source Tools can be SRILM Open-Source Tools, and above-mentioned function may be implemented by the ngram order in SRILM Open-Source Tools Can, only do the operation of a little script level, so that it may complete entire screening process, be highly convenient for Project Realization and various systems It is integrated.
Smaller according to degree of aliasing, language divides the similarity between the unit word sequence for being included and first language model just It is higher, estimate that word sequence that language segmentation unit is included should be similar, thus the word sequence with first language model It is selected as the corpus in field.After successively obtaining calculated result corresponding with each language segmentation unit, according to pre- If threshold value screens the calculated result corresponding with each language segmentation unit being calculated, third corpus is obtained.Work as meter When calculating result less than threshold value, then retain the corresponding language segmentation unit of calculated result, when calculated result is more than or equal to threshold value, then It abandons the corresponding language of calculated result and divides unit.
The embodiment passes through the language determined in the first corpus and divides unit, using first language model in the first corpus The word sequence that each language segmentation unit is included carries out degree of aliasing calculating respectively, successively obtains dividing unit pair with each language The calculated result answered;The calculated result corresponding with each language segmentation unit being calculated is sieved according to preset threshold Choosing, is obtained third corpus, realizes and calculated using first language model the first corpus degree of aliasing, obtains third language with screening Material, and then by being trained to obtain third language model to third corpus;By second language model and third language model into Row interpolation processing, obtains language model ready for use.Since it is only necessary to use the training of development set data for the screening based on degree of aliasing One basic N-Gram model, calculation amount, which is compared, needs the term vector network of large-scale corpus training can be ignored, It avoids trained term vector network and needs more computing resource, realize the technical effect for improving the performance of language model.
As an alternative embodiment, dividing unit institute to language each in the first corpus using first language model The word sequence for including carries out degree of aliasing calculating respectively, successively obtains calculated result corresponding with each language segmentation unit and includes: Calculate the probability of occurrence of word sequence that each language segmentation unit is included in the first corpus;Using probability of occurrence seek with often The corresponding cross entropy of word sequence that a language segmentation unit is included;Index is set by cross entropy and default value is arranged Exponent arithmetic is carried out for the truth of a matter, obtains calculated result corresponding with each language segmentation unit.
Probability of occurrence of the word sequence that each language segmentation unit is included in the first corpus is for characterizing each language Degree of the word sequence that segmentation unit is included in the frequency of occurrence in the first corpus.In use first language model to first The word sequence that each language segmentation unit is included in corpus carries out degree of aliasing calculating respectively, successively obtains dividing with each language When the corresponding calculated result of unit, it is general to calculate the appearance of word sequence that each language segmentation unit is included in the first corpus Rate.Word sequence that each language segmentation unit is included is being obtained after the probability of occurrence in the first corpus, it is general using occurring Rate is sought dividing the corresponding cross entropy of the unit word sequence that is included with each language.
It for example, include word sequence W for one section1, W2…WNContent,For this The cross entropy of word sequence, wherein the number for the word sequence that N includes by each language segmentation unit.
It is sought after dividing the corresponding cross entropy of the unit word sequence that is included with each language using probability of occurrence, to friendship It pitches entropy and carries out exponent arithmetic, obtain calculated result corresponding with each language segmentation unit.Index can be set by cross entropy The index of operation sets default value to the truth of a matter of exponent arithmetic, carries out exponent arithmetic according to cross entropy and preset data, obtains To calculated result corresponding with each language segmentation unit.
It for example, include word sequence W for above-mentioned one section1, W2…WNContent, set H on the finger of exponent arithmetic Number, the truth of a matter for being determined as exponent arithmetic for 2, then 2HAs secondary sequence W1, W2…WNDegree of aliasing.When 2HMore hour, then 2HIt is corresponding The similarity with test set distribution it is higher, the expectation that corresponding language segmentation unit more meets model profile is trained, when 2H When bigger, corresponding language segmentation unit does not meet the distribution of model more.
Probability of occurrence of the word sequence that the embodiment is included by each language segmentation unit of calculating in the first corpus; It seeks dividing the corresponding cross entropy of the unit word sequence that is included with each language using probability of occurrence;Cross entropy is set as referring to It counts and sets the truth of a matter for default value and carry out exponent arithmetic, obtain calculated result corresponding with each language segmentation unit, It is carried out respectively to realize using the word sequence that first language model is included to language each in the first corpus segmentation unit Degree of aliasing calculates, and successively obtains dividing the corresponding calculated result of unit with each language, and then according to preset threshold to calculating To screened with the corresponding calculated result of each language segmentation unit, obtain third corpus, and by third corpus into Row training obtains third language model, and second language model and third language model are carried out interpolation processing, obtained ready for use Language model realizes the technical effect for improving the performance of language model.
As an alternative embodiment, calculating word sequence that each language segmentation unit is included in the first corpus Probability of occurrence include: successively to seek N-1 word sequence before each word sequence that each language segmentation unit is included is equivalent to Word probability, wherein the value of N is predetermined according to first language model;By to the corresponding list of each word sequence Word probability carries out product calculation and obtains probability of occurrence.
The model training of the embodiment is carried out according to training N-GRAM language model.Calculating, each language segmentation is single When probability of occurrence of the word sequence in the first corpus that position is included, optionally, it is assumed that included for language segmentation unit N-th word W in word sequence, the word probability of word W is only uniquely determined by N-1 word before it, by each word The corresponding word probability of sequence carries out product calculation and obtains word sequence that each language segmentation unit is included in the first corpus Probability of occurrence.I.e. P (W)=P (W | W1W2…WN-1), wherein WiFor i-th of word before word W, the value of N is according to the first language Say that model is predetermined.
The embodiment, which passes through, successively seeks N-1 word before each word sequence that each language segmentation unit is included is equivalent to The word probability of sequence;Probability of occurrence is obtained by carrying out product calculation to the corresponding word probability of each word sequence, thus real The probability of occurrence of word sequence that calculating each language segmentation unit is included in the first corpus is showed, has not needed to do sentence grade vector Conversion, be highly convenient for the integrated of Project Realization and various systems, reduce the complexity of Project Realization.
As an alternative embodiment, utilizing maximum- likelihood estimation ((Maximum Likelihood Estimation, referred to as MLE) calculate the corresponding word probability of each word sequence.
In the training process, probability P that each tuple occurs is estimated using maximum- likelihood estimation, wherein tuple is The unit being made of multiple words, for example, the X tuple being made of word X1, X2 and X3.Count each X tuple (W1, W2... WX) number that occurs in corpus, wherein X≤N, then by individual count divided by the summation of corresponding tuple, as P (WX|WX-1… W1), to obtain the corresponding word probability of each word sequence.
The quantity of tuple assumes that online, for example, the probability of " patent " should be " I " I want apply for a patent in " Want to apply for a patent " probability that occurs of this combination, or by P (frequency of occurrence of " patent " | the frequency of occurrence of " I wants to apply ") Obtain, if only considering the influence of preceding 3 words, then the probability of " patent " appearance be just P (frequency of occurrence of " patent " | " I thinks Application ").
Optionally, by smooth operation to make up the cavity in statistical counting, thus the further standard of lift scheme estimation True property.
Optionally, above-mentioned function is realized using the ngram-count order of SRILM Open-Source Tools.
As an alternative embodiment, second language model and third language model are carried out interpolation processing, obtain Language model ready for use comprises determining that step: determining that multiple groups weighted value combines, wherein the combination of every group of weighted value includes: and the Corresponding first weighted value of two language models and the second weighted value corresponding with third language model;Calculate step: using working as The weighted value combination of preceding selection is weighted and averaged calculating to second language model and third language model, obtains alternate language mould Type;Processing step: distinguished using the word sequence that alternative language model is included to language each in the second corpus segmentation unit Degree of aliasing calculating is carried out, degree of aliasing assessment result corresponding with the second corpus is obtained;Judge whether deposit in the combination of multiple groups weighted value In the weighted value combination not yet chosen, step is calculated if it is, returning;If it is not, then continuing to execute comparison step;Compare step It is rapid: corresponding degree of aliasing assessment result being combined to every group of weighted value and is comprehensively compared, it is minimum to choose degree of aliasing assessment result Alternate language model is as language model ready for use.
It can when obtaining language model ready for use second language model and third language model are carried out interpolation processing To be realized by determining step, calculating step, processing step and comparison step.
In determining step, determine that multiple groups weighted value combines, every group of weighed combination includes corresponding with second language model First weighted value, and the second weighted value corresponding with third language model, that is, every group of weighed combination includes and training set language Say corresponding first weighted value of model, and the second weighted value corresponding with screening language model, the first weight and the second weight Be estimated by algorithm, that is, according to development set measure come, be interpolation coefficient.
In calculating step, second language model and third language model are carried out using the weighted value combination currently chosen Weighted average calculation obtains alternate language model, that is, using the weighted value combination currently chosen to training set language model and Screening language model carries out plus full average computation, obtains alternate language model.
In processing step, the word that unit is included is divided to language each in the second corpus using alternative language model Sequence carries out degree of aliasing calculating respectively, obtains degree of aliasing assessment result corresponding with the second corpus, deposits in group weighted value combination It in the case where the weighted value combination not yet chosen, returns and calculates step, the combination of weight selection value is to second language model again It is weighted and averaged calculating with third language model, obtains new alternate language model.
In the case where combination in group weighted value combination there is no the weighted value not yet chosen in comparison step, to every group Weighted value combines corresponding degree of aliasing assessment result and is comprehensively compared, and then chooses the minimum alternative language of degree of aliasing assessment result Say model as language model ready for use.
In this embodiment, due to having used first language model as screening with N-GRAM language model, and the first language More or less there are some differences in speech model and test set.And since its territoriality is too strong, cause its generalization ability insufficient, because And the method for using model interpolation obtains language model ready for use.That is, using whole training set corpus training one compared with Big second language model filters out the corresponding third language model of data and is tested with meeting in field as bottoming model Needs.Optionally, use whole training set corpus one N-GRAM language models of training as bottoming language model, simultaneously The corresponding screening language model of training garbled data is carried out to meet the needs tested in field, by the two according to certain weight Meet the language model of demand in generalization ability and field by weight average, while obtaining final.
The embodiment based on the screening of degree of aliasing it is only necessary to use development set data training one basic N-Gram mould Type, calculation amount, which is compared, needs the term vector network of large-scale corpus training can be ignored;The embodiment of the present invention is not decided Topic property it is assumed that therefore be not present cluster operation, also avoid the problem of falling into local optimum;Hypothesis based on N-GRAM model is The conditional probability distribution of N-gram does not need the conversion for doing sentence grade vector;There are many relevant Open-Source Tools of N-GRAM, the present invention Embodiment can use the Open-Source Tools for supporting N-GRAM model, only does the operation of a little script level, completes entire screening process, It is highly convenient for the integrated of Project Realization and various systems, to improve the performance of language model.
The embodiment can predict the sample of test set to a certain extent based on development set N-GRAM language model Distribution, degree of aliasing can preferably characterize the estimated value of the word Yu test set degree of approximation sentence by sentence, can be by basic computational ele- ment Article is become from sentence, pointedly optimizes instruction to provide for each product line.
Embodiment 3
Technical solution of the present invention is illustrated below with reference to preferred embodiment.Specifically using the first corpus as training set Corpus, the second corpus are development set corpus, and first language model development collection language model, second language model is bottoming language mould Type, third language model are that screening language model is illustrated.
Fig. 3 is a kind of flow diagram of the acquisition methods of language model according to an embodiment of the present invention.As shown in figure 3, The acquisition methods of the language model the following steps are included:
Step S301 obtains training set corpus.
During obtaining language model, the corpus of text in a large amount of various sources is arranged as training set corpus, the instruction White silk integrates corpus as the language text of random acquisition.
Step S302 obtains development set corpus.
A certain amount of corpus chosen according to tasks in areas is determined as development set corpus, which is default The language text chosen under context.
Step S303, by being trained to obtain bottoming language model to training set corpus model.
It, can by being trained to obtain bottoming language model to training set corpus model after obtaining training set corpus To be trained to obtain bottoming language model to training set corpus model by N-GRAM language model.
Step S304 obtains development set language model by carrying out model training to development set corpus.
After obtaining development set corpus, development set language model is obtained by carrying out model training to development set corpus, Model training can be carried out to development set corpus by N-GRAM language model and obtain development set language model.
Step S305, the word order for being included to language each in training set corpus segmentation unit using development set language model Column carry out degree of aliasing calculating respectively, successively obtain calculated result corresponding with each language segmentation unit.
Degree of aliasing is used to describe the similarity degree an of Duan Wenben (or word, a word) and N-GRAM language model.? After obtaining development set corpus, the language segmentation unit in training set corpus is determined, using development set language model to training set The word sequence that each language segmentation unit is included in corpus carries out degree of aliasing calculating respectively.Degree of aliasing calculating is can be according to instruction Practice development set language model in corpus every a word calculate degree of aliasing, degree of aliasing is low, then the corresponding sentence of the degree of aliasing with The similarity of test set is high.For a N-GRAM language model, degree of aliasing is lower, and the probability of appearance is bigger, that is, with N- The training corpus of GRAM language model is closer.
Optionally, N-1 word sequence before each word sequence that each language segmentation unit is included is equivalent to successively is sought Word probability, can use maximum- likelihood estimation and calculate the corresponding word probability of each word sequence, wherein the value of N It is predetermined according to development set language model;It is obtained by carrying out product calculation to the corresponding word probability of each word sequence Probability of occurrence.After obtaining using probability of occurrence, seek dividing the word that unit is included with each language using probability of occurrence The corresponding cross entropy of sequence;It sets cross entropy to index and sets the truth of a matter for default value to carry out exponent arithmetic, obtain Calculated result corresponding with each language segmentation unit.
Optionally, in model training, which uses N-GRAM language model.The model hypothesis is for any word W, probability are only uniquely determined by N-1 word before it, i.e. and P (W)=P (W | W1W2…WN-1), wherein WiIt is before word W I word.
For this purpose, in the training process, the probability P that each tuple of maximal possibility estimation occurs: counting each X tuple (W1, W2,…WX) the number occurred in corpus, wherein X≤N, then by individual count divided by the summation of corresponding tuple, as P (WX|WX-1…W1)。
In order to further enhance the accuracy of model estimation, smooth operation can be used to make up the sky in statistical counting Hole.
Optionally, using the ngram-count order of SRILM Open-Source Tools on behalf of the above-mentioned function of realization.
It optionally, include word sequence W for one section1,W2,…WNContent, definition For the cross entropy of the word sequence, at this time 2HThe as degree of aliasing of the sequence.Degree of aliasing is lower, shows that word more meets the model Distribution, degree of aliasing is lower, shows that word does not meet the distribution of the model more.
Optionally, this above-mentioned function is realized using the ngram order of SRILM Open-Source Tools.
Step S306 carries out the calculated result corresponding with each language segmentation unit being calculated according to preset threshold Screening obtains screening corpus.
After successively obtaining dividing the corresponding calculated result of unit with each language, according to preset threshold to being calculated Screened with the corresponding calculated result of each language segmentation unit, obtain screening corpus.
It when being screened, according to the degree of aliasing being calculated and is calculated or preset threshold, screening degree of aliasing is small In the screening corpus of threshold value.We are screened by way of threshold value, are retained less than the language segmentation unit of preset threshold, Abandon the language segmentation unit for being more than or equal to threshold value.
Optionally, preset threshold can be using rule of thumb preset method, and it is preferable to use proportionally sieve for the embodiment The method for selecting data.For example, think that task needs the data of general total amount of data 5% in statistical significance, then it is 5% data are corresponding Degree of aliasing as preset threshold.This method is not limited by N-GRAM size and first number, can preferably meet various tasks Needs.
For example, first language segmentation unit " I thinks that this part thing is our mistake ", corresponding first degree of aliasing is 12.67;Second language divides unit " you determine ", and corresponding second degree of aliasing is 22.85;Third language divides unit " he Who ", corresponding third degree of aliasing is 87.99;4th language divides unit " I does not know oh yet ", corresponding 4th degree of aliasing It is 44.16, selected threshold is that degree of aliasing is 44, then corresponding with each language segmentation unit to what is be calculated according to preset threshold Degree of aliasing screened, retain sentence of the degree of aliasing less than 44, abandon degree of aliasing be more than or equal to 44 sentence, then obtain first Language divides unit " I thinks that this part thing is our mistake " and second language segmentation unit " you determine ".
Step S307 obtains screening language model by being trained to screening corpus.
The calculated result corresponding with each language segmentation unit being calculated is being screened according to preset threshold, is being obtained To after screening corpus, screening language model is obtained by being trained to screening corpus.
Test set is used to test the performance of language model ready for use, is distributed according to the smaller word sequence of degree of aliasing and development set It is more close, estimate that the word sequence should also be similar (assuming that development set is close with test set) with test set, then the sequence It should more be selected as the corpus in field.
Bottoming language model model and screening language model are carried out model interpolation, obtain final language mould by step S308 Type.
By being trained to obtain bottoming language model to training set corpus model, and by being carried out to screening corpus After training obtains screening language model, bottoming language model model and screening language model are subjected to model interpolation, obtained most Whole language model.
Optionally it is determined that multiple groups weighted value combines, wherein every group of weighted value, which combines, includes: and training set language model pair The first weighted value and the second weighted value corresponding with screening language model answered;It is combined using the weighted value currently chosen to instruction Practice collection language model and screening language model is weighted and averaged calculating, obtains alternate language model;Using alternative language mould Type carries out degree of aliasing calculating to language each in the development set corpus segmentation unit word sequence that is included respectively, obtains and development set The corresponding degree of aliasing assessment result of corpus;Judge with the presence or absence of the weighted value combination not yet chosen in the combination of multiple groups weighted value, such as Fruit is that then the combination of weight selection value is weighted and averaged calculating to training set language model and screening language model again, is obtained New alternate language model;It is comprehensively compared, selects if it is not, then combining corresponding degree of aliasing assessment result to every group of weighted value The alternate language model for taking degree of aliasing assessment result minimum is as language model ready for use.
For example, bottoming language model M1 is obtained with the training of 1,000,000,000 datas, screening model M2 100,000,000 datas training It obtains.If the weight of bottoming language model M1 is 0.1, the weight of screening model is 0.9, then can use each word of M1 Probability * 0.1+M2 in each word probability * 0.9 be used as alternate language model M _ 0.1, it is split with this M_0.1 The word sequence that each language segmentation unit is included in hair collection corpus carries out degree of aliasing calculating respectively, obtains and development set corpus pair The degree of aliasing assessment result answered;Judge with the presence or absence of the weighted value combination not yet chosen in the combination of multiple groups weighted value, if it is, Again the combination of weight selection value is weighted and averaged calculating to training set language model and screening language model, obtains new alternative Language model, for example, with the first weight be 0.2, the second weight be 0.8, to training set language model and screening language model into Row weighted average calculation obtains new alternate language model M _ 0.1;Obscure if it is not, then corresponding to every group of weighted value combination Degree assessment result is comprehensively compared, and chooses the minimum alternate language model of degree of aliasing assessment result as language mould ready for use Type.
Model interpolation can be for multiple N-GRAM language models, according to certain interpolation coefficient, by it by power fusion Both at a unified language model, the language model after interpolation can be taken into account by power, to obtain more comprehensive modeling Ability.Due to having used development set as screening N-GRAM language model, and development set and test set more or less have one It is a little different;And since its territoriality is too strong, cause its generalization ability insufficient.For this purpose, using the strategy of model interpolation, that is, Use full dose corpus one biggish N-GRAM language model of training as bottoming model, while training the corresponding sieve of garbled data Modeling type is carried out by weight average, while obtaining final to meet the needs tested in field, by the two according to certain weight Meet the language model of demand in generalization ability and field.
In this embodiment, experiment confirm influence of the interpolation coefficient to degree of aliasing be not it is especially big, in practical operation by It has been used to from training set corpus choose data in development set corpus, therefore optionally development set can be considered and estimate for interpolation coefficient Meter, or rule of thumb estimated.
The embodiment based on the screening of degree of aliasing it is only necessary to use development set data training one basic N-Gram model, Its calculation amount, which is compared, needs the term vector network of large-scale corpus training can be ignored;The embodiment does not do thematic vacation If therefore there is no cluster operations, avoid the problem of falling into local optimum, the hypothesis based on N-GRAM model is N-gram Conditional probability distribution does not need the conversion for doing sentence grade vector, which can use the Open-Source Tools for supporting N-GRAM model, Entire screening process is completed in the operation for only doing a little script level, is highly convenient for the integrated of Project Realization and various systems.Separately Outside, which can predict the sample distribution of test set to a certain extent, and the degree of aliasing of each language segmentation unit can be with The estimated value for preferably characterizing each language the segmentation unit and test set degree of approximation, basic computational ele- ment can be become from sentence Article provides for each product line and pointedly optimizes instruction.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because According to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the invention It is necessary.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but it is very much In the case of the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to existing The part that technology contributes can be embodied in the form of software products, which is stored in a storage In medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, calculate Machine, server or network equipment etc.) method that executes each embodiment of the present invention.
Embodiment 4
According to embodiments of the present invention, it additionally provides a kind of for implementing the language model of the acquisition methods of above-mentioned language model Acquisition device.Fig. 4 is a kind of schematic diagram of the acquisition device of language model according to an embodiment of the present invention.As shown in figure 4, should The acquisition device 400 of language model includes: to obtain module 402, computing module 404 and processing module 406.
Module 402 is obtained, for obtaining the first corpus and the second corpus, wherein the first corpus is the language of random acquisition Text, the second corpus are the language text chosen under default context.
Computing module 404, for using the first language model by being trained to the first corpus, to the first language Material carries out degree of aliasing calculating, obtains third corpus with screening, wherein the degree of aliasing between third corpus and the second corpus is less than pre- If threshold value.
Processing module 406, for the second language model by the way that the first corpus to be trained and by by The third language model that three corpus are trained carries out interpolation processing, obtains language model ready for use.
Optionally, computing module 404 includes: the first determination unit 408, for determining that the segmentation of the language in the first corpus is single Position;First computing unit 410, the word for being included to language each in the first corpus segmentation unit using first language model Sequence carries out degree of aliasing calculating respectively, successively obtains calculated result corresponding with each language segmentation unit, wherein with each language The corresponding calculated result of speech segmentation unit be used to show the word sequence that language segmentation unit is included and first language model it Between degree of aliasing;Screening unit 412, by according to preset threshold to be calculated it is corresponding with each language segmentation unit based on It calculates result to be screened, obtains third corpus.
Optionally, the first computing unit 410 includes: the first computation subunit 414, for calculating each language segmentation unit Probability of occurrence of the word sequence for being included in the first corpus;Second computation subunit 416, for using probability of occurrence seek with The corresponding cross entropy of word sequence that each language segmentation unit is included;Third computation subunit 418, for cross entropy to be arranged It is set as truth of a matter progress exponent arithmetic for index and by default value, calculating corresponding with each language segmentation unit is obtained and ties Fruit.
Optionally, the first computation subunit 414, each word order for being included for successively seeking each language segmentation unit The word probability of N-1 word sequence before column are equivalent to, and by carrying out product calculation to the corresponding word probability of each word sequence Obtain probability of occurrence, wherein the value of N is predetermined according to first language model.
Optionally, the first computation subunit 414, it is corresponding for calculating each word sequence using maximum- likelihood estimation Word probability.
Optionally, processing module 406 includes: the second determination unit 420, for determining that multiple groups weighted value combines, wherein every Group weighted value combination includes: and corresponding first weighted value of second language model and the second power corresponding with third language model Weight values;Second computing unit 422, for being combined using the weighted value currently chosen to second language model and third language model It is weighted and averaged calculating, obtains alternate language model;Processing unit 424, for using alternative language model to the second language The word sequence that each language segmentation unit is included in material carries out degree of aliasing calculating respectively, obtains corresponding with the second corpus obscuring Spend assessment result;Judge with the presence or absence of the weighted value combination not yet chosen in the combination of multiple groups weighted value, if it is, returning to second Computing unit;If it is not, then continuing to execute comparing unit 426;Comparing unit 426, for corresponding to every group of weighted value combination Degree of aliasing assessment result is comprehensively compared, and chooses the minimum alternate language model of degree of aliasing assessment result as language ready for use Say model.
The embodiment obtains the first corpus and the second corpus by obtaining module 402, wherein the first corpus is random acquisition Language text, the second corpus is the language text chosen under default context, by computing module 404 using by first The first language model that corpus is trained, first language model calculate the first corpus degree of aliasing, obtain the with screening Three corpus, wherein degree of aliasing between third corpus and the second corpus is less than preset threshold, by processing module 406 pass through by The second language model and the third language mould by the way that third corpus to be trained that first corpus is trained Type carries out interpolation processing, obtains language model ready for use.Due to by using the first language mould chosen under default context Type is calculated from the first corpus degree of aliasing of random acquisition, obtains the third corpus low with the second corpus degree of aliasing with screening, and By being trained to obtain third language model to third corpus, second language model and third language model are carried out at interpolation Reason, obtains language model ready for use, has achieved the purpose that obtain language model, to realize the performance for improving language model Technical effect, and then solve the technical issues of degraded performance of language model.
It should be noted that the preferred embodiment in the present embodiment may refer to the associated description in embodiment, herein It repeats no more.
Embodiment 5
The embodiment of the present invention can provide a kind of terminal, which can be any one terminal in terminal group and set It is standby.Optionally, in the present embodiment, above-mentioned terminal also could alternatively be the terminal devices such as mobile terminal.
Optionally, in the present embodiment, above-mentioned terminal can be located in multiple network equipments of computer network at least One network equipment.
In the present embodiment, above-mentioned terminal can be with following steps in the acquisition methods of the language model of executing application Program code: the first corpus and the second corpus are obtained, wherein the first corpus is the language text of random acquisition, and the second corpus is The language text chosen under default context;By the first language model being trained to the first corpus, to the first language Expect that degree of aliasing calculates, third corpus is obtained with screening, wherein the degree of aliasing between third corpus and the second corpus is less than default threshold Value;Pass through the second language model for being trained the first corpus and by the way that third corpus to be trained Three language models carry out interpolation processing, obtain language model ready for use.
Optionally, Fig. 5 is a kind of structural block diagram of terminal according to an embodiment of the present invention.As shown in figure 5, the meter Calculation machine terminal A may include: one or more (one is only shown in figure) processors 502, memory 504 and transmission module 506。
Wherein, memory can be used for storing software program and module, such as obtaining for the language model in the embodiment of the present invention Corresponding program instruction/the module of method and apparatus is taken, processor is by running the software program being stored in memory and mould Block realizes the acquisition methods of above-mentioned language model thereby executing various function application and data processing.Memory can wrap Include high speed random access memory, can also include nonvolatile memory, as one or more magnetic storage device, flash memory or Other non-volatile solid state memories of person.In some instances, memory can further comprise remotely located relative to processor Memory, these remote memories can pass through network connection to terminal A.The example of above-mentioned network includes but unlimited In internet, intranet, local area network, mobile radio communication and combinations thereof.
Processor can call the information and application program of memory storage by transmission module, to execute following step: Obtain the first corpus and the second corpus, wherein the first corpus is the language text of random acquisition, and the second corpus is in default context The language text of lower selection;By the first language model being trained to the first corpus, to the first corpus degree of aliasing meter It calculates, third corpus is obtained with screening, wherein the degree of aliasing between third corpus and the second corpus is less than preset threshold;Passing through will The second language model and the third language mould by the way that third corpus to be trained that first corpus is trained Type carries out interpolation processing, obtains language model ready for use.
Optionally, the program code of following steps can also be performed in above-mentioned processor: determining the language point in the first corpus Cut unit;Obscured respectively using the word sequence that first language model is included to language each in the first corpus segmentation unit Degree calculates, and successively obtains calculated result corresponding with each language segmentation unit, wherein corresponding with each language segmentation unit Calculated result is used to show the degree of aliasing between the language segmentation unit word sequence for being included and first language model;According to pre- If threshold value screens the calculated result corresponding with each language segmentation unit being calculated, third corpus is obtained.
Optionally, the program code of following steps can also be performed in above-mentioned processor: calculating each language segmentation unit institute Probability of occurrence of the word sequence for including in the first corpus;Using probability of occurrence seek with each language divide unit included The corresponding cross entropy of word sequence;It sets cross entropy to index and sets the truth of a matter for default value to carry out exponent arithmetic, obtain To calculated result corresponding with each language segmentation unit.
Optionally, the program code of following steps can also be performed in above-mentioned processor: it is single successively to seek each language segmentation The word probability of N-1 word sequence before each word sequence that position is included is equivalent to, wherein the value of N is according to first language Model is predetermined;Probability of occurrence is obtained by carrying out product calculation to the corresponding word probability of each word sequence.
The program code of following steps can also be performed in above-mentioned processor: calculating each word using maximum- likelihood estimation The corresponding word probability of sequence.
Optionally, the program code of following steps can also be performed in above-mentioned processor: determining step: determining multiple groups weighted value Combination, wherein the combination of every group of weighted value include: the first weighted value corresponding with second language model and with third language model Corresponding second weighted value;It calculates step: being combined using the weighted value currently chosen to second language model and third language mould Type is weighted and averaged calculating, obtains alternate language model;Processing step: using alternative language model to every in the second corpus The word sequence that a language segmentation unit is included carries out degree of aliasing calculating respectively, obtains degree of aliasing assessment corresponding with the second corpus As a result;Judge to calculate step if it is, returning with the presence or absence of the weighted value combination not yet chosen in the combination of multiple groups weighted value; If it is not, then continuing to execute comparison step;Comparison step: it is comprehensive that corresponding degree of aliasing assessment result progress is combined to every group of weighted value Composition and division in a proportion is compared with the minimum alternate language model of selection degree of aliasing assessment result is as language model ready for use.
Using the embodiment of the present invention, a kind of scheme of the acquisition methods of language model is provided.Obtain the first corpus and Two corpus, wherein the first corpus is the language text of random acquisition, and the second corpus is the language text chosen under default context This;By the first language model being trained to the first corpus, the first corpus degree of aliasing is calculated, obtains the with screening Three corpus, wherein the degree of aliasing between third corpus and the second corpus is less than preset threshold;By second language model and third language It says that model carries out interpolation processing, obtains language model ready for use, due to mixed to the first corpus by using first language model Degree of confusing calculates, and obtains third corpus with screening, and by being trained to obtain third language model to third corpus;By by The second language model and the third language model by the way that third corpus to be trained that one corpus is trained Interpolation processing is carried out, language model ready for use is obtained, has achieved the purpose that obtain language model, to realize raising language The technical effect of the performance of model, and then solve the technical issues of degraded performance of language model.
It will appreciated by the skilled person that structure shown in fig. 5 is only to illustrate, terminal is also possible to smart phone (such as Android phone, iOS mobile phone), tablet computer, applause computer and mobile internet device (Mobile Internet Devices, MID), the terminal devices such as PAD.Fig. 5 it does not cause to limit to the structure of above-mentioned electronic device.For example, computer Terminal A may also include than shown in Fig. 5 more perhaps less component (such as network interface, display device) or have with Different configuration shown in Fig. 5.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing the relevant hardware of terminal device by program, which can store in a computer readable storage medium In, storage medium may include: flash disk, read-only memory (Read-Only Memory, ROM), random access device (Random Access Memory, RAM), disk or CD etc..
Embodiment 6
The embodiments of the present invention also provide a kind of storage mediums.Optionally, in the present embodiment, above-mentioned storage medium can With program code performed by the acquisition methods for saving language model provided by above-described embodiment one.
Optionally, in the present embodiment, above-mentioned storage medium can be located at any one in terminal group in computer network In a terminal, or in any one mobile terminal in mobile terminal group.
Optionally, in the present embodiment, storage medium is arranged to store the program code for executing following steps: obtaining Take the first corpus and the second corpus, wherein the first corpus is the language text of random acquisition, and the second corpus is under default context The language text of selection;By the first language model being trained to the first corpus, the first corpus degree of aliasing is calculated, Third corpus is obtained with screening, wherein the degree of aliasing between third corpus and the second corpus is less than preset threshold;By by first The second language model and the third language model by the way that third corpus to be trained that corpus is trained into Row interpolation processing, obtains language model ready for use.
Optionally, storage medium is also configured to store the program code for executing following steps: determining the first corpus In language divide unit;The word sequence point for being included to language each in the first corpus segmentation unit using first language model Not carry out degree of aliasing calculating, successively obtain dividing the corresponding calculated result of unit with each language, wherein divide with each language The corresponding calculated result of unit is used to show mixed between the language segmentation unit word sequence for being included and first language model Degree of confusing;It is screened according to preset threshold to calculated result corresponding with each language segmentation unit is calculated, obtains the Three corpus.
Optionally, storage medium is also configured to store the program code for executing following steps: calculating each language Probability of occurrence of the word sequence that segmentation unit is included in the first corpus;It seeks dividing with each language using probability of occurrence single The corresponding cross entropy of word sequence that position is included;Cross entropy is set to index and sets the truth of a matter for default value to refer to Number operation obtains calculated result corresponding with each language segmentation unit.
Optionally, storage medium is also configured to store the program code for executing following steps: successively seeking each The word probability of N-1 word sequence before each word sequence that language segmentation unit is included is equivalent to, wherein the value of N is root It is predetermined according to first language model;It obtains occurring generally by carrying out product calculation to the corresponding word probability of each word sequence Rate.
Storage medium is also configured to store the program code for executing following steps: utilizing maximum- likelihood estimation Calculate the corresponding word probability of each word sequence.
Storage medium is also configured to store the program code for executing following steps: determining step: determining that multiple groups are weighed Weight values combination, wherein the combination of every group of weighted value include: the first weighted value corresponding with second language model and with third language Corresponding second weighted value of model;It calculates step: being combined using the weighted value currently chosen to second language model and third language Speech model is weighted and averaged calculating, obtains alternate language model;Processing step: using alternative language model to the second corpus In each language segmentation unit word sequence that is included carry out degree of aliasing calculating respectively, obtain degree of aliasing corresponding with the second corpus Assessment result;Judge to calculate step if it is, returning with the presence or absence of the weighted value combination not yet chosen in the combination of multiple groups weighted value Suddenly;If it is not, then continuing to execute comparison step;Comparison step: to every group of weighted value combine corresponding degree of aliasing assessment result into Row is comprehensively compared, and chooses the minimum alternate language model of degree of aliasing assessment result as language model ready for use.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, only A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (12)

1. a kind of acquisition methods of language model characterized by comprising
Obtain the first corpus and the second corpus, wherein first corpus is the language text of random acquisition, second corpus For the language text chosen under default context;
Using the first language model by being trained to second corpus, degree of aliasing is carried out to first corpus It calculates, third corpus is obtained with screening, wherein the degree of aliasing between the third corpus and second corpus is less than default threshold Value;
By the second language model for being trained first corpus and by instructing the third corpus The third language model got carries out interpolation processing, obtains language model ready for use.
2. the method according to claim 1, wherein using by being trained to second corpus First language model carries out degree of aliasing calculating to first corpus, and obtaining the third corpus with screening includes:
Determine the language segmentation unit in first corpus;
Using the first language model to language each in first corpus segmentation unit included word sequence respectively into Row degree of aliasing calculates, and successively obtains calculated result corresponding with each language segmentation unit, wherein divides unit with each language Corresponding calculated result is used to show mixed between word sequence and the first language model that language segmentation unit is included Degree of confusing;
The calculated result corresponding with each language segmentation unit being calculated is screened according to the preset threshold, is obtained The third corpus.
3. according to the method described in claim 2, it is characterized in that, using the first language model in first corpus The word sequence that each language segmentation unit is included carries out degree of aliasing calculating respectively, successively obtains dividing unit pair with each language The calculated result answered includes:
Calculate the probability of occurrence of word sequence that each language segmentation unit is included in first corpus;
It seeks dividing the corresponding cross entropy of the unit word sequence that is included with each language using the probability of occurrence;
It sets the cross entropy to index and sets the truth of a matter for default value to carry out exponent arithmetic, obtain and each language Divide the corresponding calculated result of unit.
4. according to the method described in claim 3, existing it is characterized in that, calculating the word sequence that each language segmentation unit is included Probability of occurrence in first corpus includes:
The word probability of N-1 word sequence before each word sequence that each language segmentation unit is included is equivalent to successively is sought, Wherein, the value of the N is predetermined according to the first language model;
The probability of occurrence is obtained by carrying out product calculation to the corresponding word probability of each word sequence.
5. according to the method described in claim 4, it is characterized in that, calculating each word sequence pair using maximum- likelihood estimation The word probability answered.
6. the method according to claim 1, wherein by by the second language model and the third language Model carries out interpolation processing, and obtaining the language model ready for use includes:
It determines step: determining that multiple groups weighted value combines, wherein every group of weighted value, which combines, includes: and the second language model pair The first weighted value and the second weighted value corresponding with the third language model answered;
It calculates step: the second language model and the third language model being carried out using the weighted value combination currently chosen Weighted average calculation obtains alternate language model;
Processing step: the word for being included to language each in second corpus segmentation unit using the alternative language model Sequence carries out degree of aliasing calculating respectively, obtains degree of aliasing assessment result corresponding with second corpus;Judge the multiple groups power With the presence or absence of the weighted value combination not yet chosen in weight values combination, if it is, returning to the calculating step;If it is not, then after It is continuous to execute comparison step;
The comparison step: corresponding degree of aliasing assessment result is combined to every group of weighted value and is comprehensively compared, degree of aliasing is chosen The minimum alternate language model of assessment result is as the language model ready for use.
7. a kind of acquisition device of language model characterized by comprising
Module is obtained, for obtaining the first corpus and the second corpus, wherein first corpus is the language text of random acquisition This, second corpus is the language text chosen under default context;
Computing module, for using the second language model by being trained to second corpus, to described first Corpus carries out degree of aliasing calculating, obtains third corpus with screening, wherein mixed between the third corpus and second corpus Degree of confusing is less than preset threshold;
Processing module, for the second language model by the way that first corpus to be trained and by by described the The third language model that three corpus are trained carries out interpolation processing, obtains language model ready for use.
8. device according to claim 7, which is characterized in that the computing module includes:
First determination unit, for determining that the language in first corpus divides unit;
First computing unit, for being wrapped using the first language model to language each in first corpus segmentation unit The word sequence contained carries out degree of aliasing calculating respectively, successively obtains calculated result corresponding with each language segmentation unit, wherein with Each language divides the corresponding calculated result of unit and is used to show the word sequence and described first that language segmentation unit is included Degree of aliasing between language model;
Screening unit, for according to the preset threshold to calculated result corresponding with each language segmentation unit is calculated It is screened, obtains the third corpus.
9. device according to claim 8, which is characterized in that first computing unit includes:
First computation subunit, word sequence the going out in first corpus for being included for calculating each language segmentation unit Existing probability;
Second computation subunit divides the word sequence pair that unit is included with each language for seeking using the probability of occurrence The cross entropy answered;
Third computation subunit carries out index for setting the cross entropy to index and setting the truth of a matter for default value Operation obtains calculated result corresponding with each language segmentation unit.
10. device according to claim 9, which is characterized in that first computation subunit, it is each for successively seeking The word probability of N-1 word sequence before each word sequence that language segmentation unit is included is equivalent to, and by each word order It arranges corresponding word probability progress product calculation and obtains the probability of occurrence, wherein the value of the N is according to first language Say that model is predetermined.
11. device according to claim 10, which is characterized in that first computation subunit, for utilizing maximum seemingly Right algorithm for estimating calculates the corresponding word probability of each word sequence.
12. device according to claim 7, which is characterized in that the processing module includes:
Second determination unit, for determining that multiple groups weighted value combines, wherein every group of weighted value, which combines, includes: and second language Say corresponding first weighted value of model and the second weighted value corresponding with the third language model;
Second computing unit, for being combined using the weighted value currently chosen to the second language model and the third language Model is weighted and averaged calculating, obtains alternate language model;
Processing unit, for being included to language each in second corpus segmentation unit using the alternative language model Word sequence carry out degree of aliasing calculating respectively, obtain degree of aliasing assessment result corresponding with second corpus;Judge described more With the presence or absence of the weighted value combination not yet chosen in group weighted value combination, if it is, returning to second computing unit;If It is no, then continue to execute comparing unit;
The comparing unit is comprehensively compared for combining corresponding degree of aliasing assessment result to every group of weighted value, chooses mixed The minimum alternate language model of degree of confusing assessment result is as the language model ready for use.
CN201710910635.9A 2017-09-29 2017-09-29 Method and device for acquiring language model Active CN110019832B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710910635.9A CN110019832B (en) 2017-09-29 2017-09-29 Method and device for acquiring language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710910635.9A CN110019832B (en) 2017-09-29 2017-09-29 Method and device for acquiring language model

Publications (2)

Publication Number Publication Date
CN110019832A true CN110019832A (en) 2019-07-16
CN110019832B CN110019832B (en) 2023-02-24

Family

ID=67186451

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710910635.9A Active CN110019832B (en) 2017-09-29 2017-09-29 Method and device for acquiring language model

Country Status (1)

Country Link
CN (1) CN110019832B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110491394A (en) * 2019-09-12 2019-11-22 北京百度网讯科技有限公司 Wake up the acquisition methods and device of corpus
CN111143518A (en) * 2019-12-30 2020-05-12 北京明朝万达科技股份有限公司 Cross-domain language model training method and device, electronic equipment and storage medium
CN111241813A (en) * 2020-04-29 2020-06-05 同盾控股有限公司 Corpus expansion method, apparatus, device and medium
CN112151021A (en) * 2020-09-27 2020-12-29 北京达佳互联信息技术有限公司 Language model training method, speech recognition device and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014117555A1 (en) * 2013-01-29 2014-08-07 Tencent Technology (Shenzhen) Company Limited Method and system for automatic speech recognition
CN104572614A (en) * 2014-12-03 2015-04-29 北京捷通华声语音技术有限公司 Training method and system for language model
US20150199339A1 (en) * 2014-01-14 2015-07-16 Xerox Corporation Semantic refining of cross-lingual information retrieval results
CN106294307A (en) * 2015-05-15 2017-01-04 北京国双科技有限公司 Language material screening technique and device
CN106328147A (en) * 2016-08-31 2017-01-11 中国科学技术大学 Speech recognition method and device
CN106372061A (en) * 2016-09-12 2017-02-01 电子科技大学 Short text similarity calculation method based on semantics

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014117555A1 (en) * 2013-01-29 2014-08-07 Tencent Technology (Shenzhen) Company Limited Method and system for automatic speech recognition
US20150199339A1 (en) * 2014-01-14 2015-07-16 Xerox Corporation Semantic refining of cross-lingual information retrieval results
CN104572614A (en) * 2014-12-03 2015-04-29 北京捷通华声语音技术有限公司 Training method and system for language model
CN106294307A (en) * 2015-05-15 2017-01-04 北京国双科技有限公司 Language material screening technique and device
CN106328147A (en) * 2016-08-31 2017-01-11 中国科学技术大学 Speech recognition method and device
CN106372061A (en) * 2016-09-12 2017-02-01 电子科技大学 Short text similarity calculation method based on semantics

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110491394A (en) * 2019-09-12 2019-11-22 北京百度网讯科技有限公司 Wake up the acquisition methods and device of corpus
CN110491394B (en) * 2019-09-12 2022-06-17 北京百度网讯科技有限公司 Awakening corpus obtaining method and device
CN111143518A (en) * 2019-12-30 2020-05-12 北京明朝万达科技股份有限公司 Cross-domain language model training method and device, electronic equipment and storage medium
CN111241813A (en) * 2020-04-29 2020-06-05 同盾控股有限公司 Corpus expansion method, apparatus, device and medium
CN112151021A (en) * 2020-09-27 2020-12-29 北京达佳互联信息技术有限公司 Language model training method, speech recognition device and electronic equipment

Also Published As

Publication number Publication date
CN110019832B (en) 2023-02-24

Similar Documents

Publication Publication Date Title
CN108446374B (en) User's Intention Anticipation method, apparatus, electronic equipment, storage medium
CN109388743A (en) The determination method and apparatus of language model
CN110019832A (en) The acquisition methods and device of language model
CN105701120B (en) The method and apparatus for determining semantic matching degree
CN110489755A (en) Document creation method and device
CN108363790A (en) For the method, apparatus, equipment and storage medium to being assessed
CN104143005B (en) A kind of related search system and method
CN109522556A (en) A kind of intension recognizing method and device
CN110366734A (en) Optimization neural network framework
CN108108821A (en) Model training method and device
CN108734338A (en) Credit risk forecast method and device based on LSTM models
CN109299344A (en) The generation method of order models, the sort method of search result, device and equipment
CN107832432A (en) A kind of search result ordering method, device, server and storage medium
CN110033022A (en) Processing method, device and the storage medium of text
CN108122122A (en) Advertisement placement method and system
CN102486922B (en) Speaker recognition method, device and system
JP7405775B2 (en) Computer-implemented estimating methods, estimating devices, electronic equipment and storage media
CN109829775A (en) A kind of item recommendation method, device, equipment and readable storage medium storing program for executing
CN109299420A (en) Social media account processing method, device, equipment and readable storage medium storing program for executing
CN106462803A (en) Augmenting neural networks with external memory
CN110427560A (en) A kind of model training method and relevant apparatus applied to recommender system
CN108491389A (en) Click bait title language material identification model training method and device
CN106484777A (en) A kind of multimedia data processing method and device
CN108304373A (en) Construction method, device, storage medium and the electronic device of semantic dictionary
CN107818491A (en) Electronic installation, Products Show method and storage medium based on user's Internet data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40010852

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant