CN108829657A - smoothing processing method and system - Google Patents

smoothing processing method and system Download PDF

Info

Publication number
CN108829657A
CN108829657A CN201810344157.4A CN201810344157A CN108829657A CN 108829657 A CN108829657 A CN 108829657A CN 201810344157 A CN201810344157 A CN 201810344157A CN 108829657 A CN108829657 A CN 108829657A
Authority
CN
China
Prior art keywords
occurrence
frequency
word
probability
missing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810344157.4A
Other languages
Chinese (zh)
Other versions
CN108829657B (en
Inventor
李贤�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201810344157.4A priority Critical patent/CN108829657B/en
Publication of CN108829657A publication Critical patent/CN108829657A/en
Application granted granted Critical
Publication of CN108829657B publication Critical patent/CN108829657B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The present invention relates to natural language processing technique fields, and more particularly to a kind of smoothing processing method and system, method includes the following steps:First frequency of occurrence of the statistics missing word in target corpus, wherein missing word is the word that frequency of occurrence is 0 in former corpus;The normalized frequency index of missing word is calculated according to the first frequency of occurrence;According to the smooth probability of normalized frequency index and remaining probability calculation missing word, missing word is smoothed according to smooth probability, wherein, remaining probability is the sum of the probability of occurrence of word that frequency of occurrence is less than or equal to k times from former corpus, and k is positive integer.The above method and system have been able to solve the problem of traditional smoothing processing mode effect difference, distinguish the word error itself and the insufficient both of these case of corpus covering itself that missing word is likely to occur, missing word is smoothed, erroneous judgement is reduced, enhances smoothing processing effect.

Description

Smoothing processing method and system
Technical field
The present invention relates to natural language processing technique fields, more particularly to a kind of smoothing processing method and system.
Background technique
Language model is during handling natural language, and the language abstract mathematics carried out according to language objective fact are built Mould.Language model will appear shortage of data, for shortage of data, need to solve by smoothing algorithm.Smoothing algorithm passes through misfortune The probability for word occurred is taken, the remaining probability for reallocation is obtained, by the probability that can be used for distributing according to certain rule point Dispensing lacks word, and probability obtained by missing word distribution is known as smooth probability.
Inventor's discovery has the following problems in the conventional technology, by taking Good Turing smoothing algorithm as an example, in Good In Turing smoothing algorithm, residual frequency is distributed by way of mean allocation, missing data carries out smoothly, but mean allocation Mode may not meet truth.Because shortage of data comes from two kinds of situations, first is that word itself malfunctions, rare word is in office It is impossible to occur in what corpus, second is that may be simply because corpus covering itself not enough, occurs being not logged in word.Traditional Word error itself cannot be distinguished in Good Turing smoothing algorithm and corpus itself covers insufficient both of these case, leads to Good Turing smoothing algorithm will cause a large amount of erroneous judgements in practical applications.
In conclusion traditional smoothing processing mode effect is poor.
Summary of the invention
Based on this, it is necessary to for traditional smoothing processing mode effect difference problem, provide a kind of smoothing processing method and System.
A kind of smoothing processing method, includes the following steps:It counts first lacked word in target corpus and goes out occurrence Number, wherein missing word is the word that frequency of occurrence is 0 in former corpus;Missing word is calculated according to the first frequency of occurrence Normalized frequency index;According to the smooth probability of normalized frequency index and remaining probability calculation missing word, and according to flat Sliding probability is smoothed missing word, wherein remaining probability is that frequency of occurrence is less than or equal to k times from former corpus The sum of the probability of occurrence of word, k is positive integer.
Above-mentioned smoothing processing method, first frequency of occurrence of the statistics missing word in target corpus, goes out according to first Occurrence number calculates the normalized frequency index of missing word, lacks word according to normalized frequency index and remaining probability calculation Smooth probability, and missing word is smoothed according to smooth probability;By being introduced into missing word in target corpus The first frequency of occurrence, calculate missing word normalized frequency index, remained by the normalized frequency Distribution Indexes of missing word Complementary probability is distinguished word error itself and the insufficient both of these case of corpus covering itself that missing word is likely to occur, is obtained anti- The smooth probability for reflecting missing word truth is smoothed missing word, reduces erroneous judgement, enhancing smoothing processing effect Fruit.
Further, missing word is multiple;The normalized frequency index of missing word is calculated according to the first frequency of occurrence The step of, include the following steps:Calculate the logarithm of each first frequency of occurrence;It sums to logarithm, obtains the sum of logarithm;Point Not by the logarithm of each first frequency of occurrence respectively divided by the sum of logarithm, the normalized frequency for obtaining corresponding to each missing word refers to Mark.
Above-mentioned smoothing processing method obtains normalized frequency index by carrying out numerical value processing to each first frequency of occurrence, So that the normalized of the first frequency of occurrence is easy, in addition the processing of logarithm can be suitble to the biggish feelings of frequency of occurrence numerical value Condition.
Further, the step of calculating the logarithm of each first frequency of occurrence, includes the following steps:Go out occurrence for each first Number increases separately numerical value of N, obtains each absolute frequency of occurrence;The logarithm for calculating each absolute frequency of occurrence, by each absolute appearance Logarithm of the logarithm of number respectively as each first frequency of occurrence, wherein N is the positive integer greater than 1.
Above-mentioned smoothing processing method is carried out at logarithm after increasing separately numerical value of N to each first frequency of occurrence of selection Reason can lead to not the case where carrying out logarithm process to avoid the first frequency of occurrence for 0.
Further, according to the smooth probability of normalized frequency index and remaining probability calculation missing word the step of, packet Include following steps:The normalized frequency index of word and the product of the first residual frequency will be lacked as the smooth general of missing word Rate.
Above-mentioned smoothing processing method, it is remaining general according to the normalized frequency Distribution Indexes of missing word by multiplication process Rate is smoothed missing word, enhances smoothing processing effect.
Further, k is the positive integer greater than 1;Word is being lacked according to normalized frequency index and remaining probability calculation Smooth probability the step of before, it is further comprising the steps of:Second frequency of occurrence of each n-gram word language in former corpus is obtained, And calculate each second frequency of occurrence and value;The third for counting each n-gram word language of second frequency of occurrence less than or equal to k times occurs Number, and calculate each third frequency of occurrence and value;According to each second frequency of occurrence and value and each third frequency of occurrence Remaining probability is calculated with value.
The third of above-mentioned smoothing processing method, the word by the second frequency of occurrence of statistics less than or equal to k times goes out occurrence Second frequency of occurrence of several and each n-gram word language in former corpus calculates remaining probability, is smoothly located to missing word Reason, enhances smoothing processing effect.
Further, remaining probability is the sum of the probability of occurrence of word that frequency of occurrence is equal to 1 time from former corpus;? It is further comprising the steps of before the step of the smooth probability of normalized frequency index and remaining probability calculation missing word: Obtain second frequency of occurrence of each n-gram word language in former corpus and calculate each second frequency of occurrence and value;Statistics second goes out There is word number in the single that occurrence number is 1;There is the calculating of word number with value and single according to each second frequency of occurrence Remaining probability.
There is word number by the single that the second frequency of occurrence of statistics is 1 and calculates each the in above-mentioned smoothing processing method Two frequency of occurrence and value, calculate remaining probability, missing word be smoothed, smoothing processing effect is enhanced.
Further, according to the smooth probability of normalized frequency index and remaining probability calculation missing word the step of it Afterwards, further comprising the steps of:Obtain second frequency of occurrence of the n-gram word language in former corpus;It is calculated according to the second frequency of occurrence There is word and has robbed the first probability of occurrence after taking;According to smooth probability and the first probability of occurrence, to the n member where missing word Syntactic model is trained.
Above-mentioned smoothing processing method, by calculating the first probability of occurrence, according to smooth probability and the first probability of occurrence, to scarce N-gram model where losing word is trained, the n-gram model after establishing smoothing processing, is carried out to n-gram model Smoothing processing enhances smoothing processing effect.
Further, target corpus is using search engine web site as the internet corpus of entrance;First frequency of occurrence Correlated results number after being searched in search engine web site for missing word.
Above-mentioned smoothing processing method, the database search by internet search engine in internal data network lack word Language can expand the coverage area of corpus, and the first frequency of occurrence of acquisition can meet truth, occur to shortage of data Two kinds of situations distinguish, and enhance smoothing processing effect.
A kind of smoothing processing system, including:Number statistical module, for counting missing word the in target corpus One frequency of occurrence, wherein missing word is the word that frequency of occurrence is 0 in former corpus;Normalized frequency index calculates mould Block, for calculating the normalized frequency index of missing word according to the first frequency of occurrence;Smoothing module, for according to normalizing Change the smooth probability of Frequency Index and remaining probability calculation missing word, and missing word is smoothly located according to smooth probability Reason, wherein remaining probability is the sum of the probability of occurrence of word that frequency of occurrence is less than or equal to k times from former corpus, and k is positive Integer.
Above-mentioned smoothing processing system, first frequency of occurrence of the statistics missing word in target corpus, goes out according to first Occurrence number calculates the normalized frequency index of missing word, lacks word according to normalized frequency index and remaining probability calculation Smooth probability, and missing word is smoothed according to smooth probability;By being introduced into missing word in target corpus The first frequency of occurrence, calculate missing word normalized frequency index, remained by the normalized frequency Distribution Indexes of missing word Complementary probability is distinguished word error itself and the insufficient both of these case of corpus covering itself that missing word is likely to occur, is obtained anti- The smooth probability for reflecting missing word truth is smoothed missing word, reduces erroneous judgement, enhancing smoothing processing effect Fruit.
A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor Computer program, processor realizes when executing computer program such as above-mentioned smoothing processing method.
A kind of computer storage medium, is stored thereon with computer program, realizes when which is executed by processor as above State smoothing processing method.
Detailed description of the invention
Fig. 1 is the flow chart of the smoothing processing method of one embodiment of the invention;
Fig. 2 is the structural schematic diagram of the smoothing processing system of one embodiment of the invention;
Fig. 3 is the flow chart of the smoothing processing method of a specific embodiment of the invention.
Specific embodiment
To facilitate the understanding of the present invention, a more comprehensive description of the invention is given in the following sections with reference to the relevant attached drawings.
It is shown in Figure 1, it is the flow chart of the smoothing processing method of one embodiment of the invention.It is smooth in the embodiment Processing method includes the following steps:
Step S110:First frequency of occurrence of the statistics missing word in target corpus, wherein missing word is in original The word that frequency of occurrence is 0 in corpus.
In this step, since missing word is the word that frequency of occurrence is 0 in former corpus, smooth probability is being distributed Process, the word error itself and corpus itself that cannot be distinguished in shortage of data cover insufficient both of these case, it is therefore desirable to draw Enter to distinguish the parameter of above-mentioned two situations to distribute smooth probability.Then, in target corpus, statistics missing word Frequency of occurrence obtains the first frequency of occurrence.Missing word occurs in target corpus, can reflect out corpus itself and covers not Foot;Missing word does not occur in target corpus or frequency of occurrence is few, reflects that word itself malfunctions.It is searched having In the target corpus of rope or locating function, it can be gone out by searching for or searching missing word using lookup result number as first Occurrence number.Wherein, missing word is n-gram word language, and n-gram word language is the word being made of in n-gram model n word, n-1 member word Language refers to that the word being made of in n-gram model n-1 word, n are the positive integer greater than 1.
In one embodiment, target corpus is the biggish corpus of data volume.
Optionally, target corpus can be the database of internal data network, can also be ken database, Such as patent database, academic documents database or self-built database, such as the database of storage service file;Target Corpus is also possible to the content of books or text, such as technical manual, the world's masterpiece of dictionary, various industries.Lack word It can be common word in some ken, since the field of corpus covering is different, the case where will lead to shortage of data, therefore Can for specific knowledge field target corpus obtain the first frequency of occurrence, for specific knowledge field missing word into Row smoothing processing.In addition, being directed to the smoothing processing in specific knowledge field, corresponding language model can be enhanced in specific knowledge field Application effect, improve Language Processing ability.
Step S120:The normalized frequency index of missing word is calculated according to the first frequency of occurrence.
In this step, it in order to according to a certain percentage be allocated remaining probability, needs to carry out the first frequency of occurrence Normalized obtains normalized frequency index, divides in subsequent processing according to normalized frequency index remaining probability Match.According to the first frequency of occurrence of missing word ratio shared in all first frequency of occurrence, returning for missing word is calculated One changes Frequency Index.After normalized, the sum of the normalized frequency index of all missing words is equal to 1.
Step S130:According to the smooth probability of normalized frequency index and remaining probability calculation missing word, and according to flat Sliding probability is smoothed missing word, wherein remaining probability is that frequency of occurrence is less than or equal to k times from former corpus The sum of the probability of occurrence of word, k is positive integer.
The probability of occurrence for having occurred word in former corpus is reduced according to a certain percentage, so that all word occurred For the sum of probability of language less than 1, there is surplus in the distribution of probability, and the surplus of appearance is remaining probability, is gone out as from former corpus Occurrence number is less than or equal to k the sum of the probability of occurrence of word, and k is positive integer.Assuming that occurred in former corpus word A, B, the probability of C, D, E are respectively 0.5,0.4,0.3,0.2 and 0.1, respectively according to 10%, 20%, 30%, 40% and 50% Ratio is reduced, and the surplus of appearance is equal to 0.5 × 0.1+0.4 × 0.2+0.3 × 0.3+0.2 × 0.4+0.1 × 0.5= 0.35, i.e., remaining probability is 0.35.
In this step, according to normalized frequency Distribution Indexes residue probability, the smooth probability of missing word is obtained.It will remain Complementary probability is allocated according to the ratio of normalized frequency index, and the smooth probability of missing word is obtained after calculating.
Above-mentioned smoothing processing method, first frequency of occurrence of the statistics missing word in target corpus, goes out according to first Occurrence number calculates the normalized frequency index of missing word, lacks word according to normalized frequency index and remaining probability calculation Smooth probability, and missing word is smoothed according to smooth probability;By being introduced into missing word in target corpus The first frequency of occurrence, calculate missing word normalized frequency index, remained by the normalized frequency Distribution Indexes of missing word Complementary probability is distinguished word error itself and the insufficient both of these case of corpus covering itself that missing word is likely to occur, is obtained anti- The smooth probability for reflecting missing word truth is smoothed missing word, reduces erroneous judgement, enhancing smoothing processing effect Fruit.
Further, missing word is multiple;The normalized frequency index of missing word is calculated according to the first frequency of occurrence The step of, include the following steps:
Step S121:Calculate the logarithm of each first frequency of occurrence;Step S122:It sums to logarithm, obtains logarithm The sum of;Step S123:Respectively by the logarithm of each first frequency of occurrence respectively divided by the sum of logarithm, obtain corresponding to each missing word The normalized frequency index of language.
If the first frequency of occurrence for lacking word AA, AB, BC, BD and CE is respectively 1000,90000,85000,6000 With 450, then the first frequency of occurrence logarithm f of word AA is lackedAA=log1000=3, the sum of logarithm are 19.3150, then The normalized frequency index for lacking word AA is 0.155.
Above-mentioned smoothing processing method obtains normalized frequency index by carrying out numerical value processing to each first frequency of occurrence, So that the normalized of the first frequency of occurrence is easy, in addition the processing of logarithm can be suitble to the biggish feelings of frequency of occurrence numerical value Condition.
Further, the step of calculating the logarithm of each first frequency of occurrence, includes the following steps:
Step S1211:Each first frequency of occurrence is increased separately into numerical value of N, obtains each absolute frequency of occurrence;Step S1212:The logarithm of each absolute frequency of occurrence is gone out occurrence by the logarithm for calculating each absolute frequency of occurrence Several logarithm, wherein N is the positive integer greater than 1.
If for the binary word w not occurredjwi, the first frequency of occurrence isConsiderFor 0 and be 1 feelings Condition, if absolutely frequency of occurrence isLack the first frequency of occurrence difference of word AA, AB, BC, BD and CE It is 1000,90000,85000,6000 and 450, N 2, then lacks the first frequency of occurrence logarithm f of word AAAA=log (1000+2)=3.0009, the sum of logarithm are 19.3180, then the normalized frequency index for lacking word AA is 0.155.
Above-mentioned smoothing processing method is carried out at logarithm after increasing separately numerical value of N to each first frequency of occurrence of selection Reason can lead to not the case where carrying out logarithm process to avoid the first frequency of occurrence for 0.
Further, according to the smooth probability of normalized frequency index and remaining probability calculation missing word the step of, packet Include following steps:
The normalized frequency index of word and the product of the first residual frequency will be lacked as the smooth probability of missing word.
If the first residual frequency is 0.43, when the normalized frequency index for lacking word AA is 0.155, word is lacked The smooth probability of language AA is 0.0665.
In this step, the normalized frequency index of word and the product of the first residual frequency will be lacked as missing word Smooth probability.By multiplication process, according to the normalized frequency Distribution Indexes residue probability of missing word.
Above-mentioned smoothing processing method, it is remaining general according to the normalized frequency Distribution Indexes of missing word by multiplication process Rate is smoothed missing word, enhances smoothing processing effect.
Further, k is the positive integer greater than 1;
Before the step of according to the smooth probability of normalized frequency index and remaining probability calculation missing word, further include Following steps:
Step S141:Second frequency of occurrence of each n-gram word language in former corpus is obtained, and calculates each second frequency of occurrence And value;Step S142:The third frequency of occurrence of each n-gram word language of second frequency of occurrence less than or equal to k times is counted, and is calculated Each third frequency of occurrence and value;Step S143:According to each second frequency of occurrence and value and each third frequency of occurrence sum Value calculates remaining probability.
If k is equal to 2, second frequency of occurrence of the n-gram word language in former corpus is counted, and calculates each second and goes out occurrence Several and value, that is, obtain the appearance total degree of occurred word;Count the appearance word number n that the second frequency of occurrence is 22 The appearance word number n for being 1 with the second frequency of occurrence1, then the third frequency of occurrence for the appearance word that the second frequency of occurrence is 2 For 2n2, the third frequency of occurrence for the appearance word that the second frequency of occurrence is 1 is 1n1, 2n2+n1For the sum of each third frequency of occurrence Value, remaining probability each third frequency of occurrence and that value is as calculated divided by each second frequency of occurrence and value quotient.
The third of above-mentioned smoothing processing method, the word by the second frequency of occurrence of statistics less than or equal to k times goes out occurrence Second frequency of occurrence of several and each n-gram word language in former corpus calculates remaining probability, is smoothly located to missing word Reason, enhances smoothing processing effect.
Further, remaining probability is the sum of the probability of occurrence of word that frequency of occurrence is equal to 1 time from former corpus;
Before the step of according to the smooth probability of normalized frequency index and remaining probability calculation missing word, further include Following steps:
Step S144:It obtains second frequency of occurrence of each n-gram word language in former corpus and calculates each second frequency of occurrence And value;Step S145:It counts the single that the second frequency of occurrence is 1 and word number occurs;Step S146:Go out according to each second There is the remaining probability of word number calculating with value and single in occurrence number.
If r is second frequency of occurrence of the n-gram word language in former corpus, n by taking n-gram model as an examplerIt is primitive material Just occur the number of r n-gram word language, n in libraryr+1Be occur just in former corpus r+1 times n-gram word language number, According to nrReduce r, the frequency of occurrence r after being reduced*.It reduces r and is equivalent to from former corpus and the probability of occurrence of word occurred Middle misfortune takes probability, and the probability of occurrence for word occurred is reduced according to a certain percentage, so that all the general of word occurred For the sum of rate less than 1, there is surplus in the distribution of probability.r*Meet following formula:
First after being reduced according to a certain percentage in former corpus for the n-gram word language that the second frequency of occurrence is r Probability of occurrence is:WhereinThat is N is the sum of each second frequency of occurrence Value.The sum of probability of occurrence after n-gram model reduces isI.e. remaining probability isWherein n1For There is word number in the single for word occurred that second frequency of occurrence is 1.Therefore the of n-gram word language in former corpus is obtained Two frequency of occurrence r, calculate the sum of the second frequency of occurrence N, and word number n occurs in the single that the second frequency of occurrence of statistics is 11, root There is word number n according to single1Remaining probability can be calculated with the sum of the second frequency of occurrence N
There is word number by the single that the second frequency of occurrence of statistics is 1 and calculates each the in above-mentioned smoothing processing method Two frequency of occurrence and value, calculate remaining probability, missing word be smoothed, smoothing processing effect is enhanced.
Further, according to the smooth probability of normalized frequency index and remaining probability calculation missing word the step of it Afterwards, further comprising the steps of:
Step S151:Obtain second frequency of occurrence of the n-gram word language in former corpus;Step S152:Occur according to second Number, which calculates, has there is the first probability of occurrence after word misfortune takes;Step S153:It is right according to smooth probability and the first probability of occurrence N-gram model where missing word is trained.
If 2 metagrammars by taking n-gram model as an example, before carrying out the smoothing processing method of one embodiment of the invention In model, second frequency of occurrence r, n of the n-gram word language in former corpus is obtainedrIt is the n member for occurring r times in former corpus just The number of word, according to nrReduce r, the frequency of occurrence r after being reduced*.It reduces r and is equivalent to and word occurred from former corpus In the probability of occurrence of language rob take probability, the probability of occurrence for word occurred is reduced according to a certain percentage so that it is all There is the sum of probability of word less than 1, surplus occurs in the distribution of probability.r*Meet following formula:
First after being reduced according to a certain percentage in former corpus for the n-gram word language that the second frequency of occurrence is r Probability of occurrence is:WhereinThat is N is the sum of second frequency of occurrence.
According to smooth probability and the first probability of occurrence, the n-gram model where missing word is trained, is established flat Sliding treated n-gram model.
Above-mentioned smoothing processing method, by calculating the first probability of occurrence, according to smooth probability and the first probability of occurrence, to scarce N-gram model where losing word is trained, the n-gram model after establishing smoothing processing, is carried out to n-gram model Smoothing processing enhances smoothing processing effect.
Further, target corpus is using search engine web site as the internet corpus of entrance;First frequency of occurrence Correlated results number after being searched in search engine web site for missing word.
Target corpus can be using search engine web site as the database of the internal data network of entrance, and can be with By the correlated results number after being searched in internet search engine website, occur as missing word the first of target corpus Number.
Above-mentioned smoothing processing method, the database search by internet search engine in internal data network lack word Language can expand the coverage area of corpus, and the first frequency of occurrence of acquisition can meet truth, occur to shortage of data Two kinds of situations distinguish, and enhance smoothing processing effect.
It is shown in Figure 2, it is the structural schematic diagram of the smoothing processing system of one embodiment of the invention.In the embodiment Smoothing processing system, including:
Number statistical module 210, for counting first frequency of occurrence of the missing word in target corpus, wherein lack Losing word is the word that frequency of occurrence is 0 in former corpus;
Normalized frequency index computing module 220, for calculating the normalization frequency of missing word according to the first frequency of occurrence Rate index;
Smoothing module 230, for lacking the smooth general of word according to normalized frequency index and remaining probability calculation Rate, and missing word is smoothed according to smooth probability, wherein remaining probability is that frequency of occurrence is small from former corpus In or equal to k the sum of probability of occurrence of word, k is positive integer.
Above-mentioned smoothing processing system, first frequency of occurrence of the statistics missing word in target corpus, goes out according to first Occurrence number calculates the normalized frequency index of missing word, lacks word according to normalized frequency index and remaining probability calculation Smooth probability, and missing word is smoothed according to smooth probability;By being introduced into missing word in target corpus The first frequency of occurrence, calculate missing word normalized frequency index, remained by the normalized frequency Distribution Indexes of missing word Complementary probability is distinguished word error itself and the insufficient both of these case of corpus covering itself that missing word is likely to occur, is obtained anti- The smooth probability for reflecting missing word truth is smoothed missing word, reduces erroneous judgement, enhancing smoothing processing effect Fruit.
Further, missing word is multiple, and normalized frequency index computing module 220 calculates each first frequency of occurrence Logarithm;It sums to logarithm, obtains the sum of logarithm;Respectively by the logarithm of each first frequency of occurrence respectively divided by logarithm The sum of, obtain the normalized frequency index for corresponding to each missing word.
Above-mentioned smoothing processing system obtains normalized frequency index by carrying out numerical value processing to each first frequency of occurrence, So that the normalized of the first frequency of occurrence is easy, in addition the processing of logarithm can be suitble to the biggish feelings of frequency of occurrence numerical value Condition.
Further, each first frequency of occurrence is increased separately numerical value of N by normalized frequency index computing module 220, is obtained Absolute frequency of occurrence;The logarithm for calculating each absolute frequency of occurrence, using the logarithm of each absolute frequency of occurrence as first The logarithm of frequency of occurrence, wherein N is the positive integer greater than 1.
Above-mentioned smoothing processing system is carried out at logarithm after increasing separately numerical value of N to each first frequency of occurrence of selection Reason can lead to not the case where carrying out logarithm process to avoid the first frequency of occurrence for 0.
Further, the multiplying the normalized frequency index for lacking word and the first residual frequency of smoothing module 230 Smooth probability of the product as missing word.
Above-mentioned smoothing processing system, it is remaining general according to the normalized frequency Distribution Indexes of missing word by multiplication process Rate is smoothed missing word, enhances smoothing processing effect.
Further, k is the positive integer greater than 1, and smoothing module 230 obtains each n-gram word language in former corpus Second frequency of occurrence, and calculate each second frequency of occurrence and value;Count each n member of second frequency of occurrence less than or equal to k times The third frequency of occurrence of word, and calculate each third frequency of occurrence and value;According to each second frequency of occurrence with value and respectively Third frequency of occurrence calculates remaining probability with value.
The third of above-mentioned smoothing processing system, the word by the second frequency of occurrence of statistics less than or equal to k times goes out occurrence Second frequency of occurrence of several and each n-gram word language in former corpus calculates remaining probability, is smoothly located to missing word Reason, enhances smoothing processing effect.
Further, remaining probability is the sum of the probability of occurrence of word that frequency of occurrence is equal to 1 time from former corpus, is put down Sliding processing module 230 obtains second frequency of occurrence of each n-gram word language in former corpus and calculates the sum of each second frequency of occurrence Value;It counts the single that the second frequency of occurrence is 1 and word number occurs;According to occurring with value and single for each second frequency of occurrence Word number calculates remaining probability.
There is word number by the single that the second frequency of occurrence of statistics is 1 and calculates each the in above-mentioned smoothing processing system Two frequency of occurrence and value, calculate remaining probability, missing word be smoothed, smoothing processing effect is enhanced.
Further, smoothing module 230 obtains second frequency of occurrence of the n-gram word language in former corpus;
It is calculated according to the second frequency of occurrence and the first probability of occurrence after word misfortune takes has occurred;According to smooth probability and first Probability of occurrence is trained the n-gram model where missing word.
Above-mentioned smoothing processing system, by calculating the first probability of occurrence, according to smooth probability and the first probability of occurrence, to scarce N-gram model where losing word is trained, the n-gram model after establishing smoothing processing, is carried out to n-gram model Smoothing processing enhances smoothing processing effect.
Further, target corpus is using search engine web site as the internet language of entrance in number statistical module 210 Expect library;First frequency of occurrence is to lack the correlated results number after word is searched in search engine web site.
Above-mentioned smoothing processing system, the database search by internet search engine in internal data network lack word Language can expand the coverage area of corpus, and the first frequency of occurrence of acquisition can meet truth, occur to shortage of data Two kinds of situations distinguish, and enhance smoothing processing effect.
A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor Computer program, processor realizes when executing computer program such as above-mentioned smoothing processing method.
A kind of computer storage medium, is stored thereon with computer program, realizes when which is executed by processor as above State smoothing processing method.
According to the smoothing processing method of aforementioned present invention, the present invention also provides a kind of computer equipments and computer storage to be situated between Matter, for realizing above-mentioned smoothing processing method by program.
It is shown in Figure 3, for the flow chart of the smoothing processing method of a specific embodiment of the invention.In the embodiment Smoothing processing method includes the following steps:
Calculate remaining probability.In n-gram model, the probability of occurrence for word occurred is dropped according to a certain percentage It is low, so that the sum of all probability for word occurred, less than 1, surplus occurs in the distribution of probability, obtain remaining probability.Obtain n member Second frequency of occurrence r, n of the word in former corpusrBe occur just in former corpus r times n-gram word language number, nr+1 Be occur just in former corpus r+1 times n-gram word language number, according to nrReduce r, the frequency of occurrence r after being reduced*.Drop Low r be equivalent to from the probability of occurrence for having occurred word in former corpus rob takes probability, by the probability of occurrence for word occurred by It is reduced according to certain proportion.r*Meet following formula:
First after being reduced according to a certain percentage in former corpus for the n-gram word language that the second frequency of occurrence is r Probability of occurrence is:WhereinThat is N is the sum of each second frequency of occurrence Value.The sum of probability of occurrence after n-gram model reduces isI.e. remaining probability isWherein n1For Second frequency of occurrence is that 1 single word number occurs.Therefore second frequency of occurrence r of the n-gram word language in former corpus, meter are obtained The sum of the second frequency of occurrence N is calculated, word number n occurs in the single that the second frequency of occurrence of statistics is 11, word is occurred according to single Number n1Remaining probability can be calculated with the sum of the second frequency of occurrence N
Statistics missing word, inputs target corpus for missing word, and missing word is inputted search engine, is obtained each The number of results of a missing word, using number of results as the first frequency of occurrence.It can be multiple for lacking word.
The normalized frequency index of corresponding each missing word is calculated according to each first frequency of occurrence.Calculate each first chosen The logarithm of frequency of occurrence;It sums to logarithm, obtains the sum of logarithm;Respectively by pair of the first frequency of occurrence of each selection Numerical value obtains the normalized frequency index for corresponding to each missing word respectively divided by the sum of logarithm.
For lacking word wjwi, normalized frequency index rate is
WhereinFor the first frequency of occurrence, n0For the number of all missing words.
According to the smooth probability of normalized frequency index and remaining probability calculation missing word, and according to smooth probability to scarce Word is lost to be smoothed.Using the normalized frequency index of each missing word and the product of the first residual frequency as each missing The smooth probability of word.
Wherein,To lack word wjwiSmooth probability.
As shown in table 1, table 1 is the frequency of occurrence of binary word and the relationship of number in former corpus.
The frequency of occurrence of binary word and the relationship of number in the former corpus of table 1
r nr
1 2053
2 458
3 191
4 107
5 69
6 48
7 36
According toHave:
The frequency distribution of binary word in the former corpus of table 2
r nr r* pr
1 2053 0.44618 9.190×10-5
2 458 1.25109 2.577×10-4
3 191 2.24084 4.616×10-4
4 107 3.22430 6.641×10-4
5 69 4.17391 8.597×10-4
6 48 5.25000 1.081×10-3
7 36 - -
As shown in table 2, table 2 is the frequency distribution of binary word in former corpus, then can calculate the second frequency of occurrence is 1 Single there is word number n1=2053, each second frequency of occurrence and value N=4855, remaining probability is
The first frequency of occurrence obtained for the binary word " text amount " not occurred is 939000, and the first of " text amount " occurs The logarithm of number is:
fWen Liang=log939000=5.97
Assuming that the logarithm summation of the frequency of occurrence of all binary words is 5000, then missing word " text amount " is flat Sliding probability is:
Smoothing processing system of the invention and smoothing processing method of the invention correspond, in above-mentioned smoothing processing method Embodiment illustrate technical characteristic and its advantages suitable for the embodiment of smoothing processing system, hereby give notice that.
Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention The normally understood meaning of technical staff is identical.Term as used herein in the specification of the present invention is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.Term " and or " used herein includes one or more phases Any and all combinations of the listed item of pass.
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, all should be considered as described in this specification.Those of ordinary skill in the art will appreciate that realizing above-mentioned implementation All or part of the steps in example method is relevant hardware can be instructed to complete by program, and the program can deposit It is stored in computer-readable storage medium, which when being executed, includes the steps that described in above method, and the storage is situated between Matter, such as:ROM/RAM, magnetic disk, CD etc..
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection of the invention Range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (11)

1. a kind of smoothing processing method, which is characterized in that include the following steps:
First frequency of occurrence of the statistics missing word in target corpus, wherein the missing word is in former corpus The word that frequency of occurrence is 0;
The normalized frequency index of the missing word is calculated according to first frequency of occurrence;
The smooth probability of word is lacked according to the normalized frequency index and remaining probability calculation, and according to described smooth Probability is smoothed the missing word, wherein the residue probability is that frequency of occurrence is small from the former corpus In or equal to k the sum of probability of occurrence of word, k is positive integer.
2. smoothing processing method according to claim 1, which is characterized in that the missing word is multiple;
The step of calculating the normalized frequency index of the missing word according to first frequency of occurrence, includes the following steps:
Calculate the logarithm of each first frequency of occurrence;
It sums to the logarithm, obtains the sum of logarithm;
Respectively by the logarithm of each first frequency of occurrence respectively divided by the sum of described logarithm, obtain corresponding to each missing word The normalized frequency index.
3. smoothing processing method according to claim 2, which is characterized in that calculate the logarithm of each first frequency of occurrence The step of value, includes the following steps:
Each first frequency of occurrence is increased separately into numerical value of N, obtains each absolute frequency of occurrence;
The logarithm for calculating each absolute frequency of occurrence, using the logarithm of each absolute frequency of occurrence as each described The logarithm of first frequency of occurrence, wherein the N is the positive integer greater than 1.
4. smoothing processing method according to claim 1, which is characterized in that according to the normalized frequency index and residue The step of smooth probability of word is lacked described in probability calculation, includes the following steps:
Using the normalized frequency index of the missing word and the product of first residual frequency as the missing word Smooth probability.
5. smoothing processing method according to claim 1, which is characterized in that k is the positive integer greater than 1;
Before the step of lacking the smooth probability of word according to the normalized frequency index and remaining probability calculation, also Include the following steps:
Obtain second frequency of occurrence of each n-gram word language in former corpus, and calculate each second frequency of occurrence and value;
The third frequency of occurrence of each n-gram word language of second frequency of occurrence less than or equal to k times is counted, and calculates each third and goes out occurrence Several and value;
Remaining probability is calculated according to each second frequency of occurrence and value and each third frequency of occurrence and value.
6. smoothing processing method according to claim 1, which is characterized in that the residue probability is from the former corpus Middle frequency of occurrence is equal to 1 the sum of probability of occurrence of word;
Before the step of lacking the smooth probability of word according to the normalized frequency index and remaining probability calculation, also Include the following steps:
Obtain second frequency of occurrence of each n-gram word language in former corpus and calculate each second frequency of occurrence and value;
It counts the single that the second frequency of occurrence is 1 and word number occurs;
There is the remaining probability of word number calculating with value and the single according to each second frequency of occurrence.
7. smoothing processing method according to claim 1, which is characterized in that according to the normalized frequency index and surplus It is further comprising the steps of after complementary probability calculates the step of smooth probability of the missing word:
Obtain second frequency of occurrence of the n-gram word language in former corpus;
It is calculated according to second frequency of occurrence and the first probability of occurrence after word misfortune takes has occurred;
According to the smooth probability and first probability of occurrence, the n-gram model where missing word is trained.
8. smoothing processing method according to claim 1, which is characterized in that the target corpus is with search engine net It stands as the internet corpus of entrance;First frequency of occurrence is that the missing word is searched in described search engine website Correlated results number afterwards.
9. a kind of smoothing processing system, which is characterized in that including:
Number statistical module, for counting first frequency of occurrence of the missing word in target corpus, wherein the missing word Language is the word that frequency of occurrence is 0 in former corpus;
Normalized frequency index computing module, for calculating the normalization frequency of the missing word according to first frequency of occurrence Rate index;
Smoothing module, for the smooth general of the missing word according to the normalized frequency index and remaining probability calculation Rate, and the missing word is smoothed according to the smooth probability, wherein the residue probability is from the primitive Expect the probability of occurrence of word the sum of of the frequency of occurrence less than or equal to k times in library, k is positive integer.
10. a kind of computer equipment, including memory, processor and it is stored on the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to Smoothing processing method described in 8 any one.
11. a kind of computer storage medium, is stored thereon with computer program, which is characterized in that the program is executed by processor Smoothing processing method of the Shi Shixian as described in claim 1 to 8 any one.
CN201810344157.4A 2018-04-17 2018-04-17 Smoothing method and system Active CN108829657B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810344157.4A CN108829657B (en) 2018-04-17 2018-04-17 Smoothing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810344157.4A CN108829657B (en) 2018-04-17 2018-04-17 Smoothing method and system

Publications (2)

Publication Number Publication Date
CN108829657A true CN108829657A (en) 2018-11-16
CN108829657B CN108829657B (en) 2022-05-03

Family

ID=64154406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810344157.4A Active CN108829657B (en) 2018-04-17 2018-04-17 Smoothing method and system

Country Status (1)

Country Link
CN (1) CN108829657B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040249628A1 (en) * 2003-06-03 2004-12-09 Microsoft Corporation Discriminative training of language models for text and speech classification
US20060259480A1 (en) * 2005-05-10 2006-11-16 Microsoft Corporation Method and system for adapting search results to personal information needs
CN101295294A (en) * 2008-06-12 2008-10-29 昆明理工大学 Improved Bayes acceptation disambiguation method based on information gain
CN103116578A (en) * 2013-02-07 2013-05-22 北京赛迪翻译技术有限公司 Translation method integrating syntactic tree and statistical machine translation technology and translation device
CN103488629A (en) * 2013-09-24 2014-01-01 南京大学 Method for extracting translation unit table in machine translation
CN104408087A (en) * 2014-11-13 2015-03-11 百度在线网络技术(北京)有限公司 Method and system for identifying cheating text
US20160092434A1 (en) * 2014-09-29 2016-03-31 Apple Inc. Integrated word n-gram and class m-gram language models
CN106649269A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Extraction method and device of colloquial sentence
US20170177563A1 (en) * 2010-09-24 2017-06-22 National University Of Singapore Methods and systems for automated text correction

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040249628A1 (en) * 2003-06-03 2004-12-09 Microsoft Corporation Discriminative training of language models for text and speech classification
US20060259480A1 (en) * 2005-05-10 2006-11-16 Microsoft Corporation Method and system for adapting search results to personal information needs
CN101295294A (en) * 2008-06-12 2008-10-29 昆明理工大学 Improved Bayes acceptation disambiguation method based on information gain
US20170177563A1 (en) * 2010-09-24 2017-06-22 National University Of Singapore Methods and systems for automated text correction
CN103116578A (en) * 2013-02-07 2013-05-22 北京赛迪翻译技术有限公司 Translation method integrating syntactic tree and statistical machine translation technology and translation device
CN103488629A (en) * 2013-09-24 2014-01-01 南京大学 Method for extracting translation unit table in machine translation
US20160092434A1 (en) * 2014-09-29 2016-03-31 Apple Inc. Integrated word n-gram and class m-gram language models
CN104408087A (en) * 2014-11-13 2015-03-11 百度在线网络技术(北京)有限公司 Method and system for identifying cheating text
CN106649269A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Extraction method and device of colloquial sentence

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
FUCHUN PENG,DALE SCHUURMANS: "Combining Naive Bayes and n-Gram Language Models for Text Classification", 《ECIR 2003: ADVANCES IN INFORMATION RETRIEVAL》 *
文娟: "统计语言模型的研究与应用", 《中国优秀博硕士学位论文全文数据库(博士)信息科技辑》 *
楚彦凌: "基于数据聚类的语言模型研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
黄永文: "基于互信息的统计语言模型平滑技术", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Also Published As

Publication number Publication date
CN108829657B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN105786991A (en) Chinese emotion new word recognition method and system in combination with user emotion expression ways
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN111971669A (en) System and method for providing feedback of natural language queries
CN108833458B (en) Application recommendation method, device, medium and equipment
CN108073568A (en) keyword extracting method and device
Bates et al. Counting clusters in twitter posts
CN107357812A (en) A kind of data query method and device
CN107291939A (en) The clustering match method and system of hotel information
CN111738843B (en) Quantitative risk evaluation system and method using running water data
CN105373853A (en) Stock public opinion index prediction method and device
CN109635084A (en) A kind of real-time quick De-weight method of multi-source data document and system
CN105373546A (en) Information processing method and system for knowledge services
CN103455534A (en) Document clustering method and device
CN108182531A (en) Shale gas development evaluation method, apparatus and terminal device
CN109117475A (en) A kind of method and relevant device of text rewriting
CN108182182A (en) Document matching process, device and computer readable storage medium in translation database
CN109471953A (en) A kind of speech data retrieval method and terminal device
CN108829657A (en) smoothing processing method and system
CN105677664A (en) Compactness determination method and device based on web search
CN114138743A (en) ETL task automatic configuration method and device based on machine learning
Mesiarova-Zemankova et al. Averaging operators in fuzzy classification systems
US20160371331A1 (en) Computer-implemented method of performing a search using signatures
Partyka et al. Semantic schema matching without shared instances
Boulkrinat et al. Towards recommender systems based on a fuzzy preference aggregation
CN108733824B (en) Interactive theme modeling method and device considering expert knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant