CN108073565A

CN108073565A - The method and apparatus and machine translation method and equipment of words criterion

Info

Publication number: CN108073565A
Application number: CN201610989788.2A
Authority: CN
Inventors: 王晓利; 钟延; 张驰; 陈岚; 徐蔚然; 申站; 姜欣; 姜一欣; 武市真知
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2016-11-10
Filing date: 2016-11-10
Publication date: 2018-05-25
Also published as: JP7120751B2; JP2018077850A

Abstract

The method and apparatus and machine translation method and equipment, the method for the words criterion for providing words criterion include：Obtain target word to be standardized；Retrieve to explain the sentence of the target word using network search engines, and determine in the sentence with first group candidate word of the relevant word of the target word as the standardization result of the expression target word；The similarity of target word and each candidate word in first group of candidate word is calculated based on term vector, and each candidate word is ranked up according to the similarity；The standardization result of target word is determined according to the result of sequence.Above-mentioned words criterion technology and machine translation mothod standardize to non-standard word according to the meaning of non-standard word using unsupervised scheme, therefore its standardization can be obtained for the non-standardization word for the modification that looks like as a result, and improving the performance of the machine translation of the sentence of the non-standardization word comprising interesting modification.

Description

The method and apparatus and machine translation method and equipment of words criterion

Technical field

The present disclosure generally relates to natural language processings, and in particular to method, the equipment of the words criterion of non-standardization word And machine translation method, equipment.

Background technology

The presence of non-standardization word causes many difficulties to natural language processing, in particular with the hair of cyberspeak Exhibition, more exacerbates this problem.For example, in question answering system, problem is：" recommending a place that small basin friend is suitble to play ", Since system can not understand the meaning of non-standard word wherein included " small basin friend ", the answer provided is " not having answer "； For another example, in machine translation, original language is " this small Loli sprouts very much ", and machine translation result is " This little Lolita is too Meng ", it can be seen that the translation result fails correctly to translate " small Loli " and " sprouting ".It is above-mentioned in order to solve Problem is, it is necessary to standardize to non-standardization word, to improve the performance of natural language processing.

Existing words criterion method can be divided into unsupervised scheme and have two major class of supervision scheme.There is supervision scheme In, it is solved using the standardization of word as classification problem, but this needs the artificial mark of magnanimity, therefore cost of labor is very It is high.It in unsupervised scheme, is solved using the standardization of word as issues for translation, but existing unsupervised scheme all can only It solves the problems, such as phonetic modification, can not solve the problems, such as meaning modification.More particularly, the word standardization side of existing unsupervised scheme Method can solve the standardization of the non-standardization word of the phonetics such as " cup ", " pear " deformation, however it can not be applied to meaning The non-standardization word for type of wanting to change, for example, for the neologisms such as " emperorship ", " Hai Tao ", " counteroffensive " and " dog blood ", " eunuch ", " sense Emit " etc. have new meaning old word, there is no effective normalization method at present.

The content of the invention

The disclosure is proposed at least for problem above.

According to one embodiment of the disclosure, a kind of method of words criterion is provided, including：It obtains to be standardized Target word；Retrieve to explain the sentence of the target word using network search engines, and determine in the sentence with the target word First group candidate word of the relevant word as the standardization result for representing the target word；Target word and first is calculated based on term vector The similarity of each candidate word in group candidate word, and each candidate word is ranked up according to the similarity；According to sequence Result determine the standardization result of target word.

According to another embodiment of the present disclosure, a kind of equipment of words criterion is provided, including：Obtaining widget, configuration To obtain target word to be standardized；Candidate word determines component, is configured to utilize network search engines retrieval for explaining the mesh Mark the sentence of word, and determine in the sentence with the of the relevant word of the target word as the standardization result of the expression target word One group of candidate word；Sequencing of similarity component is configured to term vector and calculates target word and each time in first group of candidate word The similarity of word is selected, and each candidate word is ranked up according to the similarity；Standardize component, is configured to according to sequence As a result the standardization result of target word is determined.

According to another embodiment of the present disclosure, a kind of equipment of words criterion is provided, including：Processor；Memory； With the computer program instructions being stored in the memory, the computer program instructions are held when being run by the processor Row following steps：Obtain target word to be standardized；Using network search engines retrieval for explaining the sentence of the target word, and Determine in the sentence with the relevant word of the target word as represent the target word standardization result first group of candidate word；Base The similarity of target word and each candidate word in first group of candidate word is calculated in term vector, and according to the similarity to each Candidate word is ranked up；The standardization result of target word is determined according to the result of sequence.

According to another embodiment of the present disclosure, a kind of method of words criterion is provided, including：It obtains to be standardized Target word and represent the target word standardization result candidate word set；Target word and this group of candidate word are calculated based on term vector In each candidate word similarity, and each candidate word is ranked up according to the similarity；It determines in each candidate word The confidence level of the highest candidate word of similarity；If the confidence level is more than the 3rd threshold value, by the highest candidate word of the similarity Standardization result as target word.

According to another embodiment of the present disclosure, a kind of equipment of words criterion is provided, including：Obtaining widget, configuration For obtain target word to be standardized and represent the target word standardization result candidate word set；Sequencing of similarity component, The similarity that term vector calculates target word and each candidate word in this group of candidate word is configured to, and according to the similarity Each candidate word is ranked up；Confidence level determines component, is configured to determine the highest candidate word of similarity in each candidate word Confidence level；Standardize component, if being configured to the confidence level more than the 3rd threshold value, the highest candidate word of the similarity is made For the standardization result of target word.

According to another embodiment of the present disclosure, a kind of equipment of words criterion is provided, including：Processor；Memory； With the computer program instructions being stored in the memory, the computer program instructions are held when being run by the processor Row following steps：It obtains target word to be standardized and represents the candidate word set of the standardization result of the target word；It is word-based Vector calculates the similarity of target word and each candidate word in this group of candidate word, and according to the similarity to each candidate word It is ranked up；Determine the confidence level of the highest candidate word of similarity in each candidate word；If the confidence level is more than the 3rd threshold value, Then using the highest candidate word of the similarity as the standardization result of target word.

According to the another embodiment of the disclosure, a kind of machine translation method is provided, including：Detect the non-rule in original language Model word；Obtain the candidate word set for the standardization result for representing the target word；Target word and this group of candidate are calculated based on term vector The similarity of each candidate word in word, and each candidate word is ranked up according to the similarity；Determine each candidate word The confidence level of the middle highest candidate word of similarity；If the confidence level is more than the 3rd threshold value, by the highest candidate of the similarity Word is transcribed into object language as the word after standardization.

According to the another embodiment of the disclosure, a kind of machine translating apparatus is provided, including：Detection part is configured to examine Survey the non-standard word in original language；Candidate word determines component, is configured to obtain the candidate for the standardization result for representing the target word Set of words；Sequencing of similarity component is configured to term vector and calculates target word and each candidate word in this group of candidate word Similarity, and each candidate word is ranked up according to the similarity；Confidence level determines component, is configured to determine each candidate The highest candidate word of similarity in word；Translation unit, if being configured to the confidence level is more than the 3rd threshold value, by the similarity most High candidate word is transcribed into object language as the word after standardization.

According to the another embodiment of the disclosure, a kind of machine translation method is provided, including：Processor；Memory；With deposit Store up computer program instructions in the memory, the computer program instructions performed when being run by the processor with Lower step：Detect the non-standard word in original language；Obtain the candidate word set for the standardization result for representing the target word；It is word-based Vector calculates the similarity of target word and each candidate word in this group of candidate word, and according to the similarity to each candidate word It is ranked up；Determine the confidence level of the highest candidate word of similarity in each candidate word；If the confidence level is more than the 3rd threshold value, Then using the highest candidate word of the similarity as the word after standardization, and it is transcribed into object language.

According to the words criterion technology and machine translation mothod of the embodiment of the present disclosure using unsupervised scheme according to non-rule The meaning of model word can obtain its standardization to standardize to non-standard word for the non-standardization word for the modification that looks like As a result, and improve the performance of the machine translation of the sentence of the non-standardization word comprising interesting modification.On the other hand, according to The words criterion technology and machine translation mothod of the embodiment of the present disclosure assess its credibility, and root to obtained standardization word It determines whether standardization word is subjected to according to credibility, thereby ensures that the correctness of standardization word.

Description of the drawings

The embodiment of the present disclosure is described in more detail in conjunction with the accompanying drawings, the above-mentioned and other purpose of the disclosure, Feature and advantage will be apparent.Attached drawing is used for providing further understanding the embodiment of the present disclosure, and forms explanation A part for book for explaining the disclosure together with the embodiment of the present disclosure, does not form the limitation to the disclosure.In the accompanying drawings, Identical reference number typically represents same parts or step.

Fig. 1 schematically shows the flow chart of the method for the words criterion according to the first embodiment of the present disclosure.

Fig. 2 shows the webpage obtained by Baidupedia retrieval non-standard word.

Fig. 3 shows the webpage for knowing by Baidu and retrieving non-standard word and obtaining.

Fig. 4, which is shown, determines phase in first group of candidate word in the method according to the words criterion of the first embodiment of the present disclosure Like the flow chart of the method for the confidence level for spending highest candidate word.

Fig. 5 instantiates the method according to the words criterion of the first embodiment of the present disclosure to each in first group of candidate word The candidate word scoring and the result to be sorted based on candidate word scoring to each candidate word that candidate word determines.

Fig. 6 schematically shows the flow chart of the method for the words criterion according to the second embodiment of the present disclosure

Fig. 7 shows the side that second group of candidate word is determined in the method according to the words criterion of the second embodiment of the present disclosure The flow chart of method.

Fig. 8 schematically shows the flow chart of the method for the words criterion according to the third embodiment of the present disclosure.

Fig. 9 shows the functional configuration block diagram of words criterion equipment according to an embodiment of the invention.

Figure 10 shows the functional configuration block diagram of the words criterion equipment of another embodiment according to the present invention.

Figure 11 shows the signal available for the computing device for realizing the words criterion equipment according to the embodiment of the present disclosure Property block diagram.

Specific embodiment

In order to enable the purpose, technical scheme and advantage of the disclosure become apparent, root is described in detail below with reference to accompanying drawings According to the example embodiment of the disclosure.Obviously, described embodiment is only a part of this disclosure embodiment rather than this public affairs The whole embodiments opened, it should be appreciated that the disclosure from example embodiment described herein limitation.Described in the disclosure Embodiment, those skilled in the art's obtained all other embodiment in the case where not making the creative labor should all fall Enter within the protection domain of the disclosure.

The method of the words criterion according to the first embodiment of the present disclosure is described in detail below with reference to Fig. 1. Fig. 1 schematically shows the flow chart of the method for the words criterion according to the present embodiment.

As shown in Figure 1, in step S110, target word to be standardized is obtained.

Target word to be standardized can be obtained by various modes, such as can be directly inputted by user, Huo Zhetong It crosses existing new word detection method etc. and detects the target word, etc. from the sentence of the target word to be standardized comprising this.

In step S120, retrieve to explain the sentence of the target word using network search engines, and determine the sentence In with the relevant word of the target word as represent the target word standardization result first group of candidate word.

In this step, the webpage in relation to the target word can be retrieved using existing network search engines, then will Each sentence in the webpage retrieved is matched with pre-defined template, and using with the sentence of template matches as being used for Explain the sentence of the target word.The pre-defined template is the template clause for explaining, defining to target word, It can rule of thumb preset, and various template can be defined.As long as sentence and each template in the webpage retrieved In at least one matching, it is for the sentence of objective of interpretation word to be considered as the sentence.In order to make it easy to understand, it is searched below with network It is that Baidu is known with exemplified by Baidupedia that index, which is held up, and above-mentioned processing is described.

For example, by taking target word to be standardized is " dog blood " as an example, Fig. 2 shows the net retrieved by Baidupedia Page, Fig. 3 is shown knows webpage that retrieval obtains by Baidu.By will be each in the webpage retrieved as shown in Figures 2 and 3 A sentence is matched with pre-defined template, defines for example following sentence for being used for objective of interpretation word " dog blood "：

A. amplification has the mysterious so-called dog blood of the meaning of chat exaggeration

B. the unusual meaning of old stuff

C. it is meant to chat

D. describe those similar plots often occurred, clumsy imitation or exaggerate very much very false performance

E. it is exactly the plot imitated in TV play by continuous reproduction

F. refer to performer to be soaped the audience with the performance to overdo on platform

Sense of propriety is not said when being g. exactly lecturer's performance

After the sentence for objective of interpretation word is as above retrieved, further determine that word relevant with target word is made in sentence To represent first group of candidate word of the standardization result of target word.It is described can be according to various appropriate with the relevant word of target word Mode determines.

In a kind of basic realization method, it can be split by word and sentence is divided into word, matched according to the sentence The syntactic structure of template determine then to filter from identified relevant word with the relevant word of target word in the obtained word of segmentation Except stop-word and dittograph, using remaining word as first group of candidate word.As previously mentioned, template is for target word The template clause explain, defined, therefore by carrying out what syntactic analysis can be determined and explained, define to template clause Position of the high word of target word correlation in template clause.Such as the template for " being meant to ... ", it may be determined that positioned at " meaning Think refer to " after word be the higher word of correlation, for the template of " being exactly to say ... ", it may be determined that after " being exactly to say " Word is higher word of correlation, etc..Therefore, in the realization method, the sentence that is retrieved for one, it may be determined that with this The position of the word high with target word correlation in the matched template of sentence determines and these positions from the word split by word Put corresponding word as with the relevant word of target word, then therefrom filter out stop-word and dittograph, using remaining word as First group of candidate word.So-called stop-word (stop words) refers to that the frequency of occurrences is very high but to text in natural environment The meaning of chapter or the page does not have that class word of materially affect, as in Chinese " ", " ", " " etc., stop-word is usually used Frequently, but to semantic effect very little.

In order to make it easy to understand, the realization method is described by taking " the dog blood " of above-illustrated as an example below.It is for example, right In the sentence a-g retrieved as implied above, word segmentation can be carried out to each sentence, then be obtained such as according to syntactic structure " chat ", " exaggeration ", " inconceivable ", " old stuff ", " imitation ", " performer ", " spectators ", " saying sense of propriety " etc. and target word " dog blood " Relevant word, then filter out stop-word therein (such as " ") and dittograph (such as " chat " in the word point of two sentences Cut in result and all exist, then can remove " chat " of a repetition), then remaining word is by as first group of candidate word.

The word split sometimes through word can not objective of interpretation word well, and will split obtained word expand to it is short Language can preferably objective of interpretation word.For example, for " ripe female " this target word, retrieve to explain that its sentence may be " mean ripe women " can be obtained " women " by word segmentation, and " women " can not actually explain well it is " ripe Female ", " ripe women " are then more in line with the meaning of " ripe female ".Occur in addition, if splitting at one before obtained word Negate rhetoric " no ", " not having " etc., then the word that the negative rhetoric is obtained with segmentation is combined to the table for being usually more in line with Chinese It states.In view of said circumstances, in an optional implementation manner, sentence can be divided by word by word segmentation and determined Go out wherein with after the relevant word of target word, based on dependence and/or negative rhetoric in identified relevant word at least One word is extended, and then will filter out stopping from the word after extension, other relevant words in addition to the word after extension Remaining word is as first group of candidate word after word and dittograph.Dependence is for describing in sentence between each ingredient Semantic modified relationship, in the realization method, dependence may be employed in surely middle relation, dynamic guest's relation, verbal endocentric phrase It is any or its combination.

For example, still by taking " the dog blood " of above-illustrated as an example, for the sentence a-g retrieved as implied above, to each A sentence carries out word segmentation and obtains such as " chat ", " exaggeration ", " inconceivable ", " old stuff ", " mould according to syntactic structure It is imitative ", " performer ", " spectators ", after the words such as " saying sense of propriety ", can be based on dependence and/or negative rhetoric will for example " imitation " expand It opens up as " clumsy imitation ", " spectators " is expanded into " soaping the audience ", " sense of propriety will be said " and expand to " not saying sense of propriety ", etc., so Stop-word and dittograph are filtered out from the word after extension, other relevant words in addition to the word after extension afterwards, and will be surplus Remaining word is as first group of candidate word.

In step S130, the similarity of target word and each candidate word in first group of candidate word is calculated based on term vector, And each candidate word is ranked up according to the similarity.

Known in the art, any one word can be represented with term vector, and the distance between two term vectors more connect Near then representated by them two words are more similar.It is calculated in this step by term vector in target word and first group of candidate word Each candidate word similarity, that is, the similarity degree of each candidate word and target word is determined, then according to the height of similarity Each candidate word is ranked up.

Specifically, it can determine corresponding term vector for target word and each candidate word in this step, then The similarity (for example, COS distance) between the term vector of target word and the term vector of each candidate word is calculated, as target word With the similarity of each candidate word.

It, in one implementation, can be with when determining corresponding term vector for target word and each candidate word Target word and the corresponding term vector of each candidate word are directly determined by the existing instrument such as word embedding. In another realization method, target word and each candidate word can be all decomposed into word, then be determined often by existing instrument The corresponding word vector of a word finally by the word vector of each word included in word is cumulative, obtains target word and each candidate The term vector of word.Optionally, if the corresponding word vector of some word can not be determined, corresponding word vector can be arranged to zero.

In step S140, the standardization result of target word is determined according to the result of sequence.

In this step, it can determine that the standardization result of target word is true according to the result of sequence according to predetermined rule It is fixed.For example, in a kind of basic realization method, the obtained highest candidate word of similarity that will can directly sort is as target The standardization result of word.

In an optional implementation manner, the confidence of the highest candidate word of similarity in first group of candidate word can be calculated Degree, if the confidence level is more than the first predetermined threshold, using the highest candidate word of the similarity as the standardization knot of target word Fruit, it is on the contrary, then it is assumed that even the highest candidate word of similarity can not represent target word well, i.e., not have for target word Obtain available standardization result.First predetermined threshold can be set according to specific needs, such as a kind of example, Its value can be 0.45.

It is carried out below in conjunction with processing of the Fig. 4 to determining the confidence level of the highest candidate word of similarity in first group of candidate word Description.

As shown in figure 4, in step S1401, for each candidate word in first group of candidate word, candidate word scoring is calculated.

In this step, various appropriate modes may be employed to determine that the candidate word of each candidate word scores.For example, make For a kind of example, the time can be determined according to the frequency of occurrences of the candidate word and with the quality of the relevant template of the candidate word The candidate word of word is selected to score, so that the frequency of occurrences is higher, template is better, then the candidate word scoring of candidate word is higher.Specifically, According to the example, for each candidate word, appearance of the candidate word in the sentence for objective of interpretation word retrieved is calculated Frequency；Determine each sentence for including the candidate word in the sentence for objective of interpretation word；Determine pre-defined template It is middle respectively with each matched each template of sentence；Determine the respective predetermined score of each template；Based on highest predetermined Score and the frequency of occurrences determine the candidate word scoring of the candidate word.

In order to make it easy to understand, below be outlined above target word " dog blood ", to calculate candidate word scoring candidate word It is above-mentioned processing to be described exemplified by " exaggeration ".It can be seen that " exaggeration " occurs 2 times in the sentence a-g retrieved, point It does not appear in sentence a and sentence d, it is assumed that predetermined with the matched templates of sentence a is scored at 4 points, with the matched templates of sentence d It is predetermined be scored at 4.5 points, then can divide and occur 2 times, by calculating weighted average etc. based on highest predetermined score 4.5 Various appropriate modes determine that the candidate word " exaggerated " scores.

In step S1402, each candidate word in first group of candidate word is ranked up according to candidate word scoring.

For example, Fig. 5 is instantiated through the exemplary process in step S1401-S1402, to each candidate word of " dog blood " Result after determining candidate word scoring and being ranked up based on candidate word scoring.

In step S1403, the difference per the candidate word scoring between a pair of of neighboring candidate word is calculated.

In this step, calculate in multiple candidate words according to candidate word marking and queuing per between a pair of neighboring candidate word The difference of candidate word scoring.For example, for ranking results as shown in Figure 5, the time between " exaggeration " and " chat " is calculated respectively Select the difference that difference, the candidate word between " chats " and " imitation " that word scores score, and so on, up to calculate " overdoing " and The difference of candidate word scoring between " not saying sense of propriety ".

In step S1404, difference, first group of candidate word at least based on the scoring of highest candidate word, the scoring of maximum candidate word Quantity, utilize the confidence level of the highest candidate word of the trained classifier calculated similarity.

In this step, the difference that is scored using the scoring of highest candidate word, maximum candidate word, the quantity of first group of candidate word Parameter as grader, it should be understood that this is only a kind of example, can also select ginseng of its dependent variable as grader Number.For example, in addition to this 3 variables, the parameter that all as follows high candidate words score as grader can also be increased.

In addition, there is no restriction for used grader in the step, logistic regression classifier etc. may be employed Various trained graders calculate the confidence level of the highest candidate word of similarity.

Above with reference to Fig. 4 to determining the highest candidate of similarity in first group of candidate word according to the first embodiment of the present disclosure The processing of the confidence level of word is described.It is understood that this is only a kind of example, and it is not limitation of the present invention, The confidence level of the highest candidate word of similarity can be determined using other modes as the case may be.

The method of words criterion described in detail above according to the first embodiment of the present disclosure.This method is using unsupervised Scheme standardizes to non-standard word according to the meaning of non-standard word, therefore can for the non-standardization word for the modification that looks like Obtain its result of standardizing.

In first embodiment above, standardize according only to the meaning of non-standard word to non-standard word, therefore Its result of standardizing can be obtained for the non-standardization word for the modification that looks like；In the present embodiment, except considering non-standard word Outside the meaning, it is also contemplated that the pronunciation of non-standard word, therefore can be obtained for the non-standardization word of voice modification and meaning modification To its result of standardizing.In the following description, only the present embodiment part different from first embodiment is described in detail, And it is then not repeated to describe for the part identical with first embodiment.

The method of the words criterion according to the present embodiment is described in detail below with reference to Fig. 6.Fig. 6 is schematic Ground shows the flow chart of the method for the words criterion according to the second embodiment of the present disclosure.

As shown in fig. 6, in step S610, target word to be standardized is obtained；In step S620, network search engines are utilized Retrieval determines that word relevant with the target word is as the expression target word in the sentence for explaining the sentence of the target word Standardization result first group of candidate word；In step S630, calculated based on term vector in target word and first group of candidate word The similarity of each candidate word, and each candidate word is ranked up according to the similarity.

Processing in above-mentioned steps S610-S630 is identical with the processing in the step S110-S130 of first embodiment respectively, It is not described in detail herein.

Fig. 6 is returned to, in step S640, the editing distance based on the phonetic with target word and the appearance frequency in corpus Rate is determined as second group of candidate word of the standardization result for representing the target word.Below with reference to Fig. 7 to the place in the step Reason is described.

As shown in fig. 7, in step S6401, the phonetic of target word is determined.

In step S6402, determine that the editing distance of phonetic and the phonetic of the target word is less than the alternative word of Alternate thresholds.

Editing distance refers between two character strings as the minimum edit operation number needed for one changes into another.It is logical Often, editing distance is smaller, and the similarity of two character strings is higher.In this step, phonetic is regarded as character string, then by one by one The editing distance of the phonetic of each word in a certain dictionary and the phonetic of target word is calculated, the spelling of phonetic and target word can be obtained The editing distance of sound is less than each alternative word of preset Alternate thresholds.It is known in the art for how calculating editing distance , it is not described in detail herein.In addition, the present embodiment is not limited for dictionary, can select according to specific needs any suitable When dictionary；The present embodiment is not also limited for the value of Alternate thresholds, can set appropriate value according to specific needs.

In step S6403, its frequency of occurrences in corpus is calculated for each alternative word.

The present embodiment is not limited for corpus, can select existing various corpus according to specific needs.It is selected The occurrence number of each alternative word wherein, the frequency of occurrences as the alternative word can be determined after corpus.

In step S6404, the candidate word for determining each alternative word based on editing distance and the frequency of occurrences scores.

In this step, various appropriate modes may be employed, alternative word is determined based on editing distance and the frequency of occurrences Candidate word scores, as long as the editing distance of alternative word is smaller, the frequency of occurrences is higher, candidate word scoring is higher.For example, make For an example, candidate word scoring=(word frequency/maximum word frequency) × n+ [1- (editing distance/(word length × a))] × m, wherein most Big word frequency represents the maximum of the frequency of occurrences of each alternative word, and word length represents that alternative word forms the (word of such as children's footwear by several words A length of 2) editing distance represents the editing distance between the alternative word and target word, and word frequency represents the frequency of occurrences of the alternative word, N and m is weighted value, and a is that can take the maximum 4.1 of editing distance according to empirically determined regulatory factor, such as a.

In step S6405, candidate word scoring is more than the alternative word of candidate word threshold value as the specification for representing the target word Change second group of candidate word of result.

The present embodiment is not limited for the value of candidate word threshold value, can set appropriate value according to specific needs.In the step In rapid, by scoring candidate word compared with candidate word threshold value, it may be determined that go out candidate word scoring and be more than candidate word threshold value Each alternative word as second group of candidate word

Fig. 7 is had been combined above to according to the present embodiment, editing distance based on the phonetic with target word and in language The frequency of occurrences in storehouse is expected to determine that the processing of second group of candidate word is described.It is appreciated that above description is only A kind of illustrative basic handling mode, optionally, step S6402 determine the editor of phonetic and the phonetic of target word away from It, can be according to its each syllable and the corresponding syllable of target word for each alternative word after the alternative word less than Alternate thresholds It is whether similar, adjust the editing distance of the alternative word and target word.Specifically, as previously mentioned, in step S6402, using this Technology well known to field calculates editing distance, and according to the currently used technology in this field, calculating the editor of two syllables Apart from when, be identical according only to two syllables or different calculated.It is for example, identical with two of " s " for such as " s " Syllable, editing distance 0, and for two different syllables of such as " x " with " sh ", editing distance 1.However, It is understood that the editing distance between the difference such as " s " and " sh ", " en " and " eng " but similar syllable should be less than two Editing distance between a different and dissimilar syllables.

It, in an optional implementation manner, will be in each syllable in alternative word and target word based on above-mentioned cognition Corresponding syllable is compared respectively, if there is N number of syllable still phase different from N number of corresponding syllable in target word in the alternative word Seemingly, then the editing distance of the alternative word and target word is reduced into N number of first distance, N is natural number.In the optional realization method In, it is similar that can rule of thumb wait and which syllable preset.

In another optional realization method, if all syllables of a word of the alternative word are corresponding with target word All corresponding syllables of word are all similar, then the editing distance of the alternative word and target word are reduced second distance.

In another optional realization method, if all syllables of a word of the alternative word are corresponding with target word All corresponding syllables of word are different from also dissmilarity, then the editing distance of the alternative word and target word are increased the 3rd distance.

Fig. 6 is returned to, in step S650, target word and each candidate word in second group of candidate word are calculated based on term vector Similarity, and determine wherein with the highest candidate word of the similarity of target word.

Processing in the step is similar with the processing in the step S130 of first embodiment, is not described in detail herein.

In step S660, the standardization result of target word is determined according to the result of sequence.

In this step, the standardization result of target word can be determined according to the result of sequence according to predetermined rule.Example Such as, in a kind of basic realization method, the highest time of similarity in the first group of candidate word that can will be determined in step S630 Select in word (hereinafter referred to as the first preferred term) and step S650 determine second group of candidate word in the highest candidate word of similarity (with Lower the second preferred term of abbreviation) similarity be compared, and using the higher standardization as target word of similarity in the two As a result.

In an optional implementation manner, if the similarity of the second preferred term is not higher than the first preferred term, it is based on The candidate word scoring of each candidate word determines the confidence level of the first preferred term in first group of candidate word, if the confidence level is more than first Predetermined threshold, then using the first preferred term as the standardization result of target word.Fig. 4 is above had been combined to determining first preferably The processing of the confidence level of word is described, and details are not described herein again.Optionally, if the first preferred term is in second group of candidate word There is also be then directly set to maximum by the confidence level of first preferred term.

In an optional implementation manner, if the similarity of the second preferred term is higher than the first candidate word, based on the The candidate word scoring of each candidate word calculates the confidence level of the second preferred term in two groups of candidate words, if the confidence level is more than second in advance Determine threshold value, then using the second preferred term as the standardization result of target word.In the realization method, it can be based in step The candidate word scoring of each candidate word in the second group of candidate word calculated in S6404, using it is various it is appropriate by the way of calculate the The confidence level of two preferred terms.For example, as a kind of example, can be scored according to candidate word to each time in second group of candidate word Word is selected to be ranked up, it is that then the highest scorings of M (M be natural number, the quantity of M≤the second group candidate word) are added and divided by The quantity of second group of candidate word, the confidence level as the second preferred term.Optionally, if the second preferred term is in first group of candidate word In there is also be then directly set to maximum by the confidence level of second preferred term.Second predetermined threshold can be according to specific It needs to set, the present embodiment is not limited in this respect.

The method of words criterion described in detail above according to the second embodiment of the present disclosure.This method had both considered non-rule The pronunciation of model word, the meaning for being also contemplated for non-standard word, therefore can for the non-standardization word of voice modification and meaning modification Obtain its result of standardizing.In addition, this method is when considering the pronunciation of non-standard word, whether similar according to syllable, adjustment is each standby Word and the editing distance of target word are selected, therefore improves the standardization result of the non-standardization word of voice deformation.

It should be noted that, although above according to the order from step S610 to S660 to being advised according to the word of the present embodiment The method of generalized is illustrated, but this is only a kind of example, and step S610 to the S660 is not necessarily according to described Order perform.For example, can after step S640, S650 is sequentially performed again order perform step S620, S30 or Person can be parallel while step S620, S630 is performed execution step S640, S650, etc..

<3rd embodiment>

In embodiment in front, the meaning and the pronunciation of non-standard word are considered to determine the standardization of expression non-standardization word As a result candidate word.In the words criterion method according to the present embodiment, for determining the standardization of expression non-standardization word As a result the method for candidate word is not limited, and after candidate word is determined, its confidence level will be assessed, and determines to wait according to confidence level Select whether word is subjected to.

The method of the words criterion according to the third embodiment of the present disclosure is described below with reference to Fig. 8.Fig. 8 illustrates Show to property the flow chart of the method for the words criterion according to the third embodiment of the present disclosure.

As shown in figure 8, in step S810, obtain target word to be standardized and represent the standardization result of the target word Candidate word set.

As previously mentioned, in the words criterion method according to the present embodiment, the standardization of target word is represented for obtaining As a result the method for candidate word is not limited.For example, the mode of first embodiment description may be employed herein based on target word The meaning obtains candidate word set, and both the mode of second embodiment description pronunciation and the meaning based on target word may be employed to obtain Candidate word set is taken, the pronunciation of target word can also be based only upon and appointed to obtain candidate word set or may be employed in this field What appropriate method obtains candidate word set.

In step S820, the similarity of target word and each candidate word in this group of candidate word is calculated based on term vector, and Each candidate word is ranked up according to the similarity.

The processing of the step is identical with the processing in first embodiment step S130, and details are not described herein again.

In step S830, the confidence level of the highest candidate word of similarity in each candidate word is determined.

In this step, various appropriate modes may be employed and determine the highest candidate word of similarity in each candidate word Confidence level.For example, the time obtained for the candidate word set of meaning acquisition based on target word and/or the phonetic based on target word Set of words is selected, it can be such as first embodiment of the invention and the highest candidate word of the described definite similarity of second embodiment Confidence level, details are not described herein again.

In step S840, if the confidence level is more than the 3rd threshold value, using the highest candidate word of the similarity as target The standardization result of word.

If it is determined that the confidence level of the highest preferred term of the similarity is more than preset 3rd threshold value, then it is assumed that the time Select word that can represent target word well, therefore can be using the candidate word as the standardization result of the target word of non-standard；Instead It, it is believed that acceptable standardization result is not obtained for target word.3rd threshold value can rule of thumb with actual need It sets, such as a kind of example, value can be 0.6.

The method of words criterion described in detail above according to the third embodiment of the present disclosure.It is real according to the disclosure the 3rd The words criterion method of example is applied for determining to represent that the method for the candidate word of the standardization result of non-standardization word is not limited, And after candidate word is determined, its confidence level will be assessed, and determine whether candidate word is subjected to according to confidence level, thereby ensure that rule The correctness of generalized word.

On the other hand, machine translation can be applied to according to the words criterion method of the present embodiment.More particularly, this reality It applies example and actually additionally provides a kind of machine translation method, comprise the following steps：(i) the non-standard word in original language is detected； (ii) the candidate word set for the standardization result for representing the target word is obtained；(iii) target word and the group are calculated based on term vector The similarity of each candidate word in candidate word, and each candidate word is ranked up according to the similarity；(iv) determine each The confidence level of the highest candidate word of similarity in a candidate word；If (v) confidence level is more than the 3rd threshold value, by the similarity Highest candidate word is transcribed into object language as the word after standardization.It, can be by existing in above-mentioned steps (i) Some new word detection methods etc. detect the non-standard word from the sentence comprising non-standard word, in step (v), may be employed Word after standardization is translated into object language by various common machine translation methods, the place in remaining each step (ii)-(iv) Reason is similar with the processing of each corresponding step in the words criterion method according to the present embodiment, and details are not described herein again.

Fig. 9 shows the functional configuration block diagram of words criterion equipment 900 according to an embodiment of the invention.

As shown in figure 9, words criterion equipment 900 includes：Obtaining widget 910, candidate word determine component 920, similarity Ordering element 930 and standardization component 940.The concrete function of each component and operation above for Fig. 1-7 with describing It is essentially identical, therefore in order to avoid repeating, brief description is hereinafter only carried out to the equipment, and omit to identical thin The detailed description of section.

Obtaining widget 910 is configured to obtain target word to be standardized.Obtaining widget 910 can obtain by various modes Target word that must be to be standardized, for example, can be directly inputted by user or by existing new word detection method etc. from comprising The target word, etc. is detected in the sentence of the target word to be standardized.

Candidate word determines that component 920 is configured to utilize network search engines retrieval for explaining the sentence of the target word, and Determine in the sentence with the relevant word of the target word as represent the target word standardization result first group of candidate word.

Specifically, candidate word determines that component 920 can retrieve the related target word using existing network search engines Webpage, then each sentence in the webpage retrieved is matched with pre-defined template, and will be with template matches Sentence as explaining the sentence of the target word.The pre-defined template is for being explained to target word, determines The template clause of justice, can rule of thumb preset, and can define various template.As long as candidate word determines component Sentence in 920 webpages retrieved and at least one matching in each template, it is for objective of interpretation to be considered as the sentence The sentence of word.

After the sentence for objective of interpretation word is as above retrieved, candidate word determines component 920 further according to various suitable When mode determine in sentence with the relevant word of target word as represent target word standardization result first group of candidate word.Example Such as, in a kind of basic realization method, candidate word determines that component 920 can be split by word and sentence is divided into word, according to With the syntactic structure of the matched template of the sentence determine in the obtained word of segmentation with the relevant word of target word, then from identified Stop-word and dittograph are filtered out in relevant word, using remaining word as first group of candidate word.In a kind of optional reality In existing mode, candidate word determine component 920 can by word segmentation by sentence be divided into word and determine wherein with target word After relevant word, at least one word in identified relevant word is extended based on dependence and/or negative rhetoric, Then remained after stop-word and dittograph will be filtered out from the word after extension, other relevant words in addition to the word after extension Remaining word is as first group of candidate word.

Sequencing of similarity component 930 is configured to term vector and calculates target word and each candidate in first group of candidate word The similarity of word, and each candidate word is ranked up according to the similarity；

Known in the art, any one word can be represented with term vector, and the distance between two term vectors more connect Near then representated by them two words are more similar.Sequencing of similarity component 930 is calculated by term vector in first group of candidate word Each candidate word and target word similarity, then each candidate word is ranked up according to the height of similarity.Specifically, Sequencing of similarity component 930 can determine corresponding term vector for target word and each candidate word, then calculate target Similarity between the term vector of word and the term vector of each candidate word, as target word and the similarity of each candidate word. When determining corresponding term vector for target word and each candidate word, in one implementation, sequencing of similarity component 930 can directly determine target word and the corresponding word of each candidate word by the existing instrument such as word embedding Vector；In another realization method, target word and each candidate word can be all decomposed into word by sequencing of similarity component 930, Then the corresponding word vector of each word is determined by existing instrument, finally by the word vector of each word included in word is tired Add, obtain the term vector of target word and each candidate word.It optionally, can if the corresponding word vector of some word can not be determined Zero is arranged to so that word vector will be corresponded to.

Standardization component 940 is configured to determine the standardization result of target word according to the result of sequence.

Standardization component 940 can determine that the standardization result of target word is true according to predetermined rule according to the result of sequence It is fixed.For example, in a kind of basic realization method, the similarity that standardization component 940 can directly obtain sequence is highest Standardization result of the candidate word as target word.

In an optional implementation manner, standardization component 940 can calculate similarity highest in first group of candidate word Candidate word confidence level, if the confidence level be more than the first predetermined threshold, using the highest candidate word of the similarity as mesh The standardization of word is marked as a result, on the contrary, then it is assumed that even the highest candidate word of similarity can not represent target word well, i.e., Available standardization result is not obtained for target word.First predetermined threshold can be set according to specific needs, example Such as a kind of example, value can be 0.45.In the realization method, standardization component 940 can include：First marking is single First 9401 (not shown), 9402 (not shown) of sequencing unit, adjacent poor 9403 (not shown) of computing unit and grader unit 9404 (not shown).

First marking unit 9401 is configured to for each candidate word in first group of candidate word, calculates candidate word scoring. The candidate word scoring that the first marking unit 9401 may be employed various appropriate modes to determine each candidate word.For example, make For a kind of example, the time can be determined according to the frequency of occurrences of the candidate word and with the quality of the relevant template of the candidate word The candidate word of word is selected to score, so that the frequency of occurrences is higher, template is better, then the candidate word scoring of candidate word is higher.Specifically, According to the example, for each candidate word, appearance of the candidate word in the sentence for objective of interpretation word retrieved is calculated Frequency；Determine each sentence for including the candidate word in the sentence for objective of interpretation word；Determine pre-defined template It is middle respectively with each matched each template of sentence；Determine the respective predetermined score of each template；Based on highest predetermined Score and the frequency of occurrences determine the candidate word scoring of the candidate word.

Sequencing unit 9402 is configured to be ranked up each candidate word in first group of candidate word according to candidate word scoring.

Adjacent difference computing unit 9403 is configured to calculate the difference of the candidate word scoring between every a pair of of neighboring candidate word.

Grader unit 9404 is configured at least based on the scoring of highest candidate word, the difference of maximum candidate word scoring, first The quantity of group candidate word, utilizes the confidence level of the highest candidate word of the trained classifier calculated similarity.

Herein, grader unit 9404 utilizes the scoring of highest candidate word, the difference of maximum candidate word scoring, first group of candidate Parameter of the quantity of word as grader, it should be understood that this is only a kind of example, can also select its dependent variable as classification The parameter of device.For example, in addition to this 3 variables, the ginseng that all as follows high candidate words score as grader can also be increased Number.

In addition, there is no restriction for grader used by for grader unit 9404, such as logistic regression point may be employed The various trained graders such as class device calculate the confidence level of the highest candidate word of similarity.

Optionally, the candidate word determines that component 920 can be further configured to the editor based on the phonetic with target word Distance and the frequency of occurrences in corpus are determined as second group of candidate word of the standardization result for representing target word.Tool Body, candidate word determines that component 920 can be further configured to include：9201 (not shown) of phonetic determination unit is configured to really Set the goal the phonetic of word；9202 (not shown) of alternative word determination unit is configured to determine phonetic and the phonetic of the target word Editing distance is less than the alternative word of Alternate thresholds；9203 (not shown) of frequency determinative elements is configured to for each alternative word meter Calculate its frequency of occurrences in corpus；Second marking 9204 (not shown) of unit is configured to editing distance and frequency occurs Rate determines the candidate word scoring of each alternative word；9205 (not shown) of pinyin candidate word determination unit, is configured to comment candidate word Divide second group candidate word of the alternative word for being more than candidate word threshold value as the standardization result for representing the target word；Adjustment unit 9206 (not shown) are configured to each alternative word determined for alternative word determination unit 9202, according to its each syllable and mesh It whether similar marks the correspondence syllable of word, adjusts the editing distance of the alternative word and target word.The concrete function of above-mentioned each unit and Operation is identical with being described above for Fig. 7, is not described in detail herein.

Optionally, sequencing of similarity component 930 can be further configured to calculate target word and second group based on term vector The similarity of each candidate word in candidate word, and determine wherein with the highest candidate word of the similarity of target word.

Optionally, standardization component 940 can be further configured to be determined first in the sequencing of similarity component 930 Similarity is highest in the highest candidate word of similarity (hereinafter referred to as the first preferred term) and second group of candidate word in group candidate word In the case of candidate word (hereinafter referred to as the second preferred term), the rule of target word are determined according to the result of sequence according to predetermined rule Generalized result.

For example, in a kind of basic realization method, standardization component 940 can be by the first preferred term and the second preferred term Similarity be compared, and using the higher standardization result as target word of similarity in the two.

In an optional implementation manner, if the similarity of the second preferred term is not higher than the first candidate word, standardization Component 940 determines the confidence level of the first preferred term based on the candidate word scoring of each candidate word in first group of candidate word, if this is put Reliability is more than the first predetermined threshold, then using the first preferred term as the standardization result of target word.Optionally, if first is preferred There is also be then directly set to maximum to word by the confidence level of first preferred term in second group of candidate word.

In an optional implementation manner, if the similarity of the second preferred term is higher than the first candidate word, standardization portion Candidate word scoring of the part 940 based on each candidate word in second group of candidate word calculates the confidence level of the second preferred term, if the confidence Degree is more than the second predetermined threshold, then using the second preferred term as the standardization result of target word.It herein, can be true based on candidate word Determine the candidate word scoring of each candidate word in second group of candidate word that component 920 calculates, using it is various it is appropriate by the way of calculate The confidence level of second preferred term.For example, as a kind of example, can be scored according to candidate word to each in second group of candidate word Candidate word is ranked up, and is then added M (M is natural number, and M is less than or equal to the quantity of second group of candidate word) highest scorings And divided by second group of candidate word quantity, the confidence level as the second preferred term.Optionally, if the second preferred term is first Exist in group candidate word, then the confidence level of second preferred term is directly set to maximum.

The equipment of words criterion described in detail above according to the embodiment of the present disclosure.The equipment can be according to non-standard The meaning of word can obtain its standardization knot to standardize to non-standard word for the non-standardization word for the modification that looks like Fruit.The equipment can also consider that both the pronunciation of non-standard word and the meaning of non-standard word to carry out specification to non-standard word simultaneously Change, therefore its result of standardizing can be obtained for the non-standardization word of voice modification and meaning modification.In addition, the equipment exists It is whether similar according to syllable when considering the pronunciation of non-standard word, the editing distance of each alternative word and target word is adjusted, therefore is improved The standardization result of the non-standardization word of voice deformation.

Figure 10 shows the functional configuration block diagram of the words criterion equipment 1000 of another embodiment according to the present invention.

As shown in Figure 10, words criterion equipment 1000 includes：Obtaining widget 1010, sequencing of similarity component 1020, puts Reliability determines component 1030 and standardization component 1040.The concrete function of each component and operation with above for Fig. 8 What is described is essentially identical, therefore in order to avoid repeating, brief description is hereinafter only carried out to the equipment, and omits to phase With the detailed description of details.

Obtaining widget 1010 is configured to the time for the standardization result for obtaining target word to be standardized and representing the target word Select set of words.

Obtaining widget 1010 can obtain target word to be standardized by various modes, such as can be direct by user Input detects the target by existing new word detection method etc. from the sentence of the target word to be standardized comprising this Word, etc..As previously mentioned, in the words criterion method according to the present embodiment, the standardization of target word is represented for obtaining As a result the method for candidate word is not limited.For example, the mode that first embodiment description may be employed in obtaining widget 1010 is based on The meaning of target word obtains candidate word set, and pronunciation and the meaning of the mode based on target word of second embodiment description may be employed The two obtains candidate word set, can also be based only upon the pronunciation of target word to obtain candidate word set or this may be employed Any appropriate method obtains candidate word set in field.

Sequencing of similarity component 1020 is configured to term vector and calculates target word and each candidate in this group of candidate word The similarity of word, and each candidate word is ranked up according to the similarity.

Confidence level determines that component 1030 is configured to determine the confidence level of the highest candidate word of similarity in each candidate word.This Place, confidence level determine that component 1030 may be employed various appropriate modes and determine the highest candidate word of similarity in each candidate word The confidence level of (hereinafter referred to as preferred term).For example, for based on target word the meaning obtain candidate word set or based on target The candidate word set that the phonetic of word obtains, can be such as first embodiment of the invention and the described definite preferred term of second embodiment Confidence level.

If standardization component 1040 is configured to the confidence level more than the 3rd threshold value, by the highest candidate word of the similarity Standardization result as target word.

If standardization component 1040 determines that the confidence level of the highest preferred term of the similarity is more than the preset 3rd Threshold value, then it is assumed that the preferred term can represent target word well, therefore standardize component 1040 can using the preferred term as The standardization of the target word of non-standard is as a result, on the contrary, it is believed that acceptable standardization result is not obtained for target word.

Words criterion equipment described in detail above according to the present embodiment.It is set according to the words criterion of the present embodiment The standby mode for determining to represent the candidate word of the standardization result of non-standardization word is not limited, and after candidate word is determined, Its confidence level will be assessed, and determines whether candidate word is subjected to according to confidence level, thereby ensures that the correctness of standardization word.

In the following, calculating available for the realization embodiment of the present disclosure, for words criterion equipment is described with reference to Figure 11 The schematic block diagram of equipment.

As shown in figure 11, computing device 1100 includes one or more processors 1102, storage device 1104, input unit 1106 and output device 1108, these components it is mutual by bindiny mechanism's (not shown) of bus system 1110 and/or other forms Even.It should be noted that the component and structure of computing device 1100 shown in Figure 11 are illustrative, and not restrictive, according to It needs, computing device 1100 can also have other assemblies and structure.

Processor 1102 can be central processing unit (CPU) or perform energy with data-handling capacity and/or instruction The processing unit of the other forms of power, and other components in computing device 1100 can be controlled to perform desired function.

Storage device 1104 can include one or more computer program products, and the computer program product can wrap Include various forms of computer readable storage mediums, such as volatile memory and/or nonvolatile memory.The volatibility Memory is such as can include random access memory (RAM) and/or cache memory (cache).It is described non-volatile Property memory such as read-only memory (ROM), hard disk can be included, flash memory.It can on the computer readable storage medium With the one or more computer program instructions of storage, processor 112 can run described program instruction, described above to realize The function of embodiment of the disclosure and/or other desired functions.It can be in the computer readable storage medium Store various application programs and various data, for example, be mentioned above target word to be standardized, the sentence of objective of interpretation word, First group of candidate word, second group of candidate word, each candidate's Word similarity, pre-defined sentence template, each candidate word are corresponding Term vector, the phonetic of target word, the editing distance of each candidate word, candidate word scoring, preferred term confidence level, various threshold values etc..

Input unit 1106, can be with for receiving input information from the user, such as target word to be standardized etc. Including the various input equipments such as wire/wireless network interface card, keyboard, mouse, touch-screen, microphone.

Output device 1108 can export various information to outside, for example, non-standardization word standardization as a result, and can To include the various display devices such as wire/wireless network interface card, display, projecting apparatus, TV.

The basic principle of the disclosure is described above in association with specific embodiment, however, it is desirable to, it is noted that in the disclosure The advantages of referring to, advantage, effect etc. are only exemplary rather than limiting, it is impossible to which it is the disclosure to think these advantages, advantage, effect etc. Each embodiment is prerequisite.In addition, detail disclosed above is merely to exemplary effect and the work readily appreciated With, and it is unrestricted, above-mentioned details is not intended to limit the disclosure as that must be realized using above-mentioned concrete details.

Device, device, equipment, the block diagram of system involved in the disclosure only as illustrative example and are not intended to It is required that or hint must be attached in a manner that box illustrates, arrange, configure.As those skilled in the art will appreciate that , it can connect, arrange by any way, configuring these devices, device, equipment, system.Such as " comprising ", "comprising", " tool " etc. word be open vocabulary, refer to " including but not limited to ", and can be used interchangeably with it.Vocabulary used herein above "or" and " and " refer to vocabulary "and/or", and can be used interchangeably with it, unless it is not such that context, which is explicitly indicated,.Here made Vocabulary " such as " refers to phrase " such as, but not limited to ", and can be used interchangeably with it.

In addition, as used herein, with the item of " at least one " beginnings enumerate the middle "or" used indicate it is separated It enumerates, so that enumerating for such as " A, B or C's being at least one " means A or B or C or AB or AC or BC or ABC (i.e. A and B And C).In addition, wording " exemplary " does not mean that the example of description is preferred or more preferable than other examples.

It may also be noted that in the system and method for the disclosure, each component or each step are can to decompose and/or again Combination nova.These decompose and/or reconfigure the equivalent scheme that should be regarded as the disclosure.

The technology instructed defined by the appended claims can not departed from and carried out to the various of technology described herein Change, replace and change.In addition, the scope of the claim of the disclosure is not limited to process described above, machine, manufacture, thing Composition, means, method and the specific aspect of action of part.It can be essentially identical using being carried out to corresponding aspect described herein Function either realize essentially identical result there is currently or to be developed later processing, machine, manufacture, event group Into, means, method or action.Thus, appended claims include such processing within its scope, machine, manufacture, event Composition, means, method or action.

The above description of disclosed aspect is provided so that any person skilled in the art can make or use this It is open.Various modifications in terms of these are readily apparent to those skilled in the art, and are defined herein General Principle can be applied to other aspect without departing from the scope of the present disclosure.Therefore, the disclosure is not intended to be limited to Aspect shown in this, but according to the widest range consistent with principle disclosed herein and novel feature.

In order to which purpose of illustration and description has been presented for above description.In addition, this description is not intended to the reality of the disclosure It applies example and is restricted to form disclosed herein.Although already discussed above multiple exemplary aspects and embodiment, this field skill Art personnel will be recognized that its some modifications, modification, change, addition and sub-portfolio.

Claims

1. a kind of method of words criterion, including：

Obtain target word to be standardized；

Using network search engines retrieval for explaining the sentence of the target word, and determine related to the target word in the sentence Word as represent the target word standardization result first group of candidate word；

The similarity of target word and each candidate word in first group of candidate word is calculated based on term vector, and according to the similarity Each candidate word is ranked up；

The standardization result of target word is determined according to the result of sequence.

2. the method for words criterion as described in claim 1, wherein described utilize network search engines retrieval for explaining The sentence of the target word includes：

The webpage in relation to the target word is retrieved using network search engines；

Each sentence in the webpage retrieved with pre-defined template is matched, and will be made with the sentence of template matches To be used to explain the sentence of the target word.

3. the method for words criterion as claimed in claim 2, wherein related to the target word in the definite sentence Word include as the first group of candidate word of standardization result for representing the target word：

The sentence is divided into word,

According to the syntactic structure with the matched template of the sentence, determine in the word that segmentation obtains with the relevant word of target word,

Stop-word and dittograph are filtered out from the definite relevant word, using remaining word as first group of candidate Word.

4. the method for words criterion as claimed in claim 2, wherein related to the target word in the definite sentence Word include as the first group of candidate word of standardization result for representing the target word：

The sentence is divided into word；

Based on dependence and/or negative rhetoric, at least one word in the definite relevant word is extended；

Stop-word and dittograph are filtered out from the word after extension, other relevant words in addition to the word after extension, by residue Word as first group of candidate word.

5. the method for words criterion as claimed in claim 4, the dependence includes fixed middle relation, dynamic guest's relation, shape At least one of middle structure.

6. the method for the words criterion as any one of claim 1-5, wherein based on term vector calculate target word with The similarity of each candidate word in first group of candidate word includes：

Corresponding term vector is determined for target word and each candidate word；

Calculate the similarity of the term vector of target word and the term vector of each candidate word, the phase as target word and each candidate word Like degree.

7. the method for words criterion as claimed in claim 6, wherein described determine respectively for target word and each candidate word From term vector include：

Target word is decomposed into word, determines the corresponding word vector of each word；

Each word vector is added up and obtains the corresponding term vector of target word；

Each candidate word is decomposed into word, determines the corresponding word vector of each word；

For each candidate word, the corresponding word vector of each of which word is added up and obtains the corresponding term vector of the candidate word.

8. the method for words criterion as claimed in claim 2, wherein the result according to sequence determines the rule of target word Generalized result includes：

Based on the candidate word scoring of each candidate word in first group of candidate word, the highest candidate of similarity in first group of candidate word is determined The confidence level of word,

If the confidence level is more than the first predetermined threshold, using the highest candidate word of the similarity as the standardization knot of target word Fruit.

9. the method for words criterion as claimed in claim 8, wherein determining the highest time of similarity in first group of candidate word Selecting the confidence level of word includes：

For each candidate word in first group of candidate word, candidate word scoring is calculated；

Each candidate word in first group of candidate word is ranked up according to candidate word scoring；

Calculate the difference per the candidate word scoring between a pair of of neighboring candidate word；

At least based on the scoring of highest candidate word, the difference of maximum candidate word scoring, the quantity of first group of candidate word, using training The highest candidate word of the classifier calculated similarity confidence level.

10. the method for words criterion as claimed in claim 9, wherein each candidate in first group of candidate word Word, which calculates candidate word scoring, to be included：

Calculate the frequency of occurrences of the candidate word in the sentence for objective of interpretation word retrieved；

Determine each sentence for including the candidate word in the sentence for objective of interpretation word；

Determine in the pre-defined template respectively with each matched each template of sentence；

Determine the respective predetermined score of each template；

The candidate word for determining the candidate word based on highest predetermined score and the frequency of occurrences scores.

11. the method for words criterion as described in claim 1, further includes：

Determine the phonetic of target word；

Determine that the editing distance of the phonetic of phonetic and the target word is less than the alternative word of Alternate thresholds；

Its frequency of occurrences in corpus is calculated for each alternative word；

The candidate word for determining each alternative word based on editing distance and the frequency of occurrences scores；

Candidate word scoring is more than to second group time of the alternative word as the standardization result for representing the target word of candidate word threshold value Select word.

12. the method for words criterion as claimed in claim 11, further includes：

It is whether similar with the corresponding syllable of target word according to its each syllable for each alternative word, adjust the alternative word and mesh Mark the editing distance of word.

13. the method for words criterion as claimed in claim 12, it is described for each alternative word according to its each syllable with The whether similar alternative word and the editing distance of target word of adjusting of the correspondence syllable of target word includes：

By each syllable in the alternative word and the correspondence syllable in target word respectively compared with；

If had in the alternative word, N number of syllable and N number of corresponding syllable in target word are different but similar, by the alternative word with The editing distance of target word reduces N number of first distance, and N is natural number.

14. the method for words criterion as claimed in claim 13, it is described for each alternative word according to its each syllable with Whether similar adjustment alternative word of correspondence syllable of target word and the editing distance of target word further include：

If all syllables of a word of the alternative word are all similar with all corresponding syllables of the corresponding word of target word, should The editing distance of alternative word and target word reduces second distance；

If all syllables of a word of the alternative word are different from also not with all corresponding syllables of the corresponding word of target word It is similar, then the editing distance of the alternative word and target word is increased into the 3rd distance.

15. the method for the words criterion as any one of claim 11-14, further includes：

The similarity of target word and each candidate word in second group of candidate word is calculated based on term vector, and determine wherein with target The highest candidate word of similarity of word.

16. the method for words criterion as claimed in claim 15, wherein the result according to sequence determines that target word is advised The result of generalized includes：

Determine the highest candidate word of similarity in first group of candidate word；

If similarity highest in higher than first group candidate word of the similarity of the highest candidate word of similarity in second group of candidate word Candidate word, then based in second group of candidate word each candidate word candidate word scoring calculate second group of candidate word in similarity most The confidence level of high candidate word,

If the confidence level is more than the second predetermined threshold, using the highest candidate word of similarity in second group of candidate word as target The standardization result of word.

17. the method for words criterion as claimed in claim 16, wherein the result according to sequence determines that target word is advised The result of generalized further includes：

If the similarity of the highest candidate word of similarity is not higher than similarity in first group of candidate word most in second group of candidate word High candidate word, the then candidate word scoring based on each candidate word in first group of candidate word, calculates similar in first group of candidate word The confidence level of highest candidate word is spent,

If the confidence level is more than the first predetermined threshold, using the highest candidate word of the similarity in first group of candidate word as mesh Mark the standardization result of word.

18. words criterion method as claimed in claim 17, further includes：

If there is also by first group of time in second group of candidate word for the highest candidate word of similarity in first group of candidate word The confidence level of the highest candidate word of similarity in word is selected to be set to maximum；

If there is also by second group of time in first group of candidate word for the highest candidate word of similarity in second group of candidate word The confidence level of the highest candidate word of similarity in word is selected to be set to maximum.

19. a kind of equipment of words criterion, including：

Obtaining widget is configured to obtain target word to be standardized；

Candidate word determines component, is configured to utilize network search engines retrieval for explaining the sentence of the target word, and determines institute State in sentence with the relevant word of the target word as represent the target word standardization result first group of candidate word；

Sequencing of similarity component is configured to the phase that term vector calculates target word and each candidate word in first group of candidate word Each candidate word is ranked up like degree, and according to the similarity；

Standardize component, is configured to determine the standardization result of target word according to the result of sequence.

20. a kind of equipment of words criterion, including：

Processor；

Memory；With

The computer program instructions being stored in the memory, the computer program instructions by the processor when being run Perform following steps：

Obtain target word to be standardized；

21. a kind of method of words criterion, including：

It obtains target word to be standardized and represents the candidate word set of the standardization result of the target word；

The similarity of target word and each candidate word in this group of candidate word is calculated based on term vector, and according to the similarity pair Each candidate word is ranked up；

Determine the confidence level of the highest candidate word of similarity in each candidate word；

If the confidence level is more than the 3rd threshold value, using the highest candidate word of the similarity as the standardization result of target word.

22. a kind of machine translation method, including：

Detect the non-standard word in original language；

Obtain the candidate word set for the standardization result for representing the target word；

If the confidence level be more than the 3rd threshold value, using the highest candidate word of the similarity as standardize after word, and by its Translate into object language.