CN108073565A - The method and apparatus and machine translation method and equipment of words criterion - Google Patents
The method and apparatus and machine translation method and equipment of words criterion Download PDFInfo
- Publication number
- CN108073565A CN108073565A CN201610989788.2A CN201610989788A CN108073565A CN 108073565 A CN108073565 A CN 108073565A CN 201610989788 A CN201610989788 A CN 201610989788A CN 108073565 A CN108073565 A CN 108073565A
- Authority
- CN
- China
- Prior art keywords
- word
- candidate
- candidate word
- target
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The method and apparatus and machine translation method and equipment, the method for the words criterion for providing words criterion include:Obtain target word to be standardized;Retrieve to explain the sentence of the target word using network search engines, and determine in the sentence with first group candidate word of the relevant word of the target word as the standardization result of the expression target word;The similarity of target word and each candidate word in first group of candidate word is calculated based on term vector, and each candidate word is ranked up according to the similarity;The standardization result of target word is determined according to the result of sequence.Above-mentioned words criterion technology and machine translation mothod standardize to non-standard word according to the meaning of non-standard word using unsupervised scheme, therefore its standardization can be obtained for the non-standardization word for the modification that looks like as a result, and improving the performance of the machine translation of the sentence of the non-standardization word comprising interesting modification.
Description
Technical field
The present disclosure generally relates to natural language processings, and in particular to method, the equipment of the words criterion of non-standardization word
And machine translation method, equipment.
Background technology
The presence of non-standardization word causes many difficulties to natural language processing, in particular with the hair of cyberspeak
Exhibition, more exacerbates this problem.For example, in question answering system, problem is:" recommending a place that small basin friend is suitble to play ",
Since system can not understand the meaning of non-standard word wherein included " small basin friend ", the answer provided is " not having answer ";
For another example, in machine translation, original language is " this small Loli sprouts very much ", and machine translation result is " This little
Lolita is too Meng ", it can be seen that the translation result fails correctly to translate " small Loli " and " sprouting ".It is above-mentioned in order to solve
Problem is, it is necessary to standardize to non-standardization word, to improve the performance of natural language processing.
Existing words criterion method can be divided into unsupervised scheme and have two major class of supervision scheme.There is supervision scheme
In, it is solved using the standardization of word as classification problem, but this needs the artificial mark of magnanimity, therefore cost of labor is very
It is high.It in unsupervised scheme, is solved using the standardization of word as issues for translation, but existing unsupervised scheme all can only
It solves the problems, such as phonetic modification, can not solve the problems, such as meaning modification.More particularly, the word standardization side of existing unsupervised scheme
Method can solve the standardization of the non-standardization word of the phonetics such as " cup ", " pear " deformation, however it can not be applied to meaning
The non-standardization word for type of wanting to change, for example, for the neologisms such as " emperorship ", " Hai Tao ", " counteroffensive " and " dog blood ", " eunuch ", " sense
Emit " etc. have new meaning old word, there is no effective normalization method at present.
The content of the invention
The disclosure is proposed at least for problem above.
According to one embodiment of the disclosure, a kind of method of words criterion is provided, including:It obtains to be standardized
Target word;Retrieve to explain the sentence of the target word using network search engines, and determine in the sentence with the target word
First group candidate word of the relevant word as the standardization result for representing the target word;Target word and first is calculated based on term vector
The similarity of each candidate word in group candidate word, and each candidate word is ranked up according to the similarity;According to sequence
Result determine the standardization result of target word.
According to another embodiment of the present disclosure, a kind of equipment of words criterion is provided, including:Obtaining widget, configuration
To obtain target word to be standardized;Candidate word determines component, is configured to utilize network search engines retrieval for explaining the mesh
Mark the sentence of word, and determine in the sentence with the of the relevant word of the target word as the standardization result of the expression target word
One group of candidate word;Sequencing of similarity component is configured to term vector and calculates target word and each time in first group of candidate word
The similarity of word is selected, and each candidate word is ranked up according to the similarity;Standardize component, is configured to according to sequence
As a result the standardization result of target word is determined.
According to another embodiment of the present disclosure, a kind of equipment of words criterion is provided, including:Processor;Memory;
With the computer program instructions being stored in the memory, the computer program instructions are held when being run by the processor
Row following steps:Obtain target word to be standardized;Using network search engines retrieval for explaining the sentence of the target word, and
Determine in the sentence with the relevant word of the target word as represent the target word standardization result first group of candidate word;Base
The similarity of target word and each candidate word in first group of candidate word is calculated in term vector, and according to the similarity to each
Candidate word is ranked up;The standardization result of target word is determined according to the result of sequence.
According to another embodiment of the present disclosure, a kind of method of words criterion is provided, including:It obtains to be standardized
Target word and represent the target word standardization result candidate word set;Target word and this group of candidate word are calculated based on term vector
In each candidate word similarity, and each candidate word is ranked up according to the similarity;It determines in each candidate word
The confidence level of the highest candidate word of similarity;If the confidence level is more than the 3rd threshold value, by the highest candidate word of the similarity
Standardization result as target word.
According to another embodiment of the present disclosure, a kind of equipment of words criterion is provided, including:Obtaining widget, configuration
For obtain target word to be standardized and represent the target word standardization result candidate word set;Sequencing of similarity component,
The similarity that term vector calculates target word and each candidate word in this group of candidate word is configured to, and according to the similarity
Each candidate word is ranked up;Confidence level determines component, is configured to determine the highest candidate word of similarity in each candidate word
Confidence level;Standardize component, if being configured to the confidence level more than the 3rd threshold value, the highest candidate word of the similarity is made
For the standardization result of target word.
According to another embodiment of the present disclosure, a kind of equipment of words criterion is provided, including:Processor;Memory;
With the computer program instructions being stored in the memory, the computer program instructions are held when being run by the processor
Row following steps:It obtains target word to be standardized and represents the candidate word set of the standardization result of the target word;It is word-based
Vector calculates the similarity of target word and each candidate word in this group of candidate word, and according to the similarity to each candidate word
It is ranked up;Determine the confidence level of the highest candidate word of similarity in each candidate word;If the confidence level is more than the 3rd threshold value,
Then using the highest candidate word of the similarity as the standardization result of target word.
According to the another embodiment of the disclosure, a kind of machine translation method is provided, including:Detect the non-rule in original language
Model word;Obtain the candidate word set for the standardization result for representing the target word;Target word and this group of candidate are calculated based on term vector
The similarity of each candidate word in word, and each candidate word is ranked up according to the similarity;Determine each candidate word
The confidence level of the middle highest candidate word of similarity;If the confidence level is more than the 3rd threshold value, by the highest candidate of the similarity
Word is transcribed into object language as the word after standardization.
According to the another embodiment of the disclosure, a kind of machine translating apparatus is provided, including:Detection part is configured to examine
Survey the non-standard word in original language;Candidate word determines component, is configured to obtain the candidate for the standardization result for representing the target word
Set of words;Sequencing of similarity component is configured to term vector and calculates target word and each candidate word in this group of candidate word
Similarity, and each candidate word is ranked up according to the similarity;Confidence level determines component, is configured to determine each candidate
The highest candidate word of similarity in word;Translation unit, if being configured to the confidence level is more than the 3rd threshold value, by the similarity most
High candidate word is transcribed into object language as the word after standardization.
According to the another embodiment of the disclosure, a kind of machine translation method is provided, including:Processor;Memory;With deposit
Store up computer program instructions in the memory, the computer program instructions performed when being run by the processor with
Lower step:Detect the non-standard word in original language;Obtain the candidate word set for the standardization result for representing the target word;It is word-based
Vector calculates the similarity of target word and each candidate word in this group of candidate word, and according to the similarity to each candidate word
It is ranked up;Determine the confidence level of the highest candidate word of similarity in each candidate word;If the confidence level is more than the 3rd threshold value,
Then using the highest candidate word of the similarity as the word after standardization, and it is transcribed into object language.
According to the words criterion technology and machine translation mothod of the embodiment of the present disclosure using unsupervised scheme according to non-rule
The meaning of model word can obtain its standardization to standardize to non-standard word for the non-standardization word for the modification that looks like
As a result, and improve the performance of the machine translation of the sentence of the non-standardization word comprising interesting modification.On the other hand, according to
The words criterion technology and machine translation mothod of the embodiment of the present disclosure assess its credibility, and root to obtained standardization word
It determines whether standardization word is subjected to according to credibility, thereby ensures that the correctness of standardization word.
Description of the drawings
The embodiment of the present disclosure is described in more detail in conjunction with the accompanying drawings, the above-mentioned and other purpose of the disclosure,
Feature and advantage will be apparent.Attached drawing is used for providing further understanding the embodiment of the present disclosure, and forms explanation
A part for book for explaining the disclosure together with the embodiment of the present disclosure, does not form the limitation to the disclosure.In the accompanying drawings,
Identical reference number typically represents same parts or step.
Fig. 1 schematically shows the flow chart of the method for the words criterion according to the first embodiment of the present disclosure.
Fig. 2 shows the webpage obtained by Baidupedia retrieval non-standard word.
Fig. 3 shows the webpage for knowing by Baidu and retrieving non-standard word and obtaining.
Fig. 4, which is shown, determines phase in first group of candidate word in the method according to the words criterion of the first embodiment of the present disclosure
Like the flow chart of the method for the confidence level for spending highest candidate word.
Fig. 5 instantiates the method according to the words criterion of the first embodiment of the present disclosure to each in first group of candidate word
The candidate word scoring and the result to be sorted based on candidate word scoring to each candidate word that candidate word determines.
Fig. 6 schematically shows the flow chart of the method for the words criterion according to the second embodiment of the present disclosure
Fig. 7 shows the side that second group of candidate word is determined in the method according to the words criterion of the second embodiment of the present disclosure
The flow chart of method.
Fig. 8 schematically shows the flow chart of the method for the words criterion according to the third embodiment of the present disclosure.
Fig. 9 shows the functional configuration block diagram of words criterion equipment according to an embodiment of the invention.
Figure 10 shows the functional configuration block diagram of the words criterion equipment of another embodiment according to the present invention.
Figure 11 shows the signal available for the computing device for realizing the words criterion equipment according to the embodiment of the present disclosure
Property block diagram.
Specific embodiment
In order to enable the purpose, technical scheme and advantage of the disclosure become apparent, root is described in detail below with reference to accompanying drawings
According to the example embodiment of the disclosure.Obviously, described embodiment is only a part of this disclosure embodiment rather than this public affairs
The whole embodiments opened, it should be appreciated that the disclosure from example embodiment described herein limitation.Described in the disclosure
Embodiment, those skilled in the art's obtained all other embodiment in the case where not making the creative labor should all fall
Enter within the protection domain of the disclosure.
<First embodiment>
The method of the words criterion according to the first embodiment of the present disclosure is described in detail below with reference to Fig. 1.
Fig. 1 schematically shows the flow chart of the method for the words criterion according to the present embodiment.
As shown in Figure 1, in step S110, target word to be standardized is obtained.
Target word to be standardized can be obtained by various modes, such as can be directly inputted by user, Huo Zhetong
It crosses existing new word detection method etc. and detects the target word, etc. from the sentence of the target word to be standardized comprising this.
In step S120, retrieve to explain the sentence of the target word using network search engines, and determine the sentence
In with the relevant word of the target word as represent the target word standardization result first group of candidate word.
In this step, the webpage in relation to the target word can be retrieved using existing network search engines, then will
Each sentence in the webpage retrieved is matched with pre-defined template, and using with the sentence of template matches as being used for
Explain the sentence of the target word.The pre-defined template is the template clause for explaining, defining to target word,
It can rule of thumb preset, and various template can be defined.As long as sentence and each template in the webpage retrieved
In at least one matching, it is for the sentence of objective of interpretation word to be considered as the sentence.In order to make it easy to understand, it is searched below with network
It is that Baidu is known with exemplified by Baidupedia that index, which is held up, and above-mentioned processing is described.
For example, by taking target word to be standardized is " dog blood " as an example, Fig. 2 shows the net retrieved by Baidupedia
Page, Fig. 3 is shown knows webpage that retrieval obtains by Baidu.By will be each in the webpage retrieved as shown in Figures 2 and 3
A sentence is matched with pre-defined template, defines for example following sentence for being used for objective of interpretation word " dog blood ":
A. amplification has the mysterious so-called dog blood of the meaning of chat exaggeration
B. the unusual meaning of old stuff
C. it is meant to chat
D. describe those similar plots often occurred, clumsy imitation or exaggerate very much very false performance
E. it is exactly the plot imitated in TV play by continuous reproduction
F. refer to performer to be soaped the audience with the performance to overdo on platform
Sense of propriety is not said when being g. exactly lecturer's performance
After the sentence for objective of interpretation word is as above retrieved, further determine that word relevant with target word is made in sentence
To represent first group of candidate word of the standardization result of target word.It is described can be according to various appropriate with the relevant word of target word
Mode determines.
In a kind of basic realization method, it can be split by word and sentence is divided into word, matched according to the sentence
The syntactic structure of template determine then to filter from identified relevant word with the relevant word of target word in the obtained word of segmentation
Except stop-word and dittograph, using remaining word as first group of candidate word.As previously mentioned, template is for target word
The template clause explain, defined, therefore by carrying out what syntactic analysis can be determined and explained, define to template clause
Position of the high word of target word correlation in template clause.Such as the template for " being meant to ... ", it may be determined that positioned at " meaning
Think refer to " after word be the higher word of correlation, for the template of " being exactly to say ... ", it may be determined that after " being exactly to say "
Word is higher word of correlation, etc..Therefore, in the realization method, the sentence that is retrieved for one, it may be determined that with this
The position of the word high with target word correlation in the matched template of sentence determines and these positions from the word split by word
Put corresponding word as with the relevant word of target word, then therefrom filter out stop-word and dittograph, using remaining word as
First group of candidate word.So-called stop-word (stop words) refers to that the frequency of occurrences is very high but to text in natural environment
The meaning of chapter or the page does not have that class word of materially affect, as in Chinese " ", " ", " " etc., stop-word is usually used
Frequently, but to semantic effect very little.
In order to make it easy to understand, the realization method is described by taking " the dog blood " of above-illustrated as an example below.It is for example, right
In the sentence a-g retrieved as implied above, word segmentation can be carried out to each sentence, then be obtained such as according to syntactic structure
" chat ", " exaggeration ", " inconceivable ", " old stuff ", " imitation ", " performer ", " spectators ", " saying sense of propriety " etc. and target word " dog blood "
Relevant word, then filter out stop-word therein (such as " ") and dittograph (such as " chat " in the word point of two sentences
Cut in result and all exist, then can remove " chat " of a repetition), then remaining word is by as first group of candidate word.
The word split sometimes through word can not objective of interpretation word well, and will split obtained word expand to it is short
Language can preferably objective of interpretation word.For example, for " ripe female " this target word, retrieve to explain that its sentence may be
" mean ripe women " can be obtained " women " by word segmentation, and " women " can not actually explain well it is " ripe
Female ", " ripe women " are then more in line with the meaning of " ripe female ".Occur in addition, if splitting at one before obtained word
Negate rhetoric " no ", " not having " etc., then the word that the negative rhetoric is obtained with segmentation is combined to the table for being usually more in line with Chinese
It states.In view of said circumstances, in an optional implementation manner, sentence can be divided by word by word segmentation and determined
Go out wherein with after the relevant word of target word, based on dependence and/or negative rhetoric in identified relevant word at least
One word is extended, and then will filter out stopping from the word after extension, other relevant words in addition to the word after extension
Remaining word is as first group of candidate word after word and dittograph.Dependence is for describing in sentence between each ingredient
Semantic modified relationship, in the realization method, dependence may be employed in surely middle relation, dynamic guest's relation, verbal endocentric phrase
It is any or its combination.
For example, still by taking " the dog blood " of above-illustrated as an example, for the sentence a-g retrieved as implied above, to each
A sentence carries out word segmentation and obtains such as " chat ", " exaggeration ", " inconceivable ", " old stuff ", " mould according to syntactic structure
It is imitative ", " performer ", " spectators ", after the words such as " saying sense of propriety ", can be based on dependence and/or negative rhetoric will for example " imitation " expand
It opens up as " clumsy imitation ", " spectators " is expanded into " soaping the audience ", " sense of propriety will be said " and expand to " not saying sense of propriety ", etc., so
Stop-word and dittograph are filtered out from the word after extension, other relevant words in addition to the word after extension afterwards, and will be surplus
Remaining word is as first group of candidate word.
In step S130, the similarity of target word and each candidate word in first group of candidate word is calculated based on term vector,
And each candidate word is ranked up according to the similarity.
Known in the art, any one word can be represented with term vector, and the distance between two term vectors more connect
Near then representated by them two words are more similar.It is calculated in this step by term vector in target word and first group of candidate word
Each candidate word similarity, that is, the similarity degree of each candidate word and target word is determined, then according to the height of similarity
Each candidate word is ranked up.
Specifically, it can determine corresponding term vector for target word and each candidate word in this step, then
The similarity (for example, COS distance) between the term vector of target word and the term vector of each candidate word is calculated, as target word
With the similarity of each candidate word.
It, in one implementation, can be with when determining corresponding term vector for target word and each candidate word
Target word and the corresponding term vector of each candidate word are directly determined by the existing instrument such as word embedding.
In another realization method, target word and each candidate word can be all decomposed into word, then be determined often by existing instrument
The corresponding word vector of a word finally by the word vector of each word included in word is cumulative, obtains target word and each candidate
The term vector of word.Optionally, if the corresponding word vector of some word can not be determined, corresponding word vector can be arranged to zero.
In step S140, the standardization result of target word is determined according to the result of sequence.
In this step, it can determine that the standardization result of target word is true according to the result of sequence according to predetermined rule
It is fixed.For example, in a kind of basic realization method, the obtained highest candidate word of similarity that will can directly sort is as target
The standardization result of word.
In an optional implementation manner, the confidence of the highest candidate word of similarity in first group of candidate word can be calculated
Degree, if the confidence level is more than the first predetermined threshold, using the highest candidate word of the similarity as the standardization knot of target word
Fruit, it is on the contrary, then it is assumed that even the highest candidate word of similarity can not represent target word well, i.e., not have for target word
Obtain available standardization result.First predetermined threshold can be set according to specific needs, such as a kind of example,
Its value can be 0.45.
It is carried out below in conjunction with processing of the Fig. 4 to determining the confidence level of the highest candidate word of similarity in first group of candidate word
Description.
As shown in figure 4, in step S1401, for each candidate word in first group of candidate word, candidate word scoring is calculated.
In this step, various appropriate modes may be employed to determine that the candidate word of each candidate word scores.For example, make
For a kind of example, the time can be determined according to the frequency of occurrences of the candidate word and with the quality of the relevant template of the candidate word
The candidate word of word is selected to score, so that the frequency of occurrences is higher, template is better, then the candidate word scoring of candidate word is higher.Specifically,
According to the example, for each candidate word, appearance of the candidate word in the sentence for objective of interpretation word retrieved is calculated
Frequency;Determine each sentence for including the candidate word in the sentence for objective of interpretation word;Determine pre-defined template
It is middle respectively with each matched each template of sentence;Determine the respective predetermined score of each template;Based on highest predetermined
Score and the frequency of occurrences determine the candidate word scoring of the candidate word.
In order to make it easy to understand, below be outlined above target word " dog blood ", to calculate candidate word scoring candidate word
It is above-mentioned processing to be described exemplified by " exaggeration ".It can be seen that " exaggeration " occurs 2 times in the sentence a-g retrieved, point
It does not appear in sentence a and sentence d, it is assumed that predetermined with the matched templates of sentence a is scored at 4 points, with the matched templates of sentence d
It is predetermined be scored at 4.5 points, then can divide and occur 2 times, by calculating weighted average etc. based on highest predetermined score 4.5
Various appropriate modes determine that the candidate word " exaggerated " scores.
In step S1402, each candidate word in first group of candidate word is ranked up according to candidate word scoring.
For example, Fig. 5 is instantiated through the exemplary process in step S1401-S1402, to each candidate word of " dog blood "
Result after determining candidate word scoring and being ranked up based on candidate word scoring.
In step S1403, the difference per the candidate word scoring between a pair of of neighboring candidate word is calculated.
In this step, calculate in multiple candidate words according to candidate word marking and queuing per between a pair of neighboring candidate word
The difference of candidate word scoring.For example, for ranking results as shown in Figure 5, the time between " exaggeration " and " chat " is calculated respectively
Select the difference that difference, the candidate word between " chats " and " imitation " that word scores score, and so on, up to calculate " overdoing " and
The difference of candidate word scoring between " not saying sense of propriety ".
In step S1404, difference, first group of candidate word at least based on the scoring of highest candidate word, the scoring of maximum candidate word
Quantity, utilize the confidence level of the highest candidate word of the trained classifier calculated similarity.
In this step, the difference that is scored using the scoring of highest candidate word, maximum candidate word, the quantity of first group of candidate word
Parameter as grader, it should be understood that this is only a kind of example, can also select ginseng of its dependent variable as grader
Number.For example, in addition to this 3 variables, the parameter that all as follows high candidate words score as grader can also be increased.
In addition, there is no restriction for used grader in the step, logistic regression classifier etc. may be employed
Various trained graders calculate the confidence level of the highest candidate word of similarity.
Above with reference to Fig. 4 to determining the highest candidate of similarity in first group of candidate word according to the first embodiment of the present disclosure
The processing of the confidence level of word is described.It is understood that this is only a kind of example, and it is not limitation of the present invention,
The confidence level of the highest candidate word of similarity can be determined using other modes as the case may be.
The method of words criterion described in detail above according to the first embodiment of the present disclosure.This method is using unsupervised
Scheme standardizes to non-standard word according to the meaning of non-standard word, therefore can for the non-standardization word for the modification that looks like
Obtain its result of standardizing.
<Second embodiment>
In first embodiment above, standardize according only to the meaning of non-standard word to non-standard word, therefore
Its result of standardizing can be obtained for the non-standardization word for the modification that looks like;In the present embodiment, except considering non-standard word
Outside the meaning, it is also contemplated that the pronunciation of non-standard word, therefore can be obtained for the non-standardization word of voice modification and meaning modification
To its result of standardizing.In the following description, only the present embodiment part different from first embodiment is described in detail,
And it is then not repeated to describe for the part identical with first embodiment.
The method of the words criterion according to the present embodiment is described in detail below with reference to Fig. 6.Fig. 6 is schematic
Ground shows the flow chart of the method for the words criterion according to the second embodiment of the present disclosure.
As shown in fig. 6, in step S610, target word to be standardized is obtained;In step S620, network search engines are utilized
Retrieval determines that word relevant with the target word is as the expression target word in the sentence for explaining the sentence of the target word
Standardization result first group of candidate word;In step S630, calculated based on term vector in target word and first group of candidate word
The similarity of each candidate word, and each candidate word is ranked up according to the similarity.
Processing in above-mentioned steps S610-S630 is identical with the processing in the step S110-S130 of first embodiment respectively,
It is not described in detail herein.
Fig. 6 is returned to, in step S640, the editing distance based on the phonetic with target word and the appearance frequency in corpus
Rate is determined as second group of candidate word of the standardization result for representing the target word.Below with reference to Fig. 7 to the place in the step
Reason is described.
As shown in fig. 7, in step S6401, the phonetic of target word is determined.
In step S6402, determine that the editing distance of phonetic and the phonetic of the target word is less than the alternative word of Alternate thresholds.
Editing distance refers between two character strings as the minimum edit operation number needed for one changes into another.It is logical
Often, editing distance is smaller, and the similarity of two character strings is higher.In this step, phonetic is regarded as character string, then by one by one
The editing distance of the phonetic of each word in a certain dictionary and the phonetic of target word is calculated, the spelling of phonetic and target word can be obtained
The editing distance of sound is less than each alternative word of preset Alternate thresholds.It is known in the art for how calculating editing distance
, it is not described in detail herein.In addition, the present embodiment is not limited for dictionary, can select according to specific needs any suitable
When dictionary;The present embodiment is not also limited for the value of Alternate thresholds, can set appropriate value according to specific needs.
In step S6403, its frequency of occurrences in corpus is calculated for each alternative word.
The present embodiment is not limited for corpus, can select existing various corpus according to specific needs.It is selected
The occurrence number of each alternative word wherein, the frequency of occurrences as the alternative word can be determined after corpus.
In step S6404, the candidate word for determining each alternative word based on editing distance and the frequency of occurrences scores.
In this step, various appropriate modes may be employed, alternative word is determined based on editing distance and the frequency of occurrences
Candidate word scores, as long as the editing distance of alternative word is smaller, the frequency of occurrences is higher, candidate word scoring is higher.For example, make
For an example, candidate word scoring=(word frequency/maximum word frequency) × n+ [1- (editing distance/(word length × a))] × m, wherein most
Big word frequency represents the maximum of the frequency of occurrences of each alternative word, and word length represents that alternative word forms the (word of such as children's footwear by several words
A length of 2) editing distance represents the editing distance between the alternative word and target word, and word frequency represents the frequency of occurrences of the alternative word,
N and m is weighted value, and a is that can take the maximum 4.1 of editing distance according to empirically determined regulatory factor, such as a.
In step S6405, candidate word scoring is more than the alternative word of candidate word threshold value as the specification for representing the target word
Change second group of candidate word of result.
The present embodiment is not limited for the value of candidate word threshold value, can set appropriate value according to specific needs.In the step
In rapid, by scoring candidate word compared with candidate word threshold value, it may be determined that go out candidate word scoring and be more than candidate word threshold value
Each alternative word as second group of candidate word
Fig. 7 is had been combined above to according to the present embodiment, editing distance based on the phonetic with target word and in language
The frequency of occurrences in storehouse is expected to determine that the processing of second group of candidate word is described.It is appreciated that above description is only
A kind of illustrative basic handling mode, optionally, step S6402 determine the editor of phonetic and the phonetic of target word away from
It, can be according to its each syllable and the corresponding syllable of target word for each alternative word after the alternative word less than Alternate thresholds
It is whether similar, adjust the editing distance of the alternative word and target word.Specifically, as previously mentioned, in step S6402, using this
Technology well known to field calculates editing distance, and according to the currently used technology in this field, calculating the editor of two syllables
Apart from when, be identical according only to two syllables or different calculated.It is for example, identical with two of " s " for such as " s "
Syllable, editing distance 0, and for two different syllables of such as " x " with " sh ", editing distance 1.However,
It is understood that the editing distance between the difference such as " s " and " sh ", " en " and " eng " but similar syllable should be less than two
Editing distance between a different and dissimilar syllables.
It, in an optional implementation manner, will be in each syllable in alternative word and target word based on above-mentioned cognition
Corresponding syllable is compared respectively, if there is N number of syllable still phase different from N number of corresponding syllable in target word in the alternative word
Seemingly, then the editing distance of the alternative word and target word is reduced into N number of first distance, N is natural number.In the optional realization method
In, it is similar that can rule of thumb wait and which syllable preset.
In another optional realization method, if all syllables of a word of the alternative word are corresponding with target word
All corresponding syllables of word are all similar, then the editing distance of the alternative word and target word are reduced second distance.
In another optional realization method, if all syllables of a word of the alternative word are corresponding with target word
All corresponding syllables of word are different from also dissmilarity, then the editing distance of the alternative word and target word are increased the 3rd distance.
Fig. 6 is returned to, in step S650, target word and each candidate word in second group of candidate word are calculated based on term vector
Similarity, and determine wherein with the highest candidate word of the similarity of target word.
Processing in the step is similar with the processing in the step S130 of first embodiment, is not described in detail herein.
In step S660, the standardization result of target word is determined according to the result of sequence.
In this step, the standardization result of target word can be determined according to the result of sequence according to predetermined rule.Example
Such as, in a kind of basic realization method, the highest time of similarity in the first group of candidate word that can will be determined in step S630
Select in word (hereinafter referred to as the first preferred term) and step S650 determine second group of candidate word in the highest candidate word of similarity (with
Lower the second preferred term of abbreviation) similarity be compared, and using the higher standardization as target word of similarity in the two
As a result.
In an optional implementation manner, if the similarity of the second preferred term is not higher than the first preferred term, it is based on
The candidate word scoring of each candidate word determines the confidence level of the first preferred term in first group of candidate word, if the confidence level is more than first
Predetermined threshold, then using the first preferred term as the standardization result of target word.Fig. 4 is above had been combined to determining first preferably
The processing of the confidence level of word is described, and details are not described herein again.Optionally, if the first preferred term is in second group of candidate word
There is also be then directly set to maximum by the confidence level of first preferred term.
In an optional implementation manner, if the similarity of the second preferred term is higher than the first candidate word, based on the
The candidate word scoring of each candidate word calculates the confidence level of the second preferred term in two groups of candidate words, if the confidence level is more than second in advance
Determine threshold value, then using the second preferred term as the standardization result of target word.In the realization method, it can be based in step
The candidate word scoring of each candidate word in the second group of candidate word calculated in S6404, using it is various it is appropriate by the way of calculate the
The confidence level of two preferred terms.For example, as a kind of example, can be scored according to candidate word to each time in second group of candidate word
Word is selected to be ranked up, it is that then the highest scorings of M (M be natural number, the quantity of M≤the second group candidate word) are added and divided by
The quantity of second group of candidate word, the confidence level as the second preferred term.Optionally, if the second preferred term is in first group of candidate word
In there is also be then directly set to maximum by the confidence level of second preferred term.Second predetermined threshold can be according to specific
It needs to set, the present embodiment is not limited in this respect.
The method of words criterion described in detail above according to the second embodiment of the present disclosure.This method had both considered non-rule
The pronunciation of model word, the meaning for being also contemplated for non-standard word, therefore can for the non-standardization word of voice modification and meaning modification
Obtain its result of standardizing.In addition, this method is when considering the pronunciation of non-standard word, whether similar according to syllable, adjustment is each standby
Word and the editing distance of target word are selected, therefore improves the standardization result of the non-standardization word of voice deformation.
It should be noted that, although above according to the order from step S610 to S660 to being advised according to the word of the present embodiment
The method of generalized is illustrated, but this is only a kind of example, and step S610 to the S660 is not necessarily according to described
Order perform.For example, can after step S640, S650 is sequentially performed again order perform step S620, S30 or
Person can be parallel while step S620, S630 is performed execution step S640, S650, etc..
<3rd embodiment>
In embodiment in front, the meaning and the pronunciation of non-standard word are considered to determine the standardization of expression non-standardization word
As a result candidate word.In the words criterion method according to the present embodiment, for determining the standardization of expression non-standardization word
As a result the method for candidate word is not limited, and after candidate word is determined, its confidence level will be assessed, and determines to wait according to confidence level
Select whether word is subjected to.
The method of the words criterion according to the third embodiment of the present disclosure is described below with reference to Fig. 8.Fig. 8 illustrates
Show to property the flow chart of the method for the words criterion according to the third embodiment of the present disclosure.
As shown in figure 8, in step S810, obtain target word to be standardized and represent the standardization result of the target word
Candidate word set.
Target word to be standardized can be obtained by various modes, such as can be directly inputted by user, Huo Zhetong
It crosses existing new word detection method etc. and detects the target word, etc. from the sentence of the target word to be standardized comprising this.
As previously mentioned, in the words criterion method according to the present embodiment, the standardization of target word is represented for obtaining
As a result the method for candidate word is not limited.For example, the mode of first embodiment description may be employed herein based on target word
The meaning obtains candidate word set, and both the mode of second embodiment description pronunciation and the meaning based on target word may be employed to obtain
Candidate word set is taken, the pronunciation of target word can also be based only upon and appointed to obtain candidate word set or may be employed in this field
What appropriate method obtains candidate word set.
In step S820, the similarity of target word and each candidate word in this group of candidate word is calculated based on term vector, and
Each candidate word is ranked up according to the similarity.
The processing of the step is identical with the processing in first embodiment step S130, and details are not described herein again.
In step S830, the confidence level of the highest candidate word of similarity in each candidate word is determined.
In this step, various appropriate modes may be employed and determine the highest candidate word of similarity in each candidate word
Confidence level.For example, the time obtained for the candidate word set of meaning acquisition based on target word and/or the phonetic based on target word
Set of words is selected, it can be such as first embodiment of the invention and the highest candidate word of the described definite similarity of second embodiment
Confidence level, details are not described herein again.
In step S840, if the confidence level is more than the 3rd threshold value, using the highest candidate word of the similarity as target
The standardization result of word.
If it is determined that the confidence level of the highest preferred term of the similarity is more than preset 3rd threshold value, then it is assumed that the time
Select word that can represent target word well, therefore can be using the candidate word as the standardization result of the target word of non-standard;Instead
It, it is believed that acceptable standardization result is not obtained for target word.3rd threshold value can rule of thumb with actual need
It sets, such as a kind of example, value can be 0.6.
The method of words criterion described in detail above according to the third embodiment of the present disclosure.It is real according to the disclosure the 3rd
The words criterion method of example is applied for determining to represent that the method for the candidate word of the standardization result of non-standardization word is not limited,
And after candidate word is determined, its confidence level will be assessed, and determine whether candidate word is subjected to according to confidence level, thereby ensure that rule
The correctness of generalized word.
On the other hand, machine translation can be applied to according to the words criterion method of the present embodiment.More particularly, this reality
It applies example and actually additionally provides a kind of machine translation method, comprise the following steps:(i) the non-standard word in original language is detected;
(ii) the candidate word set for the standardization result for representing the target word is obtained;(iii) target word and the group are calculated based on term vector
The similarity of each candidate word in candidate word, and each candidate word is ranked up according to the similarity;(iv) determine each
The confidence level of the highest candidate word of similarity in a candidate word;If (v) confidence level is more than the 3rd threshold value, by the similarity
Highest candidate word is transcribed into object language as the word after standardization.It, can be by existing in above-mentioned steps (i)
Some new word detection methods etc. detect the non-standard word from the sentence comprising non-standard word, in step (v), may be employed
Word after standardization is translated into object language by various common machine translation methods, the place in remaining each step (ii)-(iv)
Reason is similar with the processing of each corresponding step in the words criterion method according to the present embodiment, and details are not described herein again.
<The overall arrangement of words criterion equipment>
Fig. 9 shows the functional configuration block diagram of words criterion equipment 900 according to an embodiment of the invention.
As shown in figure 9, words criterion equipment 900 includes:Obtaining widget 910, candidate word determine component 920, similarity
Ordering element 930 and standardization component 940.The concrete function of each component and operation above for Fig. 1-7 with describing
It is essentially identical, therefore in order to avoid repeating, brief description is hereinafter only carried out to the equipment, and omit to identical thin
The detailed description of section.
Obtaining widget 910 is configured to obtain target word to be standardized.Obtaining widget 910 can obtain by various modes
Target word that must be to be standardized, for example, can be directly inputted by user or by existing new word detection method etc. from comprising
The target word, etc. is detected in the sentence of the target word to be standardized.
Candidate word determines that component 920 is configured to utilize network search engines retrieval for explaining the sentence of the target word, and
Determine in the sentence with the relevant word of the target word as represent the target word standardization result first group of candidate word.
Specifically, candidate word determines that component 920 can retrieve the related target word using existing network search engines
Webpage, then each sentence in the webpage retrieved is matched with pre-defined template, and will be with template matches
Sentence as explaining the sentence of the target word.The pre-defined template is for being explained to target word, determines
The template clause of justice, can rule of thumb preset, and can define various template.As long as candidate word determines component
Sentence in 920 webpages retrieved and at least one matching in each template, it is for objective of interpretation to be considered as the sentence
The sentence of word.
After the sentence for objective of interpretation word is as above retrieved, candidate word determines component 920 further according to various suitable
When mode determine in sentence with the relevant word of target word as represent target word standardization result first group of candidate word.Example
Such as, in a kind of basic realization method, candidate word determines that component 920 can be split by word and sentence is divided into word, according to
With the syntactic structure of the matched template of the sentence determine in the obtained word of segmentation with the relevant word of target word, then from identified
Stop-word and dittograph are filtered out in relevant word, using remaining word as first group of candidate word.In a kind of optional reality
In existing mode, candidate word determine component 920 can by word segmentation by sentence be divided into word and determine wherein with target word
After relevant word, at least one word in identified relevant word is extended based on dependence and/or negative rhetoric,
Then remained after stop-word and dittograph will be filtered out from the word after extension, other relevant words in addition to the word after extension
Remaining word is as first group of candidate word.
Sequencing of similarity component 930 is configured to term vector and calculates target word and each candidate in first group of candidate word
The similarity of word, and each candidate word is ranked up according to the similarity;
Known in the art, any one word can be represented with term vector, and the distance between two term vectors more connect
Near then representated by them two words are more similar.Sequencing of similarity component 930 is calculated by term vector in first group of candidate word
Each candidate word and target word similarity, then each candidate word is ranked up according to the height of similarity.Specifically,
Sequencing of similarity component 930 can determine corresponding term vector for target word and each candidate word, then calculate target
Similarity between the term vector of word and the term vector of each candidate word, as target word and the similarity of each candidate word.
When determining corresponding term vector for target word and each candidate word, in one implementation, sequencing of similarity component
930 can directly determine target word and the corresponding word of each candidate word by the existing instrument such as word embedding
Vector;In another realization method, target word and each candidate word can be all decomposed into word by sequencing of similarity component 930,
Then the corresponding word vector of each word is determined by existing instrument, finally by the word vector of each word included in word is tired
Add, obtain the term vector of target word and each candidate word.It optionally, can if the corresponding word vector of some word can not be determined
Zero is arranged to so that word vector will be corresponded to.
Standardization component 940 is configured to determine the standardization result of target word according to the result of sequence.
Standardization component 940 can determine that the standardization result of target word is true according to predetermined rule according to the result of sequence
It is fixed.For example, in a kind of basic realization method, the similarity that standardization component 940 can directly obtain sequence is highest
Standardization result of the candidate word as target word.
In an optional implementation manner, standardization component 940 can calculate similarity highest in first group of candidate word
Candidate word confidence level, if the confidence level be more than the first predetermined threshold, using the highest candidate word of the similarity as mesh
The standardization of word is marked as a result, on the contrary, then it is assumed that even the highest candidate word of similarity can not represent target word well, i.e.,
Available standardization result is not obtained for target word.First predetermined threshold can be set according to specific needs, example
Such as a kind of example, value can be 0.45.In the realization method, standardization component 940 can include:First marking is single
First 9401 (not shown), 9402 (not shown) of sequencing unit, adjacent poor 9403 (not shown) of computing unit and grader unit
9404 (not shown).
First marking unit 9401 is configured to for each candidate word in first group of candidate word, calculates candidate word scoring.
The candidate word scoring that the first marking unit 9401 may be employed various appropriate modes to determine each candidate word.For example, make
For a kind of example, the time can be determined according to the frequency of occurrences of the candidate word and with the quality of the relevant template of the candidate word
The candidate word of word is selected to score, so that the frequency of occurrences is higher, template is better, then the candidate word scoring of candidate word is higher.Specifically,
According to the example, for each candidate word, appearance of the candidate word in the sentence for objective of interpretation word retrieved is calculated
Frequency;Determine each sentence for including the candidate word in the sentence for objective of interpretation word;Determine pre-defined template
It is middle respectively with each matched each template of sentence;Determine the respective predetermined score of each template;Based on highest predetermined
Score and the frequency of occurrences determine the candidate word scoring of the candidate word.
Sequencing unit 9402 is configured to be ranked up each candidate word in first group of candidate word according to candidate word scoring.
Adjacent difference computing unit 9403 is configured to calculate the difference of the candidate word scoring between every a pair of of neighboring candidate word.
Grader unit 9404 is configured at least based on the scoring of highest candidate word, the difference of maximum candidate word scoring, first
The quantity of group candidate word, utilizes the confidence level of the highest candidate word of the trained classifier calculated similarity.
Herein, grader unit 9404 utilizes the scoring of highest candidate word, the difference of maximum candidate word scoring, first group of candidate
Parameter of the quantity of word as grader, it should be understood that this is only a kind of example, can also select its dependent variable as classification
The parameter of device.For example, in addition to this 3 variables, the ginseng that all as follows high candidate words score as grader can also be increased
Number.
In addition, there is no restriction for grader used by for grader unit 9404, such as logistic regression point may be employed
The various trained graders such as class device calculate the confidence level of the highest candidate word of similarity.
Optionally, the candidate word determines that component 920 can be further configured to the editor based on the phonetic with target word
Distance and the frequency of occurrences in corpus are determined as second group of candidate word of the standardization result for representing target word.Tool
Body, candidate word determines that component 920 can be further configured to include:9201 (not shown) of phonetic determination unit is configured to really
Set the goal the phonetic of word;9202 (not shown) of alternative word determination unit is configured to determine phonetic and the phonetic of the target word
Editing distance is less than the alternative word of Alternate thresholds;9203 (not shown) of frequency determinative elements is configured to for each alternative word meter
Calculate its frequency of occurrences in corpus;Second marking 9204 (not shown) of unit is configured to editing distance and frequency occurs
Rate determines the candidate word scoring of each alternative word;9205 (not shown) of pinyin candidate word determination unit, is configured to comment candidate word
Divide second group candidate word of the alternative word for being more than candidate word threshold value as the standardization result for representing the target word;Adjustment unit
9206 (not shown) are configured to each alternative word determined for alternative word determination unit 9202, according to its each syllable and mesh
It whether similar marks the correspondence syllable of word, adjusts the editing distance of the alternative word and target word.The concrete function of above-mentioned each unit and
Operation is identical with being described above for Fig. 7, is not described in detail herein.
Optionally, sequencing of similarity component 930 can be further configured to calculate target word and second group based on term vector
The similarity of each candidate word in candidate word, and determine wherein with the highest candidate word of the similarity of target word.
Optionally, standardization component 940 can be further configured to be determined first in the sequencing of similarity component 930
Similarity is highest in the highest candidate word of similarity (hereinafter referred to as the first preferred term) and second group of candidate word in group candidate word
In the case of candidate word (hereinafter referred to as the second preferred term), the rule of target word are determined according to the result of sequence according to predetermined rule
Generalized result.
For example, in a kind of basic realization method, standardization component 940 can be by the first preferred term and the second preferred term
Similarity be compared, and using the higher standardization result as target word of similarity in the two.
In an optional implementation manner, if the similarity of the second preferred term is not higher than the first candidate word, standardization
Component 940 determines the confidence level of the first preferred term based on the candidate word scoring of each candidate word in first group of candidate word, if this is put
Reliability is more than the first predetermined threshold, then using the first preferred term as the standardization result of target word.Optionally, if first is preferred
There is also be then directly set to maximum to word by the confidence level of first preferred term in second group of candidate word.
In an optional implementation manner, if the similarity of the second preferred term is higher than the first candidate word, standardization portion
Candidate word scoring of the part 940 based on each candidate word in second group of candidate word calculates the confidence level of the second preferred term, if the confidence
Degree is more than the second predetermined threshold, then using the second preferred term as the standardization result of target word.It herein, can be true based on candidate word
Determine the candidate word scoring of each candidate word in second group of candidate word that component 920 calculates, using it is various it is appropriate by the way of calculate
The confidence level of second preferred term.For example, as a kind of example, can be scored according to candidate word to each in second group of candidate word
Candidate word is ranked up, and is then added M (M is natural number, and M is less than or equal to the quantity of second group of candidate word) highest scorings
And divided by second group of candidate word quantity, the confidence level as the second preferred term.Optionally, if the second preferred term is first
Exist in group candidate word, then the confidence level of second preferred term is directly set to maximum.
The equipment of words criterion described in detail above according to the embodiment of the present disclosure.The equipment can be according to non-standard
The meaning of word can obtain its standardization knot to standardize to non-standard word for the non-standardization word for the modification that looks like
Fruit.The equipment can also consider that both the pronunciation of non-standard word and the meaning of non-standard word to carry out specification to non-standard word simultaneously
Change, therefore its result of standardizing can be obtained for the non-standardization word of voice modification and meaning modification.In addition, the equipment exists
It is whether similar according to syllable when considering the pronunciation of non-standard word, the editing distance of each alternative word and target word is adjusted, therefore is improved
The standardization result of the non-standardization word of voice deformation.
Figure 10 shows the functional configuration block diagram of the words criterion equipment 1000 of another embodiment according to the present invention.
As shown in Figure 10, words criterion equipment 1000 includes:Obtaining widget 1010, sequencing of similarity component 1020, puts
Reliability determines component 1030 and standardization component 1040.The concrete function of each component and operation with above for Fig. 8
What is described is essentially identical, therefore in order to avoid repeating, brief description is hereinafter only carried out to the equipment, and omits to phase
With the detailed description of details.
Obtaining widget 1010 is configured to the time for the standardization result for obtaining target word to be standardized and representing the target word
Select set of words.
Obtaining widget 1010 can obtain target word to be standardized by various modes, such as can be direct by user
Input detects the target by existing new word detection method etc. from the sentence of the target word to be standardized comprising this
Word, etc..As previously mentioned, in the words criterion method according to the present embodiment, the standardization of target word is represented for obtaining
As a result the method for candidate word is not limited.For example, the mode that first embodiment description may be employed in obtaining widget 1010 is based on
The meaning of target word obtains candidate word set, and pronunciation and the meaning of the mode based on target word of second embodiment description may be employed
The two obtains candidate word set, can also be based only upon the pronunciation of target word to obtain candidate word set or this may be employed
Any appropriate method obtains candidate word set in field.
Sequencing of similarity component 1020 is configured to term vector and calculates target word and each candidate in this group of candidate word
The similarity of word, and each candidate word is ranked up according to the similarity.
Confidence level determines that component 1030 is configured to determine the confidence level of the highest candidate word of similarity in each candidate word.This
Place, confidence level determine that component 1030 may be employed various appropriate modes and determine the highest candidate word of similarity in each candidate word
The confidence level of (hereinafter referred to as preferred term).For example, for based on target word the meaning obtain candidate word set or based on target
The candidate word set that the phonetic of word obtains, can be such as first embodiment of the invention and the described definite preferred term of second embodiment
Confidence level.
If standardization component 1040 is configured to the confidence level more than the 3rd threshold value, by the highest candidate word of the similarity
Standardization result as target word.
If standardization component 1040 determines that the confidence level of the highest preferred term of the similarity is more than the preset 3rd
Threshold value, then it is assumed that the preferred term can represent target word well, therefore standardize component 1040 can using the preferred term as
The standardization of the target word of non-standard is as a result, on the contrary, it is believed that acceptable standardization result is not obtained for target word.
Words criterion equipment described in detail above according to the present embodiment.It is set according to the words criterion of the present embodiment
The standby mode for determining to represent the candidate word of the standardization result of non-standardization word is not limited, and after candidate word is determined,
Its confidence level will be assessed, and determines whether candidate word is subjected to according to confidence level, thereby ensures that the correctness of standardization word.
<System hardware configuration>
In the following, calculating available for the realization embodiment of the present disclosure, for words criterion equipment is described with reference to Figure 11
The schematic block diagram of equipment.
As shown in figure 11, computing device 1100 includes one or more processors 1102, storage device 1104, input unit
1106 and output device 1108, these components it is mutual by bindiny mechanism's (not shown) of bus system 1110 and/or other forms
Even.It should be noted that the component and structure of computing device 1100 shown in Figure 11 are illustrative, and not restrictive, according to
It needs, computing device 1100 can also have other assemblies and structure.
Processor 1102 can be central processing unit (CPU) or perform energy with data-handling capacity and/or instruction
The processing unit of the other forms of power, and other components in computing device 1100 can be controlled to perform desired function.
Storage device 1104 can include one or more computer program products, and the computer program product can wrap
Include various forms of computer readable storage mediums, such as volatile memory and/or nonvolatile memory.The volatibility
Memory is such as can include random access memory (RAM) and/or cache memory (cache).It is described non-volatile
Property memory such as read-only memory (ROM), hard disk can be included, flash memory.It can on the computer readable storage medium
With the one or more computer program instructions of storage, processor 112 can run described program instruction, described above to realize
The function of embodiment of the disclosure and/or other desired functions.It can be in the computer readable storage medium
Store various application programs and various data, for example, be mentioned above target word to be standardized, the sentence of objective of interpretation word,
First group of candidate word, second group of candidate word, each candidate's Word similarity, pre-defined sentence template, each candidate word are corresponding
Term vector, the phonetic of target word, the editing distance of each candidate word, candidate word scoring, preferred term confidence level, various threshold values etc..
Input unit 1106, can be with for receiving input information from the user, such as target word to be standardized etc.
Including the various input equipments such as wire/wireless network interface card, keyboard, mouse, touch-screen, microphone.
Output device 1108 can export various information to outside, for example, non-standardization word standardization as a result, and can
To include the various display devices such as wire/wireless network interface card, display, projecting apparatus, TV.
The basic principle of the disclosure is described above in association with specific embodiment, however, it is desirable to, it is noted that in the disclosure
The advantages of referring to, advantage, effect etc. are only exemplary rather than limiting, it is impossible to which it is the disclosure to think these advantages, advantage, effect etc.
Each embodiment is prerequisite.In addition, detail disclosed above is merely to exemplary effect and the work readily appreciated
With, and it is unrestricted, above-mentioned details is not intended to limit the disclosure as that must be realized using above-mentioned concrete details.
Device, device, equipment, the block diagram of system involved in the disclosure only as illustrative example and are not intended to
It is required that or hint must be attached in a manner that box illustrates, arrange, configure.As those skilled in the art will appreciate that
, it can connect, arrange by any way, configuring these devices, device, equipment, system.Such as " comprising ", "comprising", " tool
" etc. word be open vocabulary, refer to " including but not limited to ", and can be used interchangeably with it.Vocabulary used herein above
"or" and " and " refer to vocabulary "and/or", and can be used interchangeably with it, unless it is not such that context, which is explicitly indicated,.Here made
Vocabulary " such as " refers to phrase " such as, but not limited to ", and can be used interchangeably with it.
In addition, as used herein, with the item of " at least one " beginnings enumerate the middle "or" used indicate it is separated
It enumerates, so that enumerating for such as " A, B or C's being at least one " means A or B or C or AB or AC or BC or ABC (i.e. A and B
And C).In addition, wording " exemplary " does not mean that the example of description is preferred or more preferable than other examples.
It may also be noted that in the system and method for the disclosure, each component or each step are can to decompose and/or again
Combination nova.These decompose and/or reconfigure the equivalent scheme that should be regarded as the disclosure.
The technology instructed defined by the appended claims can not departed from and carried out to the various of technology described herein
Change, replace and change.In addition, the scope of the claim of the disclosure is not limited to process described above, machine, manufacture, thing
Composition, means, method and the specific aspect of action of part.It can be essentially identical using being carried out to corresponding aspect described herein
Function either realize essentially identical result there is currently or to be developed later processing, machine, manufacture, event group
Into, means, method or action.Thus, appended claims include such processing within its scope, machine, manufacture, event
Composition, means, method or action.
The above description of disclosed aspect is provided so that any person skilled in the art can make or use this
It is open.Various modifications in terms of these are readily apparent to those skilled in the art, and are defined herein
General Principle can be applied to other aspect without departing from the scope of the present disclosure.Therefore, the disclosure is not intended to be limited to
Aspect shown in this, but according to the widest range consistent with principle disclosed herein and novel feature.
In order to which purpose of illustration and description has been presented for above description.In addition, this description is not intended to the reality of the disclosure
It applies example and is restricted to form disclosed herein.Although already discussed above multiple exemplary aspects and embodiment, this field skill
Art personnel will be recognized that its some modifications, modification, change, addition and sub-portfolio.
Claims (22)
1. a kind of method of words criterion, including:
Obtain target word to be standardized;
Using network search engines retrieval for explaining the sentence of the target word, and determine related to the target word in the sentence
Word as represent the target word standardization result first group of candidate word;
The similarity of target word and each candidate word in first group of candidate word is calculated based on term vector, and according to the similarity
Each candidate word is ranked up;
The standardization result of target word is determined according to the result of sequence.
2. the method for words criterion as described in claim 1, wherein described utilize network search engines retrieval for explaining
The sentence of the target word includes:
The webpage in relation to the target word is retrieved using network search engines;
Each sentence in the webpage retrieved with pre-defined template is matched, and will be made with the sentence of template matches
To be used to explain the sentence of the target word.
3. the method for words criterion as claimed in claim 2, wherein related to the target word in the definite sentence
Word include as the first group of candidate word of standardization result for representing the target word:
The sentence is divided into word,
According to the syntactic structure with the matched template of the sentence, determine in the word that segmentation obtains with the relevant word of target word,
Stop-word and dittograph are filtered out from the definite relevant word, using remaining word as first group of candidate
Word.
4. the method for words criterion as claimed in claim 2, wherein related to the target word in the definite sentence
Word include as the first group of candidate word of standardization result for representing the target word:
The sentence is divided into word;
According to the syntactic structure with the matched template of the sentence, determine in the word that segmentation obtains with the relevant word of target word,
Based on dependence and/or negative rhetoric, at least one word in the definite relevant word is extended;
Stop-word and dittograph are filtered out from the word after extension, other relevant words in addition to the word after extension, by residue
Word as first group of candidate word.
5. the method for words criterion as claimed in claim 4, the dependence includes fixed middle relation, dynamic guest's relation, shape
At least one of middle structure.
6. the method for the words criterion as any one of claim 1-5, wherein based on term vector calculate target word with
The similarity of each candidate word in first group of candidate word includes:
Corresponding term vector is determined for target word and each candidate word;
Calculate the similarity of the term vector of target word and the term vector of each candidate word, the phase as target word and each candidate word
Like degree.
7. the method for words criterion as claimed in claim 6, wherein described determine respectively for target word and each candidate word
From term vector include:
Target word is decomposed into word, determines the corresponding word vector of each word;
Each word vector is added up and obtains the corresponding term vector of target word;
Each candidate word is decomposed into word, determines the corresponding word vector of each word;
For each candidate word, the corresponding word vector of each of which word is added up and obtains the corresponding term vector of the candidate word.
8. the method for words criterion as claimed in claim 2, wherein the result according to sequence determines the rule of target word
Generalized result includes:
Based on the candidate word scoring of each candidate word in first group of candidate word, the highest candidate of similarity in first group of candidate word is determined
The confidence level of word,
If the confidence level is more than the first predetermined threshold, using the highest candidate word of the similarity as the standardization knot of target word
Fruit.
9. the method for words criterion as claimed in claim 8, wherein determining the highest time of similarity in first group of candidate word
Selecting the confidence level of word includes:
For each candidate word in first group of candidate word, candidate word scoring is calculated;
Each candidate word in first group of candidate word is ranked up according to candidate word scoring;
Calculate the difference per the candidate word scoring between a pair of of neighboring candidate word;
At least based on the scoring of highest candidate word, the difference of maximum candidate word scoring, the quantity of first group of candidate word, using training
The highest candidate word of the classifier calculated similarity confidence level.
10. the method for words criterion as claimed in claim 9, wherein each candidate in first group of candidate word
Word, which calculates candidate word scoring, to be included:
Calculate the frequency of occurrences of the candidate word in the sentence for objective of interpretation word retrieved;
Determine each sentence for including the candidate word in the sentence for objective of interpretation word;
Determine in the pre-defined template respectively with each matched each template of sentence;
Determine the respective predetermined score of each template;
The candidate word for determining the candidate word based on highest predetermined score and the frequency of occurrences scores.
11. the method for words criterion as described in claim 1, further includes:
Determine the phonetic of target word;
Determine that the editing distance of the phonetic of phonetic and the target word is less than the alternative word of Alternate thresholds;
Its frequency of occurrences in corpus is calculated for each alternative word;
The candidate word for determining each alternative word based on editing distance and the frequency of occurrences scores;
Candidate word scoring is more than to second group time of the alternative word as the standardization result for representing the target word of candidate word threshold value
Select word.
12. the method for words criterion as claimed in claim 11, further includes:
It is whether similar with the corresponding syllable of target word according to its each syllable for each alternative word, adjust the alternative word and mesh
Mark the editing distance of word.
13. the method for words criterion as claimed in claim 12, it is described for each alternative word according to its each syllable with
The whether similar alternative word and the editing distance of target word of adjusting of the correspondence syllable of target word includes:
By each syllable in the alternative word and the correspondence syllable in target word respectively compared with;
If had in the alternative word, N number of syllable and N number of corresponding syllable in target word are different but similar, by the alternative word with
The editing distance of target word reduces N number of first distance, and N is natural number.
14. the method for words criterion as claimed in claim 13, it is described for each alternative word according to its each syllable with
Whether similar adjustment alternative word of correspondence syllable of target word and the editing distance of target word further include:
If all syllables of a word of the alternative word are all similar with all corresponding syllables of the corresponding word of target word, should
The editing distance of alternative word and target word reduces second distance;
If all syllables of a word of the alternative word are different from also not with all corresponding syllables of the corresponding word of target word
It is similar, then the editing distance of the alternative word and target word is increased into the 3rd distance.
15. the method for the words criterion as any one of claim 11-14, further includes:
The similarity of target word and each candidate word in second group of candidate word is calculated based on term vector, and determine wherein with target
The highest candidate word of similarity of word.
16. the method for words criterion as claimed in claim 15, wherein the result according to sequence determines that target word is advised
The result of generalized includes:
Determine the highest candidate word of similarity in first group of candidate word;
If similarity highest in higher than first group candidate word of the similarity of the highest candidate word of similarity in second group of candidate word
Candidate word, then based in second group of candidate word each candidate word candidate word scoring calculate second group of candidate word in similarity most
The confidence level of high candidate word,
If the confidence level is more than the second predetermined threshold, using the highest candidate word of similarity in second group of candidate word as target
The standardization result of word.
17. the method for words criterion as claimed in claim 16, wherein the result according to sequence determines that target word is advised
The result of generalized further includes:
If the similarity of the highest candidate word of similarity is not higher than similarity in first group of candidate word most in second group of candidate word
High candidate word, the then candidate word scoring based on each candidate word in first group of candidate word, calculates similar in first group of candidate word
The confidence level of highest candidate word is spent,
If the confidence level is more than the first predetermined threshold, using the highest candidate word of the similarity in first group of candidate word as mesh
Mark the standardization result of word.
18. words criterion method as claimed in claim 17, further includes:
If there is also by first group of time in second group of candidate word for the highest candidate word of similarity in first group of candidate word
The confidence level of the highest candidate word of similarity in word is selected to be set to maximum;
If there is also by second group of time in first group of candidate word for the highest candidate word of similarity in second group of candidate word
The confidence level of the highest candidate word of similarity in word is selected to be set to maximum.
19. a kind of equipment of words criterion, including:
Obtaining widget is configured to obtain target word to be standardized;
Candidate word determines component, is configured to utilize network search engines retrieval for explaining the sentence of the target word, and determines institute
State in sentence with the relevant word of the target word as represent the target word standardization result first group of candidate word;
Sequencing of similarity component is configured to the phase that term vector calculates target word and each candidate word in first group of candidate word
Each candidate word is ranked up like degree, and according to the similarity;
Standardize component, is configured to determine the standardization result of target word according to the result of sequence.
20. a kind of equipment of words criterion, including:
Processor;
Memory;With
The computer program instructions being stored in the memory, the computer program instructions by the processor when being run
Perform following steps:
Obtain target word to be standardized;
Using network search engines retrieval for explaining the sentence of the target word, and determine related to the target word in the sentence
Word as represent the target word standardization result first group of candidate word;
The similarity of target word and each candidate word in first group of candidate word is calculated based on term vector, and according to the similarity
Each candidate word is ranked up;
The standardization result of target word is determined according to the result of sequence.
21. a kind of method of words criterion, including:
It obtains target word to be standardized and represents the candidate word set of the standardization result of the target word;
The similarity of target word and each candidate word in this group of candidate word is calculated based on term vector, and according to the similarity pair
Each candidate word is ranked up;
Determine the confidence level of the highest candidate word of similarity in each candidate word;
If the confidence level is more than the 3rd threshold value, using the highest candidate word of the similarity as the standardization result of target word.
22. a kind of machine translation method, including:
Detect the non-standard word in original language;
Obtain the candidate word set for the standardization result for representing the target word;
The similarity of target word and each candidate word in this group of candidate word is calculated based on term vector, and according to the similarity pair
Each candidate word is ranked up;
Determine the confidence level of the highest candidate word of similarity in each candidate word;
If the confidence level be more than the 3rd threshold value, using the highest candidate word of the similarity as standardize after word, and by its
Translate into object language.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610989788.2A CN108073565A (en) | 2016-11-10 | 2016-11-10 | The method and apparatus and machine translation method and equipment of words criterion |
JP2017217389A JP7120751B2 (en) | 2016-11-10 | 2017-11-10 | Word normalization method, word normalization device and machine translation method, machine translation device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610989788.2A CN108073565A (en) | 2016-11-10 | 2016-11-10 | The method and apparatus and machine translation method and equipment of words criterion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108073565A true CN108073565A (en) | 2018-05-25 |
Family
ID=62150615
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610989788.2A Pending CN108073565A (en) | 2016-11-10 | 2016-11-10 | The method and apparatus and machine translation method and equipment of words criterion |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP7120751B2 (en) |
CN (1) | CN108073565A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804423A (en) * | 2018-05-30 | 2018-11-13 | 平安医疗健康管理股份有限公司 | Medical Text character extraction and automatic matching method and system |
CN111539853A (en) * | 2020-06-19 | 2020-08-14 | 支付宝(杭州)信息技术有限公司 | Standard case routing determination method, device and equipment |
CN111931477A (en) * | 2020-09-29 | 2020-11-13 | 腾讯科技(深圳)有限公司 | Text matching method and device, electronic equipment and storage medium |
CN113221557A (en) * | 2021-05-28 | 2021-08-06 | 中国工商银行股份有限公司 | Data cross-reference management method and device based on neural network |
CN116415582A (en) * | 2023-05-24 | 2023-07-11 | 中国医学科学院阜外医院 | Text processing method, text processing device, computer readable storage medium and electronic equipment |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614463B (en) * | 2018-10-24 | 2023-02-03 | 创新先进技术有限公司 | Text matching processing method and device |
CN110852100B (en) * | 2019-10-30 | 2023-07-21 | 北京大米科技有限公司 | Keyword extraction method and device, electronic equipment and medium |
CN111581976B (en) * | 2020-03-27 | 2023-07-21 | 深圳平安医疗健康科技服务有限公司 | Medical term standardization method, device, computer equipment and storage medium |
CN111753147A (en) | 2020-06-27 | 2020-10-09 | 百度在线网络技术(北京)有限公司 | Similarity processing method, device, server and storage medium |
CN112463969B (en) * | 2020-12-08 | 2022-09-20 | 上海烟草集团有限责任公司 | Method, system, equipment and medium for detecting new words of cigarette brand and product rule words |
CN112559559A (en) * | 2020-12-24 | 2021-03-26 | 中国建设银行股份有限公司 | List similarity calculation method and device, computer equipment and storage medium |
CN112650791B (en) * | 2020-12-29 | 2023-12-26 | 招联消费金融有限公司 | Method, device, computer equipment and storage medium for processing field |
CN113657109A (en) * | 2021-08-31 | 2021-11-16 | 平安医疗健康管理股份有限公司 | Method, apparatus and computer device for standardization of model-based clinical terminology |
CN114201968A (en) * | 2021-11-29 | 2022-03-18 | 上海保链科技有限公司 | Data normalization processing method and device based on medical scene and Chinese character combination |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150186361A1 (en) * | 2013-12-25 | 2015-07-02 | Kabushiki Kaisha Toshiba | Method and apparatus for improving a bilingual corpus, machine translation method and apparatus |
CN105068998A (en) * | 2015-07-29 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Translation method and translation device based on neural network model |
CN105183720A (en) * | 2015-08-05 | 2015-12-23 | 百度在线网络技术(北京)有限公司 | Machine translation method and apparatus based on RNN model |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH01273171A (en) * | 1988-04-25 | 1989-11-01 | Nippon Telegr & Teleph Corp <Ntt> | Document rewriting system and automatic translating system |
US7689585B2 (en) | 2004-04-15 | 2010-03-30 | Microsoft Corporation | Reinforced clustering of multi-type data objects for search term suggestion |
JP4953459B2 (en) | 2008-03-11 | 2012-06-13 | ヤフー株式会社 | Abbreviation generation apparatus, method and program using character vectors |
JP5514760B2 (en) | 2011-03-28 | 2014-06-04 | Kddi株式会社 | Chinese input device |
-
2016
- 2016-11-10 CN CN201610989788.2A patent/CN108073565A/en active Pending
-
2017
- 2017-11-10 JP JP2017217389A patent/JP7120751B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150186361A1 (en) * | 2013-12-25 | 2015-07-02 | Kabushiki Kaisha Toshiba | Method and apparatus for improving a bilingual corpus, machine translation method and apparatus |
CN105068998A (en) * | 2015-07-29 | 2015-11-18 | 百度在线网络技术(北京)有限公司 | Translation method and translation device based on neural network model |
CN105183720A (en) * | 2015-08-05 | 2015-12-23 | 百度在线网络技术(北京)有限公司 | Machine translation method and apparatus based on RNN model |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804423A (en) * | 2018-05-30 | 2018-11-13 | 平安医疗健康管理股份有限公司 | Medical Text character extraction and automatic matching method and system |
CN108804423B (en) * | 2018-05-30 | 2023-09-08 | 深圳平安医疗健康科技服务有限公司 | Medical text feature extraction and automatic matching method and system |
CN111539853A (en) * | 2020-06-19 | 2020-08-14 | 支付宝(杭州)信息技术有限公司 | Standard case routing determination method, device and equipment |
CN111931477A (en) * | 2020-09-29 | 2020-11-13 | 腾讯科技(深圳)有限公司 | Text matching method and device, electronic equipment and storage medium |
CN113221557A (en) * | 2021-05-28 | 2021-08-06 | 中国工商银行股份有限公司 | Data cross-reference management method and device based on neural network |
CN116415582A (en) * | 2023-05-24 | 2023-07-11 | 中国医学科学院阜外医院 | Text processing method, text processing device, computer readable storage medium and electronic equipment |
CN116415582B (en) * | 2023-05-24 | 2023-08-25 | 中国医学科学院阜外医院 | Text processing method, text processing device, computer readable storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
JP7120751B2 (en) | 2022-08-17 |
JP2018077850A (en) | 2018-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108073565A (en) | The method and apparatus and machine translation method and equipment of words criterion | |
US10176804B2 (en) | Analyzing textual data | |
CN104484411B (en) | A kind of construction method of the semantic knowledge-base based on dictionary | |
CN108681574B (en) | Text abstract-based non-fact question-answer selection method and system | |
US20200183983A1 (en) | Dialogue System and Computer Program Therefor | |
US10496756B2 (en) | Sentence creation system | |
CN103324621B (en) | A kind of Thai text spelling correcting method and device | |
WO2013125286A1 (en) | Non-factoid question answering system and computer program | |
Lipping et al. | Crowdsourcing a dataset of audio captions | |
KR20160026892A (en) | Non-factoid question-and-answer system and method | |
CN108509409A (en) | A method of automatically generating semantic similarity sentence sample | |
CN106570180A (en) | Artificial intelligence based voice searching method and device | |
CN109614620B (en) | HowNet-based graph model word sense disambiguation method and system | |
Chen et al. | Automatic key term extraction from spoken course lectures using branching entropy and prosodic/semantic features | |
Gómez-Adorno et al. | A graph based authorship identification approach | |
Harwath et al. | Zero resource spoken audio corpus analysis | |
CN104750677A (en) | Speech translation apparatus, speech translation method and speech translation program | |
Lee et al. | Off-Topic Spoken Response Detection Using Siamese Convolutional Neural Networks. | |
Wang et al. | Summarizing decisions in spoken meetings | |
CN114116965A (en) | Opinion extraction method for comment text and electronic equipment | |
TWI659411B (en) | Multilingual mixed speech recognition method | |
CN103336803A (en) | Method for generating name-embedded spring festival scrolls through computer | |
Yoon et al. | Off-Topic Spoken Response Detection with Word Embeddings. | |
CN107562774A (en) | Generation method, system and the answering method and system of rare foreign languages word incorporation model | |
CN109002540B (en) | Method for automatically generating Chinese announcement document question answer pairs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180525 |
|
WD01 | Invention patent application deemed withdrawn after publication |