WO2012026667A2 - 토큰 분리 및 번역 과정을 통합한 통합 디코딩 장치 및 그 방법 - Google Patents
토큰 분리 및 번역 과정을 통합한 통합 디코딩 장치 및 그 방법 Download PDFInfo
- Publication number
- WO2012026667A2 WO2012026667A2 PCT/KR2011/003830 KR2011003830W WO2012026667A2 WO 2012026667 A2 WO2012026667 A2 WO 2012026667A2 KR 2011003830 W KR2011003830 W KR 2011003830W WO 2012026667 A2 WO2012026667 A2 WO 2012026667A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- translation
- word
- token
- model
- candidate
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/53—Processing of non-Latin text
Definitions
- the present invention relates to an integrated decoding apparatus and method for integrating a token separation and translation process, and more particularly, to generate all possible candidate tokens by performing a token separation and translation together during decoding to integrally decode input character sequences.
- the present invention relates to an integrated decoding apparatus and method for integrating a token separation and translation process, which can reduce translation errors and obtain an optimal translation result.
- SMT statistical machine translation
- phrases means a substring of several consecutive words.
- the general syntax-based statistical machine translation system still has some disadvantages.
- a general system can reliably perform a translation process that rearranges several consecutive words recognized through learning, but most general translation systems do not account for long distance word dependencies.
- the translation process uses hierarchical syntax. For example, it uses synchronous context-free grammars in both the source language and the target language. Fundamentally, due to errors in segmentation for translation and errors in syntax and word alignments for learning translation rules, this approach suffers from a lack of translation accuracy when the correct translation rules cannot be applied.
- the token separation process plays an important role in statistical machine translation. This is because token separation of source sentences determines the basic unit of translation in a statistical machine translation system.
- FIG. 1 is a conceptual diagram of a token separation process and a translation process in a conventional statistical machine translation system.
- a conventional statistical machine translation system includes a token separator 110 and a decoder 120.
- Token separation device 110 performs a token separation process in the pre-processing process.
- Token separator 110 receives a string to generate a token-separated string.
- the decoder 120 receives the token-separated string from the token separator 110 and finds an optimal translation for the transmitted string.
- Segmenting words in morphological units is intended to effectively improve translation performance by making the basic unit of translation in morphemes the minimum semantic unit in various languages. Even if a high performance token separator is used, its performance is not 100%, so there is a limit to improving the translation quality. Therefore, there is a need for a more appropriate token separation method that reduces the problem of errors in token separation in statistical machine translation systems.
- the grid-based translation method that improves the translation performance is to replace the 1-best tokenizaiton with n-best token separation.
- word grid-based conversion methods still search for band phrases in a limited search space. That is, since the manner in which the token separation process is separated from the decoded and constructed token grids depends on the filtered and preprocessed tokens before the decoding process, the search space is still limited by the preprocessing process.
- the present invention was devised to solve the above problems, and by integrating and decoding tokens together with a token separation and translation during decoding on input character sequences, all possible candidate tokens can be generated and translation errors can be reduced and optimized
- An object of the present invention is to provide an integrated decoding apparatus and method for integrating token separation and translation processes that can obtain translation results.
- a candidate token generator for generating a plurality of candidate tokens by applying a maximum entropy model to the input character sequence
- a probability calculator for calculating a token separation probability of each of the generated candidate tokens using a language model
- An unregistered word processor configured to process an unregistered word by using word frequency and discount information on the unregistered word among the generated candidate tokens
- a translation unit generating a band phrase corresponding to the input character sequence according to a translation model using the calculated probability values of the candidate tokens and the non-registered word processing result.
- the translation model database stores a translation model trained from the token separated data of the parallel corpus; And a language model database for storing language models learned from the single word corpus.
- the translation model database characterized in that for storing a string-based translation model.
- the candidate token generation unit at least one of the letters corresponding to the beginning of the word in the input character sequence, the letter appearing in the middle of the word, the letter appearing at the end of the word, a single letter word It is characterized by tagging from a production point of view.
- the probability calculation unit is characterized by calculating the combined token separation probability of each of the generated candidate tokens and the N-gram language model.
- the non-registered word processing unit characterized in that for adjusting the frequency of words for the non-registered words in the generated candidate token.
- the translation unit generates a translation corresponding to the input character sequence by applying the calculated probability values of the candidate tokens and the non-registered word processing result to a log-linear model. .
- the translation model is characterized in that the string-based translation model.
- the generating of the candidate token comprises: a character tag corresponding to the beginning of a word in the input character sequence, a character tag appearing in the middle of the word, a character tag appearing at the end of the word, a word tag consisting of one letter, Each letter is tagged in terms of word generation.
- the probability calculating step is characterized by calculating the combined token separation probability of each of the generated candidate tokens and the N-gram language model.
- the non-registered word processing step characterized in that for adjusting the frequency of words for the non-registered words in the generated candidate token.
- the translation step is characterized by generating a translation corresponding to the input character sequence by applying the calculated probability values of the candidate tokens and the non-registered word processing result to a log-linear model. .
- the present invention has the effect of generating all possible candidate tokens, reducing translation errors, and obtaining an optimal translation result by performing integrated decoding by performing token separation and translation together during decoding on the input character sequence. . That is, the present invention can improve the translation performance and reduce the segmentation error by performing token and translation process by searching token separation and band phrase for the source language string in the decoding process of statistical machine translation.
- the present invention not only effectively handles unified token separation and translation, but also takes a log-linear model with special qualities for effectively dealing with unregistered language problems, and integrates token separation and translation, And it is effective to improve the translation performance than the translation method using 1-best token separation and grid in both Korean and Chinese translation.
- the present invention can significantly improve performance over +1.46 BLEU in a large Chinese-English translation by performing a token separation process at the source side and a translation process at the target side at the same time.
- the present invention has the effect of reducing the Chinese word segmentation error to about 8.7%.
- the present invention has an effect that can improve the performance for the Korean-Chinese translation.
- FIG. 1 is a conceptual diagram of a token separation process and a translation process in a conventional statistical machine translation system
- FIG. 2 is a conceptual diagram of an integrated decoding apparatus incorporating a token separation and translation process according to the present invention
- FIG. 3 is a block diagram of an integrated decoding apparatus in a statistical machine translation system according to the present invention.
- FIG. 4 is a view illustrating an embodiment of a token separation process of an input string applied to the present invention
- FIG. 5 is a diagram illustrating an embodiment of a integrated decoding process in the integrated decoding apparatus according to the present invention.
- FIG. 6 is a flowchart illustrating an integrated decoding method incorporating a token separation and translation process according to the present invention.
- token separation unit 320 translation model learning unit
- FIG. 2 is a conceptual diagram of an integrated decoding apparatus incorporating a token separation and translation process according to the present invention.
- the integrated decoding apparatus 200 relates to statistical machine translation, and performs a token separation process and a translation process together, thereby deteriorating the translation performance due to wrong word units for translation between different languages. It is to solve.
- the integrated decoding apparatus 200 is to improve translation performance by processing a token separation process and a translation process together for decoding an input string.
- the integrated decoding apparatus 200 performs the token separation process and the translation process together during the decoding time to find the best translation and optimized token for the input language string.
- the integrated decoding apparatus 200 receives an input string from a source side.
- the integrated decoding apparatus 200 simultaneously performs a token separation process and a translation process.
- the integrated decoding apparatus 200 may output the token-separated string and the band phrase on the target side through the integration process.
- FIG. 3 is a block diagram of an integrated decoding apparatus in a statistical machine translation system according to the present invention.
- the statistical machine translation system includes a learning apparatus 300 and an integrated decoding apparatus 200.
- the learning apparatus 300 includes a token separator 310, a translation model learner 320, and a language model learner 330.
- the integrated decoding apparatus 200 includes a candidate token generator 210, a probability calculator 220, an unregistered word processor 230, and a translator 240.
- the integrated decoding apparatus 200 integrates the token separation process and the translation process into one process in the decoding process.
- the learning apparatus 300 executes a hierarchical syntax-based statistical machine translation model using synchronous context free grammar.
- the statistical machine translation model applied to the present invention is not limited to the statistical machine translation model based on hierarchical syntax.
- Learning device 300 receives a parallel corpus and a single word corpus.
- the token separator 310 generates token-separated data by token-separating the input parallel corpus.
- the translation model learner 320 learns the translation model using the token-separated data generated by the token separator 310.
- the learned translation model may be stored in a translation model DB in a database form.
- the language model learner 330 generates a language model by learning a language model using a single word corpus input to the learning apparatus 300.
- the generated language model may be stored in the language model DB in the form of a database.
- the integrated decoding apparatus 200 receives an input string that is not token separated as an input.
- the integrated decoding apparatus 200 separates the input string of the source side of the token while searching the corresponding band syntax for the target side. Since the rules for hierarchical syntax-based statistical machine translation, ie, translation models, include token separation information, the integrated decoding apparatus 200 performs translation process on the target side of the translation models and token separation on the source side of the translation models. The process can be performed simultaneously.
- the integrated decoding apparatus 200 may include a translation model database for storing a translation model trained from the token-separated data of the parallel corpus or a language model database for storing the language model trained from the single word corpus.
- the integrated decoding apparatus 200 may further include a string-based translation model database in which a string-based translation model is stored.
- the candidate token generator 210 generates a plurality of candidate tokens by applying a maximum entropy model to the input character sequence.
- the candidate token generator 210 may label the letter using at least one of a start letter of a word, a middle letter of a word, an end letter of a word, and one letter word in an input letter sequence.
- the probability calculator 220 combines the tokenization probability of each candidate token generated by the candidate token generator 210 with a language model to calculate a comprehensive tokenization probability.
- the probability calculator 220 calculates the token separated probability using the language model learned by the language model learner 330.
- the probability calculator 220 may use a language model stored in the language model DB.
- the probability calculator 220 calculates the token separation probability of each of the generated candidate tokens using the N-gram model.
- the non-registered word processor 230 processes the unregistered word by using word frequency and discount information on the non-registered word among candidate tokens generated by the probability calculator 220.
- the unregistered processor 230 may adjust the word frequency for the unregistered word among the generated candidate tokens.
- the translator 240 uses the probability values of the candidate tokens calculated by the probability calculator 220 and the unregistered word processing result processed by the unregistered word processor 230 to determine a band phrase corresponding to the input character sequence according to the translation model. Create The translation unit 240 generates a band phrase corresponding to the input character sequence according to the translation model learned by the translation model learning unit 320. The translation unit 240 may use a translation model stored in the translation model DB. The translator 240 logs-linearly adds the probability values of the candidate tokens calculated by the probability calculator 220 and the unregistered word processing result of the unregistered word processor 230 to translate the band phrase corresponding to the input character sequence. Create according to the model.
- FIG. 4 is a diagram illustrating an embodiment of a token separation process of an input string applied to the present invention.
- "401” shows an English character string corresponding to a Chinese character string made in Chinese. Looking at the Chinese string and the English string (401) when the Chinese string is " ⁇ ⁇ ⁇ ⁇ " (tao fei ke you wang duo fen) and the English string is "Taufik will have the chance to gain a point”. As follows. In the "401" string example, a sorting relationship between Chinese strings and English strings is shown.
- examples 402 and 403 of token separation by different token separation processes are shown. Also shown is an example 404 of the process of token separation in the form of a grid generated by different examples 402 and 403.
- FIG 5 is an exemplary diagram for an integrated decoding process in the integrated decoding apparatus according to the present invention.
- the integrated decoding apparatus 200 may integrate token separation models with translation features within the framework by performing a token separation process during decoding.
- the integrated decoding apparatus 200 complementarily performs a token separation process and a translation process.
- the integrated decoding apparatus 200 provides the optimal token separation results in the token separation process as a translation process and helps the token separation process to remove ambiguity in the translation process.
- the probability of derivation (D) is represented by Equation 1 below.
- the integrated decoding apparatus 200 uses 16 features.
- eight general translation qualities are four translation model scores (eg, direct and inverse phrase translation scores and direct and inverse lexical translation scores), a language model on the target side, and a word that forms a phrase. It includes three penalties for the frequency of hiring.
- the eight general translational qualities represent the frequency of the trained translation model and the special glue rule applied.
- Three token separation qualities include the maximum entropy model, language model, and word frequency on the source side.
- the five Out of Vocabulary (OOV) qualities are for handling unregistered words.
- the integrated decoding apparatus 200 uses OOV Character Count (OCC) and four OOV Discount qualities.
- the candidate token generator 210 generates a plurality of candidate tokens by applying a maximum entropy model (ME) to the input character sequence.
- the maximum entropy model for the token separation process is generated by presenting the token separation process as a tagging problem.
- the candidate token generator 210 allocates a range tag for a character using the following four types.
- the candidate token generator 210 first generates a label sequence “b e” for a “you-wang” that is a token of the string “you wang”. And the candidate token generator 210 calculates the probability for this token separation process as shown in Equation 2 below.
- tag Character sequence with The probability of is calculated by Equation 3 below.
- Equation 4 the probability of assigning the letter "C” to the tag "l” is represented by Equation 4 below.
- the probability calculator 230 may have a token having L words. The probability of using a simple but effective n-gram language model. The probability calculator 230 calculates a probability of the n-gram language model by Equation 5 below.
- the probability calculating unit 230 calculates the probability of the token shown in the example “402” of FIG. 4 in the 3-gram model by Equation 6 below.
- the non-registered word processor 230 counts the number of words in the token by using word counts (WC).
- the language model tends to assign higher probabilities to short sentences in a biased manner. These qualities can compensate language model scores by giving long sentences a high probability.
- the non-registered word processor 230 may optimize the token unit for the band phrase by using these qualities. If a larger unit is preferred in the translation process, the unregistered word processor 230 may use this feature to limit the token containing more words.
- the unregistered word processor 230 processes an unregistered problem.
- the non-registered word processor 230 may process possible tokens and band phrases by using only the trained translation model.
- the learned translation models are the same as the decoding of the translation process.
- using a trained translation model can limit the search space of possible tokens. Looking at the sentence “tao fei ke”, “taofeike” has an unregistered problem. The token “taofeike” cannot be derived in a limited way.
- the non-registered word processor 230 needs to ensure that all possible tokens can be derived. This biased approach degrades the performance of the integrated decoding device 200.
- the non-register word processing unit 230 estimates the non-register word frequency (OCC: OOV Character Count).
- OCC OOV Character Count
- the non-registered word processor 230 counts the number of characters included in the non-registered word by using the OCC feature.
- the non-registered word processor 230 controls the number of non-registered word characters by such a feature. For example, "Taofeike" is unregistered in the derivation of FIG. And the OCC qualities for that derivation are "3".
- the non-registered word processor 230 performs a non-registered word discount (OD) discount.
- the non-registered word processor 230 uses the non-registered word discount feature OD i to distinguish an opportunity to become a word by using different frequencies of letters that become non-registered words.
- the unregistered word discount feature OD i indicates the number of unregistered words having an i-th character.
- the non-register word processor 230 uses four non-register word discount features. For example, the non-register word processing unit 230 shows four unregistered word discount features as "1", "2", "3", and "4+". Unregistered word qualification is difficult to distinguish between different tokens.
- FIG. 6 is a flowchart illustrating an integrated decoding method incorporating a token separation and translation process according to the present invention.
- a translation model and a language model are generated by the learning apparatus 300 in advance.
- the translation model is learned from the parallel corpus, and may be a string-based translation model.
- the language model is learned from a single monolingual corpus.
- the learned translation model may be stored in a translation model DB in a database form.
- the learned language model may be stored in the language model DB in the form of a database.
- the candidate token generator 210 generates a plurality of candidate tokens by applying a maximum entropy model to the input character sequence (S602). During the generation of the candidate token, the candidate token generator 210 tags a letter using at least one of a start letter of a word, a middle letter of a word, an end letter of a word, and one letter word in an input character sequence.
- the probability calculator 220 combines the token separation probability of each candidate token generated by the candidate token generator 210 and the language model to calculate a comprehensive token separation probability calculation (S604).
- the probability calculator 220 may calculate a probability of token separation using a pretrained language model.
- the probability calculator 220 may use a language model stored in advance in the language model DB.
- the non-registered word processor 230 processes the non-registered word by using the frequency and discount information of the non-registered word among the candidate tokens generated by the probability calculator 220 (S606).
- the unregistered processor 230 may adjust a frequency according to the frequency of the unregistered word or the word discount information among the generated candidate tokens.
- the word frequency may mean an average word frequency in which one word may be formed, and the unregistered processor 230 may determine whether to apply the unregistered word by adjusting it.
- the unregistered processor 230 may determine whether to apply the unregistered word by adjusting the frequency according to the word discount information as "1, 2, 3, and 4+".
- the translator 240 uses the probability values of the candidate tokens calculated by the probability calculator 220 and the unregistered word processing result processed by the unregistered word processor 230 to determine a band phrase corresponding to the input character sequence according to the translation model. Generate according to the translation model (S608).
- the translation unit 240 may use a translation model stored in the translation model DB.
- the translator 240 generates a band phrase corresponding to the input character sequence based on the lon-linear model of the probability values of the candidate tokens calculated by the probability calculator 220 and the unregistered word processing result of the unregistered word processor 230. do.
- the present invention can be applied to various playback apparatuses by implementing the integrated decoding method as a software program and recording it on a computer-readable recording medium.
- Various playback devices may be PCs, laptops, portable terminals, and the like.
- the recording medium may be a hard disk, a flash memory, a RAM, a ROM, or the like as an internal type of each playback device, or an optical disc such as a CD-R or a CD-RW, a compact flash card, a smart media, a memory stick, or a multimedia card as an external type. have.
- the program recorded on the computer-readable recording medium includes, as described above, a candidate token generation function for generating a plurality of candidate tokens by applying a maximum entropy model to the input character sequence; A probability calculation function for calculating a token separation probability of each of the generated candidate tokens using a language model; An unregistered word processing function for processing an unregistered word using word counts and discount information on the unregistered word among the generated candidate tokens; And a translation function for generating a band phrase corresponding to the input character sequence by using the calculated probability values of the candidate tokens and the non-registered word processing result.
- the present invention can improve the translation performance and reduce the segmentation error by performing token and translation process by searching token separation and band phrases for the source language string in the decoding process of statistical machine translation.
Abstract
Description
Claims (14)
- 입력 문자 시퀀스에 최대 엔트로피 모델을 적용하여 복수의 후보 토큰들을 생성하는 후보 토큰 생성부;상기 생성된 각각의 후보 토큰들의 토큰 분리 확률을 언어 모델을 이용하여 계산하는 확률 계산부;상기 생성된 후보 토큰 중에서 미등록어에 대한 단어 빈도 및 디스카운트 정보를 이용하여 미등록어를 처리하는 미등록어 처리부; 및상기 계산된 후보 토큰들의 확률 값들과 상기 미등록어 처리 결과를 이용하여 상기 입력 문자 시퀀스에 대응하는 대역어구를 번역 모델에 따라 생성하는 번역부를 포함하는 토큰 분리 및 번역 과정을 통합한 통합 디코딩 장치.
- 제 1 항에 있어서,병렬 말뭉치의 토큰 분리된 데이터로부터 학습된 번역 모델을 저장하는 번역 모델 데이터베이스; 및단일어 말뭉치로부터 학습된 언어 모델을 저장하는 언어 모델 데이터베이스를 더 포함하는 통합 디코딩 장치.
- 제 2 항에 있어서,상기 번역 모델 데이터베이스는,문자열 기반의 번역 모델을 저장하는 통합 디코딩 장치.
- 제 1 항에 있어서,상기 후보 토큰 생성부는,상기 입력 문자 시퀀스에서 단어의 시작에 해당하는 문자, 단어의 중간에 출현하는 문자, 단어의 끝에 나타나는 문자, 하나의 문자로 구성된 단어 중 적어도 하나의 문자를 단어 생성 관점에서 태깅하는 통합 디코딩 장치.
- 제 1 항에 있어서,상기 확률 계산부는,상기 생성된 각각의 후보 토큰들의 토큰 분리 확률과 N-그램 언어 모델을 결합하여 계산하는 통합 디코딩 장치.
- 제 1 항에 있어서,상기 미등록어 처리부는,상기 생성된 후보 토큰 중에서 미등록어에 대한 단어 빈도 수를 조절하는 통합 디코딩 장치.
- 제 1 항에 있어서,상기 번역부는,상기 계산된 후보 토큰들의 확률 값들과 상기 미등록어 처리 결과를 로그-리니어(log-linear) 모델에 적용하여 상기 입력 문자 시퀀스에 대응하는 번역문을 생성하는 통합 디코딩 장치.
- 입력 문자 시퀀스에 최대 엔트로피 모델을 적용하여 복수의 후보 토큰들을 생성하는 후보 토큰 생성 단계;상기 생성된 각각의 후보 토큰들의 토큰 분리 확률을 언어 모델과 결합하여 계산하는 확률 계산 단계;상기 생성된 후보 토큰 중에서 미등록어에 대한 단어 빈도 및 디스카운트 정보를 이용하여 미등록어를 처리하는 미등록어 처리 단계; 및상기 계산된 후보 토큰들의 확률 값들과 상기 미등록어 처리 결과를 이용하여 상기 입력 문자 시퀀스에 대응하는 대역어구를 생성하는 번역 단계를 포함하는 토큰 분리 및 번역 과정을 통합한 통합 디코딩 방법.
- 제 8 항에 있어서,상기 번역 모델은,문자열 기반의 번역 모델인 것을 특징으로 하는 통합 디코딩 방법.
- 제 8 항에 있어서,상기 후보 토큰 생성 단계는,상기 입력 문자 시퀀스에서 단어의 시작에 해당하는 문자, 단어의 중간에 출현하는 문자, 단어의 끝에 나타나는 문자, 하나의 문자로 구성된 단어 중 적어도 하나의 문자를 단어 생성 관점에서 태깅하는 통합 디코딩 방법.
- 제 8 항에 있어서,상기 확률 계산 단계는,상기 생성된 각각의 후보 토큰들의 토큰 분리 확률과 N-그램 언어 모델을 결합하여 계산하는 통합 디코딩 방법.
- 제 8 항에 있어서,상기 미등록어 처리 단계는,상기 생성된 후보 토큰 중에서 미등록어에 대한 단어 빈도 수를 조절하는 통합 디코딩 방법.
- 제 8 항에 있어서,상기 번역 단계는,상기 계산된 후보 토큰들의 확률 값들과 상기 미등록어 처리 결과를 로그 리니어(log-linear) 모델에 적용하여 상기 입력 문자 시퀀스에 대응하는 번역문을 생성하는 통합 디코딩 방법.
- 제 8 항 내지 제 13 항 어느 한 항에 의한 과정을 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/813,463 US8543376B2 (en) | 2010-08-23 | 2011-05-25 | Apparatus and method for decoding using joint tokenization and translation |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020100081677A KR101682207B1 (ko) | 2010-08-23 | 2010-08-23 | 토큰 분리 및 번역 과정을 통합한 통합 디코딩 장치 및 그 방법 |
KR10-2010-0081677 | 2010-08-23 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2012026667A2 true WO2012026667A2 (ko) | 2012-03-01 |
WO2012026667A3 WO2012026667A3 (ko) | 2012-04-19 |
Family
ID=45723875
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2011/003830 WO2012026667A2 (ko) | 2010-08-23 | 2011-05-25 | 토큰 분리 및 번역 과정을 통합한 통합 디코딩 장치 및 그 방법 |
Country Status (3)
Country | Link |
---|---|
US (1) | US8543376B2 (ko) |
KR (1) | KR101682207B1 (ko) |
WO (1) | WO2012026667A2 (ko) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110263304A (zh) * | 2018-11-29 | 2019-09-20 | 腾讯科技(深圳)有限公司 | 语句编码方法、语句解码方法、装置、存储介质及设备 |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101356417B1 (ko) * | 2010-11-05 | 2014-01-28 | 고려대학교 산학협력단 | 병렬 말뭉치를 이용한 동사구 번역 패턴 구축 장치 및 그 방법 |
JP2014078132A (ja) * | 2012-10-10 | 2014-05-01 | Toshiba Corp | 機械翻訳装置、方法およびプログラム |
CN104216892B (zh) * | 2013-05-31 | 2018-01-02 | 亿览在线网络技术(北京)有限公司 | 歌曲搜索中非语义、非词组的切换方法 |
WO2016043539A1 (ko) * | 2014-09-18 | 2016-03-24 | 특허법인 남앤드남 | 소번역메모리를 포함하는 번역 메모리, 그를 이용한 역방향 번역메모리 및 이들을 기록한 컴퓨터 판독가능한 저장매체 |
US9953171B2 (en) * | 2014-09-22 | 2018-04-24 | Infosys Limited | System and method for tokenization of data for privacy |
CN106663092B (zh) * | 2014-10-24 | 2020-03-06 | 谷歌有限责任公司 | 具有罕见词处理的神经机器翻译系统 |
US9934203B2 (en) | 2015-03-10 | 2018-04-03 | International Business Machines Corporation | Performance detection and enhancement of machine translation |
US9940324B2 (en) * | 2015-03-10 | 2018-04-10 | International Business Machines Corporation | Performance detection and enhancement of machine translation |
US10140983B2 (en) * | 2015-08-28 | 2018-11-27 | International Business Machines Corporation | Building of n-gram language model for automatic speech recognition (ASR) |
US10180930B2 (en) * | 2016-05-10 | 2019-01-15 | Go Daddy Operating Company, Inc. | Auto completing domain names comprising multiple languages |
US10430485B2 (en) | 2016-05-10 | 2019-10-01 | Go Daddy Operating Company, LLC | Verifying character sets in domain name requests |
US10735736B2 (en) * | 2017-08-29 | 2020-08-04 | Google Llc | Selective mixing for entropy coding in video compression |
KR102069692B1 (ko) * | 2017-10-26 | 2020-01-23 | 한국전자통신연구원 | 신경망 기계번역 방법 및 장치 |
KR20210037307A (ko) * | 2019-09-27 | 2021-04-06 | 삼성전자주식회사 | 전자 장치 및 전자 장치의 제어 방법 |
US11797781B2 (en) | 2020-08-06 | 2023-10-24 | International Business Machines Corporation | Syntax-based multi-layer language translation |
KR20220093653A (ko) | 2020-12-28 | 2022-07-05 | 삼성전자주식회사 | 전자 장치 및 그 제어 방법 |
US20240062021A1 (en) * | 2022-08-22 | 2024-02-22 | Oracle International Corporation | Calibrating confidence scores of a machine learning model trained as a natural language interface |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR19990001034A (ko) * | 1997-06-12 | 1999-01-15 | 윤종용 | 문맥 정보 및 지역적 문서 형태를 이용한 문장 추출 방법 |
KR20000056245A (ko) * | 1999-02-18 | 2000-09-15 | 윤종용 | 예제기반 기계번역에서 분별성이 반영된 유사도를 이용한 번역예문 선정방법 |
US20090248422A1 (en) * | 2008-03-28 | 2009-10-01 | Microsoft Corporation | Intra-language statistical machine translation |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8612205B2 (en) * | 2010-06-14 | 2013-12-17 | Xerox Corporation | Word alignment method and system for improved vocabulary coverage in statistical machine translation |
US9098488B2 (en) * | 2011-04-03 | 2015-08-04 | Microsoft Technology Licensing, Llc | Translation of multilingual embedded phrases |
-
2010
- 2010-08-23 KR KR1020100081677A patent/KR101682207B1/ko active IP Right Grant
-
2011
- 2011-05-25 WO PCT/KR2011/003830 patent/WO2012026667A2/ko active Application Filing
- 2011-05-25 US US13/813,463 patent/US8543376B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR19990001034A (ko) * | 1997-06-12 | 1999-01-15 | 윤종용 | 문맥 정보 및 지역적 문서 형태를 이용한 문장 추출 방법 |
KR20000056245A (ko) * | 1999-02-18 | 2000-09-15 | 윤종용 | 예제기반 기계번역에서 분별성이 반영된 유사도를 이용한 번역예문 선정방법 |
US20090248422A1 (en) * | 2008-03-28 | 2009-10-01 | Microsoft Corporation | Intra-language statistical machine translation |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110263304A (zh) * | 2018-11-29 | 2019-09-20 | 腾讯科技(深圳)有限公司 | 语句编码方法、语句解码方法、装置、存储介质及设备 |
CN110263304B (zh) * | 2018-11-29 | 2023-01-10 | 腾讯科技(深圳)有限公司 | 语句编码方法、语句解码方法、装置、存储介质及设备 |
Also Published As
Publication number | Publication date |
---|---|
KR101682207B1 (ko) | 2016-12-12 |
WO2012026667A3 (ko) | 2012-04-19 |
KR20120018687A (ko) | 2012-03-05 |
US8543376B2 (en) | 2013-09-24 |
US20130132064A1 (en) | 2013-05-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2012026667A2 (ko) | 토큰 분리 및 번역 과정을 통합한 통합 디코딩 장치 및 그 방법 | |
WO2014069779A1 (ko) | 구문 전처리 기반의 구문 분석 장치 및 그 방법 | |
WO2016010245A1 (en) | Method and system for robust tagging of named entities in the presence of source or translation errors | |
WO2014025135A1 (ko) | 문법 오류 검출 방법, 이를 위한 오류검출장치 및 이 방법이 기록된 컴퓨터로 판독 가능한 기록매체 | |
US8548794B2 (en) | Statistical noun phrase translation | |
WO2012026668A2 (ko) | 의존관계 포레스트를 이용한 통계적 기계 번역 방법 | |
WO2012060540A1 (ko) | 구문 구조 변환 모델과 어휘 변환 모델을 결합한 기계 번역 장치 및 기계 번역 방법 | |
Fu et al. | Chinese named entity recognition using lexicalized HMMs | |
KR20050027298A (ko) | 규칙 기반 방식과 번역 패턴 방식을 혼합한 하이브리드자동 번역 장치 및 방법과 그 프로그램을 기록한 컴퓨터로읽을 수 있는 기록매체 | |
WO2003056450A1 (fr) | Procede et appareil d'analyse syntaxique | |
WO2016208941A1 (ko) | 텍스트 전처리 방법 및 이를 수행하는 전처리 시스템 | |
Huang et al. | Soft syntactic constraints for hierarchical phrase-based translation using latent syntactic distributions | |
WO2014030834A1 (ko) | 문법의 오류 검출 방법, 이를 위한 오류검출장치 및 이 방법이 기록된 컴퓨터로 판독 가능한 기록매체 | |
Lehal | A word segmentation system for handling space omission problem in urdu script | |
WO2012008684A2 (ko) | 계층적 구문 기반의 통계적 기계 번역에서의 번역규칙 필터링과 목적단어 생성을 위한 방법 및 장치 | |
KR100496873B1 (ko) | 대표 형태소 어휘 문맥에 기반한 통계적 태깅 오류 정정장치 및 그 방법 | |
WO2012060534A1 (ko) | 병렬 말뭉치를 이용한 동사구 번역 패턴 구축 장치 및 그 방법 | |
WO2012030053A2 (ko) | 병렬 말뭉치의 구 정렬을 이용한 숙어 표현 인식 장치 및 그 방법 | |
Luo et al. | An iterative algorithm to build Chinese language models | |
Zitouni et al. | Cross-language information propagation for arabic mention detection | |
KR101742244B1 (ko) | 통계적 기계 번역에서 문자 정렬을 이용한 단어 정렬 방법 및 이를 이용한 장치 | |
CN110688840B (zh) | 一种文本转换方法及装置 | |
Jindal et al. | A Framework for Grammatical Error Detection and Correction System for Punjabi Language Using Stochastic Approach | |
KR20090042201A (ko) | 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법 및 장치 | |
KR20090041897A (ko) | 이중언어 문서에서의 음차표기 대역쌍 자동 추출 방법 및장치 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11820095 Country of ref document: EP Kind code of ref document: A2 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13813463 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11/06/2013) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 11820095 Country of ref document: EP Kind code of ref document: A2 |