WO2015029241A1 - Procédé d'acquisition de traduction de mot - Google Patents

Procédé d'acquisition de traduction de mot Download PDF

Info

Publication number
WO2015029241A1
WO2015029241A1 PCT/JP2013/073464 JP2013073464W WO2015029241A1 WO 2015029241 A1 WO2015029241 A1 WO 2015029241A1 JP 2013073464 W JP2013073464 W JP 2013073464W WO 2015029241 A1 WO2015029241 A1 WO 2015029241A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
words
word
monolingual
language
Prior art date
Application number
PCT/JP2013/073464
Other languages
English (en)
Inventor
Daniel Georg ANDRADE SILVA
Hironori Mizuguchi
Kai Ishikawa
Masaaki Tsuchida
Takashi Onishi
Original Assignee
Nec Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nec Corporation filed Critical Nec Corporation
Priority to JP2016510530A priority Critical patent/JP6090531B2/ja
Priority to PCT/JP2013/073464 priority patent/WO2015029241A1/fr
Publication of WO2015029241A1 publication Critical patent/WO2015029241A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment

Definitions

  • the present invention relates to a method to acquire translations of words using a seed dictionary.
  • a word refers to a single word (term), or multiple words (phrases).
  • the inventors of the present invention propose a system that uses monolingual clustering with cross-lingual constraints from a bilingual seed dictionary to improve the finding of new translations of a word (in the following called query word). There is a constant need to update existing bilingual dictionary for new words.
  • Non-Patent Document 1 suggests to use a probabilistic model that assumes that the feature vector of source word s and target word t are generated from the same underlying concept. Given a latent concept presented by a vector in the latent vector space, they assume that feature vector of source word s and target word t are generated from the same latent concept. The feature vector of source word s and t are generated from a normal distribution, each, wherein the mean vector of the normal distribution is set to the latent concept vector transformed to the source and target vector space, respectively. Their model makes the simplifying assumption that there is a one-to-one mapping between a source word s and a target word t.
  • Non-Patent Document 2 proposes to use a poly-lingual topic model to match documents and vocabulary across several languages. They assume a latent prior distribution of topics, from which language topic indicators are generated. A topic indicator in turn defines a probability distribution over the vocabulary. For training, they assume an initial set of aligned document pairs, for example, by using cross-lingual links from Wikipedia. They show that, given a sufficient high number of aligned document pairs as training data, the model can learn to align topics across languages. That means, the i-th topic indicator in language A and the i-th topic indicator in language B, tend to represent the same topic. Therefore, the words that appear with high probability in the aligned topics might be considered as translations.
  • Non-Patent Document 5 suggests to use an existing multilingual ontology, which contains predefined clusters of words (synsets) in the source language and target language, respectively. It is assumed that the multilingual ontology contains a one-to-one mapping between each cluster in the source and target language. However, such multilingual ontologies are domain dependent making them less available than (general domain) bilingual dictionaries. Furthermore, they make the limited assumption that a new word (in the source language) necessarily belongs to an existing cluster that is predefined in the ontology. Obviously, this assumption is not true, if the new word is not related (e.g. not a synonym) to any of the words registered in the ontology. Document of the Prior Art
  • Non-Patent Document 1 “Learning Bilingual Lexicons from Monolingual Corpora”, Haghighi et al., 2008
  • Non-Patent Document 2 “Polylingual Topic Models”, Mimno D. et. al, 2009
  • Non-Patent Document 3 "Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora", Laroche et al., 2010
  • Non-Patent Document 4 "An efficient method for determining bilingual word classes", Och et al., 1999
  • Non-Patent Document 5 "An approach based on multilingual thesauri and model combination for bilingual lexicon extraction", Dejean et. al. 2002 DISCLOSURE OF INVENTION
  • Non-Patent Document 1 requires the assumption that each word has only one translation. If the query word does not have synonyms, then this assumption can improve translation accuracy since it reduces the number of translation candidates.
  • An example is shown in FIG. 1 where the possible translation candidates for the query word M [baiku] (motorbike) can be limited to “car”, “motorbike”, “light”, “lamp”. The words “automobile” and "seat” are not considered as translation candidates. This reduces the search space, and this way can improve accuracy.
  • the underlying assumption that there are no synonyms of the query word can be a too restrictive constraint, and also decrease accuracy. For example, let us assume we have additionally the following translation pair in our seed dictionary: " ⁇ 3 ⁇ 4J— $ra$(jidounirinsya]"
  • the monolingual clustering we mutual-exclusively align the monolingual clusters across languages that maximize our objective function.
  • the objective function is set such that it favors cross-lingual alignment of clusters which feature vectors are similar across languages.
  • the feature vector space used for the cross-lingual alignment and the feature vector space used for monolingual-clustering need not to be same, which, in general, allows to use richer feature vectors for the latter case.
  • FIG. 12 shows the initial monolingual clustering given by the bilingual dictionary entries.
  • the (monolingual) feature vector of ⁇ -f ⁇ 7 [baiku] (motorbike) is similar to " ⁇ 3 ⁇ 4— $JB $ [jidounirinsya]” (motorbike)'s feature vector, and therefore ⁇ [baiku] (motorbike) and " ⁇ i3 ⁇ 4— m$ [jidounirinsya]” (motorbike) will form one cluster.
  • iSfseki] (seat) and — h [shito] (seat) will form one cluster.
  • the result of the monolingual clustering is shown in FIG. 13.
  • Non-Patent Document 3 In both situations our proposed method can improve translation accuracy when compare to previous methods, like Non-Patent Document 3 or Non-Patent Document 1.
  • the first type of methods like Non-Patent Document 3, considers all words in the target language as translation candidates and this way has an enlarged search space.
  • the second type of methods like Non-Patent Document 1 , considers only words that do not already have a translation (in the dictionary), and therefore cannot find translations for a set of synonyms.
  • the present invention has the effect of limiting the translation candidates for a query word while allowing a many-to-many cross-lingual alignment of words.
  • FIG. 1 shows the translation candidates for query ⁇ * ⁇ 7 [baiku] (motorbike) when no clustering is performed; an English (target language) word that occurs in dictionary is not considered as translation candidates.
  • FIG. 2 shows the translation candidates for query ⁇ f [baiku] (motorbike) when no clustering is performed; an English (target language) word that occurs in dictionary is not considered as translation candidates (assumes that additionally entry
  • ⁇ I 3 ⁇ 4]Zlfra$[iidounirinsya] motorbike is in dictionary).
  • FIG. 3 is a block diagram showing the functional structure of the system proposed by our invention with respect to the first embodiment.
  • FIG. 4 is a block diagram showing an example of the functional structure of Component 10.
  • FIG. 5 is a block diagram showing an example of the functional structure of
  • FIG. 6 is a block diagram showing an example of the functional structure of Component 30.
  • FIG. 7 shows an example of the bipartite graph induced by a bilingual dictionary.
  • FIG. 8 shows an example of the construction of the reduced bipartite graph.
  • FIG. 9 shows an example of the monolingual must-link constraints induced by a bilingual dictionary.
  • FIG. 10 shows an example of the monolingual cannot-link constraints induced by a bilingual dictionary.
  • FIG. 11 shows an example of the cross-lingual must-link constraints induced by a bilingual dictionary.
  • FIG. 12 shows the initial clustering of the source and target words; the initial clustering is induced by the monolingual must-link constraints.
  • FIG. 13 shows the final result of the clustering of the source and target words.
  • FIG. 14 shows the final result of the clustering of the source and target words; the cross-lingual must-link constraints and the resulting translation candidates are also shown.
  • FIG. 16 is a block diagram showing the functional structure of the system proposed by our invention with respect to the second embodiment.
  • the main architecture usually performed by a computer system is described in FIG. 3.
  • Our method requires the input of the query word, a set of source words (including the query word) in the source language and their feature vectors, and a set of target words and their feature vectors.
  • the set of target words includes the correct translation of the query word.
  • two feature vectors are provided.
  • the first type of feature vector we call monolingual feature vector, which can be compared between two words in the same language (monolingual vector space).
  • the second type we call cross-lingual feature vectors which can be compared across the two languages (common vector space).
  • the similarity of two feature vectors e.g., measured by the cosine similarity, indicates semantic similarity of the words.
  • the monolingual and cross-lingual feature vectors can be created.
  • two corpora one for the source language and one for the target language.
  • the two corpora are ideally two text collections with similar topics, so called comparable corpora, for example, news articles from the same publication interval in Japanese and English.
  • From these corpora we extract context vectors for all relevant words, for example, all nouns that occur in the corpora.
  • the context vector of a noun contains in each dimension the tf-idf weight of a content word that co-occurs with the noun.
  • the monolingual feature vectors are richer since they can contain all context words, whereas the cross-lingual feature vectors contain only the words that are listed in the bilingual dictionary 1.
  • the feature vectors in the common vector space are possible. For example, we could map the words in the source language as well as the words in the target language to a latent vector space using canonical correlation analysis (CCA) as described in Non-Patent Document 1.
  • CCA canonical correlation analysis
  • the feature vectors in the monolingual vector space and in the common vector space are the input of Component 20 and Component 30, respectively.
  • Component 10 creates monolingual clustering constraints using the bilingual dictionary 1.
  • the work flow of Component 10 is described in FIG. 4.
  • a bilingual dictionary 1 induces a bipartite graph by considering all source words and target words in two different partitions, whereas there is an (undirected) edge between source words s and target word t, if and only if, there is a translation from s to t.
  • FIG. 7 shows an example of a bipartite graph induced by a bilingual dictionary 1.
  • We create a reduced (bipartite) graph by removing an edge between word s and t, if the similarity of 5 and t's cross-lingual feature vectors is smaller than a certain threshold.
  • FIG. 8 shows the corresponding reduced bipartite graph with removed edge(s) (a, b) if similarity between context vector of a and b is smaller than certain threshold (here 0.5).
  • certain threshold here 0.5
  • a monolingual must-link constraint between word x and word y, in the same language, is set, if and only if, there is a path between x and y in the reduced bipartite graph.
  • a monolingual cannot-link constraint between word x and word y, in the same language is set, if and only if, x andy are both listed in the bilingual dictionary 1, and there is no path between x and y in the induced bipartite graph.
  • An example is shown in FIG. 9 and FIG. 10.
  • the "must-link constraint" means that two word in the same language that are connected by a path in the reduced graph.
  • FIG. 10 the "must-link constraint" means that two word in the same language that are connected by a path in the reduced graph.
  • cannot-link constraint means that two words in the same language that are listed in the dictionary but that are not connected by a path in the induced bi-partite graph. Note that there is no must-link constraint between — h [shTto] (seat, sheet) and "sheet” since the reduced graph does not have a path between x and y. Furthermore, note that there is no cannot-link constraint between "seat” and “sheet”, since both are connected through a path in the induced bipartite graph.
  • cross-lingual must-link constraints simply correspond to the edges of the reduced graph. An example is given in FIG. 11.
  • Component 20 we use these monolingual constraints together with the feature vectors from the monolingual vector space to cluster the words in the source and target language, respectively.
  • For the monolingual-clustering we use a mixture probability model. The work flow of Component 20 is described in FIG. 5.
  • n r be the total number of words (in the same language) that are listed in the bilingual dictionary 1 and for which a feature vector is provided as input.
  • the (monolingual) must-link constraints define a clustering over these n r number of words: word wl and w2 is in the same cluster, if and only if, there is a monolingual must-link constraint between wl and w2.
  • p the resulting number of clusters
  • n is the number of words that are of interest, i.e. the words for which a feature vector is provided.
  • a is a hyper-parameter that influences the number of clusters that are generated. It can be shown that the expected number of clusters is as follows (see, for example, "Nonparametric empirical Bayes for the Dirichlet process mixture model.”, Jon D.
  • Equation (3) ⁇ T ⁇ — - , (3)
  • Equation (2) we can calculate an optimal initial a. Note, that there is no analytic solution for a, however, we can solve it using numeric approximations. For example, since E (number of clusters) is monotonic increasing with respect to a, we can use binary search.
  • corresponds to the cluster mean.
  • the normalization constant is defined as follows:
  • Is denotes the modified Bessel function of its first kind with order s.
  • the cosine-similarity measures the angle between two vectors, and the von Mises-Fisher distribution defines a probability distribution over the possible angles.
  • Equation (5) can be approximated as described, for example, in "A short note on parameter approximation for von Mises-Fisher distributions", Suvrit Sra.
  • step 1 for each cluster /, we sample a new cluster mean parameter ⁇ from the posterior distribution ⁇ ⁇ ⁇ x ⁇ , ..., x n i, c ⁇ , ..., c n , ⁇ ), where x ⁇ , ... x specialtyi are the (monolingual) feature vectors of the words assigned to cluster .
  • step 2 we sample a new cluster allocation c ⁇ , ..., ccountry. Let aj, a 2 , a q be a set of words that are connected through the must-link constraint. And let bi, b 2 , ... , b r be the words that have a cannot-link constraint with a ⁇ , a , ... , a q .
  • the new cluster for the words a ⁇ , 2 , ..., a q from the posterior probability given the cluster allocation of the remaining words, denoted as c u ⁇ , ..., c u practic_ q
  • Equation (6) we set the posterior distribution (Equation (6)) to 0. If we sampled a new cluster id, then we create the mean vector of the new cluster by sampling from the posterior distribution H a that is based on the prior distribution Go, given the observations x a , ..., x a q .
  • the MAP cluster allocation for the words in the source and target language are passed to Component 30.
  • Component 30 is illustrated in FIG. 6.
  • For the combination we use the feature vectors from the cross-lingual vector space.
  • the feature vectors can, for example, be combined simply by summation. If the feature vectors are based on co-occurrence frequency in a corpus, another choice for combining the feature vectors is as follows:
  • the combined feature vectors of the clusters in the source and target language are used to calculate the alignment probabilities across the two languages.
  • these combined feature vectors the cluster representatives. Let us denote x and x' the cluster representatives of cluster c and c', respectively.
  • the alignment probability of cluster c with cluster c' can be calculated as follows:
  • sim ⁇ ⁇ , ⁇ denotes a similarity measure between two vectors.
  • Cross-lingual must-link clusters are clusters that contain a pair of words w s , in source language, and w t , in target language, that have a cross-lingual must-link constraint.
  • ⁇ 0 is the monolingual clusters mean of cluster c, found by the maximum- a-posterior solution in Component 20.
  • p(qfa c , ⁇ ) is the probability that word q belongs to cluster c by using:
  • the probability >(align(c, c')/q in cluster c) is calculated as described in Component 30, whereas word q is allocated to cluster c, and the cluster representatives are updated, respectively.
  • the present invention can also be used to align expressions between two different domains in the same language. For example, words like V?— [sutaba] (Starbucks®) are used in informal text found in Twitter or other SNS. On the other hand, in formal text like newspapers we will primarily find the expression 5>— / y 9
  • the common vector space and the intra-domain vector space can be the same, or very similar.
  • feature vectors we can use, as described in the first exemplary embodiment, the context vectors extracted from the text collections, e.g. collection of Tweets, and newspaper articles, respectively.
  • Another choice is to encode the spelling of the words as feature vectors.
  • the encoding of the words can be done, for example, by using unigrams, bigrams or tri-grams as described, for example, in "Spelling correction for the telecommunications network for the deaf, Karen Kukich, Communications of the ACM.
  • Word-Net which can be additionally enriched by an existing (seed) list of mappings between informal and formal words.
  • words which have the same spelling are likely to have the same meaning, if their number of characters is sufficiently large. If the number is too small, say for example less than 4 characters, the word might be ambiguous, and have different meanings across domains.
  • the word translation acquisition method of the above exemplary embodiments may be realized by dedicated hardware, or may be configured by means of memory and a DSP (digital signal processor) or other computation and processing device.
  • the functions may be realized by execution of a program used to realize the steps of the word translation acquisition method.
  • a program to realize the steps of the word translation acquisition method may be recorded on computer-readable storage media, and the program recorded on this storage media may be read and executed by a computer system to perform adaptive rotation angle control processing.
  • a "computer system” may include an OS, peripheral equipment, or other hardware.
  • “computer-readable storage media” means a flexible disk
  • magneto-optical disc ROM, flash memory or other writable nonvolatile memory, CD-ROM or other removable media, or a hard disk or other storage system incorporated within a computer system.
  • “computer readable storage media” also includes members which hold the program for a fixed length of time, such as volatile memory (for example, DRAM (dynamic random access memory)) within a computer system serving as a server or client, when the program is transmitted via the Internet, other networks, telephone circuits, or other communication circuits.
  • volatile memory for example, DRAM (dynamic random access memory)
  • DRAM dynamic random access memory
  • the present invention allows the accurate acquisition of new term translations. This is necessary in order to update and extend existing bilingual dictionaries which in turn are an indispensable resource for cross-lingual text mining, among other applications.
  • Another important application is the acquisition of synonyms across different linguistic styles.
  • our invention allows to map informal expressions from Twitter, or other SNS, to formal expressions in newspaper which, for example, allows to jointly track news on SNS and newspapers.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention porte sur un procédé d'acquisition de traduction de mot qui comporte les étapes suivantes : l'extraction de contraintes qui extrait des contraintes de liaison obligatoire unilingue, de liaison impossible unilingue et de liaison obligatoire de langues croisées d'un dictionnaire bilingue ; le regroupement qui regroupe des mots dans chaque langue en utilisant un vecteur de fonction du mot de telle sorte que les contraintes de liaison obligatoire unilingue et de liaison impossible unilingue sont respectées ; le calcul de représentation de groupe qui combine des vecteurs de fonction des mots attribués au même groupe pour former une représentation de groupe, les vecteurs de fonction qui sont combinés à la représentation de groupe étant tels que les vecteurs de fonction peuvent être comparés parmi les langues ; l'alignement qui aligne les groupes unilingues parmi des langues en utilisant les représentations de groupe et les contraintes de liaison obligatoire de langues croisées données par le dictionnaire bilingue ; l'évaluation par score qui évalue par score chaque groupe de langue cible par utilisation d'une probabilité que le mot demandé est attribué à un groupe de langue source, et d'une probabilité que le groupe de langue source est aligné sur le groupe de langue cible.
PCT/JP2013/073464 2013-08-27 2013-08-27 Procédé d'acquisition de traduction de mot WO2015029241A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2016510530A JP6090531B2 (ja) 2013-08-27 2013-08-27 単語訳取得方法
PCT/JP2013/073464 WO2015029241A1 (fr) 2013-08-27 2013-08-27 Procédé d'acquisition de traduction de mot

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2013/073464 WO2015029241A1 (fr) 2013-08-27 2013-08-27 Procédé d'acquisition de traduction de mot

Publications (1)

Publication Number Publication Date
WO2015029241A1 true WO2015029241A1 (fr) 2015-03-05

Family

ID=52585848

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/073464 WO2015029241A1 (fr) 2013-08-27 2013-08-27 Procédé d'acquisition de traduction de mot

Country Status (2)

Country Link
JP (1) JP6090531B2 (fr)
WO (1) WO2015029241A1 (fr)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017021523A (ja) * 2015-07-09 2017-01-26 日本電信電話株式会社 用語意味コード判定装置、方法、及びプログラム
US9558182B1 (en) 2016-01-08 2017-01-31 International Business Machines Corporation Smart terminology marker system for a language translation system
WO2017035382A1 (fr) * 2015-08-25 2017-03-02 Alibaba Group Holding Limited Procédé et système de génération de traductions candidates
FR3040808A1 (fr) * 2015-09-07 2017-03-10 Proxem Procede d'etablissement automatique de requetes inter-langues pour moteur de recherche
WO2017193472A1 (fr) * 2016-05-10 2017-11-16 北京信息科技大学 Procédé d'établissement d'une bibliothèque interprétative numérique d'anciens textes dongba
CN109271517A (zh) * 2018-09-29 2019-01-25 东北大学 Ig tf-idf文本特征向量生成及文本分类方法
US10268685B2 (en) 2015-08-25 2019-04-23 Alibaba Group Holding Limited Statistics-based machine translation method, apparatus and electronic device
CN109726263A (zh) * 2018-12-30 2019-05-07 广西财经学院 基于特征词加权关联模式挖掘的跨语言译后混合扩展方法
CN111460804A (zh) * 2019-01-02 2020-07-28 阿里巴巴集团控股有限公司 文本处理方法、装置和系统
CN112633017A (zh) * 2020-12-24 2021-04-09 北京百度网讯科技有限公司 翻译模型训练、翻译处理方法、装置、设备和存储介质
CN113220845A (zh) * 2021-05-26 2021-08-06 鲁东大学 基于深度语义的多语言文本细粒度精准对齐方法
WO2022062523A1 (fr) * 2020-09-22 2022-03-31 腾讯科技(深圳)有限公司 Procédé d'exploration de texte fondé sur l'intelligence artificielle, appareil associé et dispositif

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000311169A (ja) * 1999-04-27 2000-11-07 Nec Corp 訳語選択装置及び訳語選択方法並びに記録媒体
JP2005107705A (ja) * 2003-09-29 2005-04-21 Hitachi Ltd 複数言語を対象とした文書分類装置及び文書分類方法
JP2006190107A (ja) * 2005-01-06 2006-07-20 Nippon Hoso Kyokai <Nhk> 対訳対抽出装置および対訳対抽出プログラム

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000311169A (ja) * 1999-04-27 2000-11-07 Nec Corp 訳語選択装置及び訳語選択方法並びに記録媒体
JP2005107705A (ja) * 2003-09-29 2005-04-21 Hitachi Ltd 複数言語を対象とした文書分類装置及び文書分類方法
JP2006190107A (ja) * 2005-01-06 2006-07-20 Nippon Hoso Kyokai <Nhk> 対訳対抽出装置および対訳対抽出プログラム

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017021523A (ja) * 2015-07-09 2017-01-26 日本電信電話株式会社 用語意味コード判定装置、方法、及びプログラム
WO2017035382A1 (fr) * 2015-08-25 2017-03-02 Alibaba Group Holding Limited Procédé et système de génération de traductions candidates
US10860808B2 (en) 2015-08-25 2020-12-08 Alibaba Group Holding Limited Method and system for generation of candidate translations
US10255275B2 (en) 2015-08-25 2019-04-09 Alibaba Group Holding Limited Method and system for generation of candidate translations
US10268685B2 (en) 2015-08-25 2019-04-23 Alibaba Group Holding Limited Statistics-based machine translation method, apparatus and electronic device
US10810379B2 (en) 2015-08-25 2020-10-20 Alibaba Group Holding Limited Statistics-based machine translation method, apparatus and electronic device
FR3040808A1 (fr) * 2015-09-07 2017-03-10 Proxem Procede d'etablissement automatique de requetes inter-langues pour moteur de recherche
WO2017042161A1 (fr) * 2015-09-07 2017-03-16 Proxem Procédé d'établissement automatique de requêtes inter-langues pour moteur de recherche
US11055370B2 (en) 2015-09-07 2021-07-06 Proxem Method for automatically constructing inter-language queries for a search engine
US9558182B1 (en) 2016-01-08 2017-01-31 International Business Machines Corporation Smart terminology marker system for a language translation system
US10380065B2 (en) 2016-05-10 2019-08-13 Beijing Information Science & Technology University Method for establishing a digitized interpretation base of dongba classic ancient books
WO2017193472A1 (fr) * 2016-05-10 2017-11-16 北京信息科技大学 Procédé d'établissement d'une bibliothèque interprétative numérique d'anciens textes dongba
CN109271517A (zh) * 2018-09-29 2019-01-25 东北大学 Ig tf-idf文本特征向量生成及文本分类方法
CN109271517B (zh) * 2018-09-29 2021-12-31 东北大学 Ig tf-idf文本特征向量生成及文本分类方法
CN109726263A (zh) * 2018-12-30 2019-05-07 广西财经学院 基于特征词加权关联模式挖掘的跨语言译后混合扩展方法
CN109726263B (zh) * 2018-12-30 2021-07-02 广西财经学院 基于特征词加权关联模式挖掘的跨语言译后混合扩展方法
CN111460804A (zh) * 2019-01-02 2020-07-28 阿里巴巴集团控股有限公司 文本处理方法、装置和系统
CN111460804B (zh) * 2019-01-02 2023-05-02 阿里巴巴集团控股有限公司 文本处理方法、装置和系统
WO2022062523A1 (fr) * 2020-09-22 2022-03-31 腾讯科技(深圳)有限公司 Procédé d'exploration de texte fondé sur l'intelligence artificielle, appareil associé et dispositif
CN112633017A (zh) * 2020-12-24 2021-04-09 北京百度网讯科技有限公司 翻译模型训练、翻译处理方法、装置、设备和存储介质
CN112633017B (zh) * 2020-12-24 2023-07-25 北京百度网讯科技有限公司 翻译模型训练、翻译处理方法、装置、设备和存储介质
CN113220845A (zh) * 2021-05-26 2021-08-06 鲁东大学 基于深度语义的多语言文本细粒度精准对齐方法
CN113220845B (zh) * 2021-05-26 2022-05-17 鲁东大学 基于深度语义的多语言文本细粒度精准对齐方法

Also Published As

Publication number Publication date
JP6090531B2 (ja) 2017-03-08
JP2016532916A (ja) 2016-10-20

Similar Documents

Publication Publication Date Title
WO2015029241A1 (fr) Procédé d&#39;acquisition de traduction de mot
Balahur et al. Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis
US8706474B2 (en) Translation of entity names based on source document publication date, and frequency and co-occurrence of the entity names
Yuan et al. Generative biomedical entity linking via knowledge base-guided pre-training and synonyms-aware fine-tuning
Bassil et al. Context-sensitive spelling correction using google web 1t 5-gram information
Scherrer et al. Modernising historical Slovene words
KR20150017507A (ko) 한국어 어휘 의미망을 이용한 문맥 철자오류 교정 장치 및 방법
RU2018122648A (ru) Перефразирование клинического произвольного текста в электронном виде с помощью считывающего устройства
Alam et al. A review of bangla natural language processing tasks and the utility of transformer models
Maučec et al. Slavic languages in phrase-based statistical machine translation: a survey
Alqahtani et al. Using optimal transport as alignment objective for fine-tuning multilingual contextualized embeddings
Joshi et al. Transliterated search using syllabification approach
Mon et al. SymSpell4Burmese: symmetric delete Spelling correction algorithm (SymSpell) for burmese spelling checking
Naz et al. Urdu part of speech tagging using transformation based error driven learning
Aghaebrahimian Deep neural networks at the service of multilingual parallel sentence extraction
Zhang et al. zNLP: Identifying parallel sentences in Chinese-English comparable corpora
Wu et al. Learning to find English to Chinese transliterations on the web
Singh et al. Urdu to Punjabi machine translation: An incremental training approach
Zarnoufi et al. Machine normalization: Bringing social media text from non-standard to standard form
Claeser et al. Token level code-switching detection using Wikipedia as a lexical resource
Taslimipoor et al. Bilingual contexts from comparable corpora to mine for translations of collocations
Shinohara et al. An easily implemented method for abbreviation expansion for the medical domain in Japanese text
CN113408302A (zh) 一种机器翻译结果的评估方法、装置、设备及存储介质
Okita et al. Dcu terminology translation system for medical query subtask at wmt14
Lavergne et al. LIMSI’s participation to the 2013 shared task on Native Language Identification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13892584

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016510530

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13892584

Country of ref document: EP

Kind code of ref document: A1