WO2015029241A1 - Procédé d'acquisition de traduction de mot - Google Patents
Procédé d'acquisition de traduction de mot Download PDFInfo
- Publication number
- WO2015029241A1 WO2015029241A1 PCT/JP2013/073464 JP2013073464W WO2015029241A1 WO 2015029241 A1 WO2015029241 A1 WO 2015029241A1 JP 2013073464 W JP2013073464 W JP 2013073464W WO 2015029241 A1 WO2015029241 A1 WO 2015029241A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cluster
- words
- word
- monolingual
- language
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/45—Example-based machine translation; Alignment
Definitions
- the present invention relates to a method to acquire translations of words using a seed dictionary.
- a word refers to a single word (term), or multiple words (phrases).
- the inventors of the present invention propose a system that uses monolingual clustering with cross-lingual constraints from a bilingual seed dictionary to improve the finding of new translations of a word (in the following called query word). There is a constant need to update existing bilingual dictionary for new words.
- Non-Patent Document 1 suggests to use a probabilistic model that assumes that the feature vector of source word s and target word t are generated from the same underlying concept. Given a latent concept presented by a vector in the latent vector space, they assume that feature vector of source word s and target word t are generated from the same latent concept. The feature vector of source word s and t are generated from a normal distribution, each, wherein the mean vector of the normal distribution is set to the latent concept vector transformed to the source and target vector space, respectively. Their model makes the simplifying assumption that there is a one-to-one mapping between a source word s and a target word t.
- Non-Patent Document 2 proposes to use a poly-lingual topic model to match documents and vocabulary across several languages. They assume a latent prior distribution of topics, from which language topic indicators are generated. A topic indicator in turn defines a probability distribution over the vocabulary. For training, they assume an initial set of aligned document pairs, for example, by using cross-lingual links from Wikipedia. They show that, given a sufficient high number of aligned document pairs as training data, the model can learn to align topics across languages. That means, the i-th topic indicator in language A and the i-th topic indicator in language B, tend to represent the same topic. Therefore, the words that appear with high probability in the aligned topics might be considered as translations.
- Non-Patent Document 5 suggests to use an existing multilingual ontology, which contains predefined clusters of words (synsets) in the source language and target language, respectively. It is assumed that the multilingual ontology contains a one-to-one mapping between each cluster in the source and target language. However, such multilingual ontologies are domain dependent making them less available than (general domain) bilingual dictionaries. Furthermore, they make the limited assumption that a new word (in the source language) necessarily belongs to an existing cluster that is predefined in the ontology. Obviously, this assumption is not true, if the new word is not related (e.g. not a synonym) to any of the words registered in the ontology. Document of the Prior Art
- Non-Patent Document 1 “Learning Bilingual Lexicons from Monolingual Corpora”, Haghighi et al., 2008
- Non-Patent Document 2 “Polylingual Topic Models”, Mimno D. et. al, 2009
- Non-Patent Document 3 "Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora", Laroche et al., 2010
- Non-Patent Document 4 "An efficient method for determining bilingual word classes", Och et al., 1999
- Non-Patent Document 5 "An approach based on multilingual thesauri and model combination for bilingual lexicon extraction", Dejean et. al. 2002 DISCLOSURE OF INVENTION
- Non-Patent Document 1 requires the assumption that each word has only one translation. If the query word does not have synonyms, then this assumption can improve translation accuracy since it reduces the number of translation candidates.
- An example is shown in FIG. 1 where the possible translation candidates for the query word M [baiku] (motorbike) can be limited to “car”, “motorbike”, “light”, “lamp”. The words “automobile” and "seat” are not considered as translation candidates. This reduces the search space, and this way can improve accuracy.
- the underlying assumption that there are no synonyms of the query word can be a too restrictive constraint, and also decrease accuracy. For example, let us assume we have additionally the following translation pair in our seed dictionary: " ⁇ 3 ⁇ 4J— $ra$(jidounirinsya]"
- the monolingual clustering we mutual-exclusively align the monolingual clusters across languages that maximize our objective function.
- the objective function is set such that it favors cross-lingual alignment of clusters which feature vectors are similar across languages.
- the feature vector space used for the cross-lingual alignment and the feature vector space used for monolingual-clustering need not to be same, which, in general, allows to use richer feature vectors for the latter case.
- FIG. 12 shows the initial monolingual clustering given by the bilingual dictionary entries.
- the (monolingual) feature vector of ⁇ -f ⁇ 7 [baiku] (motorbike) is similar to " ⁇ 3 ⁇ 4— $JB $ [jidounirinsya]” (motorbike)'s feature vector, and therefore ⁇ [baiku] (motorbike) and " ⁇ i3 ⁇ 4— m$ [jidounirinsya]” (motorbike) will form one cluster.
- iSfseki] (seat) and — h [shito] (seat) will form one cluster.
- the result of the monolingual clustering is shown in FIG. 13.
- Non-Patent Document 3 In both situations our proposed method can improve translation accuracy when compare to previous methods, like Non-Patent Document 3 or Non-Patent Document 1.
- the first type of methods like Non-Patent Document 3, considers all words in the target language as translation candidates and this way has an enlarged search space.
- the second type of methods like Non-Patent Document 1 , considers only words that do not already have a translation (in the dictionary), and therefore cannot find translations for a set of synonyms.
- the present invention has the effect of limiting the translation candidates for a query word while allowing a many-to-many cross-lingual alignment of words.
- FIG. 1 shows the translation candidates for query ⁇ * ⁇ 7 [baiku] (motorbike) when no clustering is performed; an English (target language) word that occurs in dictionary is not considered as translation candidates.
- FIG. 2 shows the translation candidates for query ⁇ f [baiku] (motorbike) when no clustering is performed; an English (target language) word that occurs in dictionary is not considered as translation candidates (assumes that additionally entry
- ⁇ I 3 ⁇ 4]Zlfra$[iidounirinsya] motorbike is in dictionary).
- FIG. 3 is a block diagram showing the functional structure of the system proposed by our invention with respect to the first embodiment.
- FIG. 4 is a block diagram showing an example of the functional structure of Component 10.
- FIG. 5 is a block diagram showing an example of the functional structure of
- FIG. 6 is a block diagram showing an example of the functional structure of Component 30.
- FIG. 7 shows an example of the bipartite graph induced by a bilingual dictionary.
- FIG. 8 shows an example of the construction of the reduced bipartite graph.
- FIG. 9 shows an example of the monolingual must-link constraints induced by a bilingual dictionary.
- FIG. 10 shows an example of the monolingual cannot-link constraints induced by a bilingual dictionary.
- FIG. 11 shows an example of the cross-lingual must-link constraints induced by a bilingual dictionary.
- FIG. 12 shows the initial clustering of the source and target words; the initial clustering is induced by the monolingual must-link constraints.
- FIG. 13 shows the final result of the clustering of the source and target words.
- FIG. 14 shows the final result of the clustering of the source and target words; the cross-lingual must-link constraints and the resulting translation candidates are also shown.
- FIG. 16 is a block diagram showing the functional structure of the system proposed by our invention with respect to the second embodiment.
- the main architecture usually performed by a computer system is described in FIG. 3.
- Our method requires the input of the query word, a set of source words (including the query word) in the source language and their feature vectors, and a set of target words and their feature vectors.
- the set of target words includes the correct translation of the query word.
- two feature vectors are provided.
- the first type of feature vector we call monolingual feature vector, which can be compared between two words in the same language (monolingual vector space).
- the second type we call cross-lingual feature vectors which can be compared across the two languages (common vector space).
- the similarity of two feature vectors e.g., measured by the cosine similarity, indicates semantic similarity of the words.
- the monolingual and cross-lingual feature vectors can be created.
- two corpora one for the source language and one for the target language.
- the two corpora are ideally two text collections with similar topics, so called comparable corpora, for example, news articles from the same publication interval in Japanese and English.
- From these corpora we extract context vectors for all relevant words, for example, all nouns that occur in the corpora.
- the context vector of a noun contains in each dimension the tf-idf weight of a content word that co-occurs with the noun.
- the monolingual feature vectors are richer since they can contain all context words, whereas the cross-lingual feature vectors contain only the words that are listed in the bilingual dictionary 1.
- the feature vectors in the common vector space are possible. For example, we could map the words in the source language as well as the words in the target language to a latent vector space using canonical correlation analysis (CCA) as described in Non-Patent Document 1.
- CCA canonical correlation analysis
- the feature vectors in the monolingual vector space and in the common vector space are the input of Component 20 and Component 30, respectively.
- Component 10 creates monolingual clustering constraints using the bilingual dictionary 1.
- the work flow of Component 10 is described in FIG. 4.
- a bilingual dictionary 1 induces a bipartite graph by considering all source words and target words in two different partitions, whereas there is an (undirected) edge between source words s and target word t, if and only if, there is a translation from s to t.
- FIG. 7 shows an example of a bipartite graph induced by a bilingual dictionary 1.
- We create a reduced (bipartite) graph by removing an edge between word s and t, if the similarity of 5 and t's cross-lingual feature vectors is smaller than a certain threshold.
- FIG. 8 shows the corresponding reduced bipartite graph with removed edge(s) (a, b) if similarity between context vector of a and b is smaller than certain threshold (here 0.5).
- certain threshold here 0.5
- a monolingual must-link constraint between word x and word y, in the same language, is set, if and only if, there is a path between x and y in the reduced bipartite graph.
- a monolingual cannot-link constraint between word x and word y, in the same language is set, if and only if, x andy are both listed in the bilingual dictionary 1, and there is no path between x and y in the induced bipartite graph.
- An example is shown in FIG. 9 and FIG. 10.
- the "must-link constraint" means that two word in the same language that are connected by a path in the reduced graph.
- FIG. 10 the "must-link constraint" means that two word in the same language that are connected by a path in the reduced graph.
- cannot-link constraint means that two words in the same language that are listed in the dictionary but that are not connected by a path in the induced bi-partite graph. Note that there is no must-link constraint between — h [shTto] (seat, sheet) and "sheet” since the reduced graph does not have a path between x and y. Furthermore, note that there is no cannot-link constraint between "seat” and “sheet”, since both are connected through a path in the induced bipartite graph.
- cross-lingual must-link constraints simply correspond to the edges of the reduced graph. An example is given in FIG. 11.
- Component 20 we use these monolingual constraints together with the feature vectors from the monolingual vector space to cluster the words in the source and target language, respectively.
- For the monolingual-clustering we use a mixture probability model. The work flow of Component 20 is described in FIG. 5.
- n r be the total number of words (in the same language) that are listed in the bilingual dictionary 1 and for which a feature vector is provided as input.
- the (monolingual) must-link constraints define a clustering over these n r number of words: word wl and w2 is in the same cluster, if and only if, there is a monolingual must-link constraint between wl and w2.
- p the resulting number of clusters
- n is the number of words that are of interest, i.e. the words for which a feature vector is provided.
- a is a hyper-parameter that influences the number of clusters that are generated. It can be shown that the expected number of clusters is as follows (see, for example, "Nonparametric empirical Bayes for the Dirichlet process mixture model.”, Jon D.
- Equation (3) ⁇ T ⁇ — - , (3)
- Equation (2) we can calculate an optimal initial a. Note, that there is no analytic solution for a, however, we can solve it using numeric approximations. For example, since E (number of clusters) is monotonic increasing with respect to a, we can use binary search.
- ⁇ corresponds to the cluster mean.
- the normalization constant is defined as follows:
- Is denotes the modified Bessel function of its first kind with order s.
- the cosine-similarity measures the angle between two vectors, and the von Mises-Fisher distribution defines a probability distribution over the possible angles.
- Equation (5) can be approximated as described, for example, in "A short note on parameter approximation for von Mises-Fisher distributions", Suvrit Sra.
- step 1 for each cluster /, we sample a new cluster mean parameter ⁇ from the posterior distribution ⁇ ⁇ ⁇ x ⁇ , ..., x n i, c ⁇ , ..., c n , ⁇ ), where x ⁇ , ... x specialtyi are the (monolingual) feature vectors of the words assigned to cluster .
- step 2 we sample a new cluster allocation c ⁇ , ..., ccountry. Let aj, a 2 , a q be a set of words that are connected through the must-link constraint. And let bi, b 2 , ... , b r be the words that have a cannot-link constraint with a ⁇ , a , ... , a q .
- the new cluster for the words a ⁇ , 2 , ..., a q from the posterior probability given the cluster allocation of the remaining words, denoted as c u ⁇ , ..., c u practic_ q
- Equation (6) we set the posterior distribution (Equation (6)) to 0. If we sampled a new cluster id, then we create the mean vector of the new cluster by sampling from the posterior distribution H a that is based on the prior distribution Go, given the observations x a , ..., x a q .
- the MAP cluster allocation for the words in the source and target language are passed to Component 30.
- Component 30 is illustrated in FIG. 6.
- For the combination we use the feature vectors from the cross-lingual vector space.
- the feature vectors can, for example, be combined simply by summation. If the feature vectors are based on co-occurrence frequency in a corpus, another choice for combining the feature vectors is as follows:
- the combined feature vectors of the clusters in the source and target language are used to calculate the alignment probabilities across the two languages.
- these combined feature vectors the cluster representatives. Let us denote x and x' the cluster representatives of cluster c and c', respectively.
- the alignment probability of cluster c with cluster c' can be calculated as follows:
- sim ⁇ ⁇ , ⁇ denotes a similarity measure between two vectors.
- Cross-lingual must-link clusters are clusters that contain a pair of words w s , in source language, and w t , in target language, that have a cross-lingual must-link constraint.
- ⁇ 0 is the monolingual clusters mean of cluster c, found by the maximum- a-posterior solution in Component 20.
- p(qfa c , ⁇ ) is the probability that word q belongs to cluster c by using:
- the probability >(align(c, c')/q in cluster c) is calculated as described in Component 30, whereas word q is allocated to cluster c, and the cluster representatives are updated, respectively.
- the present invention can also be used to align expressions between two different domains in the same language. For example, words like V?— [sutaba] (Starbucks®) are used in informal text found in Twitter or other SNS. On the other hand, in formal text like newspapers we will primarily find the expression 5>— / y 9
- the common vector space and the intra-domain vector space can be the same, or very similar.
- feature vectors we can use, as described in the first exemplary embodiment, the context vectors extracted from the text collections, e.g. collection of Tweets, and newspaper articles, respectively.
- Another choice is to encode the spelling of the words as feature vectors.
- the encoding of the words can be done, for example, by using unigrams, bigrams or tri-grams as described, for example, in "Spelling correction for the telecommunications network for the deaf, Karen Kukich, Communications of the ACM.
- Word-Net which can be additionally enriched by an existing (seed) list of mappings between informal and formal words.
- words which have the same spelling are likely to have the same meaning, if their number of characters is sufficiently large. If the number is too small, say for example less than 4 characters, the word might be ambiguous, and have different meanings across domains.
- the word translation acquisition method of the above exemplary embodiments may be realized by dedicated hardware, or may be configured by means of memory and a DSP (digital signal processor) or other computation and processing device.
- the functions may be realized by execution of a program used to realize the steps of the word translation acquisition method.
- a program to realize the steps of the word translation acquisition method may be recorded on computer-readable storage media, and the program recorded on this storage media may be read and executed by a computer system to perform adaptive rotation angle control processing.
- a "computer system” may include an OS, peripheral equipment, or other hardware.
- “computer-readable storage media” means a flexible disk
- magneto-optical disc ROM, flash memory or other writable nonvolatile memory, CD-ROM or other removable media, or a hard disk or other storage system incorporated within a computer system.
- “computer readable storage media” also includes members which hold the program for a fixed length of time, such as volatile memory (for example, DRAM (dynamic random access memory)) within a computer system serving as a server or client, when the program is transmitted via the Internet, other networks, telephone circuits, or other communication circuits.
- volatile memory for example, DRAM (dynamic random access memory)
- DRAM dynamic random access memory
- the present invention allows the accurate acquisition of new term translations. This is necessary in order to update and extend existing bilingual dictionaries which in turn are an indispensable resource for cross-lingual text mining, among other applications.
- Another important application is the acquisition of synonyms across different linguistic styles.
- our invention allows to map informal expressions from Twitter, or other SNS, to formal expressions in newspaper which, for example, allows to jointly track news on SNS and newspapers.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
L'invention porte sur un procédé d'acquisition de traduction de mot qui comporte les étapes suivantes : l'extraction de contraintes qui extrait des contraintes de liaison obligatoire unilingue, de liaison impossible unilingue et de liaison obligatoire de langues croisées d'un dictionnaire bilingue ; le regroupement qui regroupe des mots dans chaque langue en utilisant un vecteur de fonction du mot de telle sorte que les contraintes de liaison obligatoire unilingue et de liaison impossible unilingue sont respectées ; le calcul de représentation de groupe qui combine des vecteurs de fonction des mots attribués au même groupe pour former une représentation de groupe, les vecteurs de fonction qui sont combinés à la représentation de groupe étant tels que les vecteurs de fonction peuvent être comparés parmi les langues ; l'alignement qui aligne les groupes unilingues parmi des langues en utilisant les représentations de groupe et les contraintes de liaison obligatoire de langues croisées données par le dictionnaire bilingue ; l'évaluation par score qui évalue par score chaque groupe de langue cible par utilisation d'une probabilité que le mot demandé est attribué à un groupe de langue source, et d'une probabilité que le groupe de langue source est aligné sur le groupe de langue cible.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2016510530A JP6090531B2 (ja) | 2013-08-27 | 2013-08-27 | 単語訳取得方法 |
PCT/JP2013/073464 WO2015029241A1 (fr) | 2013-08-27 | 2013-08-27 | Procédé d'acquisition de traduction de mot |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2013/073464 WO2015029241A1 (fr) | 2013-08-27 | 2013-08-27 | Procédé d'acquisition de traduction de mot |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015029241A1 true WO2015029241A1 (fr) | 2015-03-05 |
Family
ID=52585848
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2013/073464 WO2015029241A1 (fr) | 2013-08-27 | 2013-08-27 | Procédé d'acquisition de traduction de mot |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP6090531B2 (fr) |
WO (1) | WO2015029241A1 (fr) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2017021523A (ja) * | 2015-07-09 | 2017-01-26 | 日本電信電話株式会社 | 用語意味コード判定装置、方法、及びプログラム |
US9558182B1 (en) | 2016-01-08 | 2017-01-31 | International Business Machines Corporation | Smart terminology marker system for a language translation system |
WO2017035382A1 (fr) * | 2015-08-25 | 2017-03-02 | Alibaba Group Holding Limited | Procédé et système de génération de traductions candidates |
FR3040808A1 (fr) * | 2015-09-07 | 2017-03-10 | Proxem | Procede d'etablissement automatique de requetes inter-langues pour moteur de recherche |
WO2017193472A1 (fr) * | 2016-05-10 | 2017-11-16 | 北京信息科技大学 | Procédé d'établissement d'une bibliothèque interprétative numérique d'anciens textes dongba |
CN109271517A (zh) * | 2018-09-29 | 2019-01-25 | 东北大学 | Ig tf-idf文本特征向量生成及文本分类方法 |
US10268685B2 (en) | 2015-08-25 | 2019-04-23 | Alibaba Group Holding Limited | Statistics-based machine translation method, apparatus and electronic device |
CN109726263A (zh) * | 2018-12-30 | 2019-05-07 | 广西财经学院 | 基于特征词加权关联模式挖掘的跨语言译后混合扩展方法 |
CN111460804A (zh) * | 2019-01-02 | 2020-07-28 | 阿里巴巴集团控股有限公司 | 文本处理方法、装置和系统 |
CN112633017A (zh) * | 2020-12-24 | 2021-04-09 | 北京百度网讯科技有限公司 | 翻译模型训练、翻译处理方法、装置、设备和存储介质 |
CN113220845A (zh) * | 2021-05-26 | 2021-08-06 | 鲁东大学 | 基于深度语义的多语言文本细粒度精准对齐方法 |
WO2022062523A1 (fr) * | 2020-09-22 | 2022-03-31 | 腾讯科技(深圳)有限公司 | Procédé d'exploration de texte fondé sur l'intelligence artificielle, appareil associé et dispositif |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000311169A (ja) * | 1999-04-27 | 2000-11-07 | Nec Corp | 訳語選択装置及び訳語選択方法並びに記録媒体 |
JP2005107705A (ja) * | 2003-09-29 | 2005-04-21 | Hitachi Ltd | 複数言語を対象とした文書分類装置及び文書分類方法 |
JP2006190107A (ja) * | 2005-01-06 | 2006-07-20 | Nippon Hoso Kyokai <Nhk> | 対訳対抽出装置および対訳対抽出プログラム |
-
2013
- 2013-08-27 WO PCT/JP2013/073464 patent/WO2015029241A1/fr active Application Filing
- 2013-08-27 JP JP2016510530A patent/JP6090531B2/ja active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000311169A (ja) * | 1999-04-27 | 2000-11-07 | Nec Corp | 訳語選択装置及び訳語選択方法並びに記録媒体 |
JP2005107705A (ja) * | 2003-09-29 | 2005-04-21 | Hitachi Ltd | 複数言語を対象とした文書分類装置及び文書分類方法 |
JP2006190107A (ja) * | 2005-01-06 | 2006-07-20 | Nippon Hoso Kyokai <Nhk> | 対訳対抽出装置および対訳対抽出プログラム |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2017021523A (ja) * | 2015-07-09 | 2017-01-26 | 日本電信電話株式会社 | 用語意味コード判定装置、方法、及びプログラム |
WO2017035382A1 (fr) * | 2015-08-25 | 2017-03-02 | Alibaba Group Holding Limited | Procédé et système de génération de traductions candidates |
US10860808B2 (en) | 2015-08-25 | 2020-12-08 | Alibaba Group Holding Limited | Method and system for generation of candidate translations |
US10255275B2 (en) | 2015-08-25 | 2019-04-09 | Alibaba Group Holding Limited | Method and system for generation of candidate translations |
US10268685B2 (en) | 2015-08-25 | 2019-04-23 | Alibaba Group Holding Limited | Statistics-based machine translation method, apparatus and electronic device |
US10810379B2 (en) | 2015-08-25 | 2020-10-20 | Alibaba Group Holding Limited | Statistics-based machine translation method, apparatus and electronic device |
FR3040808A1 (fr) * | 2015-09-07 | 2017-03-10 | Proxem | Procede d'etablissement automatique de requetes inter-langues pour moteur de recherche |
WO2017042161A1 (fr) * | 2015-09-07 | 2017-03-16 | Proxem | Procédé d'établissement automatique de requêtes inter-langues pour moteur de recherche |
US11055370B2 (en) | 2015-09-07 | 2021-07-06 | Proxem | Method for automatically constructing inter-language queries for a search engine |
US9558182B1 (en) | 2016-01-08 | 2017-01-31 | International Business Machines Corporation | Smart terminology marker system for a language translation system |
US10380065B2 (en) | 2016-05-10 | 2019-08-13 | Beijing Information Science & Technology University | Method for establishing a digitized interpretation base of dongba classic ancient books |
WO2017193472A1 (fr) * | 2016-05-10 | 2017-11-16 | 北京信息科技大学 | Procédé d'établissement d'une bibliothèque interprétative numérique d'anciens textes dongba |
CN109271517A (zh) * | 2018-09-29 | 2019-01-25 | 东北大学 | Ig tf-idf文本特征向量生成及文本分类方法 |
CN109271517B (zh) * | 2018-09-29 | 2021-12-31 | 东北大学 | Ig tf-idf文本特征向量生成及文本分类方法 |
CN109726263A (zh) * | 2018-12-30 | 2019-05-07 | 广西财经学院 | 基于特征词加权关联模式挖掘的跨语言译后混合扩展方法 |
CN109726263B (zh) * | 2018-12-30 | 2021-07-02 | 广西财经学院 | 基于特征词加权关联模式挖掘的跨语言译后混合扩展方法 |
CN111460804A (zh) * | 2019-01-02 | 2020-07-28 | 阿里巴巴集团控股有限公司 | 文本处理方法、装置和系统 |
CN111460804B (zh) * | 2019-01-02 | 2023-05-02 | 阿里巴巴集团控股有限公司 | 文本处理方法、装置和系统 |
WO2022062523A1 (fr) * | 2020-09-22 | 2022-03-31 | 腾讯科技(深圳)有限公司 | Procédé d'exploration de texte fondé sur l'intelligence artificielle, appareil associé et dispositif |
CN112633017A (zh) * | 2020-12-24 | 2021-04-09 | 北京百度网讯科技有限公司 | 翻译模型训练、翻译处理方法、装置、设备和存储介质 |
CN112633017B (zh) * | 2020-12-24 | 2023-07-25 | 北京百度网讯科技有限公司 | 翻译模型训练、翻译处理方法、装置、设备和存储介质 |
CN113220845A (zh) * | 2021-05-26 | 2021-08-06 | 鲁东大学 | 基于深度语义的多语言文本细粒度精准对齐方法 |
CN113220845B (zh) * | 2021-05-26 | 2022-05-17 | 鲁东大学 | 基于深度语义的多语言文本细粒度精准对齐方法 |
Also Published As
Publication number | Publication date |
---|---|
JP6090531B2 (ja) | 2017-03-08 |
JP2016532916A (ja) | 2016-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2015029241A1 (fr) | Procédé d'acquisition de traduction de mot | |
Balahur et al. | Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis | |
US8706474B2 (en) | Translation of entity names based on source document publication date, and frequency and co-occurrence of the entity names | |
Yuan et al. | Generative biomedical entity linking via knowledge base-guided pre-training and synonyms-aware fine-tuning | |
Bassil et al. | Context-sensitive spelling correction using google web 1t 5-gram information | |
Scherrer et al. | Modernising historical Slovene words | |
KR20150017507A (ko) | 한국어 어휘 의미망을 이용한 문맥 철자오류 교정 장치 및 방법 | |
RU2018122648A (ru) | Перефразирование клинического произвольного текста в электронном виде с помощью считывающего устройства | |
Alam et al. | A review of bangla natural language processing tasks and the utility of transformer models | |
Maučec et al. | Slavic languages in phrase-based statistical machine translation: a survey | |
Alqahtani et al. | Using optimal transport as alignment objective for fine-tuning multilingual contextualized embeddings | |
Joshi et al. | Transliterated search using syllabification approach | |
Mon et al. | SymSpell4Burmese: symmetric delete Spelling correction algorithm (SymSpell) for burmese spelling checking | |
Naz et al. | Urdu part of speech tagging using transformation based error driven learning | |
Aghaebrahimian | Deep neural networks at the service of multilingual parallel sentence extraction | |
Zhang et al. | zNLP: Identifying parallel sentences in Chinese-English comparable corpora | |
Wu et al. | Learning to find English to Chinese transliterations on the web | |
Singh et al. | Urdu to Punjabi machine translation: An incremental training approach | |
Zarnoufi et al. | Machine normalization: Bringing social media text from non-standard to standard form | |
Claeser et al. | Token level code-switching detection using Wikipedia as a lexical resource | |
Taslimipoor et al. | Bilingual contexts from comparable corpora to mine for translations of collocations | |
Shinohara et al. | An easily implemented method for abbreviation expansion for the medical domain in Japanese text | |
CN113408302A (zh) | 一种机器翻译结果的评估方法、装置、设备及存储介质 | |
Okita et al. | Dcu terminology translation system for medical query subtask at wmt14 | |
Lavergne et al. | LIMSI’s participation to the 2013 shared task on Native Language Identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13892584 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2016510530 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13892584 Country of ref document: EP Kind code of ref document: A1 |