WO2015079591A1 - Crosslingual text classification method using expected frequencies - Google Patents

Crosslingual text classification method using expected frequencies Download PDF

Info

Publication number
WO2015079591A1
WO2015079591A1 PCT/JP2013/082514 JP2013082514W WO2015079591A1 WO 2015079591 A1 WO2015079591 A1 WO 2015079591A1 JP 2013082514 W JP2013082514 W JP 2013082514W WO 2015079591 A1 WO2015079591 A1 WO 2015079591A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
target
language
words
target language
Prior art date
Application number
PCT/JP2013/082514
Other languages
French (fr)
Inventor
Silva Daniel Georg Andrade
Kai Ishikawa
Hironori Mizuguchi
Takashi Onishi
Original Assignee
Nec Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nec Corporation filed Critical Nec Corporation
Priority to PCT/JP2013/082514 priority Critical patent/WO2015079591A1/en
Publication of WO2015079591A1 publication Critical patent/WO2015079591A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models

Definitions

  • the present invention relates to a method that given an input text written in source language F, finds an expected bag-of-words representation of the input text in target language E.
  • the expected bag-of-words representation shows how often we expect that a word in language E occurs in the translation of the input text.
  • the expected bag-of-words representation can be used, for example, for cross-lingual classification, and translation acquisition.
  • the inventors of the present invention propose a method to translate an input text (in source language) into an expected bag-of-words representation using a bilingual dictionary.
  • the expected bag-of-words representation shows how often we expect that a word in the target language occurs in the translation of the input text.
  • this expected bag-of-words can be used to create a feature vector for text classification: the resulting feature vector is used to classify the input by using a classifier trained on text written in the target language (often referred to as cross-lingual classification).
  • Another application is to use the resulting feature vector to create a context vector for a word q (source language) that is not listed in the bilingual dictionary.
  • the context vector of q can be compared to context vectors of words in the target language to find plausible translation candidates of q.
  • the order of the words is irrelevant, and only the expected number of times a word occurs in the translation is necessary. Therefore, no machine translation system with parallel corpora is needed, but only a bilingual dictionary.
  • Non-patent Document 2 work in the area of CLIR (Cross-lingual Information Retrieval). They suggest to disambiguate a text (query) by using a target language corpus. They translate each word in the text separately and then weight each translation by taking into account the correlations of the translation in the target language corpus. For example, given the query "plant worker” the word “plant” can be translated into » I1 ⁇ 2H, whereas the word “worker” can be translated into 3 ⁇ 4Mi3 ⁇ 43 ⁇ 4 " , using an ordinary bilingual dictionary.
  • CLIR Cross-lingual Information Retrieval
  • the degree of association (or correlation) of two words is calculated by using the co-occurrence frequencies of two words in the target language corpus (here, in this example, Japanese). Their weighting scheme, however, does not take into account how often a word occurs in the query. This is reasonable, since a query normally contains only a few,
  • Non-Patent Document 3 uses neither cross-lingually aligned documents, nor a machine translation system for mapping all texts into the same language. Instead, they use a bilingual dictionary, that is organized into synsets (WordNet®) that are cross-lingually aligned. Furthermore, their methods exploits words that have exactly the same spelling (cognates). Obviously, the latter holds only for language pairs that are closely related like English and Italian. Each document is represented as a term vector, that can also be weighted using Inverse Document
  • Non-Patent Document 1 "Estimating Word Translation Probabilities from
  • Non-Patent Document 2 "Using mutual information to resolve query translation ambiguities and query term weighting", ACL, 1999.
  • Non-Patent Document 3 "Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization", ACL, 2006.
  • a straight-forward solution to create a feature vector, or context vector, in the target language is to first translate the input text into the target language using a machine translation system; and in the second step, to create a feature vector, or context vector, using the translated text.
  • the feature vector can then, for example, be used to classify the input text using the classifier trained with the training data in the target language.
  • Non-Patent Document 3 has the disadvantage requiring additionally resources like a WSD, or that the words in source and target language are organized in sense like WordNet®.
  • Non-Patent Document 1 Using the method described in Non-Patent Document 1 , we can find a translation of the input text, without the need of expensive resources like parallel corpora. The idea is displayed in FIG. 1, where Component 10 corresponds to method described in Non-Patent Document 1.
  • Component 20 counts the frequencies of each word and places it into feature vector.
  • the classifier in Component 30 can be any classifier, for example SVM (support vector machine) classifier. However, it is in general
  • Component 10 might for example choose the translation "blaze”.
  • the example is shown in FIG. 2.
  • Training data instances labeled as “emergency” might contain the word "fire” (i.e., the training data shown in FIG. 2 having class “emergency” includes, for instance, "A fire broke out here at the train station.”, "An earthquake broke out and fire everywhere.”, and "The earthquake completely destroyed the dike", but not contain the word "blaze”.
  • the present invention has the effect of allowing to extract word features in the target language from a text written in the source language, without relying on a machine translation system and parallel corpora, but nevertheless being able to use the context of the input text to create an appropriate feature vector that can be used for text
  • FIG. 1 is a block diagram showing the functional structure of a baseline system that is trivial combination of previous work.
  • FIG. 2 illustrates the problem of the baseline system of the previous work.
  • FIG. 3 is a block diagram showing the functional structure of a system according to a first exemplary embodiment of the present invention.
  • FIG. 4 is a block diagram showing the functional structure of a system according to a second exemplary embodiment of the present invention.
  • FIGS. 5A to 5C show the data and the probabilities that are used for the examples explained in the first exemplary embodiment.
  • FIGS. 6 A, 6B and 6C show the results of the examples, using a word-by- word translation, a baseline method and the proposed method according to the first exemplary embodiment, respectively.
  • FIG. 7 is a graphical model using a plate notation according to embodiments of the present invention.
  • FIG. 8 is a block diagram showing the functional structure of the system according to a fourth exemplary embodiment of the present invention.
  • FIG. 9 is a block diagram showing a method for estimating translation probabilities according to the fourth exemplary embodiment of the present invention.
  • FIG. 10 is a block diagram showing the functional structure of the system according to a related art.
  • FIG. 11 is a block diagram showing the functional structure of the system according to a fifth exemplary embodiment of the present invention.
  • bag-of-words representation of the translation of an input text in the context of a cross-lingual text classification system. Often the most important features that are used for classification are uni-grams (single terms, or phrases).
  • FIG. 3 shows a system, usually performed by a computer system, for calculating the expected bag-of- words representation in the target language.
  • These word translation probabilities can be calculated for example using the method described in Non-Patent Document 1. Or, as an even simpler method, using an existing bilingual dictionary, and assuming uniform translation probabilities over all translations listed for a word e. In cases, where we need the translation probabilities p(e ⁇ f), we will simply assume p(e ⁇ f) o p(f ⁇ e). Alternatively, we could also calculate them using Non-Patent Document 1.
  • the training data 25 is stored in a non-transitory computer storage medium such as a hard disk drive and a semiconductor memory.
  • Equation (2) assumes that the translations of word in F depend only on fi. That means:
  • p(ej) is the probability that word ej occurs in a text in position j in the target language; recall is the probability that word e, can be translated into fa and is the probability that word e, occurs in position i of text ⁇ , if we know that word e j occurs in Tin position j.
  • Another choice for p(e t ⁇ ej) is, for example, to use the probability that word occurs in a text (or sentence), if we that in this text (or sentence) the word e, occurs.
  • FIG. 6C The result is shown in FIG. 6C, as "Using the proposed method". Since our proposed method uses the co-occurrence information from the words “miss” with “fire” and “miss” with “blaze”, it is able to infer that the probability of the translation "miss” is small, and therefore the expected count is also small. As shown in FIG. 6C, it is possible to disambiguate context and to better estimate of expected counts for "destroy". It therefore, is able to remedy the problem of the baseline system that is not able to disambiguate between the translations "miss” and "destroy” as shown in FIG. 6B.
  • FIG. 4 shows a system of second exemplary embodiment, usually performed by a computer system.
  • a non-transitory computer storage medium such as a hard disk drive and a semiconductor memory is used for the resource 35 for storing weight data.
  • N is the total number of documents in corpus 15 (the same as in FIG. 3), and df e is the number of documents that contain word e in corpus 15.
  • g is a function for combining the frequency counts and the weight w e .
  • the graphical model is depicted in FIG. 7 using the plate notation, z E ⁇ z ⁇ , Z2, Zk) denotes a topic, according to which the English words (target language words) is generated using a categorical distribution with parameters p(e ⁇ z).
  • An English word e and Japanese word (source language word) / is generated for each word in a document d.
  • the graphical model above node classifies a bag-of-words (document) in English into one of k topics.
  • LDA Topic Dirichlet Allocation
  • p(f ⁇ e) For each English word a Japanese word is generated using the probability p f ⁇ e).
  • p(f ⁇ e) is given. It can, for example, be learned by using a bilingual dictionary and the method described in the fourth exemplary embodiment. We therefore assume that e is from a fixed vocabulary (a set of words in English), and/is from fixed vocabulary W (a set of words in Japanese).
  • D : ⁇ d ⁇ , d m ). Ideally, this English collection contains a subset of documents that are of similar topic.
  • p(e ⁇ z) can, for example, be learned using the EM- Algorithm, as described in the following. Let, denote p'(e ⁇ z) p(z), at step t, respectively. Furthermore, let denote
  • n(d, e) is the number of times word e occurs in document d
  • ri d is the total number of words in document d
  • e ⁇ , e nd are all words that occur in document d.
  • the probability p(z ⁇ &, e ⁇ , e Ud ) is calculated as follows:
  • topic distribution over z can also be used as features for classifying a source document. This can be achieved as follows:
  • First for each training instance (a target language document) calculate the probability p z ⁇ e ⁇ , e nd ) and use the probability as a feature vector to train a classifier.
  • Second for an input document F calculate the probability p(z
  • the translation probabilities p f ⁇ e) can be estimated using A and the probabilities p(z) and p(e ⁇ z) as follows:
  • FIG. 8 shows a system of fourth exemplary embodiment, usually performed by a computer system.
  • word q the word that is not listed in the bilingual dictionary, and for which we want to find a (new) translation.
  • a context vector for word w contains in each dimension the co-occurrence frequency of word w with a word w ', where w 'is a word listed in the bilingual dictionary.
  • Co-occurrence frequency is, for example, defined on the sentence level, i.e. how often two words occur in the same sentence.
  • FIG. 11 shows a system of fifth exemplary embodiment, usually performed by a computer system.
  • the present invention allows to classify an input text, even in the case, where the training data is only available in language that is the different from the input text. For example, we might have plenty of Tweets in English that are annotated as either reporting about an emergency event, like an earthquake, or not reporting about any emergency. This kind of training data needs to be created manually, and is therefore expensive to create.
  • Our proposed method allows to create a feature vector of the Japanese input text that can be classified by the classifier that is trained using only the English training data. This way no new training data in Japanese is needed. Furthermore, since our approach uses readily available resources like a bilingual dictionary and monolingual corpora, it can implemented at low cost, and easily extended to other pairs of languages.
  • Another application is the translation of words which are not yet listed in a translation dictionary (fifth embodiment). This is especially helpful is there is only a small seed bilingual translation dictionary. For example, the new translations can be used to enrich the existing bilingual dictionary, and this way help to improve
  • the method for calculating the expected bag-of-words representation, the document classification method, and the word translation acq ⁇ isition method of the above exemplary embodiments may be realized by dedicated hardware, or may be configured by means of memory and a DSP (digital signal processor) or other
  • the functions may be realized by execution of a program used to realize the steps of the method for calculating the expected bag-of-words representation, the document classification method, and the word translation acquisition method.
  • a program to realize the steps of the method for calculating the expected bag-of-words representation, the document classification method, and the word translation acquisition method may be recorded on computer-readable storage media, and the program recorded on this storage media may be read and executed by a computer system to perform the method for calculating the expected bag-of-words representation, the document classification, and the word translation acquisition processing.
  • a "computer system” may include an OS, peripheral equipment, or other hardware.
  • “computer-readable storage media” means a flexible disk
  • magneto-optical disc ROM, flash memory or other writable nonvolatile memory
  • CD-ROM or other removable media or a hard disk or other storage system incorporated within a computer system.
  • “computer readable storage media” also includes members which hold the program for a fixed length of time, such as volatile memory (for example, DRAM (dynamic random access memory)) within a computer system serving as a server or client, when the program is transmitted via the Internet, other networks, telephone circuits, or other communication circuits.
  • volatile memory for example, DRAM (dynamic random access memory)
  • DRAM dynamic random access memory

Abstract

A method that, given a bag-of-words representation of a text snippet written in a source language, calculates an expected bag-of- words representation in a target language, includes: a step in which, for a source word in the input bag-of-words, a probability that the source word is translated into a target word is calculated by using given probabilities that the target word is translated into the source word and by using co-occurrence probabilities of two or more target words that are calculated from a corpus written in the target language; and a step in which the probability that the target word is a translation of the source word is summed up to denote an expected count of the target word, and to create a feature vector by using the expected counts; the resulting feature vector in the target language being considered as the expected bag-of-words representation that represents the input bag-of-words.

Description

DESCRIPTION
CROSSLINGUAL TEXT CLASSIFICATION METHOD USING EXPECTED FREQUENCIES
TECHNICAL FIELD
The present invention relates to a method that given an input text written in source language F, finds an expected bag-of-words representation of the input text in target language E. The expected bag-of-words representation shows how often we expect that a word in language E occurs in the translation of the input text. The expected bag-of-words representation can be used, for example, for cross-lingual classification, and translation acquisition.
BACKGROUND ART
The inventors of the present invention propose a method to translate an input text (in source language) into an expected bag-of-words representation using a bilingual dictionary. The expected bag-of-words representation shows how often we expect that a word in the target language occurs in the translation of the input text. For example, this expected bag-of-words can be used to create a feature vector for text classification: the resulting feature vector is used to classify the input by using a classifier trained on text written in the target language (often referred to as cross-lingual classification).
Another application is to use the resulting feature vector to create a context vector for a word q (source language) that is not listed in the bilingual dictionary.
Subsequently, the context vector of q can be compared to context vectors of words in the target language to find plausible translation candidates of q. For both applications, the order of the words is irrelevant, and only the expected number of times a word occurs in the translation is necessary. Therefore, no machine translation system with parallel corpora is needed, but only a bilingual dictionary.
However, a simple word-by- word translation using only a bilingual dictionary has the disadvantage of introducing words that are completely unrelated. This is due to the ambiguity of polysemic words. For example, given the text "plant worker", the word "plant" will be translated into )^ X|§ that are listed in bilingual dictionary. It is clear that in this context ¾ 3 is not a sensible translation and therefore should be ignored.
This problem is, for example, addressed in Non-patent Document 2, work in the area of CLIR (Cross-lingual Information Retrieval). They suggest to disambiguate a text (query) by using a target language corpus. They translate each word in the text separately and then weight each translation by taking into account the correlations of the translation in the target language corpus. For example, given the query "plant worker" the word "plant" can be translated into » I½H, whereas the word "worker" can be translated into ¾Mi¾¾", using an ordinary bilingual dictionary. Since the degree of association between # and ¾Μ8) is higher than the one of and ^Mtt^, they give the translation I¾ more weight that the translation The degree of association (or correlation) of two words is calculated by using the co-occurrence frequencies of two words in the target language corpus (here, in this example, Japanese). Their weighting scheme, however, does not take into account how often a word occurs in the query. This is reasonable, since a query normally contains only a few,
non-overlapping, words. Therefore, their method cannot be used to calculate the expected frequency of words of a (long) text which is likely to contain the same word more than once.
The method in Non-Patent Document 3 uses neither cross-lingually aligned documents, nor a machine translation system for mapping all texts into the same language. Instead, they use a bilingual dictionary, that is organized into synsets (WordNet®) that are cross-lingually aligned. Furthermore, their methods exploits words that have exactly the same spelling (cognates). Obviously, the latter holds only for language pairs that are closely related like English and Italian. Each document is represented as a term vector, that can also be weighted using Inverse Document
Frequencies (IDF). Additionally, if the synset id (which is language independent) of a word, is available, they also add this information to the term vector. Next, the term vectors of the documents in both languages are combined to one matrix. And the matrix is then decomposed using SVD to find a set of latent topics (they call it domains). For training and classification they map the term vectors into the latent topics which can then be compared using the cosine similarity. Therefore, for classification, they use the cosine kernel. For their proposed method, they assume that the bilingual dictionary is organized into synsets (synonym) sets, and chose only the monosemic (only one sense) words. Using their proposed method for polysemic words, it is necessary to use a WSD (Word Sense Disambiguation) system to find the correct sense for the word.
Non-Patent Document 1 , describes an efficient method for learning word translation probabilities p(f]e) using a bilingual dictionary and a pair of comparable corpora (two corpora written in different languages which do not need to be translations of each other). Using these translation probabilities, they are able to create a word-by -word translation (FIG. 6 A) of a sentence F. As translation of F they use the best translation E* which is: E* = aigmaxp(E) · p{F\E) (1)
E
where p(E) is calculated by using the language model of the source language.
Assuming a bi-gram language model we get: LT
Figure imgf000005_0001
where and e; denote the single words occurring in F and E; the index / runs over the words in the sentence of F (note that they assume that the length of translated sentence is the same as F). Their method has the big advantage that no parallel corpora are needed for translation.
Document of the Prior Art
Non-Patent Document 1 : "Estimating Word Translation Probabilities from
Unrelated Monolingual Corpora Using the EM Algorithm", AAAI, 2000.
Non-Patent Document 2: "Using mutual information to resolve query translation ambiguities and query term weighting", ACL, 1999.
Non-Patent Document 3: "Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization", ACL, 2006.
DISCLOSURE OF INVENTION
Problems to be Solved by the Invention
A straight-forward solution to create a feature vector, or context vector, in the target language, is to first translate the input text into the target language using a machine translation system; and in the second step, to create a feature vector, or context vector, using the translated text. The feature vector can then, for example, be used to classify the input text using the classifier trained with the training data in the target language.
However, in general the use of a machine translation system is expensive, since either translation rules need to be manually created, or a large collection of parallel corpora is necessary. Furthermore, a method like Non-Patent Document 3 has the disadvantage requiring additionally resources like a WSD, or that the words in source and target language are organized in sense like WordNet®.
Using the method described in Non-Patent Document 1 , we can find a translation of the input text, without the need of expensive resources like parallel corpora. The idea is displayed in FIG. 1, where Component 10 corresponds to method described in Non-Patent Document 1.
However, we note that their method assumes that the word order in the languages are the same. This is obviously, not the case for language pairs like English and Japanese. We therefore show (in the fourth embodiment) how translation probabilities can be learned from comparable corpora where the word order is different.
Next, Component 20 counts the frequencies of each word and places it into feature vector. Finally, the classifier in Component 30, can be any classifier, for example SVM (support vector machine) classifier. However, it is in general
disadvantages to rely only on one translation that is then used for classification. For example, given the sentence " ^fi^li fi fe έ i†0 " ("The house was destroyed by a blaze."), plausible translations for !k - are "blaze" and "fire". In that case
Component 10, might for example choose the translation "blaze". The example is shown in FIG. 2. Furthermore, assume we want to classify the sentence as either belonging to class "emergency" or not. Training data instances labeled as "emergency" might contain the word "fire" (i.e., the training data shown in FIG. 2 having class "emergency" includes, for instance, "A fire broke out here at the train station.", "An earthquake broke out and fire everywhere.", and "The earthquake completely destroyed the dike", but not contain the word "blaze". As a consequence, it is difficult to classify the text based on the uni-grams "house", "blaze" and "destroy".
A straight-forward solution is to use for a word /in the input text, all translations that are listed in a bilingual dictionary. However, this has the disadvantage of introducing noise. For example, possible translations for ΐ : are "miss" and
"destroyed", whereas it is unlikely that "miss" is an appropriate translation in this context.
Means for Solving the Problem
For creating the feature vector of an input text, we do not to rely on one translation of the input text. Instead, we (implicitly) generate all translations, count for each translation the word occurrences, and then combine the counts by using the probability that the translation is correct. Mathematically speaking, we use the expected word counts with respect to the probability distribution over the possibly correct translations.
In order to make the method independent of a machine translation system, we suggest to use word-by- word translations using for example a given bilingual dictionary. We assume that we have access to word-by-word translation probabilities and denote these probabilities as p(f\e). Note, that these translation probabilities are independent of the context of input text. For example, they are simply calculated by assuming a uniform distribution over all the translations for a word e that are listed in a bilingual dictionary.
However, as mentioned before, these translation probabilities change depending on the context of the input text. In order to calculate the word translation probabilities that consider the context of the input text, we additionally use the co-occurrence probabilities of two words e, and denoted as p{et\ej), calculated from a corpus written in the target language. (We use these co-occurrence probabilities either directly, i.e.
calculating p(e,|e,), or indirectly, by defining document clusters as described in
Embodiment s.)
This way, for the z'-th word in the input text F written in the source language, we can calculate the probability that this word is translated into a word e in the target language, given the context of the input text F. We denote this translation probability as p(ei\F). By summing up over all positions i the probability that word e is a translation of the i-th word in the input text, i.e., , we can calculate the expected count of a word e. Using these expected counts, we are able to create a feature vector, which is in the word-vector space of the target language. Effect of the Invention
The present invention has the effect of allowing to extract word features in the target language from a text written in the source language, without relying on a machine translation system and parallel corpora, but nevertheless being able to use the context of the input text to create an appropriate feature vector that can be used for text
classification, or to find word translations.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing the functional structure of a baseline system that is trivial combination of previous work.
FIG. 2 illustrates the problem of the baseline system of the previous work. FIG. 3 is a block diagram showing the functional structure of a system according to a first exemplary embodiment of the present invention.
FIG. 4 is a block diagram showing the functional structure of a system according to a second exemplary embodiment of the present invention.
FIGS. 5A to 5C show the data and the probabilities that are used for the examples explained in the first exemplary embodiment.
FIGS. 6 A, 6B and 6C show the results of the examples, using a word-by- word translation, a baseline method and the proposed method according to the first exemplary embodiment, respectively.
FIG. 7 is a graphical model using a plate notation according to embodiments of the present invention.
FIG. 8 is a block diagram showing the functional structure of the system according to a fourth exemplary embodiment of the present invention.
FIG. 9 is a block diagram showing a method for estimating translation probabilities according to the fourth exemplary embodiment of the present invention.
FIG. 10 is a block diagram showing the functional structure of the system according to a related art.
FIG. 11 is a block diagram showing the functional structure of the system according to a fifth exemplary embodiment of the present invention.
EXEMPLARY EMBODIMENTS FOR CARRYING OUT THE INVENTION
<First Exemplary Embodiment
First, we describe the proposed method for calculating the expected
bag-of-words representation of the translation of an input text, in the context of a cross-lingual text classification system. Often the most important features that are used for classification are uni-grams (single terms, or phrases).
Using as features uni-grams, we can represent text Tas a column vector where dimension r contains the number of times word a wr occurs in text T. Formally, we denote the number of times word w occurs in text Thy count j{w). We denote the language of the text we want to classify as the source language, and the language of the texts in the training data as the target language. If text Tis in the source language we write F, if the text is in the target language we write E. We denote the /-word of F as ft. Furthermore, we denote as et the word in E that is the translation of fi. Note, that, for simplicity, we make the assumption that one word in F is translated into exactly one word in E. As a consequence the length of E and F are the same.
Furthermore, we assume that we have the translation probabilities p( \e) given, corresponding to Resource 5 in FIG. 3. FIG. 3 shows a system, usually performed by a computer system, for calculating the expected bag-of- words representation in the target language. These word translation probabilities can be calculated for example using the method described in Non-Patent Document 1. Or, as an even simpler method, using an existing bilingual dictionary, and assuming uniform translation probabilities over all translations listed for a word e. In cases, where we need the translation probabilities p(e\f), we will simply assume p(e\f) o p(f\e). Alternatively, we could also calculate them using Non-Patent Document 1. The set of all translations of a word/ are denoted as Φ(ή which we simply define as φ(/) := ie e
Figure imgf000010_0001
> 0}, where Ω is the set of words that occur in the training corpus (Training Data 25 in FIG. 3). The training data 25 is stored in a non-transitory computer storage medium such as a hard disk drive and a semiconductor memory.
Instead of using only one translation of E, we implicitly generate all translations and weight them by the probability of each translation. More formally, instead of using count (e), we use the expected number of word occurrences denoted by E[countE(e) \F] as features. When we use a simple uni-gram language model in the source language we get:
E[countE (2) where we might write F as
Figure imgf000011_0001
fi is the i-th word in F, and n is the number of words in F.
However, Equation (2) assumes that the translations of word in F depend only on fi. That means:
Figure imgf000011_0002
We will relax this assumption, by using the following assumption:
P(ei \ F, ej) = P(ei \fi ej) (3) where e} - is the translation of word fi , for/ £ {1, 2, 1, i + 1, n}. In words, the assumption means, the probability thstfi is translated into e, depends only on fi and the -th word in E. The word e; can be interpreted as the word that we consider as important context information, and which helps us in finding an appropriate translation eh We do not make any restrictions on/ E { 1, 2, 1, / '+ 1, «}. A possible choice is to select it randomly, or to chose j = i - \, that is the word previous of e,. In case, we do not want to restrict our choice to one j, we can average over several choices by using:
Figure imgf000011_0003
je{l,2,...,i-l,i+l,...,n}
Figure imgf000011_0004
is the model that uses the assumption that the j-th position in E is the important context information which helps us in finding translation er, and /?,(/) is the probability that position j is important. We might for example set p ) such that it sets high weight to the surrounding words of et, like:
Figure imgf000012_0001
O else .
Using the assumption of equation (3), we can calculate p(ej\F, j) as follows: V(<H E j) - ∑
Figure imgf000012_0002
= ∑ P /,- «.,· ) · ρ(¾· Π (5)
Furthermore, we assume that we can factorize the probability p( i, fj, ej) as follows: P(<¾ , f ¾ ). = P(cj ) ' p(ei l ¾ ) PU i)
where p(ej) is the probability that word ej occurs in a text in position j in the target language; recall
Figure imgf000012_0003
is the probability that word e, can be translated into fa and is the probability that word e, occurs in position i of text Γ, if we know that word ej occurs in Tin position j. For example, we can choose for p(et\ej), assuming y < , the probability that a word e, occurs in the source target language if we know that word e, occurred / - j words before; for j = i - 1 this is equal to the Markov assumption.
Another choice for p(et\ej) is, for example, to use the probability that word occurs in a text (or sentence), if we that in this text (or sentence) the word e, occurs.
Note, that this factorization corresponds to the graphical model e, ->
et -> ft. Therefore, it follows that: ¾ I /*; ¾) =
Figure imgf000012_0004
Finally, combining the above formula with Equations (5) and (4), we get:
(6)
Note that this formula is, in general, recursive, since we need p{ej\F) to calculate
Figure imgf000013_0001
for calculation. One solution is to the use the EM-algorithm, whereas the initially we set VjP(ej \F) ~ P(ej \ fj) .
Another solution is, for example, to assume that p(e\\F) =p(e\\f\) and then, for / > I , making e, only dependent on e^. That means setting, in formula (6), pi(J) to 1, for j = i - 1, and else to 0. This results in the simplified formula (for i > 1):
Figure imgf000013_0002
As before, we assume that each word fi is translated into conditionally independent given text F. Furthermore, we also keep the assumption that one word in F is translated into exactly one word in E. Therefore, the translation of ft into et is modeled as a Bernoulli distribution with probability ρ{βι \F). Using the above assumptions we can calculate the expected frequency of a word e in text E:
n n
E[countE(e) \ F] = ^ E[e< | ] =∑p(ei \F) where is calculated as in Equation (6).
Finally, we give an example using the model given in Equation (7) which demonstrates how our invention solves the problem of translation disambiguation for word expectation count. We assume the source language is Japanese, and the target language is English. We assume we want to classify the sentence "^tt^ ^" ^^: c tl† " ("The house was destroyed by a blaze"). For simplicity we ignore functional words which leaves as with the representation F = (^, l)< $-, f^^:), that is, we use as representation only the content words F = i , ^). All the necessary probabilities are listed up in FIGS. 5A, 5B and 5C. In FIG. 5C, we expect that "fire" and "blaze" often co-occur with "destroy", but not with "miss". Starting the calculation from the most left word to the most right word in F we get:
Figure imgf000014_0001
And in the next step:
Figure imgf000014_0002
Finally, we calculate the translation probabilities for the right most word:
p(rniss\F)
Figure imgf000014_0003
and, analogously, we get:
p(destroy\F) = 0.89
The result is shown in FIG. 6C, as "Using the proposed method". Since our proposed method uses the co-occurrence information from the words "miss" with "fire" and "miss" with "blaze", it is able to infer that the probability of the translation "miss" is small, and therefore the expected count is also small. As shown in FIG. 6C, it is possible to disambiguate context and to better estimate of expected counts for "destroy". It therefore, is able to remedy the problem of the baseline system that is not able to disambiguate between the translations "miss" and "destroy" as shown in FIG. 6B.
Concerning the application for cross-lingual classification, we note, that it can be advantageous to calculate the expected bag-of-word representations for each training instance into the source language, i.e., the other way round as described before: First, for each training instance (in target language) the expected bag-of-word representations in the source language is calculated and used to train a classifier in the source language. In the second step, the input text (in source language) is classified by using the classifier from the previous step.
<Second Exemplary Embodiment
As before, let us denote by count^e) the frequency of a word e in text E. In the first exemplary embodiment, we used the frequency count^e) for training and the expected frequency [countE(e) \F] for classification. However, it is well known that instead of using the word frequencies directly, weighting the word frequencies, for example, by the inverse document frequency (IDF) results in higher classification accuracy. We assume that these weights are given, as shown in Resource 35, FIG. 4. FIG. 4 shows a system of second exemplary embodiment, usually performed by a computer system. A non-transitory computer storage medium such as a hard disk drive and a semiconductor memory is used for the resource 35 for storing weight data. For example, using IDF, we can calculate a weight for word e, denoted by we, as follows:
Figure imgf000015_0001
where N is the total number of documents in corpus 15 (the same as in FIG. 3), and dfe is the number of documents that contain word e in corpus 15.
We then weight the frequency counts by weight we using:
g (count E(e) , we)
where g is a function for combining the frequency counts and the weight we.
For example, using multiplication we get: g (count E (e) , we) = countE ( ) · we
Instead of using the raw (expected) word frequencies, we suggest to use the weighted word frequency g{count^e) for training and the weighted expected word frequency
Figure imgf000016_0001
we) for classification.
<Third Exemplary Embodiment
Here we show another possible method for calculating the expected frequency count of target words, given a text (document) in the source language. The method described here has the advantage that co-occurrences of more than two words in a text can be used. Furthermore, the word order is irrelevant, which makes the method applicable also to language pairs like English and Japanese.
The graphical model is depicted in FIG. 7 using the plate notation, z E {z\, Z2, Zk) denotes a topic, according to which the English words (target language words) is generated using a categorical distribution with parameters p(e\z). We denote the set of topics {zi, z2, ... , Zk} as Z. An English word e and Japanese word (source language word) /is generated for each word in a document d. The graphical model above node classifies a bag-of-words (document) in English into one of k topics. For this task, we could also use a different generative model, like LDA (Latent Dirichlet Allocation). For illustration, we use the simpler method described here, which can be considered as a kind of Naive Bayes clustering.
For each English word a Japanese word is generated using the probability p f\e). We assume here that p(f\e) is given. It can, for example, be learned by using a bilingual dictionary and the method described in the fourth exemplary embodiment. We therefore assume that e is from a fixed vocabulary (a set of words in English), and/is from fixed vocabulary W (a set of words in Japanese). First, in the training phase, we learn the parameters p(e\z), and p(z), by using a document collection in English, denoted as D := {d\, dm). Ideally, this English collection contains a subset of documents that are of similar topic. For example, assuming we want to translate a Tweet in Japanese, we should choose a large collection of English Tweets for this document collection. The parameters p(e\z) can, for example, be learned using the EM- Algorithm, as described in the following. Let, denote p'(e\z) p(z), at step t, respectively. Furthermore, let denote
Figure imgf000017_0001
The expected number of times that we observe word e together with topic z is denote as E[n(e, z)] and is calculated as follows:
, where n(d, e) is the number of times word e occurs in document d, and rid is the total number of words in document d, and e\, end are all words that occur in document d. The probability p(z\&, e\, eUd) is calculated as follows:
Figure imgf000017_0002
?.6{l,...,nd}
where the normalization constant is∑z p (z, ex , ... , end \ Θ * ). r
The expected number of times that we observe word topic z is denote as E[n(z)] and is calculated as follows:
)] = ^ E[n(e , z)] = - pizl . ei , ..., e¾) step calculates the new parameters θ^1, using the expected counts:
Figure imgf000018_0001
,t+i , ^ _ E[n(z)}
pt+1 (z) =
E[n(z')]
This way, we can learn the parameters p(e\z) and p(z).
In the second step, we use the trained model to calculate the expected count E[countE(e) \F] given a Japanese (source language) text F. We assume that for each word fi m' F there is an English word e;. We denote the words in F by ... , fna] (we use the index a in na, since in later embodiments we will use letter a to denote a document in the source language, instead of using the letter F).
E[countE{e) \F]
Figure imgf000018_0002
i=l i=l
is calculated as follows:
Figure imgf000018_0003
Figure imgf000018_0004
where the normalization constant is∑ c-% e V P ( e'1 ' f1 ' · · · ' f fla ).
Finally, we note that the topic distribution over z can also be used as features for classifying a source document. This can be achieved as follows:
First for each training instance (a target language document) calculate the probability p z\ e\, end) and use the probability as a feature vector to train a classifier.
Second for an input document F calculate the probability p(z | f ,—, fna) and use this probability to classify document F with the classifier trained from the previous step.
<Fourth Exemplary Embodiment
In the previous embodiments, we assumed that we have the translation probabilities p(f\e) are given, for example, can be calculated, in advance, with a method like described in Non-Patent Document 1. However, as we noted before, the method in Non-Patent Document 1 assumes that the word order in the languages are the same. This is obviously, not the case for language pairs like English and Japanese.
For the estimation of the probabilities we use the same model as in FIG. 7. We assume that the parameters p(z) and p(e\z) are estimated using the English corpus (target language corpus) D as described in embodiment three. Furthermore, we assume we have a Japanese corpus (source language corpus), denoted as A := {a\, a{\ which is comparable to the English corpus. For example, the two corpora can contain one year of Tweets in English and Japanese, respectively.
The translation probabilities p f\e) can be estimated using A and the probabilities p(z) and p(e\z) as follows:
argmaxp A p e , p z , p e 2 -p(ei \z)
Figure imgf000019_0001
where fi is the i-th word in (the bag of words) of document , and document a contains 1, ..., na words.
This term can be optimized using, for example, the EM-algorithm. As an initial setting, we set p f\e) to the uniform distribution over all translations / that are listed in the bilingual dictionary. If, in the dictionary, a source word/is not listed as a translation of e we set p(f}e) = 0. An overview of the input and output of the method is show in FIG. 9. As an example application we show in FIG. 8 how the estimation of translation probabilities can be combined with cross-lingual classification described in the previous embodiments. FIG. 8 shows a system of fourth exemplary embodiment, usually performed by a computer system.
<Fifth Exemplary Embodiment
Here we demonstrate that the proposed invention can also be used to improve the translation acquisition of new words. We denote word q the word that is not listed in the bilingual dictionary, and for which we want to find a (new) translation.
An overview of related previous work, disclosed in, for instance, "A Statistical View on Bilingual Lexicon Extraction", P. Fung, LNCS 1998 (Non-Patent Document 4), is given in FIG. 10. We assume we have given two comparable corpora resources 7, and 15, from which we extract context vectors for each word. A context vector for word w contains in each dimension the co-occurrence frequency of word w with a word w ', where w 'is a word listed in the bilingual dictionary. Co-occurrence frequency is, for example, defined on the sentence level, i.e. how often two words occur in the same sentence. A context vector of a word e in English (target language) cannot be compared with a context vector of a word /in Japanese (source language), since the dimensions are different. Therefore, previous work maps, for example, the Japanese context vector to an English context vector using the bilingual dictionary, as described in Component 28. Finally, in Component 30, we (as well as previous work) compare the translated context vector of a word q with the context vectors of all translation candidates in the target language. This can be done, for example, by converting the raw co-occurrence counts into tf-idf weights and then using the cosine similarity, as described in Non-Patent Document 4.
The problem of previous work is that, when translating the context vector, fixed translation probabilities are used. That is, as described in FIG. 6B, the frequencies are translated into the target language, word by word, without considering the context, that is the co-occurrence with other words. In order to overcome this problem, we suggest, to apply the probabilistic model from the third embodiment, to calculate the expected co-occurrence frequencies of a context word e (target language) with the query word q (source language). The idea is sketched in FIG. 11. FIG. 11 shows a system of fifth exemplary embodiment, usually performed by a computer system.
First, given a sentence s, in a document a (document in source language), we calculate the probability that word e occurs one or more times in the translated sentence. We denote this probability as p{e\s, a). Let S be the set of indices of the words in sentence s, i.e., S -Ξ { !, ..., ¾} .
Figure imgf000021_0001
where f\, fn are the words in document a, the
Figure imgf000021_0002
f\, fna) is the probability that the i-t word is translated into word e. The
Figure imgf000021_0003
f\, fna) is calculated as described in the third embodiment.
We denote by E[couni(e, q)
Figure imgf000021_0004
me eXpected co-occurrence frequency of word e with q for a document a. We can calculate it as follows:
E[count(e, q) \a]— J^ .s (g) - p(e\s, a) where Is(q) is 1, if q occurs in sentence s, or otherwise 0.
Finally, we can calculate the expected co-occurrence frequency, over the whole source language corpus A, denoted by [count(e, q)] as follows: K[count(e, q)] =
Figure imgf000022_0001
a<=A
Finally, the expected counts E[coim£(e, q)} used to create the context vector (in target language) for word q.
INDUSTRIAL APPLICABILITY
The present invention allows to classify an input text, even in the case, where the training data is only available in language that is the different from the input text. For example, we might have plenty of Tweets in English that are annotated as either reporting about an emergency event, like an earthquake, or not reporting about any emergency. This kind of training data needs to be created manually, and is therefore expensive to create.
If we want to classify a Tweet in Japanese, we again need to manually create training data in Japanese. As an alternative, we could use a machine translation system to translate a text from Japanese to English. However, such a machine translation system itself needs training data of parallel text in Japanese and English which is expensive.
Our proposed method allows to create a feature vector of the Japanese input text that can be classified by the classifier that is trained using only the English training data. This way no new training data in Japanese is needed. Furthermore, since our approach uses readily available resources like a bilingual dictionary and monolingual corpora, it can implemented at low cost, and easily extended to other pairs of languages.
Another application is the translation of words which are not yet listed in a translation dictionary (fifth embodiment). This is especially helpful is there is only a small seed bilingual translation dictionary. For example, the new translations can be used to enrich the existing bilingual dictionary, and this way help to improve
cross-lingual classification.
The method for calculating the expected bag-of-words representation, the document classification method, and the word translation acqμisition method of the above exemplary embodiments may be realized by dedicated hardware, or may be configured by means of memory and a DSP (digital signal processor) or other
computation and processing device. On the other hand, the functions may be realized by execution of a program used to realize the steps of the method for calculating the expected bag-of-words representation, the document classification method, and the word translation acquisition method.
Moreover, a program to realize the steps of the method for calculating the expected bag-of-words representation, the document classification method, and the word translation acquisition method may be recorded on computer-readable storage media, and the program recorded on this storage media may be read and executed by a computer system to perform the method for calculating the expected bag-of-words representation, the document classification, and the word translation acquisition processing. Here, a "computer system" may include an OS, peripheral equipment, or other hardware.
Further, "computer-readable storage media" means a flexible disk,
magneto-optical disc, ROM, flash memory or other writable nonvolatile memory,
CD-ROM or other removable media, or a hard disk or other storage system incorporated within a computer system.
Further, "computer readable storage media" also includes members which hold the program for a fixed length of time, such as volatile memory (for example, DRAM (dynamic random access memory)) within a computer system serving as a server or client, when the program is transmitted via the Internet, other networks, telephone circuits, or other communication circuits.

Claims

1. A method that, given a bag-of- words representation of a full text, paragraph, sentence or word-window written in a source language, calculates an expected bag-of-words representation in a target language, comprising:
a first step in which, for a source word fin the input bag-of-words F written in the source language, a probability p(e/ \F), that the source word /is translated into a target word e in the target language, is calculated by using given probabilities p(fe) that the target word e in the target language is translated into the source word /in the source language and by using co-occurrence probabilities of two or more target words in the target language that are calculated from a corpus written in the target language (for example by clustering the target words into several topics z, and then using the probabilities p(e | z)); and
a second step in which, for each target word e in the target language, the probability that the target word e is a translation of the source word/in the input text is summed up over all source words /in the input bag-of-words F so as to denote the resulting value∑/ l^") as an expected count of the target word e, and to create a feature vector by using the expected counts of each target word e in the target language; the resulting feature vector in the target language being considered as the expected bag-of-words representation, written in the target language, that represents the input bag-of-words F.
A cross-lingual document classification method comprising:
a third step in which the bag-of-words representation of the input document in source language is converted into an expected bag-of-words representation in the target language to generate a feature vector using method of claim 1 ; and a fourth step in which, given a classifier trained with the bag-of-words representation of each training instance or document in the target language, the input text is classified by using the feature vector generated in the third step.
3. A cross-lingual document classification method comprising:
a fifth step in which each training instance or document in the target language is converted into an expected bag-of-words representation in the source language to generate a feature vector using method of claim 1 ;
a sixth step in which a classifier in the source language is trained using the feature vectors generated in the fifth step;
a seventh step in which the input document in the source language is classified by using the classifier trained in the sixth step.
4. The cross-lingual document classification method according to claim 2, wherein the input text is split into smaller text parts including paragraphs or sentences, and for each text part, the expected bag-of-words representation in the target language is calculated, comprising:
creating one feature vector by adding all text parts' expected bag-of-words representations in the target language, such that the expected number of occurrences of a word in the translated text is calculated by summing up the expected number of occurrences for word in each sentence translation of the input text.
5. The cross-lingual document classification method according to claim 2, wherein: the second step including calculating a weighted feature vector by combining the expected counts for a target word e with a word weight we by using a monotonic function g to get a word weighted count; and
the cross-lingual document classification method including using a classifier trained using the weighted word counts of each training instance or document in the target language, where the weighted word counts are the word counts of the target word e combined with word weight we by using the function g.
6. A word translation acquisition method comprising:
an eighth step in which, for each document in the source language, an expected bag-of- words representation in the target language is calculated using the method of claim 1;
a ninth step in which an expected co-occurrence of two words q in the source language word and not listed in dictionary and the target word e in the target language word and listed in dictionary in a certain type of context including sentence,
word- window, and modifier are calculated by using the expected bag-of- words representation in the target language calculated in the eighth step, the co-occurrences are used to form the context vector for q;
a tenth step in which, for each translation candidate in the target language, a context vector is calculated using the co-occurrence frequencies between the translation candidate and a target word e in the target language word and that is listed in the dictionary using the target language corpus and the same type of context including a sentence context;
an eleventh step in which each translation candidate is scored by comparing the context vector of the query word q with the context vector of each translation candidate.
7. The method according to claim 1, wherein the word translation probabilities pf\e) are estimated using a bilingual dictionary and a pair of comparable corpora including source and target language corpus by:
a twelfth step in which each document in the target language corpus is clustered into one or more topics and learned a probability p(e\z) that a target word e is generated from a topic z;
a thirteenth step in which, using the probabilities p(e\z) and integrating over the topics z, the probabilities p(f\e) are found by maximizing the probability that we observe the source language corpus.
8. A cross-lingual document classification method, according to claim 2, comprising:
a fourteenth step in which all documents in the target language are clustered into several topics, and the topic distributions p(e\z) are learned;
a fifteenth step, using the topic distribution for each training instance (a target language document) as a feature vector, and train a classifier in the target language; a sixteenth step, where using the translation probabilities p(f\e) and the topic distribution p(e | z), the topic distribution for an input document in the source language is calculated and used to classify the input document using the classifier from the above step.
PCT/JP2013/082514 2013-11-27 2013-11-27 Crosslingual text classification method using expected frequencies WO2015079591A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2013/082514 WO2015079591A1 (en) 2013-11-27 2013-11-27 Crosslingual text classification method using expected frequencies

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2013/082514 WO2015079591A1 (en) 2013-11-27 2013-11-27 Crosslingual text classification method using expected frequencies

Publications (1)

Publication Number Publication Date
WO2015079591A1 true WO2015079591A1 (en) 2015-06-04

Family

ID=53198575

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/082514 WO2015079591A1 (en) 2013-11-27 2013-11-27 Crosslingual text classification method using expected frequencies

Country Status (1)

Country Link
WO (1) WO2015079591A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704456A (en) * 2016-08-09 2018-02-16 松下知识产权经营株式会社 Identify control method and identification control device
CN108846033A (en) * 2018-05-28 2018-11-20 北京邮电大学 The discovery and classifier training method and apparatus of specific area vocabulary
CN109766545A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Text similarity computing method based on multi-feature fusion
CN110298046A (en) * 2019-07-03 2019-10-01 科大讯飞股份有限公司 A kind of translation model training method, text interpretation method and relevant apparatus
CN110414229A (en) * 2019-03-29 2019-11-05 腾讯科技(深圳)有限公司 Operational order detection method, device, computer equipment and storage medium
CN110532544A (en) * 2019-07-18 2019-12-03 中央民族大学 Low-resource text tour field construction of knowledge base method and system
CN111553168A (en) * 2020-05-09 2020-08-18 识因智能科技(北京)有限公司 Bilingual short text matching method
CN112036485A (en) * 2020-08-31 2020-12-04 平安科技(深圳)有限公司 Method and device for topic classification and computer equipment
CN113032565A (en) * 2021-03-23 2021-06-25 复旦大学 Cross-language supervision-based superior-inferior relation detection method
CN113032559A (en) * 2021-03-15 2021-06-25 新疆大学 Language model fine-tuning method for low-resource adhesion language text classification
CN115374779A (en) * 2022-10-25 2022-11-22 北京海天瑞声科技股份有限公司 Text language identification method, device, equipment and medium
CN115544971A (en) * 2022-09-21 2022-12-30 中国科学院地理科学与资源研究所 Ancient climate reconstruction data processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Y.ISHIKAWA ET AL.: "daikibc tangengo Corpus to kihongo taiyaku jisho wc mochiita senmonyougo no yakugokakutoku", PROCEEDINGS OF THE SIXTEENTH ANNUAL MEETING OF THE ASSOCIATION FOR NATURAL LANGUAGE PROCESSING, THE ASSOCIATION FOR NATURAL LANGUAGE PROCESSING, pages 700 - 703 *
Y.MATSUMOTO ET AL.: "Software Toolbox for Research Activity (1) Japanese Sentence Analysis by ChaSen and Cabocha -Using Syntactic Information for Sentence Role Classification", JOURNAL OF THE JAPANESE SOCIETY FOR ARTIFICIAL INTELLIGENCE, vol. 19, no. 3, pages 334 - 339 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704456A (en) * 2016-08-09 2018-02-16 松下知识产权经营株式会社 Identify control method and identification control device
CN107704456B (en) * 2016-08-09 2023-08-29 松下知识产权经营株式会社 Identification control method and identification control device
CN108846033B (en) * 2018-05-28 2022-04-08 北京邮电大学 Method and device for discovering specific domain vocabulary and training classifier
CN108846033A (en) * 2018-05-28 2018-11-20 北京邮电大学 The discovery and classifier training method and apparatus of specific area vocabulary
CN109766545A (en) * 2018-12-24 2019-05-17 中国科学院合肥物质科学研究院 Text similarity computing method based on multi-feature fusion
CN109766545B (en) * 2018-12-24 2022-11-18 中国科学院合肥物质科学研究院 Text similarity calculation method based on multi-feature fusion
CN110414229A (en) * 2019-03-29 2019-11-05 腾讯科技(深圳)有限公司 Operational order detection method, device, computer equipment and storage medium
CN110414229B (en) * 2019-03-29 2023-12-12 腾讯科技(深圳)有限公司 Operation command detection method, device, computer equipment and storage medium
CN110298046A (en) * 2019-07-03 2019-10-01 科大讯飞股份有限公司 A kind of translation model training method, text interpretation method and relevant apparatus
CN110298046B (en) * 2019-07-03 2023-04-07 科大讯飞股份有限公司 Translation model training method, text translation method and related device
CN110532544A (en) * 2019-07-18 2019-12-03 中央民族大学 Low-resource text tour field construction of knowledge base method and system
CN110532544B (en) * 2019-07-18 2023-03-24 中央民族大学 Method and system for constructing low-resource word tourism field knowledge base
CN111553168A (en) * 2020-05-09 2020-08-18 识因智能科技(北京)有限公司 Bilingual short text matching method
CN112036485A (en) * 2020-08-31 2020-12-04 平安科技(深圳)有限公司 Method and device for topic classification and computer equipment
CN112036485B (en) * 2020-08-31 2023-10-24 平安科技(深圳)有限公司 Method, device and computer equipment for classifying topics
CN113032559A (en) * 2021-03-15 2021-06-25 新疆大学 Language model fine-tuning method for low-resource adhesion language text classification
CN113032559B (en) * 2021-03-15 2023-04-28 新疆大学 Language model fine tuning method for low-resource adhesive language text classification
CN113032565A (en) * 2021-03-23 2021-06-25 复旦大学 Cross-language supervision-based superior-inferior relation detection method
CN115544971A (en) * 2022-09-21 2022-12-30 中国科学院地理科学与资源研究所 Ancient climate reconstruction data processing method and device
CN115374779A (en) * 2022-10-25 2022-11-22 北京海天瑞声科技股份有限公司 Text language identification method, device, equipment and medium
CN115374779B (en) * 2022-10-25 2023-01-10 北京海天瑞声科技股份有限公司 Text language identification method, device, equipment and medium

Similar Documents

Publication Publication Date Title
WO2015079591A1 (en) Crosslingual text classification method using expected frequencies
Mikolov et al. Advances in pre-training distributed word representations
Potthast et al. Cross-language plagiarism detection
US8543563B1 (en) Domain adaptation for query translation
Brychcín et al. Uwb: Machine learning approach to aspect-based sentiment analysis
US20150006157A1 (en) Term synonym acquisition method and term synonym acquisition apparatus
Zouaghi et al. Combination of information retrieval methods with LESK algorithm for Arabic word sense disambiguation
Hasler et al. Dynamic topic adaptation for phrase-based mt
US20140350914A1 (en) Term translation acquisition method and term translation acquisition apparatus
Ni et al. Neural cross-lingual relation extraction based on bilingual word embedding mapping
Chen et al. Dynamically supporting unexplored domains in conversational interactions by enriching semantics with neural word embeddings
Fallgren et al. Towards a standard dataset of Swedish word vectors
Le-Hong et al. Using dependency analysis to improve question classification
Chowdhury et al. Drug-drug interaction extraction using composite kernels
de Melo et al. Constructing and utilizing wordnets using statistical methods
Konkol Uwb at semeval-2016 task 11: Exploring features for complex word identification
Chen et al. On the benefit of incorporating external features in a neural architecture for answer sentence selection
Apidianaki From word types to tokens and back: A survey of approaches to word meaning representation and interpretation
Boulares et al. Learning sign language machine translation based on elastic net regularization and latent semantic analysis
Sever et al. Evaluating cross-lingual textual similarity on dictionary alignment problem
Li et al. Computational linguistics literature and citations oriented citation linkage, classification and summarization
Klapaftis et al. Evaluating word sense induction and disambiguation methods
Nguyen et al. Text normalization for named entity recognition in Vietnamese tweets
Hong et al. Cross-lingual event-centered news clustering based on elements semantic correlations of different news
Bernhard Adding dialectal lexicalisations to linked open data resources: The example of Alsatian

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13898374

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13898374

Country of ref document: EP

Kind code of ref document: A1