WO2015079591A1

WO2015079591A1 - Crosslingual text classification method using expected frequencies

Info

Publication number: WO2015079591A1
Application number: PCT/JP2013/082514
Authority: WO
Inventors: Silva Daniel Georg Andrade; Kai Ishikawa; Hironori Mizuguchi; Takashi Onishi
Original assignee: Nec Corporation
Priority date: 2013-11-27
Filing date: 2013-11-27
Publication date: 2015-06-04

Abstract

A method that, given a bag-of-words representation of a text snippet written in a source language, calculates an expected bag-of- words representation in a target language, includes: a step in which, for a source word in the input bag-of-words, a probability that the source word is translated into a target word is calculated by using given probabilities that the target word is translated into the source word and by using co-occurrence probabilities of two or more target words that are calculated from a corpus written in the target language; and a step in which the probability that the target word is a translation of the source word is summed up to denote an expected count of the target word, and to create a feature vector by using the expected counts; the resulting feature vector in the target language being considered as the expected bag-of-words representation that represents the input bag-of-words.

Description

DESCRIPTION

CROSSLINGUAL TEXT CLASSIFICATION METHOD USING EXPECTED FREQUENCIES

TECHNICAL FIELD

The present invention relates to a method that given an input text written in source language F, finds an expected bag-of-words representation of the input text in target language E. The expected bag-of-words representation shows how often we expect that a word in language E occurs in the translation of the input text. The expected bag-of-words representation can be used, for example, for cross-lingual classification, and translation acquisition.

BACKGROUND ART

The inventors of the present invention propose a method to translate an input text (in source language) into an expected bag-of-words representation using a bilingual dictionary. The expected bag-of-words representation shows how often we expect that a word in the target language occurs in the translation of the input text. For example, this expected bag-of-words can be used to create a feature vector for text classification: the resulting feature vector is used to classify the input by using a classifier trained on text written in the target language (often referred to as cross-lingual classification).

Another application is to use the resulting feature vector to create a context vector for a word q (source language) that is not listed in the bilingual dictionary.

Subsequently, the context vector of q can be compared to context vectors of words in the target language to find plausible translation candidates of q. For both applications, the order of the words is irrelevant, and only the expected number of times a word occurs in the translation is necessary. Therefore, no machine translation system with parallel corpora is needed, but only a bilingual dictionary.

However, a simple word-by- word translation using only a bilingual dictionary has the disadvantage of introducing words that are completely unrelated. This is due to the ambiguity of polysemic words. For example, given the text "plant worker", the word "plant" will be translated into )_^ X|§ that are listed in bilingual dictionary. It is clear that in this context ¾ 3 is not a sensible translation and therefore should be ignored.

This problem is, for example, addressed in Non-patent Document 2, work in the area of CLIR (Cross-lingual Information Retrieval). They suggest to disambiguate a text (query) by using a target language corpus. They translate each word in the text separately and then weight each translation by taking into account the correlations of the translation in the target language corpus. For example, given the query "plant worker" the word "plant" can be translated into » I½H, whereas the word "worker" can be translated into ¾Mi¾¾^", using an ordinary bilingual dictionary. Since the degree of association between # and ¾Μ8) is higher than the one of and ^Mtt^, they give the translation I¾ more weight that the translation The degree of association (or correlation) of two words is calculated by using the co-occurrence frequencies of two words in the target language corpus (here, in this example, Japanese). Their weighting scheme, however, does not take into account how often a word occurs in the query. This is reasonable, since a query normally contains only a few,

non-overlapping, words. Therefore, their method cannot be used to calculate the expected frequency of words of a (long) text which is likely to contain the same word more than once.

The method in Non-Patent Document 3 uses neither cross-lingually aligned documents, nor a machine translation system for mapping all texts into the same language. Instead, they use a bilingual dictionary, that is organized into synsets (WordNet®) that are cross-lingually aligned. Furthermore, their methods exploits words that have exactly the same spelling (cognates). Obviously, the latter holds only for language pairs that are closely related like English and Italian. Each document is represented as a term vector, that can also be weighted using Inverse Document

Frequencies (IDF). Additionally, if the synset id (which is language independent) of a word, is available, they also add this information to the term vector. Next, the term vectors of the documents in both languages are combined to one matrix. And the matrix is then decomposed using SVD to find a set of latent topics (they call it domains). For training and classification they map the term vectors into the latent topics which can then be compared using the cosine similarity. Therefore, for classification, they use the cosine kernel. For their proposed method, they assume that the bilingual dictionary is organized into synsets (synonym) sets, and chose only the monosemic (only one sense) words. Using their proposed method for polysemic words, it is necessary to use a WSD (Word Sense Disambiguation) system to find the correct sense for the word.

Non-Patent Document 1 , describes an efficient method for learning word translation probabilities p(f]e) using a bilingual dictionary and a pair of comparable corpora (two corpora written in different languages which do not need to be translations of each other). Using these translation probabilities, they are able to create a word-by -word translation (FIG. 6 A) of a sentence F. As translation of F they use the best translation E* which is: E^* = aigmaxp(E) · p{F\E) (1)

E

where p(E) is calculated by using the language model of the source language.

Assuming a bi-gram language model we get: LT

where and e_; denote the single words occurring in F and E; the index / runs over the words in the sentence of F (note that they assume that the length of translated sentence is the same as F). Their method has the big advantage that no parallel corpora are needed for translation.

Document of the Prior Art

Non-Patent Document 1 : "Estimating Word Translation Probabilities from

Unrelated Monolingual Corpora Using the EM Algorithm", AAAI, 2000.

Non-Patent Document 2: "Using mutual information to resolve query translation ambiguities and query term weighting", ACL, 1999.

Non-Patent Document 3: "Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization", ACL, 2006.

DISCLOSURE OF INVENTION

Problems to be Solved by the Invention

A straight-forward solution to create a feature vector, or context vector, in the target language, is to first translate the input text into the target language using a machine translation system; and in the second step, to create a feature vector, or context vector, using the translated text. The feature vector can then, for example, be used to classify the input text using the classifier trained with the training data in the target language.

However, in general the use of a machine translation system is expensive, since either translation rules need to be manually created, or a large collection of parallel corpora is necessary. Furthermore, a method like Non-Patent Document 3 has the disadvantage requiring additionally resources like a WSD, or that the words in source and target language are organized in sense like WordNet®.

Using the method described in Non-Patent Document 1 , we can find a translation of the input text, without the need of expensive resources like parallel corpora. The idea is displayed in FIG. 1, where Component 10 corresponds to method described in Non-Patent Document 1.

However, we note that their method assumes that the word order in the languages are the same. This is obviously, not the case for language pairs like English and Japanese. We therefore show (in the fourth embodiment) how translation probabilities can be learned from comparable corpora where the word order is different.

Next, Component 20 counts the frequencies of each word and places it into feature vector. Finally, the classifier in Component 30, can be any classifier, for example SVM (support vector machine) classifier. However, it is in general

disadvantages to rely only on one translation that is then used for classification. For example, given the sentence " ^fi^li fi fe έ i†₀ " ("The house was destroyed by a blaze."), plausible translations for !k - are "blaze" and "fire". In that case

Component 10, might for example choose the translation "blaze". The example is shown in FIG. 2. Furthermore, assume we want to classify the sentence as either belonging to class "emergency" or not. Training data instances labeled as "emergency" might contain the word "fire" (i.e., the training data shown in FIG. 2 having class "emergency" includes, for instance, "A fire broke out here at the train station.", "An earthquake broke out and fire everywhere.", and "The earthquake completely destroyed the dike", but not contain the word "blaze". As a consequence, it is difficult to classify the text based on the uni-grams "house", "blaze" and "destroy".

A straight-forward solution is to use for a word /in the input text, all translations that are listed in a bilingual dictionary. However, this has the disadvantage of introducing noise. For example, possible translations for ΐ : are "miss" and

"destroyed", whereas it is unlikely that "miss" is an appropriate translation in this context.

Means for Solving the Problem

For creating the feature vector of an input text, we do not to rely on one translation of the input text. Instead, we (implicitly) generate all translations, count for each translation the word occurrences, and then combine the counts by using the probability that the translation is correct. Mathematically speaking, we use the expected word counts with respect to the probability distribution over the possibly correct translations.

In order to make the method independent of a machine translation system, we suggest to use word-by- word translations using for example a given bilingual dictionary. We assume that we have access to word-by-word translation probabilities and denote these probabilities as p(f\e). Note, that these translation probabilities are independent of the context of input text. For example, they are simply calculated by assuming a uniform distribution over all the translations for a word e that are listed in a bilingual dictionary.

However, as mentioned before, these translation probabilities change depending on the context of the input text. In order to calculate the word translation probabilities that consider the context of the input text, we additionally use the co-occurrence probabilities of two words e, and denoted as p{e_t\e_j), calculated from a corpus written in the target language. (We use these co-occurrence probabilities either directly, i.e.

calculating p(e,|e,), or indirectly, by defining document clusters as described in

Embodiment s.)

This way, for the z^'-th word in the input text F written in the source language, we can calculate the probability that this word is translated into a word e in the target language, given the context of the input text F. We denote this translation probability as p(ei\F). By summing up over all positions i the probability that word e is a translation of the i-th word in the input text, i.e., , we can calculate the expected count of a word e. Using these expected counts, we are able to create a feature vector, which is in the word-vector space of the target language. Effect of the Invention

The present invention has the effect of allowing to extract word features in the target language from a text written in the source language, without relying on a machine translation system and parallel corpora, but nevertheless being able to use the context of the input text to create an appropriate feature vector that can be used for text

classification, or to find word translations.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing the functional structure of a baseline system that is trivial combination of previous work.

FIG. 2 illustrates the problem of the baseline system of the previous work. FIG. 3 is a block diagram showing the functional structure of a system according to a first exemplary embodiment of the present invention.

FIG. 4 is a block diagram showing the functional structure of a system according to a second exemplary embodiment of the present invention.

FIGS. 5A to 5C show the data and the probabilities that are used for the examples explained in the first exemplary embodiment.

FIGS. 6 A, 6B and 6C show the results of the examples, using a word-by- word translation, a baseline method and the proposed method according to the first exemplary embodiment, respectively.

FIG. 7 is a graphical model using a plate notation according to embodiments of the present invention.

FIG. 8 is a block diagram showing the functional structure of the system according to a fourth exemplary embodiment of the present invention.

FIG. 9 is a block diagram showing a method for estimating translation probabilities according to the fourth exemplary embodiment of the present invention.

FIG. 10 is a block diagram showing the functional structure of the system according to a related art.

FIG. 11 is a block diagram showing the functional structure of the system according to a fifth exemplary embodiment of the present invention.

EXEMPLARY EMBODIMENTS FOR CARRYING OUT THE INVENTION

<First Exemplary Embodiment

First, we describe the proposed method for calculating the expected

bag-of-words representation of the translation of an input text, in the context of a cross-lingual text classification system. Often the most important features that are used for classification are uni-grams (single terms, or phrases).

Using as features uni-grams, we can represent text Tas a column vector where dimension r contains the number of times word a w_r occurs in text T. Formally, we denote the number of times word w occurs in text Thy count j{w). We denote the language of the text we want to classify as the source language, and the language of the texts in the training data as the target language. If text Tis in the source language we write F, if the text is in the target language we write E. We denote the /-word of F as f_t. Furthermore, we denote as e_t the word in E that is the translation of fi. Note, that, for simplicity, we make the assumption that one word in F is translated into exactly one word in E. As a consequence the length of E and F are the same.

Furthermore, we assume that we have the translation probabilities p( \e) given, corresponding to Resource 5 in FIG. 3. FIG. 3 shows a system, usually performed by a computer system, for calculating the expected bag-of- words representation in the target language. These word translation probabilities can be calculated for example using the method described in Non-Patent Document 1. Or, as an even simpler method, using an existing bilingual dictionary, and assuming uniform translation probabilities over all translations listed for a word e. In cases, where we need the translation probabilities p(e\f), we will simply assume p(e\f) o p(f\e). Alternatively, we could also calculate them using Non-Patent Document 1. The set of all translations of a word/ are denoted as Φ(ή which we simply define as ^φ(/) ^:= i^{e e}

> 0}, where Ω is the set of words that occur in the training corpus (Training Data 25 in FIG. 3). The training data 25 is stored in a non-transitory computer storage medium such as a hard disk drive and a semiconductor memory.

Instead of using only one translation of E, we implicitly generate all translations and weight them by the probability of each translation. More formally, instead of using count (e), we use the expected number of word occurrences denoted by E[count_E(e) \F] as features. When we use a simple uni-gram language model in the source language we get:

E[count_E (2) where we might write F as

fi is the i-th word in F, and n is the number of words in F.

However, Equation (2) assumes that the translations of word in F depend only on fi. That means:

We will relax this assumption, by using the following assumption:

P(^ei \ ^F, ^ej) = P(ei \fi ej) (3) where e_} - is the translation of word fi , for/ £ {1, 2, 1, i + 1, n}. In words, the assumption means, the probability thstfi is translated into e, depends only on fi and the -th word in E. The word e_; can be interpreted as the word that we consider as important context information, and which helps us in finding an appropriate translation e_h We do not make any restrictions on/ E { 1, 2, 1, / ^'+ 1, «}. A possible choice is to select it randomly, or to chose j = i - \, that is the word previous of e,. In case, we do not want to restrict our choice to one j, we can average over several choices by using:

je{l,2,...,i-l,i+l,...,n}

is the model that uses the assumption that the j-th position in E is the important context information which helps us in finding translation er, and /?,(/) is the probability that position j is important. We might for example set p ) such that it sets high weight to the surrounding words of e_t, like:

O else .

Using the assumption of equation (3), we can calculate p(ej\F, j) as follows: V(<H E j) - ∑

= ∑ P /,- «_.,· ) · ρ(¾· Π (5)

Furthermore, we assume that we can factorize the probability p( i, fj, ej) as follows: P(<¾ , f ¾ ). = P(cj ) ^' p(^ei l ¾ ) ^■ PU i)

where p(ej) is the probability that word ej occurs in a text in position j in the target language; recall

is the probability that word e, can be translated into fa and is the probability that word e, occurs in position i of text Γ, if we know that word e_j occurs in Tin position j. For example, we can choose for p(e_t\ej), assuming y < , the probability that a word e, occurs in the source target language if we know that word e, occurred / - j words before; for j = i - 1 this is equal to the Markov assumption.

Another choice for p(e_t\ej) is, for example, to use the probability that word occurs in a text (or sentence), if we that in this text (or sentence) the word e, occurs.

Note, that this factorization corresponds to the graphical model e, ->

e_t -> f_t. Therefore, it follows that: ¾ I /*; ¾^■) =

Finally, combining the above formula with Equations (5) and (4), we get:

(6)

Note that this formula is, in general, recursive, since we need p{ej\F) to calculate

for calculation. One solution is to the use the EM-algorithm, whereas the initially we set VjP(^ej \F) ^~ P(^ej \ fj) .

Another solution is, for example, to assume that p(e_\\F) =p(e_\\f_\) and then, for / > I , making e, only dependent on e^. That means setting, in formula (6), pi(J) to 1, for j = i - 1, and else to 0. This results in the simplified formula (for i > 1):

As before, we assume that each word fi is translated into conditionally independent given text F. Furthermore, we also keep the assumption that one word in F is translated into exactly one word in E. Therefore, the translation of ft into e_t is modeled as a Bernoulli distribution with probability ρ{βι \F). Using the above assumptions we can calculate the expected frequency of a word e in text E:

n n

E[count_E(e) \ F] = ^ E[e< | ] =∑p(ei \F) where is calculated as in Equation (6).

Finally, we give an example using the model given in Equation (7) which demonstrates how our invention solves the problem of translation disambiguation for word expectation count. We assume the source language is Japanese, and the target language is English. We assume we want to classify the sentence "^tt^ ^^" ^^: c tl† " ("The house was destroyed by a blaze"). For simplicity we ignore functional words which leaves as with the representation F = (^, l)< $-, f^^:), that is, we use as representation only the content words F = i , ^). All the necessary probabilities are listed up in FIGS. 5A, 5B and 5C. In FIG. 5C, we expect that "fire" and "blaze" often co-occur with "destroy", but not with "miss". Starting the calculation from the most left word to the most right word in F we get:

And in the next step:

Finally, we calculate the translation probabilities for the right most word:

p(rniss\F)

and, analogously, we get:

p(destroy\F) = 0.89

The result is shown in FIG. 6C, as "Using the proposed method". Since our proposed method uses the co-occurrence information from the words "miss" with "fire" and "miss" with "blaze", it is able to infer that the probability of the translation "miss" is small, and therefore the expected count is also small. As shown in FIG. 6C, it is possible to disambiguate context and to better estimate of expected counts for "destroy". It therefore, is able to remedy the problem of the baseline system that is not able to disambiguate between the translations "miss" and "destroy" as shown in FIG. 6B.

Concerning the application for cross-lingual classification, we note, that it can be advantageous to calculate the expected bag-of-word representations for each training instance into the source language, i.e., the other way round as described before: First, for each training instance (in target language) the expected bag-of-word representations in the source language is calculated and used to train a classifier in the source language. In the second step, the input text (in source language) is classified by using the classifier from the previous step.

<Second Exemplary Embodiment

As before, let us denote by count^e) the frequency of a word e in text E. In the first exemplary embodiment, we used the frequency count^e) for training and the expected frequency [count_E(e) \F] f_or classification. However, it is well known that instead of using the word frequencies directly, weighting the word frequencies, for example, by the inverse document frequency (IDF) results in higher classification accuracy. We assume that these weights are given, as shown in Resource 35, FIG. 4. FIG. 4 shows a system of second exemplary embodiment, usually performed by a computer system. A non-transitory computer storage medium such as a hard disk drive and a semiconductor memory is used for the resource 35 for storing weight data. For example, using IDF, we can calculate a weight for word e, denoted by w_e, as follows:

where N is the total number of documents in corpus 15 (the same as in FIG. 3), and df_e is the number of documents that contain word e in corpus 15.

We then weight the frequency counts by weight w_e using:

g (count _E(e) , w_e)

where g is a function for combining the frequency counts and the weight w_e.

For example, using multiplication we get: g (count E (e) , w_e) = countE ( ) · w_e

Instead of using the raw (expected) word frequencies, we suggest to use the weighted word frequency g{count^e) for training and the weighted expected word frequency

w_e) for classification.

<Third Exemplary Embodiment

Here we show another possible method for calculating the expected frequency count of target words, given a text (document) in the source language. The method described here has the advantage that co-occurrences of more than two words in a text can be used. Furthermore, the word order is irrelevant, which makes the method applicable also to language pairs like English and Japanese.

The graphical model is depicted in FIG. 7 using the plate notation, z E {z_\, Z2, Zk) denotes a topic, according to which the English words (target language words) is generated using a categorical distribution with parameters p(e\z). We denote the set of topics {zi, z₂, ... , Zk} as Z. An English word e and Japanese word (source language word) /is generated for each word in a document d. The graphical model above node classifies a bag-of-words (document) in English into one of k topics. For this task, we could also use a different generative model, like LDA (Latent Dirichlet Allocation). For illustration, we use the simpler method described here, which can be considered as a kind of Naive Bayes clustering.

For each English word a Japanese word is generated using the probability p f\e). We assume here that p(f\e) is given. It can, for example, be learned by using a bilingual dictionary and the method described in the fourth exemplary embodiment. We therefore assume that e is from a fixed vocabulary (a set of words in English), and/is from fixed vocabulary W (a set of words in Japanese). First, in the training phase, we learn the parameters p(e\z), and p(z), by using a document collection in English, denoted as D := {d_\, d_m). Ideally, this English collection contains a subset of documents that are of similar topic. For example, assuming we want to translate a Tweet in Japanese, we should choose a large collection of English Tweets for this document collection. The parameters p(e\z) can, for example, be learned using the EM- Algorithm, as described in the following. Let, denote p'(e\z) p(z), at step t, respectively. Furthermore, let denote

The expected number of times that we observe word e together with topic z is denote as E[n(e, z)] and is calculated as follows:

, where n(d, e) is the number of times word e occurs in document d, and ri_d is the total number of words in document d, and e_\, e_nd are all words that occur in document d. The probability p(z\&, e_\, e_Ud) is calculated as follows:

?.6{l,...,n_d}

where the normalization constant is∑_z p (z, e_x , ... , e_nd \ Θ * ). ^r

The expected number of times that we observe word topic z is denote as E[n(z)] and is calculated as follows:

)] = ^ E[n(e , z)] = - pizl . ei , ..., e_¾) step calculates the new parameters θ^¹, using the expected counts:

_,t+i , ^ _ E[n(z)}

p^t+1 (z) =

E[n(z')]

This way, we can learn the parameters p(e\z) and p(z).

In the second step, we use the trained model to calculate the expected count E[count_E(e) \F] given a Japanese (source language) text F. We assume that for each word fi m^' F there is an English word e_;. We denote the words in F by ... , f_na] (we use the index a in n_a, since in later embodiments we will use letter a to denote a document in the source language, instead of using the letter F).

E[count_E{e) \F]

i=l i=l

is calculated as follows:

where the normalization constant is∑ c-% e V P ( ^e'1 ' f¹ ' · · · ' f ^fla ).

Finally, we note that the topic distribution over z can also be used as features for classifying a source document. This can be achieved as follows:

First for each training instance (a target language document) calculate the probability p z\ e\, e_nd) and use the probability as a feature vector to train a classifier.

Second for an input document F calculate the probability p(z | f ,—, f_na) and use this probability to classify document F with the classifier trained from the previous step.

<Fourth Exemplary Embodiment

In the previous embodiments, we assumed that we have the translation probabilities p(f\e) are given, for example, can be calculated, in advance, with a method like described in Non-Patent Document 1. However, as we noted before, the method in Non-Patent Document 1 assumes that the word order in the languages are the same. This is obviously, not the case for language pairs like English and Japanese.

For the estimation of the probabilities we use the same model as in FIG. 7. We assume that the parameters p(z) and p(e\z) are estimated using the English corpus (target language corpus) D as described in embodiment three. Furthermore, we assume we have a Japanese corpus (source language corpus), denoted as A := {a\, a{\ which is comparable to the English corpus. For example, the two corpora can contain one year of Tweets in English and Japanese, respectively.

The translation probabilities p f\e) can be estimated using A and the probabilities p(z) and p(e\z) as follows:

argmaxp A p e , p z , p e 2 -p(ei \z)

where fi is the i-th word in (the bag of words) of document , and document a contains 1, ..., n_a words.

This term can be optimized using, for example, the EM-algorithm. As an initial setting, we set p f\e) to the uniform distribution over all translations / that are listed in the bilingual dictionary. If, in the dictionary, a source word/is not listed as a translation of e we set p(f}e) = 0. An overview of the input and output of the method is show in FIG. 9. As an example application we show in FIG. 8 how the estimation of translation probabilities can be combined with cross-lingual classification described in the previous embodiments. FIG. 8 shows a system of fourth exemplary embodiment, usually performed by a computer system.

<Fifth Exemplary Embodiment

Here we demonstrate that the proposed invention can also be used to improve the translation acquisition of new words. We denote word q the word that is not listed in the bilingual dictionary, and for which we want to find a (new) translation.

An overview of related previous work, disclosed in, for instance, "A Statistical View on Bilingual Lexicon Extraction", P. Fung, LNCS 1998 (Non-Patent Document 4), is given in FIG. 10. We assume we have given two comparable corpora resources 7, and 15, from which we extract context vectors for each word. A context vector for word w contains in each dimension the co-occurrence frequency of word w with a word w ', where w 'is a word listed in the bilingual dictionary. Co-occurrence frequency is, for example, defined on the sentence level, i.e. how often two words occur in the same sentence. A context vector of a word e in English (target language) cannot be compared with a context vector of a word /in Japanese (source language), since the dimensions are different. Therefore, previous work maps, for example, the Japanese context vector to an English context vector using the bilingual dictionary, as described in Component 28. Finally, in Component 30, we (as well as previous work) compare the translated context vector of a word q with the context vectors of all translation candidates in the target language. This can be done, for example, by converting the raw co-occurrence counts into tf-idf weights and then using the cosine similarity, as described in Non-Patent Document 4.

The problem of previous work is that, when translating the context vector, fixed translation probabilities are used. That is, as described in FIG. 6B, the frequencies are translated into the target language, word by word, without considering the context, that is the co-occurrence with other words. In order to overcome this problem, we suggest, to apply the probabilistic model from the third embodiment, to calculate the expected co-occurrence frequencies of a context word e (target language) with the query word q (source language). The idea is sketched in FIG. 11. FIG. 11 shows a system of fifth exemplary embodiment, usually performed by a computer system.

First, given a sentence s, in a document a (document in source language), we calculate the probability that word e occurs one or more times in the translated sentence. We denote this probability as p{e\s, a). Let S be the set of indices of the words in sentence s, i.e., S -Ξ { !, ..., ¾} .

where f_\, f_n are the words in document a, the

f\, f_na) is the probability that the i-t word is translated into word e. The

f\, f_na) is calculated as described in the third embodiment.

We denote by E[couni(e, q)

_{me eX}p_ected co-occurrence frequency of word e with q for a document a. We can calculate it as follows:

E[count(e, q) \a]— J^ ._s (g) - p(e\s, a) where I_s(q) is 1, if q occurs in sentence s, or otherwise 0.

Finally, we can calculate the expected co-occurrence frequency, over the whole source language corpus A, denoted by [count(e, q)] as follows: K[count(e, q)] =

a<=A

Finally, the expected counts E[coⁱm£(e, q)} _used to create the context vector (in target language) for word q.

INDUSTRIAL APPLICABILITY

The present invention allows to classify an input text, even in the case, where the training data is only available in language that is the different from the input text. For example, we might have plenty of Tweets in English that are annotated as either reporting about an emergency event, like an earthquake, or not reporting about any emergency. This kind of training data needs to be created manually, and is therefore expensive to create.

If we want to classify a Tweet in Japanese, we again need to manually create training data in Japanese. As an alternative, we could use a machine translation system to translate a text from Japanese to English. However, such a machine translation system itself needs training data of parallel text in Japanese and English which is expensive.

Our proposed method allows to create a feature vector of the Japanese input text that can be classified by the classifier that is trained using only the English training data. This way no new training data in Japanese is needed. Furthermore, since our approach uses readily available resources like a bilingual dictionary and monolingual corpora, it can implemented at low cost, and easily extended to other pairs of languages.

Another application is the translation of words which are not yet listed in a translation dictionary (fifth embodiment). This is especially helpful is there is only a small seed bilingual translation dictionary. For example, the new translations can be used to enrich the existing bilingual dictionary, and this way help to improve

cross-lingual classification.

The method for calculating the expected bag-of-words representation, the document classification method, and the word translation acqμisition method of the above exemplary embodiments may be realized by dedicated hardware, or may be configured by means of memory and a DSP (digital signal processor) or other

computation and processing device. On the other hand, the functions may be realized by execution of a program used to realize the steps of the method for calculating the expected bag-of-words representation, the document classification method, and the word translation acquisition method.

Moreover, a program to realize the steps of the method for calculating the expected bag-of-words representation, the document classification method, and the word translation acquisition method may be recorded on computer-readable storage media, and the program recorded on this storage media may be read and executed by a computer system to perform the method for calculating the expected bag-of-words representation, the document classification, and the word translation acquisition processing. Here, a "computer system" may include an OS, peripheral equipment, or other hardware.

Further, "computer-readable storage media" means a flexible disk,

magneto-optical disc, ROM, flash memory or other writable nonvolatile memory,

CD-ROM or other removable media, or a hard disk or other storage system incorporated within a computer system.

Further, "computer readable storage media" also includes members which hold the program for a fixed length of time, such as volatile memory (for example, DRAM (dynamic random access memory)) within a computer system serving as a server or client, when the program is transmitted via the Internet, other networks, telephone circuits, or other communication circuits.

Claims

1. A method that, given a bag-of- words representation of a full text, paragraph, sentence or word-window written in a source language, calculates an expected bag-of-words representation in a target language, comprising:

a first step in which, for a source word fin the input bag-of-words F written in the source language, a probability p(e/ \F), that the source word /is translated into a target word e in the target language, is calculated by using given probabilities p(fe) that the target word e in the target language is translated into the source word /in the source language and by using co-occurrence probabilities of two or more target words in the target language that are calculated from a corpus written in the target language (for example by clustering the target words into several topics z, and then using the probabilities p(e | z)); and

a second step in which, for each target word e in the target language, the probability that the target word e is a translation of the source word/in the input text is summed up over all source words /in the input bag-of-words F so as to denote the resulting value∑/ l^^") as an expected count of the target word e, and to create a feature vector by using the expected counts of each target word e in the target language; the resulting feature vector in the target language being considered as the expected bag-of-words representation, written in the target language, that represents the input bag-of-words F.

A cross-lingual document classification method comprising:

a third step in which the bag-of-words representation of the input document in source language is converted into an expected bag-of-words representation in the target language to generate a feature vector using method of claim 1 ; and a fourth step in which, given a classifier trained with the bag-of-words representation of each training instance or document in the target language, the input text is classified by using the feature vector generated in the third step.

3. A cross-lingual document classification method comprising:

a fifth step in which each training instance or document in the target language is converted into an expected bag-of-words representation in the source language to generate a feature vector using method of claim 1 ;

a sixth step in which a classifier in the source language is trained using the feature vectors generated in the fifth step;

a seventh step in which the input document in the source language is classified by using the classifier trained in the sixth step.

4. The cross-lingual document classification method according to claim 2, wherein the input text is split into smaller text parts including paragraphs or sentences, and for each text part, the expected bag-of-words representation in the target language is calculated, comprising:

creating one feature vector by adding all text parts' expected bag-of-words representations in the target language, such that the expected number of occurrences of a word in the translated text is calculated by summing up the expected number of occurrences for word in each sentence translation of the input text.

5. The cross-lingual document classification method according to claim 2, wherein: the second step including calculating a weighted feature vector by combining the expected counts for a target word e with a word weight w_e by using a monotonic function g to get a word weighted count; and

the cross-lingual document classification method including using a classifier trained using the weighted word counts of each training instance or document in the target language, where the weighted word counts are the word counts of the target word e combined with word weight w_e by using the function g.

6. A word translation acquisition method comprising:

an eighth step in which, for each document in the source language, an expected bag-of- words representation in the target language is calculated using the method of claim 1;

a ninth step in which an expected co-occurrence of two words q in the source language word and not listed in dictionary and the target word e in the target language word and listed in dictionary in a certain type of context including sentence,

word- window, and modifier are calculated by using the expected bag-of- words representation in the target language calculated in the eighth step, the co-occurrences are used to form the context vector for q;

a tenth step in which, for each translation candidate in the target language, a context vector is calculated using the co-occurrence frequencies between the translation candidate and a target word e in the target language word and that is listed in the dictionary using the target language corpus and the same type of context including a sentence context;

an eleventh step in which each translation candidate is scored by comparing the context vector of the query word q with the context vector of each translation candidate.

7. The method according to claim 1, wherein the word translation probabilities pf\e) are estimated using a bilingual dictionary and a pair of comparable corpora including source and target language corpus by:

a twelfth step in which each document in the target language corpus is clustered into one or more topics and learned a probability p(e\z) that a target word e is generated from a topic z;

a thirteenth step in which, using the probabilities p(e\z) and integrating over the topics z, the probabilities p(f\e) are found by maximizing the probability that we observe the source language corpus.

8. A cross-lingual document classification method, according to claim 2, comprising:

a fourteenth step in which all documents in the target language are clustered into several topics, and the topic distributions p(e\z) are learned;

a fifteenth step, using the topic distribution for each training instance (a target language document) as a feature vector, and train a classifier in the target language; a sixteenth step, where using the translation probabilities p(f\e) and the topic distribution p(e | z), the topic distribution for an input document in the source language is calculated and used to classify the input document using the classifier from the above step.