WO2015145981A1

WO2015145981A1 - Multilingual document-similarity-degree learning device, multilingual document-similarity-degree determination device, multilingual document-similarity-degree learning method, multilingual document-similarity-degree determination method, and storage medium

Info

Publication number: WO2015145981A1
Application number: PCT/JP2015/001028
Authority: WO
Inventors: 定政　邦彦
Original assignee: 日本電気株式会社
Priority date: 2014-03-28
Filing date: 2015-02-27
Publication date: 2015-10-01
Also published as: JPWO2015145981A1

Abstract

This invention provides a technology for searching for similar documents in a multilingual document group at lower cost and with higher precision, even if three or more languages are present. This multilingual document-similarity-degree learning device (1) comprises the following: a multilingual matrix storage unit (11) that holds a matrix for each target language; a word-vector acquisition unit (12) that acquires a word vector corresponding to a document; a meaning-vector creation unit (13) that creates a meaning vector for said document on the basis of the word vector for said document and the matrix corresponding to the language in which said document is written; a similarity-degree calculation unit (14) that calculates similarity degrees on the basis of meaning vectors for documents in a document group; and a multilingual matrix learning unit (15) that implements learning by adjusting values in the matrices corresponding to the respective target languages such that, within a set of documents each written in one of the target languages, the similarity degrees for groups of documents that exhibit source-translation relationships are higher than the similarity degrees for groups of documents that do not exhibit source-translation relationships.

Description

Multilingual document similarity learning device, multilingual document similarity determining device, multilingual document similarity learning method, multilingual document similarity determining method, and storage medium

The present invention relates to a technique for finding documents whose contents are similar to each other in a multilingual document group in which documents of different languages are mixed.

The Internet has become widespread and various information has been transmitted in various languages. To collect more information, you should target information written in more languages. However, in this case, similar information written in different languages is individually collected and presented as different information, which is inefficient from the viewpoint of information collection.

A technique related to such a problem is described in Patent Document 1. In this related technique, a target document is once machine-translated into a reference language, for example, English, and then documents having similar contents are collected using a technique such as clustering.

Another technique related to such a problem is described in Patent Document 2. Patent Document 2 proposes a framework called SSI (supervised semantic indexing) that learns a model representing the similarity of contents between two languages directly from a bilingual corpus without going through an intermediate result of machine translation. This related technique learns and determines the similarity of documents between two languages as follows.

First, in this related technique, each document _dij in each language is expressed by bag-of-words in a document set in a parallel translation relationship across languages (dimension numbers D ₁ and D _{2 for} each language are free). Here, the subscript i represents the language type. The subscript j represents the document ID for each language.

This related technique prepares a matrix W for learning the correspondence between language pairs. W is a matrix of D ₁ × D ₂ (D ₁ row D ₂ column). However, since the number of parameters to be learned is large, U and V satisfying W = U ^T · V are learned for dimensional compression. Here, U ^T represents the transposed matrix of U. U and V are matrices of N × D ₁ and N × D ₂ respectively. N = 100 or the like is applied. An equation for calculating the score of the document pair is shown in the following equation (1).

At this time, learning of U and V is performed so that the score of the document pair in the bilingual relationship is higher than the scores of the other document pairs. In this related technique, the similarity between the documents d _1j and d _2k in different languages is determined using the learned matrices U and V according to the expression (1).

Further, Patent Document 3 describes another technique related to such a problem. This related technique searches a plurality of documents described in different languages for those that are semantically approximate to a search request described in a certain language. In this related technique, a word dictionary database is prepared in advance. The word dictionary database associates a word Wi group of synonyms between natural languages A, B, C, D,... With one word feature vector Vi. Then, this related technique normalizes the sum of word feature vectors related to words included in each document, and calculates it as a document feature vector. Also, this related technique normalizes the sum of word feature vectors related to each word included in the search request and calculates it as a search request feature vector. This related technique calculates the inner product of the search request feature vector and the document feature vector of each document as a semantic approximation. This related technique searches for a document having a large semantic similarity as a document that approximates the search request.

JP 2013-84306 A US Pat. No. 8,359,282 Japanese Patent Laid-Open No. 10-31677

However, since the related technology described in Patent Document 1 is a framework that goes through an intermediate state of machine translation results, there is a problem in that the accuracy of compilation is not high when the accuracy of machine translation is not necessarily high.

The related art described in Patent Document 2 has no problem in the case of two languages, but has a problem in the case of three or more languages. It the matrix W = U T ^· V to determine the similarity between languages, is that it must be provided in the number of partner language. For example, if the number of languages is n, it is necessary to calculate and hold a matrix of n × (n−1) / 2 including U and V. “/” Represents division. Further, at the time of determining similarity, when comparing a document d _ij in a certain language with other documents, it is necessary to calculate W · d _ij for the number of languages to be compared, which increases the calculation cost.

Further, Patent Document 3 does not describe how to learn one word feature vector Vi for a synonym word Wi group. Also, considering the existence of multiple meanings, the number of combinations of synonym words Wi may be enormous. Therefore, with this related technology, the cost for maintaining and learning the word dictionary database increases. Also, determining the degree of approximation of a document via a word level synonym group is equivalent to using a word level machine translation. Therefore, this related technique has a problem that the accuracy of the semantic approximation is not high when the accuracy of the synonym group (machine translation) is not necessarily high.

The present invention has been made to solve the above-described problems. That is, an object of the present invention is to provide a technique for searching for similar documents more accurately at a lower cost even when there are three or more languages in a multilingual document group.

In order to achieve the above object, the multilingual document similarity learning device of the present invention includes a multilingual matrix storage unit that holds a matrix for each target language, a word vector acquisition unit that acquires a word vector corresponding to a document, Based on the word vector of the document and the matrix corresponding to the description language of the document, a semantic vector creating means for creating the semantic vector of the document, and for a set of documents, the similarity based on the semantic vector of each document In the set of documents each of which is described in any one of the target languages, the similarity of the set of documents having a bilingual relationship is the similarity of the set of documents having no bilingual relationship Multilingual matrix learning means for adjusting and learning the value of the matrix corresponding to each target language so as to be higher.

The multilingual document similarity determination device of the present invention corresponds to a document and multilingual matrix storage means for holding the matrix for each target language learned using the multilingual document similarity learning device described above. A word vector obtaining means for obtaining a word vector, a semantic vector creating means for creating a semantic vector of the document based on the matrix corresponding to the word vector of the document and the description language of the document, and a set of documents, Similarity calculation means for calculating similarity based on the semantic vector of each document; similarity determination means for determining similarity between documents using the similarity in a set of similarity determination target documents; Is provided.

Further, the multilingual document similarity learning method of the present invention uses the matrix held for each target language, and based on the word vector corresponding to the document and the matrix corresponding to the description language of the document, By creating a semantic vector of a document and calculating a similarity for each set of documents based on the semantic vector of each document, there is a bilingual relationship in a set of documents each described in one of the target languages Learning is performed by adjusting the value of the matrix corresponding to each target language so that the similarity of the document set is higher than the similarity of the document set that is not in a parallel translation relationship.

Further, the multilingual document similarity determination method of the present invention uses the matrix for each target language learned by the multilingual document similarity learning method described above, a word vector corresponding to the document, and the document Based on the matrix corresponding to a description language, a semantic vector of the document is created, and for a set of documents, a similarity is calculated based on the semantic vector of each document. The similarity between documents is determined using the similarity.

The storage medium of the present invention corresponds to a word vector acquisition step of acquiring a word vector corresponding to a document using a matrix held for each target language, the word vector of the document, and the description language of the document. A semantic vector creating step for creating a semantic vector of the document based on the matrix; a similarity calculating step for calculating a similarity based on the semantic vector of each document for a set of documents; and any of the target languages In the set of documents each of which is described above, the similarity corresponding to the set of documents having a translation relationship is higher than the similarity of the set of documents having no translation relationship. A multilingual matrix similarity learning program for causing a computer device to execute a multilingual matrix learning step of adjusting and learning matrix values is stored.

In addition, another storage medium of the present invention uses the matrix for each target language learned by executing the multilingual document similarity learning program stored in the above-described storage medium to obtain a word vector corresponding to the document. A word vector obtaining step, a semantic vector creating step for creating a semantic vector of the document based on the word vector of the document and the matrix corresponding to the description language of the document, and a set of documents, A similarity calculation step for calculating a similarity based on the semantic vector, and a similarity determination step for determining a similarity between documents using the similarity in a set of documents for similarity determination A multilingual document similarity determination program to be executed is stored.

The present invention can provide a technique for searching for similar documents more accurately at a lower cost even when there are three or more languages in a multilingual document group.

It is a functional block diagram of the multilingual document similarity learning apparatus as the first embodiment of the present invention. It is a hardware block diagram of the multilingual document similarity learning apparatus as the 1st Embodiment of this invention. It is a flowchart explaining operation | movement of the multilingual document similarity learning apparatus as the 1st Embodiment of this invention. It is a functional block diagram of the multilingual document similarity determination apparatus as the 2nd Embodiment of this invention. It is a flowchart explaining operation | movement of the multilingual document similarity determination apparatus as the 2nd Embodiment of this invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

(First embodiment)
FIG. 1 is a diagram showing a functional block configuration of a multilingual document similarity learning apparatus 1 as a first embodiment of the present invention. In FIG. 1, a multilingual document similarity learning device 1 includes a multilingual matrix storage unit 11, a word vector acquisition unit 12, a semantic vector creation unit 13, a similarity calculation unit 14, a multilingual matrix learning unit 15, ,including.

FIG. 2 is a diagram illustrating an example of a hardware configuration of the multilingual document similarity learning apparatus 1. In FIG. 2, the multilingual document similarity learning device 1 is configured by a computer device. The computer device includes a CPU (Central Processing Unit) 1001, a RAM (Random Access Memory) 1002, a ROM (Read Only Memory) 1003, a storage device 1004, an input device 1005, and an output device 1006. In this case, the multilingual matrix storage unit 11 is configured by the storage device 1004. The word vector acquisition unit 12, the semantic vector creation unit 13, and the similarity calculation unit 14 are configured by a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them. The The multilingual matrix learning unit 15 includes an input device 1005 and a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them. Note that the hardware configuration of the multilingual document similarity learning device 1 and each functional block thereof is not limited to the above-described configuration.

The multilingual matrix storage unit 11 holds a matrix for each target language. Each matrix is a weight matrix for converting a word vector of a document described in the target language into a semantic vector. The word vector and the semantic vector will be described later. For example, each matrix may have the same number of columns. In that case, the number of columns is the number of dimensions of the semantic vector. In this case, the number of rows in each matrix may be the number of dimensions of a word vector described later.

The word vector acquisition unit 12 acquires a word vector corresponding to the document. The word vector is a concept that is generally used when calculating the similarity of documents, and is an expression format that represents a document by a set of words included in the document. The number of dimensions of the word vector may be, for example, the number of words used in a target language describing the document (hereinafter also referred to as a description language). For example, the word vector acquisition unit 12 may create a word vector by a known technique based on a given document. Or the word vector acquisition part 12 may acquire what was previously produced | generated as a word vector corresponding to the document from the memory | storage device 1004 or the input device 1005 grade | etc.,.

The semantic vector creation unit 13 creates the semantic vector of the document based on the word vector of the document and the matrix held in the multilingual matrix storage unit 11 corresponding to the description language of the document. Here, the semantic vector is information representing the semantic features of the document. For example, the semantic vector creation unit 13 may create a product of a word vector of a document and a matrix corresponding to the description language of the document as the semantic vector of the document.

The similarity calculation unit 14 calculates a similarity for a set of documents based on the semantic vector of each document. For example, the similarity calculation unit 14 may calculate the inner product of the semantic vectors of each document as the similarity of a set of documents.

The multilingual matrix learning unit 15 uses a semantic vector creation unit 13 and a similarity calculation unit 14 in a set of documents each described in one of the target languages, and uses matrix values corresponding to the description language of each document. Adjust to learn. Specifically, the multilingual matrix learning unit 15 learns each matrix so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship. For example, the multilingual matrix learning unit 15 causes the similarity of a pair of documents in a bilingual relationship to be higher than the similarity between one document of the pair and another document that is not in a bilingual relationship with the document. In addition, learning may be performed by adjusting matrix values. Note that the multilingual matrix learning unit 15 preferably performs learning of each matrix in parallel.

Here, a set of documents used for learning by the multilingual matrix learning unit 15 will be described. In a set of documents, there may be three or more types of target languages describing each document. Such a set of documents is configured to include at least a set of documents having a translation relation. Further, such a set of documents is configured to include a set of documents that are not in a parallel translation relationship at least in part. Note that the document set may be stored in advance in the storage device 1004. A set of documents may be input from the outside via the input device 1005 or a network interface (not shown).

Also, for example, the multilingual matrix learning unit 15 may learn each matrix described above using the stochastic steepest gradient method. In this case, the multilingual matrix learning unit 15 may randomly select from the set of documents a set of documents that have a translation relationship and a set of documents that do not have a translation relationship for each step of the probabilistic steepest gradient method.

The operation of the multilingual document similarity learning apparatus 1 configured as described above will be described with reference to FIG. Note that the multilingual document similarity learning device 1 starts the following operation when a set of documents each described in any of the target languages is input.

In FIG. 3, first, the word vector acquisition unit 12 acquires a corresponding word vector for each document in a set of documents each described in one of the target languages (step S1).

Next, the multilingual matrix learning unit 15 instructs the semantic vector creation unit 13 to create a semantic vector for each document in the document set that has a bilingual relationship and a document set that has no bilingual relationship in the set of documents ( Step S2). The semantic vector creation unit 13 creates a semantic vector based on a word vector of each document and a matrix corresponding to the description language.

Next, the multilingual matrix learning unit 15 uses the similarity calculation unit 14 to make a similarity between each of a pair of documents having a parallel translation relationship and a pair of documents having no parallel translation relationship in the document set based on the semantic vector of each document. The degree is calculated (step S3).

Next, the multilingual matrix learning unit 15 adjusts the matrix corresponding to each description language so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship ( Step S4).

Next, if the adjustment of the matrix has converged (Yes in step S5), the multilingual matrix learning unit 15 ends the learning.

On the other hand, if the matrix adjustment has not converged (No in step S5), the operation of the multilingual document similarity learning apparatus 1 returns to step S2. And the multilingual document similarity learning apparatus 1 repeats the operation | movement from step S2 using the matrix after adjustment by previous step S4.

Next, effects of the first exemplary embodiment of the present invention will be described.

The multilingual document similarity learning apparatus according to the first embodiment of the present invention creates a semantic vector of a document in determination of similar documents even when there are three or more languages in a multilingual document group. The used matrix can be learned more accurately at a lower cost.

This is because the multilingual matrix storage unit holds a matrix for converting a word vector of a document into a semantic vector for each target language. In addition, the word vector acquisition unit acquires a word vector corresponding to the document, and the semantic vector creation unit creates a semantic vector based on the word vector of the document and a matrix corresponding to the description language of the document. . Then, the multilingual matrix learning unit is configured such that, in a set of documents each described in one of the target languages, the similarity of a pair of documents having a parallel translation relationship is higher than the similarity of a pair of documents having no parallel translation relationship. In addition, the learning of the matrix corresponding to each target language is performed.

Thus, in the present embodiment, a matrix for creating a semantic vector is prepared for each target language, not for each language pair. Therefore, this embodiment does not need to learn a matrix for each language pair. And this Embodiment should just learn a matrix for every object language so that the meaning of each dimension in the semantic vector of a document may become the same irrespective of a language pair. As a result, in this embodiment, the matrix of each description language obtained by learning becomes independent of the partner language.

Therefore, in this embodiment, even when there are three or more types of target languages, it is not necessary to learn a matrix for similarity determination for each combination of target languages, and it is only necessary to learn a matrix for each target language. For this reason, calculation cost can be suppressed low.

Also, in the present embodiment, information can be obtained from language pairs with a plurality of other target languages even when a matrix is learned for a target language with a small absolute number of documents in a set of documents. For this reason, a performance improvement can be expected compared to the case of learning a matrix for each language pair. Furthermore, this embodiment can further improve the learning accuracy by performing learning of the matrices of each target language in parallel.

Next, the operation of the first embodiment of the present invention will be shown as a specific example.

Here, it is assumed that information is collected on news articles that exist on the web. In order to improve the efficiency of information collection, there is a need to consolidate news articles written in different languages as long as they have the same contents. For this purpose, it is necessary to determine the similarity between news articles across languages. The following describes the learning of a matrix for calculating the similarity between news articles across languages.

学習 For learning, a large amount of documents, some of which are in a parallel translation relationship, will be used. The translation relationship here does not need to be a complete translation relationship, a so-called parallel corpus. For example, the bilingual relationship may be a so-called comparable corpus that describes the same object in different languages. As a set of such documents, a bilingual corpus used in statistical machine translation research may be used. Alternatively, a national language version of Wikipedia may be used as a set of such documents.

And the multilingual matrix memory | storage part 11 memorize | stores a matrix by the number of the target languages which measure a similarity in the collection of the above-mentioned documents. The matrix M _i corresponding to the language i is an N × D _i matrix. N is the number of dimensions of the semantic vector. In order to make the meaning of each dimension of the semantic vector the same regardless of the language, it is desirable that N has the same size regardless of the language. Empirically, N = 100 to several hundreds works well. _Di is the number of words used in language i. D _i may be a different value for each language. The initial value for each M _i, for example, 0 is set.

First, the word vector acquisition unit 12 converts each document in the above-described document set into a word vector. As described above, the word vector is a concept that is generally used when calculating the similarity of documents, and is an expression format that expresses a document by a set of words included in the document. The simplest word vector is a vector (for example, an expression such as [1, 0, 1, 0]) composed of elements in which the presence or absence of each word is represented by 0 or 1. Other word vectors include those based on TF (word appearance frequency: term frequency) * IDF (inverse document frequency) that weights each word from the viewpoint of calculating similarity. Furthermore, a method of temporarily compressing a word vector using a method such as LSI (Latent Semantic Indexing) or LDA (Latent Dirichlet Allocation) is also known. Further, instead of words, word N-GRAM or character N-GRAM may be used. Here, the word TF * IDF is used as the word vector. As a result, for the j-th document in the language i, a word vector d _ij with a dimension number D _i is created.

The semantic vector creation unit 13 calculates the product of the word vector of the document and the matrix corresponding to the description language of the document as the semantic vector of the document under the control of the multilingual matrix learning unit 15. Specifically, the semantic vector creation unit 13 calculates M _i · d _ij , which is the product of the corresponding word vector d _ij and the current matrix M _i of language i, for the j-th document in language i. Calculate and use it as a semantic vector. The number of dimensions of M _i · d _ij is N regardless of the document or language.

The multilingual matrix learning unit 15 uses the mean vector generating unit 13 and the similarity calculation unit 14 performs learning of each M _i. The basic idea of learning is that the matrixes for multiple description languages are adjusted in parallel so that the similarity of a pair of documents in a bilingual relationship is higher than the similarity of a pair of documents in a non-parallel relationship. Is. Specifically, multilingual matrix learning unit 15, among the set of documents mentioned above, a document q language i _q, the language i ₊ document d ⁺ in document q and translation relation, not the translation relationship language For the document d ⁻ of i ₋ , the matrices M _iq , M _{i +} , and M _i− are adjusted in parallel so as to satisfy the following expression (2).

Here, it is generally known that the performance is improved when the similarity between the documents q and d ⁺ is adjusted to be larger than the similarity between the documents q and d ⁻ by a certain margin or more. Therefore, the multilingual matrix learning unit 15 performs adjustment so as to minimize the loss function of the following equation (3) considering the margin.

Here, R is a set of a set of a certain document in the input document set and a document that has a parallel translation relationship with the document or a document that does not have a parallel translation relationship. F (q, d) represents the similarity between documents q and d. That, f (q, d) is language _i q document q, if the language of d and _{i d,} a _{^{(M iq · q) T ·}} (M id · d).

One of the methods for minimizing the loss function described above is a method using a stochastic steepest gradient method. In this case, the multilingual matrix learning unit 15 randomly selects a pair of q, d ⁺ and d ⁻ for each step of the stochastic steepest gradient method, and 1−f (q, d ⁺ ) + f (q, When d ⁻ )> 0, each matrix M (M _iq , M _{i +} , M _i− ) is updated as in the following equations (4) to (6).

In this way, the multilingual matrix learning unit 15 performs random document extraction and adjustment of the matrix M based on the extraction until convergence.

This completes the description of the specific example.

Thus, in this specific example, the multilingual matrix storage unit 11 stores a matrix M _i for converting a word vector of a document into a semantic vector for each language i. Further, the multilingual matrix learning unit 15 performs parallel learning of the matrix M _i for a plurality of languages i so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship. And do it. Thereby, learning can be performed so that the meaning of each dimension of the semantic vector is the same regardless of the language pair. As a result, in this specific example, the matrix M _i for the target language i obtained by learning becomes independent of the partner language.

On the other hand, in the related technique described in Patent Document 2, the similarity of the set of the documents _Daj and _Dbk is calculated by the above-described equation (1). In other words, when U = M _ab and V = M _ba in equation (1),

Can be written. Patent Document 2 has to hold and learn a plurality of matrices such as M _ab , M _ac ... For each language b, c. This is because the meaning of each dimension in the semantic vector (M · d) is different for each language pair.

On the other hand, in the specific example of the present embodiment, the semantic vector (M · d) can be handled as a partner language-independent vector. Thus, in this specific example, only one matrix M _i corresponding to each language i is prepared, and the multilingual matrix learning unit M has the same meaning in each dimension of the semantic vector regardless of the language pair. Learn _i . Thus, in this specific example, the number of the matrix M is reduced from n × (n−1) to n. Thereby, the multilingual document similarity learning apparatus in this specific example does not need to hold and learn a matrix for each language pair, and can reduce the calculation cost.

In addition, the multilingual document similarity learning apparatus in this specific example is capable of learning a language matrix having a small absolute number of documents, in each of a document described in that language and each of a plurality of other languages. Information can be obtained from a pair with each document described. For this reason, the multilingual document similarity learning apparatus in this specific example can improve learning performance.

This completes the description of the specific example of the operation according to the first embodiment of the present invention.

(Second Embodiment)
Next, a second embodiment of the present invention will be described in detail with reference to the drawings. Note that, in each drawing referred to in the description of the present embodiment, the same reference numerals are given to the same configuration and steps that operate in the same manner as in the first embodiment of the present invention, and the detailed description in the present embodiment. Description is omitted.

FIG. 4 is a diagram showing a functional block configuration of the multilingual document similarity determination apparatus 2 according to the second embodiment of the present invention. 4, the multilingual document similarity determination device 2 includes a multilingual matrix storage unit 11, a word vector acquisition unit 12, a semantic vector creation unit 13, a similarity calculation unit 14, and a similarity determination unit 26. Including. Here, the multilingual document similarity determination device 2 can be configured by the same hardware elements as the multilingual document similarity learning device 1 according to the first embodiment of the present invention described with reference to FIG. is there. In this case, the similarity determination unit 26 includes an output device 1006 and a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them. Note that the hardware configuration of the multilingual document similarity determination device 2 and each functional block thereof is not limited to the above-described configuration.

The multilingual matrix storage unit 11 holds a matrix for each target language learned by the multilingual document similarity learning device 1 according to the first embodiment of the present invention. The word vector acquisition unit 12, the semantic vector creation unit 13, and the similarity calculation unit 14 are configured in the same manner as in the first embodiment of the present invention.

The similarity determination unit 26 determines the similarity of a document using the similarity calculated by the similarity calculation unit 14 in a set of documents to be subjected to similarity determination. Note that the set of documents to be subjected to similarity determination may be a set of documents. In this case, the similarity determination unit 26 may determine that a set of documents to be determined is similar if the similarity is equal to or greater than a threshold value, and may determine that they are not similar if the similarity is less than the threshold value. Further, the set of documents to be subjected to similarity determination may be a set of three or more documents. In this case, for example, the similarity determination unit 26 may perform clustering of documents based on the similarity as the determination of the similarity in the set of documents to be determined. Further, for example, the similarity determination unit 26 may perform ranking of similar documents with respect to a certain document as determination of similarity in the set of documents to be determined. The similarity determination unit 26 may output the determination result to the output device 1006.

Note that the set of documents to be subjected to similarity determination may be stored in the storage device 1004 in advance. Further, the set of documents to be subjected to similarity determination may be input from the outside via the input device 1005 or a network interface (not shown).

The operation of the multilingual document similarity determination apparatus 2 configured as described above will be described with reference to FIG. Note that the multilingual document similarity determination device 2 starts the following operation when a set of documents whose similarity is to be determined is input.

In FIG. 5, first, the word vector acquisition unit 12 acquires a word vector for each document in a set of documents whose similarity is to be determined (step S11).

Next, the semantic vector creation unit 13 creates a semantic vector for each document based on a word vector of the document and a matrix corresponding to the description language of the document (step S12).

Next, the similarity calculation unit 14 calculates the similarity of a set of arbitrary documents in the set of documents (step S13).

Next, the similarity determination unit 26 determines the similarity based on the obtained similarity and outputs a determination result (step S14). As described above, the similarity determination unit 26 may output information indicating whether or not a set of arbitrary documents is similar by comparing the similarity with a threshold value. Further, the similarity determination unit 26 may perform clustering and ranking of documents using the similarity and output the result as a determination result.

Thus, the multilingual document similarity determination device 2 ends the operation.

Next, effects of the second exemplary embodiment of the present invention will be described.

The multilingual document similarity determination apparatus according to the second exemplary embodiment of the present invention can determine a similar document more accurately at a lower cost even if there are three or more languages in a multilingual document group. it can.

This is because the multilingual matrix storage unit holds what is learned as a matrix for converting a word vector of a document into a semantic vector for each target language. And a word vector acquisition part acquires a word vector about a document. Then, the semantic vector creation unit creates a semantic vector for the document based on the word vector and a matrix corresponding to the description language. Then, the similarity calculation unit calculates the similarity based on the semantic vector for the document set. This is because the similarity determination unit performs similarity determination based on the similarity in a set of documents to be subjected to similarity determination.

As described above, the matrix for each target language held in the multilingual matrix storage unit is learned by the multilingual document similarity learning apparatus as the first embodiment of the present invention. Is independent. Therefore, in order to obtain another document similar to a certain document, the similarity calculation unit does not need to calculate a semantic vector for a certain document by the number corresponding to the description language of the document group to be compared. In other words, the similarity calculation unit may calculate one semantic vector for a document regardless of the number of description languages of the document group to be compared. For this reason, the calculation cost for similarity determination becomes low. Further, such a semantic vector is created so that the meaning of each dimension is the same regardless of the language pair. For this reason, the similarity calculated based on the semantic vector has high accuracy.

Next, the operation of the second embodiment of the present invention will be shown as a specific example.

Here, an example of clustering news articles (documents) on the web across languages will be described.

It is assumed that the multilingual matrix storage unit 11 holds a matrix M _i for each language learned in the specific example according to the first embodiment of the present invention. In addition, it is assumed that a news article group collected from the web is input to the multilingual document similarity determination device 2 as a set of documents to be clustered (similarity determination target).

First, the word vector acquisition unit 12 converts each document in the set of documents to be clustered into a word vector. The conversion method is the same as the specific example in the first embodiment of the present invention.

Next, the semantic vector creation section 13, the word vectors created, by taking the product of the matrix M _i of each language stored in the multilingual matrix storage unit 11, creates a semantic vector of each document. The creation method is the same as the specific example in the first embodiment of the present invention.

Next, the similarity calculation unit 14 obtains the similarity by taking the inner product of the semantic vectors for each set of documents in the set of documents to be clustered.

Then, the similarity determination unit 26 performs clustering by causing a set of documents whose similarity is equal to or greater than a threshold value to belong to the same cluster.

Thus, in this embodiment, the multilingual matrix storage unit 11 holds the matrix M _i for converting means vector word vector of the document for each language. Since this matrix M _i is learned by the multilingual document similarity learning apparatus 1 in the specific example of the first embodiment of the present invention, it is independent of the partner language. Therefore, when calculating the similarity for each set of arbitrary documents, the similarity calculation unit 14 may use a semantic vector calculated for each document, and calculation for each language pair becomes unnecessary.

On the other hand, in the related technique described in Patent Document 2 described above, it is necessary to calculate M _ix · d _ij for the number of languages of the comparison target document (x represents the comparison target language).

On the other hand, in this specific example, when a certain document d _ij is compared with other document groups, one semantic vector M _i · d _ij is used even if the comparison target document group is described in a plurality of languages. Just ask. Therefore, this specific example can reduce the calculation cost of the similarity.

This completes the description of the specific example of the operation according to the second embodiment of the present invention.

In each of the embodiments of the present invention described above, the matrix held for each target language has been described mainly with an example in which the number of columns is equal to each other. In addition, each matrix may have the same number of rows. In this case, the number of word vectors in the document described in the corresponding language may be applied to the number of columns in each matrix.

Further, in each of the embodiments of the present invention described above, an example has been described in which the semantic vector creation unit calculates the product of a word vector of a document and a matrix corresponding to the description language as a semantic vector. In addition, the semantic vector creation unit may use other calculation methods for creating a semantic vector based on a word vector of a document and a matrix corresponding to the description language. Further, the description has been made mainly on the example in which the similarity calculation unit calculates the inner product of the semantic vectors of each document for the document set to obtain the similarity. In addition, the similarity calculation unit may use another calculation method for calculating the similarity based on the semantic vector of each document.

Also, the multilingual document similarity learning device and the multilingual document similarity determination device as the embodiments of the present invention described above may be realized on the same device.

In each embodiment of the present invention described above, each functional block of the multilingual document similarity learning device and the multilingual document similarity determination device is executed by a CPU that executes a computer program stored in a storage device or ROM. The explanation is centered on examples that are realized. In addition to such a configuration, in the present embodiment, part, all, or a combination of each functional block may be realized by dedicated hardware.

In each embodiment of the present invention described above, the functional blocks of the multilingual document similarity learning device or the multilingual document similarity determination device may be distributed and implemented in a plurality of devices.

Further, in each of the embodiments of the present invention described above, the operations of the multilingual document similarity learning device and the multilingual document similarity determination device described with reference to the flowcharts are the same as the computer program of the present invention. You may store in a memory | storage device (storage medium). Then, the computer program may be read and executed by the CPU. In such a case, the present invention is constituted by the code of the computer program or a storage medium.

Also, the above-described embodiments can be implemented in appropriate combination.

The present invention has been described above using the above-described embodiments as exemplary examples. However, the present invention is not limited to the above-described embodiments. That is, the present invention can apply various modes that can be understood by those skilled in the art within the scope of the present invention.

This application claims priority based on Japanese Patent Application No. 2014-67359 filed on March 28, 2014, the entire disclosure of which is incorporated herein.

DESCRIPTION OF SYMBOLS 1 Multilingual document similarity learning apparatus 2 Multilingual document similarity determination apparatus 11 Multilingual matrix memory | storage part 12 Word vector acquisition part 13 Semantic vector creation part 14 Similarity degree calculation part 15 Multilingual matrix learning part 26 Similarity degree determination part 1001 CPU
1002 RAM
1003 ROM
1004 Storage device 1005 Input device 1006 Output device

Claims

Multilingual matrix storage means for holding a matrix for each target language;
Word vector acquisition means for acquiring a word vector corresponding to a document;
Semantic vector creating means for creating a semantic vector of the document based on the matrix corresponding to a word vector of the document and a description language of the document;
Similarity calculation means for calculating similarity based on the semantic vector of each document for a set of documents;
In the set of documents each described in one of the target languages, each of the objects is set such that the similarity of a pair of documents having a parallel translation relationship is higher than the similarity of a pair of documents having no parallel translation relationship. Multilingual matrix learning means for adjusting and learning the value of the matrix corresponding to a language;
Multilingual document similarity learning apparatus.
The multilingual document similarity learning apparatus according to claim 1, wherein the multilingual matrix learning means performs learning of the matrix corresponding to each of the target languages in parallel.
The multilingual matrix storage means holds a matrix having the same number of rows or columns as a matrix for each target language,
The semantic vector creation means creates a product of the word vector of the document and the matrix corresponding to the description language of the document as a semantic vector of the document,
The multilingual document similarity learning according to claim 1, wherein the similarity calculation unit calculates an inner product of the semantic vectors of each document as the similarity for the document set. apparatus.
The multilingual matrix learning means is configured such that the similarity of the pair of documents in the bilingual relationship is between one document of the pair of documents in the bilingual relationship and another document not in the bilingual relationship with respect to the document. The multilingual document similarity learning apparatus according to any one of claims 1 to 3, wherein learning of each of the matrices is performed so as to be higher than the similarity.
The multilingual matrix learning means uses the probabilistic steepest gradient method so that the similarity of the pair of documents in the bilingual relationship is higher than the similarity of the pair of documents not in the bilingual relationship. Learning each matrix and randomly selecting from the set of documents a set of documents in the bilingual relationship and a set of documents not in the bilingual relationship for each step of the probabilistic steepest gradient method. The multilingual document similarity learning apparatus according to any one of claims 1 to 4.
Multilingual matrix storage means for holding the matrix for each target language learned using the multilingual document similarity learning device according to any one of claims 1 to 5;
Word vector acquisition means for acquiring a word vector corresponding to a document;
Semantic vector creating means for creating a semantic vector of the document based on the matrix corresponding to a word vector of the document and a description language of the document;
Similarity calculation means for calculating similarity based on the semantic vector of each document for a set of documents;
Similarity determination means for determining similarity between documents using the similarity in a set of documents for similarity determination;
A multilingual document similarity determination device.
Using a matrix held for each target language, a word vector corresponding to the document and a semantic vector of the document are created based on the matrix corresponding to the description language of the document. By calculating the similarity based on the semantic vector of the document,
In each set of documents each described in one of the target languages, each of the target languages is set such that the similarity of a pair of documents having a parallel translation relationship is higher than the similarity of a pair of documents having no parallel relationship. A multilingual document similarity learning method that learns by adjusting values of the matrix corresponding to.
Using the matrix for each target language learned by the multilingual document similarity learning method according to claim 7, based on a word vector corresponding to a document and the matrix corresponding to a description language of the document , Creating a semantic vector of the document and calculating a similarity for each set of documents based on the semantic vector of each document,
A multilingual document similarity determination method for determining a similarity between documents using the similarity in a set of documents for similarity determination.
Using a matrix held for each target language,
A word vector acquisition step of acquiring a word vector corresponding to the document;
A semantic vector creating step of creating a semantic vector of the document based on the word vector of the document and the matrix corresponding to the description language of the document;
A similarity calculation step for calculating a similarity based on the semantic vector of each document for a set of documents;
In the set of documents each described in one of the target languages, each of the objects is set such that the similarity of a pair of documents having a parallel translation relationship is higher than the similarity of a pair of documents having no parallel translation relationship. A multilingual matrix learning step of adjusting and learning the value of the matrix corresponding to the language;
A storage medium storing a multilingual document similarity learning program for causing a computer device to execute the program.
Using the matrix for each target language learned by executing a multilingual document similarity learning program stored in the storage medium according to claim 9,
A word vector acquisition step of acquiring a word vector corresponding to the document;
A semantic vector creating step of creating a semantic vector of the document based on the word vector of the document and the matrix corresponding to the description language of the document;
A similarity calculation step for calculating a similarity based on the semantic vector of each document for a set of documents;
A similarity determination step for determining a similarity between documents using the similarity in a set of documents for similarity determination;
A storage medium storing a multilingual document similarity determination program for causing a computer device to execute