WO2015145981A1 - Multilingual document-similarity-degree learning device, multilingual document-similarity-degree determination device, multilingual document-similarity-degree learning method, multilingual document-similarity-degree determination method, and storage medium - Google Patents

Multilingual document-similarity-degree learning device, multilingual document-similarity-degree determination device, multilingual document-similarity-degree learning method, multilingual document-similarity-degree determination method, and storage medium Download PDF

Info

Publication number
WO2015145981A1
WO2015145981A1 PCT/JP2015/001028 JP2015001028W WO2015145981A1 WO 2015145981 A1 WO2015145981 A1 WO 2015145981A1 JP 2015001028 W JP2015001028 W JP 2015001028W WO 2015145981 A1 WO2015145981 A1 WO 2015145981A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
similarity
documents
matrix
multilingual
Prior art date
Application number
PCT/JP2015/001028
Other languages
French (fr)
Japanese (ja)
Inventor
定政 邦彦
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2016509952A priority Critical patent/JPWO2015145981A1/en
Publication of WO2015145981A1 publication Critical patent/WO2015145981A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Definitions

  • the present invention relates to a technique for finding documents whose contents are similar to each other in a multilingual document group in which documents of different languages are mixed.
  • Patent Document 1 A technique related to such a problem is described in Patent Document 1.
  • a target document is once machine-translated into a reference language, for example, English, and then documents having similar contents are collected using a technique such as clustering.
  • Patent Document 2 proposes a framework called SSI (supervised semantic indexing) that learns a model representing the similarity of contents between two languages directly from a bilingual corpus without going through an intermediate result of machine translation. This related technique learns and determines the similarity of documents between two languages as follows.
  • SSI supervised semantic indexing
  • each document dij in each language is expressed by bag-of-words in a document set in a parallel translation relationship across languages (dimension numbers D 1 and D 2 for each language are free).
  • the subscript i represents the language type.
  • the subscript j represents the document ID for each language.
  • W is a matrix of D 1 ⁇ D 2 (D 1 row D 2 column).
  • U T represents the transposed matrix of U.
  • U and V are matrices of N ⁇ D 1 and N ⁇ D 2 respectively.
  • N 100 or the like is applied.
  • An equation for calculating the score of the document pair is shown in the following equation (1).
  • Patent Document 3 describes another technique related to such a problem.
  • This related technique searches a plurality of documents described in different languages for those that are semantically approximate to a search request described in a certain language.
  • a word dictionary database is prepared in advance.
  • the word dictionary database associates a word Wi group of synonyms between natural languages A, B, C, D,... With one word feature vector Vi.
  • this related technique normalizes the sum of word feature vectors related to words included in each document, and calculates it as a document feature vector.
  • this related technique normalizes the sum of word feature vectors related to each word included in the search request and calculates it as a search request feature vector.
  • This related technique calculates the inner product of the search request feature vector and the document feature vector of each document as a semantic approximation.
  • This related technique searches for a document having a large semantic similarity as a document that approximates the search request.
  • Patent Document 1 since the related technology described in Patent Document 1 is a framework that goes through an intermediate state of machine translation results, there is a problem in that the accuracy of compilation is not high when the accuracy of machine translation is not necessarily high.
  • Patent Document 3 does not describe how to learn one word feature vector Vi for a synonym word Wi group. Also, considering the existence of multiple meanings, the number of combinations of synonym words Wi may be enormous. Therefore, with this related technology, the cost for maintaining and learning the word dictionary database increases. Also, determining the degree of approximation of a document via a word level synonym group is equivalent to using a word level machine translation. Therefore, this related technique has a problem that the accuracy of the semantic approximation is not high when the accuracy of the synonym group (machine translation) is not necessarily high.
  • an object of the present invention is to provide a technique for searching for similar documents more accurately at a lower cost even when there are three or more languages in a multilingual document group.
  • the multilingual document similarity learning device of the present invention includes a multilingual matrix storage unit that holds a matrix for each target language, a word vector acquisition unit that acquires a word vector corresponding to a document, Based on the word vector of the document and the matrix corresponding to the description language of the document, a semantic vector creating means for creating the semantic vector of the document, and for a set of documents, the similarity based on the semantic vector of each document
  • the similarity of the set of documents having a bilingual relationship is the similarity of the set of documents having no bilingual relationship
  • Multilingual matrix learning means for adjusting and learning the value of the matrix corresponding to each target language so as to be higher.
  • the multilingual document similarity determination device of the present invention corresponds to a document and multilingual matrix storage means for holding the matrix for each target language learned using the multilingual document similarity learning device described above.
  • a word vector obtaining means for obtaining a word vector
  • a semantic vector creating means for creating a semantic vector of the document based on the matrix corresponding to the word vector of the document and the description language of the document, and a set of documents
  • Similarity calculation means for calculating similarity based on the semantic vector of each document
  • similarity determination means for determining similarity between documents using the similarity in a set of similarity determination target documents; Is provided.
  • the multilingual document similarity learning method of the present invention uses the matrix held for each target language, and based on the word vector corresponding to the document and the matrix corresponding to the description language of the document, By creating a semantic vector of a document and calculating a similarity for each set of documents based on the semantic vector of each document, there is a bilingual relationship in a set of documents each described in one of the target languages Learning is performed by adjusting the value of the matrix corresponding to each target language so that the similarity of the document set is higher than the similarity of the document set that is not in a parallel translation relationship.
  • the multilingual document similarity determination method of the present invention uses the matrix for each target language learned by the multilingual document similarity learning method described above, a word vector corresponding to the document, and the document Based on the matrix corresponding to a description language, a semantic vector of the document is created, and for a set of documents, a similarity is calculated based on the semantic vector of each document. The similarity between documents is determined using the similarity.
  • the storage medium of the present invention corresponds to a word vector acquisition step of acquiring a word vector corresponding to a document using a matrix held for each target language, the word vector of the document, and the description language of the document.
  • a semantic vector creating step for creating a semantic vector of the document based on the matrix; a similarity calculating step for calculating a similarity based on the semantic vector of each document for a set of documents; and any of the target languages
  • the similarity corresponding to the set of documents having a translation relationship is higher than the similarity of the set of documents having no translation relationship.
  • a multilingual matrix similarity learning program for causing a computer device to execute a multilingual matrix learning step of adjusting and learning matrix values is stored.
  • another storage medium of the present invention uses the matrix for each target language learned by executing the multilingual document similarity learning program stored in the above-described storage medium to obtain a word vector corresponding to the document.
  • a multilingual document similarity determination program to be executed is stored.
  • the present invention can provide a technique for searching for similar documents more accurately at a lower cost even when there are three or more languages in a multilingual document group.
  • FIG. 1 is a diagram showing a functional block configuration of a multilingual document similarity learning apparatus 1 as a first embodiment of the present invention.
  • a multilingual document similarity learning device 1 includes a multilingual matrix storage unit 11, a word vector acquisition unit 12, a semantic vector creation unit 13, a similarity calculation unit 14, a multilingual matrix learning unit 15, ,including.
  • FIG. 2 is a diagram illustrating an example of a hardware configuration of the multilingual document similarity learning apparatus 1.
  • the multilingual document similarity learning device 1 is configured by a computer device.
  • the computer device includes a CPU (Central Processing Unit) 1001, a RAM (Random Access Memory) 1002, a ROM (Read Only Memory) 1003, a storage device 1004, an input device 1005, and an output device 1006.
  • the multilingual matrix storage unit 11 is configured by the storage device 1004.
  • the word vector acquisition unit 12, the semantic vector creation unit 13, and the similarity calculation unit 14 are configured by a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them.
  • the multilingual matrix learning unit 15 includes an input device 1005 and a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them. Note that the hardware configuration of the multilingual document similarity learning device 1 and each functional block thereof is not limited to the above-described configuration.
  • the multilingual matrix storage unit 11 holds a matrix for each target language.
  • Each matrix is a weight matrix for converting a word vector of a document described in the target language into a semantic vector.
  • the word vector and the semantic vector will be described later.
  • each matrix may have the same number of columns. In that case, the number of columns is the number of dimensions of the semantic vector. In this case, the number of rows in each matrix may be the number of dimensions of a word vector described later.
  • the word vector acquisition unit 12 acquires a word vector corresponding to the document.
  • the word vector is a concept that is generally used when calculating the similarity of documents, and is an expression format that represents a document by a set of words included in the document.
  • the number of dimensions of the word vector may be, for example, the number of words used in a target language describing the document (hereinafter also referred to as a description language).
  • the word vector acquisition unit 12 may create a word vector by a known technique based on a given document.
  • the word vector acquisition part 12 may acquire what was previously produced
  • the semantic vector creation unit 13 creates the semantic vector of the document based on the word vector of the document and the matrix held in the multilingual matrix storage unit 11 corresponding to the description language of the document.
  • the semantic vector is information representing the semantic features of the document.
  • the semantic vector creation unit 13 may create a product of a word vector of a document and a matrix corresponding to the description language of the document as the semantic vector of the document.
  • the similarity calculation unit 14 calculates a similarity for a set of documents based on the semantic vector of each document. For example, the similarity calculation unit 14 may calculate the inner product of the semantic vectors of each document as the similarity of a set of documents.
  • the multilingual matrix learning unit 15 uses a semantic vector creation unit 13 and a similarity calculation unit 14 in a set of documents each described in one of the target languages, and uses matrix values corresponding to the description language of each document. Adjust to learn. Specifically, the multilingual matrix learning unit 15 learns each matrix so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship. For example, the multilingual matrix learning unit 15 causes the similarity of a pair of documents in a bilingual relationship to be higher than the similarity between one document of the pair and another document that is not in a bilingual relationship with the document. In addition, learning may be performed by adjusting matrix values. Note that the multilingual matrix learning unit 15 preferably performs learning of each matrix in parallel.
  • a set of documents used for learning by the multilingual matrix learning unit 15 will be described.
  • a set of documents there may be three or more types of target languages describing each document.
  • Such a set of documents is configured to include at least a set of documents having a translation relation. Further, such a set of documents is configured to include a set of documents that are not in a parallel translation relationship at least in part.
  • the document set may be stored in advance in the storage device 1004.
  • a set of documents may be input from the outside via the input device 1005 or a network interface (not shown).
  • the multilingual matrix learning unit 15 may learn each matrix described above using the stochastic steepest gradient method.
  • the multilingual matrix learning unit 15 may randomly select from the set of documents a set of documents that have a translation relationship and a set of documents that do not have a translation relationship for each step of the probabilistic steepest gradient method.
  • the multilingual document similarity learning apparatus 1 configured as described above will be described with reference to FIG. Note that the multilingual document similarity learning device 1 starts the following operation when a set of documents each described in any of the target languages is input.
  • the word vector acquisition unit 12 acquires a corresponding word vector for each document in a set of documents each described in one of the target languages (step S1).
  • the multilingual matrix learning unit 15 instructs the semantic vector creation unit 13 to create a semantic vector for each document in the document set that has a bilingual relationship and a document set that has no bilingual relationship in the set of documents (Ste S2).
  • the semantic vector creation unit 13 creates a semantic vector based on a word vector of each document and a matrix corresponding to the description language.
  • the multilingual matrix learning unit 15 uses the similarity calculation unit 14 to make a similarity between each of a pair of documents having a parallel translation relationship and a pair of documents having no parallel translation relationship in the document set based on the semantic vector of each document.
  • the degree is calculated (step S3).
  • the multilingual matrix learning unit 15 adjusts the matrix corresponding to each description language so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship (Ste S4).
  • step S5 if the adjustment of the matrix has converged (Yes in step S5), the multilingual matrix learning unit 15 ends the learning.
  • step S5 if the matrix adjustment has not converged (No in step S5), the operation of the multilingual document similarity learning apparatus 1 returns to step S2. And the multilingual document similarity learning apparatus 1 repeats the operation
  • the multilingual document similarity learning apparatus creates a semantic vector of a document in determination of similar documents even when there are three or more languages in a multilingual document group.
  • the used matrix can be learned more accurately at a lower cost.
  • the multilingual matrix storage unit holds a matrix for converting a word vector of a document into a semantic vector for each target language.
  • the word vector acquisition unit acquires a word vector corresponding to the document
  • the semantic vector creation unit creates a semantic vector based on the word vector of the document and a matrix corresponding to the description language of the document.
  • the multilingual matrix learning unit is configured such that, in a set of documents each described in one of the target languages, the similarity of a pair of documents having a parallel translation relationship is higher than the similarity of a pair of documents having no parallel translation relationship.
  • the learning of the matrix corresponding to each target language is performed.
  • a matrix for creating a semantic vector is prepared for each target language, not for each language pair. Therefore, this embodiment does not need to learn a matrix for each language pair. And this Embodiment should just learn a matrix for every object language so that the meaning of each dimension in the semantic vector of a document may become the same irrespective of a language pair. As a result, in this embodiment, the matrix of each description language obtained by learning becomes independent of the partner language.
  • information can be obtained from language pairs with a plurality of other target languages even when a matrix is learned for a target language with a small absolute number of documents in a set of documents. For this reason, a performance improvement can be expected compared to the case of learning a matrix for each language pair. Furthermore, this embodiment can further improve the learning accuracy by performing learning of the matrices of each target language in parallel.
  • the translation relationship here does not need to be a complete translation relationship, a so-called parallel corpus.
  • the bilingual relationship may be a so-called comparable corpus that describes the same object in different languages.
  • a bilingual corpus used in statistical machine translation research may be used.
  • a national language version of Wikipedia may be used as a set of such documents.
  • storage part 11 memorize
  • the matrix M i corresponding to the language i is an N ⁇ D i matrix.
  • Di is the number of words used in language i.
  • D i may be a different value for each language.
  • the initial value for each M i for example, 0 is set.
  • the word vector acquisition unit 12 converts each document in the above-described document set into a word vector.
  • the word vector is a concept that is generally used when calculating the similarity of documents, and is an expression format that expresses a document by a set of words included in the document.
  • the simplest word vector is a vector (for example, an expression such as [1, 0, 1, 0]) composed of elements in which the presence or absence of each word is represented by 0 or 1.
  • Other word vectors include those based on TF (word appearance frequency: term frequency) * IDF (inverse document frequency) that weights each word from the viewpoint of calculating similarity.
  • a method of temporarily compressing a word vector using a method such as LSI (Latent Semantic Indexing) or LDA (Latent Dirichlet Allocation) is also known.
  • word N-GRAM or character N-GRAM may be used.
  • the word TF * IDF is used as the word vector.
  • the semantic vector creation unit 13 calculates the product of the word vector of the document and the matrix corresponding to the description language of the document as the semantic vector of the document under the control of the multilingual matrix learning unit 15. Specifically, the semantic vector creation unit 13 calculates M i ⁇ d ij , which is the product of the corresponding word vector d ij and the current matrix M i of language i, for the j-th document in language i. Calculate and use it as a semantic vector. The number of dimensions of M i ⁇ d ij is N regardless of the document or language.
  • the multilingual matrix learning unit 15 uses the mean vector generating unit 13 and the similarity calculation unit 14 performs learning of each M i.
  • the basic idea of learning is that the matrixes for multiple description languages are adjusted in parallel so that the similarity of a pair of documents in a bilingual relationship is higher than the similarity of a pair of documents in a non-parallel relationship. Is.
  • multilingual matrix learning unit 15 among the set of documents mentioned above, a document q language i q, the language i + document d + in document q and translation relation, not the translation relationship language
  • the matrices M iq , M i + , and M i ⁇ are adjusted in parallel so as to satisfy the following expression (2).
  • the multilingual matrix learning unit 15 performs adjustment so as to minimize the loss function of the following equation (3) considering the margin.
  • R is a set of a set of a certain document in the input document set and a document that has a parallel translation relationship with the document or a document that does not have a parallel translation relationship.
  • F (q, d) represents the similarity between documents q and d. That, f (q, d) is language i q document q, if the language of d and i d, a (M iq ⁇ q) T ⁇ (M id ⁇ d).
  • One of the methods for minimizing the loss function described above is a method using a stochastic steepest gradient method.
  • the multilingual matrix learning unit 15 randomly selects a pair of q, d + and d ⁇ for each step of the stochastic steepest gradient method, and 1 ⁇ f (q, d + ) + f (q, When d ⁇ )> 0, each matrix M (M iq , M i + , M i ⁇ ) is updated as in the following equations (4) to (6).
  • the multilingual matrix learning unit 15 performs random document extraction and adjustment of the matrix M based on the extraction until convergence.
  • the multilingual matrix storage unit 11 stores a matrix M i for converting a word vector of a document into a semantic vector for each language i.
  • the multilingual matrix learning unit 15 performs parallel learning of the matrix M i for a plurality of languages i so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship. And do it. Thereby, learning can be performed so that the meaning of each dimension of the semantic vector is the same regardless of the language pair.
  • the matrix M i for the target language i obtained by learning becomes independent of the partner language.
  • Patent Document 2 has to hold and learn a plurality of matrices such as M ab , M ac ... For each language b, c. This is because the meaning of each dimension in the semantic vector (M ⁇ d) is different for each language pair.
  • the semantic vector (M ⁇ d) can be handled as a partner language-independent vector.
  • the multilingual matrix learning unit M has the same meaning in each dimension of the semantic vector regardless of the language pair. Learn i .
  • the number of the matrix M is reduced from n ⁇ (n ⁇ 1) to n.
  • the multilingual document similarity learning apparatus in this specific example is capable of learning a language matrix having a small absolute number of documents, in each of a document described in that language and each of a plurality of other languages. Information can be obtained from a pair with each document described. For this reason, the multilingual document similarity learning apparatus in this specific example can improve learning performance.
  • FIG. 4 is a diagram showing a functional block configuration of the multilingual document similarity determination apparatus 2 according to the second embodiment of the present invention.
  • the multilingual document similarity determination device 2 includes a multilingual matrix storage unit 11, a word vector acquisition unit 12, a semantic vector creation unit 13, a similarity calculation unit 14, and a similarity determination unit 26.
  • the multilingual document similarity determination device 2 can be configured by the same hardware elements as the multilingual document similarity learning device 1 according to the first embodiment of the present invention described with reference to FIG. is there.
  • the similarity determination unit 26 includes an output device 1006 and a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them.
  • the hardware configuration of the multilingual document similarity determination device 2 and each functional block thereof is not limited to the above-described configuration.
  • the multilingual matrix storage unit 11 holds a matrix for each target language learned by the multilingual document similarity learning device 1 according to the first embodiment of the present invention.
  • the word vector acquisition unit 12, the semantic vector creation unit 13, and the similarity calculation unit 14 are configured in the same manner as in the first embodiment of the present invention.
  • the similarity determination unit 26 determines the similarity of a document using the similarity calculated by the similarity calculation unit 14 in a set of documents to be subjected to similarity determination.
  • the set of documents to be subjected to similarity determination may be a set of documents.
  • the similarity determination unit 26 may determine that a set of documents to be determined is similar if the similarity is equal to or greater than a threshold value, and may determine that they are not similar if the similarity is less than the threshold value.
  • the set of documents to be subjected to similarity determination may be a set of three or more documents.
  • the similarity determination unit 26 may perform clustering of documents based on the similarity as the determination of the similarity in the set of documents to be determined. Further, for example, the similarity determination unit 26 may perform ranking of similar documents with respect to a certain document as determination of similarity in the set of documents to be determined.
  • the similarity determination unit 26 may output the determination result to the output device 1006.
  • the set of documents to be subjected to similarity determination may be stored in the storage device 1004 in advance. Further, the set of documents to be subjected to similarity determination may be input from the outside via the input device 1005 or a network interface (not shown).
  • the multilingual document similarity determination apparatus 2 starts the following operation when a set of documents whose similarity is to be determined is input.
  • the word vector acquisition unit 12 acquires a word vector for each document in a set of documents whose similarity is to be determined (step S11).
  • the semantic vector creation unit 13 creates a semantic vector for each document based on a word vector of the document and a matrix corresponding to the description language of the document (step S12).
  • the similarity calculation unit 14 calculates the similarity of a set of arbitrary documents in the set of documents (step S13).
  • the similarity determination unit 26 determines the similarity based on the obtained similarity and outputs a determination result (step S14). As described above, the similarity determination unit 26 may output information indicating whether or not a set of arbitrary documents is similar by comparing the similarity with a threshold value. Further, the similarity determination unit 26 may perform clustering and ranking of documents using the similarity and output the result as a determination result.
  • the multilingual document similarity determination device 2 ends the operation.
  • the multilingual document similarity determination apparatus can determine a similar document more accurately at a lower cost even if there are three or more languages in a multilingual document group. it can.
  • the multilingual matrix storage unit holds what is learned as a matrix for converting a word vector of a document into a semantic vector for each target language.
  • a word vector acquisition part acquires a word vector about a document.
  • the semantic vector creation unit creates a semantic vector for the document based on the word vector and a matrix corresponding to the description language.
  • the similarity calculation unit calculates the similarity based on the semantic vector for the document set. This is because the similarity determination unit performs similarity determination based on the similarity in a set of documents to be subjected to similarity determination.
  • the similarity calculation unit does not need to calculate a semantic vector for a certain document by the number corresponding to the description language of the document group to be compared.
  • the similarity calculation unit may calculate one semantic vector for a document regardless of the number of description languages of the document group to be compared. For this reason, the calculation cost for similarity determination becomes low. Further, such a semantic vector is created so that the meaning of each dimension is the same regardless of the language pair. For this reason, the similarity calculated based on the semantic vector has high accuracy.
  • the multilingual matrix storage unit 11 holds a matrix M i for each language learned in the specific example according to the first embodiment of the present invention.
  • a news article group collected from the web is input to the multilingual document similarity determination device 2 as a set of documents to be clustered (similarity determination target).
  • the word vector acquisition unit 12 converts each document in the set of documents to be clustered into a word vector.
  • the conversion method is the same as the specific example in the first embodiment of the present invention.
  • the semantic vector creation section 13 the word vectors created, by taking the product of the matrix M i of each language stored in the multilingual matrix storage unit 11, creates a semantic vector of each document.
  • the creation method is the same as the specific example in the first embodiment of the present invention.
  • the similarity calculation unit 14 obtains the similarity by taking the inner product of the semantic vectors for each set of documents in the set of documents to be clustered.
  • the similarity determination unit 26 performs clustering by causing a set of documents whose similarity is equal to or greater than a threshold value to belong to the same cluster.
  • the multilingual matrix storage unit 11 holds the matrix M i for converting means vector word vector of the document for each language. Since this matrix M i is learned by the multilingual document similarity learning apparatus 1 in the specific example of the first embodiment of the present invention, it is independent of the partner language. Therefore, when calculating the similarity for each set of arbitrary documents, the similarity calculation unit 14 may use a semantic vector calculated for each document, and calculation for each language pair becomes unnecessary.
  • this specific example when a certain document d ij is compared with other document groups, one semantic vector M i ⁇ d ij is used even if the comparison target document group is described in a plurality of languages. Just ask. Therefore, this specific example can reduce the calculation cost of the similarity.
  • each matrix held for each target language has been described mainly with an example in which the number of columns is equal to each other.
  • each matrix may have the same number of rows.
  • the number of word vectors in the document described in the corresponding language may be applied to the number of columns in each matrix.
  • the semantic vector creation unit calculates the product of a word vector of a document and a matrix corresponding to the description language as a semantic vector.
  • the semantic vector creation unit may use other calculation methods for creating a semantic vector based on a word vector of a document and a matrix corresponding to the description language.
  • the description has been made mainly on the example in which the similarity calculation unit calculates the inner product of the semantic vectors of each document for the document set to obtain the similarity.
  • the similarity calculation unit may use another calculation method for calculating the similarity based on the semantic vector of each document.
  • the multilingual document similarity learning device and the multilingual document similarity determination device as the embodiments of the present invention described above may be realized on the same device.
  • each functional block of the multilingual document similarity learning device and the multilingual document similarity determination device is executed by a CPU that executes a computer program stored in a storage device or ROM.
  • the explanation is centered on examples that are realized.
  • part, all, or a combination of each functional block may be realized by dedicated hardware.
  • the functional blocks of the multilingual document similarity learning device or the multilingual document similarity determination device may be distributed and implemented in a plurality of devices.
  • the operations of the multilingual document similarity learning device and the multilingual document similarity determination device described with reference to the flowcharts are the same as the computer program of the present invention.
  • the present invention is constituted by the code of the computer program or a storage medium.

Abstract

This invention provides a technology for searching for similar documents in a multilingual document group at lower cost and with higher precision, even if three or more languages are present. This multilingual document-similarity-degree learning device (1) comprises the following: a multilingual matrix storage unit (11) that holds a matrix for each target language; a word-vector acquisition unit (12) that acquires a word vector corresponding to a document; a meaning-vector creation unit (13) that creates a meaning vector for said document on the basis of the word vector for said document and the matrix corresponding to the language in which said document is written; a similarity-degree calculation unit (14) that calculates similarity degrees on the basis of meaning vectors for documents in a document group; and a multilingual matrix learning unit (15) that implements learning by adjusting values in the matrices corresponding to the respective target languages such that, within a set of documents each written in one of the target languages, the similarity degrees for groups of documents that exhibit source-translation relationships are higher than the similarity degrees for groups of documents that do not exhibit source-translation relationships.

Description

多言語文書類似度学習装置、多言語文書類似度判定装置、多言語文書類似度学習方法、多言語文書類似度判定方法、および、記憶媒体Multilingual document similarity learning device, multilingual document similarity determining device, multilingual document similarity learning method, multilingual document similarity determining method, and storage medium
 本発明は、異なる言語の文書が混在している多言語文書群において、互いに内容が類似している文書を発見する技術に関する。 The present invention relates to a technique for finding documents whose contents are similar to each other in a multilingual document group in which documents of different languages are mixed.
 インターネットが普及し、多様な情報が様々な言語で発信されるようになった。より多くの情報を収集するためには、より多くの言語で記述された情報を対象にすべきである。しかし、この場合、異なる言語で記述された類似する情報が個別に収集され、別の情報として提示されることになり、情報収集の観点では効率が悪い。 The Internet has become widespread and various information has been transmitted in various languages. To collect more information, you should target information written in more languages. However, in this case, similar information written in different languages is individually collected and presented as different information, which is inefficient from the viewpoint of information collection.
 このような問題に関連する技術が、特許文献1に記載されている。この関連技術は、対象の文書を、一度基準となる言語、例えば英語に機械翻訳した後、クラスタリング等の技術を用いて、類似する内容の文書をまとめ上げている。 A technique related to such a problem is described in Patent Document 1. In this related technique, a target document is once machine-translated into a reference language, for example, English, and then documents having similar contents are collected using a technique such as clustering.
 また、このような問題に関連する他の技術が、特許文献2に記載されている。特許文献2では、機械翻訳結果という中間結果を経ずに、対訳コーパスから直接、2言語間の内容の類似性を表すモデルを学習するSSI(supervised semantic indexing)という枠組みが提案されている。この関連技術は、以下のように、2言語間の文書の類似性を学習・判定する。 Another technique related to such a problem is described in Patent Document 2. Patent Document 2 proposes a framework called SSI (supervised semantic indexing) that learns a model representing the similarity of contents between two languages directly from a bilingual corpus without going through an intermediate result of machine translation. This related technique learns and determines the similarity of documents between two languages as follows.
 まず、この関連技術は、言語横断で対訳関係にある文書集合において、各言語の各文書dijを、bag-of-wordsで表現(言語毎の次元数D,Dは自由)する。ここで、添え字iは、言語の種類を表す。また、添え字jは、言語毎の文書のIDを表す。 First, in this related technique, each document dij in each language is expressed by bag-of-words in a document set in a parallel translation relationship across languages (dimension numbers D 1 and D 2 for each language are free). Here, the subscript i represents the language type. The subscript j represents the document ID for each language.
 そして、この関連技術は、言語対の対応関係を学習する行列Wを用意する。Wは、D×D(D行D列)の行列である。ただし、学習すべきパラメータ数が多いので、次元圧縮のため、W=U・Vを満たすU、Vについて学習を行う。ここで、Uは、Uの転置行列を表す。なお、U,Vは、それぞれN×D,N×Dの行列とする。また、N=100などが適用される。文書対のスコアを算出する式を、次式(1)に示す。
Figure JPOXMLDOC01-appb-I000001
This related technique prepares a matrix W for learning the correspondence between language pairs. W is a matrix of D 1 × D 2 (D 1 row D 2 column). However, since the number of parameters to be learned is large, U and V satisfying W = U T · V are learned for dimensional compression. Here, U T represents the transposed matrix of U. U and V are matrices of N × D 1 and N × D 2 respectively. N = 100 or the like is applied. An equation for calculating the score of the document pair is shown in the following equation (1).
Figure JPOXMLDOC01-appb-I000001
このとき、対訳関係の文書対のスコアが、それ以外の文書対のスコアより高くなるように、U,Vの学習が行われる。そして、この関連技術は、異なる言語の文書d1j,d2k間の類似性を、学習した行列U,Vを用いて、式(1)により判定する。 At this time, learning of U and V is performed so that the score of the document pair in the bilingual relationship is higher than the scores of the other document pairs. In this related technique, the similarity between the documents d 1j and d 2k in different languages is determined using the learned matrices U and V according to the expression (1).
 また、このような問題に関連するさらに他の技術が、特許文献3に記載されている。この関連技術は、異なる言語で記述された複数の文書の中から、ある言語で記述された検索要求に対して意味的に近似するものを検索する。この関連技術では、あらかじめ、単語辞書データベースが用意される。単語辞書データベースは、自然言語A,B,C,D・・・・間の同義語の単語Wi群と、1つの単語特徴ベクトルViとを関係付けたものである。そして、この関連技術は、各文書に含まれる単語に関係付けられた単語特徴ベクトルの総和を正規化し、文書特徴ベクトルとして算出する。また、この関連技術は、検索要求に含まれる各単語について関係付けられた単語特徴ベクトルの総和を正規化し、検索要求特徴ベクトルとして算出する。そして、この関連技術は、検索要求特徴ベクトルと、各文書の文書特徴ベクトルとの内積を意味的近似度として算出する。そして、この関連技術は、意味的近似度の大きい文書を、検索要求に近似する文書として検索する。 Further, Patent Document 3 describes another technique related to such a problem. This related technique searches a plurality of documents described in different languages for those that are semantically approximate to a search request described in a certain language. In this related technique, a word dictionary database is prepared in advance. The word dictionary database associates a word Wi group of synonyms between natural languages A, B, C, D,... With one word feature vector Vi. Then, this related technique normalizes the sum of word feature vectors related to words included in each document, and calculates it as a document feature vector. Also, this related technique normalizes the sum of word feature vectors related to each word included in the search request and calculates it as a search request feature vector. This related technique calculates the inner product of the search request feature vector and the document feature vector of each document as a semantic approximation. This related technique searches for a document having a large semantic similarity as a document that approximates the search request.
特開2013-84306号公報JP 2013-84306 A 米国特許8359282号明細書US Pat. No. 8,359,282 特開平10-31677号公報Japanese Patent Laid-Open No. 10-31677
 しかしながら、特許文献1に記載された関連技術は、機械翻訳結果という中間状態を経由する枠組みのため、機械翻訳の精度自体が必ずしも高くない場合、まとめ上げの精度自体が高くないという問題がある。 However, since the related technology described in Patent Document 1 is a framework that goes through an intermediate state of machine translation results, there is a problem in that the accuracy of compilation is not high when the accuracy of machine translation is not necessarily high.
 また、特許文献2に記載された関連技術は、2言語間の場合は問題が無いが、3言語間以上の場合に問題が生じる。それは、言語間の類似性を判定するための行列W=U・Vを、相手言語の数だけ用意しなければならないことである。例えば、言語数がnであるとすると、U、V併せてn×(n-1)/2個の行列を計算・保持する必要が生じる。なお、“/”は除算を表す。また、類似性の判定時には、ある言語の文書dijを他の文書と比較する場合、比較対象の言語数分、W・dijを計算する必要があり、計算コストが高くなる。 The related art described in Patent Document 2 has no problem in the case of two languages, but has a problem in the case of three or more languages. It the matrix W = U T · V to determine the similarity between languages, is that it must be provided in the number of partner language. For example, if the number of languages is n, it is necessary to calculate and hold a matrix of n × (n−1) / 2 including U and V. “/” Represents division. Further, at the time of determining similarity, when comparing a document d ij in a certain language with other documents, it is necessary to calculate W · d ij for the number of languages to be compared, which increases the calculation cost.
 また、特許文献3には、同義語の単語Wi群に対する1つの単語特徴ベクトルViを、どのように学習するかについて記載がない。また、多義語の存在を考慮すると、同義語の単語Wi群の組み合わせ数が膨大になる可能性がある。そのため、この関連技術では、単語辞書データベースの保持・学習にかかるコストが高くなる。また、単語レベルの同義語群を介して文書の近似度を判定することは、単語レベルの機械翻訳を介することと等価である。そのため、この関連技術では、同義語群(機械翻訳)の精度が必ずしも高くない場合、意味的近似度の判定精度も高くないという問題がある。 Further, Patent Document 3 does not describe how to learn one word feature vector Vi for a synonym word Wi group. Also, considering the existence of multiple meanings, the number of combinations of synonym words Wi may be enormous. Therefore, with this related technology, the cost for maintaining and learning the word dictionary database increases. Also, determining the degree of approximation of a document via a word level synonym group is equivalent to using a word level machine translation. Therefore, this related technique has a problem that the accuracy of the semantic approximation is not high when the accuracy of the synonym group (machine translation) is not necessarily high.
 本発明は、上述の課題を解決するためになされたものである。すなわち、本発明は、多言語文書群において、言語が3種類以上であっても、より低コストにより精度よく、類似文書を検索するための技術を提供することを目的とする。 The present invention has been made to solve the above-described problems. That is, an object of the present invention is to provide a technique for searching for similar documents more accurately at a lower cost even when there are three or more languages in a multilingual document group.
 上記目的を達成するために、本発明の多言語文書類似度学習装置は、対象言語毎に行列を保持する多言語行列記憶手段と、文書に対応する単語ベクトルを取得する単語ベクトル取得手段と、前記文書の単語ベクトルおよび前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成する意味ベクトル作成手段と、文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算する類似度計算手段と、前記対象言語のいずれかでそれぞれが記述された文書の集合において、対訳関係にある文書の組の前記類似度が、対訳関係にない文書の組の前記類似度より高くなるように、前記各対象言語に対応する前記行列の値を調整して学習する多言語行列学習手段と、を備える。 In order to achieve the above object, the multilingual document similarity learning device of the present invention includes a multilingual matrix storage unit that holds a matrix for each target language, a word vector acquisition unit that acquires a word vector corresponding to a document, Based on the word vector of the document and the matrix corresponding to the description language of the document, a semantic vector creating means for creating the semantic vector of the document, and for a set of documents, the similarity based on the semantic vector of each document In the set of documents each of which is described in any one of the target languages, the similarity of the set of documents having a bilingual relationship is the similarity of the set of documents having no bilingual relationship Multilingual matrix learning means for adjusting and learning the value of the matrix corresponding to each target language so as to be higher.
 また、本発明の多言語文書類似度判定装置は、上述の多言語文書類似度学習装置を用いて学習された前記対象言語毎の前記行列を保持する多言語行列記憶手段と、文書に対応する単語ベクトルを取得する単語ベクトル取得手段と、前記文書の単語ベクトルおよび前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成する意味ベクトル作成手段と、文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算する類似度計算手段と、類似度判定対象の文書の集合において、前記類似度を用いて文書間の類似度を判定する類似度判定手段と、を備える。 The multilingual document similarity determination device of the present invention corresponds to a document and multilingual matrix storage means for holding the matrix for each target language learned using the multilingual document similarity learning device described above. A word vector obtaining means for obtaining a word vector, a semantic vector creating means for creating a semantic vector of the document based on the matrix corresponding to the word vector of the document and the description language of the document, and a set of documents, Similarity calculation means for calculating similarity based on the semantic vector of each document; similarity determination means for determining similarity between documents using the similarity in a set of similarity determination target documents; Is provided.
 また、本発明の多言語文書類似度学習方法は、対象言語毎に保持される行列を用いて、文書に対応する単語ベクトル、および、前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成し、文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算することにより、対象言語のいずれかでそれぞれが記述された文書の集合において、対訳関係にある文書の組の前記類似度が、対訳関係にない文書の組の前記類似度より高くなるように、前記各対象言語に対応する前記行列の値を調整して学習する。 Further, the multilingual document similarity learning method of the present invention uses the matrix held for each target language, and based on the word vector corresponding to the document and the matrix corresponding to the description language of the document, By creating a semantic vector of a document and calculating a similarity for each set of documents based on the semantic vector of each document, there is a bilingual relationship in a set of documents each described in one of the target languages Learning is performed by adjusting the value of the matrix corresponding to each target language so that the similarity of the document set is higher than the similarity of the document set that is not in a parallel translation relationship.
 また、本発明の多言語文書類似度判定方法は、上述の多言語文書類似度学習方法により学習された前記対象言語毎の前記行列を用いて、文書に対応する単語ベクトル、および、前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成し、文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算することにより、類似度判定対象の文書の集合において、前記類似度を用いて文書間の類似度を判定する。 Further, the multilingual document similarity determination method of the present invention uses the matrix for each target language learned by the multilingual document similarity learning method described above, a word vector corresponding to the document, and the document Based on the matrix corresponding to a description language, a semantic vector of the document is created, and for a set of documents, a similarity is calculated based on the semantic vector of each document. The similarity between documents is determined using the similarity.
 また、本発明の記憶媒体は、対象言語毎に保持される行列を用いて、文書に対応する単語ベクトルを取得する単語ベクトル取得ステップと、前記文書の単語ベクトルおよび前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成する意味ベクトル作成ステップと、文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算する類似度計算ステップと、前記対象言語のいずれかでそれぞれが記述された文書の集合において、対訳関係にある文書の組の前記類似度が、対訳関係にない文書の組の前記類似度より高くなるように、前記各対象言語に対応する前記行列の値を調整して学習する多言語行列学習ステップと、をコンピュータ装置に実行させる多言語文書類似度学習プログラムを記憶している。 The storage medium of the present invention corresponds to a word vector acquisition step of acquiring a word vector corresponding to a document using a matrix held for each target language, the word vector of the document, and the description language of the document. A semantic vector creating step for creating a semantic vector of the document based on the matrix; a similarity calculating step for calculating a similarity based on the semantic vector of each document for a set of documents; and any of the target languages In the set of documents each of which is described above, the similarity corresponding to the set of documents having a translation relationship is higher than the similarity of the set of documents having no translation relationship. A multilingual matrix similarity learning program for causing a computer device to execute a multilingual matrix learning step of adjusting and learning matrix values is stored.
 また、本発明の他の記憶媒体は、上述の記憶媒体に記憶された多言語文書類似度学習プログラムの実行により学習された前記対象言語毎の前記行列を用いて、文書に対応する単語ベクトルを取得する単語ベクトル取得ステップと、前記文書の単語ベクトルおよび前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成する意味ベクトル作成ステップと、文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算する類似度計算ステップと、類似度判定対象の文書の集合において、前記類似度を用いて文書間の類似度を判定する類似度判定ステップと、をコンピュータ装置に実行させる多言語文書類似度判定プログラムを記憶している。 In addition, another storage medium of the present invention uses the matrix for each target language learned by executing the multilingual document similarity learning program stored in the above-described storage medium to obtain a word vector corresponding to the document. A word vector obtaining step, a semantic vector creating step for creating a semantic vector of the document based on the word vector of the document and the matrix corresponding to the description language of the document, and a set of documents, A similarity calculation step for calculating a similarity based on the semantic vector, and a similarity determination step for determining a similarity between documents using the similarity in a set of documents for similarity determination A multilingual document similarity determination program to be executed is stored.
 本発明は、多言語文書群において、言語が3種類以上であっても、より低コストにより精度よく、類似文書を検索するための技術を提供することができる。 The present invention can provide a technique for searching for similar documents more accurately at a lower cost even when there are three or more languages in a multilingual document group.
本発明の第1の実施の形態としての多言語文書類似度学習装置の機能ブロック図である。It is a functional block diagram of the multilingual document similarity learning apparatus as the first embodiment of the present invention. 本発明の第1の実施の形態としての多言語文書類似度学習装置のハードウェア構成図である。It is a hardware block diagram of the multilingual document similarity learning apparatus as the 1st Embodiment of this invention. 本発明の第1の実施の形態としての多言語文書類似度学習装置の動作を説明するフローチャートである。It is a flowchart explaining operation | movement of the multilingual document similarity learning apparatus as the 1st Embodiment of this invention. 本発明の第2の実施の形態としての多言語文書類似度判定装置の機能ブロック図である。It is a functional block diagram of the multilingual document similarity determination apparatus as the 2nd Embodiment of this invention. 本発明の第2の実施の形態としての多言語文書類似度判定装置の動作を説明するフローチャートである。It is a flowchart explaining operation | movement of the multilingual document similarity determination apparatus as the 2nd Embodiment of this invention.
 以下、本発明の実施の形態について、図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
 (第1の実施の形態)
 図1は、本発明の第1の実施の形態としての多言語文書類似度学習装置1の機能ブロック構成を示す図である。図1において、多言語文書類似度学習装置1は、多言語行列記憶部11と、単語ベクトル取得部12と、意味ベクトル作成部13と、類似度計算部14と、多言語行列学習部15と、を含む。
(First embodiment)
FIG. 1 is a diagram showing a functional block configuration of a multilingual document similarity learning apparatus 1 as a first embodiment of the present invention. In FIG. 1, a multilingual document similarity learning device 1 includes a multilingual matrix storage unit 11, a word vector acquisition unit 12, a semantic vector creation unit 13, a similarity calculation unit 14, a multilingual matrix learning unit 15, ,including.
 図2は、多言語文書類似度学習装置1のハードウェア構成の一例を示す図である。図2において、多言語文書類似度学習装置1は、コンピュータ装置によって構成されている。このコンピュータ装置は、CPU(Central Processing Unit)1001と、RAM(Random Access Memory)1002と、ROM(Read Only Memory)1003と、記憶装置1004と、入力装置1005と、出力装置1006とを含む。この場合、多言語行列記憶部11は、記憶装置1004によって構成される。また、単語ベクトル取得部12と、意味ベクトル作成部13と、類似度計算部14とは、ROM1003および記憶装置1004に記憶されたコンピュータ・プログラムおよび各種データをRAM1002に読み込んで実行するCPU1001によって構成される。また、多言語行列学習部15は、入力装置1005と、ROM1003および記憶装置1004に記憶されたコンピュータ・プログラムおよび各種データをRAM1002に読み込んで実行するCPU1001とによって構成される。なお、多言語文書類似度学習装置1およびその各機能ブロックのハードウェア構成は、上述の構成に限定されない。 FIG. 2 is a diagram illustrating an example of a hardware configuration of the multilingual document similarity learning apparatus 1. In FIG. 2, the multilingual document similarity learning device 1 is configured by a computer device. The computer device includes a CPU (Central Processing Unit) 1001, a RAM (Random Access Memory) 1002, a ROM (Read Only Memory) 1003, a storage device 1004, an input device 1005, and an output device 1006. In this case, the multilingual matrix storage unit 11 is configured by the storage device 1004. The word vector acquisition unit 12, the semantic vector creation unit 13, and the similarity calculation unit 14 are configured by a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them. The The multilingual matrix learning unit 15 includes an input device 1005 and a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them. Note that the hardware configuration of the multilingual document similarity learning device 1 and each functional block thereof is not limited to the above-described configuration.
 多言語行列記憶部11は、対象言語毎に行列を保持する。各行列は、その対象言語で記述された文書の単語ベクトルを、意味ベクトルに変換するための重み行列である。単語ベクトルおよび意味ベクトルについては後述する。例えば、各行列は、列の数が互いに等しいものであってもよい。その場合、列の数は、意味ベクトルの次元数となる。また、その場合、各行列の行の数は、後述の単語ベクトルの次元数であってもよい。 The multilingual matrix storage unit 11 holds a matrix for each target language. Each matrix is a weight matrix for converting a word vector of a document described in the target language into a semantic vector. The word vector and the semantic vector will be described later. For example, each matrix may have the same number of columns. In that case, the number of columns is the number of dimensions of the semantic vector. In this case, the number of rows in each matrix may be the number of dimensions of a word vector described later.
 単語ベクトル取得部12は、文書に対応する単語ベクトルを取得する。単語ベクトルは、文書の類似度を計算する際に一般的に用いられる概念であり、文書を、文書中に含まれる単語の集合によって表す表現形式である。単語ベクトルの次元数は、例えば、その文書を記述する対象言語(以下、記述言語とも記載する)で用いられる単語数であってもよい。例えば、単語ベクトル取得部12は、与えられた文書に基づいて、公知の手法により単語ベクトルを作成してもよい。あるいは、単語ベクトル取得部12は、その文書に対応する単語ベクトルとしてあらかじめ生成されたものを、記憶装置1004または入力装置1005等から取得してもよい。 The word vector acquisition unit 12 acquires a word vector corresponding to the document. The word vector is a concept that is generally used when calculating the similarity of documents, and is an expression format that represents a document by a set of words included in the document. The number of dimensions of the word vector may be, for example, the number of words used in a target language describing the document (hereinafter also referred to as a description language). For example, the word vector acquisition unit 12 may create a word vector by a known technique based on a given document. Or the word vector acquisition part 12 may acquire what was previously produced | generated as a word vector corresponding to the document from the memory | storage device 1004 or the input device 1005 grade | etc.,.
 意味ベクトル作成部13は、文書の単語ベクトルと、その文書の記述言語に対応して多言語行列記憶部11に保持されている行列とに基づいて、その文書の意味ベクトルを作成する。ここで、意味ベクトルとは、文書の意味的特徴を表す情報である。例えば、意味ベクトル作成部13は、文書の単語ベクトルと、その文書の記述言語に対応する行列との積を、その文書の意味ベクトルとして作成してもよい。 The semantic vector creation unit 13 creates the semantic vector of the document based on the word vector of the document and the matrix held in the multilingual matrix storage unit 11 corresponding to the description language of the document. Here, the semantic vector is information representing the semantic features of the document. For example, the semantic vector creation unit 13 may create a product of a word vector of a document and a matrix corresponding to the description language of the document as the semantic vector of the document.
 類似度計算部14は、文書の組について、各文書の意味ベクトルに基づいて類似度を計算する。例えば、類似度計算部14は、各文書の意味ベクトルの内積を、文書の組の類似度として計算してもよい。 The similarity calculation unit 14 calculates a similarity for a set of documents based on the semantic vector of each document. For example, the similarity calculation unit 14 may calculate the inner product of the semantic vectors of each document as the similarity of a set of documents.
 多言語行列学習部15は、対象言語のいずれかでそれぞれが記述された文書の集合において、意味ベクトル作成部13および類似度計算部14を用いて、各文書の記述言語に対応する行列の値を調整して学習する。具体的には、多言語行列学習部15は、対訳関係にある文書の組の類似度が、対訳関係にない文書の組の類似度より高くなるように、各行列を学習する。例えば、多言語行列学習部15は、対訳関係にある文書の組の類似度が、その組の一方の文書と、その文書に対して対訳関係にない他の文書との類似度より高くなるように、行列の値を調整して学習を行ってもよい。なお、多言語行列学習部15は、各行列の学習を、並列して行うことが望ましい。 The multilingual matrix learning unit 15 uses a semantic vector creation unit 13 and a similarity calculation unit 14 in a set of documents each described in one of the target languages, and uses matrix values corresponding to the description language of each document. Adjust to learn. Specifically, the multilingual matrix learning unit 15 learns each matrix so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship. For example, the multilingual matrix learning unit 15 causes the similarity of a pair of documents in a bilingual relationship to be higher than the similarity between one document of the pair and another document that is not in a bilingual relationship with the document. In addition, learning may be performed by adjusting matrix values. Note that the multilingual matrix learning unit 15 preferably performs learning of each matrix in parallel.
 ここで、多言語行列学習部15による学習の際に用いられる文書の集合について説明する。文書の集合において、各文書を記述している対象言語は、3種類以上あってもよい。このような文書の集合は、少なくとも一部に、対訳関係にある文書の組を含むよう構成される。また、このような文書の集合は、少なくとも一部に、対訳関係にない文書の組を含むよう構成される。なお、文書の集合は、あらかじめ記憶装置1004に記憶されたものであってもよい。また、文書の集合は、入力装置1005またはネットワークインタフェース(図示せず)等を介して外部から入力されるものであってもよい。 Here, a set of documents used for learning by the multilingual matrix learning unit 15 will be described. In a set of documents, there may be three or more types of target languages describing each document. Such a set of documents is configured to include at least a set of documents having a translation relation. Further, such a set of documents is configured to include a set of documents that are not in a parallel translation relationship at least in part. Note that the document set may be stored in advance in the storage device 1004. A set of documents may be input from the outside via the input device 1005 or a network interface (not shown).
 また、例えば、多言語行列学習部15は、確率的最急勾配法を用いて、上述の各行列の学習を行ってもよい。この場合、多言語行列学習部15は、確率的最急勾配法のステップごとに、対訳関係にある文書の組および対訳関係にない文書の組を文書の集合からランダムに選択してもよい。 Also, for example, the multilingual matrix learning unit 15 may learn each matrix described above using the stochastic steepest gradient method. In this case, the multilingual matrix learning unit 15 may randomly select from the set of documents a set of documents that have a translation relationship and a set of documents that do not have a translation relationship for each step of the probabilistic steepest gradient method.
 以上のように構成された多言語文書類似度学習装置1の動作について、図3を参照して説明する。なお、多言語文書類似度学習装置1は、対象言語のいずれかでそれぞれが記述された文書の集合が入力されると、以下の動作を開始するものとする。 The operation of the multilingual document similarity learning apparatus 1 configured as described above will be described with reference to FIG. Note that the multilingual document similarity learning device 1 starts the following operation when a set of documents each described in any of the target languages is input.
 図3では、まず、単語ベクトル取得部12は、対象言語のいずれかでそれぞれが記述された文書の集合の各文書について、対応する単語ベクトルを取得する(ステップS1)。 In FIG. 3, first, the word vector acquisition unit 12 acquires a corresponding word vector for each document in a set of documents each described in one of the target languages (step S1).
 次に、多言語行列学習部15は、意味ベクトル作成部13に指示し、文書の集合において対訳関係にある文書の組および対訳関係にない文書の組の各文書について、意味ベクトルを作成させる(ステップS2)。意味ベクトル作成部13は、各文書の単語ベクトルおよびその記述言語に対応する行列に基づき、意味ベクトルを作成する。 Next, the multilingual matrix learning unit 15 instructs the semantic vector creation unit 13 to create a semantic vector for each document in the document set that has a bilingual relationship and a document set that has no bilingual relationship in the set of documents ( Step S2). The semantic vector creation unit 13 creates a semantic vector based on a word vector of each document and a matrix corresponding to the description language.
 次に、多言語行列学習部15は、文書の集合において対訳関係にある文書の組および対訳関係にない文書の組のそれぞれについて、各文書の意味ベクトルに基づいて、類似度計算部14により類似度を計算する(ステップS3)。 Next, the multilingual matrix learning unit 15 uses the similarity calculation unit 14 to make a similarity between each of a pair of documents having a parallel translation relationship and a pair of documents having no parallel translation relationship in the document set based on the semantic vector of each document. The degree is calculated (step S3).
 次に、多言語行列学習部15は、対訳関係にある文書の組の類似度が、対訳関係にない文書の組の類似度より高くなるように、各記述言語に対応する行列を調整する(ステップS4)。 Next, the multilingual matrix learning unit 15 adjusts the matrix corresponding to each description language so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship ( Step S4).
 次に、行列の調整が収束していれば(ステップS5でYes)、多言語行列学習部15は、学習を終了する。 Next, if the adjustment of the matrix has converged (Yes in step S5), the multilingual matrix learning unit 15 ends the learning.
 一方、行列の調整が収束してなければ(ステップS5でNo)、多言語文書類似度学習装置1の動作は、ステップS2に戻る。そして、多言語文書類似度学習装置1は、前回のステップS4で調整後の行列を用いて、ステップS2からの動作を繰り返す。 On the other hand, if the matrix adjustment has not converged (No in step S5), the operation of the multilingual document similarity learning apparatus 1 returns to step S2. And the multilingual document similarity learning apparatus 1 repeats the operation | movement from step S2 using the matrix after adjustment by previous step S4.
 次に、本発明の第1の実施の形態の効果について説明する。 Next, effects of the first exemplary embodiment of the present invention will be described.
 本発明の第1の実施の形態としての多言語文書類似度学習装置は、多言語文書群において、言語が3種類以上であっても、類似文書の判定において文書の意味ベクトルを作成するために用いられる行列を、より低コストに、より精度よく学習することができる。 The multilingual document similarity learning apparatus according to the first embodiment of the present invention creates a semantic vector of a document in determination of similar documents even when there are three or more languages in a multilingual document group. The used matrix can be learned more accurately at a lower cost.
 その理由は、多言語行列記憶部が、対象言語毎に、文書の単語ベクトルを意味ベクトルに変換するための行列を保持しておくからである。また、単語ベクトル取得部が、文書に対応する単語ベクトルを取得し、意味ベクトル作成部が、文書の単語ベクトルおよびその文書の記述言語に対応する行列に基づいて、意味ベクトルを作成するからである。そして、多言語行列学習部が、対象言語のいずれかでそれぞれ記述された文書の集合において、対訳関係にある文書の組の類似度が、対訳関係にない文書の組の類似度より高くなるように、各対象言語に対応する行列の学習を行うからである。 This is because the multilingual matrix storage unit holds a matrix for converting a word vector of a document into a semantic vector for each target language. In addition, the word vector acquisition unit acquires a word vector corresponding to the document, and the semantic vector creation unit creates a semantic vector based on the word vector of the document and a matrix corresponding to the description language of the document. . Then, the multilingual matrix learning unit is configured such that, in a set of documents each described in one of the target languages, the similarity of a pair of documents having a parallel translation relationship is higher than the similarity of a pair of documents having no parallel translation relationship. In addition, the learning of the matrix corresponding to each target language is performed.
 このように、本実施の形態は、意味ベクトルを作成するための行列を、言語対毎に用意するのではなく、対象言語毎に用意する。したがって、本実施の形態は、言語対ごとに行列を学習する必要がない。そして、本実施の形態は、言語対によらず、文書の意味ベクトルにおける各次元の意味が同じになるように、対象言語毎に行列を学習すればよい。その結果、本実施の形態では、学習により得られる各記述言語の行列は、相手言語に対して非依存となる。 Thus, in the present embodiment, a matrix for creating a semantic vector is prepared for each target language, not for each language pair. Therefore, this embodiment does not need to learn a matrix for each language pair. And this Embodiment should just learn a matrix for every object language so that the meaning of each dimension in the semantic vector of a document may become the same irrespective of a language pair. As a result, in this embodiment, the matrix of each description language obtained by learning becomes independent of the partner language.
 したがって、本実施の形態は、対象言語が3種類以上あっても、対象言語の組み合わせごとに類似度判定のための行列を学習する必要がなく、対象言語毎に行列を学習すればよい。このため、計算コストを低く抑えることができる。 Therefore, in this embodiment, even when there are three or more types of target languages, it is not necessary to learn a matrix for similarity determination for each combination of target languages, and it is only necessary to learn a matrix for each target language. For this reason, calculation cost can be suppressed low.
 また、本実施の形態は、文書の集合において、文書の絶対数が少ない対象言語について行列の学習を行う場合も、他の複数の対象言語との言語対から情報を得られる。このため、言語対毎に行列を学習する場合と比べて、性能向上が期待できる。さらに、本実施の形態は、そのような各対象言語の行列の学習を並列して行うことにより、学習精度をさらに高めることができる。 Also, in the present embodiment, information can be obtained from language pairs with a plurality of other target languages even when a matrix is learned for a target language with a small absolute number of documents in a set of documents. For this reason, a performance improvement can be expected compared to the case of learning a matrix for each language pair. Furthermore, this embodiment can further improve the learning accuracy by performing learning of the matrices of each target language in parallel.
 次に、本発明の第1の実施の形態の動作を具体例で示す。 Next, the operation of the first embodiment of the present invention will be shown as a specific example.
 ここでは、ウェブ上に存在するニュース記事の情報収集を想定する。情報収集の効率化のためには、異なる言語で記述されたニュース記事であっても同様の内容であれば1つに纏め上げたい、というニーズがある。そのためには、言語横断でニュース記事間の類似度を判定することが必要となる。以下に、言語横断でニュース記事間の類似度を計算するための行列の学習について述べる。 Here, it is assumed that information is collected on news articles that exist on the web. In order to improve the efficiency of information collection, there is a need to consolidate news articles written in different languages as long as they have the same contents. For this purpose, it is necessary to determine the similarity between news articles across languages. The following describes the learning of a matrix for calculating the similarity between news articles across languages.
 学習には、一部が対訳関係になっている大量の文書を用いることとする。ここでの対訳関係とは、完全な対訳関係、いわゆるパラレルコーパスである必要はない。例えば、対訳関係とは、同じ対象について異なる言語で記述してある程度の、いわゆるコンパラブルコーパスでもよい。そのような文書の集合として、統計的機械翻訳の研究で用いられる対訳コーパスを用いてもよい。あるいは、そのような文書の集合として、各国語版のウィキペディアを用いてもよい。 学習 For learning, a large amount of documents, some of which are in a parallel translation relationship, will be used. The translation relationship here does not need to be a complete translation relationship, a so-called parallel corpus. For example, the bilingual relationship may be a so-called comparable corpus that describes the same object in different languages. As a set of such documents, a bilingual corpus used in statistical machine translation research may be used. Alternatively, a national language version of Wikipedia may be used as a set of such documents.
 そして、多言語行列記憶部11は、上述の文書の集合において類似度を測る対象言語の数だけ行列を記憶する。言語iに対応する行列Mは、N×Dの行列である。Nは、意味ベクトルの次元数である。言語によらず意味ベクトルの各次元の持つ意味を同じに揃える為に、Nは、言語によらず同じ大きさとすることが望ましい。経験的には、N=100から数百程度がうまく働く。Dは、言語iで用いられる単語数である。Dは、言語ごとに異なる値であってもよい。各Mの初期値としては、例えば0が設定される。 And the multilingual matrix memory | storage part 11 memorize | stores a matrix by the number of the target languages which measure a similarity in the collection of the above-mentioned documents. The matrix M i corresponding to the language i is an N × D i matrix. N is the number of dimensions of the semantic vector. In order to make the meaning of each dimension of the semantic vector the same regardless of the language, it is desirable that N has the same size regardless of the language. Empirically, N = 100 to several hundreds works well. Di is the number of words used in language i. D i may be a different value for each language. The initial value for each M i, for example, 0 is set.
 まず、単語ベクトル取得部12は、上述の文書の集合における各文書を、単語ベクトルに変換する。単語ベクトルは、上述のように、文書の類似度を計算する際に一般的に用いられる概念であり、文書を文書中に含まれる単語の集合によって表す表現形式である。もっとも単純な単語ベクトルは、各単語の出現の有無を0または1で表した要素からなるベクトル(たとえば、[1、0、1、0]のような表現)である。その他、単語ベクトルとしては、類似度を計算する観点で各単語に重みづけを行うTF(単語の出現頻度:Term Frequency)*IDF(逆文書頻度:Inverse Document Frequency)に基づくものがある。さらには、LSI(Latent Semantic Indexing)、LDA(Latent Dirichlet Allocation)などの方法を用いて、単語ベクトルを一旦次元圧縮する方法等も知られている。また、単語の代わりに、単語のN-GRAMや文字N-GRAMを用いてもよい。ここでは、単語ベクトルとして、単語のTF*IDFを用いることとする。これにより、言語iのj番目の文書については、次元数Dの単語ベクトルdijが作成されるとする。 First, the word vector acquisition unit 12 converts each document in the above-described document set into a word vector. As described above, the word vector is a concept that is generally used when calculating the similarity of documents, and is an expression format that expresses a document by a set of words included in the document. The simplest word vector is a vector (for example, an expression such as [1, 0, 1, 0]) composed of elements in which the presence or absence of each word is represented by 0 or 1. Other word vectors include those based on TF (word appearance frequency: term frequency) * IDF (inverse document frequency) that weights each word from the viewpoint of calculating similarity. Furthermore, a method of temporarily compressing a word vector using a method such as LSI (Latent Semantic Indexing) or LDA (Latent Dirichlet Allocation) is also known. Further, instead of words, word N-GRAM or character N-GRAM may be used. Here, the word TF * IDF is used as the word vector. As a result, for the j-th document in the language i, a word vector d ij with a dimension number D i is created.
 また、意味ベクトル作成部13は、多言語行列学習部15の制御の基に、文書の単語ベクトルおよびその文書の記述言語に対応する行列の積を、当該文書の意味ベクトルとして計算する。具体的には、意味ベクトル作成部13は、言語iのj番目の文書について、対応する単語ベクトルdijと、言語iのその時点での行列Mの積である、M・dijを計算し、意味ベクトルとする。このM・dijの次元数は、文書や言語によらずNとなる。 The semantic vector creation unit 13 calculates the product of the word vector of the document and the matrix corresponding to the description language of the document as the semantic vector of the document under the control of the multilingual matrix learning unit 15. Specifically, the semantic vector creation unit 13 calculates M i · d ij , which is the product of the corresponding word vector d ij and the current matrix M i of language i, for the j-th document in language i. Calculate and use it as a semantic vector. The number of dimensions of M i · d ij is N regardless of the document or language.
 そして、多言語行列学習部15は、意味ベクトル作成部13および類似度計算部14を用いて、各Mの学習を行う。学習の基本的な考え方は、対訳関係にある文書の組の類似度が、対訳関係にない文書の組の類似度より高くなるように、複数の記述言語に対する行列を並列して調整する、というものである。具体的には、多言語行列学習部15は、前述の文書の集合のうち、言語iの文書qと、文書qと対訳関係にある言語iの文書dと、対訳関係にない言語iの文書dについて、次式(2)を満たすように、行列Miq、Mi+、Mi-を並列して調整する。
Figure JPOXMLDOC01-appb-I000002
The multilingual matrix learning unit 15 uses the mean vector generating unit 13 and the similarity calculation unit 14 performs learning of each M i. The basic idea of learning is that the matrixes for multiple description languages are adjusted in parallel so that the similarity of a pair of documents in a bilingual relationship is higher than the similarity of a pair of documents in a non-parallel relationship. Is. Specifically, multilingual matrix learning unit 15, among the set of documents mentioned above, a document q language i q, the language i + document d + in document q and translation relation, not the translation relationship language For the document d of i , the matrices M iq , M i + , and M i− are adjusted in parallel so as to satisfy the following expression (2).
Figure JPOXMLDOC01-appb-I000002
 ここで、一般には、文書qとdとの類似度が文書qとdとの類似度より一定のマージン以上大きくなるように調整した方が、性能が高くなることが知られている。そこで、多言語行列学習部15は、マージンを考慮した次式(3)の損失関数を最小化するように調整を行う。
Figure JPOXMLDOC01-appb-I000003
Here, it is generally known that the performance is improved when the similarity between the documents q and d + is adjusted to be larger than the similarity between the documents q and d by a certain margin or more. Therefore, the multilingual matrix learning unit 15 performs adjustment so as to minimize the loss function of the following equation (3) considering the margin.
Figure JPOXMLDOC01-appb-I000003
 ここで、Rは、入力された文書集合中の、ある文書と、当該文書に対して対訳関係にある文書または対訳関係にない文書との組の集合である。また、f(q,d)は、文書qおよびdの類似度を表す。つまり、f(q,d)は、文書qの言語をi、dの言語をiとすると、(Miq・q)・(Mid・d)である。 Here, R is a set of a set of a certain document in the input document set and a document that has a parallel translation relationship with the document or a document that does not have a parallel translation relationship. F (q, d) represents the similarity between documents q and d. That, f (q, d) is language i q document q, if the language of d and i d, a (M iq · q) T · (M id · d).
 上述の損失関数の最小化の方法の1つとして、確率的最急勾配法を用いる方法が挙げられる。この場合、多言語行列学習部15は、確率的最急勾配法の1ステップごとに、ランダムにq、d、dの組を選び出し、1-f(q,d)+f(q,d)>0の場合は、次式(4)~(6)のように各行列M(Miq、Mi+、Mi-)を更新する。
Figure JPOXMLDOC01-appb-I000004

Figure JPOXMLDOC01-appb-I000005

Figure JPOXMLDOC01-appb-I000006
One of the methods for minimizing the loss function described above is a method using a stochastic steepest gradient method. In this case, the multilingual matrix learning unit 15 randomly selects a pair of q, d + and d for each step of the stochastic steepest gradient method, and 1−f (q, d + ) + f (q, When d )> 0, each matrix M (M iq , M i + , M i− ) is updated as in the following equations (4) to (6).
Figure JPOXMLDOC01-appb-I000004

Figure JPOXMLDOC01-appb-I000005

Figure JPOXMLDOC01-appb-I000006
 このようにして、多言語行列学習部15は、ランダムな文書抽出と、それに基づく行列Mの調整を、収束するまで行う。 In this way, the multilingual matrix learning unit 15 performs random document extraction and adjustment of the matrix M based on the extraction until convergence.
 以上で、具体例の説明を終了する。 This completes the description of the specific example.
 このように、この具体例では、多言語行列記憶部11に、言語iごとに、文書の単語ベクトルを意味ベクトルに変換するための行列Mを記憶している。さらに、多言語行列学習部15が、対訳関係にある文書の組の類似度が、対訳関係にない文書の組の類似度より高くなるように、複数の言語iに対する行列Mの学習を並列して行う。これにより、言語対によらず意味ベクトルの各次元の意味が同じになるように学習を行うことができる。その結果、この具体例では、学習により得られる対象言語iについての行列Mは、相手言語非依存となる。 Thus, in this specific example, the multilingual matrix storage unit 11 stores a matrix M i for converting a word vector of a document into a semantic vector for each language i. Further, the multilingual matrix learning unit 15 performs parallel learning of the matrix M i for a plurality of languages i so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship. And do it. Thereby, learning can be performed so that the meaning of each dimension of the semantic vector is the same regardless of the language pair. As a result, in this specific example, the matrix M i for the target language i obtained by learning becomes independent of the partner language.
 一方、特許文献2に記載された関連技術は、前述の式(1)により文書DajおよびDbkの組の類似度を算出していた。つまり、式(1)において、U=Mab、V=Mbaと置くと、
Figure JPOXMLDOC01-appb-I000007
On the other hand, in the related technique described in Patent Document 2, the similarity of the set of the documents Daj and Dbk is calculated by the above-described equation (1). In other words, when U = M ab and V = M ba in equation (1),
Figure JPOXMLDOC01-appb-I000007
と書ける。特許文献2は、言語aについて、相手言語b、c・・・毎に、Mab、Mac・・といったように、複数の行列を保持・学習しなければならなかった。これは、意味ベクトル(M・d)において各次元の持つ意味が、言語対ごとに異なっていたからである。 Can be written. Patent Document 2 has to hold and learn a plurality of matrices such as M ab , M ac ... For each language b, c. This is because the meaning of each dimension in the semantic vector (M · d) is different for each language pair.
 これに対して、本実施の形態の具体例は、意味ベクトル(M・d)を、相手言語非依存のベクトルとして扱えるようにしている。このように、この具体例は、言語i毎に対応する行列Mを1つのみ用意し、多言語行列学習部において言語対によらず意味ベクトルの各次元の意味が同じになるようにMの学習を行う。これにより、この具体例は、行列Mの個数をn×(n-1)個からn個に減らしている。これにより、この具体例における多言語文書類似度学習装置は、言語対ごとに行列を保持・学習する必要がなくなり、計算コストを低く抑えることができる。 On the other hand, in the specific example of the present embodiment, the semantic vector (M · d) can be handled as a partner language-independent vector. Thus, in this specific example, only one matrix M i corresponding to each language i is prepared, and the multilingual matrix learning unit M has the same meaning in each dimension of the semantic vector regardless of the language pair. Learn i . Thus, in this specific example, the number of the matrix M is reduced from n × (n−1) to n. Thereby, the multilingual document similarity learning apparatus in this specific example does not need to hold and learn a matrix for each language pair, and can reduce the calculation cost.
 また、この具体例における多言語文書類似度学習装置は、文書の絶対数が少ない言語の行列を学習する場合であっても、その言語で記述された文書と、他の複数の言語のそれぞれで記述された各文書との対から情報を得られる。このため、この具体例における多言語文書類似度学習装置は、学習性能を向上させることができる。 In addition, the multilingual document similarity learning apparatus in this specific example is capable of learning a language matrix having a small absolute number of documents, in each of a document described in that language and each of a plurality of other languages. Information can be obtained from a pair with each document described. For this reason, the multilingual document similarity learning apparatus in this specific example can improve learning performance.
 以上で、本発明の第1の実施の形態における動作の具体例の説明を終了する。 This completes the description of the specific example of the operation according to the first embodiment of the present invention.
 (第2の実施の形態)
 次に、本発明の第2の実施の形態について図面を参照して詳細に説明する。なお、本実施の形態の説明において参照する各図面において、本発明の第1の実施の形態と同一の構成および同様に動作するステップには同一の符号を付して本実施の形態における詳細な説明を省略する。
(Second Embodiment)
Next, a second embodiment of the present invention will be described in detail with reference to the drawings. Note that, in each drawing referred to in the description of the present embodiment, the same reference numerals are given to the same configuration and steps that operate in the same manner as in the first embodiment of the present invention, and the detailed description in the present embodiment. Description is omitted.
 図4は、本発明の第2の実施の形態としての多言語文書類似度判定装置2の機能ブロック構成を示す図である。図4において、多言語文書類似度判定装置2は、多言語行列記憶部11と、単語ベクトル取得部12と、意味ベクトル作成部13と、類似度計算部14と、類似度判定部26とを含む。ここで、多言語文書類似度判定装置2は、図2を参照して説明した本発明の第1の実施の形態としての多言語文書類似度学習装置1と同様のハードウェア要素によって構成可能である。この場合、類似度判定部26は、出力装置1006と、ROM1003および記憶装置1004に記憶されたコンピュータ・プログラムおよび各種データをRAM1002に読み込んで実行するCPU1001とによって構成される。なお、多言語文書類似度判定装置2およびその各機能ブロックのハードウェア構成は、上述の構成に限定されない。 FIG. 4 is a diagram showing a functional block configuration of the multilingual document similarity determination apparatus 2 according to the second embodiment of the present invention. 4, the multilingual document similarity determination device 2 includes a multilingual matrix storage unit 11, a word vector acquisition unit 12, a semantic vector creation unit 13, a similarity calculation unit 14, and a similarity determination unit 26. Including. Here, the multilingual document similarity determination device 2 can be configured by the same hardware elements as the multilingual document similarity learning device 1 according to the first embodiment of the present invention described with reference to FIG. is there. In this case, the similarity determination unit 26 includes an output device 1006 and a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them. Note that the hardware configuration of the multilingual document similarity determination device 2 and each functional block thereof is not limited to the above-described configuration.
 多言語行列記憶部11には、本発明の第1の実施の形態としての多言語文書類似度学習装置1によって学習された対象言語毎の行列が保持されている。単語ベクトル取得部12、意味ベクトル作成部13、および、類似度計算部14は、本発明の第1の実施の形態と同様に構成される。 The multilingual matrix storage unit 11 holds a matrix for each target language learned by the multilingual document similarity learning device 1 according to the first embodiment of the present invention. The word vector acquisition unit 12, the semantic vector creation unit 13, and the similarity calculation unit 14 are configured in the same manner as in the first embodiment of the present invention.
 類似度判定部26は、類似度判定対象の文書の集合において、類似度計算部14によって計算される類似度を用いて、文書の類似度を判定する。なお、類似度判定対象の文書の集合は、文書の組であってもよい。この場合、類似度判定部26は、判定対象の文書の組について、類似度が閾値以上であれば類似していると判定し、閾値未満であれば類似していないと判定してもよい。また、類似度判定対象の文書の集合は、3つ以上の文書の集合であってもよい。この場合、例えば、類似度判定部26は、判定対象の文書の集合における類似度の判定として、類似度に基づく文書のクラスタリングを行ってもよい。また、例えば、類似度判定部26は、判定対象の文書の集合における類似度の判定として、ある文書に対して類似する文書のランキングを行ってもよい。また、類似度判定部26は、判定結果を出力装置1006に出力してもよい。 The similarity determination unit 26 determines the similarity of a document using the similarity calculated by the similarity calculation unit 14 in a set of documents to be subjected to similarity determination. Note that the set of documents to be subjected to similarity determination may be a set of documents. In this case, the similarity determination unit 26 may determine that a set of documents to be determined is similar if the similarity is equal to or greater than a threshold value, and may determine that they are not similar if the similarity is less than the threshold value. Further, the set of documents to be subjected to similarity determination may be a set of three or more documents. In this case, for example, the similarity determination unit 26 may perform clustering of documents based on the similarity as the determination of the similarity in the set of documents to be determined. Further, for example, the similarity determination unit 26 may perform ranking of similar documents with respect to a certain document as determination of similarity in the set of documents to be determined. The similarity determination unit 26 may output the determination result to the output device 1006.
 なお、類似度判定対象の文書の集合は、記憶装置1004にあらかじめ記憶されているものであってもよい。また、類似度判定対象の文書の集合は、入力装置1005またはネットワークインタフェース(図示せず)等を介して外部から入力されるものであってもよい。 Note that the set of documents to be subjected to similarity determination may be stored in the storage device 1004 in advance. Further, the set of documents to be subjected to similarity determination may be input from the outside via the input device 1005 or a network interface (not shown).
 以上のように構成された多言語文書類似度判定装置2の動作について、図5を参照して説明する。なお、多言語文書類似度判定装置2は、類似度を判定したい文書の集合が入力されると、以下の動作を開始するものとする。 The operation of the multilingual document similarity determination apparatus 2 configured as described above will be described with reference to FIG. Note that the multilingual document similarity determination device 2 starts the following operation when a set of documents whose similarity is to be determined is input.
 図5では、まず、単語ベクトル取得部12は、類似度を判定したい文書の集合の各文書について、単語ベクトルを取得する(ステップS11)。 In FIG. 5, first, the word vector acquisition unit 12 acquires a word vector for each document in a set of documents whose similarity is to be determined (step S11).
 次に、意味ベクトル作成部13は、各文書について、文書の単語ベクトルおよびその文書の記述言語に対応する行列に基づいて、その文書の意味ベクトルを作成する(ステップS12)。 Next, the semantic vector creation unit 13 creates a semantic vector for each document based on a word vector of the document and a matrix corresponding to the description language of the document (step S12).
 次に、類似度計算部14において、文書の集合において、任意の文書の組の類似度を計算する(ステップS13)。 Next, the similarity calculation unit 14 calculates the similarity of a set of arbitrary documents in the set of documents (step S13).
 次に、類似度判定部26は、得られた類似度に基づいて類似度を判定し、判定結果を出力する(ステップS14)。前述のように、類似度判定部26は、類似度を閾値と比較することにより、任意の文書の組が類似しているか否かを表す情報を判定結果として出力してもよい。また、類似度判定部26は、類似度を用いて文書のクラスタリングやランキングを行い、その結果を判定結果として出力してもよい。 Next, the similarity determination unit 26 determines the similarity based on the obtained similarity and outputs a determination result (step S14). As described above, the similarity determination unit 26 may output information indicating whether or not a set of arbitrary documents is similar by comparing the similarity with a threshold value. Further, the similarity determination unit 26 may perform clustering and ranking of documents using the similarity and output the result as a determination result.
 以上で、多言語文書類似度判定装置2は、動作を終了する。 Thus, the multilingual document similarity determination device 2 ends the operation.
 次に、本発明の第2の実施の形態の効果について説明する。 Next, effects of the second exemplary embodiment of the present invention will be described.
 本発明の第2の実施の形態としての多言語文書類似度判定装置は、多言語文書群において、言語が3種類以上であっても、より低コストにより精度よく、類似文書を判定することができる。 The multilingual document similarity determination apparatus according to the second exemplary embodiment of the present invention can determine a similar document more accurately at a lower cost even if there are three or more languages in a multilingual document group. it can.
 その理由は、多言語行列記憶部が、対象言語毎に文書の単語ベクトルを意味ベクトルに変換するための行列として学習されたものを保持しているからである。そして、単語ベクトル取得部が、文書について単語ベクトルを取得する。そして、意味ベクトル作成部が、文書について単語ベクトルおよびその記述言語に対応する行列に基づいて意味ベクトルを作成する。そして、類似度計算部が、文書の組について意味ベクトルに基づいて類似度を計算する。そして、類似度判定部が、類似度判定対象の文書の集合において、類似度に基づく類似判定を行うからである。 This is because the multilingual matrix storage unit holds what is learned as a matrix for converting a word vector of a document into a semantic vector for each target language. And a word vector acquisition part acquires a word vector about a document. Then, the semantic vector creation unit creates a semantic vector for the document based on the word vector and a matrix corresponding to the description language. Then, the similarity calculation unit calculates the similarity based on the semantic vector for the document set. This is because the similarity determination unit performs similarity determination based on the similarity in a set of documents to be subjected to similarity determination.
 このように、多言語行列記憶部に保持される対象言語毎の行列は、本発明の第1の実施の形態としての多言語文書類似度学習装置によって学習されたものであり、相手言語に対して非依存となっている。そのため、類似度計算部は、ある文書に類似する他の文書を求めるために、ある文書についての意味ベクトルを、比較対象となる文書群の記述言語に対応する数だけ計算する必要がない。つまり、類似度計算部は、ある文書について、比較対象となる文書群の記述言語の数に関わらず、1つの意味ベクトルを計算すればよい。このため、類似度判定のための計算コストが低くなる。また、そのような意味ベクトルは、言語対によらず、各次元の意味が同じになるように作成されている。このため、意味ベクトルに基づき算出される類似度は、精度が高いものとなる。 As described above, the matrix for each target language held in the multilingual matrix storage unit is learned by the multilingual document similarity learning apparatus as the first embodiment of the present invention. Is independent. Therefore, in order to obtain another document similar to a certain document, the similarity calculation unit does not need to calculate a semantic vector for a certain document by the number corresponding to the description language of the document group to be compared. In other words, the similarity calculation unit may calculate one semantic vector for a document regardless of the number of description languages of the document group to be compared. For this reason, the calculation cost for similarity determination becomes low. Further, such a semantic vector is created so that the meaning of each dimension is the same regardless of the language pair. For this reason, the similarity calculated based on the semantic vector has high accuracy.
 次に、本発明の第2の実施の形態の動作を具体例で示す。 Next, the operation of the second embodiment of the present invention will be shown as a specific example.
 ここでは、言語横断でウェブ上のニュース記事(文書)のクラスタリングを行う例について説明する。 Here, an example of clustering news articles (documents) on the web across languages will be described.
 多言語行列記憶部11には、本発明の第1の実施の形態における具体例で学習した言語ごとの行列Mが保持されているものとする。また、多言語文書類似度判定装置2には、クラスタリング対象(類似度の判定対象)となる文書の集合として、ウェブ上から収集されるニュース記事群が入力されるものとする。 It is assumed that the multilingual matrix storage unit 11 holds a matrix M i for each language learned in the specific example according to the first embodiment of the present invention. In addition, it is assumed that a news article group collected from the web is input to the multilingual document similarity determination device 2 as a set of documents to be clustered (similarity determination target).
 まず、単語ベクトル取得部12は、クラスタリング対象の文書の集合における各文書を、単語ベクトルに変換する。変換の方法は、本発明の第1の実施の形態における具体例と同様である。 First, the word vector acquisition unit 12 converts each document in the set of documents to be clustered into a word vector. The conversion method is the same as the specific example in the first embodiment of the present invention.
 次に、意味ベクトル作成部13は、作成した単語ベクトルと、多言語行列記憶部11に保持されている言語ごとの行列Mとの積を取ることで、各文書の意味ベクトルを作成する。作成の方法は、本発明の第1の実施の形態における具体例と同様である。 Next, the semantic vector creation section 13, the word vectors created, by taking the product of the matrix M i of each language stored in the multilingual matrix storage unit 11, creates a semantic vector of each document. The creation method is the same as the specific example in the first embodiment of the present invention.
 次に、類似度計算部14は、クラスタリング対象の文書の集合において、任意の文書の組毎に、意味ベクトル同士の内積を取ることで類似度を求める。 Next, the similarity calculation unit 14 obtains the similarity by taking the inner product of the semantic vectors for each set of documents in the set of documents to be clustered.
 そして、類似度判定部26は、類似度が閾値以上の文書の組を、同じクラスタに属させることで、クラスタリングを行う。 Then, the similarity determination unit 26 performs clustering by causing a set of documents whose similarity is equal to or greater than a threshold value to belong to the same cluster.
 このように、この具体例では、多言語行列記憶部11に、言語ごとに文書の単語ベクトルを意味ベクトルに変換するための行列Mを保持している。この行列Mは、本発明の第1の実施の形態における具体例で多言語文書類似度学習装置1により学習されたものであるから、相手言語非依存である。したがって、類似度計算部14は、任意の文書の組毎に類似度を算出する際に、各文書について1つずつ算出された意味ベクトルを用いればよく、言語対毎の計算が不要となる。 Thus, in this embodiment, the multilingual matrix storage unit 11 holds the matrix M i for converting means vector word vector of the document for each language. Since this matrix M i is learned by the multilingual document similarity learning apparatus 1 in the specific example of the first embodiment of the present invention, it is independent of the partner language. Therefore, when calculating the similarity for each set of arbitrary documents, the similarity calculation unit 14 may use a semantic vector calculated for each document, and calculation for each language pair becomes unnecessary.
 一方、上述の特許文献2に記載された関連技術では、比較対象文書の言語数分、Mix・dijを計算する必要があった(xは比較対象の言語を表す)。 On the other hand, in the related technique described in Patent Document 2 described above, it is necessary to calculate M ix · d ij for the number of languages of the comparison target document (x represents the comparison target language).
 これに対して、この具体例は、ある文書dijを他の文書群と比較する場合、比較対象の文書群が複数の言語で記述されていても、意味ベクトルM・dijを1つ求めるだけで良い。したがって、この具体例は、類似度の計算コストを低くすることができる。 On the other hand, in this specific example, when a certain document d ij is compared with other document groups, one semantic vector M i · d ij is used even if the comparison target document group is described in a plurality of languages. Just ask. Therefore, this specific example can reduce the calculation cost of the similarity.
 以上で、本発明の第2の実施の形態における動作の具体例の説明を終了する。 This completes the description of the specific example of the operation according to the second embodiment of the present invention.
 なお、上述した本発明の各実施の形態において、対象言語毎に保持される行列は、互いに列の数が等しい例を中心に説明した。この他、各行列は、互いに行の数が等しいものであってもよい。この場合、各行列の列の数には、対応する言語で記述された文書の単語ベクトルの次元数を適用すればよい。 In each of the embodiments of the present invention described above, the matrix held for each target language has been described mainly with an example in which the number of columns is equal to each other. In addition, each matrix may have the same number of rows. In this case, the number of word vectors in the document described in the corresponding language may be applied to the number of columns in each matrix.
 また、上述した本発明の各実施の形態において、意味ベクトル作成部が、文書の単語ベクトルおよびその記述言語に対応する行列の積を、意味ベクトルとして算出する例を中心に説明した。この他、意味ベクトル作成部は、文書の単語ベクトルおよびその記述言語に対応する行列に基づいて意味ベクトルを作成するその他の演算方法を用いてもよい。また、類似度計算部が、文書の組について各文書の意味ベクトルの内積を算出して類似度とする例を中心に説明した。この他、類似度計算部は、各文書の意味ベクトルに基づいて類似度を計算するその他の演算方法を用いてもよい。 Further, in each of the embodiments of the present invention described above, an example has been described in which the semantic vector creation unit calculates the product of a word vector of a document and a matrix corresponding to the description language as a semantic vector. In addition, the semantic vector creation unit may use other calculation methods for creating a semantic vector based on a word vector of a document and a matrix corresponding to the description language. Further, the description has been made mainly on the example in which the similarity calculation unit calculates the inner product of the semantic vectors of each document for the document set to obtain the similarity. In addition, the similarity calculation unit may use another calculation method for calculating the similarity based on the semantic vector of each document.
 また、上述した本発明の各実施の形態としての多言語文書類似度学習装置および多言語文書類似度判定装置は、同一の装置上に実現されていてもよい。 Also, the multilingual document similarity learning device and the multilingual document similarity determination device as the embodiments of the present invention described above may be realized on the same device.
 また、上述した本発明の各実施の形態において、多言語文書類似度学習装置および多言語文書類似度判定装置の各機能ブロックが、記憶装置またはROMに記憶されたコンピュータ・プログラムを実行するCPUによって実現される例を中心に説明した。このような構成の他、本実施の形態において、各機能ブロックの一部、全部、または、それらの組み合わせが専用のハードウェアにより実現されていてもよい。 In each embodiment of the present invention described above, each functional block of the multilingual document similarity learning device and the multilingual document similarity determination device is executed by a CPU that executes a computer program stored in a storage device or ROM. The explanation is centered on examples that are realized. In addition to such a configuration, in the present embodiment, part, all, or a combination of each functional block may be realized by dedicated hardware.
 また、上述した本発明の各実施の形態において、多言語文書類似度学習装置または多言語文書類似度判定装置の機能ブロックは、複数の装置に分散されて実現されてもよい。 In each embodiment of the present invention described above, the functional blocks of the multilingual document similarity learning device or the multilingual document similarity determination device may be distributed and implemented in a plurality of devices.
 また、上述した本発明の各実施の形態において、各フローチャートを参照して説明した多言語文書類似度学習装置および多言語文書類似度判定装置の動作を、本発明のコンピュータ・プログラムとしてコンピュータ装置の記憶装置(記憶媒体)に格納してもよい。そして、係るコンピュータ・プログラムを当該CPUが読み出して実行するようにしてもよい。そして、このような場合において、本発明は、係るコンピュータ・プログラムのコードあるいは記憶媒体によって構成される。 Further, in each of the embodiments of the present invention described above, the operations of the multilingual document similarity learning device and the multilingual document similarity determination device described with reference to the flowcharts are the same as the computer program of the present invention. You may store in a memory | storage device (storage medium). Then, the computer program may be read and executed by the CPU. In such a case, the present invention is constituted by the code of the computer program or a storage medium.
 また、上述した各実施の形態は、適宜組み合わせて実施されることが可能である。 Also, the above-described embodiments can be implemented in appropriate combination.
 以上、上述した各実施の形態を模範的な例として本発明を説明した。しかしながら、本発明は、上述した各実施の形態には限定されない。即ち、本発明は、本発明のスコープ内において、当業者が理解し得る様々な態様を適用することができる。 The present invention has been described above using the above-described embodiments as exemplary examples. However, the present invention is not limited to the above-described embodiments. That is, the present invention can apply various modes that can be understood by those skilled in the art within the scope of the present invention.
 この出願は、2014年3月28日に出願された日本出願特願2014-67359を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2014-67359 filed on March 28, 2014, the entire disclosure of which is incorporated herein.
 1  多言語文書類似度学習装置
 2  多言語文書類似度判定装置
 11  多言語行列記憶部
 12  単語ベクトル取得部
 13  意味ベクトル作成部
 14  類似度計算部
 15  多言語行列学習部
 26  類似度判定部
 1001  CPU
 1002  RAM
 1003  ROM
 1004  記憶装置
 1005  入力装置
 1006  出力装置
DESCRIPTION OF SYMBOLS 1 Multilingual document similarity learning apparatus 2 Multilingual document similarity determination apparatus 11 Multilingual matrix memory | storage part 12 Word vector acquisition part 13 Semantic vector creation part 14 Similarity degree calculation part 15 Multilingual matrix learning part 26 Similarity degree determination part 1001 CPU
1002 RAM
1003 ROM
1004 Storage device 1005 Input device 1006 Output device

Claims (10)

  1.  対象言語毎に行列を保持する多言語行列記憶手段と、
     文書に対応する単語ベクトルを取得する単語ベクトル取得手段と、
     前記文書の単語ベクトルおよび前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成する意味ベクトル作成手段と、
     文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算する類似度計算手段と、
     前記対象言語のいずれかでそれぞれが記述された文書の集合において、対訳関係にある文書の組の前記類似度が、対訳関係にない文書の組の前記類似度より高くなるように、前記各対象言語に対応する前記行列の値を調整して学習する多言語行列学習手段と、
     を備えた多言語文書類似度学習装置。
    Multilingual matrix storage means for holding a matrix for each target language;
    Word vector acquisition means for acquiring a word vector corresponding to a document;
    Semantic vector creating means for creating a semantic vector of the document based on the matrix corresponding to a word vector of the document and a description language of the document;
    Similarity calculation means for calculating similarity based on the semantic vector of each document for a set of documents;
    In the set of documents each described in one of the target languages, each of the objects is set such that the similarity of a pair of documents having a parallel translation relationship is higher than the similarity of a pair of documents having no parallel translation relationship. Multilingual matrix learning means for adjusting and learning the value of the matrix corresponding to a language;
    Multilingual document similarity learning apparatus.
  2.  前記多言語行列学習手段は、前記各対象言語に対応する前記行列の学習を並列して行うことを特徴とする請求項1に記載の多言語文書類似度学習装置。 The multilingual document similarity learning apparatus according to claim 1, wherein the multilingual matrix learning means performs learning of the matrix corresponding to each of the target languages in parallel.
  3.  前記多言語行列記憶手段は、前記対象言語毎の行列として、行または列の数が互いに等しい行列を保持し、
     前記意味ベクトル作成手段は、前記文書の単語ベクトルおよび前記文書の記述言語に対応する前記行列の積を、前記文書の意味ベクトルとして作成し、
     前記類似度計算手段は、前記文書の組について、各文書の前記意味ベクトルの内積を、前記類似度として計算することを特徴とする請求項1または請求項2に記載の多言語文書類似度学習装置。
    The multilingual matrix storage means holds a matrix having the same number of rows or columns as a matrix for each target language,
    The semantic vector creation means creates a product of the word vector of the document and the matrix corresponding to the description language of the document as a semantic vector of the document,
    The multilingual document similarity learning according to claim 1, wherein the similarity calculation unit calculates an inner product of the semantic vectors of each document as the similarity for the document set. apparatus.
  4.  前記多言語行列学習手段は、前記対訳関係にある文書の組の前記類似度が、該対訳関係にある文書の組の一方の文書と、該文書に対して対訳関係にない他の文書との前記類似度より高くなるように、前記各行列の学習を行うことを特徴とする請求項1から請求項3のいずれか1項に記載の多言語文書類似度学習装置。 The multilingual matrix learning means is configured such that the similarity of the pair of documents in the bilingual relationship is between one document of the pair of documents in the bilingual relationship and another document not in the bilingual relationship with respect to the document. The multilingual document similarity learning apparatus according to any one of claims 1 to 3, wherein learning of each of the matrices is performed so as to be higher than the similarity.
  5.  前記多言語行列学習手段は、前記対訳関係にある文書の組の前記類似度が、前記対訳関係にない文書の組の前記類似度より高くなるように、確率的最急勾配法を用いて前記各行列を学習し、確率的最急勾配法のステップごとに、前記対訳関係にある文書の組および前記対訳関係にない文書の組を前記文書の集合からランダムに選択することを特徴とする請求項1から請求項4のいずれか1項に記載の多言語文書類似度学習装置。 The multilingual matrix learning means uses the probabilistic steepest gradient method so that the similarity of the pair of documents in the bilingual relationship is higher than the similarity of the pair of documents not in the bilingual relationship. Learning each matrix and randomly selecting from the set of documents a set of documents in the bilingual relationship and a set of documents not in the bilingual relationship for each step of the probabilistic steepest gradient method. The multilingual document similarity learning apparatus according to any one of claims 1 to 4.
  6.  請求項1から請求項5のいずれか1項に記載の多言語文書類似度学習装置を用いて学習された前記対象言語毎の前記行列を保持する多言語行列記憶手段と、
     文書に対応する単語ベクトルを取得する単語ベクトル取得手段と、
     前記文書の単語ベクトルおよび前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成する意味ベクトル作成手段と、
     文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算する類似度計算手段と、
     類似度判定対象の文書の集合において、前記類似度を用いて文書間の類似度を判定する類似度判定手段と、
     を備えた多言語文書類似度判定装置。
    Multilingual matrix storage means for holding the matrix for each target language learned using the multilingual document similarity learning device according to any one of claims 1 to 5;
    Word vector acquisition means for acquiring a word vector corresponding to a document;
    Semantic vector creating means for creating a semantic vector of the document based on the matrix corresponding to a word vector of the document and a description language of the document;
    Similarity calculation means for calculating similarity based on the semantic vector of each document for a set of documents;
    Similarity determination means for determining similarity between documents using the similarity in a set of documents for similarity determination;
    A multilingual document similarity determination device.
  7.  対象言語毎に保持される行列を用いて、文書に対応する単語ベクトル、および、前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成し、文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算することにより、
     対象言語のいずれかでそれぞれが記述された文書の集合において、対訳関係にある文書の組の前記類似度が、対訳関係にない文書の組の前記類似度より高くなるように、前記各対象言語に対応する前記行列の値を調整して学習する、多言語文書類似度学習方法。
    Using a matrix held for each target language, a word vector corresponding to the document and a semantic vector of the document are created based on the matrix corresponding to the description language of the document. By calculating the similarity based on the semantic vector of the document,
    In each set of documents each described in one of the target languages, each of the target languages is set such that the similarity of a pair of documents having a parallel translation relationship is higher than the similarity of a pair of documents having no parallel relationship. A multilingual document similarity learning method that learns by adjusting values of the matrix corresponding to.
  8.  請求項7に記載の多言語文書類似度学習方法により学習された前記対象言語毎の前記行列を用いて、文書に対応する単語ベクトル、および、前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成し、文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算することにより、
     類似度判定対象の文書の集合において、前記類似度を用いて文書間の類似度を判定する、多言語文書類似度判定方法。
    Using the matrix for each target language learned by the multilingual document similarity learning method according to claim 7, based on a word vector corresponding to a document and the matrix corresponding to a description language of the document , Creating a semantic vector of the document and calculating a similarity for each set of documents based on the semantic vector of each document,
    A multilingual document similarity determination method for determining a similarity between documents using the similarity in a set of documents for similarity determination.
  9.  対象言語毎に保持される行列を用いて、
     文書に対応する単語ベクトルを取得する単語ベクトル取得ステップと、
     前記文書の単語ベクトルおよび前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成する意味ベクトル作成ステップと、
     文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算する類似度計算ステップと、
     前記対象言語のいずれかでそれぞれが記述された文書の集合において、対訳関係にある文書の組の前記類似度が、対訳関係にない文書の組の前記類似度より高くなるように、前記各対象言語に対応する前記行列の値を調整して学習する多言語行列学習ステップと、
     をコンピュータ装置に実行させる多言語文書類似度学習プログラムを記憶した記憶媒体。
    Using a matrix held for each target language,
    A word vector acquisition step of acquiring a word vector corresponding to the document;
    A semantic vector creating step of creating a semantic vector of the document based on the word vector of the document and the matrix corresponding to the description language of the document;
    A similarity calculation step for calculating a similarity based on the semantic vector of each document for a set of documents;
    In the set of documents each described in one of the target languages, each of the objects is set such that the similarity of a pair of documents having a parallel translation relationship is higher than the similarity of a pair of documents having no parallel translation relationship. A multilingual matrix learning step of adjusting and learning the value of the matrix corresponding to the language;
    A storage medium storing a multilingual document similarity learning program for causing a computer device to execute the program.
  10.  請求項9に記載の記憶媒体に記憶された多言語文書類似度学習プログラムの実行により学習された前記対象言語毎の前記行列を用いて、
     文書に対応する単語ベクトルを取得する単語ベクトル取得ステップと、
     前記文書の単語ベクトルおよび前記文書の記述言語に対応する前記行列に基づいて、前記文書の意味ベクトルを作成する意味ベクトル作成ステップと、
     文書の組について、各文書の前記意味ベクトルに基づいて類似度を計算する類似度計算ステップと、
     類似度判定対象の文書の集合において、前記類似度を用いて文書間の類似度を判定する類似度判定ステップと、
     をコンピュータ装置に実行させる多言語文書類似度判定プログラムを記憶した記憶媒体。
    Using the matrix for each target language learned by executing a multilingual document similarity learning program stored in the storage medium according to claim 9,
    A word vector acquisition step of acquiring a word vector corresponding to the document;
    A semantic vector creating step of creating a semantic vector of the document based on the word vector of the document and the matrix corresponding to the description language of the document;
    A similarity calculation step for calculating a similarity based on the semantic vector of each document for a set of documents;
    A similarity determination step for determining a similarity between documents using the similarity in a set of documents for similarity determination;
    A storage medium storing a multilingual document similarity determination program for causing a computer device to execute
PCT/JP2015/001028 2014-03-28 2015-02-27 Multilingual document-similarity-degree learning device, multilingual document-similarity-degree determination device, multilingual document-similarity-degree learning method, multilingual document-similarity-degree determination method, and storage medium WO2015145981A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2016509952A JPWO2015145981A1 (en) 2014-03-28 2015-02-27 Multilingual document similarity learning device, multilingual document similarity determining device, multilingual document similarity learning method, multilingual document similarity determining method, and multilingual document similarity learning program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2014067359 2014-03-28
JP2014-067359 2014-03-28

Publications (1)

Publication Number Publication Date
WO2015145981A1 true WO2015145981A1 (en) 2015-10-01

Family

ID=54194537

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/001028 WO2015145981A1 (en) 2014-03-28 2015-02-27 Multilingual document-similarity-degree learning device, multilingual document-similarity-degree determination device, multilingual document-similarity-degree learning method, multilingual document-similarity-degree determination method, and storage medium

Country Status (2)

Country Link
JP (1) JPWO2015145981A1 (en)
WO (1) WO2015145981A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10430518B2 (en) 2017-01-22 2019-10-01 Alibaba Group Holding Limited Word vector processing for foreign languages
CN110795572A (en) * 2019-10-29 2020-02-14 腾讯科技(深圳)有限公司 Entity alignment method, device, equipment and medium
WO2022113306A1 (en) * 2020-11-27 2022-06-02 日本電信電話株式会社 Alignment device, training device, alignment method, training method, and program
JP7419961B2 (en) 2020-05-12 2024-01-23 富士通株式会社 Document extraction program, document extraction device, and document extraction method
WO2024043355A1 (en) * 2022-08-23 2024-02-29 주식회사 아카에이아이 Language data management method and server using same

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100179933A1 (en) * 2009-01-12 2010-07-15 Nec Laboratories America, Inc. Supervised semantic indexing and its extensions

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100179933A1 (en) * 2009-01-12 2010-07-15 Nec Laboratories America, Inc. Supervised semantic indexing and its extensions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JOHN C. PLATT ET AL.: "Translingual document representations from discriminative projections", E MNLP '10 PROCEEDINGS OF THE 2010 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, pages 251 - 261, XP055225060, Retrieved from the Internet <URL:http://dl.acm.org/citation.cfm?id=1870683> [retrieved on 20150326] *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10430518B2 (en) 2017-01-22 2019-10-01 Alibaba Group Holding Limited Word vector processing for foreign languages
US10878199B2 (en) 2017-01-22 2020-12-29 Advanced New Technologies Co., Ltd. Word vector processing for foreign languages
CN110795572A (en) * 2019-10-29 2020-02-14 腾讯科技(深圳)有限公司 Entity alignment method, device, equipment and medium
CN110795572B (en) * 2019-10-29 2022-05-17 腾讯科技(深圳)有限公司 Entity alignment method, device, equipment and medium
JP7419961B2 (en) 2020-05-12 2024-01-23 富士通株式会社 Document extraction program, document extraction device, and document extraction method
WO2022113306A1 (en) * 2020-11-27 2022-06-02 日本電信電話株式会社 Alignment device, training device, alignment method, training method, and program
WO2024043355A1 (en) * 2022-08-23 2024-02-29 주식회사 아카에이아이 Language data management method and server using same

Also Published As

Publication number Publication date
JPWO2015145981A1 (en) 2017-04-13

Similar Documents

Publication Publication Date Title
CN108959246B (en) Answer selection method and device based on improved attention mechanism and electronic equipment
WO2015145981A1 (en) Multilingual document-similarity-degree learning device, multilingual document-similarity-degree determination device, multilingual document-similarity-degree learning method, multilingual document-similarity-degree determination method, and storage medium
JP2020500371A (en) Apparatus and method for semantic search
CN111159363A (en) Knowledge base-based question answer determination method and device
US9298693B2 (en) Rule-based generation of candidate string transformations
JP4711761B2 (en) Data search apparatus, data search method, data search program, and computer-readable recording medium
US11461613B2 (en) Method and apparatus for multi-document question answering
KR102059743B1 (en) Method and system for providing biomedical passage retrieval using deep-learning based knowledge structure construction
WO2013128684A1 (en) Dictionary generation device, method, and program
US20220058349A1 (en) Data processing method, device, and storage medium
WO2019163642A1 (en) Summary evaluation device, method, program, and storage medium
JP7388256B2 (en) Information processing device and information processing method
US11263251B2 (en) Method for determining output data for a plurality of text documents
JP2014010634A (en) Paginal translation expression extraction device, paginal translation expression extraction method and computer program for extracting paginal translation expression
JP5869948B2 (en) Passage dividing method, apparatus, and program
CN110609997B (en) Method and device for generating abstract of text
WO2020144736A1 (en) Semantic relation learning device, semantic relation learning method, and semantic relation learning program
US20220179890A1 (en) Information processing apparatus, non-transitory computer-readable storage medium, and information processing method
JP5533272B2 (en) Data output device, data output method, and data output program
KR102519955B1 (en) Apparatus and method for extracting of topic keyword
CN111581162B (en) Ontology-based clustering method for mass literature data
KR101592670B1 (en) Apparatus for searching data using index and method for using the apparatus
CN110413956B (en) Text similarity calculation method based on bootstrapping
JP6746472B2 (en) Generation device, generation method, and generation program
CN112434134B (en) Search model training method, device, terminal equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15769056

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016509952

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase
122 Ep: pct application non-entry in european phase

Ref document number: 15769056

Country of ref document: EP

Kind code of ref document: A1