WO2015145981A1 - Dispositif d'apprentissage de degré de similarité de documents multilingues, dispositif de détermination de degré de similarité de documents multilingues, procédé d'apprentissage de degré de similarité de documents multilingues, procédé de détermination de degré de similarité de documents multilingues, et support de stockage - Google Patents

Dispositif d'apprentissage de degré de similarité de documents multilingues, dispositif de détermination de degré de similarité de documents multilingues, procédé d'apprentissage de degré de similarité de documents multilingues, procédé de détermination de degré de similarité de documents multilingues, et support de stockage Download PDF

Info

Publication number
WO2015145981A1
WO2015145981A1 PCT/JP2015/001028 JP2015001028W WO2015145981A1 WO 2015145981 A1 WO2015145981 A1 WO 2015145981A1 JP 2015001028 W JP2015001028 W JP 2015001028W WO 2015145981 A1 WO2015145981 A1 WO 2015145981A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
similarity
documents
matrix
multilingual
Prior art date
Application number
PCT/JP2015/001028
Other languages
English (en)
Japanese (ja)
Inventor
定政 邦彦
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2016509952A priority Critical patent/JPWO2015145981A1/ja
Publication of WO2015145981A1 publication Critical patent/WO2015145981A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Definitions

  • the present invention relates to a technique for finding documents whose contents are similar to each other in a multilingual document group in which documents of different languages are mixed.
  • Patent Document 1 A technique related to such a problem is described in Patent Document 1.
  • a target document is once machine-translated into a reference language, for example, English, and then documents having similar contents are collected using a technique such as clustering.
  • Patent Document 2 proposes a framework called SSI (supervised semantic indexing) that learns a model representing the similarity of contents between two languages directly from a bilingual corpus without going through an intermediate result of machine translation. This related technique learns and determines the similarity of documents between two languages as follows.
  • SSI supervised semantic indexing
  • each document dij in each language is expressed by bag-of-words in a document set in a parallel translation relationship across languages (dimension numbers D 1 and D 2 for each language are free).
  • the subscript i represents the language type.
  • the subscript j represents the document ID for each language.
  • W is a matrix of D 1 ⁇ D 2 (D 1 row D 2 column).
  • U T represents the transposed matrix of U.
  • U and V are matrices of N ⁇ D 1 and N ⁇ D 2 respectively.
  • N 100 or the like is applied.
  • An equation for calculating the score of the document pair is shown in the following equation (1).
  • Patent Document 3 describes another technique related to such a problem.
  • This related technique searches a plurality of documents described in different languages for those that are semantically approximate to a search request described in a certain language.
  • a word dictionary database is prepared in advance.
  • the word dictionary database associates a word Wi group of synonyms between natural languages A, B, C, D,... With one word feature vector Vi.
  • this related technique normalizes the sum of word feature vectors related to words included in each document, and calculates it as a document feature vector.
  • this related technique normalizes the sum of word feature vectors related to each word included in the search request and calculates it as a search request feature vector.
  • This related technique calculates the inner product of the search request feature vector and the document feature vector of each document as a semantic approximation.
  • This related technique searches for a document having a large semantic similarity as a document that approximates the search request.
  • Patent Document 1 since the related technology described in Patent Document 1 is a framework that goes through an intermediate state of machine translation results, there is a problem in that the accuracy of compilation is not high when the accuracy of machine translation is not necessarily high.
  • Patent Document 3 does not describe how to learn one word feature vector Vi for a synonym word Wi group. Also, considering the existence of multiple meanings, the number of combinations of synonym words Wi may be enormous. Therefore, with this related technology, the cost for maintaining and learning the word dictionary database increases. Also, determining the degree of approximation of a document via a word level synonym group is equivalent to using a word level machine translation. Therefore, this related technique has a problem that the accuracy of the semantic approximation is not high when the accuracy of the synonym group (machine translation) is not necessarily high.
  • an object of the present invention is to provide a technique for searching for similar documents more accurately at a lower cost even when there are three or more languages in a multilingual document group.
  • the multilingual document similarity learning device of the present invention includes a multilingual matrix storage unit that holds a matrix for each target language, a word vector acquisition unit that acquires a word vector corresponding to a document, Based on the word vector of the document and the matrix corresponding to the description language of the document, a semantic vector creating means for creating the semantic vector of the document, and for a set of documents, the similarity based on the semantic vector of each document
  • the similarity of the set of documents having a bilingual relationship is the similarity of the set of documents having no bilingual relationship
  • Multilingual matrix learning means for adjusting and learning the value of the matrix corresponding to each target language so as to be higher.
  • the multilingual document similarity determination device of the present invention corresponds to a document and multilingual matrix storage means for holding the matrix for each target language learned using the multilingual document similarity learning device described above.
  • a word vector obtaining means for obtaining a word vector
  • a semantic vector creating means for creating a semantic vector of the document based on the matrix corresponding to the word vector of the document and the description language of the document, and a set of documents
  • Similarity calculation means for calculating similarity based on the semantic vector of each document
  • similarity determination means for determining similarity between documents using the similarity in a set of similarity determination target documents; Is provided.
  • the multilingual document similarity learning method of the present invention uses the matrix held for each target language, and based on the word vector corresponding to the document and the matrix corresponding to the description language of the document, By creating a semantic vector of a document and calculating a similarity for each set of documents based on the semantic vector of each document, there is a bilingual relationship in a set of documents each described in one of the target languages Learning is performed by adjusting the value of the matrix corresponding to each target language so that the similarity of the document set is higher than the similarity of the document set that is not in a parallel translation relationship.
  • the multilingual document similarity determination method of the present invention uses the matrix for each target language learned by the multilingual document similarity learning method described above, a word vector corresponding to the document, and the document Based on the matrix corresponding to a description language, a semantic vector of the document is created, and for a set of documents, a similarity is calculated based on the semantic vector of each document. The similarity between documents is determined using the similarity.
  • the storage medium of the present invention corresponds to a word vector acquisition step of acquiring a word vector corresponding to a document using a matrix held for each target language, the word vector of the document, and the description language of the document.
  • a semantic vector creating step for creating a semantic vector of the document based on the matrix; a similarity calculating step for calculating a similarity based on the semantic vector of each document for a set of documents; and any of the target languages
  • the similarity corresponding to the set of documents having a translation relationship is higher than the similarity of the set of documents having no translation relationship.
  • a multilingual matrix similarity learning program for causing a computer device to execute a multilingual matrix learning step of adjusting and learning matrix values is stored.
  • another storage medium of the present invention uses the matrix for each target language learned by executing the multilingual document similarity learning program stored in the above-described storage medium to obtain a word vector corresponding to the document.
  • a multilingual document similarity determination program to be executed is stored.
  • the present invention can provide a technique for searching for similar documents more accurately at a lower cost even when there are three or more languages in a multilingual document group.
  • FIG. 1 is a diagram showing a functional block configuration of a multilingual document similarity learning apparatus 1 as a first embodiment of the present invention.
  • a multilingual document similarity learning device 1 includes a multilingual matrix storage unit 11, a word vector acquisition unit 12, a semantic vector creation unit 13, a similarity calculation unit 14, a multilingual matrix learning unit 15, ,including.
  • FIG. 2 is a diagram illustrating an example of a hardware configuration of the multilingual document similarity learning apparatus 1.
  • the multilingual document similarity learning device 1 is configured by a computer device.
  • the computer device includes a CPU (Central Processing Unit) 1001, a RAM (Random Access Memory) 1002, a ROM (Read Only Memory) 1003, a storage device 1004, an input device 1005, and an output device 1006.
  • the multilingual matrix storage unit 11 is configured by the storage device 1004.
  • the word vector acquisition unit 12, the semantic vector creation unit 13, and the similarity calculation unit 14 are configured by a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them.
  • the multilingual matrix learning unit 15 includes an input device 1005 and a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them. Note that the hardware configuration of the multilingual document similarity learning device 1 and each functional block thereof is not limited to the above-described configuration.
  • the multilingual matrix storage unit 11 holds a matrix for each target language.
  • Each matrix is a weight matrix for converting a word vector of a document described in the target language into a semantic vector.
  • the word vector and the semantic vector will be described later.
  • each matrix may have the same number of columns. In that case, the number of columns is the number of dimensions of the semantic vector. In this case, the number of rows in each matrix may be the number of dimensions of a word vector described later.
  • the word vector acquisition unit 12 acquires a word vector corresponding to the document.
  • the word vector is a concept that is generally used when calculating the similarity of documents, and is an expression format that represents a document by a set of words included in the document.
  • the number of dimensions of the word vector may be, for example, the number of words used in a target language describing the document (hereinafter also referred to as a description language).
  • the word vector acquisition unit 12 may create a word vector by a known technique based on a given document.
  • the word vector acquisition part 12 may acquire what was previously produced
  • the semantic vector creation unit 13 creates the semantic vector of the document based on the word vector of the document and the matrix held in the multilingual matrix storage unit 11 corresponding to the description language of the document.
  • the semantic vector is information representing the semantic features of the document.
  • the semantic vector creation unit 13 may create a product of a word vector of a document and a matrix corresponding to the description language of the document as the semantic vector of the document.
  • the similarity calculation unit 14 calculates a similarity for a set of documents based on the semantic vector of each document. For example, the similarity calculation unit 14 may calculate the inner product of the semantic vectors of each document as the similarity of a set of documents.
  • the multilingual matrix learning unit 15 uses a semantic vector creation unit 13 and a similarity calculation unit 14 in a set of documents each described in one of the target languages, and uses matrix values corresponding to the description language of each document. Adjust to learn. Specifically, the multilingual matrix learning unit 15 learns each matrix so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship. For example, the multilingual matrix learning unit 15 causes the similarity of a pair of documents in a bilingual relationship to be higher than the similarity between one document of the pair and another document that is not in a bilingual relationship with the document. In addition, learning may be performed by adjusting matrix values. Note that the multilingual matrix learning unit 15 preferably performs learning of each matrix in parallel.
  • a set of documents used for learning by the multilingual matrix learning unit 15 will be described.
  • a set of documents there may be three or more types of target languages describing each document.
  • Such a set of documents is configured to include at least a set of documents having a translation relation. Further, such a set of documents is configured to include a set of documents that are not in a parallel translation relationship at least in part.
  • the document set may be stored in advance in the storage device 1004.
  • a set of documents may be input from the outside via the input device 1005 or a network interface (not shown).
  • the multilingual matrix learning unit 15 may learn each matrix described above using the stochastic steepest gradient method.
  • the multilingual matrix learning unit 15 may randomly select from the set of documents a set of documents that have a translation relationship and a set of documents that do not have a translation relationship for each step of the probabilistic steepest gradient method.
  • the multilingual document similarity learning apparatus 1 configured as described above will be described with reference to FIG. Note that the multilingual document similarity learning device 1 starts the following operation when a set of documents each described in any of the target languages is input.
  • the word vector acquisition unit 12 acquires a corresponding word vector for each document in a set of documents each described in one of the target languages (step S1).
  • the multilingual matrix learning unit 15 instructs the semantic vector creation unit 13 to create a semantic vector for each document in the document set that has a bilingual relationship and a document set that has no bilingual relationship in the set of documents (Ste S2).
  • the semantic vector creation unit 13 creates a semantic vector based on a word vector of each document and a matrix corresponding to the description language.
  • the multilingual matrix learning unit 15 uses the similarity calculation unit 14 to make a similarity between each of a pair of documents having a parallel translation relationship and a pair of documents having no parallel translation relationship in the document set based on the semantic vector of each document.
  • the degree is calculated (step S3).
  • the multilingual matrix learning unit 15 adjusts the matrix corresponding to each description language so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship (Ste S4).
  • step S5 if the adjustment of the matrix has converged (Yes in step S5), the multilingual matrix learning unit 15 ends the learning.
  • step S5 if the matrix adjustment has not converged (No in step S5), the operation of the multilingual document similarity learning apparatus 1 returns to step S2. And the multilingual document similarity learning apparatus 1 repeats the operation
  • the multilingual document similarity learning apparatus creates a semantic vector of a document in determination of similar documents even when there are three or more languages in a multilingual document group.
  • the used matrix can be learned more accurately at a lower cost.
  • the multilingual matrix storage unit holds a matrix for converting a word vector of a document into a semantic vector for each target language.
  • the word vector acquisition unit acquires a word vector corresponding to the document
  • the semantic vector creation unit creates a semantic vector based on the word vector of the document and a matrix corresponding to the description language of the document.
  • the multilingual matrix learning unit is configured such that, in a set of documents each described in one of the target languages, the similarity of a pair of documents having a parallel translation relationship is higher than the similarity of a pair of documents having no parallel translation relationship.
  • the learning of the matrix corresponding to each target language is performed.
  • a matrix for creating a semantic vector is prepared for each target language, not for each language pair. Therefore, this embodiment does not need to learn a matrix for each language pair. And this Embodiment should just learn a matrix for every object language so that the meaning of each dimension in the semantic vector of a document may become the same irrespective of a language pair. As a result, in this embodiment, the matrix of each description language obtained by learning becomes independent of the partner language.
  • information can be obtained from language pairs with a plurality of other target languages even when a matrix is learned for a target language with a small absolute number of documents in a set of documents. For this reason, a performance improvement can be expected compared to the case of learning a matrix for each language pair. Furthermore, this embodiment can further improve the learning accuracy by performing learning of the matrices of each target language in parallel.
  • the translation relationship here does not need to be a complete translation relationship, a so-called parallel corpus.
  • the bilingual relationship may be a so-called comparable corpus that describes the same object in different languages.
  • a bilingual corpus used in statistical machine translation research may be used.
  • a national language version of Wikipedia may be used as a set of such documents.
  • storage part 11 memorize
  • the matrix M i corresponding to the language i is an N ⁇ D i matrix.
  • Di is the number of words used in language i.
  • D i may be a different value for each language.
  • the initial value for each M i for example, 0 is set.
  • the word vector acquisition unit 12 converts each document in the above-described document set into a word vector.
  • the word vector is a concept that is generally used when calculating the similarity of documents, and is an expression format that expresses a document by a set of words included in the document.
  • the simplest word vector is a vector (for example, an expression such as [1, 0, 1, 0]) composed of elements in which the presence or absence of each word is represented by 0 or 1.
  • Other word vectors include those based on TF (word appearance frequency: term frequency) * IDF (inverse document frequency) that weights each word from the viewpoint of calculating similarity.
  • a method of temporarily compressing a word vector using a method such as LSI (Latent Semantic Indexing) or LDA (Latent Dirichlet Allocation) is also known.
  • word N-GRAM or character N-GRAM may be used.
  • the word TF * IDF is used as the word vector.
  • the semantic vector creation unit 13 calculates the product of the word vector of the document and the matrix corresponding to the description language of the document as the semantic vector of the document under the control of the multilingual matrix learning unit 15. Specifically, the semantic vector creation unit 13 calculates M i ⁇ d ij , which is the product of the corresponding word vector d ij and the current matrix M i of language i, for the j-th document in language i. Calculate and use it as a semantic vector. The number of dimensions of M i ⁇ d ij is N regardless of the document or language.
  • the multilingual matrix learning unit 15 uses the mean vector generating unit 13 and the similarity calculation unit 14 performs learning of each M i.
  • the basic idea of learning is that the matrixes for multiple description languages are adjusted in parallel so that the similarity of a pair of documents in a bilingual relationship is higher than the similarity of a pair of documents in a non-parallel relationship. Is.
  • multilingual matrix learning unit 15 among the set of documents mentioned above, a document q language i q, the language i + document d + in document q and translation relation, not the translation relationship language
  • the matrices M iq , M i + , and M i ⁇ are adjusted in parallel so as to satisfy the following expression (2).
  • the multilingual matrix learning unit 15 performs adjustment so as to minimize the loss function of the following equation (3) considering the margin.
  • R is a set of a set of a certain document in the input document set and a document that has a parallel translation relationship with the document or a document that does not have a parallel translation relationship.
  • F (q, d) represents the similarity between documents q and d. That, f (q, d) is language i q document q, if the language of d and i d, a (M iq ⁇ q) T ⁇ (M id ⁇ d).
  • One of the methods for minimizing the loss function described above is a method using a stochastic steepest gradient method.
  • the multilingual matrix learning unit 15 randomly selects a pair of q, d + and d ⁇ for each step of the stochastic steepest gradient method, and 1 ⁇ f (q, d + ) + f (q, When d ⁇ )> 0, each matrix M (M iq , M i + , M i ⁇ ) is updated as in the following equations (4) to (6).
  • the multilingual matrix learning unit 15 performs random document extraction and adjustment of the matrix M based on the extraction until convergence.
  • the multilingual matrix storage unit 11 stores a matrix M i for converting a word vector of a document into a semantic vector for each language i.
  • the multilingual matrix learning unit 15 performs parallel learning of the matrix M i for a plurality of languages i so that the similarity of a pair of documents having a bilingual relationship is higher than the similarity of a pair of documents having no bilingual relationship. And do it. Thereby, learning can be performed so that the meaning of each dimension of the semantic vector is the same regardless of the language pair.
  • the matrix M i for the target language i obtained by learning becomes independent of the partner language.
  • Patent Document 2 has to hold and learn a plurality of matrices such as M ab , M ac ... For each language b, c. This is because the meaning of each dimension in the semantic vector (M ⁇ d) is different for each language pair.
  • the semantic vector (M ⁇ d) can be handled as a partner language-independent vector.
  • the multilingual matrix learning unit M has the same meaning in each dimension of the semantic vector regardless of the language pair. Learn i .
  • the number of the matrix M is reduced from n ⁇ (n ⁇ 1) to n.
  • the multilingual document similarity learning apparatus in this specific example is capable of learning a language matrix having a small absolute number of documents, in each of a document described in that language and each of a plurality of other languages. Information can be obtained from a pair with each document described. For this reason, the multilingual document similarity learning apparatus in this specific example can improve learning performance.
  • FIG. 4 is a diagram showing a functional block configuration of the multilingual document similarity determination apparatus 2 according to the second embodiment of the present invention.
  • the multilingual document similarity determination device 2 includes a multilingual matrix storage unit 11, a word vector acquisition unit 12, a semantic vector creation unit 13, a similarity calculation unit 14, and a similarity determination unit 26.
  • the multilingual document similarity determination device 2 can be configured by the same hardware elements as the multilingual document similarity learning device 1 according to the first embodiment of the present invention described with reference to FIG. is there.
  • the similarity determination unit 26 includes an output device 1006 and a CPU 1001 that reads a computer program and various data stored in the ROM 1003 and the storage device 1004 into the RAM 1002 and executes them.
  • the hardware configuration of the multilingual document similarity determination device 2 and each functional block thereof is not limited to the above-described configuration.
  • the multilingual matrix storage unit 11 holds a matrix for each target language learned by the multilingual document similarity learning device 1 according to the first embodiment of the present invention.
  • the word vector acquisition unit 12, the semantic vector creation unit 13, and the similarity calculation unit 14 are configured in the same manner as in the first embodiment of the present invention.
  • the similarity determination unit 26 determines the similarity of a document using the similarity calculated by the similarity calculation unit 14 in a set of documents to be subjected to similarity determination.
  • the set of documents to be subjected to similarity determination may be a set of documents.
  • the similarity determination unit 26 may determine that a set of documents to be determined is similar if the similarity is equal to or greater than a threshold value, and may determine that they are not similar if the similarity is less than the threshold value.
  • the set of documents to be subjected to similarity determination may be a set of three or more documents.
  • the similarity determination unit 26 may perform clustering of documents based on the similarity as the determination of the similarity in the set of documents to be determined. Further, for example, the similarity determination unit 26 may perform ranking of similar documents with respect to a certain document as determination of similarity in the set of documents to be determined.
  • the similarity determination unit 26 may output the determination result to the output device 1006.
  • the set of documents to be subjected to similarity determination may be stored in the storage device 1004 in advance. Further, the set of documents to be subjected to similarity determination may be input from the outside via the input device 1005 or a network interface (not shown).
  • the multilingual document similarity determination apparatus 2 starts the following operation when a set of documents whose similarity is to be determined is input.
  • the word vector acquisition unit 12 acquires a word vector for each document in a set of documents whose similarity is to be determined (step S11).
  • the semantic vector creation unit 13 creates a semantic vector for each document based on a word vector of the document and a matrix corresponding to the description language of the document (step S12).
  • the similarity calculation unit 14 calculates the similarity of a set of arbitrary documents in the set of documents (step S13).
  • the similarity determination unit 26 determines the similarity based on the obtained similarity and outputs a determination result (step S14). As described above, the similarity determination unit 26 may output information indicating whether or not a set of arbitrary documents is similar by comparing the similarity with a threshold value. Further, the similarity determination unit 26 may perform clustering and ranking of documents using the similarity and output the result as a determination result.
  • the multilingual document similarity determination device 2 ends the operation.
  • the multilingual document similarity determination apparatus can determine a similar document more accurately at a lower cost even if there are three or more languages in a multilingual document group. it can.
  • the multilingual matrix storage unit holds what is learned as a matrix for converting a word vector of a document into a semantic vector for each target language.
  • a word vector acquisition part acquires a word vector about a document.
  • the semantic vector creation unit creates a semantic vector for the document based on the word vector and a matrix corresponding to the description language.
  • the similarity calculation unit calculates the similarity based on the semantic vector for the document set. This is because the similarity determination unit performs similarity determination based on the similarity in a set of documents to be subjected to similarity determination.
  • the similarity calculation unit does not need to calculate a semantic vector for a certain document by the number corresponding to the description language of the document group to be compared.
  • the similarity calculation unit may calculate one semantic vector for a document regardless of the number of description languages of the document group to be compared. For this reason, the calculation cost for similarity determination becomes low. Further, such a semantic vector is created so that the meaning of each dimension is the same regardless of the language pair. For this reason, the similarity calculated based on the semantic vector has high accuracy.
  • the multilingual matrix storage unit 11 holds a matrix M i for each language learned in the specific example according to the first embodiment of the present invention.
  • a news article group collected from the web is input to the multilingual document similarity determination device 2 as a set of documents to be clustered (similarity determination target).
  • the word vector acquisition unit 12 converts each document in the set of documents to be clustered into a word vector.
  • the conversion method is the same as the specific example in the first embodiment of the present invention.
  • the semantic vector creation section 13 the word vectors created, by taking the product of the matrix M i of each language stored in the multilingual matrix storage unit 11, creates a semantic vector of each document.
  • the creation method is the same as the specific example in the first embodiment of the present invention.
  • the similarity calculation unit 14 obtains the similarity by taking the inner product of the semantic vectors for each set of documents in the set of documents to be clustered.
  • the similarity determination unit 26 performs clustering by causing a set of documents whose similarity is equal to or greater than a threshold value to belong to the same cluster.
  • the multilingual matrix storage unit 11 holds the matrix M i for converting means vector word vector of the document for each language. Since this matrix M i is learned by the multilingual document similarity learning apparatus 1 in the specific example of the first embodiment of the present invention, it is independent of the partner language. Therefore, when calculating the similarity for each set of arbitrary documents, the similarity calculation unit 14 may use a semantic vector calculated for each document, and calculation for each language pair becomes unnecessary.
  • this specific example when a certain document d ij is compared with other document groups, one semantic vector M i ⁇ d ij is used even if the comparison target document group is described in a plurality of languages. Just ask. Therefore, this specific example can reduce the calculation cost of the similarity.
  • each matrix held for each target language has been described mainly with an example in which the number of columns is equal to each other.
  • each matrix may have the same number of rows.
  • the number of word vectors in the document described in the corresponding language may be applied to the number of columns in each matrix.
  • the semantic vector creation unit calculates the product of a word vector of a document and a matrix corresponding to the description language as a semantic vector.
  • the semantic vector creation unit may use other calculation methods for creating a semantic vector based on a word vector of a document and a matrix corresponding to the description language.
  • the description has been made mainly on the example in which the similarity calculation unit calculates the inner product of the semantic vectors of each document for the document set to obtain the similarity.
  • the similarity calculation unit may use another calculation method for calculating the similarity based on the semantic vector of each document.
  • the multilingual document similarity learning device and the multilingual document similarity determination device as the embodiments of the present invention described above may be realized on the same device.
  • each functional block of the multilingual document similarity learning device and the multilingual document similarity determination device is executed by a CPU that executes a computer program stored in a storage device or ROM.
  • the explanation is centered on examples that are realized.
  • part, all, or a combination of each functional block may be realized by dedicated hardware.
  • the functional blocks of the multilingual document similarity learning device or the multilingual document similarity determination device may be distributed and implemented in a plurality of devices.
  • the operations of the multilingual document similarity learning device and the multilingual document similarity determination device described with reference to the flowcharts are the same as the computer program of the present invention.
  • the present invention is constituted by the code of the computer program or a storage medium.

Abstract

La présente invention concerne une technologie permettant de rechercher des documents similaires dans un groupe de documents multilingues à un coût inférieur et avec une plus grande précision, même si trois ou plusieurs langues sont présentes. Le dispositif d'apprentissage de degré de similarité de documents multilingues (1) comporte les éléments suivants : une unité de stockage de matrices multilingues (11) qui stocke une matrice pour chaque langue cible ; une unité d'acquisition de vecteurs de mot (12) qui sert à acquérir un vecteur de mot correspondant à un document ; une unité de création de vecteurs de signification (13) qui sert à créer un vecteur de signification pour ledit document en fonction du vecteur de mot pour ledit document et de la matrice correspondant à la langue dans laquelle est écrit ledit document ; une unité de calcul de degré de similarité (14) qui sert à calculer des degrés de similarité en fonction de vecteurs de signification pour des documents dans un groupe de documents ; et une unité d'apprentissage de matrices multilingues (15) qui met en œuvre l'apprentissage par le réglage de valeurs dans les matrices correspondant aux langues cibles respectives de telle sorte que, à l'intérieur d'un ensemble de documents écrits chacun dans l'une des langues cibles, les degrés de similarité pour des groupes de documents qui présentent des relations de traduction de source sont plus élevés que les degrés de similarité pour des groupes de documents qui ne présentent pas de relations de traduction de source.
PCT/JP2015/001028 2014-03-28 2015-02-27 Dispositif d'apprentissage de degré de similarité de documents multilingues, dispositif de détermination de degré de similarité de documents multilingues, procédé d'apprentissage de degré de similarité de documents multilingues, procédé de détermination de degré de similarité de documents multilingues, et support de stockage WO2015145981A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2016509952A JPWO2015145981A1 (ja) 2014-03-28 2015-02-27 多言語文書類似度学習装置、多言語文書類似度判定装置、多言語文書類似度学習方法、多言語文書類似度判定方法、および、多言語文書類似度学習プログラム

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2014067359 2014-03-28
JP2014-067359 2014-03-28

Publications (1)

Publication Number Publication Date
WO2015145981A1 true WO2015145981A1 (fr) 2015-10-01

Family

ID=54194537

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/001028 WO2015145981A1 (fr) 2014-03-28 2015-02-27 Dispositif d'apprentissage de degré de similarité de documents multilingues, dispositif de détermination de degré de similarité de documents multilingues, procédé d'apprentissage de degré de similarité de documents multilingues, procédé de détermination de degré de similarité de documents multilingues, et support de stockage

Country Status (2)

Country Link
JP (1) JPWO2015145981A1 (fr)
WO (1) WO2015145981A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10430518B2 (en) 2017-01-22 2019-10-01 Alibaba Group Holding Limited Word vector processing for foreign languages
CN110795572A (zh) * 2019-10-29 2020-02-14 腾讯科技(深圳)有限公司 一种实体对齐方法、装置、设备及介质
WO2022113306A1 (fr) * 2020-11-27 2022-06-02 日本電信電話株式会社 Dispositif d'alignement, dispositif d'apprentissage, procédé d'alignement, procédé d'apprentissage et programme
JP7419961B2 (ja) 2020-05-12 2024-01-23 富士通株式会社 文書抽出プログラム、文書抽出装置、及び文書抽出方法
WO2024043355A1 (fr) * 2022-08-23 2024-02-29 주식회사 아카에이아이 Procédé de gestion de données linguistiques et serveur le mettant en œuvre

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100179933A1 (en) * 2009-01-12 2010-07-15 Nec Laboratories America, Inc. Supervised semantic indexing and its extensions

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100179933A1 (en) * 2009-01-12 2010-07-15 Nec Laboratories America, Inc. Supervised semantic indexing and its extensions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JOHN C. PLATT ET AL.: "Translingual document representations from discriminative projections", E MNLP '10 PROCEEDINGS OF THE 2010 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, pages 251 - 261, XP055225060, Retrieved from the Internet <URL:http://dl.acm.org/citation.cfm?id=1870683> [retrieved on 20150326] *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10430518B2 (en) 2017-01-22 2019-10-01 Alibaba Group Holding Limited Word vector processing for foreign languages
US10878199B2 (en) 2017-01-22 2020-12-29 Advanced New Technologies Co., Ltd. Word vector processing for foreign languages
CN110795572A (zh) * 2019-10-29 2020-02-14 腾讯科技(深圳)有限公司 一种实体对齐方法、装置、设备及介质
CN110795572B (zh) * 2019-10-29 2022-05-17 腾讯科技(深圳)有限公司 一种实体对齐方法、装置、设备及介质
JP7419961B2 (ja) 2020-05-12 2024-01-23 富士通株式会社 文書抽出プログラム、文書抽出装置、及び文書抽出方法
WO2022113306A1 (fr) * 2020-11-27 2022-06-02 日本電信電話株式会社 Dispositif d'alignement, dispositif d'apprentissage, procédé d'alignement, procédé d'apprentissage et programme
WO2024043355A1 (fr) * 2022-08-23 2024-02-29 주식회사 아카에이아이 Procédé de gestion de données linguistiques et serveur le mettant en œuvre

Also Published As

Publication number Publication date
JPWO2015145981A1 (ja) 2017-04-13

Similar Documents

Publication Publication Date Title
CN108959246B (zh) 基于改进的注意力机制的答案选择方法、装置和电子设备
WO2015145981A1 (fr) Dispositif d&#39;apprentissage de degré de similarité de documents multilingues, dispositif de détermination de degré de similarité de documents multilingues, procédé d&#39;apprentissage de degré de similarité de documents multilingues, procédé de détermination de degré de similarité de documents multilingues, et support de stockage
JP2020500371A (ja) 意味的検索のための装置および方法
US9298693B2 (en) Rule-based generation of candidate string transformations
JP4711761B2 (ja) データ検索装置、データ検索方法、データ検索プログラムおよびコンピュータに読み取り可能な記録媒体
KR102059743B1 (ko) 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법 및 시스템
WO2013128684A1 (fr) Dispositif, procédé et programme de génération de dictionnaire
WO2019163642A1 (fr) Dispositif d&#39;évaluation de résumé, procédé, programme et support de stockage
JP7388256B2 (ja) 情報処理装置及び情報処理方法
US11263251B2 (en) Method for determining output data for a plurality of text documents
JP2014010634A (ja) 対訳表現抽出装置、対訳表現抽出方法及び対訳表現抽出のためのコンピュータプログラム
JP5869948B2 (ja) パッセージ分割方法、装置、及びプログラム
US20220179890A1 (en) Information processing apparatus, non-transitory computer-readable storage medium, and information processing method
JP5533272B2 (ja) データ出力装置、データ出力方法およびデータ出力プログラム
CN112434134B (zh) 搜索模型训练方法、装置、终端设备及存储介质
KR102519955B1 (ko) 토픽 키워드의 추출 장치 및 방법
CN113302601A (zh) 含义关系学习装置、含义关系学习方法及含义关系学习程序
KR101592670B1 (ko) 인덱스를 이용하는 데이터 검색 장치 및 이를 이용하는 방법
CN110413956B (zh) 一种基于bootstrapping的文本相似度计算方法
JP2010009237A (ja) 多言語間類似文書検索装置及び方法及びプログラム及びコンピュータ読取可能な記録媒体
JP6746472B2 (ja) 生成装置、生成方法および生成プログラム
CN111694948B (zh) 文本的分类方法及系统、电子设备、存储介质
Ajeissh et al. An adaptive distributed approach of a self organizing map model for document clustering using ring topology
CN111581162B (zh) 一种基于本体的海量文献数据的聚类方法
JP2018085051A (ja) 類似度算出プログラム、類似度算出方法、および類似度算出装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15769056

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016509952

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase
122 Ep: pct application non-entry in european phase

Ref document number: 15769056

Country of ref document: EP

Kind code of ref document: A1