CN103235833B - Answer search method and device by the aid of statistical machine translation - Google Patents
Answer search method and device by the aid of statistical machine translation Download PDFInfo
- Publication number
- CN103235833B CN103235833B CN201310180146.4A CN201310180146A CN103235833B CN 103235833 B CN103235833 B CN 103235833B CN 201310180146 A CN201310180146 A CN 201310180146A CN 103235833 B CN103235833 B CN 103235833B
- Authority
- CN
- China
- Prior art keywords
- overbar
- matrix
- low
- query
- language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013519 translation Methods 0.000 title claims abstract description 98
- 238000000034 method Methods 0.000 title claims abstract description 54
- 239000011159 matrix material Substances 0.000 claims abstract description 83
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 16
- 230000014616 translation Effects 0.000 claims description 94
- 238000005457 optimization Methods 0.000 claims description 31
- 230000006870 function Effects 0.000 claims description 26
- 230000014509 gene expression Effects 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 150000001875 compounds Chemical class 0.000 claims 1
- 238000012163 sequencing technique Methods 0.000 claims 1
- 238000002474 experimental method Methods 0.000 abstract description 5
- 230000008569 process Effects 0.000 description 10
- 238000012360 testing method Methods 0.000 description 6
- 238000013461 design Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
本发明公开了一种借助统计机器翻译的答案检索方法及装置。首先利用统计机器翻译工具将候选答案翻译翻译成其它若干种语言,得到候选答案的若干种等价表示。然后利用矩阵分解的方法,对候选答案的若干种等价表示进行降维数,得到低维的隐含表示形式。其次,利用统计机器翻译和矩阵分解方法将查询问题转化成低维的隐含表示形式。最后,在隐含空间上计算查询问题与候选答案之间的相似度,返回相似度最高的若干个候选答案作为该查询问题的答案。本发明提出的方法,可以有效地解决词汇不匹配和词汇歧义的问题。经过试验证明,在大规模的社区问答数据集中,答案检索的性能提高了29.36%。
The invention discloses an answer retrieval method and device by means of statistical machine translation. Firstly, the candidate answers are translated into several other languages by using statistical machine translation tools, and several equivalent representations of the candidate answers are obtained. Then, the method of matrix decomposition is used to reduce the dimensionality of several equivalent representations of candidate answers to obtain low-dimensional implicit representations. Second, the query problem is transformed into a low-dimensional implicit representation using statistical machine translation and matrix factorization methods. Finally, the similarity between the query question and the candidate answer is calculated on the hidden space, and several candidate answers with the highest similarity are returned as the answer to the query question. The method proposed by the invention can effectively solve the problems of vocabulary mismatch and vocabulary ambiguity. Experiments prove that in a large-scale community question answering dataset, the performance of answer retrieval is improved by 29.36%.
Description
技术领域technical field
本发明涉及自然语言处理技术领域,是一种借助统计机器翻译的答案检索方法及装置。The invention relates to the technical field of natural language processing, and relates to an answer retrieval method and device by means of statistical machine translation.
背景技术Background technique
随着互联网技术的快速发展,基于用户生成(User-Generated Content,UGC)的互联网服务越来越流行。社区问答正是在这一背景下出现的一种新的以“提问-回答”为主的信息交流和知识分享系统,例如Yahoo!Answers、百度知道等。区别于自动问答系统,在社区问答上,用户可以提出任何类型的问题,也可以回答其它用户任何类型的问题。答案检索是社区问答分析的基础,占有很重要的位置。答案检索的任务是指从大规模的候选答案库中检索出与查询问题在语义上相似或相近的答案,用户回答该查询问题。因此,答案检索具有重要的理论意义和实用价值。With the rapid development of Internet technologies, Internet services based on User-Generated Content (UGC) are becoming more and more popular. Community Q&A is a new information exchange and knowledge sharing system based on "question-answer" that emerged under this background, such as Yahoo! Answers, Baidu Zhizhi, etc. Different from the automatic question answering system, in the community question answering system, users can ask any type of questions, and can also answer any type of questions from other users. Answer retrieval is the basis of community Q&A analysis and occupies a very important position. The task of answer retrieval refers to retrieving semantically similar or similar answers to the query from a large-scale candidate answer library, and the user answers the query. Therefore, answer retrieval has important theoretical significance and practical value.
目前答案检索面临的主要挑战是查询问题与候选答案之间的词汇不匹配以及词汇歧义问题。词汇不匹配通常会引发答案检索模型检索出许多与用户查询意图不匹配的答案,主要原因是社区问答中查询问题和答案都是由用户给出的,而用户的查询意图高度多样化。例如,依据不同的用户,词语“interest”既可以指“curiosity”也可以指“a chargefor borrowing money”。“词语歧义”是查询问题与候选答案之间的常见现象,具体表现在,很多词语在查询问题和候选答案中出现的次数并不多,甚至都没有在查询问题或候选答案中出现过,无法用传统的基于词条匹配的方法。The main challenges in answer retrieval currently are the vocabulary mismatch between the query question and the candidate answers and the problem of lexical ambiguity. Vocabulary mismatch usually causes the answer retrieval model to retrieve many answers that do not match the user's query intent. The main reason is that the query questions and answers in the community Q&A are given by the user, and the user's query intent is highly diverse. For example, the word "interest" can mean either "curiosity" or "a charge for borrowing money", depending on the user. "Word ambiguity" is a common phenomenon between query questions and candidate answers. It is specifically manifested in the fact that many words do not appear many times in query questions and candidate answers, or even never appear in query questions or candidate answers. Use traditional term-based matching methods.
解决上述“词汇歧义”和“词汇鸿沟”问题的一个方法就是借助统计机器翻译,将原始语言中的歧义词以及字面上表示不一样的词汇用它们对应的翻译来表示。而借助统计机器翻译的方法前提是首先要建立一个合理的目标函数,将原始语言及其对应的翻译集成在一个框架中,其次是如何尽量减少统计机器翻译带来的噪声,最后是如何设计一种快速的求解方法来解决上述目标函数。而直接将得到的翻译词汇添加到原始语言中,答案检索的准确率会大打折扣,主要原因是将翻译词汇直接添加到原始语言中会大大增加计算的复杂度,同时机器翻译的错误也会带来很多噪音。One way to solve the above-mentioned "lexical ambiguity" and "lexical gap" problems is to use statistical machine translation to represent ambiguous words in the original language and words with different literal meanings with their corresponding translations. The premise of the statistical machine translation method is to first establish a reasonable objective function to integrate the original language and its corresponding translation in a framework, then how to minimize the noise caused by statistical machine translation, and finally how to design a A fast solution method to solve the above objective function. However, directly adding the translated vocabulary to the original language will greatly reduce the accuracy of answer retrieval. The main reason is that directly adding the translated vocabulary to the original language will greatly increase the computational complexity, and at the same time, errors in machine translation will also bring problems. There is a lot of noise.
答案检索的任务是指对用户输入的查询问题,从答案文档集合中检索出能够回答该查询的答案。答案检索面临的主要困难是用户查询问题与候选答案在表达相同或相似的意思时使用不同的用词形式,容易导致词汇不匹配和词汇歧义的问题。传统的方法主要依靠挖掘单语之间的词语关联,忽视了多语言信息之间的语义关联。The task of answer retrieval is to retrieve the answer that can answer the query from the answer document collection for the query question input by the user. The main difficulty in answer retrieval is that user query questions and candidate answers use different word forms when expressing the same or similar meanings, which may easily lead to lexical mismatch and lexical ambiguity. Traditional methods mainly rely on mining word associations between monolinguals, ignoring the semantic associations between multilingual information.
发明内容Contents of the invention
为解决上述问题,本发明首先需要设计一个合理的目标函数,将原始语言及其对应的翻译有效地集成到一个框架中,同时在该框架下约束机器翻译的噪声对答案检索的影响。然后根据建立的目标函数及其约束,设计了一种快速的求解方法。通过对目标函数的求解,得到原始语言及其对应翻译的隐含表示,最后在隐含空间上计算用户查询和候选答案之间的相似度。根据上述思路,本发明主要针对答案检索存在的两大难点问题入手,成功地将统计机器翻译引入到答案检索的过程中,通过实验证明,该方法有效地提高了答案检索的准确率。In order to solve the above problems, the present invention first needs to design a reasonable objective function to effectively integrate the original language and its corresponding translation into a framework, and at the same time constrain the impact of machine translation noise on answer retrieval under this framework. Then, according to the established objective function and its constraints, a fast solution method is designed. By solving the objective function, the hidden representation of the original language and its corresponding translation is obtained, and finally the similarity between the user query and the candidate answer is calculated on the hidden space. According to the above ideas, the present invention mainly focuses on the two major difficulties in answer retrieval, and successfully introduces statistical machine translation into the process of answer retrieval. It is proved by experiments that this method effectively improves the accuracy of answer retrieval.
本发明的基本思想是充分借助统计机器翻译,将原始语言中的歧义词和字面上表示不一样的词汇用它们对应的翻译来表示,从而提高答案检索的性能。The basic idea of the present invention is to make full use of statistical machine translation to represent ambiguous words and words with different literal expressions in the original language with their corresponding translations, thereby improving the performance of answer retrieval.
本发明公开了The invention discloses
一种借助统计机器翻译的答案检索方法,包括如下步骤:A method of answer retrieval by means of statistical machine translation, comprising the steps of:
步骤1、借助统计机器翻译工具将原始语言表示的所有候选答案翻译成其它多种语言;Step 1. Translate all candidate answers expressed in the original language into other languages by means of statistical machine translation tools;
步骤2、将包括所述原始语言在内的每种语言表示的候选答案集成到一个基于非负矩阵分解的框架;Step 2, integrating the candidate answers represented by each language including the original language into a framework based on non-negative matrix factorization;
步骤3、利用最小二乘法快速梯度下降算法对所述基于非负矩阵分解的框架进行求解,得到所有候选答案的所述每种语言表示的低维表达;Step 3, using the least squares fast gradient descent algorithm to solve the framework based on non-negative matrix factorization, and obtain the low-dimensional expressions of each language representation of all candidate answers;
步骤4、借助统计机器翻译工具将原始语言表示的查询问题翻译成其它多种语言翻译;Step 4. Translate the query expressed in the original language into other multiple languages by means of statistical machine translation tools;
步骤5、利用步骤3中得到的所有候选答案的所述每种语言表示的低维表达,将查询问题及其它多种语言翻译转化到低维空间上;Step 5, using the low-dimensional representations of each language representation of all candidate answers obtained in step 3, transforming the query question and other multilingual translations into a low-dimensional space;
步骤6、根据所述查询问题及其它多种语言翻译、以及该查询问题及其它多种语言翻译对应的候选答案的低维表达,计算所述查询问题及其它多种语言翻译与它们对应的候选答案之间的相似度,并根据相似度得到最终检索结果。Step 6. According to the query question and other multilingual translations, and the low-dimensional expression of candidate answers corresponding to the query question and other multilingual translations, calculate the query question and other multilingual translations and their corresponding candidates The similarity between the answers, and get the final retrieval result according to the similarity.
本发明还公开了一种借助统计机器翻译的答案检索装置,其包括:The invention also discloses an answer retrieval device by means of statistical machine translation, which includes:
候选答案翻译模块,用于将候选答案翻译成其它语言;A candidate answer translation module for translating the candidate answers into other languages;
矩阵分解模块,将包括所述原始语言在内的每种语言表示的候选答案集成到一个基于非负矩阵分解的框架;a matrix factorization module that integrates candidate answers expressed in each language, including the original language, into a framework based on non-negative matrix factorization;
优化求解模块,利用最小二乘法快速梯度下降算法对所述基于非负矩阵分解的框架进行求解,得到每一个问题的所有候选答案的所述每种语言表示的低维表达;Optimizing the solution module, using the least squares fast gradient descent algorithm to solve the framework based on non-negative matrix factorization, and obtain the low-dimensional expression of each language representation of all candidate answers for each question;
查询问题翻译模块,用于将查询问题翻译成其它语言;Query question translation module for translating query questions into other languages;
基于低维空间的相似度计算模块,其用于将查询问题转化到低维空间上,并计算查询问题与候选答案在低维空间上的相似度;A similarity calculation module based on a low-dimensional space, which is used to transform the query question into a low-dimensional space, and calculate the similarity between the query question and the candidate answer in the low-dimensional space;
所述结果排序学习模块,其用于根据所述相似度计算模块计算得到的相似度,最终得到检索答案。The result ranking learning module is used to finally obtain the search answer according to the similarity calculated by the similarity calculation module.
本发明采用借助统计机器翻译的思想来提升答案检索的性能。利用统计机器翻译工具Google Translate,将原始语言中的歧义词和字面上表示不一样的词汇用它们对应的翻译来表示,从而提高答案检索的性能。The invention adopts the idea of statistical machine translation to improve the performance of answer retrieval. Using the statistical machine translation tool Google Translate, the ambiguous words and words with different literal expressions in the original language are represented by their corresponding translations, thereby improving the performance of answer retrieval.
附图说明Description of drawings
图1是本发明中借助统计机器翻译的答案检索方法。Fig. 1 is the answer retrieval method by means of statistical machine translation in the present invention.
图2是本发明中借助统计机器翻译的答案检索装置结构图。Fig. 2 is a structural diagram of an answer retrieval device by means of statistical machine translation in the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明作进一步的详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.
本发明公开了一种借助统计机器翻译的答案检索方法及装置。其可以分为离线过程和在线过程两部分。离线过程分由三个模块实现,即候选答案翻译模块、矩阵分解模块,优化求解模块。在线过程也分三个模块进行,即查询问题翻译模块、基于低维空间的相似度计算模块以及结果排序学习模块。The invention discloses an answer retrieval method and device by means of statistical machine translation. It can be divided into two parts: offline process and online process. The offline process is realized by three modules, namely the candidate answer translation module, the matrix decomposition module, and the optimization solution module. The online process is also divided into three modules, namely the query question translation module, the similarity calculation module based on low-dimensional space, and the result ranking learning module.
图1示出了本发明提出的一种借助统计机器翻译的答案检索方法。如图1所示,其包括离线部分和在线部分两个阶段。其中离线过程包括:Fig. 1 shows an answer retrieval method by means of statistical machine translation proposed by the present invention. As shown in Figure 1, it includes two stages: offline part and online part. The offline process includes:
步骤(1)、利用统计机器翻译工具将用原始语言l1(例如英语)表示的所有候选答案进行翻译,获得L-1种不同语言的等价表示{l1,l2,…,lL-1},其中L表示所有语言的数目,所述统计机器翻译工具可选用Google Translate等。Step (1), use statistical machine translation tools to translate all candidate answers expressed in the original language l 1 (such as English), and obtain equivalent representations in L-1 different languages {l 1 , l 2 , ..., l L -1 }, wherein L represents the number of all languages, and the statistical machine translation tool can be selected from Google Translate or the like.
步骤(2)、对每种语言表示的候选答案集合表示成一个Mp×N的词-文档矩阵其中Mp表示第p种语言表示的候选答案集合中的所有词汇,N表示候选答案集合中答案的数目。Step (2), the set of candidate answers expressed in each language is represented as a word-document matrix of M p ×N Among them, M p represents all vocabulary in the candidate answer set represented by the p-th language, and N represents the number of answers in the candidate answer set.
步骤(3)、设计一个新的目标函数,采用非负矩阵分解的方法将P种不同语言表示的候选答案集成到一个统一的框架中,并且采用正则化的策略来减少统计机器翻译带来的噪声。Step (3), design a new objective function, use the method of non-negative matrix factorization to integrate the candidate answers expressed in P different languages into a unified framework, and use a regularization strategy to reduce the impact of statistical machine translation. noise.
步骤(4)、设计一个基于最小二乘的快速梯度下降算法,通过对上述目标函数求解得到L种不同语言的低维表示形式,即系数矩阵和重构矩阵 Step (4), design a least squares-based fast gradient descent algorithm, and obtain low-dimensional representations of L different languages by solving the above objective function, that is, the coefficient matrix and the reconstruction matrix
所述在线过程包括:The online process includes:
步骤(1)、利用统计机器翻译工具将原始语言l1(例如英语)表示的查询问题翻译成L-1种不同语言的等价表示,所述统计机器翻译工具可选用Google Translate等。Step (1), using a statistical machine translation tool to translate the query expressed in the original language l 1 (such as English) into equivalent representations in L-1 different languages. The statistical machine translation tool can be Google Translate or the like.
步骤(2)、利用上述离线过程(4)中求解得到的系数矩阵将查询问题及其对应的L-1种翻译表示转化到低维空间上。;Step (2), using the coefficient matrix obtained in the above-mentioned offline process (4) Transform the query problem and its corresponding L-1 translation representations into a low-dimensional space. ;
步骤(3)、在低维空间表示上计算查询问题与候选答案的相似度。Step (3), calculating the similarity between the query question and the candidate answer on the low-dimensional space representation.
步骤(4)、采用线性排序学习的策略,将L种不同语言在低维空间表示的相似度进行融合,得分最高的若干个候选答案作为最终的答案返回。Step (4), adopting the strategy of linear ranking learning, merging the similarities represented by L different languages in low-dimensional space, and returning several candidate answers with the highest scores as the final answers.
图2示出了本发明中提出的借助统计机器翻译的答案检索装置。如图2所示,该检索装置包括:候选答案翻译模块、矩阵分解模块,优化求解模块、查询问题翻译模块以及基于低维空间的相似度计算模块。Fig. 2 shows the answer retrieval device by means of statistical machine translation proposed in the present invention. As shown in Figure 2, the retrieval device includes: a candidate answer translation module, a matrix decomposition module, an optimization solution module, a query translation module and a low-dimensional space-based similarity calculation module.
所述候选答案翻译模块,用于在离线阶段时,将用原始语言l1(例如英语)表示的所有候选答案进行翻译,获得L-1种不同语言的等价表示{l1,l2,…,lL-1},其中L表示所有语言的数目,即通过对候选答案集合D1翻译得到另外L-1种语言表示的候选答案集合D2,…,DL。The candidate answer translation module is used to translate all the candidate answers expressed in the original language l 1 (such as English) in the offline stage to obtain equivalent representations of L-1 different languages {l 1 , l 2 , ..., l L-1 }, where L represents the number of all languages, that is, the candidate answer sets D 2 , ..., D L expressed in other L-1 languages are obtained by translating the candidate answer set D 1 .
候选答案翻译是本发明的技术之一。为了将候选答案从一种语言翻译成另外L-1种语言,采用人工翻译费时费力,尤其是针对社区问答答案检索这个真实任务来说,对大规模的候选答案进行翻译显然是不现实的。幸运的是,目前机器翻译的水平在自然语言处理中得到了较好的发展,尽管在翻译质量上还不是令人十分满意。目前已经有许多公开的免费翻译工具提供日常翻译服务。本发明优选实施例中采用Google Translate,该翻译工具利用统计机器学习方法在构建的大规模平行语料库上训练翻译模型,在从一种语言翻译成另一种语言的过程中,可以考虑丰富的上下文信息,在众多的翻译工具中表现出了良好的翻译性能。通过对候选答案集合D1翻译后,可以得到另外L-1种语言表示的候选答案集合D2,…,DL。Candidate answer translation is one of the techniques of the present invention. In order to translate candidate answers from one language to another L-1 language, it is time-consuming and labor-intensive to use human translation, especially for the real task of community Q&A answer retrieval, it is obviously unrealistic to translate large-scale candidate answers. Fortunately, the current level of machine translation has been well developed in natural language processing, although the translation quality is not very satisfactory. At present, there are many public free translation tools that provide daily translation services. Adopt Google Translate in the preferred embodiment of the present invention, this translation tool utilizes statistical machine learning method to train translation model on the large-scale parallel corpus of construction, in the process that is translated into another language from one language, can consider rich context information, and has shown good translation performance in many translation tools. After translating the candidate answer set D 1 , the candidate answer sets D 2 , . . . , DL expressed in other L -1 languages can be obtained.
所述矩阵分解模块,用于在离线阶段,对每种语言表示的候选答案集合表示成一个Mp×N的词-文档矩阵其中Mp表示第p种语言表示的候选答案集合中的所有词汇,N表示候选答案结合中答案的数目。The matrix decomposition module is used to express the set of candidate answers expressed in each language into an M p ×N word-document matrix in the offline stage Among them, M p represents all vocabulary in the candidate answer set represented by the p-th language, and N represents the number of answers in the candidate answer combination.
矩阵分解模块是本发明的关键技术之一。定义{l1,l2,…,lL}表示本发明中使用的语言集合,其中L表示语言的数目,l1表示原始语言(例如,英语),l2…lp表示另外L-1种语言。定义表示基于l1语言表达的候选答案集合。定义候选答案可以表示成一个Mp维的向量其中向量中的每个元素对应一个词,其表示该词在第i个候选答案中的重要程度;该向量可以用tf-idf计算,tf-idf是一种统计方法,用以评估一字词对于一个文件集或一个数据集中其中一份的重要程度。Dp可以表示成一个Mp×N维的词-文档矩阵该矩阵中,每一行表示一个不同的词,每一列表示一个候选答案,其中Mp表示Dp中不重复单词的数目,N表示Dp中候选答案的数目。The matrix decomposition module is one of the key technologies of the present invention. Definition {l 1 , l 2 , ..., l L } denotes the set of languages used in the present invention, where L denotes the number of languages, l 1 denotes the original language (e.g., English), l 2 ...l p denotes another L-1 languages. definition Represents the set of candidate answers based on l1 language expression. define candidate answers Can be expressed as a M p -dimensional vector where the vector Each element in corresponds to a word, which represents the importance of the word in the i-th candidate answer; the vector It can be calculated with tf-idf, which is a statistical method to evaluate the importance of a word for a file set or a part of a data set. D p can be represented as a word-document matrix of M p ×N dimensions In this matrix, each row represents a different word, and each column represents a candidate answer, where M p represents the number of non-repeated words in D p , and N represents the number of candidate answers in D p .
直观上来说,可以将翻译后得到的另外L-1种语言表示的候选答案集合D2,…,DL中的词汇直接添加到原始候选答案集合D1中,这样将会导致D1对应的矩阵的维数从M1×N增加到然而这种做法存在两个缺点:(1)引起数据稀疏性;(2)统计机器翻译的翻译错误将会带来噪声问题。为了解决上述问题,本发明采用矩阵分解的方法。Intuitively speaking, the words in the candidate answer set D 2 ,...,D L in another L-1 languages obtained after translation can be directly added to the original candidate answer set D 1 , which will result in D 1 corresponding matrix The dimensionality of increases from M 1 ×N to However, this approach has two disadvantages: (1) it causes data sparsity; (2) translation errors in statistical machine translation will cause noise problems. In order to solve the above problems, the present invention adopts the method of matrix decomposition.
假设矩阵可以分解成两个低维矩阵和同时考虑矩阵独立于可以获得如下的目标函数:hypothesis matrix can be decomposed into two low-dimensional matrices and Also consider the matrix Independent of the The following objective function can be obtained:
其中,||·||F表示矩阵的范数,其中表示分解后得到的系数矩阵,表示分解后得到的重构矩阵,K表示隐含空间的维数大小。Among them, ||·|| F represents the norm of the matrix, where Represents the coefficient matrix obtained after decomposition, Represents the reconstruction matrix obtained after decomposition, and K represents the dimensionality of the hidden space.
为了降低统计机器翻译错误带来的噪声问题,本发明假设从矩阵(p∈[2,L])获得的重构矩阵应当与从矩阵获得的重构矩阵越接近越好。因此,本发明提出最小化重构矩阵(p∈[2,L])与重构矩阵之前的距离:In order to reduce the noise problem caused by statistical machine translation errors, the present invention assumes from the matrix The reconstruction matrix obtained by (p∈[2,L]) should be the same as from the matrix The obtained reconstruction matrix The closer the better. Therefore, the present invention proposes to minimize the reconstruction matrix (p∈[2,L]) and reconstruction matrix Previous distance:
合并上述两个目标函数,可以得到如下的目标函数:Combining the above two objective functions, the following objective function can be obtained:
其中参数λp(p∈[2,L])用来调整两部分的相对权重。如果对参数λp设置较小的值,上述目标函数类似于传统的非负矩阵(Non-negative MatrixFactorization),如果对参数λp设置较大的值,上述目标函数更加强调统计机器翻译带来的错误。Among them, the parameter λ p (p∈[2, L]) is used to adjust the relative weight of the two parts. If a small value is set for the parameter λ p , the above objective function Similar to the traditional non-negative matrix (Non-negative Matrix Factorization), if a larger value is set for the parameter λ p , the above objective function Greater emphasis on errors introduced by statistical machine translation.
所述优化求解模块用于求解上述矩阵分解模块中的参数,即系数矩阵和重构矩阵通过该优化求解模块,得到系数矩阵和重构矩阵的局部最优表示,即为离线部分的输入结果。The optimization solving module is used to solve the parameters in the above-mentioned matrix decomposition module, that is, the coefficient matrix and the reconstruction matrix Through this optimization solution module, the coefficient matrix is obtained and the reconstruction matrix The local optimal representation of is the input result of the offline part.
优化求解模块是本发明的核心技术之一。上述目标函数同时考虑了数据稀疏性和统计机器翻译错误的问题,该目标函数中有2L个成对的优化对象,当同时考虑和的时候,很难找到一个算法来求解上述最小化问题。本发明提出了一种基于最小二乘法的快速梯度下降算法,用来找到局部最优解,当优化某个目标对象时,保持其它2L-1个对象不变。The optimization solution module is one of the core technologies of the present invention. The above objective function Considering both data sparsity and statistical machine translation errors, there are 2L pairs of optimization objects in the objective function, when considering and When , it is difficult to find an algorithm to solve the above minimization problem. The present invention proposes a fast gradient descent algorithm based on the least square method, which is used to find a local optimal solution, and when optimizing a certain target object, keep other 2L-1 objects unchanged.
保持和不变,对系数矩阵的迭代更新可以将上述目标函数转成为如下的优化问题:Keep and Invariant, for the coefficient matrix The iterative update of the above objective function can be Transformed into the following optimization problem:
定义表示一个列向量,代表的是矩阵的第i行所有元素;表示一个列向量,代表的是系数矩阵第i行的所有元素。因此,上述优化问题可以分解成Mp个相互独立的子优化问题,每一个子优化问题对应系数矩阵的一行:definition Represents a column vector, representing a matrix All elements of row i; Represents a column vector, representing the coefficient matrix All elements in row i. Therefore, the above optimization problem can be decomposed into M p independent sub-optimization problems, and each sub-optimization problem corresponds to the coefficient matrix A line of:
下标i=1,…,Mp,其中Mp表示Dp中不重复单词的数目。Subscript i=1,..., M p , where M p represents the number of non-repeated words in D p .
上述子优化问题是一个标准的最小二乘问题,它的数值解是:The above sub-optimization problem is a standard least squares problem, and its numerical solution is:
保持系数矩阵和重构矩阵不变,对重构矩阵的迭代更新可以将上述目标函数转成为如下两类的优化问题:Keep the coefficient matrix and the reconstruction matrix Invariant to the reconstruction matrix The iterative update of the above objective function can be Transform into the following two types of optimization problems:
当p∈[2,L],可以转化为下面的目标函数:When p ∈ [2, L], can be transformed into the following objective function:
当p=1时,可以转化为下面的目标函数:When p=1, can be transformed into the following objective function:
对于上述第一种情况的目标函数,定义是矩阵中的第j列向量,表示重构矩阵中的第j列向量。因此,上述第一种情况的目标函数可以分解成N个相互独立的子优化问题,每一个子优化问题对应重构矩阵的一列:For the objective function in the first case above, define is the matrix The jth column vector in , represents the reconstruction matrix The jth column vector in . Therefore, the objective function in the first case above can be decomposed into N mutually independent sub-optimization problems, and each sub-optimization problem corresponds to the reconstruction matrix A column of:
其中下标j=1,…,N,N表示集合Dp中候选答案的数目。Wherein the subscript j=1,...,N, N represents the number of candidate answers in the set Dp .
上述子优化问题是一个标准的基于L2正则化的最小二乘问题,那么它的数值解为:The above sub-optimization problem is a standard least squares problem based on L2 regularization, then its numerical solution is:
其中,p∈[2,L]表示翻译后的第p种语言,表示单位矩阵。where p ∈ [2, L] represents the translated p-th language, represents the identity matrix.
类似地,上述第二种情况的目标函数,可以采用类似的方法求解,它的数值解为:Similarly, the objective function of the second case above can be solved by a similar method, and its numerical solution is:
所述查询问题翻译模块,其用于在在线阶段时,利用统计机器翻译工具将查询问题翻译成L-1种不同语言的等价表示,所述统计机器翻译工具可选用Google Translate等。The query question translation module is used to translate the query question into L-1 equivalent representations in different languages using a statistical machine translation tool during the online phase. The statistical machine translation tool can be Google Translate or the like.
类似于候选答案翻译模块,为了将查询问题从一种语言翻译成另外L-1种语言,本发明借助统计机器翻译工具——Google Translate。对于给定的查询问题q,经过翻译后得到另外L-1种语言表示的查询问题q2,…,qL。Similar to the candidate answer translation module, in order to translate the query from one language to another L-1 languages, the present invention uses a statistical machine translation tool—Google Translate. For a given query question q, other query questions q 2 ,...,q L expressed in other L-1 languages are obtained after translation.
所述基于低维空间的相似度计算模块,用于在低维空间表示上计算查询问题与候选答案的相似度。The low-dimensional space-based similarity calculation module is used to calculate the similarity between the query question and the candidate answer on the representation of the low-dimensional space.
基于低维空间的相似度计算模块是本发明的关键技术之一。对于给定的查询问题q及其对应的L-1种语言的翻译q2,…,qL,需要将其转化到低维的空间上。为了便于表述起见,用符号q1代替原始语言表示的查询问题q,即q=q1。因此,可以利用如下的公式将q1转化到低维空间上:The similarity calculation module based on low-dimensional space is one of the key technologies of the present invention. For a given query question q and its corresponding translations q 2 ,...,q L in L-1 languages, it needs to be transformed into a low-dimensional space. For ease of expression, the query question q expressed in the original language is replaced by the symbol q 1 , that is, q=q 1 . Therefore, the following formula can be used to transform q 1 into a low-dimensional space:
其中,是查询问题q1的向量表示,是查询问题q1在低维空间上的向量表示,即重构矩阵;其中表示优化求解模块得到的原始语言对应的系数矩阵。然而对于候选答案d1,可以直接利用矩阵分解模块进行低维转换后得到的转换结果,即查询问题q1与候选答案d1在低维空间上的相似度,可以用余弦相似度表示:in, is the vector representation of the query question q 1 , is the vector representation of query question q 1 in low-dimensional space, that is, the reconstruction matrix; where Represents the coefficient matrix corresponding to the original language obtained by the optimization solution module. However, for the candidate answer d 1 , the conversion result obtained after low-dimensional conversion by the matrix factorization module can be directly used, namely The similarity between query question q 1 and candidate answer d 1 in low-dimensional space can be expressed by cosine similarity:
其中,s(q1,d1)表示查询问题q1与候选答案d1在低维空间上的相似度。Among them, s(q 1 , d 1 ) represents the similarity between the query question q 1 and the candidate answer d 1 in low-dimensional space.
对于q1对应的翻译qi(i∈[2,L])来说,可以利用如下的公式将其表示到低维的空间上:For the translation q i (i∈[2, L]) corresponding to q 1 , the following formula can be used to express it in a low-dimensional space:
其中,是查询问题qi的向量表示。类似地,对于候选答案d1对应的翻译di(i∈[2,L])来说,可以直接利用矩阵分解模块进行低维空间转换后得到的结果查询问题q1对应的翻译qi与候选答案d1对应的翻译di,在低维空间上的相似度可以采用上述类似的余弦相似度计算方法。in, is the vector representation of the query question q i . Similarly, for the translation d i (i∈[2, L]) corresponding to the candidate answer d 1 , the result obtained after the low-dimensional space transformation by the matrix factorization module can be directly used The similarity between the translation q i corresponding to the query question q 1 and the translation d i corresponding to the candidate answer d 1 in the low-dimensional space can be calculated using the above-mentioned similar cosine similarity calculation method.
所述结果排序学习模块,用于将L种不同语言在低维空间表示的相似度进行融合,得分最高的若干个候选答案作为最终的答案返回。对于给定的查询问题q1以及候选答案d1,本发明设计了一种如下的排序学习函数:The result sorting learning module is used to fuse the similarities represented by L different languages in the low-dimensional space, and return several candidate answers with the highest scores as the final answers. For a given query question q 1 and candidate answer d 1 , the present invention designs a ranking learning function as follows:
其中,Score(q1,d1)表示查询问题q1与候选答案d1最终的得分,表示特征向量的权重,Φ(q1,d1)={s(q1,d1),s(q2,d2),…,s(qL,dL)}表示特征向量,对应查询问题q1与候选答案d1的L种不同语言在低维空间表示的相似度。其中,参数采用统计机器学习中最常用的交叉验证策略获得最佳值。最终,按照Score(q1,d1)的高低排序,将得分最高的若干个候选答案作为最终的答案返回。Among them, Score(q 1 , d 1 ) represents the final score of query question q 1 and candidate answer d 1 , Represents the weight of the feature vector, Φ(q 1 , d 1 )={s(q 1 , d 1 ), s(q 2 , d 2 ),…, s(q L , d L )} represents the feature vector, corresponding to The similarity between the query question q 1 and the candidate answer d 1 expressed in L different languages in low-dimensional space. Among them, the parameter The best value is obtained using the cross-validation strategy most commonly used in statistical machine learning. Finally, sort according to the high and low of Score(q 1 , d 1 ), and return several candidate answers with the highest scores as the final answers.
为了说明该装置的性能,本发明通过实验来验证借助统计机器翻译方法对答案检索系统性能的提高。In order to illustrate the performance of the device, the present invention uses experiments to verify the improvement of the performance of the answer retrieval system by means of statistical machine translation methods.
本发明的实验数据来源于Yahoo!Answers社区问答系统,在这些历史问题集中,每个问题主要由四部分组成:问题的题目、问题的类别、问题的描述以及问题的答案。我们所采用的数据集包含1232个用户类别标签,2,288,607个问答对。为了评价该发明方法的有效性,我们另外选择了252个查询问题作为测试数据集。对于测试数据集中的每个查询问题,我们采用语言模型检索出最好的20个结果,然后让两个标注者去手工标注。如果返回的候选答案与该查询问题相似,就标注为“相关”,否则标注为“不相关”。如果两个标注者的标注结构有冲突,让第三个人来做最终的决定。在判断候选答案与查询问题是否相似的过程中,标注者仅仅知道问题本身。The experimental data of the present invention comes from Yahoo! The Answers community question answering system, in these historical question sets, each question is mainly composed of four parts: the title of the question, the category of the question, the description of the question, and the answer to the question. The dataset we adopt contains 1232 user category labels and 2,288,607 question-answer pairs. In order to evaluate the effectiveness of the inventive method, we additionally selected 252 query questions as a test data set. For each query question in the test dataset, we use the language model to retrieve the best 20 results, and then let two annotators manually annotate them. If the returned candidate answer is similar to the query question, it is marked as "relevant", otherwise it is marked as "irrelevant". If two annotators have conflicting annotation structures, let a third person make the final decision. In the process of judging whether the candidate answer is similar to the query question, the annotator only knows the question itself.
在本发明中,设置参数L=5,即需要将英语翻译成其它的4种语言(汉语、法语、意大利语、德语)。In the present invention, setting parameter L=5 means that English needs to be translated into other 4 languages (Chinese, French, Italian, German).
假设Qt表示测试问题集,本发明采用如下两个评价指标:Assuming that Q t represents a test problem set, the present invention adopts the following two evaluation indicators:
平均正确率(MAP):其计算公式如下:Average Accuracy Rate (MAP): Its calculation formula is as follows:
其中,mq是与查询问题q相关的问题数目,Rk是检索结果中第k个问题及其之前全部问题的集合,Precision(Rk)是Rk与q相关的问题比例。该指标反映了测试结果整体上的平均水平。Among them, m q is the number of questions related to the query question q, R k is the collection of the kth question and all previous questions in the retrieval results, and Precision(R k ) is the proportion of questions related to R k and q. This indicator reflects the overall average of test results.
Precision@n(P@n):定义为系统对于查询问题返回的前n个结果的准确率。整个测试集的Precision@n为测试集合中所有问题的Precision@n的平均值,其计算公式如下:Precision@n(P@n): Defined as the accuracy rate of the first n results returned by the system for query questions. The Precision@n of the entire test set is the average of the Precision@n of all questions in the test set, and its calculation formula is as follows:
其中,k表示检索系统返回的前k个问题中相关问题数目,n表示检索系统返回的问题总数目。因此,Among them, k represents the number of related questions among the first k questions returned by the retrieval system, and n represents the total number of questions returned by the retrieval system. therefore,
考虑到用户在查看检索结果时,往往希望在前面几个结果就找到自己所需要的信息,因此常常设置n=10。Considering that users often want to find the information they need in the first few results when viewing the retrieval results, n=10 is often set.
本发明借助统计机器翻译,将查询问题与候选答案之间存在的“词汇歧义”和“词汇鸿沟”问题,采用翻译后的词来表示,可以有效地解决上述两个问题。表1给出了借助统计机器翻译的答案检索性能的实验。By means of statistical machine translation, the present invention uses translated words to represent the "lexical ambiguity" and "lexical gap" between query questions and candidate answers, which can effectively solve the above two problems. Table 1 presents experiments on answer retrieval performance with the aid of statistical machine translation.
表1:借助统计机器翻译的答案检索性能的实验Table 1: Experiments on answer retrieval performance with the help of statistical machine translation
如表1所示,TRLM表示传统的基于单语言翻译的答案检索方法;SMT表示本发明提出的借助统计机器翻译的答案检索方法。通过表1的对比,可以看到本发明的方法使答案检索的性能有明显的提升。如MAP提升了29.36%,P@10提升了11.49%。实验结果证明,本发明可以较好地提升答案检索的性能。As shown in Table 1, TRLM represents the traditional answer retrieval method based on monolingual translation; SMT represents the answer retrieval method proposed by the present invention with the help of statistical machine translation. Through the comparison in Table 1, it can be seen that the method of the present invention significantly improves the performance of answer retrieval. For example, MAP increased by 29.36%, and P@10 increased by 11.49%. Experimental results prove that the present invention can better improve the performance of answer retrieval.
从以上表1的实验结果可以看到,借助统计机器翻译的答案检索方法在性能取得了不错的效果,这个方法被证明是有效的。From the experimental results in Table 1 above, it can be seen that the answer retrieval method with the help of statistical machine translation has achieved good results in performance, and this method has been proved to be effective.
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention, and are not intended to limit the present invention. Within the spirit and principles of the present invention, any modifications, equivalent replacements, improvements, etc., shall be included in the protection scope of the present invention.
Claims (12)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310180146.4A CN103235833B (en) | 2013-05-15 | 2013-05-15 | Answer search method and device by the aid of statistical machine translation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310180146.4A CN103235833B (en) | 2013-05-15 | 2013-05-15 | Answer search method and device by the aid of statistical machine translation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103235833A CN103235833A (en) | 2013-08-07 |
CN103235833B true CN103235833B (en) | 2017-02-08 |
Family
ID=48883874
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310180146.4A Active CN103235833B (en) | 2013-05-15 | 2013-05-15 | Answer search method and device by the aid of statistical machine translation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103235833B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111782789A (en) * | 2020-07-03 | 2020-10-16 | 江苏瀚涛软件科技有限公司 | Intelligent question and answer method and system |
CN112182439B (en) * | 2020-09-30 | 2023-05-23 | 中国人民大学 | Search result diversification method based on self-attention network |
US12027070B2 (en) | 2022-03-15 | 2024-07-02 | International Business Machines Corporation | Cognitive framework for identification of questions and answers |
-
2013
- 2013-05-15 CN CN201310180146.4A patent/CN103235833B/en active Active
Non-Patent Citations (3)
Title |
---|
《Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives》;Guangyou Zhou 等;《Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics》;20110624;第653-662页 * |
《互联网机器翻译》;王海峰,吴华,刘占一;《中文信息学报》;20111130;第25卷(第6期);第72-80页 * |
《非负矩阵分解及其应用现状分析》;徐泰燕,郝玉龙;《武汉工业学院学报》;20100331;第29卷(第1期);第109-114页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103235833A (en) | 2013-08-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110147436B (en) | A hybrid automatic question answering method based on educational knowledge graph and text | |
CN109344236B (en) | A problem similarity calculation method based on multiple features | |
CN109033080B (en) | Medical term standardization method and system based on probability transfer matrix | |
CN101079026B (en) | Text similarity, acceptation similarity calculating method and system and application system | |
CN104298651B (en) | Biomedicine named entity recognition and protein interactive relationship extracting on-line method based on deep learning | |
CN101763402B (en) | Integrated retrieval method for multi-language information retrieval | |
CN111444700A (en) | Text similarity measurement method based on semantic document expression | |
Xie et al. | Topic enhanced deep structured semantic models for knowledge base question answering | |
CN106844368A (en) | For interactive method, nerve network system and user equipment | |
CN105138864B (en) | Protein interactive relation data base construction method based on Biomedical literature | |
US20120150529A1 (en) | Method and apparatus for generating translation knowledge server | |
CN102663129A (en) | Medical field deep question and answer method and medical retrieval system | |
CN111984782B (en) | Tibetan text abstract generation method and system | |
CN113761890A (en) | A Multi-level Semantic Information Retrieval Method Based on BERT Context Awareness | |
Hu et al. | Natural language aggregate query over RDF data | |
CN107818081A (en) | Sentence similarity appraisal procedure based on deep semantic model and semantic character labeling | |
Yu et al. | Question classification based on MAC-LSTM | |
Kessler et al. | Extraction of terminology in the field of construction | |
Sun | [Retracted] Analysis of Chinese Machine Translation Training Based on Deep Learning Technology | |
CN103235833B (en) | Answer search method and device by the aid of statistical machine translation | |
Wang et al. | A joint chinese named entity recognition and disambiguation system | |
Mohnot et al. | Hybrid approach for Part of Speech Tagger for Hindi language | |
Wang et al. | A BERT-based named entity recognition in Chinese electronic medical record | |
Al-Sultany et al. | Enriching tweets for topic modeling via linking to the wikipedia | |
Kupiyalova et al. | Semantic search using natural language processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |