CN103902652A - Automatic question-answering system - Google Patents

Automatic question-answering system Download PDF

Info

Publication number
CN103902652A
CN103902652A CN 201410068844 CN201410068844A CN103902652A CN 103902652 A CN103902652 A CN 103902652A CN 201410068844 CN201410068844 CN 201410068844 CN 201410068844 A CN201410068844 A CN 201410068844A CN 103902652 A CN103902652 A CN 103902652A
Authority
CN
China
Prior art keywords
answer
question
answers
problem
unit
Prior art date
Application number
CN 201410068844
Other languages
Chinese (zh)
Inventor
郑海涛
古宁
江勇
夏树涛
赵从志
Original Assignee
深圳市智搜信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市智搜信息技术有限公司 filed Critical 深圳市智搜信息技术有限公司
Priority to CN 201410068844 priority Critical patent/CN103902652A/en
Publication of CN103902652A publication Critical patent/CN103902652A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2785Semantic analysis
    • G06F17/279Discourse representation

Abstract

The embodiment of the invention discloses an automatic question-answering system. According to the automatic question-answering system, a user interaction unit receives a question input by a user, a question analysis unit extracts keywords from the question input by the user and expands the keywords, then an information retrieval unit searches a frequently asked question bank for answers to the question according to the expanded keywords and returns the answers to related documents, and finally, according to answer extraction rules corresponding to the question type, an answer extraction unit extracts an answer according with the rules from the related documents returned by the information retrieval unit, and sends the extracted answer to the user interaction unit to be fed back to the user. Compared with an automatic question-answering system in the prior art, the automatic question-answering system can automatically expand the keywords and divide the types of the input questions, different types of questions correspond to different types, further answer search is carried out in the types, and the accuracy and diversity of answers are improved.

Description

自动问答系统 Answering System

技术领域 FIELD

[0001] 本发明涉及计算机软件技术领域,特别涉及一种自动问答系统。 [0001] The present invention relates to computer software technology, and particularly relates to an automatic answering system.

背景技术 Background technique

[0002] 20世纪90年代以来,Internet在世界范围内得到了迅猛的发展,互联网上信息越来越多,为人们提供了丰富的信息资源。 [0002] 20 Since the 1990s, Internet get in the world of rapid development, more and more information on the Internet, provides a wealth of information resources for people. 另一方面,网上的信息越来越多,极大地推动了自然语言处理技术的发展,同时也对自然语言处理技术提出了更高的要求:人们希望在杂乱无章的网络世界中快速、准确地获得自己想要的信息。 On the other hand, more and more information online, greatly promoted the development of natural language processing technology, as well as natural language processing technology put forward higher requirements: people want fast, accurate access to the network in the chaotic world I want information. 虽然现在互联网上有很多搜索引擎可以帮助人们搜索自己想要的信息,但是目前的搜索引擎还有很多缺点,并不能满足人们方便、快速、准确地获取信息的需要。 Although there are many search engines on the Internet can help people search for the information they want, but the current search engines, there are many shortcomings, and can not meet people easily, quickly and accurately obtain the required information. 表现在以下三个方面:一是相关性信息太多。 In the following three aspects: First, too much dependency information. 传统的搜索引擎返回的相关网页太多,用户很难快速准确地定位到所需的信息。 Too many pages a traditional search engine returns, difficult for users to quickly and accurately locate the information they need. 例如,用户在Google上输入几个关键字,它有可能返回成千上万个网页,用户将浪费很多时间在这些网页中查找自己所需要的信息。 For example, users enter a few keywords on Google, it is possible to return thousands of pages, it will waste a lot of time to find the information they need in these pages. 二是以关键词的逻辑组合来表达检索需求,因为人们的检索需求往往是非常复杂而特殊的,是无法以几个关键词的简单组合来表达的,这样用户都没有将自己的检索意图表达清楚,搜索引擎自然也就没有办法找出令用户满意的答案了。 Second, to express the logical combination of keyword retrieval needs, because retrieval needs of people tend to be very complex and unique, is not a simple combination of a few key words in order to express, so users do not express their intention to retrieve clear, search engines naturally there is no way to find out the answer to the user's satisfaction. 三是以关键词为基础的索引、匹配算法,尽管该算法简单易行,毕竟停留在语言的表层,而没有触及语义,因此检索效果很难进一步提高。 Third, keyword-based indexing, matching algorithm, although the algorithm is simple, after all, remain in the language of the surface, but did not touch semantics, and retrieval performance is difficult to further improve.

[0003] 自动问答(QA,Automatic Question Answering)技术正是为了满足人们的这种愿望伴随着自然语言的语义处理技术而发展起来的。 [0003] Answering (QA, Automatic Question Answering) technology is to meet the aspirations of the people of this technology along with the semantic processing of natural language and developed. 人们可以用普通的问句对自动问答系统提问,自动问答系统将从知识库或者互联网中搜索相应的答案,然后把答案以简洁的形式直接返回给用户,而不是像搜索引擎那样返回给用户的是一堆相关的网页。 People can use a common question for question answering a question, question answering system from the knowledge base or the Internet in search for the answer, then the answer in a concise form returned directly to the user, rather than returned to the user like a search engine is a bunch of related pages. 这样用户就可以通过自动问答系统方便地获得自己想要的信息。 So users can easily access the information they want through automated answering system. 自动问答技术综合运用了知识表示、信息检索、自然语言处理等技术。 Answering technology integrated use of knowledge representation, information retrieval, natural language processing technology. 自动问答系统能够使用户以自然语言输入问题,而不是关键词的组合。 Answering system enables the user to input a natural language question, rather than a combination of keywords. 而返回给用户的是简洁、准确的答案,而不是一些相关的网页。 And returned to the user is simple, accurate answer, not some relevant pages. 所以,问答系统能更好的满足用户的检索需求,能更快地找出用户所需的答案。 Therefore, the question answering system to better meet the needs of users to retrieve, the user can find the required answers more quickly. 可以说,问答系统就是新一代的搜索引擎。 It can be said, answering system is a new generation of search engines. 对于问答系统,用户不需要把自己的问题分解成关键字,用户可以把整个问题直接交给问答系统。 For QA systems, users do not need to own the problem into keywords, the user can put the entire issue directly to the answering system. 问答系统结合自然语言处理技术,通过对问题理解,能够直接提交给用户想要的答案。 Q system combines natural language processing technology, through the understanding of the problem can be submitted directly to the user wants answers. 问答系统就像一个知识渊博的专家,可以快速准确地回答任何问题。 Q & A system is like a knowledgeable expert can answer any questions quickly and accurately. 比如,用户提交一个问题“上海的简称是什么? ”问答系统将会直接给出答案“上海的简称是沪”。 For example, a user submits a question, "What is the abbreviation for Shanghai?" Question answering system will answer directly, "Shanghai's abbreviation is Hu." 可以看出,问答系统要比传统的搜索引擎方便、快捷、高效。 As can be seen, question answering system to facilitate traditional search engine, fast and efficient than others.

[0004]目前国内尚未比较成熟的自动问答系统。 [0004] At present, not yet mature QA system. 现有技术提出的“一种基于问答库的中文自然语言问答方法”,通过互联网专业网站建立FAQ库,然后对用户查询分词分析得到相近的查询问句,即以问句检索问句,匹配答案的方式,主要包含问答库的建立和问句相似度计算两方面。 The prior art proposes to "a Chinese natural language questions and answers questions and answers based on the library" and the establishment of professional website FAQ database via the Internet, and then analyze the query questions to get close to a user query word, that question retrieval questions, the answers match way, mainly contains questions and answers questions and build libraries of both similarity calculation. 腾讯科技(深圳)有限公司提出的“一种自动问答系统及方法”在问答系统中提供了对关键词的归一化处理单元,使用户输入语句中关键词通过归一化处理能够转化为推理知识库中通用的关键词,从而减少搭建推理知识库的工作量。 Tencent Technology (Shenzhen), "an automated answering system and method" Limited proposed to provide a return for keyword processing unit in a question and answer system, allowing users to input keywords statement by the normalization process can be converted into reasoning Knowledge generic keywords, thus reducing the workload of reasoning to build the knowledge base. 华中科技大学提出的“一种基于概念的智能中文问答系统”能对用户输入的问句处理后关键词串进行同义扩展,更好的理解问句,进行检索,提高了问答系统的查全率,从词形,词序,词长三方面给出了一种基于概念的中文句子相似度计算方法,提高了查准率。 Huazhong University of Science presented "based on the concept of intelligent Chinese question answering system" capable of keyword strings entered by the user after questions synonymous processing extensions, a better understanding of questions, retrieval, question answering system improves recall rate, the morphology, word order, word length is given a three Chinese sentence similarity computation based on the concept of a method to improve the precision. 昆明理工大学提出的“旅游领域FAQ中文问答系统实现方法”提供了一种旅游领域FAQ中文问答系统的实现方法,包括FAQ收集和组织,旅游领域知识库构建,用户查询,答案提取等步骤。 Kunming University of Science presented "the field of tourism FAQ Chinese QA system implementation" provides a method for implementing the field of tourism FAQ Chinese Q & A system, comprising the steps FAQ collect and organize, build tourism domain knowledge base, user query, the answer extraction. 该实现借助了本体论的思想,构建了旅游领域知识库-领域知网,利用KDML语言定义和描述了旅游领域术语与关系,并提出了一种旅游问句相似度计算方法。 With the realization of the idea of ​​ontological domain knowledge base to build tourism - areas HowNet by KDML language defines and describes the terms of the relationship between the field of tourism, and proposes a tour sentence similarity calculation method. 北京大学深圳研究生院提出的“一种自动问答方法及系统”对问句进行分析,采用问点/条件点识别模型对分词后的词语进行标示,利用识别出的问句的问点、条件点查询SQL结构的信息资源库得到结果。 "Q & A method and system for an automatic" Peking University Shenzhen Graduate School of questions raised by the analysis using Q-point / point condition recognition model of words after the word were marked by asking questions of the points identified, conditions point information repository SQL query structure to get results. 百度在线网络技术(北京)有限公司提出的“一种形成提问的方法、装置和知识问答系统的服务器端”提出了利用提问模板获取用户输入关键信息并提问的方式。 Baidu Online Network Technology (Beijing) Co., Ltd.'s "method of forming questions, devices and server-side quiz systems" put forward a way to get user input using the template questions critical information and ask questions. 华为技术有限公司提出的“一种多媒体问答系统及方法”根据用户对用户输入问题进行解析,获取特征信息和语义类别,在预设多媒体数据库中查找该类别下相似度最高的问题对应的答案。 Huawei Technologies Co., Ltd.'s "Q & A system and method for a multimedia" according to the user to input the user to resolve the problem, obtain feature information and semantic categories, look under the category corresponding to the highest similarity answer questions in the preset multimedia database.

[0005] 上述的技术中存在以下问题: [0005] The following problems in the above-described technique:

[0006] 第一、基于常用问题库(FAQ)的问答系统,问题库的规模和范围影响着答案的正确率,所以构建出一个比较全面的常用问题库是该类问题系统需要解决的首要问题,而且基于问题库的问答系统通常用于某个专业领域之内,其扩展性是比较差的。 [0006] First, the question answering common questions library (FAQ) based on the size and scope of the problem libraries affect the accuracy of the answers, so build a more comprehensive FAQ library is the most important issue such problems need to be solved and the question answering system based libraries are often used within a certain area of ​​expertise, its scalability is relatively poor. 此外,用户输入问句与问题库用句之间的相似度计算是系统的核心所在,其计算方法的精确性和高效性关系到整个系统的精确性和效率。 Further, the user inputs a question and the question bank computing a similarity between the sentence is the core of the system, the calculation accuracy and efficiency related to the accuracy and efficiency of the whole system.

[0007] 第二、基于Internet的自动问答系统,检索到的信息冗余过大,可能隶属多个主题信息,答案的抽取过程会比较复杂而且答案的准确率得不到保证。 [0007] Second, Internet-based automated answering system, the retrieved information redundancy is too large, it may be several topics under the information, answer extraction process would be more complicated and the accuracy of the answers can not be guaranteed.

[0008] 第三、检索中关键词匹配和语义扩展问题,但是由于汉语中表达方式灵活,具有相同语义句子其关键词的出现的位置也不定,关键词按序匹配往往不能满足检索要求,汉语中存在着大量的同义词,在问题和答案中完全不同的关键词可能含有相同的语义,如果不进行语义扩展也会造成检索失败,但是语义扩展提高了检索的召回率却可能降低检索的准确率,如果能智能的划分出所问问题的类型,则可以提高确准率。 Position [0008] Third, keyword matching retrieval and semantic expansion problems, but because of the flexibility expressed in Chinese manner, with the same semantics of the sentence which is not specified keywords appear, in sequence matching keyword retrieval often can not meet the requirements, Chinese there are a lot of synonyms, questions and answers in a completely different keyword may contain the same semantics, if not semantic extension can also cause retrieval fails, but the semantic extension increases the recall rate is retrieved may reduce the accuracy of search If intelligence can be divided into the type of questions asked, you can really improve precision.

[0009] 第四、上述方法大多是以输入问题检索FAQ库中的问题,返回FAQ库中的最相似的问题的答案,对于一些开放问题,其答案往往是受限于某个领域之内,虽然能得到一个或者多个共性的答案,往往无法提供给用户多样性的答案进行参考。 [0009] Fourth, the above methods are mostly based on input problems in the FAQ database retrieval problems, return answers to the most similar problems in the FAQ database, for some open questions, the answer is often limited to a certain field, the Although able to get answers to one or more common, users are often unable to provide an answer to the diversity of reference is.

发明内容 SUMMARY

[0010] 鉴于现有技术的不足,本发明目的在于提供一种自动对输入问题进行类型划分,在该类型中搜索答案的自动问答系统。 [0010] In view of the deficiencies of the prior art, an object of the present invention is to provide an automatic input Classification problems, the search for the answer type in question answering.

[0011] 本发明的技术方案如下: [0011] aspect of the present invention is as follows:

[0012] 一种自动问答系统,包括: [0012] An automatic answering system, comprising:

[0013] 用户交互单元,用于接收用户输入的问题以及将问题答案反馈给所述用户; [0013] The user interaction unit for receiving user inputs and issues the answer to the question back to the user;

[0014] 问题分析单元,用于抽取用户输入的问题的关键词,并对所述关键词进行扩展,以及根据预先设置的问题分类标准对问题进行类型划分得到所述问题的类型; [0014] Analysis unit for extracting keywords input by the user of the problem, and the expanded keyword, and the type of problems obtained by dividing the problem according to the type of the preset question classification criteria;

[0015] 常问问题库,用于存储用户常问的问题和答案;[0016] 信息检索单元,用于根据所述问题分析单元扩展后的关键词在所述常问问题库中搜索问题答案,并返回相关的文档或答案; [0015] Exam often ask for storing frequently asked questions and answers; [0016] Information retrieval unit, for the answer to the keyword in the expansion unit exam often ask questions based on the search Analysis and return the relevant documents or answers;

[0017] 答案抽取单元,用于根据与所述问题的类型对应的答案抽取规则从所述信息检索单元返回的相关文档中抽取符合所述规则的答案,将抽取的答案发送至所述用户交互单 [0017] answer extraction means for extracting the answers match the rules of the relevant documents from the information retrieval unit returns the extracted according to the type corresponding to the answer to the question of the rule, the extracted answer sent to the user interaction single

J Li ο J Li ο

[0018] 有益效果: [0018] beneficial effects:

[0019] 本发明实施例公开的自动问答系统中,用户交互单元接收用户输入的问题,问题分析单元对用户输入的问题进行抽取关键词并对关键词进行扩展,然后由信息检索单元根据扩展后的关键词在常问问题库中搜索问题答案,找到则直接答案发送至所述用户交互单元向用户反馈,否则返回相关的文档,最后由答案抽取单元根据与所述问题类型对应的答案抽取规则从所述信息检索单元返回的相关文档中抽取符合所述规则的答案,将抽取的答案发送至所述用户交互单元向用户反馈。 QA system embodiment of the disclosed embodiment [0019] of the present invention, the problem of the user interaction unit receives a user input, the problem analysis unit extracts keywords and keywords input by the user of the problem extended, and the extended according to the information retrieval unit keyword search in answers to questions often ask the question bank, to find the answer sent directly to the user interaction unit feedback to the user, otherwise return the relevant documents, according to the final extraction unit extracts corresponding to the type of question answer answer by the rules extracting answers conforms to the rules from the relevant documents in the information retrieval unit returns the extracted answer is sent to the user interaction feedback to the user unit. 与现有技术中的自动问答系统相比,本发明实施例提供的自动问答系统可以自动对关键词进行扩展以及对输入问题进行类型划分,不同类型的问题对应不同的类型,进而在该类型中搜索答案,提高了答案的准确率和多样性。 Compared with the prior art system QA, QA system according to an embodiment of the present invention can automatically extend and keyword input type division problems, different types corresponding to different types of problems, and further in the type search for answers to improve the accuracy and diversity of the answer. 输入问题进行问题领域和类型划分,通过领域划分,在该领域范围下搜索,缩小了搜索范围,通过类型划分提供了答案抽取的匹配规则,根据规则返回该类型对应的答案,提高了答案的准确率和多样性。 Enter problem issues in the field and Classification, in the art of dividing search in this field range, narrowing the search range, a matching rule answer extraction by Classification, return the type corresponding to the answer to the rules, improve the accuracy of the answer productivity and diversity.

附图说明 BRIEF DESCRIPTION

[0020] 下面将结合附图及实施例对本发明作进一步说明,附图中: [0020] The accompanying drawings and the following embodiments of the present invention is further illustrated drawings in which:

[0021]图1为本发明实施例提供的自动问答系统的结构示意图。 [0021] FIG. 1 is a schematic structure of an automatic answering system according to an embodiment of the present invention.

具体实施方式 Detailed ways

[0022] 为使本发明的目的、技术方案及效果更加清楚、明确,以下对本发明进一步详细说明。 [0022] For purposes of the present invention, technical solutions and advantages clearer, explicit, the following detailed description of the present invention further. 应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。 It should be understood that the specific embodiments described herein are only intended to illustrate the present invention and are not intended to limit the present invention.

[0023] 参阅图1,其为本发明实施例提供的自动问答系统的结构示意图。 [0023] Referring to Figure 1, a schematic structure of an automatic answering system according to an embodiment of the present invention which. 如图1所示,所述自动问答系统包括: As shown, the automatic question answering system 1 comprises:

[0024] 用户交互单元10,用于接收用户输入的问题以及将问题答案反馈给所述用户; [0024] The user interaction unit 10 for receiving user inputs and issues the answer to the question back to the user;

[0025] 问题分析单元20,用于抽取用户输入的问题的关键词,并对所述关键词进行扩展,以及根据预先设置的问题分类标准对问题进行类型划分得到所述问题的类型; [0025] Analysis unit 20 for extracting keywords input by the user of the problem, and the expanded keyword, and classified according to the type of problems previously set classification problem type to give the problem;

[0026] 常问问题库30,用于存储用户常问的问题和答案; [0026] often ask the question bank 30 for storing frequently asked questions and answers;

[0027] 信息检索单元40,用于根据所述问题分析单元20扩展后的关键词在所述常问问题库中搜索问题答案,并返回相关的文档或答案; [0027] The information retrieval unit 40, according to the problem analysis for the keyword expansion unit 20 to ask the question bank often find answers, and returns the relevant documents or answers;

[0028] 答案抽取单元50,用于根据与所述问题的类型对应的答案抽取规则从所述信息检索单元40返回的相关文档中抽取符合所述规则的答案,将抽取的答案发送至所述用户交互单元10。 [0028] answer extraction unit 50 for extracting the answers match the rules of the relevant documents from the information retrieval unit 40 according to the extraction and returned answer to the question corresponding to the type of rules, send the answer to the extracted user interaction unit 10.

[0029] 本发明实施例公开的自动问答系统中,用户交互单元10接收用户输入的问题,问题分析单元20对用户输入的问题进行抽取关键词并对关键词进行扩展,然后由信息检索单元40根据扩展后的关键词在常问问题库30中搜索问题答案,并返回相关的文档,最后由答案抽取单元50根据与所述问题类型对应的答案抽取规则从所述信息检索单元40返回的相关文档中抽取符合所述规则的答案,将抽取的答案发送至所述用户交互单元10向用户反馈。 QA system embodiment of the disclosed embodiments [0029] In the present invention, problems of user interaction unit 10 receives a user input, the problem analysis unit 20 extracts keyword and the keyword input by the user of the problem be extended, and then by the information retrieval unit 40 according to the expanded keywords relevant exam answers often ask questions 30 searches, and return the relevant documents, and finally by the answer extraction unit 50 returns from the information retrieval unit 40 issues the answer extraction rule corresponding to the type of extracting answer document conforms to the rule, the extracted answer sent to the user interaction unit 10 feedback to the user. 与现有技术中的自动问答系统相比,本发明实施例提供的自动问答系统可以自动对关键词进行扩展以及对输入问题进行类型划分,不同类型的问题对应不同的类型,进而在该类型中搜索答案,提高了答案的准确率和多样性。 Compared with the prior art system QA, QA system according to an embodiment of the present invention can automatically extend and keyword input type division problems, different types corresponding to different types of problems, and further in the type search for answers to improve the accuracy and diversity of the answer.

[0030] 为更详细的理解在本发明实施例提供的自动问答系统中,下面针对自动问答系统中的各功能模块进一步介绍。 [0030] In a more detailed understanding of the present invention provide a question answering embodiments, the description below for further automatic question answering system functional modules.

[0031] 在本发明实施例中,用户交互单元10为用户输入查询问题和浏览问题答案的模块,通常使用浏览器,如常用的Internet Explorer, Firefox, Chrome浏览器等,将接收到的用户查询问题送到问题分析单元处理,或者从常用问题库得到的问题答案反馈给用户。 [0031] In the embodiment of the present invention, the user interaction unit 10 as a user input inquiries and browse answers to the questions module, typically using a browser, such as the commonly used Internet Explorer, Firefox, Chrome browser, the received user query analysis processing unit to the problem, or the problem resulting from the common library answer feedback to the user.

[0032] 本发明实施例提供的自动问答系统还包括本体知识库60,所述本体知识库为所述问题分析单元20提供共享词表。 [0032] The automatic answering system according to an embodiment of the present invention further includes a body 60 knowledge base, the knowledge base of the body 20 to provide a shared vocabulary problem analysis unit. 本发明实施例将本体引入自动问答系统使系统能够对用户查询词进行语义分析,更能充分理解用户的查询意图,从而有效的改善查准率和查全率。 Embodiments of the invention will be introduced into the body question answering system to enable the user query word semantic analysis more fully understand the user's query intention, so as to effectively improve the precision and recall. 本体描述了领域的概念以及他们之间的语义关系,可以帮助机器对领域知识有深刻理解的基础上做技术操作。 Ontology describes the concept as well as in the field of semantic relationships between them, can help the machine to do the technical knowledge of the field of operation on the basis of a deep understanding. 具体的,本体知识库是指一种形式化的,对于共享概念体系的明确而又详细的说明。 Specifically, it refers to a body of knowledge is formal, clear and detailed description of the concept for a shared system. 本体知识库提供的是一种共享词表,也就是特定领域之中那些存在着的对象类型或概念及其属性和相互关系;本体知识库是一种特殊类型的术语集,具有结构化的特点,且更加适合于在计算机系统之中使用;本体知识库实际上就是对特定领域之中某套概念及其相互之间关系的形式化表达(formal representation)。 Knowledge of the body is provided a shared vocabulary, i.e. in particular those of object types or there concepts and their attributes and relationships; ontologies is a special type of term sets, having the characteristics of the structure , and more suitable for use in a computer system; formal ontologies fact their mutual relationship between the expression of a specific set of concepts in the art (formal representation).

[0033] 进一步的,本发明实施例提供的系统利用了语义化的本体作为知识库,本体描述了领域的概念以及他们之间的语义关系,可以帮助机器对领域知识有深刻理解的基础上做技术操作。 [0033] Further, an embodiment of the system of the present invention utilizes a semantic ontology knowledge, the art describes the concept of the ontology and semantic relations between them, the machine can help in the field of knowledge of a deep understanding of the foundation technical operations.

[0034] 本系统采用一种全自动的本体构造方法。 [0034] The system uses a method for automatic configuration of the body. 该方法不需要人工干预,一般而言只需要对每个概念定义少量的关键词即可。 This method does not require human intervention, generally only need to define a small number of keywords to each concept. 随后可以自动运行,从网络上抓取信息,对本体进行构造。 It may then be run automatically fetch information from the network, configuration of the ontology. 此方法对每个本体概念的训练过程可以分成4步: This method is the concept of the training process each body can be divided into four steps:

[0035] 步骤⑴用搜索引擎查询相关文档; [0035] Step ⑴ use search engines to research other relevant documents;

[0036] 步骤⑵根据步骤⑴中的文档,用LDA模型生成候选词; [0036] Step ⑴ ⑵ step in accordance with the document, generating the candidate word model with LDA;

[0037] 步骤(3)对候选词进行用语义距离公式进行打分,如NGD, WebJaccard等; [0037] Step (3) for scoring candidate words, such as NGD, WebJaccard other semantic distance formula;

[0038] 步骤(4)将打分后的词加入到本体概念的实例当中; [0038] Step (4) After scoring the words added to the body which instance concept;

[0039] 构造算法的第一步是为本体中的每个概念查找相关的文档。 [0039] The first step in construction algorithm is to find the relevant documents for each concept ontology. 例如,我们可以从百度新闻中查询最近的热门新闻作为相关文档。 For example, we can check the latest hot news from Baidu news as related documents.

[0040] 构造算法的第二步是从第一步得出的文档中选取候选词。 The second step [0040] construction algorithm is to select the candidate word from the first step to obtain the document. 这些候选词应该是在这个文档集中比较重要的词。 The candidate word should be a more important word concentrated in this document. 这里我们用LDA(Latent Dirichlet Allocation)模型进行关键词提取。 Here we conduct keyword extraction with LDA (Latent Dirichlet Allocation) model. LDA是一种文档集的概率生成模型。 LDA is the probability that A document set generation model. 他的基本思想是:每个文档都是由一系列遵循一定分布的隐含主题构成的,而每个主题中的可能出现的词也有其特定的分布。 His basic idea is: Each document is followed by a series of implicit subject of a distribution of composing, and words that may appear in each topic has its specific distribution. 在所有词的权重都计算出来之后,我们对这些词进行排序,并选择前η (η=400)个词作为对应概念的候选词。 After all the words of the heavy weights are computed, we sort the words, select front and η (η = 400) as a candidate word corresponding to the word concept.

[0041] 为了对选取的候选词进行更精确的评分,我们用基于搜索引擎查询返回数目的公式进行打分,最常用的公式为:Normalized Google Distance (NGD)。 [0041] In order to select the candidate words score more precise, we use the query returns the number of scoring formula based search engine, the most commonly used formula is: Normalized Google Distance (NGD). NGD是一个用Google搜索结果对两个词之间的关系紧密度进行评价的公式。 NGD is a formula of the relationship between the two words tightness was evaluated using Google search results. 在我们的模型中,给定一个词W和一个概念C,要计算他们之间的NGD值就必须选取一个词代表该概念。 In our model, given a word W and a concept C, to calculate the value of the NGD they must choose between a word on behalf of the concept. 因此我们在本体中增加一个属性tag,代表每个概念的定义词。 So we add an attribute tag in the body, the definition of the word on behalf of each concept.

[0042] 最后选取重新打分之后得分最高的前η个候选词加入到本体概念的实例当中,通过上述过程,得到了概念及其实例,但是概念和概念之间以及概念和属性之间的关系尚未添加,这可以借助HowNet来实现。 [0042] Finally, after the re-selection before scoring the highest scores η candidate word to which the ontology concept instance, by the above procedure to give the concept and examples, but the concepts and relationships between concepts and concepts and attributes Not adding that this can be achieved by means of HowNet. HowNet是一个以双语为代表的常识知识库,其基本组织单位是概念。 HowNet is a bilingual, represented by commonsense knowledge, the basic organizational unit is the concept. 概念使用义原定义。 The concept of using the original definition of justice. 概念与概念的关系、概念与义原的关系以及义原与义原的关系构成了HowNet的网状知识体系。 The relationship between the concept and the concept of the relationship between concept and meaning of the original, and the original meaning of the relationship with the original meaning of the network constitutes a HowNet knowledge. HowNet定义了上下位关系、同义关系、反义关系、对义关系、属性-宿主关系、部件-整体关系、材料-成品关系、事件-角色关系8种关系,从而完成本体的自动构造过程。 HowNet defined hyponymy, synonymy, antisense relations, defined relationship, attributes - host relationship, parts - the overall relationship, material - refined relations, event - role relationships eight kinds of relations, thereby completing the automatic configuration process of the body.

[0043] 问题理解是问答系统进行检索前所必需的分析工作,这个过程分析的效果对后面的处理过程有着重要的影响。 [0043] understanding that the issue answering system analytical work required before retrieval, analysis of the effect of this process has an important influence on later in the process. 问题理解部分需要完成以下几部分工作:提取出问题的关键词、依据问题的类型等因素对关键词进行适当的扩展,确定问题所属的类别,按预先定义的问题模式对问题进行模式抽取得到问题类型。 Part of the problem needs to be done to understand the following part of the work: the problem of extracting keywords, depending on the type of issues such factors on the appropriate keyword expansion, determine the category of the problem belongs, according to pre-defined problem pattern on pattern extraction problems get questions Types of. 如果是汉语的问答系统,首先要对问题进行分词以及词性标注等。 If the Chinese question answering system, we must first issue segmentation and POS tagging and so on.

[0044] 词语是信息表达的最小单位,而汉语不同于西方语言,其句子的词语间没有分隔符(空格),因此需要进行词语进行切分。 [0044] The word is the smallest unit of information expression, while Chinese language different from the West, no delimiters (spaces), hence the need for segmentation between words in the words for which the sentence. 本发明实施例提供的问题分析单元20包括:中文词语处理模块21,用于对用户输入的中文问题进行词语切分和词性标注。 Problems embodiment of the present invention is provided in the analysis unit 20 comprises: Chinese word processing module 21, configured to issue user inputted Chinese word segmentation and speech tagging.

[0045] 汉语词语切分中存在切分歧异,如句子“使用户满意”可切分为“使/用户/满意”,也可能被错误地切分为“使用/户/满意”,因而需要利用各种上下文知识解决词语切分歧异。 The presence of [0045] Chinese word segmentation isobutyl cut differences, such as the sentence "customer satisfaction" can be segmented into "enable / users / satisfaction", it may be erroneously segmented into "use / indoor / satisfaction", thus requiring using a variety of contextual knowledge to solve different words cut differences. 在切分的基础上,利用基于规则和统计(马尔科夫链)的方法进行词性标注。 On the basis of the segmentation, and using the rule-based statistics (Markov) method for speech tagging. 基于马尔科夫链随机过程的η元语法统计分析方法,被证明在词性标注中能达到较高的精度。 η-gram statistical analysis Markov random process based on proved POS tagging can achieve high accuracy. 这里本发明实施例使用的分词程序是哈尔滨工业大学计算机学院机器翻译研究室所做的分词系统,它能够将输入的汉语文本中的各个词断开,并在每个词后面用一个符号标明这个词的词性。 Segmentation procedure used in Example embodiments of the present invention herein is the word machine translation system Computer Science Laboratory Harbin Institute of Technology made, it is possible to input Chinese text in each word OFF, and marked with a symbol on the back of each word part of speech the word. 例如:/ng表示一般名词,/ηχ表示中文姓氏,/vg表示一般动词等等。 For example: / ng represent general terms, / ηχ represent Chinese surnames, / vg represent general verbs and so on. 下面是一个经过分词和词性标注的例句: Here is a word and sentence after the speech tagging:

[0046] 哈尔滨/nd在/p什么/r地方/ng ? /wj [0046] Harbin / nd in / what p / r place / ng? / Wj

[0047] 进一步的,本发明实施例提供的问题分析单元20包括关键词/概念抽取模块22,用于根据切分后的词语的词性抽取出关键词。 Analysis unit according to an embodiment of the [0047] Further, the present invention 20 comprises a keyword / concept extraction module 22 for extracting a keyword speech words in accordance with the segmentation.

[0048] 本发明实施例需要在用户提问的问题中,由关键词抽取模块22提取出对后面检索系统有用的关键字。 [0048] Example embodiments of the present invention requires the user in question in question, by the keyword extraction module 22 extracts the useful system for later retrieval keyword. 并不是在问题中的每个词都可以提取出来作为检索系统的关键词。 Each word is not in the question of can be extracted as a keyword search system. 比如,疑问词和一些常用的“吧、了、的”等词就应该被过滤掉,为此,需要一个停用词表来过滤这些词。 For example, some commonly used words and doubt ", the," and other words should be filtered out, this, you need a stop list to filter these words.

[0049] 关键词主要由名词、动词、形容词、限定性副词等组成。 [0049] Image mainly nouns, verbs, adjectives, adverbs and so limiting. 关键词可以分为两种:一般性关键词、“必须含有”的关键词。 Keywords can be divided into two types: keyword general keyword, "must contain". 所谓“必须含有”的关键词指的是这些关键词必须在答案句子中含有,而一般性关键词可以不被答案句子包含。 The so-called "must contain" keywords meant that these keywords must contain the answer sentence, and general keywords can not be the answer sentence included. 关键词被赋予不同的权重,在检索句子时这些权重用来计算句子的权重。 Key words are given different weights, the weights of these sentences retrieve weights used to calculate the weight of the sentence. 通常名词、具有限定性作用的副词会有比较高的权重。 Typically noun, adverb having a limiting effect will be relatively high weight. “必须含有”的关键词由专有名词、限定性副词(如:最大、最闻、最快等)、时间(如:1997年)组成。 Key words "must contain" a proper noun, adverb restrictive (such as: the largest and most smell, fastest, etc.), time (eg: 1997) components. 之所以要制定“必须含有”的关键词原则是因为他们对问题有极强的限定性作用,如果不含有它们的句子是几乎不可能是正确的答案。 The reason for the development of the principle of keywords "must contain" is because they have a strong role in defining the issues, if they do not contain the sentence is almost impossible to be the right answer. 例如:问题是“世界上最高的山峰是哪座山? ”而检索的结果却出现“乔戈里山是世界第二高峰”,这显然不是用户想得到的结果,之所以出现这种情况的原因就在于非常重要的关键词“最高”没有被答案句子所含有。 For example: The problem is "the world's highest mountain peaks are what?" The result of the search was a "qogir Mountain is the world's second highest peak," this is clearly not the result desired by the user, the reason this happens is reason is a very important key words "highest" is not contained in the answer sentence. 如果加上“必须含有”的关键词这个限制,那么这个答案就不会被检索出来,因此通过这些关键词的作用可以极大地提高检索的准确性。 If we add this keyword limit "must contain", then the answer will not be retrieved, it can greatly improve the accuracy of search keywords by these effects.

[0050] 本发明实施例提供的中文问答系统还包括关键词拓展的过程,在答案句子中,某些词常常不是原来问题的关键词,而是这些词的同义扩展。 [0050] Chinese question answering system according to an embodiment of the invention also includes the process of keyword expansion, in answer sentence, some word keywords are often not the original question, but synonymous with the expansion of these words. 例如:问题是“法国大革命哪一年发生? ”,答案的句子是“18世纪末,法国爆发的资产阶级大革命。”在问题中使用的是“发生”,而答案中却用了“爆发”这个词。 For example: the question is "what year the French Revolution happened?", The answer sentence is "the 18th century, the outbreak of the French bourgeois revolution." Used in the question is "occurrence", while the answers but with the "explosion" of this word. 这就造成了关键词查询失败,因此本发明实施例需要对关键词进行适当的扩展。 This failure resulted in a keyword query, embodiments of the invention therefore requires proper keyword expansion. 本发明实施例提供的自动问答系统中的问题分析单元20还包括关键词扩展模块23,用于对抽取出的关键词进行同义扩展。 Q of the automatic system provided in the embodiment of the present invention, the analyzing unit 20 further comprises a keyword extension module 23, configured to be synonymous keywords extracted extension.

[0051] 关键词扩展虽然提高了系统的召回率,但如果扩展不适当会极大地降低检索的准确率,因此一般的问答系统对关键词的扩展都是很谨慎的。 [0051] Although keyword expansion increases the recall rate of the system, but if an inappropriate expansion will greatly reduce the accuracy of retrieval, it is generally a question and answer system expansion for keywords are very cautious. 在这里,本发明实施例从两个方面进行关键词扩展。 Here, embodiments of the present invention, a keyword expanded from two aspects. 首先,将所有词的同义词作为扩展的关键词;其次,对于某些类型的问题,所对应的答案中经常会出现某种共同特征的词。 First, all the synonyms of the word as an extension of the keyword; secondly, for certain types of problems, the corresponding word answer some common characteristics often appear. 例如,对于询问地点的问题,答案中经常会出现“在” “位于” “地处”等关键词。 For example, for a place to ask questions, the answer often there will be "in", "located", "located" and other words. 本发明实施例把这些词也作为关键词进行扩展。 Embodiments of the invention also extend to these words as keywords.

[0052] 对于词义的拓展,在英语中常用的是WordNet。 [0052] For the expansion of meaning, it is commonly used in English WordNet. WordNet提供了词与近义词集合以及近义词集合之间关系网络。 WordNet provides a network of relationships between words and synonyms as well as a collection of synonyms collection. 通过WordNet中词的hypernym指针,就可以找出词与词之间的远近程度。 By hypernym pointer word of WordNet, you can find out the extent of the distance between the word and the word. 例如,“ex-husband”和“ex-wife”的hypernym指针都指向“ex-spouse”,于是这两个词在语义上可以通过“ex-spouse”联系在一起。 For example, "ex-husband" and "ex-wife" pointer points to the hypernym "ex-spouse", then the two words can be semantically linked by "ex-spouse". WordNet系统是用于英文的,对于中文,可以用知网(HowNet)作为系统的语义知识资源。 WordNet system is used in English, for Chinese, you can use HowNet (HowNet) as a semantic knowledge resources of the system. 知网是一个以汉语和英语所代表的概念为描述对象,以揭示概念与概念之间以及概念所具有的属性之间的关系为基本内容的常识知识库。 HowNet is a concept of Chinese and English to describe the object represented, in order to reveal the relationship between the concept and the concept and the concept has the common sense to attribute the basic content of the knowledge base. 它是一个网状的有机的知识系统。 It is a network of organic knowledge system.

[0053] 本发明实施例提供的中文问答系统还包括问题类型划分的过程,在本发明实施例中,每一个问题往往都由特定的类型(即领域),例如“中华人民共和国是什么时候成立的?”属于历史领域(历史类型),“姚明身高多少? ”属于体育领域(体育类型);“长江有多长? ”属于地理领域。 [0053] Chinese question answering system according to an embodiment of the present invention further comprises a process by type of problem, in embodiments of the invention, each type of problem often by specific (i.e. field), for example, "What time is the establishment of the PRC ?? the "belongs to the field of history (history type)," how tall Yao Ming "belongs to the field of sports (sports type);" how long Yangtze "is a geography field?. 预先对FAQ库中的常问问题进行按领域的分类,再对输入问题进行分类,然后在该类下进行检索,能够有效的提高检索的准确率和速度。 Advance FAQ FAQ library classified by the field, and then enter the issue of classification and retrieval in the class, can effectively improve the accuracy and speed of retrieval. 目前大部分的自动问答系统都是按照事先规定好的类别进行分类,但是这种分类还是存在很多不足的地方,太多人为的因素,而且分类太粗,并不能完全符合实际的要求,针对这一问题,本发明实施例系统中将使用基于语义的自动分类方法。 At present, most of QA systems are specified in advance in accordance with good category classification, this classification but still there are many deficiencies, too many human factors, and the classification is too coarse, and can not fully meet the actual requirements for this a problem in the embodiment of the present invention using an automatic classification system based on semantics. 因此,本发明实施例提供的自动问答系统中的问题分析单元20还包括问题类型划分模块24,用于根据预先设置的问题分类标准和所述扩展后的关键词划分所述问题的类型,或者用于根据所述扩展后的关键词的语义进行自动划分所述问题的类型。 Thus, Q of the automatic system provided in the embodiment of the present invention the analyzing unit 20 further comprises a dividing module 24 issues the type, according to the type of preset keywords and classification problems dividing the problem and the extended, or for automatic partitioning of the problem according to the type of the keyword and the extended semantics.

[0054] 本发明实施例提供的中文问答系统还包括问题模式抽取的过程,这里本发明实施例以中文问答系统为例。 Chinese question answering system according to an embodiment of the [0054] present invention further comprises a mode extraction process problems, embodiments of the invention herein to Chinese QA system as an example. 一般的问答系统都按照疑问短语来对问题的进行模式抽取。 General question and answer system to extract all of the issues in accordance with the mode doubt phrases. 下表列出了常见的问题类型: The following table lists common types of problems:

[0055] 表I常见问题类型 [0055] Table I type FAQ

Figure CN103902652AD00101

[0058] 针对于不同类型的问题制定相应的答案抽取规则,以便在答案抽取阶段应用这些规则来抽取问题的答案。 [0058] for the development of appropriate answer extraction rules for different types of problems, in order to apply these rules to extract questions answer extraction phase of the answer. 比如对于询问地点的问题,本发明实施例就可以规定,答案中必须含有位置信息。 For example, ask questions stations, embodiments of the present invention can be specified, the answer must contain the position information. 在系统中将使用基于语义的自动模式抽取方法,首先收集大量的问题作为训练语料,然后通过程序统计出经常出现的疑问短语。 Automatic mode using the extraction method is based on semantics in the system, the first to collect a lot of problems as a training corpus, and the statistics doubt phrase often by a program. 比如通过统计,“什么颜色”这几个词经常出现在问题中,那本发明实施例就可以把“什么颜色”当作一个疑问短语。 For example, by statistics, "what color" that the words often appear in the question, embodiments of the present invention is that it can "what color" as the phrase a question. 然后凡是含有“什么颜色”这个短语的问题都当作一类问题。 Then all issues including "what color" as the phrase are a class of problems.

[0059] 常用问题库(FAQ) 20指的是常问问题库(Frequently-Asked Question)。 [0059] Common problems library (FAQ) 20 refers often ask exam (Frequently-Asked Question). 常问问题库的作用是把用户经常问的问题和答案保存起来。 Ask exam often is the role of the saved user frequently asked questions and answers. 这样,对于用户输入的问题,可以首先在FAQ库中搜索,看看有没有相同的问题。 In this way, the problem for user input, you can first search for the library in the FAQ, see if there is the same problem. 如果有,就可以直接把FAQ库中这个问题对应的答案返回给用户。 If so, you can direct this question in the FAQ database corresponding answer back to the user. 这样,对于用户经常问的问题,问答系统就可以很快地给出答案,而不需要经过后面复杂的处理过程,这样就大大提高了系统的效率。 In this way, the user frequently asked questions, answering system can answer quickly, without going through a complicated process behind, thus greatly improving the efficiency of the system.

[0060] 在本发明实施例中,自动问答系统包括FAQ库30与FAQ库更新单元构成,常问问题库更新单元,用于在所述信息检索单元未在所述常问问题库中搜索到问题答案时从互联网搜索所述问题的答案,并将搜索到的答案加入所述常问问题库。 [0060] In an embodiment of the present invention, includes a question answering FAQ database updating unit 30 configured with the FAQ database, often ask exam updating means for retrieving information in the unit is not in the exam often ask to search from Internet search of the answers to your questions, and join the search for answers when answers to questions often ask the question bank.

[0061] 初始FAQ库的建立依托互联网上的相关资源,即使用网络爬虫对网页上的FAQ以及问答社区中的问答信息进行爬取,并在结构化数据处理后进行保存。 Related Resources on the establishment of relying on the Internet [0061] Initial FAQ database, namely the use of web crawlers for FAQ information and questions and answers community questions and answers on the web crawling, and save it in the structured data processing. 含有FAQ的网页可以通过Google、百度等搜索引擎查找,方法为在这些搜索引擎中以“inurl:faq”为查询条件,搜索引擎返回的即是含有FAQ的网页。 Contains a FAQ page can be found via Google, Baidu and other search engines, methods as these search engines to "inurl: faq" as a query, the search engine returns that is contained FAQ page. 现有的互联网问答社区包括百度知道、新浪爱问等,这些开放社区允许普通用户浏览他们的问题和对应的解答,问题与答案在页面上的组织方式固定,并且对正确答案有明确标记,易于抽取。 Existing Internet Q & A community, including Baidu, Sina love to ask, etc., these open community allows ordinary users to browse their questions and the corresponding answers, questions and answers fixed organization on the page, and there is a clear mark on the correct answer, easy to extract . FAQ的更新单元是指对于FAQ库中没有用户所问的问题,那么本发明实施例会通过系统的信息检索和答案抽取从Internet上找到匹配的答案,获取答案之后就可以将用户所问的这个问题和对应的答案加入FAQ库。 FAQ update unit refers to the FAQ library no user asked the question, then the embodiment of the present invention regular meeting drawn match is found on the Internet answers through a system of information retrieval and answers, and then get answers to users can be asked this question and corresponding answers join FAQ database.

[0062] 在本发明实施例中,信息检索单元40的任务就是用前面提取出来的关键字到文档库中查找相关的文档或答案。 [0062] In an embodiment of the present invention, task information retrieval unit 40 is extracted by the keyword search in front of related documents to the document library or an answer. 信息检索单元40返回的是一些最相关的文档。 Information retrieval unit 40 returns some of the most relevant documents. 在问答系统中的信息检索单元40也可以直接调用已有检索系统,比如Smart系统,或者也可调用Internet上的搜索引擎比如Google。 Information retrieval unit in question answering system 40 can also directly call the existing retrieval systems, such as Smart system, or they can call on the Internet search engine such as Google. 信息检索单元40的输入一般都是关键字的组合,如果是英文的问答系统,还需要对关键字进行词根操作(Stemming)。 Input information retrieval unit 40 are generally a combination of keywords, if the answering system is in English, also need to be root key operation (Stemming).

[0063] 要建立一个信息检索单元40,需要对文档库建立索引。 [0063] To establish an information retrieval unit 40, it needs to be indexed to the document library. 这样才能快速地找到包含特定关键词的文档。 So as to find documents that contain specific keywords quickly. 在建立索引之前,有必要对语料进行预处理,比如去除重复的文档,如果是英文的语料需要进行词根操作(Stemming),如果是汉语语料则需要分词。 Before indexing, it is necessary corpus preprocessing, such as removal of duplicate documents, if it is in English corpus requires root operation (Stemming), if you need a Chinese word corpus.

[0064] 信息检索单元40中的关键是对文档权重的确定和对文档进行排序。 Key [0064] The information retrieval unit 40 is to determine the weight of a document and the right sort of documents. 文档的权重常用的是TF-1DF算法,公式如下: Right document weight commonly used TF-1DF algorithm, formula is as follows:

Figure CN103902652AD00111

[0066] 其中:KWi是该文档包含的第i个关键词在问题分析阶段的权重,TFi是该关键词在这篇文档中出现的频率,IDFi是该关键词在文档中出现的反频率,D是指关键字在文档中的分布密度。 [0066] where: KWi is the document that contains the right keywords in the i-th stage of the re-analysis of the problem, TFi is the frequency of the keyword that appears in this document, IDFi anti frequency of the keyword appears in the document, D is the distribution density of keywords in the document. 关键词在该文档中出现的频率越高则它的TF就越大,关键词在越多的文档中出现则它的IDF就越小,反之越大,关键词在这篇文档中分布的越集中,则D值越大。 The frequency of keywords appearing in this document are the higher the greater its TF, keyword appears the smaller it is the more documentation in the IDF, the greater the contrary, the distribution of keywords in this document, the more set, the value of D increases. TF*IDF值从一个方面反映了该关键词的重要程度,通常在一个文档中经常出现(TF大)的词,而很少现在其他文档中的词(IDF大),该词所含有的信息量就越多,这个词也就越重要。 TF * IDF values ​​from one aspect reflects the importance of the keyword, (TF large) word is usually found in a document, and other documents now rarely word (IDF large), the information contained in the word the more volume, the more important the word. 另外如果关键词在文档中的分布越密集,则这篇文档包含相关答案的可能性越大,这篇文档的权重就越大。 Also, if the keyword the more likely the more densely distributed in the document, this document contains the relevant answers, the greater the heavy weight of this document. 对文档计算完权重后,就可以按照权重对文档进行排序,把权重最大的那些文档返回给答案抽取单元50。 After completion of the document to calculate weights, you can sort the documents according to their weights, the weight of the largest of those documents returned to answer extraction unit 50.

[0067] 一般搜索引擎返回的是一堆文档(网页),而问答系统需要返回的是简短的答案。 [0067] general search engine returns a pile of documents (web pages), and Q & A system needs to be returned is the short answer. 所以本发明实施例自动问答系统中的答案抽取单元50包括: So the answer extraction unit of an automatic question answering system 50 of the embodiment of the present invention comprises:

[0068] 相关性排序模块,用于对信息检索单元40返回的相关文档进行相关性排序,获得相关性高的文档;文档抽取模块52,用于根据与所述问题类型对应的答案抽取规则从所述相关性高的文档中抽取符合所述规则的一个或多个答案;以及答案整理模块53,用于对所述答案抽取单元50抽取的多个答案进行聚类,将聚类后的答案发送至所述用户交互单元。 [0068] relevance ranking module for the relevant document information retrieval unit 40 returns the ranking correlation, high correlation is obtained documents; documents extraction module 52 for extracting from the rules according to the type corresponding to the answer to question the high correlation of extracted document conforms to the answer to the one or more rules; finishing and answer module 53, a plurality of answers to the answer extraction unit 50 extracts the cluster, the cluster will answer transmitted to the user interaction unit. 其中,本发明实施例中对答案进行聚类的目的是为了让系统能够尽可能的返回多样化的答案,从而最大限度的满足用户的提问要求。 Wherein the object of the present invention answers clustering embodiment is to enable the system to return to a variety of possible answers to questions to meet the maximum requirements of users.

[0069] 相关性排序模块先要对返回的文档(例如网页)根据其相关性排序,取出其中相关性比较高的文档,交给答案抽取单元50中的文档抽取模块52来提炼一个或多个答案。 [0069] relevance ranking module must first document (e.g., web pages) to return sorted according to their relevance, wherein the extraction of relatively high correlation documents, to answer extraction module 52 extracts the document 50 to extract one or more unit answer. 答案的形式可以是词语、句子、段落或者文摘。 The answer may be in the form of words, sentences, paragraphs or abstracts. 如果答案的形式是词语、句子或者段落,处理起来还比较简单,如果答案的形式是文摘,那么就需要用到多文档自动文摘技术。 If the answer is in the form of words, sentences or paragraphs, dealing with them is relatively simple, if the answer is in the form of abstracts, you need to use a multi-document summarization technology. 抽取到了多个候选答案之后,本发明实施例中的答案整理模块53可以使用聚类的方法,对答案进行整理,根据答案的相关性和多样性的权衡反馈给用户交互单元10。 After drawn into a plurality of candidate answers, the method 53 may be used in the embodiment of clustering module according to the present invention answers finishing, finishing answers, feedback to the user interaction unit 10 and diversity trade-offs associated answers.

[0070] 不同的问题往往有不同的答案形式以及不同的答案抽取方法。 [0070] different problems often have different answers and different forms of answer extraction method. 因此需要对每类问题制定一个答案抽取规则。 Hence the need for an answer extraction rules for each type of problem. 根据问题的类型,答案的形式可以是词语、句子、段落或者文摘。 Depending on the type of question, the answer may be in the form of words, sentences, paragraphs or abstracts. 另外,对于某些问题类型,答案必须满足特定的条件。 In addition, for certain types of questions, the answers must be certain conditions are met. 表2是和表1对应的抽取答案规则。 Table 1 Table 2 and the corresponding answer extraction rules.

[0071] 表2答案抽取规则 [0071] Table 2 Answer Extraction Rule

Figure CN103902652AD00121

[0073] 如果以句子作为答案,处理起来相对简单一些,上述文档抽取模块52将相关性高的文档分为句子,根据TF-1DF算法计算每个句子的权重并根据权重排序得到候选答案,根据所述问题类型对所述候选答案进行排序获得所述问题的答案。 [0073] If the sentence as the answer, the process is relatively simple, the above-mentioned document extraction module 52 to the high correlation document is divided into sentences, calculated for each sentence based on TF-1DF algorithm weights and obtained candidate answer according to the weight sorting, according to the question type sorting the candidate answer questions the answers. 但是,对于那些问时间地点的问题,其答案就比较简短,而用不着一句话。 However, for those who ask questions of time and place, the answer is relatively brief, and do not need a word. 比如,对于问题:“中华人民共和国是什么时候成立的? ”本发明实施例期待的答案是一个短语,即“1949年10月I日”。 For example, for the question: "When did the establishment of the People's Republic of?" The answer example embodiment of the present invention is expecting a phrase, the "October 1949 I date." 但是本发明实施例可能检索出这样的一句话:“自从1949年10月I日中华人民共和国成立以来至1994年底止,我国已经同世界上的约160个国家建立了外交关系,而且还同更多的国家和地区发展了经济贸易关系和文化往来。”从这个例子可以看出,本发明实施例所要的答案只是这句话中的一小部分,如果本发明实施例能把这整句话作为答案都提交给用户的话,显然冗余信息太多,所以需要将短语答案抽取出来。 However, embodiments of the invention may be retrieved this sentence: "Since the founding of the PRC in October 1949 I ended to the end of 1994, China has established diplomatic relations with about 160 countries around the world, but also with more and more countries and regions in developing economic and trade relations and cultural exchanges. "as can be seen from this example, the answer to the example embodiment of the present invention is only a small part of this sentence, can this whole sentence if the embodiment of the present invention as the answers are presented to the user, it is clear that too much redundant information, so it is necessary to answer the phrase extracted.

[0074] 为了处理的方便,很多的问答系统返回的是句子作为答案。 [0074] In order to facilitate the process, a lot of question answering system returns the sentence as an answer. 在这种系统中,答案的抽取的详细步骤如下: In such a system, the answer extraction step detailed as follows:

[0075] (I)把检索出来的文档分成句子 [0075] (I) of the retrieved document into sentences

[0076] (2)按照一定的算法,计算每个句子的权重 [0076] (2) according to a certain algorithm to calculate the weight of each sentence weight

[0077] (3)对句子按照权重进行排序 [0077] (3) to be sorted according to their weights sentences

[0078] (4)根据问题的类型对候选答案重新排序 [0078] (4) depending on the type of problem reordered candidate answers

[0079] 在步骤(2)中计算句子的权重还是采用上述的TF-1DF的算法,根据权重进行排序后,还需要对依据问题的类型对候选答案进行重新排序。 [0079] In step right (2) is calculated using the weight of the sentence TF-1DF above algorithm, according to the weight after sorting, but also based on the type of candidate answers questions reordered. 每类问题对答案都有特殊的要求,所以每类问题都有自己特定的答案抽取规则。 The answer questions about each type has unique requirements, so every type of problem has its own specific answer extraction rules. 步骤(4)中的重新排序就是根据这些规则进行的。 Reordering step (4) is performed according to these rules. 对于时间相关的问题,答案中就必须含有时间信息。 For questions related to time, answers must contain time information. 对于数量相关的问题答案中必须含有数字信息,否则就不可能是正确答案。 The answer to the number of related issues must contain digital information, otherwise it can not be the correct answer.

[0080] 此外,对于那些问时间地点的问题,可以用很短的语句来回答,而对于有些问题,简短的一个短语或者一句话很难说清楚,比如对于像“9.11事件是怎么回事? ”这种问题,在互联网上有许多相关的报道,这些报道可能是从不同的方面描述这个事件。 [0080] In addition, for those who ask questions of time and place, you can use a short statement to answer, and for some issues, a brief phrase or sentence is difficult to say, for example, like "9/11 is how is it?" this problem, there are many related reports on the Internet, these reports might describe this event from different aspects. 如果把这些相关报道都交给用户的话,那么用户将要花很多时间来阅读。 If these reports are handed over to the user, then the user will spend a lot of time to read. 如果能把这些相关报道做成一个简短的文摘,让用户只要看文摘就能知道整个事件的前因后果,那么将会为用户带来很大的方便。 If these reports can be made a brief digest, so long as the user will be able to see Digest knows the cause of the entire event, it will bring great convenience to users. 这就需要用到多文档自动文摘技术,本发明实施例中的文档抽取模块52提取各相关性高的文档中共同关注的内容作为所述问题的答案。 Which requires use of a multi-document summarization technology to extract the relevant high content of the document as a common concern of the answers to questions in the embodiment of the document extraction module 52 embodiment of the present invention. 多文档自动文摘模块能把信息检索单元40检索出来的相关文档做成文摘返回给用户。 Retrieved multi-document summarization module information retrieval unit 40 can be made abstracts relevant documents returned to the user.

[0081] 多文档自动文摘的基本思想就是提取各文档中共同关注的主要内容,消除各文档之间相同的冗余信息,然后通过一定的算法生成文摘。 The basic idea of ​​[0081] multi-document summarization is to extract the contents of each document in the main common concern, eliminate redundant information among the same document, and then generate a digest through a certain algorithm. 多文档自动文摘可以通过句子的聚类来找出共同关注的主题,聚在一起的句子往往描述的是相同的问题,可以说一个类代表的是一个主题。 Multi-document summarization through clustering can find themes of common interest sentence, the sentence together often describe the same problem, it can be said on behalf of a class is a theme. 聚在一起的句子数量越多,说明这个主体的重要性越大,然后再比较重要的主题中选出最具代表性的句子来组成文摘。 The greater the number of sentences together, indicating the greater importance of this subject, and then the more important themes selected the most representative sentences to compose digest.

[0082] 为更详细的理解在本发明实施例提供的自动问答系统中,下面举例对对自动问答系统中的各功能模块进一步介绍。 [0082] A more detailed understanding of the present invention, in the embodiment provided QA system, the following example further describes automatic question answering system functional modules.

[0083] 用户通过用户交互单元10输入查询问句,如“哈尔滨在什么地方? ” ; [0083] user enters a query by user interaction unit 10 questions, such as "Harbin somewhere?";

[0084] 问题分析单元20接收到该问题以后,由于词语是信息表达的最小单位,而汉语不同于西方语言,其句子的词语间没有分隔符(空格),因此需要进行词语进行切分。 After [0084] Analysis unit 20 receives the question, since the word is the smallest unit of information expression, while Chinese language different from the West, no delimiters (spaces) between the words in its sentence, hence the need for the words be segmented.

[0085] 首先对该问句进行分词和语义标注,使用的分词程序是哈尔滨工业大学计算机学院机器翻译研究室所做的分词系统,它能够将输入的汉语文本中的各个词断开,并在每个词后面用一个符号标明这个词的词性。 [0085] First, the questions were word and semantic annotation, segmentation program uses the word machine translation system of Computer Science Laboratory of Harbin Institute of Technology made, it is able to input Chinese text in the off each word, and behind each word indicating the part of speech the word with a symbol. 例如:/ng表不一般名词,/nx表不中文姓氏,/vg表示一般动词等等。 For example: / ng table is not in general terms, / nx table does Chinese surname, / vg shows the general verb like. 下面是一个经过分词和词性标注的例句: Here is a word and sentence after the speech tagging:

[0086] 哈尔滨/nd在/p什么/r地方/ng ? /wj [0086] Harbin / nd in / what p / r place / ng? / Wj

[0087] 问题分析单元20中的关键词抽取模块22根据词性标注抽取出关键词,关键词主要由名词、动词、形容词、限定性副词等组成。 [0087] Analysis unit 20 according to the keyword extraction module 22 extracted speech tagging keyword, keyword mainly nouns, verbs, adjectives, adverbs and so limiting. 关键词可以分为两种:一般性关键词、“必须含有”的关键词。 Keywords can be divided into two types: keyword general keyword, "must contain". 所谓“必须含有”的关键词指的是这些关键词必须在答案句子中含有,而一般性关键词可以不被答案句子包含。 The so-called "must contain" keywords meant that these keywords must contain the answer sentence, and general keywords can not be the answer sentence included. 关键词被赋予不同的权重,在检索句子时这些权重用来计算句子的权重。 Key words are given different weights, the weights of these sentences retrieve weights used to calculate the weight of the sentence. 通常名词、具有限定性作用的副词会有比较高的权重。 Typically noun, adverb having a limiting effect will be relatively high weight. “必须含有”的关键词由专有名词、限定性副词(如:最大、最闻、最快等)、时间(如:1997年)组成。 Key words "must contain" a proper noun, adverb restrictive (such as: the largest and most smell, fastest, etc.), time (eg: 1997) components. 之所以要制定“必须含有”的关键词原则是因为他们对问题有极强的限定性作用,如果不含有它们的句子是几乎不可能是正确的答案。 The reason for the development of the principle of keywords "must contain" is because they have a strong role in defining the issues, if they do not contain the sentence is almost impossible to be the right answer. 上述问句中的“哈尔滨”、“在”、“地方”。 The above questions in the "Harbin", "in", "local."

[0088] 问题分析单元20中的关键词扩展模块23对关键词进行同义扩展,找到某些关键词的同义词。 [0088] Analysis unit keyword expansion module 2023 is synonymous keywords using extended, certain keywords to find synonyms. 对于中文,可以用知网(HowNet)作为系统的语义知识资源。 For Chinese, you can use HowNet (HowNet) as a semantic knowledge resources of the system. 知网是一个以汉语和英语所代表的概念为描述对象,以揭示概念与概念之间以及概念所具有的属性之间的关系为基本内容的常识知识库。 HowNet is a concept of Chinese and English to describe the object represented, in order to reveal the relationship between the concept and the concept and the concept has the common sense to attribute the basic content of the knowledge base. 它是一个网状的有机的知识系统。 It is a network of organic knowledge system. 上述的关键词中“在”和“位于”属于同义词,可以做同义词扩展,而且“位于”较“在”在书面用语中出现得更多, Above keyword "in" and "located" were synonymous, synonym expansion can be done, but "in the" more "in" appear in the written language more,

表意更清晰。 Ideographic more clearly.

[0089] 问题分析单元20中的问题类型划分模块24,则根据抽取出的关键词找出用户提问所在的知识领域,在本例中根据“哈尔滨”、“在” / “位于”,可以将其划分到地理知识领域,通过这样的划分可以缩小本发明实施例在FAQ库中检索问题的范围。 [0089] Analysis unit 20 issues Classification module 24, the knowledge of the art to identify the user in question is located according to the extracted keywords, in accordance with the present embodiment, "Harbin", "in" / "on", you may be divided into geographic areas of knowledge, the scope of the present invention can be reduced in the embodiment FAQ database retrieval problem embodiment by this division.

[0090] 问题模式抽取则根据制定好的抽取规则,如表1所示,问题中的“什么地方”可以将问题类型定位为询问地点,以便根据此类型最后完成得到答案的抽取规则如表2,根据规则,得到抽取的答案中必须包含地点信息。 [0090] The problem is to develop a good extraction pattern extraction rule, as shown in Table Problem "where" a problem can be positioned location query type, in order to finally obtain a complete answer extraction rules shown in Table 2 of this type according to the rules, get answers to extract must contain location information.

[0091] 在问题分析单元20完成对问题的处理之后,将问题提交到常问问题库(FAQ库)30,FAQ库30可以通过倒排索引的方式进行检索,也可以通过结构化数据,即SQL数据库方式检索。 [0091] After the problem analysis unit 20 completes the processing of the problem, often ask to submit questions to the exam (FAQ database) 30, the FAQ database 30 may be retrieved by way of the inverted index, data may be structured, i.e., SQL database retrieve. 本发明实施例中的信息检索单元40使用倒排序索引或SQL数据库检索FAQ库30,定义的问题表格如下例: Information retrieval unit 40 of the embodiment using an inverted index or SQL database retrieval ordering FAQ database 30, the form definition of the problem embodiment of the present invention the following example:

Figure CN103902652AD00141

[0093] 这样对上述问题,通过原问题匹配或者关键词匹配能找到对应的问题,则可以根据问题,得到问题的答案返回给用户。 [0093] Thus the above-mentioned problems, or by matching the original problem to find the corresponding keyword matching problem, the problem can give the answer to the user.

[0094] 假如原问题匹配或者关键词匹配都无法找到对应的Question ID,即该问题不在常问问题库中,则需要通过搜索引擎去互联网上检索该问题,得到对应的文档,而本发明实施例所需的答案就在候选的文档中。 [0094] If the original problem match can not be found or keyword matching corresponding to Question ID, that is, the problem is not often ask the question bank, the problem needs to be retrieved by the search engine on the Internet to obtain the corresponding document, the embodiment of the present invention cases the answers in the candidate's documents.

[0095] 答案提取单元50对上述的文档进行问题答案提取的操作。 [0095] The extraction unit 50, the answer to the above question answer document extraction operation.

[0096] 首先由答案提取单元50中的相关度排序单元对得到的文档进行相关度排序,然后取其中相关性排名靠前的文档中的一部分,作为提出候选答案的文档集。 [0096] relevance ranking unit 50 is extracted by the first unit was subjected to answer document relevance ranking, and then take a portion of the top-ranking related document wherein the document is proposed as a candidate answer set.

[0097] 答案提取单元50根据之前问题模式抽取单元对问题类型的分类,得到对应的抽取规则,如表2所示,例如对“中华人民共和国是什么时候成立的”这一问题,根据表2,答案必须要含有时间信息的短语如“1949年10月I日”或者句子如“中华人民共和国成立于1949年10月I日”,根据这一规则,对文档中满足上述条件的短语或者句子进行提取,得到候选答案集。 [0097] The answer extraction unit 50 before the problem of pattern extraction unit classification type problem, to obtain the corresponding extraction rules shown in Table 2, for example, "when PRC is established," this issue, according to Table 2 the phrase, the answer must contain information such as the time, "October 1949 I day" or a sentence such as "People's Republic was founded in October 1949 I date," according to this rule, in a document to satisfy the above conditions phrases or sentences extraction, get a candidate answer set.

[0098] 再比如对于问题“9.11事件是怎么回事? ”对于这种问题,在互联网上有许多相关的报道,这些报道可能是从不同的方面描述这个事件。 [0098] Another example is the question "9/11 is how is it?" For this question, there are a number of related reports on the Internet, these reports might describe this event from different aspects. 如果把这些相关报道都交给用户的话,那么用户将要花很多时间来阅读。 If these reports are handed over to the user, then the user will spend a lot of time to read. 如果能把这些相关报道做成一个简短的文摘,让用户只要看文摘就能知道整个事件的前因后果,那么将会为用户带来很大的方便。 If these reports can be made a brief digest, so long as the user will be able to see Digest knows the cause of the entire event, it will bring great convenience to users. 这就需要用到多文档自动文摘技术。 That's where multi-document summarization technology. 多文档自动文摘技术能把信息检索模块检索出来的相关文档做成文摘作为候选答案。 Multi-document summarization technology can retrieve information retrieval module out of the abstracts related documents made as a candidate answers. [0099] 上述的候选答案集中,可能包含多个可能的答案,通过聚类可以得到答案的类型划分,对于答案唯一的问题,通过聚类,一般会聚出一个类,然后可以返回用户答案。 [0099] The concentration of candidate answers, may contain a plurality of possible answers, an answer can be obtained by dividing the type of cluster, to the only answer to the question, clustering, typically a converging type, the user may then return an answer. 对于某些问题例如“苹果是哪年成立的? ”答案是不唯一的,因为“苹果”可以是IT行业的“苹果公司(Apple Inc.)”,也可以是服饰行业的“苹果集团公司”,还可以是音乐行业的“苹果唱片公司(Apple Record)”,对于这样的候选答案会聚出至少3个类,而且每一个答案都可能是用户期待的,所以不能只返回其中的一个,可以考虑反馈给用户多样性的答案。 For some questions such as "what Apple is founded in?" The answer is not only because the "Apple" may be the IT industry, "Apple (Apple Inc.)", can also be a clothing industry, "Apple Group" , may also be in the music industry, "Apple record company (Apple record)", for such a candidate answers convergence at least three categories, and each answer is likely to be users expect, it can not return only one of them can be considered feedback to answer user diversity.

[0100] 在本发明实施例提供的自动问答系统.根据本体对问题进行了分类,不单单只是领域层次的,而且是概念层次的,例如篮球/NBA/NBA明星/湖人队球星。 [0100] QA system provided in the embodiment of the present invention. Questions classified according to the body, not just the level of the art, but the concept of hierarchy, such as basketball / NBA / NBA Star / Lakers star.

[0101] 此外,本发明实施例中的系统对问题中包含的关键概念进行了提取,针对关键概念我们也可以在本体的帮助下做同义词归一化的处理。 [0101] In addition, the system in question key concepts contained were extracted embodiment of the present invention, the key concepts for the treatment we can do synonymous normalized with the help of the body.

[0102] 本发明实施例提供的自动问答系统在提出问题抽取的模式和对应的答案的规则时,根据该模式和规则学习并在下一步抽取答案,提高答案抽取的准确率;例如,为了提高匹配的准确率,我们可以抽取问题中的问点和条件点,然后做成问题查询的模式,根据该模式来抽取答案。 [0102] When the automatic question answering system provided in questions extracted patterns and corresponding answers rules embodiment of the present invention, based on the model and the rule learning and next extracted answer, improve the accuracy of answer extraction; for example, in order to improve the matching accuracy, we can extract problems and asked point-point conditions, and then made the problem query mode, to extract the answer according to this mode.

[0103] 本发明实施例自动问答系统获取的问答答案是文档,但本发明实施例可以做进一步处理,抽取后可以是词/短语,句子,甚至可以根据多个文档中抽取出来的文摘,提出采用多文档自动摘要的方法,主要是为了节省用户阅读答案的时间。 [0103] Example question answering quiz answer embodiment of the present invention is to obtain the document, but the embodiment of the present invention can be further processed, may be extracted word / phrase, sentence, or even can be extracted from a plurality of documents abstracts presented multi-document summarization method, mainly in order to save the user time to read the answer.

[0104] 现有技术中的自动问答系统基本上都没有明确提出问题答案的多样化考虑,而本发明实施例中的自动问答系统可以在对匹配的答案根据相关性和多样性的来重新计算权重,从而达到更加理想的排序,满足多样性和相关性的需求,反馈给用户答案。 [0104] QA prior art system are not clearly substantially diversity consider the answer, question answering embodiments may matching answer recalculated according to the correlation and diversity to the embodiment of the present invention weights, so as to achieve a more ideal sort, relevance and diversity to meet the needs of feedback to the user answers. 这里多样化指的是计算答案的排序的时候就考虑的了,而不是对答案进行聚类这部分。 Here diverse computing refers to the sort of answer when he considered, and rather than answer this part of the cluster.

[0105] 与现有技术相比,本发明实施例通过对输入问题和FAQ库中的问题进行类型划分处理和模式抽取,得到的用户答案内容范围广泛,形式上能包括短语、句子,还能通过多文档摘要算法来得出摘要,通过聚类等方法对候选答案处理,提高了答案的准确率和多样性。 [0105] Compared with the prior art, the present invention is illustrated by the embodiment in questions and problems in the FAQ database processing division and the type extraction mode, the user answers a wide range obtained, the form can include phrases, sentences, but also to draw Abstract multi-document digest algorithm, the candidate answers processed by clustering methods to improve the accuracy and diversity of the answer. 此外,本发明实施例通过将本体引入自动问答系统可以使系统能够对用户查询词进行语义分析,更能充分理解用户的查询意图,从而有效的改善查准率和查全率。 Further, embodiments of the present invention by introducing the body question answering system allows the user to query word semantic analysis more fully understand the user's query intention, so as to effectively improve the precision and recall.

[0106] 应当理解的是,本发明的应用不限于上述的举例,对本领域普通技术人员来说,可以根据上述说明加以改进或变换,所有这些改进和变换都应属于本发明所附权利要求的保护范围。 [0106] It should be appreciated that the present invention is applied is not limited to the above-described example, those of ordinary skill in the art, can be modified or converted according to the above description, all such modifications and variations shall fall within the appended claims of the invention protected range.

Claims (10)

1.一种自动问答系统,其特征在于,包括:用户交互单元、问题分析单元、常问问题库、信息检索单元以及答案抽取单元, 其中,用户交互单元,用于接收用户输入的问题以及将问题答案反馈给所述用户;问题分析单元,用于抽取用户输入的问题的关键词,并对所述关键词进行扩展,以及根据预先设置的问题分类标准对问题进行类型划分得到所述问题的类型; 常问问题库,用于存储用户常问的问题和答案; 信息检索单元,用于根据所述问题分析单元扩展后的关键词在所述常问问题库中搜索问题答案,并返回相关的文档或答案; 答案抽取单元,用于根据与所述问题的类型对应的答案抽取规则从所述信息检索单元返回的相关文档中抽取符合所述规则的答案,将抽取的答案发送至所述用户交互单元。 An automatic answering system, characterized by comprising: a user interaction unit, the analysis unit issues, often ask the exam, the answer information retrieval unit and extraction unit, wherein the user interaction means for receiving user input, and the problem answer to the user feedback; problem analysis unit for extracting keywords input by the user of the problem, and the expanded keyword, and classified according to the type of problems previously set classification problems the problem resulting type; often ask the question bank, for storing frequently asked questions and answers; information retrieval unit for the keyword analysis of the problem unit in the normally expanded search exam answers to questions to ask, and returns the associated or an answer document; answer extraction means for extracting the answers match the rules of the relevant documents from the information retrieval unit returns the extracted according to the type corresponding to the answer to the question of the rule, the extracted answer sent to the user interaction unit.
2.根据权利要求1所述的自动问答系统,其特征在于,所述问题分析单元包括: 中文词语处理模块,用于对用户输入的中文问题进行词语切分和词性标注; 关键词抽取模块,用于根据切分后的词语的词性抽取出关键词; 关键词扩展模块,用于对抽取出的关键词进行同义扩展; 问题类型划分模块,用于根据预先设置的问题分类标准和所述扩展后的关键词划分所述问题的类型,或者用于根据所述扩展后的关键词的语义进行自动划分所述问题的类型。 The automatic answering system according to claim 1, characterized in that, the problem analysis unit comprises: Chinese word processing module is configured to issue user inputted Chinese word segmentation and speech tagging; keyword extraction module, the parts of speech for extracting words after segmentation key words; keyword expansion module configured to be synonymous keywords extracted extended; classification module problem, according to a preset classification problems and the after dividing the expanded keyword problem type, or for automatic partitioning of the problem according to the type of the keyword and the extended semantics.
3.根据权利要求2所述的自动问答系统,其特征在于,所述答案抽取单元包括: 相关性排序模块,用于对信息检索单元返回的相关文档进行相关性排序,获得相关性高的文档; 文档抽取模块,用于根据与所述问题类型对应的答案抽取规则从所述相关性高的文档中抽取符合所述规则的一个或多个答案; 答案整理模块,用于对所述答案抽取单元抽取的多个答案进行聚类,将聚类后的答案发送至所述用户交互单元。 The automatic answering system according to claim 2, characterized in that, the answer extraction unit comprises: relevance ranking module for the relevant document information retrieval unit returns the ranking correlation, high correlation is obtained document ; document extraction means for extracting based on the answers to the extraction rule corresponding to the type of problem from the high document relevance in one or more of the answers conforms to the rule; answer finishing module for the answer extraction a plurality of answer units extracted cluster, the cluster sends the answer to the user interaction unit.
4.根据权利要求3所述的自动问答系统,其特征在于,若所述问题类型对应的答案抽取规则为句子或段落, 则所述文档抽取模块将相关性高的文档分为句子,根据TF-1DF算法计算每个句子的权重并根据权重排序得到候选答案,根据所述问题类型对所述候选答案进行排序获得所述问题的答案。 The automatic answering system according to claim 3, wherein, if the answer to the problem of the extraction rule corresponding to the type of sentence or paragraph, then the document extraction module high document relevance into sentences, in accordance with TF -1DF algorithm calculates the weight of each sentence and the candidate answer obtained according to the weight sorting, sorting answers the questions of the candidate answers according to the type of problem.
5.根据权利要求3所述的自动问答系统,其特征在于,若所述问题类型对应的答案抽取规则为文摘, 则所述文档抽取模块提取各相关性高的文档中共同关注的内容作为所述问题的答案。 The automatic answering system according to claim 3, wherein, if the answer to the problem of the extraction rule corresponding to the type of abstracts, the document extraction module extracts the contents of each document relevance high interest as the common said the answer.
6.根据权利要求2所述的自动问答系统,其特征在于,所述信息检索单元通过倒排序索引或结构化数据的方式在所述常问问题库中搜索问题答案。 The automatic answering system according to claim 2, wherein said information retrieval unit by way of reverse-order index or structured data exam answers often ask questions in the search.
7.根据权利要求1-6任一项所述的自动问答系统,其特征在于,所述系统还包括: 常问问题库更新单元,用于在所述信息检索单元未在所述常问问题库中搜索到问题答案时从互联网搜索所述问题的答案,并将搜索到的答案加入所述常问问题库。 The automatic answering system of any one of claims 1-6, characterized in that the system further comprises: updating unit exam often ask for the information retrieval unit is not in the FAQ library Internet search search for answers from the question when the answer to the question, and join the search for answers to frequently ask the question bank.
8.根据权利要求1-6任一项所述的自动问答系统,其特征在于,所述用户交互单元为浏览器。 8. The automatic answering system of any one of claims 1-6, wherein the user interaction means of the browser.
9.根据权利要求1-6任一项所述的自动问答系统,其特征在于,所述问题分析单元抽取的关键词包括名词、动词、形容词或副词。 9. The automatic answering system of any one of claims 1-6, characterized in that, the problem analysis unit comprises a keyword extracted noun, verb, adjective or adverb.
10.根据权利要求1-6任一项所述的自动问答系统,其特征在于,所述系统还包括本体知识库,所述本体知识库为所述问题分析单元提供共享词表。 10. The automatic answering system of any of claims 1-6, characterized in that the system further comprises a knowledge base body, said body providing a shared vocabulary knowledge of the problem to the analysis unit.
CN 201410068844 2014-02-27 2014-02-27 Automatic question-answering system CN103902652A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201410068844 CN103902652A (en) 2014-02-27 2014-02-27 Automatic question-answering system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201410068844 CN103902652A (en) 2014-02-27 2014-02-27 Automatic question-answering system

Publications (1)

Publication Number Publication Date
CN103902652A true CN103902652A (en) 2014-07-02

Family

ID=50993975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201410068844 CN103902652A (en) 2014-02-27 2014-02-27 Automatic question-answering system

Country Status (1)

Country Link
CN (1) CN103902652A (en)

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102721A (en) * 2014-07-18 2014-10-15 百度在线网络技术(北京)有限公司 Method and device for recommending information
CN104461525A (en) * 2014-11-27 2015-03-25 韩慧健 Intelligent user-defined consulting platform generating system
CN104536991A (en) * 2014-12-10 2015-04-22 乐娟 Answer extraction method and device
CN104679815A (en) * 2014-12-08 2015-06-03 北京云知声信息技术有限公司 Method and system for screening question and answer pairs and updating question and answer database in real time
CN104778257A (en) * 2015-04-20 2015-07-15 百度在线网络技术(北京)有限公司 Method and device for searching applied problems
CN104820694A (en) * 2015-04-28 2015-08-05 中国科学院自动化研究所 Automatic Q&A method and system based on multi-knowledge base and integral linear programming ILP
CN104951433A (en) * 2015-06-24 2015-09-30 北京京东尚科信息技术有限公司 Method and system for intention recognition based on context
CN105095195A (en) * 2015-07-03 2015-11-25 北京京东尚科信息技术有限公司 Method and system for human-machine questioning and answering based on knowledge graph
CN105117387A (en) * 2015-09-21 2015-12-02 上海智臻智能网络科技股份有限公司 Intelligent robot interaction system
CN105117388A (en) * 2015-09-21 2015-12-02 上海智臻智能网络科技股份有限公司 Intelligent robot interaction system
CN105279274A (en) * 2015-10-30 2016-01-27 北京京东尚科信息技术有限公司 Answer combining and matching method and system based on natural synthetic answer system
CN105335400A (en) * 2014-07-22 2016-02-17 阿里巴巴集团控股有限公司 Method and apparatus for obtaining answer information for questioning intention of user
CN105354180A (en) * 2015-08-26 2016-02-24 欧阳江 Method and system for realizing open semantic interaction service
CN105512349A (en) * 2016-02-23 2016-04-20 首都师范大学 Question and answer method and question and answer device for adaptive learning of learners
CN105574133A (en) * 2015-12-15 2016-05-11 苏州贝多环保技术有限公司 Multi-mode intelligent question answering system and method
CN105630917A (en) * 2015-12-22 2016-06-01 成都小多科技有限公司 Intelligent answering method and intelligent answering device
CN105630938A (en) * 2015-12-23 2016-06-01 深圳市智客网络科技有限公司 Intelligent question-answering system
CN105630887A (en) * 2015-12-18 2016-06-01 北京中科汇联科技股份有限公司 Representation method for knowledge markup languages of Chinese question answering system and Chinese question answering system
CN105653619A (en) * 2015-12-25 2016-06-08 上海智臻智能网络科技股份有限公司 Update method and device of correct log library in intelligent question-answering system
CN105740310A (en) * 2015-12-21 2016-07-06 哈尔滨工业大学 Automatic answer summarizing method and system for question answering system
CN105824933A (en) * 2016-03-18 2016-08-03 苏州大学 Automatic question-answering system based on theme-rheme positions and realization method of automatic question answering system
CN105893465A (en) * 2016-03-28 2016-08-24 北京京东尚科信息技术有限公司 Automatic question answering method and device
CN105893552A (en) * 2016-03-31 2016-08-24 成都小多科技有限公司 Method and device for processing data
CN105912527A (en) * 2016-04-19 2016-08-31 北京高地信息技术有限公司 Method, device and system outputting answer according to natural language
CN105912697A (en) * 2016-04-25 2016-08-31 北京光年无限科技有限公司 Optimization method and device of dialog system knowledge base
CN105955976A (en) * 2016-04-15 2016-09-21 中国工商银行股份有限公司 Automatic answering system and method
CN106095965A (en) * 2016-06-17 2016-11-09 上海智臻智能网络科技股份有限公司 Data processing method and device
CN106202288A (en) * 2016-06-30 2016-12-07 北京智能管家科技有限公司 Optimization method and system for knowledge base of man-machine interactive system
CN106547785A (en) * 2015-09-22 2017-03-29 阿里巴巴集团控股有限公司 Method and system for acquiring information from knowledge base
CN106599297A (en) * 2016-12-28 2017-04-26 北京百度网讯科技有限公司 Method and device for searching question-type search terms on basis of deep questions and answers
WO2017076263A1 (en) * 2015-11-03 2017-05-11 中兴通讯股份有限公司 Method and device for integrating knowledge bases, knowledge base management system and storage medium
CN106878819A (en) * 2017-01-20 2017-06-20 合网络技术(北京)有限公司 Method, system and device for information interaction in live webcast
CN106951470A (en) * 2017-03-03 2017-07-14 中兴耀维科技江苏有限公司 Intelligent question-answering system based on business knowledge graph retrieval
CN106991161A (en) * 2017-03-31 2017-07-28 北京字节跳动科技有限公司 Method for automatically generating answers to open-ended questions
CN107229675A (en) * 2017-04-28 2017-10-03 北京神州泰岳软件股份有限公司 Question and answer base construction method, method of answering, the apparatus and system of list type knowledge
WO2018121759A1 (en) * 2016-12-31 2018-07-05 深圳市优必选科技有限公司 Intelligent question and answer method and system
WO2018157805A1 (en) * 2017-03-03 2018-09-07 腾讯科技(深圳)有限公司 Automatic questioning and answering processing method and automatic questioning and answering system
WO2019062012A1 (en) * 2017-09-30 2019-04-04 平安科技(深圳)有限公司 Question association push method, electronic device and computer readable storage medium
EP3531301A1 (en) * 2018-02-27 2019-08-28 DTMS GmbH Computer-implemented method for querying data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928864A (en) * 2006-09-22 2007-03-14 浙江大学 FAQ based Chinese natural language ask and answer method
CN101097573A (en) * 2006-06-28 2008-01-02 腾讯科技(深圳)有限公司 Automatically request-answering system and method
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN101373532A (en) * 2008-07-10 2009-02-25 昆明理工大学 FAQ Chinese request-answering system implementing method in tourism field
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN103425640A (en) * 2012-05-14 2013-12-04 华为技术有限公司 Multimedia questioning-answering system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101097573A (en) * 2006-06-28 2008-01-02 腾讯科技(深圳)有限公司 Automatically request-answering system and method
CN1928864A (en) * 2006-09-22 2007-03-14 浙江大学 FAQ based Chinese natural language ask and answer method
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN101373532A (en) * 2008-07-10 2009-02-25 昆明理工大学 FAQ Chinese request-answering system implementing method in tourism field
CN103425640A (en) * 2012-05-14 2013-12-04 华为技术有限公司 Multimedia questioning-answering system and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YONG JIANG ET AL.: ""Affiliation Disambiguation for Constructing Semantic Digital Libraries"", 《JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY》 *
刘里等: ""自动问答系统研究综述"", 《山东科技大学学报(自然科学版)》 *
郑实福等: ""自动问答综述"", 《中文信息学报》 *

Cited By (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102721A (en) * 2014-07-18 2014-10-15 百度在线网络技术(北京)有限公司 Method and device for recommending information
CN105335400A (en) * 2014-07-22 2016-02-17 阿里巴巴集团控股有限公司 Method and apparatus for obtaining answer information for questioning intention of user
CN105335400B (en) * 2014-07-22 2018-11-23 阿里巴巴集团控股有限公司 Enquirement for user is intended to obtain the method and device of answer information
CN104461525B (en) * 2014-11-27 2018-01-23 韩慧健 A kind of intelligent consulting platform generation system that can customize
CN104461525A (en) * 2014-11-27 2015-03-25 韩慧健 Intelligent user-defined consulting platform generating system
CN104679815B (en) * 2014-12-08 2018-02-23 北京云知声信息技术有限公司 It is a kind of to screen question and answer pair and the method and system in real-time update question and answer storehouse
CN104679815A (en) * 2014-12-08 2015-06-03 北京云知声信息技术有限公司 Method and system for screening question and answer pairs and updating question and answer database in real time
CN104536991A (en) * 2014-12-10 2015-04-22 乐娟 Answer extraction method and device
CN104536991B (en) * 2014-12-10 2017-12-08 乐娟 answer extracting method and device
CN104778257A (en) * 2015-04-20 2015-07-15 百度在线网络技术(北京)有限公司 Method and device for searching applied problems
CN104778257B (en) * 2015-04-20 2018-09-07 百度在线网络技术(北京)有限公司 Using topic searching method and device
CN104820694A (en) * 2015-04-28 2015-08-05 中国科学院自动化研究所 Automatic Q&A method and system based on multi-knowledge base and integral linear programming ILP
CN104820694B (en) * 2015-04-28 2019-03-15 中国科学院自动化研究所 Automatic question-answering method and system based on multiple knowledge base and integral linear programming ILP
CN104951433A (en) * 2015-06-24 2015-09-30 北京京东尚科信息技术有限公司 Method and system for intention recognition based on context
CN104951433B (en) * 2015-06-24 2018-01-23 北京京东尚科信息技术有限公司 The method and system of intention assessment is carried out based on context
CN105095195B (en) * 2015-07-03 2018-09-18 北京京东尚科信息技术有限公司 Nan-machine interrogation's method and system of knowledge based collection of illustrative plates
CN105095195A (en) * 2015-07-03 2015-11-25 北京京东尚科信息技术有限公司 Method and system for human-machine questioning and answering based on knowledge graph
CN105354180A (en) * 2015-08-26 2016-02-24 欧阳江 Method and system for realizing open semantic interaction service
CN105117387A (en) * 2015-09-21 2015-12-02 上海智臻智能网络科技股份有限公司 Intelligent robot interaction system
CN105117388A (en) * 2015-09-21 2015-12-02 上海智臻智能网络科技股份有限公司 Intelligent robot interaction system
CN106547785A (en) * 2015-09-22 2017-03-29 阿里巴巴集团控股有限公司 Method and system for acquiring information from knowledge base
CN105279274B (en) * 2015-10-30 2018-11-02 北京京东尚科信息技术有限公司 Answer synthesis based on naturally semantic question answering system and matched method and system
CN105279274A (en) * 2015-10-30 2016-01-27 北京京东尚科信息技术有限公司 Answer combining and matching method and system based on natural synthetic answer system
WO2017076263A1 (en) * 2015-11-03 2017-05-11 中兴通讯股份有限公司 Method and device for integrating knowledge bases, knowledge base management system and storage medium
CN105574133A (en) * 2015-12-15 2016-05-11 苏州贝多环保技术有限公司 Multi-mode intelligent question answering system and method
CN105630887A (en) * 2015-12-18 2016-06-01 北京中科汇联科技股份有限公司 Representation method for knowledge markup languages of Chinese question answering system and Chinese question answering system
CN105740310B (en) * 2015-12-21 2019-08-02 哈尔滨工业大学 A kind of automatic answer method of abstracting and system in question answering system
CN105740310A (en) * 2015-12-21 2016-07-06 哈尔滨工业大学 Automatic answer summarizing method and system for question answering system
CN105630917A (en) * 2015-12-22 2016-06-01 成都小多科技有限公司 Intelligent answering method and intelligent answering device
CN105630938A (en) * 2015-12-23 2016-06-01 深圳市智客网络科技有限公司 Intelligent question-answering system
CN105653619A (en) * 2015-12-25 2016-06-08 上海智臻智能网络科技股份有限公司 Update method and device of correct log library in intelligent question-answering system
CN105653619B (en) * 2015-12-25 2019-01-25 上海智臻智能网络科技股份有限公司 The update method and device in correct log library in intelligent Answer System
CN105512349B (en) * 2016-02-23 2019-03-26 首都师范大学 A kind of answering method and device for learner's adaptive learning
CN105512349A (en) * 2016-02-23 2016-04-20 首都师范大学 Question and answer method and question and answer device for adaptive learning of learners
CN105824933B (en) * 2016-03-18 2019-02-26 苏州大学 Automatically request-answering system and its implementation based on main rheme
CN105824933A (en) * 2016-03-18 2016-08-03 苏州大学 Automatic question-answering system based on theme-rheme positions and realization method of automatic question answering system
CN105893465A (en) * 2016-03-28 2016-08-24 北京京东尚科信息技术有限公司 Automatic question answering method and device
CN105893552A (en) * 2016-03-31 2016-08-24 成都小多科技有限公司 Method and device for processing data
CN105955976B (en) * 2016-04-15 2019-05-14 中国工商银行股份有限公司 A kind of automatic answering system and method
CN105955976A (en) * 2016-04-15 2016-09-21 中国工商银行股份有限公司 Automatic answering system and method
CN105912527A (en) * 2016-04-19 2016-08-31 北京高地信息技术有限公司 Method, device and system outputting answer according to natural language
CN105912697B (en) * 2016-04-25 2019-08-27 北京光年无限科技有限公司 A kind of optimization method and device of conversational system knowledge base
CN105912697A (en) * 2016-04-25 2016-08-31 北京光年无限科技有限公司 Optimization method and device of dialog system knowledge base
CN106095965A (en) * 2016-06-17 2016-11-09 上海智臻智能网络科技股份有限公司 Data processing method and device
CN106202288B (en) * 2016-06-30 2019-10-11 北京智能管家科技有限公司 A kind of optimization method and system of man-machine interactive system knowledge base
CN106202288A (en) * 2016-06-30 2016-12-07 北京智能管家科技有限公司 Optimization method and system for knowledge base of man-machine interactive system
CN106599297A (en) * 2016-12-28 2017-04-26 北京百度网讯科技有限公司 Method and device for searching question-type search terms on basis of deep questions and answers
WO2018121759A1 (en) * 2016-12-31 2018-07-05 深圳市优必选科技有限公司 Intelligent question and answer method and system
WO2018133518A1 (en) * 2017-01-20 2018-07-26 优酷网络技术(北京)有限公司 Method, system, and apparatus for information interaction in live webcast
CN106878819A (en) * 2017-01-20 2017-06-20 合网络技术(北京)有限公司 Method, system and device for information interaction in live webcast
CN106951470A (en) * 2017-03-03 2017-07-14 中兴耀维科技江苏有限公司 Intelligent question-answering system based on business knowledge graph retrieval
WO2018157805A1 (en) * 2017-03-03 2018-09-07 腾讯科技(深圳)有限公司 Automatic questioning and answering processing method and automatic questioning and answering system
CN106991161A (en) * 2017-03-31 2017-07-28 北京字节跳动科技有限公司 Method for automatically generating answers to open-ended questions
CN106991161B (en) * 2017-03-31 2019-02-19 北京字节跳动科技有限公司 A method of automatically generating open-ended question answer
CN107229675A (en) * 2017-04-28 2017-10-03 北京神州泰岳软件股份有限公司 Question and answer base construction method, method of answering, the apparatus and system of list type knowledge
WO2019062012A1 (en) * 2017-09-30 2019-04-04 平安科技(深圳)有限公司 Question association push method, electronic device and computer readable storage medium
EP3531301A1 (en) * 2018-02-27 2019-08-28 DTMS GmbH Computer-implemented method for querying data

Similar Documents

Publication Publication Date Title
Speer et al. ConceptNet 5: A large semantic network for relational knowledge
Carpineto et al. A survey of automatic query expansion in information retrieval
US8060357B2 (en) Linguistic user interface
US8756245B2 (en) Systems and methods for answering user questions
CA2536265C (en) System and method for processing a query
US8321424B2 (en) Bipartite graph reinforcement modeling to annotate web images
US20050203924A1 (en) System and methods for analytic research and literate reporting of authoritative document collections
CN103124980B (en) Including providing answers to questions collected the answers from multiple documents section
CN104050256B (en) Q & A Q & A method and active learning system based on this method
Saggion et al. Automatic text summarization: Past, present and future
Varadarajan et al. A system for query-specific document summarization
Deshpande et al. Building, maintaining, and using knowledge bases: a report from the trenches
Wan et al. Person resolution in person search results: Webhawk
Kim et al. Automatic keyphrase extraction from scientific articles
Wen et al. Clustering user queries of a search engine
KR100546743B1 (en) Method for automatically creating a question and indexing the question-answer by language-analysis and the question-answering method and system
Allam et al. The question answering systems: A survey
US20070185831A1 (en) Information retrieval
CN101286161B (en) Intelligent Chinese request-answering system based on concept
US9213771B2 (en) Question answering framework
US20090138454A1 (en) Semi-Automatic Example-Based Induction of Semantic Translation Rules to Support Natural Language Search
US10242049B2 (en) Method, system and storage medium for implementing intelligent question answering
CN1290036C (en) Computer system and method for establishing concept knowledge according to machine readable dictionary
US9715493B2 (en) Method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model
CN1928864A (en) FAQ based Chinese natural language ask and answer method

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
RJ01 Rejection of invention patent application after publication