CN109726274B - Question generation method, device and storage medium - Google Patents
Question generation method, device and storage medium Download PDFInfo
- Publication number
- CN109726274B CN109726274B CN201811641895.1A CN201811641895A CN109726274B CN 109726274 B CN109726274 B CN 109726274B CN 201811641895 A CN201811641895 A CN 201811641895A CN 109726274 B CN109726274 B CN 109726274B
- Authority
- CN
- China
- Prior art keywords
- question
- text
- document
- processed
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及信息技术领域,尤其涉及一种问题生成方法、装置及计算机可读存储介质。The present invention relates to the field of information technology, and in particular, to a problem generation method, device and computer-readable storage medium.
背景技术Background technique
FAQ(Frequently Asked Questions,问答系统)是当前网络上提供在线帮助的主要手段,通过事先组织好一些可能的常问的问答对,发布在网页上为用户提供咨询服务。FAQ (Frequently Asked Questions, question and answer system) is the main means of providing online help on the current network. By organizing some question-and-answer pairs that may be frequently asked in advance, they are published on web pages to provide users with consulting services.
现有技术的FAQ实现方式主要包括以下几种:The FAQ implementation methods of the prior art mainly include the following:
(1)通用问答系统,基于检索或者基于知识的问答服务。(1) General question answering system, based on retrieval or knowledge-based question answering service.
(2)定制化检索,对文档内容分段分词创建索引;或者,通过文档结构化或人工筛选的方法从而得到问答对。(2) Customized retrieval, creating an index for document content segmentation and word segmentation; or obtaining question-answer pairs by document structuring or manual screening.
(3)基于词匹配或同义词匹配的问题检索。(3) Question retrieval based on word matching or synonym matching.
现有技术的缺陷主要包括以下几个方面:The deficiencies of the prior art mainly include the following aspects:
(1)基于检索或者基于知识的通用问答系统不能解决定制化的需求。(1) Retrieval-based or knowledge-based general question answering systems cannot meet the needs of customization.
(2)对于通过对文档内容创建索引实现问答的方式,首先并非所有的内容都是问答内容,因此通篇存储会造成存储空间的浪费;其次是这种方式生成问题的准确率低,因为词命中不意味着当前内容是答案;还有无法判断答案边界和无法形成可视化FAQ文档。所谓可视化是指将对文本内容深度的阅读理解,从而提取出若干问答对,方便用户查找问题检索答案。现在的技术无法对篇章深度理解或者对文本生成好的问题。(2) For the question-and-answer method by creating an index on the document content, firstly, not all the content is question-and-answer content, so storing the entire text will cause a waste of storage space; secondly, the accuracy of generating questions in this way is low, because words A hit does not mean that the current content is the answer; there is also an inability to determine the answer boundary and an inability to form a visual FAQ document. The so-called visualization refers to the reading comprehension of the depth of the text content, so as to extract several question-and-answer pairs, which is convenient for users to find the question and retrieve the answer. The current technology cannot deeply understand the text or generate a good text.
(3)基于同义词匹配或词匹配的问题检索的泛化能力差且召回率低。(3) Question retrieval based on synonym matching or word matching has poor generalization ability and low recall rate.
发明内容SUMMARY OF THE INVENTION
本发明实施例提供一种问题生成方法、装置及计算机可读存储介质,以至少解决现有技术中的一个或多个技术问题。Embodiments of the present invention provide a problem generation method, apparatus, and computer-readable storage medium, so as to at least solve one or more technical problems in the prior art.
第一方面,本发明实施例提供了一种问题生成方法,包括:In a first aspect, an embodiment of the present invention provides a problem generation method, including:
根据文本结构识别待处理文档的文本类型;Identify the text type of the document to be processed according to the text structure;
选择与所述文本类型对应的生成模型,所述生成模型包括显式问题生成模型、结构化和半结构化问题生成模型和自然语言问题生成模型中的至少一种;selecting a generative model corresponding to the text type, the generative model includes at least one of an explicit question generation model, a structured and semi-structured question generation model, and a natural language question generation model;
利用选择的所述生成模型针对所述待处理文档生成问题。A question is generated for the document to be processed using the selected generative model.
在一种实施方式中,根据文本结构识别待处理文档的文本类型,包括:识别所述待处理文档的文本结构中是否有问答结构;In one embodiment, identifying the text type of the document to be processed according to the text structure includes: identifying whether there is a question-and-answer structure in the text structure of the document to be processed;
选择与所述文本类型对应的生成模型,包括:若所述待处理文档的文本结构中有问答结构,则将所述显式问题生成模型作为与所述文本类型对应的生成模型;Selecting the generation model corresponding to the text type includes: if there is a question-and-answer structure in the text structure of the document to be processed, then using the explicit question generation model as the generation model corresponding to the text type;
利用选择的所述生成模型针对所述待处理文档生成问题,包括:利用所述显式问题生成模型针对所述待处理文档生成问题。Generating a question for the document to be processed using the selected generation model includes: generating a question for the document to be processed by using the explicit question generation model.
在一种实施方式中,利用所述显式问题生成模型针对所述待处理文档生成问题,包括:In one embodiment, generating a question for the document to be processed using the explicit question generation model includes:
判断所述问答结构中的问题部分和对应的回答部分是否匹配,将匹配成功的所述问答结构对应的部分文本作为候选文本筛选出来;Judging whether the question part in the question-and-answer structure matches the corresponding answer part, and screening out the part of the text corresponding to the question-and-answer structure that has been successfully matched as candidate text;
利用第一循环神经网络模型,对筛选出的所述候选文本进行分类,以从所述候选文本中识别出显式问题;classifying the screened candidate texts using the first recurrent neural network model to identify explicit questions from the candidate texts;
将所述显式问题作为针对所述待处理文档生成的问题。Treat the explicit question as a question generated for the document to be processed.
在一种实施方式中,根据文本结构识别待处理文档的文本类型,包括:识别所述待处理文档的文本结构中是否有标题结构,所述标题结构包括标题或表格;In one embodiment, identifying the text type of the document to be processed according to the text structure includes: identifying whether there is a title structure in the text structure of the document to be processed, and the title structure includes a title or a table;
选择与所述文本类型对应的生成模型,包括:若所述待处理文档的文本结构中有标题结构,则将所述结构化和半结构化问题生成模型作为与所述文本类型对应的生成模型;Selecting a generation model corresponding to the text type includes: if there is a title structure in the text structure of the document to be processed, then using the structured and semi-structured problem generation models as the generation model corresponding to the text type ;
利用选择的所述生成模型针对所述待处理文档生成问题,包括:利用所述结构化和半结构化问题生成模型针对所述待处理文档生成问题。Generating a question for the document to be processed using the selected generation model includes: generating a question for the document to be processed using the structured and semi-structured question generation model.
在一种实施方式中,利用所述结构化和半结构化问题生成模型针对所述待处理文档生成问题,包括:In one embodiment, generating a question for the document to be processed using the structured and semi-structured question generation model includes:
在所述待处理文档的文本结构中有标题的情况下,获取与所述标题相关的属性复述;If there is a title in the text structure of the document to be processed, obtain a paraphrase of the attribute related to the title;
根据所述属性复述生成问题。A question is generated based on the attribute rephrase.
在一种实施方式中,获取与所述标题相关的属性复述,包括:In one embodiment, obtaining a paraphrase of attributes related to the title includes:
获取与所述标题相关的搜索点击展现日志;Obtain the search click presentation log related to the title;
对所述搜索点击展现日志进行数据挖掘,得到与所述标题相关的属性复述;Data mining is performed on the search click presentation log to obtain a retelling of attributes related to the title;
将所述属性复述存入属性复述表中。The attribute repetition is stored in an attribute repetition table.
在一种实施方式中,根据所述属性复述生成问题,包括:In one embodiment, generating a question based on the attribute paraphrase includes:
根据所述属性复述,利用第一编码器-解码器模型生成问题;或者,generating a question using the first encoder-decoder model based on the attribute recap; or,
从所述属性复述表中查询与所述标题相关的属性复述,并根据查询到的所述属性复述生成问题。Query the attribute paraphrase related to the title from the property paraphrase table, and generate a question according to the property paraphrase found.
在一种实施方式中,根据文本结构识别待处理文档的文本类型,包括:识别所述待处理文档的文本结构中是否有问答结构和标题结构,所述标题结构包括标题或表格;In one embodiment, identifying the text type of the document to be processed according to the text structure includes: identifying whether there is a question-and-answer structure and a title structure in the text structure of the document to be processed, and the title structure includes a title or a table;
选择与所述文本类型对应的生成模型,包括:若所述待处理文档的文本结构中没有问答结构且没有标题结构,则将所述自然语言问题生成模型作为与所述文本类型对应的生成模型;Selecting a generation model corresponding to the text type includes: if there is no question-and-answer structure and no title structure in the text structure of the document to be processed, using the natural language question generation model as a generation model corresponding to the text type ;
利用选择的所述生成模型针对所述待处理文档生成问题,包括:利用所述自然语言问题生成模型针对所述待处理文档生成问题。Generating a question for the document to be processed using the selected generation model includes: generating a question for the document to be processed by using the natural language question generation model.
在一种实施方式中,利用所述自然语言问题生成模型针对所述待处理文档生成问题,包括:In one embodiment, generating a question for the document to be processed using the natural language question generation model includes:
利用第二循环神经网络模型,从所述待处理文档中筛选出目标句子,所述目标句子包括语义完整的句子;Use the second recurrent neural network model to screen out target sentences from the documents to be processed, where the target sentences include sentences with complete semantics;
利用第三循环神经网络模型,从所述目标句子中选择候选答案片段;using a third recurrent neural network model to select candidate answer segments from the target sentence;
根据所述候选答案片段,利用第二编码器-解码器模型生成问题。Based on the candidate answer segments, questions are generated using a second encoder-decoder model.
在一种实施方式中,所述方法还包括:针对生成的所述问题进行答案边界定位。In one embodiment, the method further includes: positioning an answer boundary for the generated question.
在一种实施方式中,针对生成的所述问题进行答案边界定位,包括:In one embodiment, answer boundary positioning for the generated question includes:
利用双向注意流网络预测所述问题对应的答案片段的起止位置;Use the bidirectional attention flow network to predict the start and end positions of the answer segments corresponding to the question;
利用学习排序模型将所述答案片段排序,根据排序结果对所述问题进行答案边界定位,其中,所述学习排序模型的特征包括所述答案片段的起止位置。The answer segments are sorted by a learning sorting model, and the answer boundary is located for the question according to the sorting result, wherein the features of the learning sorting model include the start and end positions of the answer segments.
第二方面,本发明实施例提供了一种问题生成装置,包括:In a second aspect, an embodiment of the present invention provides a problem generation device, including:
文本类型识别单元,用于根据文本结构识别待处理文档的文本类型;A text type identification unit, used to identify the text type of the document to be processed according to the text structure;
生成模型选择单元,用于选择与所述文本类型对应的生成模型,所述生成模型包括显式问题生成模型、结构化和半结构化问题生成模型和自然语言问题生成模型中的至少一种;a generative model selection unit for selecting a generative model corresponding to the text type, the generative model includes at least one of an explicit question generation model, a structured and semi-structured question generation model, and a natural language question generation model;
问题生成单元,用于利用选择的所述生成模型针对所述待处理文档生成问题。A question generating unit, configured to generate a question for the document to be processed by using the selected generation model.
在一种实施方式中,所述文本类型识别单元包括第一识别子单元,所述第一识别子单元用于:识别所述待处理文档的文本结构中是否有问答结构;In one embodiment, the text type identification unit includes a first identification subunit, and the first identification subunit is used to: identify whether there is a question-and-answer structure in the text structure of the document to be processed;
所述生成模型选择单元包括第一选择子单元,所述第一选择子单元用于:若所述待处理文档的文本结构中有问答结构,则将所述显式问题生成模型作为与所述文本类型对应的生成模型;The generation model selection unit includes a first selection subunit, and the first selection subunit is used for: if there is a question-and-answer structure in the text structure of the document to be processed, the explicit question generation model is used as the The generative model corresponding to the text type;
所述问题生成单元包括第一生成子单元,所述第一生成子单元用于:利用所述显式问题生成模型针对所述待处理文档生成问题。The question generation unit includes a first generation subunit, and the first generation subunit is configured to: generate a question for the document to be processed by using the explicit question generation model.
在一种实施方式中,所述第一生成子单元还用于:In one embodiment, the first generation subunit is further used for:
判断所述问答结构中的问题部分和对应的回答部分是否匹配,将匹配成功的所述问答结构对应的部分文本作为候选文本筛选出来;Judging whether the question part in the question-and-answer structure matches the corresponding answer part, and screening out the part of the text corresponding to the question-and-answer structure that has been successfully matched as candidate text;
利用第一循环神经网络模型,对筛选出的所述候选文本进行分类,以从所述候选文本中识别出显式问题;classifying the screened candidate texts using the first recurrent neural network model to identify explicit questions from the candidate texts;
将所述显式问题作为针对所述待处理文档生成的问题。Treat the explicit question as a question generated for the document to be processed.
在一种实施方式中,所述文本类型识别单元包括第二识别子单元,所述第二识别子单元用于:识别所述待处理文档的文本结构中是否有标题结构,所述标题结构包括标题或表格;In one embodiment, the text type identification unit includes a second identification subunit, and the second identification subunit is configured to: identify whether there is a title structure in the text structure of the document to be processed, and the title structure includes headings or tables;
所述生成模型选择单元包括第二选择子单元,所述第二选择子单元用于:若所述待处理文档的文本结构中有标题结构,则将所述结构化和半结构化问题生成模型作为与所述文本类型对应的生成模型;The generation model selection unit includes a second selection subunit, and the second selection subunit is used for: if there is a title structure in the text structure of the document to be processed, generate a model for the structured and semi-structured problems as a generative model corresponding to the text type;
所述问题生成单元包括第二生成子单元,所述第二生成子单元用于:利用所述结构化和半结构化问题生成模型针对所述待处理文档生成问题。The question generation unit includes a second generation subunit, and the second generation subunit is configured to: generate a question for the document to be processed by using the structured and semi-structured question generation model.
在一种实施方式中,所述第二生成子单元包括:In one embodiment, the second generating subunit includes:
复述获取子单元,用于在所述待处理文档的文本结构中有标题的情况下,获取与所述标题相关的属性复述;A paraphrase acquisition subunit, used for acquiring a paraphrase of an attribute related to the title when there is a title in the text structure of the document to be processed;
复述问题生成子单元,用于根据所述属性复述生成问题。The paraphrase question generation subunit is used to paraphrase the generation question according to the attribute.
在一种实施方式中,所述复述获取子单元还用于:In one embodiment, the paraphrase acquisition subunit is further used for:
获取与所述标题相关的搜索点击展现日志;Obtain the search click presentation log related to the title;
对所述搜索点击展现日志进行数据挖掘,得到与所述标题相关的属性复述;Data mining is performed on the search click presentation log to obtain a retelling of attributes related to the title;
将所述属性复述存入属性复述表中。The attribute repetition is stored in an attribute repetition table.
在一种实施方式中,所述复述问题生成子单元还用于:In one embodiment, the paraphrase question generation subunit is further used for:
根据所述属性复述,利用第一编码器-解码器模型生成问题;或者,generating a question using the first encoder-decoder model based on the attribute recap; or,
从所述属性复述表中查询与所述标题相关的属性复述,并根据查询到的所述属性复述生成问题。Query the attribute paraphrase related to the title from the property paraphrase table, and generate a question according to the property paraphrase found.
在一种实施方式中,所述文本类型识别单元包括第三识别子单元,所述第三识别子单元用于:识别所述待处理文档的文本结构中是否有问答结构和标题结构,所述标题结构包括标题或表格;In one embodiment, the text type identification unit includes a third identification subunit, and the third identification subunit is used for: identifying whether there is a question-and-answer structure and a title structure in the text structure of the document to be processed, the Heading structure including headings or tables;
所述生成模型选择单元包括第三选择子单元,所述第三选择子单元用于:若所述待处理文档的文本结构中没有问答结构且没有标题结构,则将所述自然语言问题生成模型作为与所述文本类型对应的生成模型;The generation model selection unit includes a third selection subunit, and the third selection subunit is used for: if there is no question-and-answer structure and no title structure in the text structure of the document to be processed, generate a model for the natural language question as a generative model corresponding to the text type;
所述问题生成单元包括第三生成子单元,所述第三生成子单元用于:利用所述自然语言问题生成模型针对所述待处理文档生成问题。The question generating unit includes a third generating subunit, and the third generating subunit is configured to: generate a question for the document to be processed by using the natural language question generating model.
在一种实施方式中,所述第三生成子单元还用于:In one embodiment, the third generation subunit is further used for:
利用第二循环神经网络模型,从所述待处理文档中筛选出目标句子,所述目标句子包括语义完整的句子;Use the second recurrent neural network model to screen out target sentences from the documents to be processed, where the target sentences include sentences with complete semantics;
利用第三循环神经网络模型,从所述目标句子中选择候选答案片段;using a third recurrent neural network model to select candidate answer segments from the target sentence;
根据所述候选答案片段,利用第二编码器-解码器模型生成问题。Based on the candidate answer segments, questions are generated using a second encoder-decoder model.
在一种实施方式中,所述装置还包括答案边界定位单元,用于针对生成的所述问题进行答案边界定位。In one embodiment, the apparatus further includes an answer boundary locating unit for locating the answer boundary for the generated question.
在一种实施方式中,所述答案边界定位单元还用于:In one embodiment, the answer boundary positioning unit is further used for:
利用双向注意流网络预测所述问题对应的答案片段的起止位置;Use the bidirectional attention flow network to predict the start and end positions of the answer segments corresponding to the question;
利用学习排序模型将所述答案片段排序,根据排序结果对所述问题进行答案边界定位,其中,所述学习排序模型的特征包括所述答案片段的起止位置。The answer segments are sorted by a learning sorting model, and the answer boundary is located for the question according to the sorting result, wherein the features of the learning sorting model include the start and end positions of the answer segments.
第三方面,本发明实施例提供了一种问题生成装置,所述装置的功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。In a third aspect, an embodiment of the present invention provides an apparatus for generating a problem. The function of the apparatus may be implemented by hardware, or by executing corresponding software in hardware. The hardware or software includes one or more modules corresponding to the above functions.
在一个可能的设计中,所述装置的结构中包括处理器和存储器,所述存储器用于存储支持所述装置执行上述方法的程序,所述处理器被配置为用于执行所述存储器中存储的程序。所述装置还可以包括通信接口,用于与其他设备或通信网络通信。In a possible design, the structure of the apparatus includes a processor and a memory, the memory is used to store a program that supports the apparatus to perform the above method, and the processor is configured to execute the memory stored in the memory program of. The apparatus may also include a communication interface for communicating with other devices or a communication network.
第四方面,本发明实施例提供了一种计算机可读存储介质,其存储有计算机程序,该程序被处理器执行时实现上述第一方面中任一所述的方法。In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, implements any one of the methods described in the first aspect above.
上述技术方案中的一个技术方案具有如下优点或有益效果:针对不同的文本类型的特点,对于整篇文档而言,或者对整篇文档的各部分文本而言,都选择最适用的生成模型,提高了生成问题的准确率。One of the above technical solutions has the following advantages or beneficial effects: according to the characteristics of different text types, for the entire document, or for each part of the text of the entire document, the most suitable generation model is selected, Improves the accuracy of generating questions.
上述技术方案中的另一个技术方案具有如下优点或有益效果:通过问答技术能够得到问题对应答案的准确边界,进一步提高了生成的FAQ文档的准确性。Another technical solution in the above technical solutions has the following advantages or beneficial effects: the accurate boundary of the answer corresponding to the question can be obtained through the question answering technology, which further improves the accuracy of the generated FAQ document.
上述概述仅仅是为了说明书的目的,并不意图以任何方式进行限制。除上述描述的示意性的方面、实施方式和特征之外,通过参考附图和以下的详细描述,本发明进一步的方面、实施方式和特征将会是容易明白的。The above summary is for illustrative purposes only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments and features described above, further aspects, embodiments and features of the present invention will become apparent by reference to the accompanying drawings and the following detailed description.
附图说明Description of drawings
在附图中,除非另外规定,否则贯穿多个附图相同的附图标记表示相同或相似的部件或元素。这些附图不一定是按照比例绘制的。应该理解,这些附图仅描绘了根据本发明公开的一些实施方式,而不应将其视为是对本发明范围的限制。In the drawings, unless stated otherwise, the same reference numbers refer to the same or like parts or elements throughout the several figures. The drawings are not necessarily to scale. It should be understood that these drawings depict only some embodiments according to the disclosure and should not be considered as limiting the scope of the invention.
图1为本发明实施例提供的问题生成方法的流程图。FIG. 1 is a flowchart of a problem generation method provided by an embodiment of the present invention.
图2为本发明实施例提供的问题生成方法的FAQ挖掘目标文档样例示意图。FIG. 2 is a schematic diagram of a sample FAQ mining target document of the method for generating questions according to an embodiment of the present invention.
图3为本发明实施例提供的问题生成方法的文本类型的示意图。FIG. 3 is a schematic diagram of a text type of a question generation method provided by an embodiment of the present invention.
图4为本发明实施例提供的问题生成方法的利用显式问题生成模型生成问题的流程图。FIG. 4 is a flowchart of generating a question by using an explicit question generating model of the question generating method provided by the embodiment of the present invention.
图5为本发明实施例提供的问题生成方法的利用显式问题生成模型生成问题的流程图。FIG. 5 is a flowchart of generating a question by using an explicit question generating model of the question generating method provided by an embodiment of the present invention.
图6为本发明实施例提供的问题生成方法的利用结构化和半结构化问题生成模型生成问题的流程图。FIG. 6 is a flowchart of generating a question using a structured and semi-structured question generating model of the question generating method provided by an embodiment of the present invention.
图7为本发明实施例提供的问题生成方法的利用结构化和半结构化问题生成模型生成问题的流程图。FIG. 7 is a flowchart of generating a question using a structured and semi-structured question generating model of the question generating method provided by an embodiment of the present invention.
图8为本发明实施例提供的问题生成方法的利用结构化和半结构化问题生成模型生成问题的流程图。FIG. 8 is a flowchart of generating a question using a structured and semi-structured question generating model of the question generating method provided by an embodiment of the present invention.
图9为本发明实施例提供的问题生成方法的利用自然语言问题生成模型生成问题的流程图。FIG. 9 is a flowchart of generating a question using a natural language question generating model of the question generating method provided by an embodiment of the present invention.
图10为本发明实施例提供的问题生成方法的利用自然语言问题生成模型生成问题的流程图。FIG. 10 is a flowchart of generating a question using a natural language question generating model of the question generating method provided by an embodiment of the present invention.
图11为本发明实施例提供的问题生成方法的流程图。FIG. 11 is a flowchart of a problem generation method provided by an embodiment of the present invention.
图12为本发明实施例提供的问题生成方法的在线部分的示意图。FIG. 12 is a schematic diagram of the online part of the question generation method provided by the embodiment of the present invention.
图13为本发明实施例提供的问题生成方法的答案边界定位的流程图。FIG. 13 is a flowchart of answer boundary positioning of the question generation method provided by the embodiment of the present invention.
图14为本发明实施例提供的问题生成方法的离线部分的示意图。FIG. 14 is a schematic diagram of an offline part of a problem generation method provided by an embodiment of the present invention.
图15为本发明实施例提供的问题生成装置的结构框图。FIG. 15 is a structural block diagram of an apparatus for generating a question provided by an embodiment of the present invention.
图16为本发明实施例提供的问题生成装置的结构框图。FIG. 16 is a structural block diagram of an apparatus for generating a question provided by an embodiment of the present invention.
图17为本发明实施例提供的问题生成装置的第二生成子单元的结构框图。FIG. 17 is a structural block diagram of a second generating subunit of the question generating apparatus provided by the embodiment of the present invention.
图18为本发明实施例提供的问题生成装置的结构框图。FIG. 18 is a structural block diagram of an apparatus for generating a question provided by an embodiment of the present invention.
图19为本发明实施例提供的问题生成装置的结构框图。FIG. 19 is a structural block diagram of an apparatus for generating a question provided by an embodiment of the present invention.
具体实施方式Detailed ways
在下文中,仅简单地描述了某些示例性实施例。正如本领域技术人员可认识到的那样,在不脱离本发明的精神或范围的情况下,可通过各种不同方式修改所描述的实施例。因此,附图和描述被认为本质上是示例性的而非限制性的。In the following, only certain exemplary embodiments are briefly described. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not restrictive.
图1为本发明实施例提供的问题生成方法的流程图。如图1所示,本发明实施例的问题生成方法包括:FIG. 1 is a flowchart of a problem generation method provided by an embodiment of the present invention. As shown in FIG. 1, the problem generation method of the embodiment of the present invention includes:
步骤S110,根据文本结构识别待处理文档的文本类型;Step S110, identifying the text type of the document to be processed according to the text structure;
步骤S120,选择与所述文本类型对应的生成模型,所述生成模型包括显式问题生成模型、结构化和半结构化问题生成模型和自然语言问题生成模型中的至少一种;Step S120, selecting a generative model corresponding to the text type, where the generative model includes at least one of an explicit question generation model, a structured and semi-structured question generation model, and a natural language question generation model;
步骤S130,利用选择的所述生成模型针对所述待处理文档生成问题。Step S130, using the selected generation model to generate a question for the document to be processed.
在很多网站上都使用FAQ为用户提供咨询服务。在FAQ中列出了一些用户常见的问题,是一种在线帮助形式。用户在利用一些网站的功能或者服务时往往会遇到一些看似很简单、但不经过说明可能很难搞清楚的问题。有时甚至会因为这些细节问题的影响而失去用户。在很多情况下,只要经过简单的解释就可以解决这些问题,这就是FAQ的价值。FAQs are used on many websites to provide consulting services to users. Some frequently asked questions by users are listed in the FAQ, which is a form of online help. When users use the functions or services of some websites, they often encounter some seemingly simple problems that may be difficult to understand without explanation. Sometimes users are even lost because of these details. In many cases, these problems can be solved with a simple explanation, which is the value of the FAQ.
FAQ设计的问题和解答都必须是用户经常问到和遇到的。例如在网络营销中,FAQ被认为是一种常用的在线顾客服务手段,一个好的FAQ系统,应该至少可以回答用户80%以上的一般问题和常见问题。通过FAQ的使用,不仅方便了用户,也大大减轻了网站工作人员的压力,节省了大量的顾客服务成本,并且增加了用户的满意度。因此,在FAQ设计中,利用科学合理的方法生成高准确率的问题是至关重要的。The questions and answers in the FAQ design must be frequently asked and encountered by users. For example, in network marketing, FAQ is considered as a common online customer service means, a good FAQ system should be able to answer at least 80% of the general and frequently asked questions of users. Through the use of FAQ, it is not only convenient for users, but also greatly reduces the pressure of website staff, saves a lot of customer service costs, and increases user satisfaction. Therefore, in FAQ design, it is crucial to use scientifically sound methods to generate high-accuracy questions.
图2为本发明实施例提供的问题生成方法的FAQ挖掘目标文档样例示意图。可从图2所示的指定文档中自动生成问答对,期望生成的问答对举例如下:FIG. 2 is a schematic diagram of a sample FAQ mining target document of the method for generating questions according to an embodiment of the present invention. Question-and-answer pairs can be automatically generated from the specified documents shown in Figure 2. An example of the expected question-and-answer pairs is as follows:
Q:吸二手烟对身体的危害有哪些?Q: What are the effects of secondhand smoke on the body?
A:1、增加肺癌几率…;2、对记忆力的危害,烟雾中的尼古丁…3、引发儿童哮喘病、肺炎、耳部炎症…。A: 1. Increase the chance of lung cancer...; 2. Harm to memory, nicotine in smoke... 3. Cause childhood asthma, pneumonia, ear inflammation...
Q:吸二手烟的人患肺癌的概率是正常人的多少倍?Q: How many times is the risk of lung cancer in people who smoke secondhand smoke than normal people?
A:2.6倍至6倍。A: 2.6 times to 6 times.
以上,Q(question)表示问题,A(answer)表示问题对应的答案。In the above, Q(question) represents the question, and A(answer) represents the answer corresponding to the question.
本发明实施例通过定制化检索实现智能问答,可针对指定文档进行处理,以实现定制化检索的目的,使生成的智能问答更有针对性。通过进行阅读理解、问题生成技术和问答技术对待处理文档进行分析,提取出用户可能提问的问题,实现定制化问答服务。The embodiment of the present invention realizes intelligent question and answer through customized retrieval, and can process specified documents to achieve the purpose of customized retrieval, so that the generated intelligent question and answer is more pertinent. Through reading comprehension, question generation technology and question answering technology, the document to be processed is analyzed, the questions that users may ask are extracted, and the customized question answering service is realized.
具体地,可将待处理的指定文档的文本分为三种文本类型:显式FAQ文本类型、结构化和半结构化文本类型、自然语言文本类型。图3为本发明实施例提供的问题生成方法的文本类型的表格示意图。图3中的“Q”表示各个文本样例对应的生成的问题。Specifically, the text of the specified document to be processed can be classified into three text types: explicit FAQ text type, structured and semi-structured text type, and natural language text type. FIG. 3 is a schematic diagram of a table of text types of a question generation method provided by an embodiment of the present invention. "Q" in Figure 3 represents the generated question corresponding to each text sample.
图3的表格中第一行所示的是显式FAQ文本类型,其对应的生成的问题是:“98元不限量可以定购本地、国内流量包吗?”The first row of the table in Figure 3 shows the explicit FAQ text type, and the corresponding generated question is: "Can I order local and domestic traffic packages for unlimited 98 yuan?"
图3的表格中第二行所示的是结构化和半结构化文本类型,其对应的生成的问题是:“百度女神卡办理途径有哪些?”其中,“结构化”可包括文档中有表格结构的情形,“半结构化”可包括文档中有标题的情形。生成的问题中的“百度女神卡”可根据文档的总标题或文档相关的词条、关键词获得。The second row of the table in Figure 3 shows structured and semi-structured text types, and the corresponding generated question is: "What are the ways to apply for Baidu Goddess Card?" In the case of a tabular structure, "semi-structured" may include the case where the document has headings. The "Baidu Goddess Card" in the generated question can be obtained according to the general title of the document or the terms and keywords related to the document.
图3的表格中第三行所示的是自然语言文本类型,其对应的生成的问题是:“4G福卡新用户入网如何计费?”其中,生成的问题中的“4G福卡”可根据文档的总标题或文档相关的词条、关键词获得。The third row in the table in Figure 3 shows the natural language text type, and the corresponding generated question is: "How to charge for new 4G Fuka users to access the network?" Among them, "4G Fuka" in the generated question can be Obtained according to the general title of the document or the terms and keywords related to the document.
本发明实施例针对不同文本类型的文档,分别选择与所述文本类型对应的不同的生成模型,使得生成的问题更有针对性,准确率更高,更加符合定制化问答服务的需求。利用生成的问题及其回答可支持客服机器人实现人机交互。The embodiment of the present invention selects different generation models corresponding to the text types for documents of different text types, so that the generated questions are more targeted, have higher accuracy, and are more in line with the needs of customized question answering services. The generated questions and their answers can be used to support customer service robots to achieve human-computer interaction.
在上述技术方案中,还可以对问题创建索引以节省存储空间。例如,创建索引的方式可包括倒排索引或key value(键值)索引。其中,倒排索引源于实际应用中需要根据属性的值来查找记录的情况。这种索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。由于不是由记录来确定属性值,而是由属性值来确定记录的位置,因而称为倒排索引。基于问题索引的查询答案的召回准确率更高。进一步地,还可以将生成的问答对,建立语义索引,用以支持客服机器人实现人机交互。In the above technical solution, an index can also be created for the questions to save storage space. For example, the manner of creating an index may include an inverted index or a key value index. Among them, the inverted index originates from the situation in which the records need to be searched according to the value of the attribute in practical applications. Each entry in such an index table includes an attribute value and the address of each record with that attribute value. Because the attribute value is not determined by the record, but the position of the record is determined by the attribute value, so it is called an inverted index. The recall accuracy of the query answers based on the question index is higher. Further, the generated question and answer pairs can also be used to establish a semantic index to support the customer service robot to achieve human-computer interaction.
在一个示例中,待处理文档可能是整篇文档都属于上述三种文本类型中的一种,则可选择其文本类型对应的生成模型针对整篇文档生成问题。在另一个示例中,待处理文档可以分为几个部分,而各部分文本的文本类型分别属于上述三种文本类型中的一种。在这种情况下可分别选择各部分文本的文本类型对应的生成模型,针对各部分文本分别生成问题。In an example, the document to be processed may be one of the above three text types for the entire document, and a generation model corresponding to the text type may be selected to generate the problem for the entire document. In another example, the document to be processed may be divided into several parts, and the text types of the texts of each part belong to one of the above three text types respectively. In this case, the generation model corresponding to the text type of each part of the text can be selected respectively, and the question can be generated for each part of the text.
在一种实施方式中,根据文本结构识别待处理文档的文本类型,包括:根据文本结构识别待处理文档中的各部分文本的文本类型;In one embodiment, identifying the text type of the document to be processed according to the text structure includes: identifying the text type of each part of the text in the document to be processed according to the text structure;
选择与所述文本类型对应的生成模型,包括:选择与所述各部分文本的文本类型对应的生成模型;Selecting a generation model corresponding to the text type includes: selecting a generation model corresponding to the text type of each part of the text;
利用选择的所述生成模型针对所述待处理文档生成问题,包括:利用选择的所述生成模型针对所述各部分文本生成问题。Using the selected generative model to generate a question for the document to be processed includes: generating a question for each part of the text by using the selected generative model.
首先根据文本结构识别待处理文档或待处理文档中的各部分属于哪种文本类型:First, identify which text type the document to be processed or each part of the document to process belongs to according to the text structure:
(1)识别待处理文档的文本结构中是否有问答结构,即问题和问题的回答。若有,则使用显式问题生成模型,针对待处理文档中的有问答结构的部分文本生成问题。(1) Identify whether there is a question-and-answer structure in the text structure of the document to be processed, that is, questions and answers to questions. If there is, an explicit question generation model is used to generate questions for the part of the text with a question-and-answer structure in the document to be processed.
(2)识别待处理文档的文本结构中是否带有标题或表格,若有,则使用结构化和半结构化问题生成模型,针对待处理文档中的带有标题或表格的部分文本生成问题。(2) Identify whether there is a title or table in the text structure of the document to be processed, and if so, use a structured and semi-structured problem generation model to generate questions for part of the text with a title or table in the document to be processed.
(3)若待处理文档中的部分文本中没有问答结构,也没有标题或表格,例如待处理文档中的部分文本是纯文本文档的格式,可将该部分文本称为纯文本部分。这种情况则使用自然语言问题生成模型,针对该部分文本生成问题。(3) If part of the text in the document to be processed has no question-and-answer structure, no title or table, for example, part of the text in the document to be processed is in the format of a plain text document, the part of the text can be called a plain text part. In this case, a natural language question generation model is used to generate questions for this part of the text.
上述技术方案具有如下优点或有益效果:针对不同的文本类型的特点,对于整篇文档而言,或者对整篇文档的各部分文本而言,都选择最适用的生成模型,提高了生成问题的准确率。The above technical solution has the following advantages or beneficial effects: according to the characteristics of different text types, for the entire document or for each part of the entire document, the most suitable generation model is selected, which improves the generation problem. Accuracy.
图4为本发明实施例提供的问题生成方法的利用显式问题生成模型生成问题的流程图。如图4所示,在一种实施方式中,图1中的步骤S110,根据文本结构识别待处理文档的文本类型,具体可包括步骤S210:识别所述待处理文档的文本结构中是否有问答结构。FIG. 4 is a flowchart of generating a question by using an explicit question generating model of the question generating method provided by the embodiment of the present invention. As shown in FIG. 4 , in one embodiment, in step S110 in FIG. 1 , identifying the text type of the document to be processed according to the text structure may specifically include step S210 : identifying whether there is a question and answer in the text structure of the document to be processed structure.
图1中的步骤S120,选择与所述文本类型对应的生成模型,具体可包括步骤S220:若所述待处理文档的文本结构中有问答结构,则将所述显式问题生成模型作为与所述文本类型对应的生成模型。Step S120 in FIG. 1 selects a generation model corresponding to the text type, which may specifically include step S220: if there is a question-and-answer structure in the text structure of the document to be processed, the explicit question generation model is used as the The generative model corresponding to the text type described.
图1中的步骤S130,利用选择的所述生成模型针对所述待处理文档生成问题,具体可包括步骤S230:利用所述显式问题生成模型针对所述待处理文档生成问题。Step S130 in FIG. 1 , using the selected generation model to generate a question for the document to be processed, may specifically include step S230 : using the explicit question generation model to generate a question for the document to be processed.
参见图3表格中第一行所示的示例,若待处理文档的文本结构中有问答结构,则这部分文本的文本类型属于显式FAQ文本类型。对于显式FAQ文本类型,选择与之对应的显式问题生成模型针对该部分文本生成问题。Referring to the example shown in the first row of the table in FIG. 3 , if there is a question and answer structure in the text structure of the document to be processed, the text type of this part of the text belongs to the explicit FAQ text type. For the explicit FAQ text type, select the corresponding explicit question generation model to generate questions for this part of the text.
图5为本发明实施例提供的问题生成方法的利用显式问题生成模型生成问题的流程图。如图5所示,在一种实施方式中,图4中的步骤S230,利用所述显式问题生成模型针对所述待处理文档生成问题,具体可包括:FIG. 5 is a flowchart of generating a question by using an explicit question generating model of the question generating method provided by an embodiment of the present invention. As shown in FIG. 5 , in one embodiment, in step S230 in FIG. 4 , the explicit question generation model is used to generate questions for the document to be processed, which may specifically include:
步骤S310,判断所述问答结构中的问题部分和对应的回答部分是否匹配,将匹配成功的所述问答结构对应的部分文本作为候选文本筛选出来;Step S310, judging whether the question part in the question-and-answer structure matches the corresponding answer part, and selecting the part of the text corresponding to the question-and-answer structure that has been successfully matched as candidate text;
步骤S320,利用第一循环神经网络模型,对筛选出的所述候选文本进行分类,以从所述候选文本中识别出显式问题;Step S320, using the first recurrent neural network model to classify the screened candidate texts to identify explicit questions from the candidate texts;
步骤S330,将所述显式问题作为针对所述待处理文档生成的问题。Step S330, taking the explicit question as a question generated for the document to be processed.
在这种实施方式中,可通过文档结构解析和问题识别来生成问题。其中,文档结构解析的过程可包括用文本结构初筛出可能的显式问题。问题识别的过程可包括:将RNN(Recurrent Neural Network,循环神经网络)作为问题分类模型,以词为特征,使用人工标注训练数据,训练上述RNN模型,作为显式问题生成模型。In such an embodiment, questions may be generated through document structure parsing and question identification. Among them, the process of document structure parsing may include preliminarily screening out possible explicit problems with the text structure. The process of problem identification may include: using RNN (Recurrent Neural Network, Recurrent Neural Network) as a problem classification model, using words as features, using artificially labeled training data, and training the above RNN model as an explicit problem generation model.
具体地,显式问题生成模型的处理过程可分为以下步骤:Specifically, the processing process of the explicit problem generation model can be divided into the following steps:
(1)文档结构解析:找出文本中的问题部分和对应的回答部分,并判断问题部分和回答部分是否匹配。若问题部分和回答部分匹配,则将问题部分和回答部分作为“可能的显式问题”筛选出来。(1) Document structure analysis: find out the question part and the corresponding answer part in the text, and judge whether the question part and the answer part match. If the question part and the answer part match, the question part and the answer part are filtered out as "Possible Explicit Questions".
例如:“找出文本中的问题部分和对应的回答部分”的方法可包括:找出问号,将问号之前的一句话确定为问题部分,将问号之后的一段话确定为回答部分。可通过语义理解判断问题和回答是否匹配。For example, the method of "finding the question part and the corresponding answer part in the text" may include: finding the question mark, determining the sentence before the question mark as the question part, and determining the paragraph after the question mark as the answer part. Semantic understanding can be used to determine whether the question and answer match.
(2)问题识别:利用RNN模型,即上述第一循环神经网络模型,对筛选出的“可能的显式问题”进行分类。第一循环神经网络模型输出的结果可分为两类,一类是最终确定其是显式问题,另一类是最终确定其不是显式问题。(2) Problem identification: Use the RNN model, that is, the above-mentioned first recurrent neural network model, to classify the screened "possible explicit problems". The results output by the first recurrent neural network model can be divided into two categories, one is to finally determine that it is an explicit problem, and the other is to finally determine that it is not an explicit problem.
上述RNN模型使用的特征可包括:词性和词的依存关系。其中,可将文本进行分句、切词,根据需求提取不同的特征,作为训练模型的特征。例如,利用标点进行分句;利用NLP(Natural Language Processing,自然语言处理)进行切词。可以分别尝试将一个句子可以分成例如5个词、3个词或1个词的模式,选取效果最好的模式进行分词。The features used by the above RNN model may include: part-of-speech and word dependencies. Among them, the text can be divided into sentences and words, and different features can be extracted as the features of the training model according to the needs. For example, use punctuation for sentence segmentation; use NLP (Natural Language Processing, natural language processing) for word segmentation. You can try to divide a sentence into patterns such as 5 words, 3 words or 1 word respectively, and select the pattern with the best effect for word segmentation.
图6为本发明实施例提供的问题生成方法的利用结构化和半结构化问题生成模型生成问题的流程图。如图6所示,在一种实施方式中,图1中的步骤S110,根据文本结构识别待处理文档的文本类型,具体可包括步骤S410:识别所述待处理文档的文本结构中是否有标题结构,所述标题结构包括标题或表格。FIG. 6 is a flowchart of generating a question using a structured and semi-structured question generating model of the question generating method provided by an embodiment of the present invention. As shown in FIG. 6 , in one embodiment, in step S110 in FIG. 1 , identifying the text type of the document to be processed according to the text structure may specifically include step S410 : identifying whether there is a title in the text structure of the document to be processed structure, which includes headings or tables.
图1中的步骤S120,选择与所述文本类型对应的生成模型,具体可包括步骤S420:若所述待处理文档的文本结构中有标题结构,则将所述结构化和半结构化问题生成模型作为与所述文本类型对应的生成模型。Step S120 in FIG. 1 selects a generation model corresponding to the text type, which may specifically include step S420: if the text structure of the document to be processed has a title structure, generate the structured and semi-structured questions The model acts as a generative model corresponding to the text type.
图1中的步骤S130,利用选择的所述生成模型针对所述待处理文档生成问题,具体可包括步骤S430:利用所述结构化和半结构化问题生成模型针对所述待处理文档生成问题。Step S130 in FIG. 1 , using the selected generation model to generate a question for the document to be processed, may specifically include step S430 : using the structured and semi-structured question generation model to generate a question for the document to be processed.
参见图3表格中第二行所示的示例,若待处理文档的文本结构中有标题结构,其中标题结构包括标题或表格,则这部分文本的文本类型属于结构化和半结构化文本类型。对于结构化和半结构化文本类型,选择与之对应的结构化和半结构化问题生成模型针对该部分文本生成问题。Referring to the example shown in the second row of the table in FIG. 3 , if the text structure of the document to be processed has a title structure, where the title structure includes a title or a table, the text type of this part of the text belongs to structured and semi-structured text types. For structured and semi-structured text types, select the corresponding structured and semi-structured question generation models to generate questions for this part of the text.
图7为本发明实施例提供的问题生成方法的利用结构化和半结构化问题生成模型生成问题的流程图。如图7所示,在一种实施方式中,图6中的步骤S430,利用所述结构化和半结构化问题生成模型针对所述待处理文档生成问题,具体可包括:FIG. 7 is a flowchart of generating a question using a structured and semi-structured question generating model of the question generating method provided by an embodiment of the present invention. As shown in FIG. 7 , in one embodiment, in step S430 in FIG. 6 , the structured and semi-structured question generation models are used to generate questions for the documents to be processed, which may specifically include:
步骤S510,在所述待处理文档的文本结构中有标题的情况下,获取与所述标题相关的属性复述;Step S510, in the case that there is a title in the text structure of the document to be processed, obtain an attribute paraphrase related to the title;
步骤S520,根据所述属性复述生成问题。Step S520, generating a question according to the attribute paraphrase.
在这种实施方式中,可获取FAQ的检索系统中的搜索点击展现日志,依赖搜索点击展现日志挖掘属性的复述。例如:*计费方式->*如何计费,这两个属性可以互为复述。基于属性复述的生成,作为半结构化和结构化的问题生成模型。In this embodiment, the search click presentation log in the FAQ retrieval system can be obtained, and the retelling of the properties of the search click presentation log is mined by means of the search. For example: *Billing method ->*How to charge, these two attributes can be repeated for each other. Generation of attribute-based paraphrases as semi-structured and structured question generative models.
图8为本发明实施例提供的问题生成方法的利用结构化和半结构化问题生成模型生成问题的流程图。如图8所示,在一种实施方式中,图7中的步骤S510,获取与所述标题相关的属性复述,具体可包括:FIG. 8 is a flowchart of generating a question using a structured and semi-structured question generating model of the question generating method provided by an embodiment of the present invention. As shown in FIG. 8 , in one embodiment, in step S510 in FIG. 7 , acquiring the attribute paraphrase related to the title may specifically include:
步骤S610,获取与所述标题相关的搜索点击展现日志;Step S610, obtaining a search click presentation log related to the title;
步骤S620,对所述搜索点击展现日志进行数据挖掘,得到与所述标题相关的属性复述;Step S620, performing data mining on the search click presentation log to obtain a retelling of attributes related to the title;
步骤S630,将所述属性复述存入属性复述表中。Step S630, storing the attribute paraphrase in the attribute paraphrase table.
在一种实施方式中,图7中的步骤S520,根据所述属性复述生成问题,具体可包括:In one embodiment, step S520 in FIG. 7 , generating a question according to the attribute retelling, may specifically include:
根据所述属性复述,利用第一编码器-解码器模型生成问题;或者,generating a question using the first encoder-decoder model based on the attribute recap; or,
从所述属性复述表中查询与所述标题相关的属性复述,并根据查询到的所述属性复述生成问题。Query the attribute paraphrase related to the title from the property paraphrase table, and generate a question according to the property paraphrase found.
具体地,结构和半结构化生成模型可包括以下处理步骤:Specifically, structured and semi-structured generative models may include the following processing steps:
(1)依赖搜索的点击展现日志,利用数据挖掘的方法挖掘属性的复述。将挖掘的属性的复述存入属性复述表中。(1) Rely on the click display log of the search, and use the method of data mining to mine the repetition of the attribute. Stores the rephrases of the mined attributes in the attribute rehearsal table.
例如两个用户A和B使用了不同的关键词搜索后,点击了同一个URL(UniformResource Locator,统一资源定位符)。则这两个不同的关键词所表达的意思可能是相同的,可以互为复述。For example, two users A and B click on the same URL (Uniform Resource Locator) after using different keywords to search. Then the meanings expressed by these two different keywords may be the same and can be repeated for each other.
在一个示例中,用户搜索“苦瓜的功效”。其中“苦瓜”是实体,“功效”是实体的属性。“功效”这个属性的复述还包括“作用”、“药效”等。也就是说“苦瓜的功效”、“苦瓜的作用”和“苦瓜的药效”所表达的意思是相同的。In one example, a user searches for "efficacy of bitter gourd." Among them, "bitter gourd" is the entity, and "efficacy" is the attribute of the entity. The repetition of the attribute "efficacy" also includes "action", "pharmaceutical effect" and so on. That is to say, "the effect of bitter gourd", "the effect of bitter gourd" and "the medicinal effect of bitter gourd" have the same meaning.
(2)基于属性复述的生成,作为半结构化和结构化的问题生成模型。可包括两种实施方式:(2) Attribute-based paraphrase generation as a semi-structured and structured question generation model. Two implementations can be included:
方式一:使用Seq2Seq(Sequence to Sequence,序列到序列)模型,即上述第一编码器-解码器模型,生成问题。上述模型使用的特征包括词法和句法特征,序列标注模型预测的答案起止位置,以及词特征。上述模型的输入信息是待处理文档的段落,输出信息是针对输入信息生成的问题。在一个示例中,待处理文档中的文本内容为:“北京是中国的首都。”则序列标注模型可标注出“北京”。然后再把“北京”和“北京是中国的首都”作为输入信息,输入给seq2seq模型以生成问题“中国的首都是哪儿”。Mode 1: Use the Seq2Seq (Sequence to Sequence, sequence to sequence) model, that is, the above-mentioned first encoder-decoder model, to generate the problem. The features used by the above models include lexical and syntactic features, the start and end positions of the answers predicted by the sequence tagging model, and word features. The input information of the above model is the paragraph of the document to be processed, and the output information is the problem generated for the input information. In one example, the text content in the document to be processed is: "Beijing is the capital of China." Then the sequence labeling model can label "Beijing". Then, "Beijing" and "Beijing is the capital of China" are used as input information to the seq2seq model to generate the question "where is the capital of China".
Seq2Seq模型,也可以称之为Encoder-Decoder模型(编码器-解码器模型),它是RNN模型的一个重要的变种。Encoder-Decoder结构不限制输入和输出的序列长度,因此应用的范围非常广泛,比如:机器翻译、文本摘要、阅读理解、语音识别等。The Seq2Seq model, also known as the Encoder-Decoder model (encoder-decoder model), is an important variant of the RNN model. The Encoder-Decoder structure does not limit the sequence length of input and output, so it has a wide range of applications, such as machine translation, text summarization, reading comprehension, speech recognition, etc.
由于seq2seq模型可以是输入和输出序列不等长,即Sequence to Sequence,因此它实现了从一个序列到另外一个序列的转换,比如它可以实现聊天机器人对话模型。经典的RNN模型固定了输入序列和输出序列的大小,而seq2seq模型则突破了该限制。Since the seq2seq model can have unequal lengths of input and output sequences, that is, Sequence to Sequence, it realizes the conversion from one sequence to another, for example, it can implement a chatbot dialogue model. While the classic RNN model fixes the size of the input and output sequences, the seq2seq model breaks this limitation.
编码器(encoder)和解码器(decoder)分别对应着输入序列和输出序列的两个RNN。常见的encoder-decoder结构,其基本思想就是利用两个RNN,一个RNN作为encoder,另一个RNN作为decoder。encoder负责将输入序列压缩成指定长度的向量,这个向量可以看成是这个序列的语义,这个过程称为编码。而decoder则负责根据语义向量生成指定的序列,这个过程也称为解码。The encoder and decoder correspond to the two RNNs of the input sequence and the output sequence, respectively. The basic idea of the common encoder-decoder structure is to use two RNNs, one RNN as the encoder and the other RNN as the decoder. The encoder is responsible for compressing the input sequence into a vector of a specified length. This vector can be regarded as the semantics of the sequence. This process is called encoding. The decoder is responsible for generating the specified sequence according to the semantic vector, which is also called decoding.
方式二:查询属性复述表,选择一个相关的复述用以生成问题。Method 2: Query the attribute paraphrase table and select a relevant paraphrase to generate the question.
例如,对于查询到的复述:“苦瓜的功效”、“苦瓜的作用”和“苦瓜的药效”,可以从中选择一个最适合的复述用以生成问题。可利用语义理解、关键词匹配等方法,将查询到的各个复述与待处理文档中的表述做比对、分析,以确定用以生成问题的最适合的复述。For example, for the queried paraphrases: "the effect of bitter gourd", "the effect of bitter gourd", and "the medicinal effect of bitter gourd", one of the most suitable paraphrases can be selected from among them to generate the question. By means of semantic understanding, keyword matching and other methods, each queried paraphrase can be compared and analyzed with the expressions in the document to be processed, so as to determine the most suitable paraphrase for generating the question.
另一种情况,对于所述待处理文档的文本结构中有表格的情形,可利用语义理解、关键词匹配等方法,识别表格中表头、各行记录、各列字段的内容,进而确定生成的问题及其对应的答案。例如,表格中的其中一列数据的内容可能是生成的问题,而另外一列数据的内容可能是其对应的答案。In another case, in the case where there is a table in the text structure of the document to be processed, methods such as semantic understanding and keyword matching can be used to identify the content of the header, each row record, and each column field in the table, and then determine the generated content. questions and their corresponding answers. For example, the content of one column of data in the table may be the generated question, while the content of another column of data may be its corresponding answer.
图9为本发明实施例提供的问题生成方法的利用自然语言问题生成模型生成问题的流程图。如图9所示,在一种实施方式中,图1中的步骤S110,根据文本结构识别待处理文档的文本类型,具体可包括步骤S710:识别所述待处理文档的文本结构中是否有问答结构和标题结构,所述标题结构包括标题或表格;FIG. 9 is a flowchart of generating a question using a natural language question generating model of the question generating method provided by an embodiment of the present invention. As shown in FIG. 9 , in one embodiment, in step S110 in FIG. 1 , identifying the text type of the document to be processed according to the text structure may specifically include step S710 : identifying whether there is a question and answer in the text structure of the document to be processed structure and heading structure, including headings or tables;
图1中的步骤S120,选择与所述文本类型对应的生成模型,具体可包括步骤S720:若所述待处理文档的文本结构中没有问答结构且没有标题结构,则将所述自然语言问题生成模型作为与所述文本类型对应的生成模型;Step S120 in FIG. 1 selects the generation model corresponding to the text type, which may specifically include step S720: if there is no question-and-answer structure and no title structure in the text structure of the document to be processed, generate the natural language question model as a generative model corresponding to the text type;
图1中的步骤S130,利用选择的所述生成模型针对所述待处理文档生成问题,具体可包括步骤S730:利用所述自然语言问题生成模型针对所述待处理文档生成问题。Step S130 in FIG. 1 , using the selected generation model to generate a question for the document to be processed, may specifically include step S730 : using the natural language question generation model to generate a question for the document to be processed.
参见图3表格中第三行所示的示例,若待处理文档的文本结构中没有问答结构且没有标题结构,则这部分文本的文本类型属于自然语言文本类型。对于自然语言文本类型,选择与之对应的自然语言问题生成模型针对该部分文本生成问题。Referring to the example shown in the third row of the table in FIG. 3 , if there is no question-and-answer structure and no title structure in the text structure of the document to be processed, the text type of this part of the text belongs to the natural language text type. For the natural language text type, select the corresponding natural language question generation model to generate questions for this part of the text.
图10为发明实施例提供的问题生成方法的利用自然语言问题生成模型生成问题的流程图。如图10所示,在一种实施方式中,图9中的步骤S730,利用所述自然语言问题生成模型针对所述待处理文档生成问题,具体可包括:FIG. 10 is a flowchart of generating a question using a natural language question generating model of the question generating method provided by the embodiment of the invention. As shown in FIG. 10 , in one embodiment, in step S730 in FIG. 9 , using the natural language question generation model to generate a question for the document to be processed may specifically include:
步骤S810,利用第二循环神经网络模型,从所述待处理文档中筛选出目标句子,所述目标句子包括语义完整的句子;Step S810, using the second recurrent neural network model to screen out a target sentence from the document to be processed, and the target sentence includes a sentence with complete semantics;
步骤S820,利用第三循环神经网络模型,从所述目标句子中选择候选答案片段;Step S820, using the third recurrent neural network model to select candidate answer segments from the target sentence;
步骤S830,根据所述候选答案片段,利用第二编码器-解码器模型生成问题。Step S830, using the second encoder-decoder model to generate a question according to the candidate answer segment.
在这种实施方式中,可首先利用RNN模型分类筛选目标句子,针对筛选出的目标句子使用RNN序列标注选择候选答案片段,然后再用seq2seq模型生成问题。In this embodiment, the RNN model may be used to classify and screen target sentences, and the RNN sequence annotation may be used to select candidate answer segments for the screened target sentences, and then the seq2seq model may be used to generate questions.
具体地,自然语言生成模型可包括以下处理步骤:Specifically, the natural language generation model may include the following processing steps:
(1)首先利用第二个循环神经网络模型分类筛选目标句子,筛选出语义完整的句子。(1) First, use the second recurrent neural network model to classify and filter the target sentence, and screen out the sentences with complete semantics.
(2)对筛选的目标句子,利用第三个循环神经网络模型选择候选答案片段,也就是选择出问题和对应的可能是答案的片段。其中,可利用序列标注训练第三循环神经网络模型。序列标注可包括句子标注,也就是标注出可以提问的问题。(2) For the screened target sentences, the third recurrent neural network model is used to select candidate answer segments, that is, the question and the corresponding segment that may be the answer are selected. Among them, the third recurrent neural network model can be trained by using sequence labeling. Sequence tagging can include sentence tagging, that is, tagging questions that can be asked.
(3)再利用seq2seq模型,即上述第二编码器-解码器模型,问题生成模型生成问题。(3) Reuse the seq2seq model, that is, the above-mentioned second encoder-decoder model, and the problem generation model generates the problem.
在一个示例中,待处理文档中的文本内容为:“北京是中国的首都。”则序列标注可标注出“北京”。然后再把“北京”和“北京是中国的首都”作为输入信息,输入给seq2seq模型以生成问题“中国的首都是哪儿”。In one example, the text content in the document to be processed is: "Beijing is the capital of China." Then the sequence annotation can label "Beijing". Then, "Beijing" and "Beijing is the capital of China" are used as input information to the seq2seq model to generate the question "where is the capital of China".
图11为本发明实施例提供的问题生成方法的流程图。如图11所示,在一种实施方式中,所述方法还包括步骤S140:针对生成的所述问题进行答案边界定位。FIG. 11 is a flowchart of a problem generation method provided by an embodiment of the present invention. As shown in FIG. 11 , in an embodiment, the method further includes step S140 : positioning the answer boundary for the generated question.
上述技术方案具有如下优点或有益效果:通过问答技术,例如通过答案边界定位,能够得到问题对应答案的准确边界,进一步提高了生成的FAQ文档的准确性。The above technical solution has the following advantages or beneficial effects: through question answering technology, such as answer boundary positioning, the accurate boundary of the answer corresponding to the question can be obtained, which further improves the accuracy of the generated FAQ document.
在一个示例中,本发明实施例的问题生成方法包括两部分:在线部分的问题生成方法和离线部分的问题生成方法。图12为本发明实施例提供的问题生成方法的在线部分的示意图。图12中的“目标文档”也就是待处理文档。图12中的“通用文档解析”包括识别文档结构。具体地,“通用文档解析”可包括识别文档的主题(标题)、子标题、子标题下的段落和文本,可利用树形结构描述识别出的文档结构。“通用文档解析”还包括识别待处理文档中的各部分属于上述三种文本类型(显式FAQ文本类型、结构化和半结构化文本类型、自然语言文本类型)中的哪一种。识别出待处理文档中的各部分的文本类型之后,在下一步骤中选择对应的问题生成模型生成问题并进行答案边界定位。In one example, the problem generation method of the embodiment of the present invention includes two parts: the online part of the problem generation method and the offline part of the problem generation method. FIG. 12 is a schematic diagram of the online part of the question generation method provided by the embodiment of the present invention. The "target document" in Fig. 12 is the document to be processed. "General document parsing" in Figure 12 involves identifying document structure. Specifically, "general document parsing" may include identifying the subject (title), subtitle, paragraphs and text under the subtitle of the document, and the identified document structure may be described using a tree structure. "General document parsing" also includes identifying which of the three text types described above (explicit FAQ text type, structured and semi-structured text type, natural language text type) each part of the document to be processed belongs to. After identifying the text type of each part in the document to be processed, in the next step, the corresponding question generation model is selected to generate questions and locate the answer boundary.
图13为本发明实施例提供的问题生成方法的答案边界定位的流程图。如图13所示,在一种实施方式中,图11中的步骤S140,针对生成的所述问题进行答案边界定位,具体可包括:FIG. 13 is a flowchart of answer boundary positioning of the question generation method provided by the embodiment of the present invention. As shown in FIG. 13 , in an implementation manner, in step S140 in FIG. 11 , positioning the answer boundary for the generated question may specifically include:
步骤S910,利用双向注意流网络预测所述问题对应的答案片段的起止位置;Step S910, using a bidirectional attention flow network to predict the start and end positions of the answer segments corresponding to the question;
步骤S920,利用学习排序模型将所述答案片段排序,根据排序结果对所述问题进行答案边界定位,其中,所述学习排序模型的特征包括所述答案片段的起止位置。Step S920, using a learning sorting model to sort the answer segments, and positioning the answer boundary for the question according to the sorting result, wherein the features of the learning sorting model include the start and end positions of the answer segments.
具体地,答案边界定位可包括以下处理步骤:Specifically, answer boundary localization may include the following processing steps:
(1)采用Bi-DAF(Bi-Directional Attention Flow network,双向注意流网络)作为阅读理解模型,能够准确的预测答案的起止位置。(1) Using Bi-DAF (Bi-Directional Attention Flow network, bidirectional attention flow network) as a reading comprehension model can accurately predict the starting and ending positions of the answer.
双向注意流网络是一种层次化的多阶段的结构,可在不同粒度等级上对上下文进行建模。双向注意流网络主要包括在字符粒度等级(Character Level)和词粒度水平(WordLevel)上对上下文进行建模,并且使用双向注意流来获取问题-察觉的上下文的表示方法。Bidirectional attention flow network is a hierarchical multi-stage structure that can model context at different levels of granularity. The bidirectional attention flow network mainly consists of modeling the context at the character granularity level (Character Level) and the word granularity level (WordLevel), and using the bidirectional attention flow to obtain the representation of the question-aware context.
一个示例性的双向注意流网络可包括以下层次:An exemplary bidirectional attention flow network may include the following layers:
1.字符嵌入层(Character embedding layer)1. Character embedding layer
该层的主要作用是将词映射到一个固定大小的向量,该层可使用字符水平的卷积神经网络(Character level CNN)实现其功能。The main role of this layer is to map words to a fixed-size vector, and this layer can use a character level convolutional neural network (Character level CNN) to achieve its function.
2.词嵌入层(Word embedding layer)2. Word embedding layer
可使用预先训练的词嵌入模型,将每一个词映射到固定大小的向量。A pre-trained word embedding model can be used to map each word to a fixed-size vector.
3.上下文嵌入层(Contextual embedding layer)3. Contextual embedding layer
该层主要作用是给每一个词加一个上下文的线索(cue),前三层都是对问题和上下文进行应用。The main function of this layer is to add a context cue (cue) to each word, and the first three layers are applied to the question and context.
4.注意流层(Attention flow layer)4. Attention flow layer
组合问题和上下文的向量,生成一个问题-察觉的特征向量集合。Combining the question and context vectors produces a question-aware feature vector set.
5.模型层(Modeling layer)5. Modeling layer
可使用循环神经网络对上下文进行扫描。The context can be scanned using a recurrent neural network.
6.输出层(Output layer)6. Output layer
该层提供对问题的回答。This layer provides answers to questions.
(2)采用LTR(Learning to rank,学习排序),进行答案片段排序。(2) LTR (Learning to rank, learning to rank) is used to sort the answer fragments.
其中,将步骤(1)预测出的起止位置作为步骤(2)中LTR的一个特征。步骤(2)采用LTR从问题后面的大段文本中寻找问题对应的答案。其中,LTR模型根据问答特征对答案片段进行排序。The start and end positions predicted in step (1) are used as a feature of the LTR in step (2). Step (2) LTR is used to find the answer corresponding to the question from the large text behind the question. Among them, the LTR model ranks the answer fragments according to the question and answer features.
其中,学习排序是一种监督学习的排序方法。利用LTR可通过构造相关度函数,按照相关度进行排序。对于传统的排序方法,很难融合多种信息,而且很可能会出现过拟合现象。学习排序很容易融合多种特征,而且有成熟深厚的理论基础,参数是通过迭代优化出来的,有一套成熟理论解决稀疏、过拟合等问题。Among them, learning to rank is a supervised learning ranking method. Using LTR, the correlation function can be constructed to sort according to the correlation. For traditional ranking methods, it is difficult to fuse multiple information, and overfitting is likely to occur. Learning sorting is easy to integrate a variety of features, and has a mature and profound theoretical foundation. The parameters are optimized through iteration, and there is a set of mature theories to solve problems such as sparsity and overfitting.
在这种实施方式中,首先对目标文档分段,可识别自然段落或使用列表提取的方法进行分段。然后对段落提取特征并排序,如使用“领域特征”和/或“匹配特征”等相关工具提取特征。其中特征可包括以下几种:In this embodiment, the target document is first segmented, and natural paragraphs can be identified or segmented using a list extraction method. The paragraphs are then extracted and ranked, for example, using relevant tools such as "Domain Features" and/or "Match Features". Features may include the following:
问题答案匹配特征:对齐匹配技术、DNN(Deep Neural Networks,深度神经网络)QP匹配技术、结合知识图谱的QP匹配技术,其中Q表示问题(Query),P表示分段的段落(Paragrap),对齐匹配包括Q和P对齐;Question-answer matching features: alignment matching technology, DNN (Deep Neural Networks, deep neural network) QP matching technology, QP matching technology combined with knowledge graph, where Q represents question (Query), P represents segmented paragraph (Paragrap), alignment Matching includes Q and P alignment;
领域特征:实体问答特征、how why问答特征、是非问答特征、描述类问答特征;Domain features: entity question answering feature, how why question answering feature, yes/no question answering feature, descriptive question answering feature;
结构特征:列表结构特征;Structural features: list structural features;
文本特征:内容质量特征;Text features: content quality features;
交叉校验特征:文本聚合特征。Cross-validation features: Text aggregation features.
图14为本发明实施例提供的问题生成方法的离线部分的示意图。如图14所示,离线部分的主要生成模型和数据,可包括两个部分和五个模型。其中,两个部分包括文档标题数据和问答标注数据。五个模型包括显式问题生成模型、结构化和半结构化问题生成模型、自然语言问题生成模型、Bi-DAF模型(阅读理解模型)和LTR模型(答案片段排序模型)。FIG. 14 is a schematic diagram of an offline part of a problem generation method provided by an embodiment of the present invention. As shown in Figure 14, the main generation models and data of the offline part can include two parts and five models. Among them, two parts include document title data and question and answer annotation data. The five models include explicit question generation model, structured and semi-structured question generation model, natural language question generation model, Bi-DAF model (reading comprehension model) and LTR model (answer fragment ranking model).
图15为本发明实施例提供的问题生成装置的结构框图。如图15所示,本发明实施例的问题生成装置包括:FIG. 15 is a structural block diagram of an apparatus for generating a question provided by an embodiment of the present invention. As shown in FIG. 15 , the problem generating apparatus according to the embodiment of the present invention includes:
文本类型识别单元100,用于根据文本结构识别待处理文档的文本类型;A text
生成模型选择单元200,用于选择与所述文本类型对应的生成模型,所述生成模型包括显式问题生成模型、结构化和半结构化问题生成模型和自然语言问题生成模型中的至少一种;A generative
问题生成单元300,用于利用选择的所述生成模型针对所述待处理文档生成问题。The
图16为本发明实施例提供的问题生成装置的结构框图。如图16所示,在一种实施方式中,所述文本类型识别单元100包括第一识别子单元110,所述第一识别子单元110用于:识别所述待处理文档的文本结构中是否有问答结构;FIG. 16 is a structural block diagram of an apparatus for generating a question provided by an embodiment of the present invention. As shown in FIG. 16 , in an implementation manner, the text
所述生成模型选择单元200包括第一选择子单元210,所述第一选择子单元210用于:若所述待处理文档的文本结构中有问答结构,则将所述显式问题生成模型作为与所述文本类型对应的生成模型;The generation
所述问题生成单元300包括第一生成子单元310,所述第一生成子单元310用于:利用所述显式问题生成模型针对所述待处理文档生成问题。The
在一种实施方式中,所述第一生成子单元310还用于:In one embodiment, the first generating subunit 310 is further configured to:
判断所述问答结构中的问题部分和对应的回答部分是否匹配,将匹配成功的所述问答结构对应的部分文本作为候选文本筛选出来;Judging whether the question part in the question-and-answer structure matches the corresponding answer part, and screening out the part of the text corresponding to the question-and-answer structure that has been successfully matched as candidate text;
利用第一循环神经网络模型,对筛选出的所述候选文本进行分类,以从所述候选文本中识别出显式问题;classifying the screened candidate texts using the first recurrent neural network model to identify explicit questions from the candidate texts;
将所述显式问题作为针对所述待处理文档生成的问题。Treat the explicit question as a question generated for the document to be processed.
在一种实施方式中,所述文本类型识别单元100包括第二识别子单元120,所述第二识别子单元120用于:识别所述待处理文档的文本结构中是否有标题结构,所述标题结构包括标题或表格;In an implementation manner, the text
所述生成模型选择单元200包括第二选择子单元220,所述第二选择子单元220用于:若所述待处理文档的文本结构中有标题结构,则将所述结构化和半结构化问题生成模型作为与所述文本类型对应的生成模型;The generation
所述问题生成单元300包括第二生成子单元320,所述第二生成子单元320用于:利用所述结构化和半结构化问题生成模型针对所述待处理文档生成问题。The
图17为本发明实施例提供的问题生成装置的第二生成子单元的结构框图。如图17所示,在一种实施方式中,所述第二生成子单元320包括:FIG. 17 is a structural block diagram of a second generating subunit of the question generating apparatus provided by the embodiment of the present invention. As shown in FIG. 17 , in one embodiment, the
复述获取子单元321,用于在所述待处理文档的文本结构中有标题的情况下,获取与所述标题相关的属性复述;A paraphrase acquisition subunit 321, configured to acquire a paraphrase of an attribute related to the title when there is a title in the text structure of the document to be processed;
复述问题生成子单元322,用于根据所述属性复述生成问题。The paraphrase question generating subunit 322 is configured to paraphrase and generate a question according to the attribute.
在一种实施方式中,所述复述获取子单元321还用于:In one embodiment, the paraphrase obtaining subunit 321 is further configured to:
获取与所述标题相关的搜索点击展现日志;Obtain the search click presentation log related to the title;
对所述搜索点击展现日志进行数据挖掘,得到与所述标题相关的属性复述;Data mining is performed on the search click presentation log to obtain a retelling of attributes related to the title;
将所述属性复述存入属性复述表中。The attribute repetition is stored in an attribute repetition table.
在一种实施方式中,所述复述问题生成子单元322还用于:In one embodiment, the paraphrase question generation subunit 322 is further configured to:
根据所述属性复述,利用第一编码器-解码器模型生成问题;或者,generating a question using the first encoder-decoder model based on the attribute recap; or,
从所述属性复述表中查询与所述标题相关的属性复述,并根据查询到的所述属性复述生成问题。Query the attribute paraphrase related to the title from the property paraphrase table, and generate a question according to the property paraphrase found.
参见图16,在一种实施方式中,所述文本类型识别单元100包括第三识别子单元130,所述第三识别子单元130用于:识别所述待处理文档的文本结构中是否有问答结构和标题结构,所述标题结构包括标题或表格;Referring to FIG. 16, in one embodiment, the text
所述生成模型选择单元200包括第三选择子单元230,所述第三选择子单元230用于:若所述待处理文档的文本结构中没有问答结构且没有标题结构,则将所述自然语言问题生成模型作为与所述文本类型对应的生成模型;The generation
所述问题生成单元300包括第三生成子单元330,所述第三生成子单元330用于:利用所述自然语言问题生成模型针对所述待处理文档生成问题。The
在一种实施方式中,所述第三生成子单元330还用于:In one embodiment, the third generating subunit 330 is further configured to:
利用第二循环神经网络模型,从所述待处理文档中筛选出目标句子,所述目标句子包括语义完整的句子;Use the second recurrent neural network model to screen out target sentences from the documents to be processed, where the target sentences include sentences with complete semantics;
利用第三循环神经网络模型,从所述目标句子中选择候选答案片段;using a third recurrent neural network model to select candidate answer segments from the target sentence;
根据所述候选答案片段,利用第二编码器-解码器模型生成问题。Based on the candidate answer segments, questions are generated using a second encoder-decoder model.
图18为本发明实施例提供的问题生成装置的结构框图。如图18所示,在一种实施方式中,所述装置还包括答案边界定位单元400,用于针对生成的所述问题进行答案边界定位。FIG. 18 is a structural block diagram of an apparatus for generating a question provided by an embodiment of the present invention. As shown in FIG. 18 , in an embodiment, the apparatus further includes an answer
在一种实施方式中,所述答案边界定位单元400还用于:In one embodiment, the answer
利用双向注意流网络预测所述问题对应的答案片段的起止位置;Use the bidirectional attention flow network to predict the start and end positions of the answer segments corresponding to the question;
利用学习排序模型将所述答案片段排序,根据排序结果对所述问题进行答案边界定位,其中,所述学习排序模型的特征包括所述答案片段的起止位置。The answer segments are sorted by a learning sorting model, and the answer boundary is located for the question according to the sorting result, wherein the features of the learning sorting model include the start and end positions of the answer segments.
本发明实施例的问题生成装置中各单元的功能可以参见上述方法的相关描述,在此不再赘述。For the functions of each unit in the problem generating apparatus according to the embodiment of the present invention, reference may be made to the relevant description of the above method, and details are not described herein again.
在一个可能的设计中,问题生成装置的结构中包括处理器和存储器,所述存储器用于存储支持问题生成装置执行上述问题生成方法的程序,所述处理器被配置为用于执行所述存储器中存储的程序。所述问题生成装置还可以包括通信接口,问题生成装置与其他设备或通信网络通信。In a possible design, the structure of the problem generating apparatus includes a processor and a memory, the memory is used for storing a program that supports the problem generating apparatus to execute the above-mentioned problem generating method, and the processor is configured to execute the memory program stored in. The question generating apparatus may further include a communication interface, and the question generating apparatus communicates with other devices or a communication network.
图19为本发明实施例提供的问题生成装置的结构框图。如图19所示,该装置包括:存储器101和处理器102,存储器101内存储有可在处理器102上运行的计算机程序。所述处理器102执行所述计算机程序时实现上述实施例中的问题生成方法。所述存储器101和处理器102的数量可以为一个或多个。FIG. 19 is a structural block diagram of an apparatus for generating a question provided by an embodiment of the present invention. As shown in FIG. 19 , the apparatus includes: a
该装置还包括:The device also includes:
通信接口103,用于与外界设备进行通信,进行数据交互传输。The
存储器101可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。The
如果存储器101、处理器102和通信接口103独立实现,则存储器101、处理器102和通信接口103可以通过总线相互连接并完成相互间的通信。所述总线可以是工业标准体系结构(ISA,Industry Standard Architecture)总线、外部设备互连(PCI,PeripheralComponent)总线或扩展工业标准体系结构(EISA,Extended Industry StandardComponent)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图19中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。If the
可选的,在具体实现上,如果存储器101、处理器102及通信接口103集成在一块芯片上,则存储器101、处理器102及通信接口103可以通过内部接口完成相互间的通信。Optionally, in terms of specific implementation, if the
又一方面,本发明实施例提供了一种计算机可读存储介质,其存储有计算机程序,该程序被处理器执行时实现上述问题生成方法中任一所述的方法。In another aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, implements any one of the above-mentioned problem generation methods.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine the different embodiments or examples described in this specification, as well as the features of the different embodiments or examples, without conflicting each other.
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或隐含地包括至少一个该特征。在本发明的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。In addition, the terms "first" and "second" are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In the description of the present invention, "plurality" means two or more, unless otherwise expressly and specifically defined.
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本发明的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本发明的实施例所属技术领域的技术人员所理解。Any description of a process or method in the flowcharts or otherwise described herein may be understood to represent a module, segment or portion of code comprising one or more executable instructions for implementing a specified logical function or step of the process , and the scope of the preferred embodiments of the invention includes alternative implementations in which the functions may be performed out of the order shown or discussed, including performing the functions substantially concurrently or in the reverse order depending upon the functions involved, which should It is understood by those skilled in the art to which the embodiments of the present invention belong.
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,“计算机可读介质”可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。The logic and/or steps represented in flowcharts or otherwise described herein, for example, may be considered an ordered listing of executable instructions for implementing the logical functions, may be embodied in any computer-readable medium, For use with, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a system including a processor, or other system that can fetch instructions from and execute instructions from an instruction execution system, apparatus, or apparatus) or equipment. For the purposes of this specification, a "computer-readable medium" can be any device that can contain, store, communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or apparatus. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections with one or more wiring (electronic devices), portable computer disk cartridges (magnetic devices), random access memory (RAM), Read Only Memory (ROM), Erasable Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Read Only Memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, followed by editing, interpretation, or other suitable medium as necessary process to obtain the program electronically and then store it in computer memory.
应当理解,本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。It should be understood that various parts of the present invention may be implemented in hardware, software, firmware or a combination thereof. In the above-described embodiments, various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or a combination of the following techniques known in the art: Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。Those skilled in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing the relevant hardware through a program, and the program can be stored in a computer-readable storage medium, and the program can be stored in a computer-readable storage medium. When executed, one or a combination of the steps of the method embodiment is included.
此外,在本发明各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读存储介质中。所述存储介质可以是只读存储器,磁盘或光盘等。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing module, or each unit may exist physically alone, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like.
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到其各种变化或替换,这些都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art who is familiar with the technical field disclosed in the present invention can easily think of various changes or Replacement, these should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.
Claims (24)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811641895.1A CN109726274B (en) | 2018-12-29 | 2018-12-29 | Question generation method, device and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201811641895.1A CN109726274B (en) | 2018-12-29 | 2018-12-29 | Question generation method, device and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109726274A CN109726274A (en) | 2019-05-07 |
| CN109726274B true CN109726274B (en) | 2021-04-30 |
Family
ID=66299312
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201811641895.1A Active CN109726274B (en) | 2018-12-29 | 2018-12-29 | Question generation method, device and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN109726274B (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024233178A1 (en) * | 2023-05-05 | 2024-11-14 | Pryon Incorporated | Document processing for frequently-asked-questions detection in natural language content |
Families Citing this family (17)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111553159B (en) * | 2020-04-24 | 2021-08-06 | 中国科学院空天信息创新研究院 | A method and system for generating a question |
| CN113672708B (en) * | 2020-05-13 | 2024-10-08 | 武汉Tcl集团工业研究院有限公司 | Language model training method, question-answer pair generation method, device and equipment |
| CN111858883B (en) * | 2020-06-24 | 2025-01-24 | 北京百度网讯科技有限公司 | Method, device, electronic device and storage medium for generating triplet samples |
| CN111538825B (en) | 2020-07-03 | 2020-10-16 | 支付宝(杭州)信息技术有限公司 | Knowledge question and answer method, device, system, equipment and storage medium |
| CN114118072A (en) * | 2020-09-01 | 2022-03-01 | 上海智臻智能网络科技股份有限公司 | Document structuring method and device, electronic equipment and computer readable storage medium |
| CN112163076B (en) * | 2020-09-27 | 2022-09-13 | 北京字节跳动网络技术有限公司 | Knowledge question bank construction method, question and answer processing method, device, equipment and medium |
| CN112347229B (en) * | 2020-11-12 | 2021-07-20 | 润联软件系统(深圳)有限公司 | Answer extraction method and device, computer equipment and storage medium |
| CN112487139B (en) * | 2020-11-27 | 2023-07-14 | 平安科技(深圳)有限公司 | Text-based automatic question setting method and device and computer equipment |
| CN112800177B (en) * | 2020-12-31 | 2021-09-07 | 北京智源人工智能研究院 | Method and device for automatic generation of FAQ knowledge base based on complex data types |
| CN112800032B (en) * | 2021-02-24 | 2021-08-31 | 北京智源人工智能研究院 | Method and device for automatic construction of FAQ knowledge base based on tabular data |
| CN113268561B (en) * | 2021-04-25 | 2021-12-14 | 中国科学技术大学 | Problem generation method based on multi-task joint training |
| CN113919367A (en) * | 2021-09-09 | 2022-01-11 | 中国科学院自动化研究所 | Abstract acquisition method, device, equipment, medium and product |
| CN114065765A (en) * | 2021-10-29 | 2022-02-18 | 北京来也网络科技有限公司 | Weapon and equipment text processing method, device and electronic device combining AI and RPA |
| CN114491152B (en) * | 2021-12-02 | 2023-10-31 | 南京硅基智能科技有限公司 | A summary video generation method, storage medium, and electronic device |
| CN114611484B (en) * | 2022-02-17 | 2025-07-08 | 中国人民大学 | Text analysis method, system, equipment and medium based on text structure |
| CN116069910A (en) * | 2022-12-30 | 2023-05-05 | 阿里巴巴(中国)有限公司 | Dialog processing method, device and system |
| CN116992112B (en) * | 2023-06-30 | 2025-07-25 | 百度在线网络技术(北京)有限公司 | Data generation method and device, electronic equipment and medium |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108363743A (en) * | 2018-01-24 | 2018-08-03 | 清华大学深圳研究生院 | A kind of intelligence questions generation method, device and computer readable storage medium |
| CN108846130A (en) * | 2018-06-29 | 2018-11-20 | 北京百度网讯科技有限公司 | A kind of question text generation method, device, equipment and medium |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10503786B2 (en) * | 2015-06-16 | 2019-12-10 | International Business Machines Corporation | Defining dynamic topic structures for topic oriented question answer systems |
-
2018
- 2018-12-29 CN CN201811641895.1A patent/CN109726274B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108363743A (en) * | 2018-01-24 | 2018-08-03 | 清华大学深圳研究生院 | A kind of intelligence questions generation method, device and computer readable storage medium |
| CN108846130A (en) * | 2018-06-29 | 2018-11-20 | 北京百度网讯科技有限公司 | A kind of question text generation method, device, equipment and medium |
Non-Patent Citations (2)
| Title |
|---|
| Learning to Ask: Neural Question Generation for Reading Comprehension;Xinya Du 等;《ACL》;20171231;第1342-1352页 * |
| Question Generation With Doubly Adversarial Nets;Junwei Bao 等;《IEEE》;20181130;第2230-2239页 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024233178A1 (en) * | 2023-05-05 | 2024-11-14 | Pryon Incorporated | Document processing for frequently-asked-questions detection in natural language content |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109726274A (en) | 2019-05-07 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109726274B (en) | Question generation method, device and storage medium | |
| CN104252533B (en) | Searching method and searcher | |
| CN108304375B (en) | Information identification method and equipment, storage medium and terminal thereof | |
| CN111324771B (en) | Video tag determination method and device, electronic equipment and storage medium | |
| US20150081277A1 (en) | System and Method for Automatically Classifying Text using Discourse Analysis | |
| US20130060769A1 (en) | System and method for identifying social media interactions | |
| CN112182145B (en) | Text similarity determination method, device, equipment and storage medium | |
| CN108959559A (en) | Question and answer are to generation method and device | |
| CN112380866B (en) | A text topic tag generation method, terminal device and storage medium | |
| CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
| WO2020074017A1 (en) | Deep learning-based method and device for screening for keywords in medical document | |
| CN114661872B (en) | A beginner-oriented API adaptive recommendation method and system | |
| CN113157887B (en) | Knowledge question and answer intention recognition method and device and computer equipment | |
| CN108804592A (en) | Knowledge library searching implementation method | |
| CN113988057A (en) | Title generation method, device, device and medium based on concept extraction | |
| US20220365956A1 (en) | Method and apparatus for generating patent summary information, and electronic device and medium | |
| CN119228386A (en) | Optimization method, system, device and medium of intelligent customer service system | |
| CN118838993A (en) | Method for constructing keyword library and related products thereof | |
| CN109522396B (en) | Knowledge processing method and system for national defense science and technology field | |
| CN114462402A (en) | Automatic content auditing method and device, storage medium and electronic equipment | |
| CN103020311B (en) | A kind of processing method of user search word and system | |
| KR102685135B1 (en) | Video editing automation system | |
| CN113486649B (en) | Text comment generation method and electronic device | |
| CN113505889B (en) | Processing method and device of mapping knowledge base, computer equipment and storage medium | |
| Gonzalez-Mora et al. | Model-driven development of web apis to access integrated tabular open data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |