CN113157885B

CN113157885B - An efficient intelligent question answering system for artificial intelligence domain knowledge

Info

Publication number: CN113157885B
Application number: CN202110392744.2A
Authority: CN
Inventors: 曲晨帆; 金连文; 林上港; 马骏; 谭濯; 刘振鑫
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2023-07-18
Anticipated expiration: 2041-04-13
Also published as: CN113157885A

Abstract

The invention relates to a high-efficiency intelligent question-answering system oriented to knowledge in the field of artificial intelligence, which comprises a preparation module and a question-answering module; the preparation module comprises a data collection module, a model training module and a knowledge structure construction module of a question-answering system; the question-answering module comprises an input preprocessing module, a question-answering module based on a knowledge base, a question-answering module based on a text base and a question recommending module based on the knowledge base. According to the invention, through the preparation module and the question-answering module, the word segmentation accuracy of the user questions, the knowledge base questions and the text base questions is greatly enhanced, and the overall accuracy of the full question-answering system is greatly improved, so that the user experience is greatly improved, and the knowledge question-answering service with low cost, high efficiency and high user experience is realized.

Description

An efficient intelligent question answering system for artificial intelligence domain knowledge

技术领域technical field

本发明涉及人工智能及自然语言处理技术领域，尤其涉及一种面向人工智能领域知识的高效智能问答系统。The invention relates to the technical fields of artificial intelligence and natural language processing, in particular to an efficient intelligent question answering system oriented to knowledge in the field of artificial intelligence.

背景技术Background technique

近年来，人工智能技术发展迅速，在教育、医疗、农业、交通等领域均具有十分广泛的应用前景。然而，获取人工智能领域的知识需要具备一定的专业基础，各行各业的从业人员缺乏一种便捷准确地获取人工智能知识的途径，使得人工智能技术在很多领域中难以普及，无形中阻碍了社会生产力的发展。人工智能领域的非结构化文本承载了该领域大量的知识，若能完成一个该领域的基于文本理解的知识问答系统，能够为人们提供高效便捷的知识获取途径，促进人工智能技术的进一步发展。In recent years, artificial intelligence technology has developed rapidly and has broad application prospects in education, medical care, agriculture, transportation and other fields. However, obtaining knowledge in the field of artificial intelligence requires a certain professional foundation. Practitioners in all walks of life lack a convenient and accurate way to obtain artificial intelligence knowledge, which makes it difficult to popularize artificial intelligence technology in many fields and virtually hinders the society. development of productivity. Unstructured text in the field of artificial intelligence carries a large amount of knowledge in this field. If a knowledge question answering system based on text understanding in this field can be completed, it can provide people with an efficient and convenient way to acquire knowledge and promote the further development of artificial intelligence technology.

现有的知识问答系统存在下述问题：首先，信息抽取模型缺乏实体名称和实体别称的支持，前者使得相关专业术语被错误分词，进而影响搜索引擎的性能，后者缺乏对同义词问题的理解，使得后续搜索结果片面。这两者均会对问答系统的整体性能造成不利影响。其次，机器阅读理解作为一项复杂的自然语言处理任务，存在复杂度高、计算量大等问题，而且知识库的构建依赖于非结构化文本，若采用人工方式构建则耗时费力，难以形成足够规模的知识库，两者均制约了问答系统的实际部署。最后，现有的问答系统仍然缺乏高效地从跨段落、跨文档、跨形式的不同类型文本得到准确而全面的答案的能力，更缺少引导用户进一步探索领域内相关知识的能力。The existing knowledge question answering system has the following problems: First, the information extraction model lacks the support of entity names and entity aliases. The former causes related technical terms to be incorrectly segmented, which in turn affects the performance of search engines. The latter lacks an understanding of synonyms. Make subsequent search results one-sided. Both of these can adversely affect the overall performance of the question answering system. Secondly, as a complex natural language processing task, machine reading comprehension has problems such as high complexity and large amount of calculation, and the construction of the knowledge base relies on unstructured text, which is time-consuming and laborious if constructed manually, and difficult to form. A knowledge base of sufficient scale, both of which restrict the practical deployment of question answering systems. Finally, the existing question answering systems still lack the ability to efficiently obtain accurate and comprehensive answers from different types of texts across paragraphs, documents, and forms, and lack the ability to guide users to further explore relevant knowledge in the field.

发明内容Contents of the invention

为解决现有技术所存在的技术问题，本发明提供一种面向人工智能领域知识的高效智能问答系统，通过准备模块与问答模块，使得对于用户问题以及知识库问题、文本库问题的分词准确性大大增强，进而大幅度提升全问答系统整体的准确性，从而大幅度改善用户体验，实现低成本高效率高用户体验的知识问答服务。In order to solve the technical problems existing in the prior art, the present invention provides an efficient intelligent question answering system for knowledge in the field of artificial intelligence. Through the preparation module and the question answering module, the accuracy of word segmentation for user questions, knowledge base questions, and text library questions can be improved. Greatly enhanced, and then greatly improved the overall accuracy of the full question answering system, thereby greatly improving user experience, and realizing knowledge question answering services with low cost, high efficiency and high user experience.

本发明采用以下技术方案来实现：一种面向人工智能领域知识的高效智能问答系统，包括：准备模块和问答模块；其中，准备模块包括数据收集模块、模型训练模块和问答系统知识结构构建模块；问答模块包括输入预处理模块、基于知识库的问答模块、基于文本库的问答模块和基于知识库的问题推荐模块；The present invention adopts the following technical solutions to achieve: an efficient intelligent question answering system for knowledge in the field of artificial intelligence, including: a preparation module and a question answering module; wherein, the preparation module includes a data collection module, a model training module and a question answering system knowledge structure building module; The question answering module includes input preprocessing module, question answering module based on knowledge base, question answering module based on text base and question recommendation module based on knowledge base;

准备模块通过数据收集模块，将收集到的人工智能领域的无结构化知识文本段落进行标注，并训练模型训练模块的信息抽取模块和机器阅读理解模块，同时收集或定义人工智能领域同义、不同义的问题来训练短文本匹配模型，利用问答系统知识结构构建模块，将训练好的信息抽取模型抽取出知识三元组并形成问答对，同时利用抽取出的实体名称、别称进行辅助搜索，再通过改进知识库、文本库倒序索引的构建方法来为搜索引擎提供语义，并构建知识库关键词索引；The preparation module marks the collected unstructured knowledge text paragraphs in the field of artificial intelligence through the data collection module, and trains the information extraction module and machine reading comprehension module of the model training module, and at the same time collects or defines synonymous and different The short text matching model is trained by using meaningful questions, and the knowledge structure building module of the question answering system is used to extract knowledge triples from the trained information extraction model and form question and answer pairs. Provide semantics for search engines by improving the construction method of knowledge base and text base inverted index, and build a knowledge base keyword index;

问答模块通过输入预处理模块对用户输入的问题进行预处理，利用基于知识库的问答模块进行答案的寻找，若有答案则将答案准备返回，否则将预处理后的用户输入问题送入基于文本库的问答模块寻找并准备返回答案，并利用基于知识库的问题推荐模块向用户推荐问题，最终将答案和推荐问题一起返回给用户。The question answering module preprocesses the questions input by the user through the input preprocessing module, and uses the question answering module based on the knowledge base to search for answers. The question answering module of the library looks for and prepares to return answers, and uses the question recommendation module based on the knowledge base to recommend questions to users, and finally returns the answers and recommended questions to users.

本发明与现有技术相比，具有如下优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、本发明通过信息抽取模型抽取出的实体名称和其别称来补充jieba分词的词典，使得对于用户问题以及知识库问题、文本库问题的分词准确性大大增强，进而大幅度提升全问答系统整体的准确性，从而大幅度改善用户体验。1. The present invention supplements the jieba word segmentation dictionary through the entity name and its alias extracted by the information extraction model, so that the accuracy of word segmentation for user questions, knowledge base questions, and text library questions is greatly enhanced, thereby greatly improving the overall question-and-answer system accuracy, thereby greatly improving the user experience.

2、本发明通过信息抽取模型抽取出的实体名称和别称的对应关系，以及从互联网上获取的近义词典，利用改进的BM25知识库粗召回模块，使得单次检索几乎不增加推理时间而同时排序所有同义不同关键词的内容，且使得文档段落对主题词词频和文档长度差异变化带来的影响更加鲁棒，使得检索效果得到提升。2. The present invention extracts the corresponding relationship between entity names and aliases through the information extraction model, as well as the synonyms dictionary obtained from the Internet, and uses the improved BM25 knowledge base coarse recall module, so that single retrieval hardly increases the reasoning time and sorts at the same time All content with different keywords is synonymous, and the impact of the document paragraph on the frequency of the subject word and the difference in the length of the document is more robust, so that the retrieval effect is improved.

3、本发明利用信息抽取技术基于无结构化文本段落构建问答知识库，同时优选地利用其他可获取到的半结构化和结构化的相关文本段落作为补充，使得知识的获取渠道更加多样和灵活，当用户通过自然语言表达的问题若与知识库中已经有的知识语义一致时即可完成匹配，同时增强了答案的丰富程度。3. The present invention uses information extraction technology to build a question-and-answer knowledge base based on unstructured text paragraphs, and preferably uses other available semi-structured and structured relevant text paragraphs as supplements, making knowledge acquisition channels more diverse and flexible , when the question expressed by the user through natural language is consistent with the existing knowledge semantics in the knowledge base, the matching can be completed, and the richness of the answer is enhanced at the same time.

4、本发明的问答系统能够为用户推荐相关问题，引导用户进一步探索知识体系并启发用户提问，具有很高的社会价值和很强的现实意义。4. The question answering system of the present invention can recommend relevant questions for users, guide users to further explore the knowledge system and inspire users to ask questions, which has high social value and strong practical significance.

5、本发明对计算资源的需求和消耗小。5. The present invention has little demand and consumption of computing resources.

附图说明Description of drawings

图1是本发明的系统结构图；Fig. 1 is a system structure diagram of the present invention;

图2是本发明准备模块中的数据收集模块图；Fig. 2 is the data collection module diagram in the preparation module of the present invention;

图3是本发明准备模块中的模型训练模块图；Fig. 3 is the model training module diagram in the preparation module of the present invention;

图4是本发明准备模块中的模型训练中HBT模型训练流程图；Fig. 4 is the HBT model training flowchart in the model training in the preparation module of the present invention;

图5是本发明准备模块中的模型训练中ESIM模型训练流程图；Fig. 5 is the ESIM model training flowchart in the model training in the preparation module of the present invention;

图6是本发明RoBERTa-QA模型图；Fig. 6 is the RoBERTa-QA model figure of the present invention;

图7是本发明准备模块中的问答系统知识结构构建模块图；Fig. 7 is a building block diagram of the knowledge structure of the question answering system in the preparation module of the present invention;

图8是本发明准备模块中的问答系统知识结构构建模块中知识库倒序索引构建模块图；Fig. 8 is a building block diagram of the knowledge base inverted index in the question answering system knowledge structure building block in the preparation block of the present invention;

图9是本发明准备模块中的问答系统知识结构构建模块中知识库关键词索引构建模块图；Fig. 9 is a building block diagram of the knowledge base keyword index in the question answering system knowledge structure building block in the preparation block of the present invention;

图10是本发明准备模块中的问答系统知识结构构建模块中文本库倒序索引构建模块图；Fig. 10 is a building block diagram of the inverted index of the text library in the question answering system knowledge structure building block in the preparation block of the present invention;

图11是本发明问答模块的整体流程图；Fig. 11 is the overall flowchart of the question answering module of the present invention;

图12是本发明问答模块中的预处理模块中指代消解方法流程图；Fig. 12 is a flow chart of a reference resolution method in the preprocessing module in the question answering module of the present invention;

图13是本发明问答模块中的基于知识库的问答模块中粗召回模块流程图；Fig. 13 is a flow chart of the rough recall module in the question and answer module based on the knowledge base in the question and answer module of the present invention;

图14是本发明问答模块中的基于知识库的问答模块中利用ESIM判断问句同义性方法流程图；Fig. 14 is a flow chart of a method for judging the synonymity of questions by using ESIM in the question-and-answer module based on the knowledge base in the question-and-answer module of the present invention;

图15是本发明问答模块中的基于文本库的问答模块中粗召回模块流程图；Fig. 15 is a flow chart of the coarse recall module in the question and answer module based on the text library in the question and answer module of the present invention;

图16是本发明问答模块中的基于知识库的问题推荐模块图。Fig. 16 is a diagram of the question recommendation module based on the knowledge base in the question answering module of the present invention.

具体实施方式Detailed ways

下面结合实施例及附图对本发明作进一步详细的描述，但本发明的实施方式不限于此。The present invention will be further described in detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

实施例Example

如图1所示，本实施例一种面向人工智能领域知识的高效智能问答系统，包括准备模块和问答模块；其中，准备模块包括数据收集模块、模型训练模块和问答系统知识结构构建模块；问答模块包括输入预处理模块、基于知识库的问答模块、基于文本库的问答模块和基于知识库的问题推荐模块；As shown in Figure 1, the present embodiment is an efficient intelligent question answering system for knowledge in the field of artificial intelligence, including a preparation module and a question answering module; wherein, the preparation module includes a data collection module, a model training module and a question answering system knowledge structure building module; question answering The modules include input preprocessing module, question answering module based on knowledge base, question answering module based on text base and question recommendation module based on knowledge base;

准备模块通过数据收集模块，将收集到的人工智能领域的无结构化知识文本段落进行标注，并训练模型训练模块的信息抽取模块和机器阅读理解模块，同时收集或定义人工智能领域同义、不同义的问题来训练短文本匹配模型，利用问答系统知识结构构建模块，将训练好的信息抽取模型抽取出知识三元组并形成问答对，同时利用抽取出的实体名称、别称来辅助实现更高效的搜索，再通过改进知识库、文本库倒序索引的构建方法来为搜索引擎提供语义，并构建知识库关键词索引来帮助问题推荐的实现；The preparation module marks the collected unstructured knowledge text paragraphs in the field of artificial intelligence through the data collection module, and trains the information extraction module and machine reading comprehension module of the model training module, and at the same time collects or defines synonymous and different The short text matching model can be trained with meaningful questions, and the knowledge structure building module of the question answering system can be used to extract knowledge triples from the trained information extraction model and form question and answer pairs. Search, and then provide semantics for the search engine by improving the construction method of the knowledge base and the inverted index of the text base, and build a knowledge base keyword index to help the implementation of question recommendation;

如图2所示，本实施例中，数据收集模块的实现过程如下：As shown in Figure 2, in the present embodiment, the implementation process of the data collection module is as follows:

S21、收集人工智能领域相关的科学出版物、文献、网络科普知识等来源的无结构化知识文本段落，对于长度过长的文本段落，按照句号拆分，具体地，限制每个文本段落长度不超过480个字符；S21. Collect unstructured knowledge text paragraphs from sources such as scientific publications, literature, and network popular science knowledge related to the field of artificial intelligence. For text paragraphs that are too long, split them according to periods. Specifically, limit the length of each text paragraph. more than 480 characters;

S22、对信息抽取模型中抽取出的关键信息三元组类型进行定义，使用普通关系定义法定义如下三元组类型：实体-描述-内容、实体-提出者-内容、实体-包含-内容、实体-应用-内容、实体-别称-内容，并对已经定义出的三元组类型进行标注，将三元组统一称为subject-predicate-object，首先利用brat文本标注工具在收集到的人工智能领域无结构化知识文本段落中通过勾选subject来标注出所有subject，再通过勾选object来标注出所有object，最终通过连线来标注出文本段落中subject和object所有的对应关系；S22. Define the key information triplet types extracted in the information extraction model, and define the following triplet types using the common relational definition method: entity-description-content, entity-proposer-content, entity-include-content, Entity-application-content, entity-alias-content, and mark the triplet types that have been defined, and collectively call the triplets subject-predicate-object. First, use the brat text labeling tool in the collected artificial intelligence In the field unstructured knowledge text paragraph, mark all the subjects by checking the subject, then mark all the objects by checking the object, and finally mark all the correspondence between the subject and the object in the text paragraph by connecting lines;

S23、利用机器阅读理解模型，在步骤S21收集人工智能领域相关的科学出版物、文献、网络科普知识等来源的无结构化知识文本段落后，若已有对应的提问问题，则在文本段落中标注出问题对应的答案相关内容的起点终点位置，否则直接根据文本段落中的部分内容模拟现实用户场景定义多样化的关于人工智能领域科学知识的问题，并标注出问题对应的答案在文本段落中的起止位置。S23. Using the machine reading comprehension model, after collecting unstructured knowledge text paragraphs from sources such as scientific publications, literature, and popular science knowledge in the field of artificial intelligence in step S21, if there are corresponding questions, then in the text paragraphs Mark the starting and ending positions of the relevant content of the answer corresponding to the question, or directly simulate the real user scenario to define a variety of questions about scientific knowledge in the field of artificial intelligence based on part of the content in the text paragraph, and mark the answer corresponding to the question in the text paragraph start and end positions.

S24、利用短文本匹配模型直接收集一些该领域的在30字符以内的同义问题和不同义问题，同义问题两条一对，对应标签为1；不同义问题两条一对，对应标签为0。S24. Use the short text matching model to directly collect some synonymous questions and non-synonymous questions within 30 characters in this field. There are two pairs of synonymous questions, and the corresponding label is 1; two pairs of non-synonymous questions are paired, and the corresponding labels are 0.

如图3所示，本实施例中，模型训练模块的实现过程如下：As shown in Figure 3, in this embodiment, the implementation process of the model training module is as follows:

S31、利用HBT模型进行信息抽取模型的搭建，再利用RoBERTa预训练模型进行模型参数的初始化，利用已标注的三元组数据训练。具体地，HBT模型训练过程如图4所示。首先，在输入文本段落开头加上特殊符号[CLS]，文本段落末尾加上特殊符号[SEP]，并用[PAD]补充到512个字符，经过RoBERTa预训练模型的词表变为对应的数字后，通过RoBERTa预训练模型的Embedding层映射到756维特征维度的向量，该向量经过RoBERTa预训练模型的所有层Transformer后，得到输出特征向量。输出特征向量一方面通过全连接层并Sigmoid激活变化为2个通道得到subject的起点、终点预测概率结果；另一方面和按照随机选取的一对本段中subject的起点终点取出输出特征向量复制并经过全连接层特征融合器的输出结果相加后输入到全连接层变化为10通道的特征向量，10通道分别对应5种predicate类型的起点、终点，最终该特征向量经过Sigmoid激活后得到predicate-object的起点终点预测概率。将subject与predicate-object的起点终点预测概率分别用二分类交叉熵损失和标注数据计算误差，再将误差反向传播调整模型参数，依次循环迭代。S31. Use the HBT model to build an information extraction model, then use the RoBERTa pre-training model to initialize model parameters, and use the marked triplet data for training. Specifically, the HBT model training process is shown in Figure 4. First, add the special symbol [CLS] at the beginning of the input text paragraph, add the special symbol [SEP] at the end of the text paragraph, and supplement it with [PAD] to 512 characters, after the vocabulary of the RoBERTa pre-trained model becomes the corresponding number , which is mapped to a vector of 756-dimensional feature dimensions through the Embedding layer of the RoBERTa pre-training model, and the output feature vector is obtained after the vector passes through all layers of the Transformer of the RoBERTa pre-training model. On the one hand, the output feature vector passes through the fully connected layer and the Sigmoid activation changes into two channels to obtain the starting point and end point prediction probability results of the subject; The output results of the fully connected layer feature fuser are added and input to the fully connected layer to change into a feature vector of 10 channels. The 10 channels correspond to the start and end points of the five predicate types. Finally, the feature vector is activated by Sigmoid to obtain the predicate-object The predicted probability of the starting point and ending point of . The predicted probabilities of the start and end points of the subject and predicate-object are calculated using the binary cross-entropy loss and the labeled data to calculate the error, and then the error is backpropagated to adjust the model parameters, and iterated in turn.

S32、搭建ESIM模型训练短文本匹配模型，如图5所示，首先利用中文维基百科的全语料使用word2vec方式训练中文字符向量，再利用Quora Question Pairs数据集的中文翻译结果来预训练ESIM模型，在预训练得到的模型基础上利用中文开放的大规模短文本匹配数据集LCMQC和已收集的人工智能领域短文本匹配数据集进行ESIM模型参数微调训练，且该过程中将人工智能领域短文本匹配数据集重采样至与LCMQC数据集相近的数据规模，训练过程利用二分类交叉熵损失来计算ESIM模型输出结果与0或1的标签结果之间的交叉熵损失。S32. Build an ESIM model to train a short text matching model. As shown in FIG. 5, first use the full corpus of Chinese Wikipedia to train Chinese character vectors using the word2vec method, and then use the Chinese translation results of the Quora Question Pairs dataset to pre-train the ESIM model. On the basis of the pre-trained model, use the large-scale short text matching data set LCMQC open in Chinese and the collected short text matching data set in the field of artificial intelligence to fine-tune the parameters of the ESIM model. The data set is resampled to a data size similar to that of the LCMQC data set, and the training process uses the binary cross-entropy loss to calculate the cross-entropy loss between the ESIM model output and the label result of 0 or 1.

S33、搭建RoBERTa-QA模型训练机器阅读理解模型，首先搭建RoBERTa-QA模型，RoBERTa-QA模型如图6所示，该模型的输入是文本段落和问题的拼接，且拼接之后要按照RoBERTa词表的编码方式对字符进行编码，编码前在最开始位置加上特殊字符[CLS]，在文本段落和问题之间加上特殊字符[SEP]，在问题结尾也加上[SEP]，并用[PAD]补充至512个字符。RoBERTa-QA模型在RoBERTa的基础上在最后一层Transformer后面加上一个输入通道数为756，输出通道数为2的全连接层和与其相连的Softmax层来预测答案在文段中首尾位置的概率，再利用大规模开放数据集DuReader在中文预训练模型RoBERTa的基础上进行进一步的预训练，并利用收集到的人工智能领域的机器阅读理解标注数据进行参数微调训练，训练时利用二分类交叉熵损失来计算误差。S33. Build the RoBERTa-QA model to train the machine reading comprehension model. First, build the RoBERTa-QA model. The RoBERTa-QA model is shown in Figure 6. The input of this model is the splicing of text paragraphs and questions, and after splicing, it must follow the RoBERTa vocabulary The encoding method encodes the characters. Before encoding, add the special character [CLS] at the beginning position, add the special character [SEP] between the text paragraph and the question, add [SEP] at the end of the question, and use [PAD ] added to 512 characters. Based on RoBERTa, the RoBERTa-QA model adds a fully connected layer with an input channel number of 756 and an output channel number of 2 and a Softmax layer connected to it after the last layer of Transformer to predict the probability of the answer at the beginning and end of the text. , and then use the large-scale open data set DuReader to conduct further pre-training on the basis of the Chinese pre-training model RoBERTa, and use the collected machine reading comprehension annotation data in the field of artificial intelligence to perform parameter fine-tuning training, and use the binary cross-entropy during training loss to calculate the error.

如图7所示，本实施例中，问答系统知识结构构建模块的实现过程如下：As shown in Figure 7, in this embodiment, the implementation process of the knowledge structure building module of the question answering system is as follows:

S41、收集大量人工智能领域的无结构化知识文本；具体为收集人工智能领域相关的科学出版物、文献、网络科普知识等来源的无结构化文本段落，对于长度过长的段落，按照句号拆分；具体地，限制每个段落长度不超过480个字符，且段落中的每句话是完整的。S41. Collect a large number of unstructured knowledge texts in the field of artificial intelligence; specifically, collect unstructured text paragraphs from sources such as scientific publications, literature, and popular science knowledge in the field of artificial intelligence. For paragraphs that are too long, break them down according to the period points; specifically, limit the length of each paragraph to no more than 480 characters, and each sentence in the paragraph is complete.

S42、利用步骤S31中训练的信息抽取模型在大量人工智能领域的无结构化知识文本段落中进行三元组抽取，输入的文本段落在文本段落开头加上特殊符号[CLS]，文本末尾加上特殊符号[SEP]，并用[PAD]补充到512个字符，经过RoBERTa的词表变为对应的数字后，通过RoBERTa的Embedding层映射到756维特征维度的向量，该向量经过RoBERTa的所有层Transformer后，得到输出特征向量，输出特征向量一边通过全连接层并Sigmoid激活变化为2个通道得到subject的起点、终点预测概率结果，取起点、终点概率大于0.5的为起点、终点，起点与其后面距离最近的终点进行组合配对成subject位置。每一对subject的起点终点取出对应的输出特征向量复制并经过全连接层特征融合器的输出结果相加后输入到全连接层object预测器变化为10通道的特征向量，10通道分别对应5种predicate类型的起点终点，最终该特征向量经过Sigmoid激活后得到predicate-object的起点终点预测概率，取起点、终点概率大于0.5的为起点、终点，起点与其后面距离最近的终点进行组合配对成object位置。在文本段落中根据起点终点位置找到subject内容及对应的predicate-object内容，按照subject-predicate-object的形式组合成若干三元组。S42. Use the information extraction model trained in step S31 to extract triples in a large number of unstructured knowledge text paragraphs in the field of artificial intelligence. The input text paragraphs add a special symbol [CLS] at the beginning of the text paragraph, and add at the end of the text. The special symbol [SEP] is supplemented with [PAD] to 512 characters. After the vocabulary of RoBERTa is changed to the corresponding number, it is mapped to a vector of 756-dimensional feature dimensions through the Embedding layer of RoBERTa, and the vector passes through all layers of Transformer of RoBERTa. Finally, the output feature vector is obtained, and the output feature vector passes through the fully connected layer and the Sigmoid activation changes to 2 channels to obtain the starting point and ending point prediction probability results of the subject. Take the starting point and ending point probability greater than 0.5 as the starting point and ending point, and the distance between the starting point and its back The closest endpoints are combined and paired into subject locations. The starting and ending points of each pair of subjects take out the corresponding output eigenvectors, copy them, add the output results of the fully connected layer feature fusion device, and then input them to the fully connected layer object predictor to change them into 10-channel feature vectors, which correspond to 5 types. The start point and end point of the predicate type. Finally, the feature vector is activated by Sigmoid to obtain the predicted probability of the start point and end point of the predicate-object. The start point and end point probability are greater than 0.5 as the start point and end point. . In the text paragraph, find the subject content and the corresponding predicate-object content according to the start and end positions, and combine them into several triples in the form of subject-predicate-object.

S43、将三元组中的实体名称和实体别称内容单独提出出来，用以补充jieba分词工具的词典帮助大大提升分词的准确性；另一方面，将抽取到的三元组结果按照一定的规则形成问题答案键值对，如：实体-描述-内容形成{实体}是什么:{内容}；实体-应用-内容形成{实体}有哪些应用:{内容}；实体-包含-内容形成{实体}包括哪些:{内容}；实体-提出者-内容形成{实体}是谁提出的:{内容}；实体-别称-内容形成{实体}的别称是什么:{内容}，再将百度百科和维基百科中的三元组知识进行爬取并按照如{词条名}的{属性}是什么:{内容}的格式形成问答对对知识库进行补充。若能找到该领域已有的高质量问答对，也将其放入知识库中进行补充。知识库中的问答对按照问题作为键，答案作为值的键值对形式进行存储。S43. Separately propose the entity name and entity alias content in the triplet to supplement the dictionary of the jieba word segmentation tool to help greatly improve the accuracy of word segmentation; on the other hand, extract the triplet results according to certain rules Form question answer key-value pairs, such as: entity-description-content formation {entity} what is: {content}; entity-application-content formation {entity} what application: {content}; entity-inclusion-content formation {entity } What are included: {content}; Entity-proposer-content formation {entity} Who proposed: {content}; Crawl the triple knowledge in Wikipedia and form a question and answer pair to supplement the knowledge base according to the format of {entry name}'s {attribute}:{content}. If you can find high-quality question-answer pairs that already exist in this field, put them into the knowledge base to supplement them. The question-answer pairs in the knowledge base are stored in the form of key-value pairs with the question as the key and the answer as the value.

S44、为知识库中的所有问题建立倒序索引和关键词索引，分别如图8和图9所示。建立倒序索引时利用jieba分词工具分别对知识库中的每个问答对中的问题文本段落进行分词和去除停用词，得到一组词语，再统计知识库中所有问题经过这样处理之后得到的词语的集合，即知识库中的所有词语。首先，建立一个以知识库中所有词为键，且对于每一个词其在各个段中分别出现的频率为值的索引；再依据近义词典对该索引进行调整，具体方法为：若词A和词B互为同义词关系，则词A这个键在调整后的新索引中对应的值为：根据调整前的索引中找到每个出现词A或词B的段，若词A和词B只有其中一个在该段中出现，则值为该出现的词在该段中的词频；若词A和词B同时在该段中出现，则值为该段中词A和词B的词频之和，即某个词ci在文本库中某段落中的词频fi记为：S44. Establish an inverted index and a keyword index for all questions in the knowledge base, as shown in FIG. 8 and FIG. 9 respectively. When building an inverted index, use the jieba word segmentation tool to segment and remove stop words for each question text paragraph in each question-answer pair in the knowledge base to obtain a set of words, and then count all the words in the knowledge base after such processing The set of , that is, all the words in the knowledge base. First, build an index with all the words in the knowledge base as the key and the frequency of each word in each segment as the value; then adjust the index according to the dictionary of synonyms, the specific method is: if word A and Word B is a synonym for each other, so the corresponding value of the word A key in the adjusted new index is: find each segment where word A or word B appears in the index before adjustment, if word A and word B only have If one appears in this paragraph, then the value is the word frequency of the word that appears in this paragraph; if word A and word B appear in this paragraph at the same time, then the value is the sum of the word frequencies of word A and word B in this paragraph, That is, the word frequency fi of a certain word ci in a certain paragraph in the text database is recorded as:

fi＝freq(ci)+∑freq(pi)fi=freq(ci)+∑freq(pi)

其中，freq是统计这个词在文本库中某段中的词频，pi是ci的同义词；Among them, freq is to count the word frequency of this word in a certain paragraph in the text library, and pi is a synonym for ci;

构建知识库关键词索引时，遍历知识库中的每个问题，判断里面的词语其是否与信息抽取模型抽取出的实体名称和别称的集合有交集，若有，则为知识库关键词索引中该实体名称或别称为键对应的值加上这一条问题。When constructing the keyword index of the knowledge base, traverse each question in the knowledge base, and judge whether the words in it overlap with the set of entity names and aliases extracted by the information extraction model. The value corresponding to the entity name or alias key plus this question.

S45、将大量人工智能领域的无结构化知识文本直接保存作为文本库并为其建立倒序索引，如图10所示。利用jieba分词工具分别对文本库中的每个文本段落进行分词和去除停用词，得到一组词语；再统计文本库中所有文本段落经过这样处理之后得到的词语的集合，即文本库中的所有词语。首先，建立一个以文本库中所有词为键，且对于每一个词其在各个段中分别出现的频率为值的索引；再依据近义词典对该索引进行调整，具体方法为：若词A和词B互为同义词关系，则词A这个键在调整后的新索引中对应的值为：根据调整前的索引中找到每个出现词A或词B的段，若词A和词B只有其中一个在该段中出现，则值为该出现的词在该段中的词频；若词A和词B同时在该段中出现，则值为该段中词A和词B的词频之和，即某个词ci在文本库中某段落中的词频fi记为：S45. Directly save a large number of unstructured knowledge texts in the field of artificial intelligence as a text library and build an inverted index for them, as shown in FIG. 10 . Use the jieba word segmentation tool to perform word segmentation and remove stop words for each text paragraph in the text library to obtain a set of words; then count the collection of words obtained after all text paragraphs in the text library have been processed in this way, that is, in the text library all words. First, build an index that uses all the words in the text library as keys, and the frequency of each word in each segment as the value; then adjust the index according to the dictionary of synonyms, the specific method is: if word A and Word B is a synonym for each other, so the corresponding value of the word A key in the adjusted new index is: find each segment where word A or word B appears in the index before adjustment, if word A and word B only have If one appears in this paragraph, then the value is the word frequency of the word that appears in this paragraph; if word A and word B appear in this paragraph at the same time, then the value is the sum of the word frequencies of word A and word B in this paragraph, That is, the word frequency fi of a certain word ci in a certain paragraph in the text database is recorded as:

fi＝freq(ci)+∑freq(pi)fi=freq(ci)+∑freq(pi)

其中，freq是统计这个词在文本库中某段中的词频，pi是ci的同义词。Among them, freq is to count the word frequency of this word in a certain paragraph in the text library, and pi is a synonym for ci.

如图11所示，本实施例中，问答模块的具体处理过程如下：As shown in Figure 11, in this embodiment, the specific processing process of the question and answer module is as follows:

S51、利用输入预处理模块去除用户输入问题中停用词，并利用信息抽取模型得到的新词和jieba分词工具进行分词和词性标注，并基于分词和词性标注的结果和哈工大LTP语言模型进行指代消解。指代消解中，如图12所示，用jieba分词中的词性标注以及LTP模型的句法分析得到分词结果、词性和依存弧关系。首先，判断是否存在主语，如果不存在，定位词性数组中的代词词性，若其依存弧关系不是动宾关系(VOB)或定中关系(ATT)，用上一次记录下的主语替换该代词；如果不存在，定位代词并判断其依存弧关系是否为并列关系(COO)，若是则用输入主语替换代词，如果不存在主语和代词，将输入主语记录下来以备后面进行指代消解。S51. Use the input preprocessing module to remove the stop words in the user input question, and use the new words obtained by the information extraction model and the jieba word segmentation tool to perform word segmentation and part-of-speech tagging, and perform instruction based on the results of word segmentation and part-of-speech tagging and the LTP language model of Harbin Institute of Technology Generation digestion. In the resolution of anaphora, as shown in Figure 12, the part-of-speech tagging in the jieba word segmentation and the syntactic analysis of the LTP model are used to obtain the word segmentation result, part-of-speech and dependency arc relationship. First, judge whether there is a subject, if not, locate the part of speech of the pronoun in the part of speech array, if its dependent arc relationship is not a verb-object relationship (VOB) or a fixed-center relationship (ATT), replace the pronoun with the subject recorded last time; If it does not exist, locate the pronoun and judge whether its dependency arc relationship is a parallel relationship (COO). If so, replace the pronoun with the input subject. If there is no subject and pronoun, record the input subject for later reference resolution.

S52、采用基于知识库的问答模块，在接收预处理模块得到的结果后，首先，迅速查找知识库中是否有和该问题完全一样的问题，有则直接得到相应的答案，这在用户认可系统推荐的问题后会用到并能够大幅度节省系统的响应时间和对计算资源的占用。当没有找到完全匹配时，经过预处理过的用户问题也会首先被改进的BM25知识库粗召回模块在知识库的所有问题文本中进行粗检索，之后并利用ESIM模型对预处理后的用户问题和对问题的粗检索结果进行精细的语义匹配，找到语义最相近的一个结果，若该结果的匹配得分超过设定的阈值，则将该结果对应的答案作为将要返回给用户的答案。其中，ESIM模型判断两个问句是否同义如图14所示，输入是一批问题对，输出这一批问题对中每对问题的同义概率，即匹配得分。若这一步中BM25知识库粗召回模块及ESIM模型都没有找到任何匹配得分超过阈值的匹配结果，则将预处理后的用户问题输入到基于文本库的问答模块中。S52. Using the question answering module based on the knowledge base, after receiving the result obtained by the preprocessing module, first, quickly find whether there is a question exactly the same as the question in the knowledge base, and if there is, directly get the corresponding answer, which is in the user recognition system The recommended questions will be used later and can greatly save system response time and occupancy of computing resources. When no exact match is found, the preprocessed user questions will be firstly retrieved by the improved BM25 knowledge base coarse recall module in all question texts in the knowledge base, and then the preprocessed user questions will be analyzed using the ESIM model Perform fine semantic matching with the coarse search results of the question to find the result with the closest semantics. If the matching score of the result exceeds the set threshold, the answer corresponding to the result will be used as the answer to be returned to the user. Among them, the ESIM model judges whether two questions are synonymous as shown in Figure 14. The input is a batch of question pairs, and the output is the synonymous probability of each pair of questions in this batch of question pairs, that is, the matching score. If neither the rough recall module of the BM25 knowledge base nor the ESIM model finds any matching result with a matching score exceeding the threshold in this step, the preprocessed user questions are input into the question answering module based on the text library.

具体地，改进的BM25知识库粗召回模块，如图13所示，该模块首先判断经过预处理后的用户输入问题中的每个词语是否存在于知识库中所有词语的集合中，若存在则直接保留，若不存在且该词有近义词则再判断该词的任一近义词是否存在于知识库中所有词语的集合中，若存在则用该近义词替换该词并保留，若不存在或该词无近义词则去掉该词，获取进一步处理后的用户问题。最后计算经过上述处理保留下来的每个词语和知识库中每个问题的相关性分数，将各个词相关性的分数求和得到该query和知识库中每个问题的相关性分数。Specifically, the improved BM25 knowledge base coarse recall module, as shown in Figure 13, first judges whether each word in the preprocessed user input question exists in the set of all words in the knowledge base, and if so, then Reserve it directly, if it does not exist and the word has a synonym, then judge whether any synonym of the word exists in the set of all words in the knowledge base, if it exists, replace the word with the synonym and keep it, if it does not exist or the word If there is no synonym, the word is removed to obtain the user question after further processing. Finally, the correlation score of each word and each question in the knowledge base retained after the above processing is calculated, and the correlation scores of each word are summed to obtain the correlation score of the query and each question in the knowledge base.

进一步处理后的用户输入问题Q与知识库中第j个问题d_j的相关性得分为：The correlation score between the further processed user input question Q and the jth question d _j in the knowledge base is:

其中，q_i是进一步处理后的用户输入问题中第i个词汇；W_ij是知识库中第j个问题对应用户输入问题中第i个词汇的权重；IDF(q_i)是进一步处理后的用户输入问题中第i个词汇的逆文本频率指数；n是用户输入问题的词汇个数；i是指用户输入问题的第i个词汇；Among them, q _i is the i-th word in the user input question after further processing; W _ij is the weight of the j-th question in the knowledge base corresponding to the i-th word in the user input question; IDF(q _i ) is the further processed The inverse text frequency index of the i-th word in the user input question; n is the number of words in the user input question; i refers to the i-th word in the user input question;

具体地，specifically,

其中；N是知识库中总的问答对数目；n(q_i)是包含了进一步处理后的用户输入问题中第i个词汇的知识库中的问答对数目；Among them; N is the total number of question-answer pairs in the knowledge base; n(q _i ) is the number of question-answer pairs in the knowledge base that contains the i-th vocabulary in the user input question after further processing;

具体地，specifically,

其中，k₁是词语频率饱和度，此处取1.5；b是文段长度约束，此处取0.75；dl_j是知识库中第j个问题的字符串长度；avgdl知识库中所有问题的平均字符串长度。Among them, k ₁ is the term frequency saturation, which is 1.5 here; b is the paragraph length constraint, which is 0.75 here; dl _j is the string length of the jth question in the knowledge base; the average of all questions in the avgdl knowledge base String length.

S53、若基于知识库的问答模块没有找到对于用户输入问题的答案，则采用基于文本库的问答模块，先通过改进的BM25知识库粗召回模块将进一步处理后的用户问题与文本库中的知识文本段落进行检索和粗召回；再将粗召回结果和预处理后的用户问题输入到RoBERTa-QA模型中，该模型并行地预测粗召回的每一段的结果中针对用户问题的答案的起点和终点位置和相应的概率。起点、终点概率最大的位置对应的概率值之积为该文本段落有答案的概率，若有任何一段有答案的概率超过设定的阈值，则将该文本段落中起点、终点概率最大位置之间的文本段落作为返回给用户的答案。S53. If the question answering module based on the knowledge base does not find the answer to the user's input question, the question answering module based on the text base is used, and the user's question after further processing is combined with the knowledge in the text base through the improved BM25 knowledge base rough recall module Text paragraphs are retrieved and rough recalled; then the rough recall results and preprocessed user questions are input into the RoBERTa-QA model, which predicts in parallel the starting and ending points of the answers to user questions in the results of each paragraph of rough recall locations and corresponding probabilities. The product of the probability values corresponding to the position with the highest probability of the starting point and the ending point is the probability that the text paragraph has the answer. text passages as the answer returned to the user.

如图15所示，本实施例中，改进的BM25文本库粗召回模块，通过计算进一步处理后的用户输入问题的每个词语和文本库所有文本段落的相关性分数，再将各个词相关性的分数求和得到该进一步处理后的用户问题和文本库所有文本段落的相关性分数，即进一步处理后的用户输入问题Q与文本库中第j个段落d_j的相关性得分为：As shown in Figure 15, in this embodiment, the improved BM25 text library rough recall module calculates the relevance scores of each word and all text paragraphs of the text library after calculating the further processed user input question, and then calculates the correlation score of each word The sum of the scores of the further processed user question and the correlation score of all text paragraphs in the text database, that is, the correlation score between the further processed user input question Q and the jth paragraph d _j in the text database is:

其中，q_i是进一步处理后的用户输入问题中第i个词汇；W_ij是知识库中第j个问题对应用户输入问题中第i个词汇的权重；IDF(q_i)是进一步处理后的用户输入问题中第i个词汇的逆文本频率指数；n是用户输入问题的词汇个数；i是用户输入问题的第i个词汇；Among them, q _i is the i-th word in the user input question after further processing; W _ij is the weight of the j-th question in the knowledge base corresponding to the i-th word in the user input question; IDF(q _i ) is the further processed The inverse text frequency index of the i-th word in the user input question; n is the number of words in the user input question; i is the i-th word in the user input question;

具体地，specifically,

W_ij中，k₁是词语频率饱和度，此处取1.5；b是文段长度约束，此处取0.75；dl_j是知识库中第j个问题的字符串长度；avgdl知识库中所有问题的平均字符串长度。In W _ij , k ₁ is the term frequency saturation, which is 1.5 here; b is the text length constraint, which is 0.75 here; dl _j is the string length of the jth question in the knowledge base; all questions in the avgdl knowledge base The average string length of .

S54、采用基于知识库的问答推荐模块，如图16所示，该模块首先利用步骤S44建立的知识库关键词索引寻找与知识库中和预处理后的用户输入问题中有实体名称、别称重合的问题，若可以找到，则随机从中挑选一个问题推荐给用户，否则直接从所有的知识库中随机挑选一个问题，挑选完成后检查该问题是否与预处理后的用户问题一样或答案与将返回给用户的答案一样，若是，则重复挑选前面的步骤直到得到一个满足不一致条件的问题，并将得到的关于用户问题的答案和推荐给用户的问题依次返回显示给用户。S54, using the question-and-answer recommendation module based on the knowledge base, as shown in Figure 16, this module first utilizes the knowledge base keyword index established in step S44 to search for entity names and aliases coincident with the knowledge base and preprocessed user input questions If the question can be found, randomly select a question from it and recommend it to the user, otherwise directly select a question from all the knowledge bases, check whether the question is the same as the preprocessed user question or the answer is the same as the preprocessed user question and will return The answer to the user is the same, if so, repeat the previous steps until a question that satisfies the inconsistent condition is obtained, and the obtained answer to the user's question and the question recommended to the user are returned and displayed to the user in turn.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.

Claims

1. An efficient intelligent question answering system for knowledge in the field of artificial intelligence, characterized in that it includes a preparation module and a question answering module; wherein the preparation module includes a data collection module, a model training module and a question answering system knowledge structure building module; the question answering module includes input Preprocessing module, question answering module based on knowledge base, question answering module based on text base and question recommendation module based on knowledge base;

The preparation module marks the collected unstructured knowledge text paragraphs in the field of artificial intelligence through the data collection module, and trains the information extraction module and machine reading comprehension module of the model training module, and at the same time collects or defines synonymous and different The short text matching model is trained by using meaningful questions, and the knowledge structure building module of the question answering system is used to extract knowledge triples from the trained information extraction model and form question and answer pairs. Provide semantics for search engines by improving the construction method of knowledge base and text base inverted index, and build a knowledge base keyword index;

The question answering module preprocesses the questions input by the user through the input preprocessing module, and uses the question answering module based on the knowledge base to search for answers. The question answering module of the library looks for and prepares to return the answer, and uses the question recommendation module based on the knowledge base to recommend questions to the user, and finally returns the answer and the recommended question to the user;

The implementation process of the model training module is as follows:

S31. Use the HBT model to build an information extraction model, then use the RoBERTa pre-training model to initialize model parameters, and use the marked triplet data for training;

S32. Build an ESIM model to train a short text matching model, use the full corpus of Chinese Wikipedia to train Chinese character vectors using word2vec, and then use the Chinese translation results of the Quora Question Pairs dataset to pre-train the ESIM model, based on the pre-trained model Use the large-scale short text matching data set LCMQC open in Chinese and the collected short text matching data set in the field of artificial intelligence to fine-tune the parameters of the ESIM model;

S33. Build the RoBERTa-QA model to train the machine reading comprehension model, then use the open data set DuReader to conduct further pre-training on the basis of the Chinese pre-training model RoBERTa, and use the collected machine reading comprehension annotation data in the field of artificial intelligence to set parameters fine-tuning training;

The realization process of the question answering system knowledge structure building block is as follows:

S41. Collect unstructured knowledge text paragraphs in the field of artificial intelligence;

S42. Using the information extraction model trained in step S31 to extract triples in unstructured knowledge text paragraphs in the field of artificial intelligence;

S43. Separately propose the entity name and entity alias content in the triple group, form the key-value pair of the question answer from the extracted triple group result, and then crawl the triple group knowledge in Baidu Encyclopedia and Wikipedia Form a question-answer pair and put it into the knowledge base. The question-answer pair in the knowledge base is stored in the form of a key-value pair with the question as the key and the answer as the value;

S44. Establish inverted index and keyword index for all questions in the knowledge base. When establishing the inverted index, use the jieba word segmentation tool to perform word segmentation and remove stop words on the question text paragraphs in each question-answer pair in the knowledge base to obtain a set of words. Group words, and then count the collection of words obtained after all the questions in the knowledge base have been processed in this way, that is, all the words in the knowledge base; build a keyword index in the knowledge base, traverse each question in the knowledge base, if the words and information in it If there is an intersection between the entity name and alias set extracted by the extraction model, then this question is added to the value corresponding to the entity name or alias key in the keyword index of the knowledge base;

S45. Directly save the unstructured knowledge text paragraphs in the field of artificial intelligence as a text library and establish an inverted index for it, use the jieba word segmentation tool to perform word segmentation and remove stop words for each text paragraph in the text library, and obtain a set of Words; and count all the text paragraphs in the text database after the collection of words obtained after such processing, that is, all the words in the text database.

2. a kind of high-efficiency intelligent question answering system facing artificial intelligence field knowledge according to claim 1, is characterized in that, the realization process of data collection module is as follows:

S21. Collect unstructured knowledge text paragraphs of scientific publications, literature, and network popular science knowledge in the field of artificial intelligence, and the text paragraphs are split according to periods;

S22. Define the key information triplet types extracted in the information extraction model, and use the common relational definition method to define the triplet types as follows: entity-description-content, entity-proposer-content, entity-include-content, Entity-Application-Content, Entity-Alias-Content, and mark the defined triplet types, use the brat text markup tool to mark by checking the subject in the collected unstructured knowledge text paragraphs in the field of artificial intelligence Display all the subjects, and then mark all the objects by checking the object, and mark all the correspondence between the subject and the object in the text paragraph through the connection line;

S23. Using the machine reading comprehension model, after collecting unstructured knowledge text paragraphs of scientific publications, literature, and network popular science knowledge in the field of artificial intelligence in step S21, if there are corresponding questions, mark the questions in the text paragraphs The start and end positions of the corresponding answer-related content, otherwise, directly simulate real user scenarios to define diverse scientific knowledge questions in the field of artificial intelligence based on part of the content in the text paragraph, and mark the start and end positions of the answer corresponding to the question in the text paragraph;

S24. Use the short text matching model to directly collect synonymous questions and non-synonymous questions. Two pairs of synonymous questions have a corresponding label of 1; two pairs of non-synonymous questions have a corresponding label of 0.

3. a kind of high-efficiency intelligent question answering system facing artificial intelligence domain knowledge according to claim 1, it is characterized in that, the implementation process of input preprocessing module is: utilize input preprocessing module to remove stop words in user input question, and Use the new words obtained by the information extraction model and the jieba word segmentation tool to perform word segmentation and part-of-speech tagging, and perform reference resolution based on the results of word segmentation and part-of-speech tagging and the LTP language model of Harbin Institute of Technology.

4. A kind of high-efficiency intelligent question answering system facing artificial intelligence domain knowledge according to claim 1, it is characterized in that, the realization process of the question answering module based on knowledge base is: after receiving the result that the input preprocessing module obtains, by searching If the exact same question as this question is found in the knowledge base, the corresponding answer will be obtained directly; if no exact match is found, the preprocessed user question will also be stored in the knowledge base by the improved BM25 knowledge base rough recall module Perform coarse retrieval in all question texts, and use the ESIM model to perform fine semantic matching on the preprocessed user questions and the rough retrieval results of the questions, and find the result. If the matching score of the result exceeds the set threshold, then The answer corresponding to the result is taken as the answer to be returned to the user. If neither the BM25 knowledge base rough recall module nor the ESIM model finds any matching result with a matching score exceeding the threshold in this step, the preprocessed user question is input into the In the question answering module of the text library.

5. a kind of high-efficiency intelligent question answering system facing artificial intelligence field knowledge according to claim 1, it is characterized in that, the realization process of the question answering module based on text base is as follows: first by the rough recall module of improved BM25 knowledge base will further process The final user question and the knowledge text paragraph in the text library are retrieved and rough recalled; then the rough recall result and the preprocessed user question are input into the RoBERTa-QA model, and the model predicts the result of each rough recalled paragraph in parallel. The starting point and ending position of the answer to the user's question and the corresponding probability, the product of the probability value corresponding to the position with the highest probability of the starting point and the ending point is the probability that the text paragraph has the answer, if any paragraph has the probability of having the answer exceeds the set Threshold, then the text paragraph between the starting point and the position with the highest probability of the end point in the text paragraph will be used as the answer that will be returned to the user.

6. A kind of high-efficiency intelligent question answering system facing artificial intelligence domain knowledge according to claim 1, it is characterized in that, the realization process of the question and answer recommendation module based on knowledge base is as follows: utilize the knowledge base keyword index that step S44 sets up to search and If there are overlapping entity names and aliases in the knowledge base and preprocessed user input questions, if they can be found, randomly select questions from them and recommend them to the user; otherwise, randomly select questions from all knowledge bases and check after the selection is complete. Is the question the same as the preprocessed user question or the answer is the same as the answer that will be returned to the user? If so, repeat the previous steps until a question that meets the inconsistent conditions is obtained, and the obtained answer and recommendation about the user question Questions given to the user are in turn returned for display to the user.