WO2021159632A1 - 智能问答方法、装置、计算机设备及计算机存储介质 - Google Patents

智能问答方法、装置、计算机设备及计算机存储介质 Download PDF

Info

Publication number
WO2021159632A1
WO2021159632A1 PCT/CN2020/092963 CN2020092963W WO2021159632A1 WO 2021159632 A1 WO2021159632 A1 WO 2021159632A1 CN 2020092963 W CN2020092963 W CN 2020092963W WO 2021159632 A1 WO2021159632 A1 WO 2021159632A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
word
sentence
question
question sentence
Prior art date
Application number
PCT/CN2020/092963
Other languages
English (en)
French (fr)
Inventor
陈秀玲
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021159632A1 publication Critical patent/WO2021159632A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to intelligent question answering methods, devices, computer equipment and computer storage media.
  • the question answering system allows users to use natural language question sentences for information inquiry, can understand the question sentences entered by the user, analyze the user's search intention, and give high-quality answers, which not only conforms to people's search habits, but also improves the efficiency of information inquiry .
  • Open domain intelligent question answering is a form of question answering system, which mainly adopts the similarity matching of the question and answer to the corpus, or the retrieval based on the knowledge graph, and the generative question and answer based on deep learning, which can realize the question and answer of basic sentences.
  • the inventor realized that in the face of ever-changing and diverse open domain question and answer, based on the similarity of the question and answer to the corpus, or based on the knowledge graph retrieval, it is necessary to maintain a large-scale question and answer pair corpus, or knowledge graph ternary Grouping corpus makes the knowledge base often have incomplete coverage and untimely update, which makes it impossible to answer users’ questions.
  • the accuracy of the generative question and answer cannot meet the requirements of a smooth call, and it is impossible to quickly give more accurate answers.
  • this application provides an intelligent question answering method, device, computer equipment, and computer storage medium, the main purpose of which is to solve the problem that the accuracy of the current generative question answering cannot achieve a smooth call.
  • an intelligent question answering method includes: when a question sentence is received, obtaining related documents whose matching degree with the question sentence is ranked before a preset value from a pre-arranged knowledge base; The various parts cut out from each related document and the question sentence form an input sentence, which is input into a pre-trained reading comprehension model to predict the probability value of each part cut out from the related document as the answer sentence; The intercepted parts in the document are used as the probability value of the answer sentence to generate the output answer sentence.
  • an intelligent question answering device includes: an acquiring unit configured to acquire a pre-arranged knowledge base from a pre-arranged knowledge base for an intelligent question answering device. Set the associated document before the value; the prediction unit is used to form an input sentence from the various parts of each associated document and the question sentence, input it into the pre-trained reading comprehension model, and predict the various parts intercepted from the associated document As the probability value of the answer sentence; the generating unit is used to generate the output answer sentence by using each part cut out from the related document as the probability value of the answer sentence.
  • a computer device including a memory and a processor
  • the memory stores a computer program
  • the processor implements the following steps when executing the computer program: when a question sentence is received, from Obtain related documents from the pre-organized knowledge base that match the question sentence with a ranking before the preset value; each part of each related document and the question sentence are truncated to form an input sentence, which is input into pre-trained reading comprehension
  • the model predicts each part cut out from the related document as the probability value of the answer sentence; using the parts cut out from the correlated document as the probability value of the answer sentence to generate the output answer sentence.
  • a computer storage medium on which a computer program is stored.
  • the computer program When the computer program is executed by a processor, the following steps are implemented: when a problem statement is received, it is obtained from a pre-organized knowledge base The related documents whose matching degree with the question sentence is ranked before the preset value; each part cut out from each related document and the question sentence form an input sentence, input it into the pre-trained reading comprehension model, and predict from the related document The cut-out parts are used as the probability value of the answer sentence; the parts cut out from the associated document are used as the probability value of the answer sentence to generate the output answer sentence.
  • the present application provides an intelligent question answering method and device.
  • the related document whose matching degree with the question sentence is ranked before the preset value is obtained from the pre-arranged knowledge base, and further
  • the various parts cut out from the related document and the question sentence form an input sentence, which is input to the pre-trained reading comprehension model, and the probability value of each part cut out from the related document as the answer sentence is predicted to generate the output answer sentence.
  • the pre-arranged knowledge base of this application records a collection of documents collated from various websites, and provides a more complete question and answer database.
  • the pre-trained reading comprehension model can be used for user input. To understand the question sentences of the user, analyze the user's search intention, predict the probability value of the answer sentence in the related document, and give the user a high-quality answer sentence, which improves the accuracy of the generative question and answer.
  • FIG. 1 shows a schematic flowchart of an intelligent question answering method provided by an embodiment of the present application
  • FIG. 2 shows a schematic flowchart of another intelligent question answering method provided by an embodiment of the present application
  • FIG. 3 shows a schematic diagram of a process of emotion recognition on question and answer corpus provided by an embodiment of the present application
  • FIG. 4 shows a schematic structural diagram of an intelligent question answering device provided by an embodiment of the present application
  • Fig. 5 shows a schematic structural diagram of another intelligent question answering device provided by an embodiment of the present application.
  • the embodiment of the present application provides an intelligent question answering method, which can understand the question sentence input by the user, analyze the user's search intention, and give the user a high-quality answer sentence.
  • the method includes:
  • the pre-organized knowledge base can be the Wikipedia knowledge base.
  • the Wikipedia knowledge base is a website similar to Baidu, which organizes the content of the website into an open domain knowledge base, which includes various documents and data. A collection for downloading and using when training various artificial intelligence algorithms.
  • the pre-organized knowledge base when obtaining related documents from the pre-organized knowledge base that match the question sentence before the preset value, because the pre-organized knowledge base records a collection of documents sorted from various websites, here can be based on the question sentence The importance of each document in the document collection. Select the document whose importance degree ranks before the preset value as the associated document. Of course, you can also select the number of times the question sentence appears in each document in the document collection. The selected number ranks before the preset value.
  • the document is regarded as a related document, which is not limited here.
  • the pre-trained reading comprehension model uses the bert pre-training model to perform fine-tune training and prediction of reading comprehension tasks on the question and answer data set.
  • the bert pre-training model used here is a language model that uses a two-way transformer structure. It includes pre-training in the pre-training stage and the reading comprehension stage. Two types of tasks are trained. One is to mask 15% of the words in the document. In the process, the words of these masks are predicted; the other type is to predict whether the next sentence in a sentence pair is the next sentence of the previous sentence.
  • each word vector of each word in the question sentence and the associated document the position information of the word vector, and the semantic information of the word are generated.
  • Each part of the interception may be the answer sentence corresponding to the question sentence, and further based on the question sentence and the word vector of each word in the associated document, the position information of the word vector, and the semantic information of the word, the interception from the associated document
  • Each part is predicted to obtain the probability value of each part cut out from the related document as the answer sentence.
  • the related text uses Wikipedia as the question and answer corpus
  • the documents retrieved after text tracking are highly related to the question sentence, so the answer sentence is usually recorded, and the answer sentence in the related document is predicted through the reading comprehension model. It solves the problem that the topic requirements in the open domain are high in timeliness and the fixed corpus cannot meet the needs. It also solves the problem of not being able to balance the search speed and the accuracy of the semantic understanding of the question when searching for answers in a large number of questions, and improves the question answer Speed and accuracy.
  • the embodiment of the present application provides an intelligent question answering method.
  • a question sentence When a question sentence is received, the related documents whose matching degree with the question sentence is ranked before a preset value are obtained from a pre-organized knowledge base, and the related documents are further truncated
  • the various parts of and the question sentence form an input sentence, which is input to a pre-trained reading comprehension model to predict the probability value of each part intercepted from the associated document as the answer sentence, thereby generating the output answer sentence.
  • the pre-arranged knowledge base of this application records a collection of documents collated from various websites, and provides a more complete question and answer database.
  • the pre-trained reading comprehension model can be used for user input. To understand the question sentences of the user, analyze the user's search intention, predict the probability value of the answer sentence in the related document, and give the user a high-quality answer sentence, which improves the accuracy of the generative question and answer.
  • the embodiment of the present application provides another intelligent question answering method, which can understand the question sentence input by the user, analyze the user's search intention, and give the user a high-quality answer sentence.
  • the method includes:
  • each document in the document collection in the pre-organized knowledge base is first segmented, and then the word segmentation is established to the document collection The inverted index of each document.
  • word segmentation tools such as stammering, LTP, HanLP, etc. can be used.
  • each document and the word segmentation in the document collection are numbered, so that based on the word segmentation feature contained in the document, the documents related to the problem sentence can be quickly found from the massive document collection .
  • the document collection contains 5 documents.
  • each segmentation contained in the document is obtained.
  • Each segmentation has a corresponding number.
  • the document number where the segmentation appears is recorded.
  • the segmentation A appears in documents 001 and 003
  • Participle B appears in document 004
  • participle C appears in documents 001 and 004
  • participle D appears in document 005, etc.
  • the inverted list corresponding to participle A is ⁇ 001, 003 ⁇
  • participle B corresponds to inverted
  • the ranking list is ⁇ 004 ⁇
  • the inverted list corresponding to the participle C is ⁇ 001, 004 ⁇ .
  • Bag-of-words bag-of-words model was originally used in the field of information retrieval. For a document, it is assumed that the order relationship and grammar of the words in the document are not considered, and only whether the word and the word appear in the document are considered. Number of occurrences (word frequency). The characteristics of such a document are the words that appear in the document and the number of times each word appears.
  • the word-bag model can be used to perform word frequency statistics on the word segmentation of each document in the document collection, and after obtaining the word frequency of each word segmentation in each document, it can also be based on the inversion of the word segmentation to each document in the document collection. Index, add the word frequency of each word segmentation in the document collection to the inverted list of word segmentation.
  • the inverted list corresponding to word segment A is ⁇ 001, 003 ⁇ , and word segment A is in the document numbered 001 The number of occurrences in the word is 1, and it appears 4 times in the document numbered 003.
  • the inverted list of participle A is updated to ⁇ (001; 1), (003; 4) ⁇ , and the word segmentation in each document Inverted list, the inverted list records each document where the word segmentation appears and the word frequency in each document.
  • word segmentation processing can be performed on the question sentence to obtain the word segmentation contained in the question sentence, and Based on the established inverted index of the word segmentation to each document in the document collection, the word frequency of the word segment contained in the question sentence in each document is obtained, and the evaluation value of the importance of the word segment contained in the question sentence in each document is further calculated.
  • the evaluation value of the importance of the word segment contained in the question sentence in each document can be calculated by calculating the tf-idf value of the word segment contained in the question sentence in each document.
  • tf-idf is also a statistical method to evaluate the importance of a word to a document set or one of the documents in a corpus. The importance of a word is proportional to the number of times it appears in the document, but at the same time it decreases in inverse proportion to the frequency of its appearance in the corpus.
  • the inverted index records the number of documents in the document collection, and each word segmentation in the question sentence is in the document.
  • the word frequency of each document in the collection and the amount of word segmentation contained in each document are first calculated based on the number of documents in the document collection, the word frequency of each word in the question sentence in each document in the document collection, and the amount of word segmentation contained in each document.
  • tf-idf tf*idf for each participle in the question sentence;
  • the tf-idf of the question sentence is the sum of tf-idf of all the participles in the question sentence/the number of participles of the question sentence.
  • the question sentence is divided into word segmentation A, word segmentation B, and word segmentation C.
  • word segmentation B appears in document 1.
  • the subsequent identification workload will also be too much, which will affect the answering speed of the intelligent dialogue.
  • the preferred number of related documents is 5-10.
  • part of the words of each associated document is covered, and input to the pre-trained reading comprehension model to predict the part of the covered words to obtain the word vector and word of each word segmentation in the question sentence and the associated document The position information of the vector and the semantic information of the word vector.
  • the bert pre-training model masks some words in each associated document, and then uses the context information of the word segmentation to predict the original semantic information of the word segmentation, so that the learned semantic information can be Fusion of the context information on the left and right sides of a word segmentation, and then extract the word vector of each word segmentation in the question sentence and the associated document, the position information of the word vector, and the semantic information of the word vector.
  • the word vector, the position information of the word vector, and the semantic information of the word vector of each word in the question sentence and the associated document are encoded to obtain the word encoding and the position encoding, and the word
  • the calculation result between the code and the position code is input to the pre-trained reading comprehension model so that the position information is added to the word code, and the relationship between the question sentence and each part cut out from the related document is obtained, based on the question
  • the relationship between the sentence and each part cut out from the related document predicts the probability value of each part cut out from the related document as the answer sentence.
  • the process of encoding the word vector of each word in the question sentence and the related document, the position information of the word vector, and the semantic information of the word vector corresponds to a 768-dimensional word for each word segmentation.
  • the position information is an integer bit that is pre-marked for each word segmentation, and is subsequently converted into a 768-dimensional vector according to this integer bit
  • the semantic information is the distinction between question sentences and related documents in the reading comprehension model, and all question sentences
  • the word segmentation in is marked as 0, the word segmentation in all related documents is marked as 1, and then 0 and 1 are converted into a 768-dimensional vector.
  • the input of the pre-trained reading comprehension model in the reading comprehension stage is the word vector, position vector, and The addition of semantic vectors
  • the answer sentence corresponding to the question sentence is a piece of text intercepted from the associated document. Assuming that the answer sentence starts at the start-point of the associated document, and the answer sentence ends-point in the associated document, pass The pre-trained reading comprehension model can predict the probability value of each word segmentation as the start-point and the end-point in each part cut out from the document.
  • the user may consider the actual application of the topic and the context and other scene factors when inputting the question sentence.
  • Each part intercepted from the related document has the highest probability value as the answer sentence.
  • the document does not satisfy the scene factors, so before generating the output answer sentence, you can set the filtering instructions to combine the user's current scene factors to sort the probability values of the various parts intercepted from the associated documents as the answer sentence, so as to select more Part of the document is suitable for generating the output answer sentence, and the scene factor is not limited here.
  • the specific intelligent question answering process can be as shown in Figure 3.
  • the top 5 related documents are selected by tracking the documents associated with the question sentence from the Wikipedia knowledge base in real time ,
  • the question sentence and the sentence in the associated document are input to the pre-trained reading comprehension model for short text reading comprehension, so that the sentence in the document is predicted as the probability value of the answer sentence, and the answer sentence with the highest probability value is selected as the most Good answer.
  • an embodiment of the present application provides an intelligent question answering device.
  • the device includes: an acquisition unit 31, a prediction unit 32, and a generation unit 33.
  • the obtaining unit 31 may be configured to obtain, from a pre-arranged knowledge base, related documents whose matching degree with the question sentence is ranked before a preset value when a question sentence is received;
  • the prediction unit 32 can be used to form an input sentence from the various parts of each related document and the question sentence, input it into a pre-trained reading comprehension model, and predict the probability of each part cut from the related document as an answer sentence value;
  • the generating unit 33 may be used to generate the output answer sentence by using each part cut out from the associated document as the probability value of the answer sentence.
  • An intelligent question answering device when a question sentence is received, it obtains related documents whose matching degree with the question sentence is ranked before a preset value from a pre-arranged knowledge base, and further intercepts the related documents
  • Each part and the question sentence form an input sentence, which is input to a pre-trained reading comprehension model, and the probability value of each part intercepted from the associated document as the answer sentence is predicted to generate the output answer sentence.
  • the pre-arranged knowledge base of this application records a collection of documents collated from various websites, and provides a more complete question and answer database.
  • the pre-trained reading comprehension model can be used for user input. To understand the question sentences of the user, analyze the user's search intention, predict the probability value of the answer sentence in the related document, and give the user a high-quality answer sentence, which improves the accuracy of the generative question and answer.
  • FIG. 5 is a schematic structural diagram of another intelligent question answering device according to an embodiment of the present application. As shown in FIG. The device further includes:
  • the establishing unit 34 may be used to obtain the pre-organized knowledge base before the related document whose matching degree with the question sentence is ranked before the preset value from the pre-organized knowledge base when the question sentence is received. Perform word segmentation processing on the document collection in, and establish an inverted index from the word segmentation to each document in the document collection;
  • the statistical unit 35 may be configured to use the bag-of-words model to perform word frequency statistics on the word segmentation of each document in the document collection to obtain the word frequency of the word segmentation in each document.
  • the acquiring unit 31 includes:
  • the calculation module 311 can be used to calculate the evaluation value of the importance of the question sentence in each document based on the inverted index of the word segmentation to each document in the document collection;
  • the selection module 312 may be used to sort the evaluation value from highest to lowest, and select the document whose evaluation value ranks before the preset value as the related document.
  • the calculation module 311 can be specifically used to perform word segmentation processing on the question sentence, and based on the inverted index of the word segmentation to each document in the document collection, query that each word segmentation in the question sentence is in the document collection The frequency of words appearing in each document, and the amount of word segmentation contained in each document;
  • the calculation module 311 can also be specifically used to calculate the number of each word in the question sentence in each document according to the number of documents in the document collection, the word frequency of each word in the question sentence in each document in the document collection, and the amount of word contained in each document.
  • the evaluation value of medium importance
  • the calculation module 311 may also be specifically used to summarize the evaluation values of the importance of each word segment in the question sentence in each document to obtain the evaluation value of the importance of the question sentence in each document.
  • the pre-trained reading comprehension model uses the bert pre-training model to perform fine-tune training and prediction of reading comprehension tasks on the question and answer data set, including the pre-training phase and the reading comprehension phase ,
  • the prediction unit 32 includes:
  • the first prediction module 321 can be used to cover part of the words of each associated document in the pre-training phase, and input it into the pre-trained reading comprehension model to predict the part of the words that are covered to obtain the question sentence and every word in the associated document.
  • the second prediction module 322 can be used to encode the word vector, the position information of the word vector, and the semantic information of the word vector of each word segmentation in the question sentence and the associated document in the reading comprehension stage, and input it into the pre-trained reading comprehension model Predict the probability value of each part cut out from the related document as the answer sentence.
  • the second prediction module 322 can be specifically used to encode the word vector, the position information of the word vector, and the semantic information of the word vector in the question sentence and the associated document in the reading comprehension stage to obtain the word Coding and position coding;
  • the second prediction module 322 can also be specifically used to input the calculation between the word encoding and the position encoding into a pre-trained reading comprehension model so that the position information is added to the word encoding, and to obtain question sentences and related documents.
  • the second prediction module 322 may also be specifically used to predict the probability value of each part cut out from the related document as the answer sentence based on the relation between the question sentence and the parts cut out from the correlated document. .
  • the generating unit 33 includes:
  • the sorting module 331 can be used for sorting the probability values of each part cut out from the related document as the answer sentence according to the filtering instruction;
  • the generating module 332 may be used to obtain each part cut out from the associated document as the part of the document with the highest probability value of the answer sentence, and generate the answer sentence.
  • this embodiment also provides a storage medium on which a computer program is stored.
  • the program is executed by a processor, the above-mentioned Figure 1 and Figure 2 are Smart question answering method.
  • the technical solution of the present application can be embodied in the form of a software product.
  • the software product can be stored in a volatile storage medium (such as static RAM and dynamic memory DRAM, etc.), or in a non-volatile storage medium.
  • the lossy storage medium can be CD-ROM, U disk, mobile hard disk, etc.
  • a computer device can be a personal computer, server, or network device, etc.
  • an embodiment of the present application also provides a computer device, which may be a personal computer, Servers, network devices, etc.
  • the physical device includes a storage medium and a processor; the storage medium is used to store a computer program; the processor is used to execute the computer program to implement the above-mentioned intelligent question answering method as shown in FIG. 1 and FIG. 2.
  • the computer device may also include a user interface, a network interface, a camera, a radio frequency (RF) circuit, a sensor, an audio circuit, a WI-FI module, and so on.
  • the user interface may include a display screen (Display), an input unit such as a keyboard (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, and the like.
  • the network interface can optionally include a standard wired interface, a wireless interface (such as a Bluetooth interface, a WI-FI interface), and so on.
  • the physical device structure of the intelligent question answering device does not constitute a limitation on the physical device, and may include more or fewer components, or combine certain components, or different component arrangements. .
  • the storage medium may also include an operating system and a network communication module.
  • the operating system is a program that manages the hardware and software resources of the above-mentioned computer equipment, and supports the operation of information processing programs and other software and/or programs.
  • the network communication module is used to realize the communication between the various components in the storage medium and the communication with other hardware and software in the physical device.
  • the pre-organized knowledge base of this application records a collection of documents organized from various websites, provides a more complete question and answer database, and utilizes a pre-trained reading comprehension model It can understand the question sentence input by the user, analyze the user's search intention, predict the probability value of the answer sentence in the related document, give the user high-quality answer sentence, and improve the accuracy of the generative question and answer.

Abstract

一种智能问答方法、装置及计算机存储介质,涉及人工智能技术领域,能够对用户输入的问题语句进行理解,分析用户检索意图,给出用户优质的回答语句。所述方法包括:当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档(101);将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值(102);利用所述从关联文档中截取出来的各个部分作为回答语句的概率值,生成输出的回答语句(103)。

Description

智能问答方法、装置、计算机设备及计算机存储介质
相关申请的交叉引用
本申请申明享有2020年02月13日递交的申请号为CN202010091180.4、名称为“智能问答方法、装置、计算机设备及计算机存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其是涉及到智能问答方法、装置、计算机设备及计算机存储介质。
背景技术
随着互联网的快速发展,基于关键词的传统搜索引擎由于准确率低、存在冗余信息以及需要用户对搜索结果进行甄别等缺陷,已不能很好地满足互联网信息检索需求。而问答系统允许用户使用自然语言问句进行信息查询,能够对用户输入的问句进行理解,分析用户检索意图,给出高质量的答案,不仅符合人们的检索习惯,而且提高了信息查询的效率。
开放域智能问答为问答系统的一种形式,主要采用基于问答对语料的相似度匹配,或者基于知识图谱检索,以及基于深度学习的生成式问答,能够实现基本语句的问答。然而,发明人意识到,在面对日新月异、种类繁多的开放域问答时,基于问答对语料的相似度匹配,或者基于知识图谱检索,需要维护一个大规模的问答对语料,或者知识图谱三元组语料,使得知识库经常出现覆盖不全、更新不及时而无法回答用户的问题,导致生成式问答的准确率并不能达到顺畅通话的要求,无法迅速给出较为准确的答案。
发明内容
有鉴于此,本申请提供了一种智能问答方法、装置、计算机设备及计算机存储介质,主要目的在于解决目前生成式问答的准确率不能达到顺畅通话的问题。
依据本申请一个方面,提供了一种智能问答方法,该方法包括:当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档;将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值;利用所述从关联文档中截取出来的各个部分作为回答语句的概率值,生成输出的回答语句。
依据本申请另一个方面,提供了一种智能问答装置,所述装置包括:获取单元,用于当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档;预测单元,用于将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值;生成单元,用于利用所述从关联文档中截取出来的各个部分作为回答语句的概率值,生成输出的回答语句。
依据本申请又一个方面,提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档;将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值;利用所述从关联文档中截取出来的各个部分作为回答语句的概率值,生成输出的回答语句。
依据本申请再一个方面,提供了一种计算机存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档;将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值;利用所述从关联文档中截取出来的各 个部分作为回答语句的概率值,生成输出的回答语句。
借由上述技术方案,本申请提供一种智能问答方法及装置,当接收到问题语句时,通过从预先整理的知识库中获取与问题语句匹配度排名在预设数值之前的关联文档,,进一步将关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值,从而生成输出的回答语句。与现有技术中智能问答方法相比,本申请预先整理的知识库中记录有从各个网站上整理的文档集合,提供了更完善的问答数据库,利用预先训练的阅读理解模型,能够针对用户输入的问题语句进行理解,分析用户检索意图,预测出关联文档中作为回答语句的概率值,给出用户优质的回答语句,提高了生成式问答的准确率。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示出了本申请实施例提供的一种智能问答方法的流程示意图;
图2示出了本申请实施例提供的另一种智能问答方法的流程示意图;
图3示出了本申请实施例提供的对问答语料进行情绪识别过程的示意图;
图4示出了本申请实施例提供的一种智能问答装置的结构示意图;
图5示出了本申请实施例提供的另一种智能问答装置的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
本申请实施例提供了一种智能问答方法,能够对用户输入的问题语句进行理解,分析用户检索意图,给出用户优质的回答语句,如图1所示,该方法包括:
101、当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档。
其中,预先整理的知识库可以为维基百科知识库,维基百科知识库是一个类似于百度的网站,它会把网站内容整理成一个开放域的知识库,该知识库中包括各种文档以及数据集合,供各种人工智能算法训练时进行下载和使用。
可以理解的是,由于现有的问答领域中开放性问题太宽泛、话题所要求的时效性高,使得固定的语料库无法满足问答需求,本申请采用维基百科知识库作为问答语料,解决了在海量问题语句中搜索回答语句时兼顾搜索速度以及问题语句理解准确性的问题,给开放域智能问答提供一种可行性的思路。
具体在从预先整理的知识库中获取与问题语句匹配度排名在预设数值之前的关联文档时,由于预先整理的知识库中记录有从各个网站上整理的文档集合,这里可以根据问题语句在文档集合中各个文档中的重要程度,选取重要程度排名在预设数值之前的文档作为关联文档,当然还可以根据问题语句在文档集合中各个文档中出现的次数,选取次数排名在预设数值之前的文档作为关联文档,这里不进行限定。
102、将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值。
其中,预先训练的阅读理解模型通过使用bert预训练模型对问答数据集进行阅读理解任务的fine-tune训练和预测。这里使用bert预训练模型是一个使用双向transformer结构的语言模型,包括预训练在预训练阶段和阅读理解阶段,对两类任务进行训练,一类是把文档中15%的词mask起来,在训练过程中对这些mask的词进行预测;另一类是预测一个句子对中的后面一个句子是否为前面一个句子的下一句话。
通过这两类任务的训练,生成问题语句以及关联文档中每个词的词向量、词向量的位置信息、词的语义信息,由于这些词向量包含自言语言文本的上下文语义信息,从关联文档中截取的各个部分都有可能是问题语句相应的回答语句,进一步根据问题语句以及关联文档中每个词的词向量、词向量的位置信息、词的语义信息,对从关联文档中截取出的各个部分进行预测,从而得到从关联文档中截取出来的各个部分作为回答语句的概率值。
103、利用所述从关联文档中截取出来的各个部分作为回答语句的概率值,生成输出的回答语句。
由于关联文本是采用维基百科作为问答语料,使用文本追踪后检索出的与问题语句关联程度较高的文档,所以通常都会记载有回答语句,通过阅读理解模型来预测关联文档中的回答语句,既解决了开放领域问题中话题要求时效性高而致使固定的语料库无法满足需要的特点,又解决了在海量问题中搜索答案时无法兼顾搜索速度和问题语义理解的准确性等问题,提高了问题回答的速度和准确性。
本申请实施例提供的一种智能问答方法,当接收到问题语句时,通过从预先整理的知识库中获取与问题语句匹配度排名在预设数值之前的关联文档,进一步将关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值,从而生成输出的回答语句。与现有技术中智能问答方法相比,本申请预先整理的知识库中记录有从各个网站上整理的文档集合,提供了更完善的问答数据库,利用预先训练的阅读理解模型,能够针对用户输入的问题语句进行理解,分析用户检索意图,预测出关联文档中作为回答语句的概率值,给出用户优质的回答语句,提高了生成式问答的准确率。
本申请实施例提供了另一种智能问答方法,能够对用户输入的问题语句进行理解,分析用户检索意图,给出用户优质的回答语句,如图2所示,所述方法包括:
201、对所述预先整理的知识库中的文档集合进行分词处理,建立分词到文档集合中各个文档的倒排索引。
可以理解的是,对于预先整理的知识库中的文档,为了提高文档追踪效率,在系统初始化时首先对预先整理的知识库中文档集合中的各个文档进行分词处理,然后建立分词到文档集合中各个文档的倒排索引。
应说明的是,本申请对分词处理的方式不进行限定,可以使用分词工具如结巴分词、LTP、HanLP等。
通过建立分词到文档集合中各个文档的倒排索引,对文档集合中各个文档以及分词进行编号,从而基于文档中包含的分词特征,能够迅速从海量的文档集合中查找出与问题语句相关的文档。例如,文档集合包含5个文档,通过对文档进行分词处理,得到文档所包含的各个分词,每个分词都有相应编号,同时记录分词出现所在的文档编号,分词A出现在文档001、003中,分词B出现在文档004中,分词C出现在文档001、004中,分词D出现在文档005中等等,相应的,分词A对应的倒排列表为{001、003},分词B对应的倒排列表为{004},分词C对应的倒排列表为{001、004}。
202、利用词袋模型对所述文档集合中各个文档的分词进行词频统计,得到分词在各个文档中出现的词频。
其中,Bag-of-words词袋模型最初被用在信息检索领域,对于一篇文档来说,假定不考虑文档内的词的顺序关系和语法,只考虑该文档是否出现过这个单词以及该单词出现的次数(词频)。这样一个文档的特征即表现为这个文档中所出现的单词以及每个单词出现的次数。
对于本申请实施例,具体在利用词袋模型可以对文档集合中各个文档的分词进行词频统计,得到各个分词在各个文档中出现的词频之后,还可以基于分词到文档集合中各个文档的倒排索引,将每个分词的词频在文档集合中各个文档出现的词频加入到分词的倒排列表中,例如,分词A对应的倒排列表为{001、003},分词A在编号为001的文档中出现次 数为1词,在编号为003的文档中出现4次,相应的,分词A的倒排列表更新为{(001;1)、(003;4)},得到分词在各个文档中的倒排列表,该倒排列表中记录有分词出现所在的各个文档以及在各个文档中的词频。
203、当接收到问题语句时,基于所述分词到文档集合中各个文档的倒排索引,计算问题语句在各个文档中重要程度的评估值。
具体在基于所述分词到文档集合中各个文档的倒排索引,计算问题语句在各个文档中重要程度的评估值的过程中,可以针对问题语句进行分词处理,得到问题语句所包含的分词,并基于建立的分词到文档集合中各个文档的倒排索引,获取对问题语句所包含的分词在各个文档中的词频,进一步计算问题语句所包含的分词在各个文档中的重要程度的评估值。
对于本申请实施例,计算问题语句所包含的分词在各个文档中的重要程度的评估值可以通过计算问题语句所包含的分词在各个文档中的tf-idf值,tf-idf是一种用于信息检索与文本挖掘的常用加权技术。tf-idf也是一种统计方法,用以评估一字词对于一个文件集或一个语料库的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比,但同时会随着它在语料库中出现的频率成反比下降。
具体在基于分词到文档集合中各个文档的倒排索引,计算问题语句在各个文档中重要程度的评估值的过程中,倒排索引中记录有文档集合中文档数量、问题语句中各个分词在文档集合中各个文档出现的词频、各个文档中包含的分词量,首先根据文档集合中文档数量、问题语句中各个分词在文档集合中各个文档出现的词频、各个文档中包含的分词量,计算问题语句中各个分词在各个文档中的tf-idf值;然后汇总问题语句中各个分词在各个文档中的tf-idf值,得到问题语句在各个文档中的tf-idf值。
具体在计算问题语句中各个分词在各个文档中的tf-idf值过程中,首先获取问题语句中的分词在每个文档中出现的词频termFreq,然后获取每个文档中出现的总的分词数doctoTotalTerm;则
Figure PCTCN2020092963-appb-000001
获取总的文档数为docNum、包含问题语句中分词的文档数为wordIndocNum,则idf=1.0+log(docNum/(wordIndocNum+1));问题语句中每个分词的tf-idf=tf*idf;问题语句的tf-idf为该问题语句中所有分词的tf-idf之和/该问题语句的分词数量。
示例性的,问题语句划分为分词A、分词B、分词C,分词A在文档1中出现的词频termFreq=10,分词A在文档2中出现的词频termFreq=7;分词B在文档1中出现的词频termFreq=5,分词B在文档2中出现的词频termFreq=0;分词C在文档1中出现的词频termFreq=20,分词C在文档2中出现的词频termFreq=10;文档1中的分词量为100,文档2中的分词量为140;则分词A在文档1中的
Figure PCTCN2020092963-appb-000002
分词A在文档2中的
Figure PCTCN2020092963-appb-000003
分词B在文档1中的
Figure PCTCN2020092963-appb-000004
分词B在文档2中的
Figure PCTCN2020092963-appb-000005
分词C在文档1中的
Figure PCTCN2020092963-appb-000006
分词C在文档2中的
Figure PCTCN2020092963-appb-000007
分词A的idf=1.0+log(2/(2+1));分词B的idf=1.0+log(2/(2+1));分词C的idf=1.0+log(2/(1+1));那么分词A在文档1中的tf-idf值=分词A在文档1中的
Figure PCTCN2020092963-appb-000008
*分词A的idf;分词A在文档2中的tf-idf值=分词A在文档2中的
Figure PCTCN2020092963-appb-000009
*分词A的idf;分词B在文档1中的tf-idf值=分词B在文档1中的
Figure PCTCN2020092963-appb-000010
*分词B的idf;分词B在文档2中的tf-idf值=分词B在文档2中的
Figure PCTCN2020092963-appb-000011
*分词B的idf;分词C在文档1中的tf-idf值=分词C在文档1中的
Figure PCTCN2020092963-appb-000012
*分词C的idf;分词C在文档2中的tf-idf值=分词C在文档2中的
Figure PCTCN2020092963-appb-000013
*分词C的idf,最终计算问题语句在文档1中的tf-idf值=分词A在文档1中的tf-idf值+分词B在文档1中的tf-idf值+分词C在文档1中的tf-idf值/3;计算问题语句在文档2中的tf-idf值=分词A在文档2中的tf-idf值+分词B在文档2中的tf-idf值+分词C在文档2中的tf-idf值/3。
204、按照所述评估值由大到小排序,并选取评估值排名在预设数值之前的文档作为关联文档。
由于tf-idf值越大说明该文档对于问题语句越重要,与问题语句匹配度越高,所以评估值越大,说明问题语句与文档的关联程度越高,从而选取评估值排名在预设数值之前的文档作为关联文档。
可以理解的是,如果选取过多的关联文档后续识别工作量同样过多,影响智能对话的回答速度,这里优选的关联文档数量为5至10。
205、在预训练阶段,将每个关联文档的部分词遮住,输入至预先训练的阅读理解模型对遮住的部分词进行预测,得到问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息。
对于本申请实施例,在预训练阶段,bert预训练模型通过将每个关联文档中的部分词遮住,然后利用分词的上下文信息来预测这个分词原本的语义信息,使得学习到的语义信息可以融合一个分词左右两侧的上下文信息,进而提取问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息。
206、在阅读理解阶段,对问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息进行编码,输入至预先训练的阅读理解模型预测从关联文档中截取出来的各个部分作为回答语句的概率值。
对于本申请实施例,在阅读理解阶段,具体通过对问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息进行编码,得到词编码和位置编码,将词编码与位置编码之间的运算结果,输入至预先训练的阅读理解模型以使得位置信息补充到词编码之中,获取问题语句与从关联文档中截取出来的各个部分之间的关联关系,基于问题语句与从关联文档中截取出来的各个部分之间的关联关系,预测从关联文档中截取出来的各个部分作为回答语句的概率值。
具体在问题语句以及关联文档中,对问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息进行编码的过程,词向量为每个分词对应一个768维度的向量;位置信息为预先给每个分词标记的一个整数位,后续也根据这个整数位转换为一个768维度的向量;语义信息为在阅读理解模型中将问题语句与关联文档区分开,所有问题语句中的分词标注为0,所有关联文档中的分词标注为1,后续将0和1转换为一个768维度的向量,阅读理解阶段中预先训练的阅读理解模型的输入即为词向量、位置向量以及语义向量的相加,问题语句对应的回答语句为从关联文档中截取的一段文本,假设回答语句在关联文档中的开始位置start-point,回答语句在关联文档中的结束位置end-point,通过预先训练的阅读理解模型可以预测出从文档中截取出来的各个部分中每个分词作为start-point和end-point的概率值。
207、根据筛选指令对从关联文档中截取出来的各个部分作为回答语句的概率值进行排序。
可以理解的是,通常情况下,从关联文档中截取出来的各个部分作为回答语句的概率值越高,说明该部分的内容更适合作为回答语句,可以选取截取出来作为回答语句概率值最高的部分文档生成输出的回答语句,从而向用户提供更准确的回答内容。
进一步地,为了提高输出回答语句的灵活性,用户在输入问题语句的时候,可能考虑的话题实际应用场合以及语境等场景因素,从关联文档中截取出来的各个部分作为回答语句的概率值最高的文档并非满足场景因素,所以在生成输出的回答语句之前,可以通过设置筛选指令来结合用户当前的场景因素对从关联文档中截取出来的各个部分作为回答语句的概率值进行排序,从而选取更适合的部分文档生成输出的回答语句,这里对场景因素不进行限定。
208、获取从关联文档中截取出来的各个部分作为回答语句概率值最高的部分文档,生成回答语句。
对于本申请实施例,具体智能问答的流程可以如图3所示,当输入用户的问题语句时,通过从维基百科知识库中实时追踪与问题语句相关联的文档,选取排名前5的关联文档,将问题语句与关联文档中的句子输入至预先训练的阅读理解模型进行短文本阅读理解,从而将预测得到文档中的句子作为回答语句的概率值,并选取概率值靠前的回答句子作为最佳答案。
进一步地,作为图1所述方法的具体实现,本申请实施例提供了一种智能问答装置,如图4所示,所述装置包括:获取单元31、预测单元32、生成单元33。
获取单元31,可以用于当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档;
预测单元32,可以用于将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值;
生成单元33,可以用于利用所述从关联文档中截取出来的各个部分作为回答语句的概率值,生成输出的回答语句。
本申请实施例提供的一种智能问答装置,当接收到问题语句时,通过从预先整理的知识库中获取与问题语句匹配度排名在预设数值之前的关联文档,,进一步将关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值,从而生成输出的回答语句。与现有技术中智能问答方法相比,本申请预先整理的知识库中记录有从各个网站上整理的文档集合,提供了更完善的问答数据库,利用预先训练的阅读理解模型,能够针对用户输入的问题语句进行理解,分析用户检索意图,预测出关联文档中作为回答语句的概率值,给出用户优质的回答语句,提高了生成式问答的准确率。
作为图4中所示智能问答装置的进一步说明,图5是根据本申请实施例另一种智能问答装置的结构示意图,如图5所示,所述预先整理的知识库中记录有从各个网站上整理的文档集合,所述装置还包括:
建立单元34,可以用于在所述当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档之前对所述预先整理的知识库中的文档集合进行分词处理,建立分词到文档集合中各个文档的倒排索引;
统计单元35,可以用于利用词袋模型对所述文档集合中各个文档的分词进行词频统计,得到分词在各个文档中出现的词频。
在具体应用场景中,如图5所示,所述获取单元31包括:
计算模块311,可以用于基于所述分词到文档集合中各个文档的倒排索引,计算问题语句在各个文档中重要程度的评估值;
选取模块312,可以用于按照所述评估值由大到小排序,并选取评估值排名在预设数值之前的文档作为关联文档。
在具体应用场景中,所述计算模块311,具体可以用于对所述问题语句进行分词处理,基于所述分词到文档集合中各个文档的倒排索引,查询问题语句中各个分词在文档集合中各个文档出现的词频、各个文档中包含的分词量;
所述计算模块311,具体还可以用于根据文档集合中文档数量、问题语句中各个分词在文档集合中各个文档出现的词频、各个文档中包含的分词量,计算问题语句中各个分词在各个文档中重要程度的评估值;
所述计算模块311,具体还可以用于汇总所述问题语句中各个分词在各个文档中重要程度的评估值,得到问题语句在各个文档中重要程度的评估值。
在具体应用场景中,如图5所示,所述预先训练的阅读理解模型通过使用bert预训练模型对问答数据集进行阅读理解任务的fine-tune训练和预测,包括预训练阶段和阅读理解阶段,所述预测单元32包括:
第一预测模块321,可以用于在预训练阶段,将每个关联文档的部分词遮住,输入至预先训练的阅读理解模型对遮住的部分词进行预测,得到问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息;
第二预测模块322,可以用于在阅读理解阶段,对问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息进行编码,输入至预先训练的阅读理解模型预测从关联文档中截取出来的各个部分作为回答语句的概率值。
进一步地,所述第二预测模块322,具体可以用于在阅读理解阶段,对问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息进行编码,得到词编码和位置编码;
所述第二预测模块322,具体还可以用于将词编码与位置编码之间的运算,输入至预先训练的阅读理解模型以使得位置信息补充到词编码之中,获取问题语句与从关联文档中截取出来的各个部分之间的关联关系;
所述第二预测模块322,具体还可以用于基于所述问题语句与从关联文档中截取出来的各个部分之间的关联关系,预测从关联文档中截取出来的各个部分作为回答语句的概率值。
在具体应用场景中,如图5所示,所述生成单元33包括:
排序模块331,可以用于根据筛选指令对从关联文档中截取出来的各个部分作为回答语句的概率值进行排序;
生成模块332,可以用于获取从关联文档中截取出来的各个部分作为回答语句概率值最高的部分文档,生成回答语句。
需要说明的是,本实施例提供的一种智能问答装置所涉及各功能单元的其他相应描述,可以参考图1-图2中的对应描述,在此不再赘述。
基于上述如图1、图2所示方法,相应的,本实施例还提供了一种存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述如图1、图2所示的智能问答方法。
基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个易失性存储介质(例如静态RAM和动态内存DRAM等)中,也可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景所述的方法。
基于上述如图1、图2所示的方法,以及图4、图5所示的虚拟装置实施例,为了实现上述目的,本申请实施例还提供了一种计算机设备,具体可以为个人计算机、服务器、网络设备等,该实体设备包括存储介质和处理器;存储介质,用于存储计算机程序;处理器,用于执行计算机程序以实现上述如图1、图2所示的智能问答方法。
可选地,该计算机设备还可以包括用户接口、网络接口、摄像头、射频(Radio Frequency,RF)电路,传感器、音频电路、WI-FI模块等等。用户接口可以包括显示屏(Display)、输入单元比如键盘(Keyboard)等,可选用户接口还可以包括USB接口、读卡器接口等。网络接口可选的可以包括标准的有线接口、无线接口(如蓝牙接口、WI-FI接口)等。
本领域技术人员可以理解,本实施例提供的智能问答装置的实体设备结构并不构成对该实体设备的限定,可以包括更多或更少的部件,或者组合某些部件,或者不同的部件布置。
存储介质中还可以包括操作系统、网络通信模块。操作系统是管理上述计算机设备硬件和软件资源的程序,支持信息处理程序以及其它软件和/或程序的运行。网络通信模块用于实现存储介质内部各组件之间的通信,以及与该实体设备中其它硬件和软件之间通信。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以借助软件加必要的通用硬件平台的方式来实现,也可以通过硬件实现。通过应用本申请的技术方案,与目前现有技术相比,本申请预先整理的知识库中记录有从各个网站上整理的文档集 合,提供了更完善的问答数据库,利用预先训练的阅读理解模型,能够针对用户输入的问题语句进行理解,分析用户检索意图,预测出关联文档中作为回答语句的概率值,给出用户优质的回答语句,提高了生成式问答的准确率。
本领域技术人员可以理解附图只是一个优选实施场景的示意图,附图中的模块或流程并不一定是实施本申请所必须的。本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中,也可以进行相应变化位于不同于本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块,也可以进一步拆分成多个子模块。
上述本申请序号仅仅为了描述,不代表实施场景的优劣。以上公开的仅为本申请的几个具体实施场景,但是,本申请并非局限于此,任何本领域的技术人员能思之的变化都应落入本申请的保护范围。

Claims (20)

  1. 一种智能问答方法,其中,所述方法包括:
    当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档;
    将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值;
    利用所述从关联文档中截取出来的各个部分作为回答语句的概率值,生成输出的回答语句。
  2. 根据权利要求1所述的智能问答方法,其中,所述预先整理的知识库中记录有从各个网站上整理的文档集合,在所述当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档之前,所述方法还包括:
    对所述预先整理的知识库中的文档集合进行分词处理,建立分词到文档集合中各个文档的倒排索引;
    利用词袋模型对所述文档集合中各个文档的分词进行词频统计,得到分词在各个文档中出现的词频。
  3. 根据权利要求2所述的智能问答方法,其中,所述从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档,具体包括:
    基于所述分词到文档集合中各个文档的倒排索引,计算问题语句在各个文档中重要程度的评估值;
    按照所述评估值由大到小排序,并选取评估值排名在预设数值之前的文档作为关联文档。
  4. 根据权利要求3所述的智能问答方法,其中,所述基于所述分词到文档集合中各个文档的倒排索引,计算问题语句在各个文档中重要程度的评估值,具体包括:
    对所述问题语句进行分词处理,基于所述分词到文档集合中各个文档的倒排索引,查询问题语句中各个分词在文档集合中各个文档出现的词频、各个文档中包含的分词量;
    根据文档集合中文档数量、问题语句中各个分词在文档集合中各个文档出现的词频、各个文档中包含的分词量,计算问题语句中各个分词在各个文档中重要程度的评估值;
    汇总所述问题语句中各个分词在各个文档中重要程度的评估值,得到问题语句在各个文档中重要程度的评估值。
  5. 根据权利要求1所述的智能问答方法,其中,所述预先训练的阅读理解模型通过使用bert预训练模型对问答数据集进行阅读理解任务的fine-tune训练和预测,包括预训练阶段和阅读理解阶段,所述将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值,具体包括:
    在预训练阶段,将每个关联文档的部分词遮住,输入至预先训练的阅读理解模型对遮住的部分词进行预测,得到问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息;
    在阅读理解阶段,对问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息进行编码,输入至预先训练的阅读理解模型预测从关联文档中截取出来的各个部分作为回答语句的概率值。
  6. 根据权利要求5所述的智能问答方法,其中,所述在阅读理解阶段,对问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息进行编码,输入至预先训练的阅读理解模型预测从关联文档中截取出来的各个部分作为回答语句的概率值,具体包括:
    在阅读理解阶段,对问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息进行编码,得到词编码和位置编码;
    将词编码与位置编码之间的运算结果,输入至预先训练的阅读理解模型以使得位置信息补充到词编码之中,获取问题语句与从关联文档中截取出来的各个部分之间的关联关系;
    基于所述问题语句与从关联文档中截取出来的各个部分之间的关联关系,预测从关联文档中截取出来的各个部分作为回答语句的概率值。
  7. 根据权利要求1所述的智能问答方法,其中,所述利用所述从关联文档中截取出来的各个部分作为回答语句的概率值,生成输出的回答语句,具体包括:
    根据筛选指令对从关联文档中截取出来的各个部分作为回答语句的概率值进行排序;
    获取从关联文档中截取出来的各个部分作为回答语句概率值最高的部分文档,生成回答语句。
  8. 一种智能问答装置,其中,所述装置包括:
    获取单元,用于当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档;
    预测单元,用于将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值;
    生成单元,用于利用所述从关联文档中截取出来的各个部分作为回答语句的概率值,生成输出的回答语句。
  9. 根据权利要求8所述的智能问答装置,其中,所述预先整理的知识库中记录有从各个网站上整理的文档集合,所述智能问答装置还包括:
    建立单元,用于在所述当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档之前,对所述预先整理的知识库中的文档集合进行分词处理,建立分词到文档集合中各个文档的倒排索引;
    统计单元,用于利用词袋模型对所述文档集合中各个文档的分词进行词频统计,得到分词在各个文档中出现的词频。
  10. 根据权利要求9所述的智能问答装置,其中,所述获取单元包括:
    计算模块,用于基于所述分词到文档集合中各个文档的倒排索引,计算问题语句在各个文档中重要程度的评估值;
    选取模块,用于按照所述评估值由大到小排序,并选取评估值排名在预设数值之前的文档作为关联文档。
  11. 根据权利要求10所述的智能问答装置,其中,所述计算模块具体用于对所述问题语句进行分词处理,基于所述分词到文档集合中各个文档的倒排索引,查询问题语句中各个分词在文档集合中各个文档出现的词频、各个文档中包含的分词量;
    所述计算模块具体还用于根据文档集合中文档数量、问题语句中各个分词在文档集合中各个文档出现的词频、各个文档中包含的分词量,计算问题语句中各个分词在各个文档中重要程度的评估值;
    所述计算模块具体还用于汇总所述问题语句中各个分词在各个文档中重要程度的评估值,得到问题语句在各个文档中重要程度的评估值。
  12. 根据权利要求8所述的智能问答装置,其中,所述预先训练的阅读理解模型通过使用bert预训练模型对问答数据集进行阅读理解任务的fine-tune训练和预测,包括预训练阶段和阅读理解阶段,所述预测单元包括:
    第一预测模块,用于在预训练阶段,将每个关联文档的部分词遮住,输入至预先训练的阅读理解模型对遮住的部分词进行预测,得到问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息;
    第二预测模块,用于在阅读理解阶段,对问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息进行编码,输入至预先训练的阅读理解模型预测从关联文档中截取出来的各个部分作为回答语句的概率值。
  13. 根据权利要求12所述的智能问答装置,其中,所述第二预测模块具体用于在阅读理解阶段,对问题语句以及关联文档中每个分词的词向量、词向量的位置信息、词向量的语义信息进行编码,得到词编码和位置编码;
    所述第二预测模块具体还用于将词编码与位置编码之间的运算结果,输入至预先训练的阅读理解模型以使得位置信息补充到词编码之中,获取问题语句与从关联文档中截取出来的各个部分之间的关联关系;
    所述第二预测模块具体还用于基于所述问题语句与从关联文档中截取出来的各个部分之间的关联关系,预测从关联文档中截取出来的各个部分作为回答语句的概率值。
  14. 根据权利要求8所述的智能问答装置,其中,所述生成单元包括:
    排序模块,用于根据筛选指令对从关联文档中截取出来的各个部分作为回答语句的概率值进行排序;
    生成模块,用于获取从关联文档中截取出来的各个部分作为回答语句概率值最高的部分文档,生成回答语句。
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:
    当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档;
    将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值;
    利用所述从关联文档中截取出来的各个部分作为回答语句的概率值,生成输出的回答语句。
  16. 根据权利要求15所述的计算机设备,其中,所述处理器执行所述计算机程序时还实现以下步骤:
    所述预先整理的知识库中记录有从各个网站上整理的文档集合,在所述当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档之前,对所述预先整理的知识库中的文档集合进行分词处理,建立分词到文档集合中各个文档的倒排索引;利用词袋模型对所述文档集合中各个文档的分词进行词频统计,得到分词在各个文档中出现的词频。
  17. 根据权利要求16所述的计算机设备,其中,所述从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档,具体包括:
    基于所述分词到文档集合中各个文档的倒排索引,计算问题语句在各个文档中重要程度的评估值;
    按照所述评估值由大到小排序,并选取评估值排名在预设数值之前的文档作为关联文档。
  18. 一种计算机存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现以下步骤:
    当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档;
    将每个关联文档中截取出来的各个部分与问题语句组成一条输入语句,输入至预先训练的阅读理解模型,预测从关联文档中截取出来的各个部分作为回答语句的概率值;
    利用所述从关联文档中截取出来的各个部分作为回答语句的概率值,生成输出的回答语句。
  19. 根据权利要求18所述的计算机存储介质,其中,所述计算机程序被处理器执行时还实现以下步骤:
    所述预先整理的知识库中记录有从各个网站上整理的文档集合,在所述当接收到问题语句时,从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文 档之前,对所述预先整理的知识库中的文档集合进行分词处理,建立分词到文档集合中各个文档的倒排索引;利用词袋模型对所述文档集合中各个文档的分词进行词频统计,得到分词在各个文档中出现的词频。
  20. 根据权利要求19所述的计算机存储介质,其中,所述从预先整理的知识库中获取与所述问题语句匹配度排名在预设数值之前的关联文档,具体包括:
    基于所述分词到文档集合中各个文档的倒排索引,计算问题语句在各个文档中重要程度的评估值;
    按照所述评估值由大到小排序,并选取评估值排名在预设数值之前的文档作为关联文档。
PCT/CN2020/092963 2020-02-13 2020-05-28 智能问答方法、装置、计算机设备及计算机存储介质 WO2021159632A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010091180.4A CN111368042A (zh) 2020-02-13 2020-02-13 智能问答方法、装置、计算机设备及计算机存储介质
CN202010091180.4 2020-02-13

Publications (1)

Publication Number Publication Date
WO2021159632A1 true WO2021159632A1 (zh) 2021-08-19

Family

ID=71206240

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092963 WO2021159632A1 (zh) 2020-02-13 2020-05-28 智能问答方法、装置、计算机设备及计算机存储介质

Country Status (2)

Country Link
CN (1) CN111368042A (zh)
WO (1) WO2021159632A1 (zh)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100326B (zh) * 2020-08-28 2023-04-18 广州探迹科技有限公司 一种抗干扰的融合检索和机器阅读理解的问答方法及系统
CN112347223B (zh) * 2020-11-03 2023-09-22 平安科技(深圳)有限公司 文档检索方法、设备及计算机可读存储介质
CN112287085B (zh) * 2020-11-06 2023-12-05 中国平安财产保险股份有限公司 语义匹配方法、系统、设备及存储介质
CN112597291A (zh) * 2020-12-26 2021-04-02 中国农业银行股份有限公司 一种智能问答的实现方法、装置及设备
CN112685538B (zh) * 2020-12-30 2022-10-14 北京理工大学 一种结合外部知识的文本向量检索方法
CN112800202A (zh) * 2021-02-05 2021-05-14 北京金山数字娱乐科技有限公司 文档处理方法及装置
CN112883182A (zh) * 2021-03-05 2021-06-01 海信电子科技(武汉)有限公司 一种基于机器阅读的问答匹配方法及装置
CN113076431B (zh) * 2021-04-28 2022-09-02 平安科技(深圳)有限公司 机器阅读理解的问答方法、装置、计算机设备及存储介质
CN113239169B (zh) * 2021-06-01 2023-12-05 平安科技(深圳)有限公司 基于人工智能的回答生成方法、装置、设备及存储介质
CN113704408A (zh) * 2021-08-31 2021-11-26 工银科技有限公司 检索方法、装置、电子设备、存储介质和程序产品
CN113934825B (zh) * 2021-12-21 2022-03-08 北京云迹科技有限公司 一种问题回答方法、装置和电子设备
CN114444488B (zh) * 2022-01-26 2023-03-24 中国科学技术大学 一种少样本机器阅读理解方法、系统、设备及存储介质
CN114780672A (zh) * 2022-03-23 2022-07-22 清华大学 一种基于网络资源的医学问题问答处理方法及装置
CN115293132B (zh) * 2022-09-30 2022-12-30 腾讯科技(深圳)有限公司 虚拟场景的对话处理方法、装置、电子设备及存储介质
CN117474043B (zh) * 2023-12-27 2024-04-02 湖南三湘银行股份有限公司 一种基于训练模型的智能问答系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108446320A (zh) * 2018-02-09 2018-08-24 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
US20180276525A1 (en) * 2015-12-03 2018-09-27 Huawei Technologies Co., Ltd. Method and neural network system for human-computer interaction, and user equipment
CN110309283A (zh) * 2019-06-28 2019-10-08 阿里巴巴集团控股有限公司 一种智能问答的答案确定方法及装置
CN110390003A (zh) * 2019-06-19 2019-10-29 北京百度网讯科技有限公司 基于医疗的问答处理方法及系统、计算机设备及可读介质
CN110502621A (zh) * 2019-07-03 2019-11-26 平安科技(深圳)有限公司 问答方法、问答装置、计算机设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020010A (zh) * 2017-10-10 2019-07-16 阿里巴巴集团控股有限公司 数据处理方法、装置及电子设备
CN109918487A (zh) * 2019-01-28 2019-06-21 平安科技(深圳)有限公司 基于网络百科全书的智能问答方法和系统
CN110688491B (zh) * 2019-09-25 2022-05-10 暨南大学 基于深度学习的机器阅读理解方法、系统、设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180276525A1 (en) * 2015-12-03 2018-09-27 Huawei Technologies Co., Ltd. Method and neural network system for human-computer interaction, and user equipment
CN108446320A (zh) * 2018-02-09 2018-08-24 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
CN110390003A (zh) * 2019-06-19 2019-10-29 北京百度网讯科技有限公司 基于医疗的问答处理方法及系统、计算机设备及可读介质
CN110309283A (zh) * 2019-06-28 2019-10-08 阿里巴巴集团控股有限公司 一种智能问答的答案确定方法及装置
CN110502621A (zh) * 2019-07-03 2019-11-26 平安科技(深圳)有限公司 问答方法、问答装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN111368042A (zh) 2020-07-03

Similar Documents

Publication Publication Date Title
WO2021159632A1 (zh) 智能问答方法、装置、计算机设备及计算机存储介质
CN111753060B (zh) 信息检索方法、装置、设备及计算机可读存储介质
CN108647205B (zh) 细粒度情感分析模型构建方法、设备及可读存储介质
CN110147551B (zh) 多类别实体识别模型训练、实体识别方法、服务器及终端
CN111159485B (zh) 尾实体链接方法、装置、服务器及存储介质
WO2020077824A1 (zh) 异常问题的定位方法、装置、设备及存储介质
CN110795527B (zh) 候选实体排序方法、训练方法及相关装置
CN112287069B (zh) 基于语音语义的信息检索方法、装置及计算机设备
CN109710732B (zh) 信息查询方法、装置、存储介质和电子设备
CN112100326B (zh) 一种抗干扰的融合检索和机器阅读理解的问答方法及系统
CN110765247A (zh) 一种用于问答机器人的输入提示方法及装置
CN111767394A (zh) 一种基于人工智能专家系统的摘要提取方法及装置
CN112307164A (zh) 信息推荐方法、装置、计算机设备和存储介质
CN110727769A (zh) 语料库生成方法及装置、人机交互处理方法及装置
CN111061876A (zh) 事件舆情数据分析方法及装置
CN110659392A (zh) 检索方法及装置、存储介质
CN114330704A (zh) 语句生成模型更新方法、装置、计算机设备和存储介质
CN113569118A (zh) 自媒体推送方法、装置、计算机设备及存储介质
CN113033912A (zh) 问题解决人推荐方法及装置
CN113656579A (zh) 文本分类方法、装置、设备及介质
CN112330387A (zh) 一种应用于看房软件的虚拟经纪人
CN112148855A (zh) 一种智能客服问题检索方法、终端以及存储介质
CN115618968B (zh) 新意图发现方法、装置、电子设备及存储介质
CN114153947A (zh) 一种文档处理方法、装置、设备及存储介质
CN116186220A (zh) 信息检索方法、问答处理方法、信息检索装置及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20918175

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20918175

Country of ref document: EP

Kind code of ref document: A1