CN116340502A - Information retrieval method and device based on semantic understanding - Google Patents
Information retrieval method and device based on semantic understanding Download PDFInfo
- Publication number
- CN116340502A CN116340502A CN202310331474.3A CN202310331474A CN116340502A CN 116340502 A CN116340502 A CN 116340502A CN 202310331474 A CN202310331474 A CN 202310331474A CN 116340502 A CN116340502 A CN 116340502A
- Authority
- CN
- China
- Prior art keywords
- text information
- text
- information
- semantic
- retrieval
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 151
- 238000000605 extraction Methods 0.000 claims abstract description 31
- 239000013598 vector Substances 0.000 claims description 200
- 230000011218 segmentation Effects 0.000 claims description 103
- 238000003860 storage Methods 0.000 claims description 35
- 238000004590 computer program Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 abstract description 35
- 238000012216 screening Methods 0.000 abstract description 21
- 230000008569 process Effects 0.000 description 55
- 230000006870 function Effects 0.000 description 19
- 238000013528 artificial neural network Methods 0.000 description 18
- 238000004364 calculation method Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 17
- 239000000284 extract Substances 0.000 description 17
- 238000005516 engineering process Methods 0.000 description 12
- 239000012634 fragment Substances 0.000 description 10
- 230000010365 information processing Effects 0.000 description 8
- 238000009826 distribution Methods 0.000 description 7
- 238000003062 neural network model Methods 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 6
- 238000003058 natural language processing Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 125000004122 cyclic group Chemical group 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 4
- 238000003491 array Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000036961 partial effect Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000002829 reductive effect Effects 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 241000699666 Mus <mouse, genus> Species 0.000 description 1
- 241000699670 Mus sp. Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 238000013468 resource allocation Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 239000004984 smart glass Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
公开了一种基于语义理解的信息检索方法和装置。该信息检索方法包括:获取第一文本信息和多个第二文本信息;确定第一文本信息与每一个第二文本信息的语义相似度;根据第一文本信息与每一个第二文本信息的语义相似度,从多个第二文本信息中选取至少一个待检索文本信息;从至少一个待检索文本信息中分别提取与第一文本信息语义相关的第三文本信息,以形成第三文本信息集合;获取第三文本信息集合中至少两个第三文本信息对应的多文本摘要;基于多文本摘要,确定第一文本信息对应的检索结果。根据本申请一些实施例的信息检索方法通过信息筛选、提取、摘要等数据处理操作,能够高效、精确地完成诸如智能问答之类的高级信息检索任务。
An information retrieval method and device based on semantic understanding are disclosed. The information retrieval method includes: acquiring first text information and a plurality of second text information; determining the semantic similarity between the first text information and each second text information; according to the semantic similarity between the first text information and each second text information Similarity, selecting at least one text information to be retrieved from a plurality of second text information; extracting third text information semantically related to the first text information from at least one text information to be retrieved to form a third text information set; Obtain multi-text summaries corresponding to at least two third text information in the third text information set; and determine a retrieval result corresponding to the first text information based on the multi-text summaries. The information retrieval method according to some embodiments of the present application can efficiently and accurately complete advanced information retrieval tasks such as intelligent question answering through data processing operations such as information screening, extraction, and summarization.
Description
技术领域technical field
本申请涉及自然语言处理领域,特别涉及基于语义理解的信息检索方法和装置、计算设备、计算机可读存储介质及计算机程序产品。The present application relates to the field of natural language processing, in particular to an information retrieval method and device based on semantic understanding, a computing device, a computer-readable storage medium and a computer program product.
背景技术Background technique
随着互联网的快速发展,通过搜索引擎可以从互联网中检索到越来越多的信息,搜索结果呈现出数据海量化,形态多样化,覆盖全面化等特点。这一方面提升了用户搜索到结果的可能性,另一方面用户面对海量的搜索结果会显得无所适从,无法短时间获取准确的答案。例如传统的基于关键词匹配和单文档摘要的搜索引擎局限于返回与用户检索问题相关的网页或文档列表,而无法给出问题的准确答案(用户需要结合标题和摘要等信息从相关网页或文档中查找或得出问题答案),无法满足用快速获取信息的需求和期望。With the rapid development of the Internet, more and more information can be retrieved from the Internet through search engines, and the search results show the characteristics of massive data, diversified forms, and comprehensive coverage. On the one hand, this increases the possibility for users to search for results. On the other hand, users will feel at a loss in the face of massive search results and cannot obtain accurate answers in a short time. For example, traditional search engines based on keyword matching and single-document abstraction are limited to returning a list of web pages or documents related to the user's retrieval question, but cannot give an accurate answer to the question (users need to combine information such as title and abstract from related web pages or documents) search or get answers to questions), which cannot meet the needs and expectations of users to obtain information quickly.
随着用户对搜索引擎的期望越来越高,信息检索的形态开始由诸如基本的相关网页或文档列表的召回之类的初级形态向诸如智能问答检索之类的高级形态转变。智能问答检索的目的是用简洁、准确的自然语言回答用户的问题,它的出现致力于提供更有效的信息获取工具。为了实现智能问答之类的高级信息检索形态,基于语义理解或机器阅读理解的智能信息检索技术应运而生。然而,相关技术的基于语义理解或机器阅读理解的信息检索方法存在以下问题:首先,相关技术的基于关键词或字符串匹配的检索方式只能检索字面相同的文章,无法检索到文字不同、但语义相同的信息,容易造成与检索问题高度相关的重要信息资源的缺失,检索结果广度受限、准确度不高;其次,相关技术的基于关键词提取和比较的检索方式由于使用较为复杂的预设规则,引发较大的计算量和工作量,具有较低的效率,而且关键词无法完整、精确地反映整个检索问题的特征,造成检索结果的准确度不高。As users expect more and more from search engines, the form of information retrieval begins to change from a basic form such as recalling a list of relevant web pages or documents to an advanced form such as intelligent question-and-answer retrieval. The purpose of intelligent question and answer retrieval is to answer users' questions with concise and accurate natural language, and its emergence is dedicated to providing more effective information acquisition tools. In order to realize advanced information retrieval forms such as intelligent question answering, intelligent information retrieval technology based on semantic understanding or machine reading comprehension came into being. However, the information retrieval methods based on semantic understanding or machine reading comprehension in related technologies have the following problems: First, the retrieval methods based on keywords or character string matching in related technologies can only retrieve articles with the same text, and cannot retrieve articles with different texts but different texts. Information with the same semantics will easily cause the loss of important information resources highly related to the retrieval problem, and the retrieval results will be limited in breadth and accuracy. Setting rules will lead to a large amount of calculation and work, which has low efficiency, and keywords cannot completely and accurately reflect the characteristics of the entire retrieval problem, resulting in low accuracy of retrieval results.
发明内容Contents of the invention
鉴于此,本申请提供了一种基于语义理解的信息检索方法和装置、计算设备、计算机可读存储介质及计算机程序产品,期望缓解或克服上面提到的部分或全部缺陷以及其他可能的缺陷。In view of this, the present application provides an information retrieval method and device based on semantic understanding, a computing device, a computer-readable storage medium, and a computer program product, hoping to alleviate or overcome some or all of the above-mentioned defects and other possible defects.
根据本申请的第一方面,提供了一种基于语义理解的信息检索方法,包括:获取指示检索目标的第一文本信息和指示候选检索对象的多个第二文本信息;确定所述第一文本信息与所述多个第二文本信息中每个第二文本信息的语义相似度;根据所述第一文本信息与所述多个第二文本信息中每个第二文本信息的语义相似度,从所述多个第二文本信息中选取至少一个待检索文本信息;从所述至少一个待检索文本信息中分别提取与第一文本信息语义相关的第三文本信息,以形成第三文本信息集合;获取所述第三文本信息集合中至少两个第三文本信息对应的多文本摘要;基于所述多文本摘要,确定所述第一文本信息对应的检索结果。According to the first aspect of the present application, an information retrieval method based on semantic understanding is provided, including: acquiring first text information indicating a retrieval target and a plurality of second text information indicating candidate retrieval objects; determining the first text Semantic similarity between the information and each second text information in the plurality of second text information; according to the semantic similarity between the first text information and each second text information in the plurality of second text information, Selecting at least one text information to be retrieved from the plurality of second text information; respectively extracting third text information semantically related to the first text information from the at least one text information to be retrieved to form a third text information set ; Obtain a multi-text abstract corresponding to at least two third text information in the third text information set; determine a search result corresponding to the first text information based on the multi-text abstract.
在根据本申请一些实施例的信息检索方法中,多文本摘要包括生成式多文本摘要。In the information retrieval method according to some embodiments of the present application, the multi-text summarization includes a generative multi-text summarization.
在根据本申请一些实施例的信息检索方法中,确定所述第一文本信息与所述多个第二文本信息中每个第二文本信息的语义相似度,包括:获取所述第一文本信息对应的第一语义特征向量以及所述多个第二文本信息分别对应的多个第二语义特征向量;计算第一语义特征向量与所述多个第二语义特征向量中每一个第二语义特征向量的相似度;根据第一语义特征向量与每一个第二语义特征向量的相似度,确定所述第一文本信息与所述多个第二文本信息中每个第二文本信息的语义相似度。In the information retrieval method according to some embodiments of the present application, determining the semantic similarity between the first text information and each second text information in the plurality of second text information includes: acquiring the first text information The corresponding first semantic feature vector and the plurality of second semantic feature vectors respectively corresponding to the plurality of second text information; calculating the first semantic feature vector and each second semantic feature in the plurality of second semantic feature vectors Vector similarity; according to the similarity between the first semantic feature vector and each second semantic feature vector, determine the semantic similarity between the first text information and each second text information in the plurality of second text information .
在根据本申请一些实施例的信息检索方法中,计算第一语义特征向量与所述多个第二语义特征向量中每一个第二语义特征向量的相似度包括:基于所述多个第二语义特征向量与第一语义特征向量的距离,计算第一语义特征向量与所述多个第二语义特征向量中每一个第二语义特征向量的第一相似度;基于所述多个第二语义特征向量与第一语义特征向量之间的夹角的余弦,计算第一语义特征向量与所述多个第二语义特征向量中每一个第二语义特征向量的第二相似度;基于第一相似度和第二相似度中至少一个,确定第一语义特征向量与所述多个第二语义特征向量中每一个第二语义特征向量的相似度。In the information retrieval method according to some embodiments of the present application, calculating the similarity between the first semantic feature vector and each of the plurality of second semantic feature vectors includes: based on the plurality of second semantic The distance between the feature vector and the first semantic feature vector, calculating the first similarity between the first semantic feature vector and each second semantic feature vector in the plurality of second semantic feature vectors; based on the plurality of second semantic features The cosine of the angle between the vector and the first semantic feature vector, calculating the second similarity between the first semantic feature vector and each second semantic feature vector in the plurality of second semantic feature vectors; based on the first similarity and at least one of the second similarities, determining the similarity between the first semantic feature vector and each second semantic feature vector in the plurality of second semantic feature vectors.
在根据本申请一些实施例的信息检索方法中,获取所述第一文本信息对应的第一语义特征向量以及所述多个第二文本信息分别对应的多个第二语义特征向量包括:利用语义理解模型确定第一文本信息对应的第一语义特征向量;从预设的语义特征向量索引库中获取所述多个第二文本信息分别对应的多个第二语义特征向量,所述预设的语义特征向量索引库中存储有利用所述语义理解模型确定的所述多个第二语义特征向量。In the information retrieval method according to some embodiments of the present application, obtaining the first semantic feature vector corresponding to the first text information and the plurality of second semantic feature vectors respectively corresponding to the plurality of second text information includes: using semantic The understanding model determines the first semantic feature vector corresponding to the first text information; the plurality of second semantic feature vectors respectively corresponding to the plurality of second text information are obtained from the preset semantic feature vector index library, and the preset The plurality of second semantic feature vectors determined by using the semantic understanding model are stored in the semantic feature vector index library.
在根据本申请一些实施例的信息检索方法中,从所述至少一个待检索文本信息中分别提取与第一文本信息语义相关的第三文本信息,以形成第三文本信息集合,包括:针对所述至少一个待检索文本信息中每一个待检索文本信息,利用阅读理解模型从所述待检索文本信息中确定指示与所述第一文本信息对应的候选检索结果的第四文本信息;从每一个待检索文本信息中提取包含第四文本信息的第三文本信息;基于从每一个待检索文本信息中提取的第三文本信息,构建第三文本信息集合。In the information retrieval method according to some embodiments of the present application, the third text information semantically related to the first text information is respectively extracted from the at least one text information to be retrieved to form a third text information set, including: for all For each text information to be retrieved in the at least one text information to be retrieved, the fourth text information indicating the candidate retrieval result corresponding to the first text information is determined from the text information to be retrieved by using a reading comprehension model; from each Extracting third text information including fourth text information from the text information to be retrieved; constructing a third text information set based on the third text information extracted from each text information to be retrieved.
在根据本申请一些实施例的信息检索方法中,从每一个待检索文本信息中提取包含第四文本信息的第三文本信息,包括下述步骤之一:从每一个待检索文本信息中提取第四文本信息所在的语句,作为第三文本信息;从每一个待检索文本信息中提取第四文本信息所在的自然段落,作为第三文本信息;从每一个待检索文本信息中提取第四文本信息,作为第三文本信息。In the information retrieval method according to some embodiments of the present application, extracting the third text information including the fourth text information from each text information to be retrieved includes one of the following steps: extracting the first text information from each text information to be retrieved The sentence where the four text information is located is used as the third text information; the natural paragraph where the fourth text information is extracted from each text information to be retrieved is used as the third text information; the fourth text information is extracted from each text information to be retrieved , as the third text information.
在根据本申请一些实施例的信息检索方法中,针对每一个待检索文本信息,利用所述阅读理解模型从所述待检索文本信息中确定指示与所述第一文本信息对应的候选检索结果的第四文本信息,包括:针对每一个待检索文本信息执行下述步骤:通过拼接第一文本信息和所述待检索文本信息形成第一待处理文本信息;将第一待处理文本信息进行分词处理以得到分词序列,所述分词序列包含第一文本信息对应的第一分词序列和所述待检索文本信息对应的第二分词序列;将所述分词序列输入阅读理解模型以获得所述第二分词序列中每一个分词对应的第一概率和第二概率,所述每一个分词对应的第一概率表示该分词是第四文本信息的开始分词的概率,且所述每一个分词对应的第二概率表示该分词是第四文本信息的结束分词的概率;根据第二分词序列中每一个分词对应的第一概率和第二概率,从所述第二分词序列中确定所述第四文本信息的开始分词和结束分词;根据所述第四文本信息的开始分词和结束分词,从所述待检索文本信息中确定第四文本信息。In the information retrieval method according to some embodiments of the present application, for each text information to be retrieved, the reading comprehension model is used to determine from the text information to be retrieved the candidate retrieval result corresponding to the first text information The fourth text information includes: performing the following steps for each text information to be retrieved: forming the first text information to be processed by splicing the first text information and the text information to be retrieved; performing word segmentation processing on the first text information to be processed To obtain the word segmentation sequence, the word segmentation sequence includes the first word segmentation sequence corresponding to the first text information and the second word segmentation sequence corresponding to the text information to be retrieved; input the word segmentation sequence into the reading comprehension model to obtain the second word segmentation sequence The first probability and the second probability corresponding to each participle in the sequence, the first probability corresponding to each participle indicates the probability that the participle is the beginning participle of the fourth text information, and the second probability corresponding to each participle Indicates the probability that the participle is the end participle of the fourth text information; according to the first probability and the second probability corresponding to each participle in the second participle sequence, determine the beginning of the fourth text information from the second participle sequence word segmentation and ending word segmentation; according to the beginning word segmentation and ending word segmentation of the fourth text information, determine the fourth text information from the text information to be retrieved.
在根据本申请一些实施例的信息检索方法中,获取所述第三文本信息集合中至少两个第三文本信息对应的多文本摘要,包括:针对所述第三文本信息集合中每一个第三文本信息,根据所述第三文本信息所包含的第四文本信息的开始分词对应的第一概率和结束分词对应的第二概率中至少一个,确定所述第三文本信息对应的检索匹配度;根据所述第三文本集合中每一个第三文本信息对应的检索匹配度,从第三文本集合中选取至少两个第三文本信息;将所述至少两个第三文本信息按照各自对应的检索匹配度从高到低的顺序进行拼接,以形成第二待处理文本信息;利用文本摘要模型生成所述第二待处理文本信息对应的多文本摘要。In the information retrieval method according to some embodiments of the present application, obtaining multi-text summaries corresponding to at least two third text information in the third text information set includes: for each third text information in the third text information set For text information, according to at least one of the first probability corresponding to the start word and the second probability corresponding to the end word of the fourth text information included in the third text information, determine the retrieval matching degree corresponding to the third text information; According to the retrieval matching degree corresponding to each third text information in the third text collection, select at least two third text information from the third text collection; Splicing is performed in descending order of matching degrees to form second text information to be processed; a text summary model is used to generate a multi-text summary corresponding to the second text information to be processed.
在根据本申请一些实施例的信息检索方法中,针对所述第三文本信息集合中每一个第三文本信息,根据所述第三文本信息所包含的第四文本信息的开始分词对应的第一概率和结束分词对应的第二概率中至少一个,确定所述第三文本信息对应的检索匹配度,包括:基于下述数值中至少一个,确定所述第三文本信息对应的检索匹配度:所述开始分词对应的第一概率和所述结束分词对应的第二概率的算术平均值;所述开始分词对应的第一概率和所述结束分词对应的第二概率的几何平均值;所述开始分词对应的第一概率和所述结束分词对应的第二概率中的最大值;所述开始分词对应的第一概率和所述结束分词对应的第二概率中的最小值。In the information retrieval method according to some embodiments of the present application, for each third text information in the third text information set, according to the first At least one of the probability and the second probability corresponding to the end participle, and determining the retrieval matching degree corresponding to the third text information includes: determining the retrieval matching degree corresponding to the third text information based on at least one of the following values: The arithmetic mean of the first probability corresponding to the start participle and the second probability corresponding to the end participle; the geometric mean of the first probability corresponding to the start participle and the second probability corresponding to the end participle; The maximum value of the first probability corresponding to the participle and the second probability corresponding to the end participle; the minimum value of the first probability corresponding to the start participle and the second probability corresponding to the end participle.
在根据本申请一些实施例的信息检索方法中,基于所述多文本摘要,确定所述第一文本信息对应的检索结果,包括:基于所述多文本摘要,生成所述第一文本信息对应的第一检索结果;基于所述多文本摘要对应的所述至少两个第三文本信息,生成所述第一文本信息对应的第二检索结果;根据第一检索结果和第二检索结果,确定所述第一文本信息对应的检索结果,使得所述检索结果包括第一检索结果和第二检索结果中至少一个。In the information retrieval method according to some embodiments of the present application, determining the search result corresponding to the first text information based on the multi-text abstract includes: generating a search result corresponding to the first text information based on the multi-text abstract The first search result; based on the at least two third text information corresponding to the multi-text abstract, generate a second search result corresponding to the first text information; determine the search result according to the first search result and the second search result A search result corresponding to the first text information, so that the search result includes at least one of the first search result and the second search result.
在根据本申请一些实施例的信息检索方法中,根据所述第一文本信息与所述多个第二文本信息中每个第二文本信息的语义相似度,从所述多个第二文本信息中选取至少一个待检索文本信息,包括:根据第一文本信息与每一个第二文本信息的语义相似度从大到小的顺序,对所述多个第二文本信息进行排序;从所述排序中选取前M个第二文本信息,作为M个待检索文本信息,其中M为预设的正整数。In the information retrieval method according to some embodiments of the present application, according to the semantic similarity between the first text information and each second text information in the plurality of second text information, from the plurality of second text information Selecting at least one text information to be retrieved includes: sorting the plurality of second text information according to the descending order of the semantic similarity between the first text information and each second text information; Select the first M second text information as the M text information to be retrieved, where M is a preset positive integer.
在根据本申请一些实施例的信息检索方法中,检索目标包括待检索问题,并且检索结果包括与待检索问题对应的答案。In the information retrieval method according to some embodiments of the present application, the retrieval target includes a question to be retrieved, and the retrieval result includes an answer corresponding to the question to be retrieved.
根据本申请的另一方面,提出一种基于语义理解的信息检索装置,包括:第一获取模块,其被配置成获取指示检索目标的第一文本信息和指示候选检索对象的多个第二文本信息;第一确定模块,其被配置成确定所述第一文本信息与所述多个第二文本信息中每个第二文本信息的语义相似度;选取模块,其被配置成根据所述第一文本信息与所述多个第二文本信息中每个第二文本信息的语义相似度,从所述多个第二文本信息中选取至少一个待检索文本信息;提取模块,其被配置成从所述至少一个待检索文本信息中分别提取与第一文本信息语义相关的第三文本信息,以形成第三文本信息集合;第二获取模块,其被配置成获取所述第三文本信息集合中至少两个第三文本信息对应的多文本摘要;第二确定模块,其被配置成基于所述多文本摘要,确定所述第一文本信息对应的检索结果。According to another aspect of the present application, an information retrieval device based on semantic understanding is proposed, including: a first acquisition module configured to acquire first text information indicating a retrieval target and a plurality of second texts indicating candidate retrieval objects information; a first determination module configured to determine the semantic similarity between the first text information and each second text information in the plurality of second text information; a selection module configured to determine according to the first text information Semantic similarity between a text information and each second text information in the plurality of second text information, at least one text information to be retrieved is selected from the plurality of second text information; an extraction module is configured to obtain Extract third text information semantically related to the first text information from the at least one text information to be retrieved to form a third text information set; a second acquisition module configured to acquire the third text information set A multi-text abstract corresponding to at least two third text information; a second determining module configured to determine a search result corresponding to the first text information based on the multi-text abstract.
根据本申请的又一方面,提供了一种计算设备,包括存储器和处理器,其中所述存储器中存储有计算机程序,所述计算机程序在被所述处理器执行时促使所述处理器执行根据本申请一些实施例的信息检索方法的步骤。According to yet another aspect of the present application, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein a computer program which, when executed by the processor, causes the processor to perform according to The steps of the information retrieval method in some embodiments of the present application.
根据本申请的另外又一方面,提供了一种计算机可读存储介质,其上存储计算机可读指令,所述计算机可读指令在被执行时实现根据本申请一些实施例的信息检索方法。According to yet another aspect of the present application, there is provided a computer-readable storage medium on which computer-readable instructions are stored, and when executed, the computer-readable instructions implement the information retrieval method according to some embodiments of the present application.
根据本申请的另一方面,提供了一种计算机程序产品,包括计算机指令,计算机指令在被处理器执行时实现根据本申请一些实施例的信息检索方法。According to another aspect of the present application, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the information retrieval method according to some embodiments of the present application.
在根据本申请一些实施例的基于语义理解的信息检索方法和装置中,通过基于语义相似度的信息筛选、基于语义相关性的信息提取和多文本摘要等三阶段信息处理过程,高效、精确地完成了诸如智能问答之类的高级信息检索任务。具体地,首先,获取根据第一文本信息(例如待检索问题)与第二文本信息(例如用于从中获取待检索问题的答案的候选文章)的语义相似度(而非单纯的字面匹配)来从大量的第二文本信息筛选出相对少量的待检索文本信息,从而在保证不丢失与检索问题高相关度的重要信息检索资源(即确保检索广度和准确度)的情况下显著提高整体工作效率,并克服了相关技术由于关键字精确匹配造成的重要检索资源缺失的问题以及由于流程复杂、计算量庞大造成的效率低下问题;其次,针对第一阶段筛选得到的每一个待检索文本信息,基于语义相关性从中再次提取或检索出与第一文本信息(如待检索问题)对应的第三文本信息(即待检索问题对应的候选答案相关文本信息),从而再次利用待检索问题内在的语义特征与待检索信息的语义特征实现候选答案及其相关文本(即第三文本信息集合)的提取,进一步确保了第三文本信息集合与第一文本信息的较高关联性以及最终检索结果的较高精确度;最后,生成第三文本信息集合中的至少两个第三文本信息的多文本摘要(例如生成式文本摘要),作为待检索问题的最终答案,这样的多文本摘要由于融合了多个较高检索匹配度的候选答案相关文本信息,能够进一步提升检索结果的质量和精确度。In the information retrieval method and device based on semantic understanding according to some embodiments of the present application, through a three-stage information processing process such as information screening based on semantic similarity, information extraction based on semantic correlation, and multi-text summarization, efficiently and accurately Completed advanced information retrieval tasks such as intelligent question answering. Specifically, firstly, according to the semantic similarity between the first text information (such as the question to be retrieved) and the second text information (such as the candidate article used to obtain the answer to the question to be retrieved) (rather than a simple literal match) Screen a relatively small amount of text information to be retrieved from a large amount of second text information, thereby significantly improving the overall work efficiency without losing important information retrieval resources that are highly relevant to the retrieval problem (that is, ensuring retrieval breadth and accuracy). , and overcome the problem of lack of important retrieval resources caused by exact keyword matching in related technologies and the low efficiency caused by complex process and huge amount of calculation; secondly, for each text information to be retrieved in the first stage of screening, based on Semantic relevance re-extracts or retrieves the third text information corresponding to the first text information (such as the question to be retrieved) (that is, the text information related to the candidate answer corresponding to the question to be retrieved), so as to reuse the inherent semantic features of the question to be retrieved The semantic features of the information to be retrieved realize the extraction of candidate answers and their related texts (that is, the third text information set), which further ensures the high correlation between the third text information set and the first text information and the high accuracy of the final retrieval results. Accuracy; Finally, generate a multi-text abstract (such as a generative text abstract) of at least two third text information in the third text information set, as the final answer to the question to be retrieved, such a multi-text abstract is due to the fusion of multiple Candidate answer-related text information with a high retrieval matching degree can further improve the quality and accuracy of retrieval results.
根据下文描述的实施例,本申请的这些和其它优点将变得清楚,并且参考下文描述的实施例来阐明本申请的这些和其它优点。These and other advantages of the present application will be apparent from and elucidated with reference to the embodiments described hereinafter.
附图说明Description of drawings
现在将更详细并且参考附图来描述本申请的实施例,其中:Embodiments of the present application will now be described in greater detail and with reference to the accompanying drawings, in which:
图1示出根据本申请的一些实施例的基于语义理解的信息检索方法的示例性应用场景;FIG. 1 shows an exemplary application scenario of an information retrieval method based on semantic understanding according to some embodiments of the present application;
图2示出根据本申请一些实施例的基于语义理解的信息检索方法的示例性原理框图;Fig. 2 shows an exemplary functional block diagram of an information retrieval method based on semantic understanding according to some embodiments of the present application;
图3示出根据本申请的一些实施例的基于语义理解的信息检索方法的流程图;FIG. 3 shows a flowchart of an information retrieval method based on semantic understanding according to some embodiments of the present application;
图4示出根据本申请一些实施例的基于语义理解的信息检索方法中语义相似度确定步骤的示例流程图;Fig. 4 shows an example flow chart of the semantic similarity determination step in the information retrieval method based on semantic understanding according to some embodiments of the present application;
图5示出根据本申请一些实施例利用语义理解模型确定语义相似度的原理图;FIG. 5 shows a schematic diagram of determining semantic similarity using a semantic understanding model according to some embodiments of the present application;
图6示出根据本申请一些实施例的基于语义理解的信息检索方法中语义相关信息提取步骤的示例流程图;6 shows an example flow chart of the steps of extracting semantically relevant information in the semantic understanding-based information retrieval method according to some embodiments of the present application;
图7示出根据本申请一些实施例利用阅读理解模型提取语义相关信息的原理图;FIG. 7 shows a schematic diagram of extracting semantically relevant information using a reading comprehension model according to some embodiments of the present application;
图8示出根据本申请一些实施例的基于语义理解的信息检索方法中多文本摘要获取步骤的示例流程图;Fig. 8 shows an example flow chart of the multi-text abstract acquisition steps in the information retrieval method based on semantic understanding according to some embodiments of the present application;
图9示出根据本申请一些实施例利用文本摘要模型生成多文本摘要的原理图;FIG. 9 shows a schematic diagram of generating a multi-text summary using a text summary model according to some embodiments of the present application;
图10是根据本申请的一些实施例的基于语义理解的信息检索方法的完整过程示意图;Fig. 10 is a schematic diagram of a complete process of an information retrieval method based on semantic understanding according to some embodiments of the present application;
图11示出根据本申请的一些实施例的基于语义理解的信息检索装置的示例性结构框图;Fig. 11 shows an exemplary structural block diagram of an information retrieval device based on semantic understanding according to some embodiments of the present application;
图12示意性示出了根据本申请一些实施例的计算设备的示例框图。Figure 12 schematically illustrates an example block diagram of a computing device according to some embodiments of the application.
具体实施方式Detailed ways
现在将参考附图更全面地描述示例实施例。然而,示例实施例能够以多种形式实施,且不应被理解为限于在此阐述的实施例;相反,提供这些实施例使得本申请将全面和完整,并将示例实施例的构思全面地传达给本领域的技术人员。在图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus their repeated descriptions will be omitted.
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组件、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of the embodiments of the application. However, those skilled in the art will appreciate that the technical solutions of the present application may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be employed. In other instances, well-known methods, apparatus, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.
附图中所示的方框图仅仅是功能实体,不必然与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the drawings are merely functional entities and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices entity.
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。应理解,虽然本文中可能使用术语第一、第二、第三等来描述各种组件,但这些组件不应受这些术语限制。如本文中所使用,术语“和/或”及类似术语包括相关联的列出项目中的任一个、多个和全部的所有组合。The flow charts shown in the drawings are only exemplary illustrations, and do not necessarily include all contents and operations/steps, nor must they be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partly combined, so the actual order of execution may be changed according to the actual situation. It will be understood that although the terms first, second, third etc. may be used herein to describe various components, these components should not be limited by these terms. As used herein, the term "and/or" and similar terms include all combinations of any one, a plurality, and all of the associated listed items.
本领域技术人员可以理解,附图只是示例实施例的示意图,附图中的模块或流程并不一定是实施本申请所必须的,因此不能用于限制本申请的保护范围。Those skilled in the art can understand that the accompanying drawings are only schematic diagrams of exemplary embodiments, and the modules or processes in the accompanying drawings are not necessarily necessary for implementing the present application, and thus cannot be used to limit the protection scope of the present application.
在详细介绍本申请的实施例之前,为了清楚起见,首先对一些相关的概念进行解释。Before introducing the embodiments of the present application in detail, some related concepts are firstly explained for the sake of clarity.
1.文本摘要,是指通过各种技术从一个或多个源文本(例如文章或文章集合)中提取其中的关键信息或要点信息,用以概括和展示源文本的主要内容或有效信息。在本文中,按照输入类型的不同,文本摘要可以分为单文本摘要(即单个文本的摘要)和多文本摘要(即多个文本构成的文本集合的摘要);按照实现技术的不同,文本摘要可以分为抽取式文本摘要(例如,直接从源文本抽取的一个或多个片段构成的摘要)和生成式文本摘要(例如,经过对源文本的理解、总结、推理生成的摘要)。1. Text summarization refers to the extraction of key information or key information from one or more source texts (such as articles or article collections) through various techniques to summarize and display the main content or effective information of the source texts. In this paper, according to different input types, text summarization can be divided into single-text summarization (that is, the summation of a single text) and multi-text summarization (that is, the summarization of a text collection composed of multiple texts); according to different implementation technologies, text summarization It can be divided into extractive text summarization (for example, a summary composed of one or more fragments extracted directly from the source text) and generative text summarization (for example, a summary generated by understanding, summarizing, and reasoning about the source text).
2.语义理解模型,在本文中是指用于对自然语言的文本信息进行语义理解或语义编码的神经网络模型,其输入可以为待处理的文本的序列化表示(例如分词序列),输出可以为该文本信息对应的语义特征向量;在本申请中,语义理解模型可以采用预训练的自然语言处理模型,例如BERT模型、Roberta模型、Albert模型等。2. Semantic understanding model, in this paper refers to the neural network model used for semantic understanding or semantic encoding of natural language text information, its input can be the serialized representation of the text to be processed (such as word segmentation sequence), and the output can be is the semantic feature vector corresponding to the text information; in this application, the semantic understanding model can use a pre-trained natural language processing model, such as BERT model, Roberta model, Albert model, etc.
3.阅读理解模型,也称机器阅读理解模型或问答模型,在文本中是指用于对自然语言的文章或语料进行语义理解并回答相关问题的神经网络模型,其输入可以为待处理问题文本和对应的待检索文本(例如文章)的序列化表示,输出可以为待处理文本对应的答案或候选答案,或者文章中各个词语作为候选答案开始词语和结束词语的概率。在本申请中,阅读理解模型也可以采用预训练的自然语言处理模型,例如BERT模型、Roberta模型、Albert模型等。3. Reading comprehension model, also known as machine reading comprehension model or question answering model, refers to a neural network model used to semantically understand natural language articles or corpus and answer related questions in the text, and its input can be the text of the question to be processed And the serialized representation of the corresponding text to be retrieved (such as an article), the output can be the answer or candidate answer corresponding to the text to be processed, or the probability of each word in the article as the beginning word and end word of the candidate answer. In this application, the reading comprehension model can also use a pre-trained natural language processing model, such as BERT model, Roberta model, Albert model, etc.
4.文本摘要模型,是指用于生成输入文本的内容摘要的神经网络模型,其输入可以为一个或多个文本,输出可以为相应的文本摘要。在本文中,文本摘要模型可以采用预训练的编码器-解码器结构,例如基于深度神经网络的(源文本)序列到(摘要文本)序列的框架结构,其中编码器(例如语义编码器)将源文本序列转换成对应的语义向量序列,解码器(例如循环解码器)则基于语义向量序列(例如通过注意力机制和循环解码)生成摘要文本序列。4. A text summarization model refers to a neural network model used to generate content summaries of input texts. Its input can be one or more texts, and its output can be corresponding text summaries. In this paper, the text summarization model can adopt a pre-trained encoder-decoder structure, such as the framework structure of (source text) sequence to (summary text) sequence based on deep neural network, where the encoder (such as semantic encoder) will The source text sequence is transformed into a corresponding sequence of semantic vectors, and a decoder (e.g. recurrent decoder) generates a summary text sequence based on the sequence of semantic vectors (e.g. via attention mechanism and recurrent decoding).
针对相关检索技术中存在的检索过程复杂、计算量庞大、效率低下以及检索广度和精度低等问题,本申请提出了一种基于语义理解的信息检索方法。根据本申请的信息检索方法充分利用了问题(即第一文本信息,检索目标)和文章(即第二文本信息,检索对象)内在语义特征的分析比较(代替相关技术所采用的外在的字面或关键词匹配),例如语义相似度(语义理解模型)、语义相关性(问答式阅读理解模型),从大量原始待检索数据(例如文章)中筛选并提取出多个候选检索结果相关文本信息(即第三文本信息集合,候选答案相关信息集合);同时(例如通过文本摘要模型)利用多文本摘要自动提取技术基于多个候选检索结果相关信息中至少一部分生成多文本摘要,作为最终检索结果的至少一部分(例如问题的答案)。Aiming at the problems of complex retrieval process, huge amount of calculation, low efficiency, low retrieval breadth and low precision in related retrieval technology, this application proposes an information retrieval method based on semantic understanding. According to the information retrieval method of the present application, the analysis and comparison of the intrinsic semantic features of questions (i.e. the first text information, retrieval object) and articles (i.e. the second text information, retrieval object) have been fully utilized (replacing the external literals adopted by related technologies) or keyword matching), such as semantic similarity (semantic understanding model), semantic relevance (question-and-answer reading comprehension model), to screen and extract multiple candidate search results related text information from a large amount of original data to be retrieved (such as articles) (i.e. the third text information set, candidate answer-related information set); at the same time (for example, through the text summarization model) use the multi-text summarization automatic extraction technology to generate multi-text summaries based on at least a part of the relevant information of multiple candidate retrieval results, as the final retrieval result at least part of (such as an answer to a question).
首先,通过例如语义理解模型对候选检索对象(即大量的原始数据或文章)进行语义层面的初筛,即从中筛选出与待检索问题语义相似度较高的至少一个待检索信息(即少量与问题相关度较高的文章),同时由于在语义层面保留了相关度较高的文章,确保了检索资源的广度并为检索精度提供了检索基础。其次,通过例如阅读理解模型对初筛得到的至少一个待检索信息在语义层面(即在语义理解的基础上)分别提取或抽取待检索问题对应的候选检索结果(例如问题的候选答案),从而形成候选检索结果相关文本集合(即包含候选检索结果的文本形成的第三文本信息集合),进一步提升了检索结果的精确度。最后对候选检索结果相关文本集合中至少一部分文本(例如检索匹配度较高的候选检索结果对应的文本)进行多文本摘要提取,以得到最终检索结果的至少一部分(例如待检索问题的最终答案),这种多文本摘要能够融合多个具体较高检索匹配度的候选检索结果或答案的语义特征,再次提升了检索结果的鲁棒性和准确性。First, perform a preliminary semantic screening of candidate retrieval objects (i.e., a large amount of original data or articles) through, for example, a semantic understanding model, that is, to screen out at least one piece of information to be retrieved that has a high semantic similarity with the query to be retrieved (i.e., a small amount of Articles with a high degree of relevance to the question), and because the articles with a high degree of relevance are retained at the semantic level, the breadth of retrieval resources is ensured and a retrieval basis is provided for retrieval accuracy. Secondly, extract or extract candidate retrieval results corresponding to the question to be retrieved (such as a candidate answer to the question) at the semantic level (that is, on the basis of semantic understanding) of at least one information to be retrieved from the preliminary screening by, for example, the reading comprehension model, thereby Forming the related text set of the candidate retrieval result (ie, the third text information set formed by the text containing the candidate retrieval result) further improves the accuracy of the retrieval result. Finally, perform multi-text abstraction extraction on at least a part of the texts in the relevant text set of the candidate retrieval results (for example, the text corresponding to the candidate retrieval results with high search matching degree) to obtain at least a part of the final retrieval results (such as the final answer to the question to be retrieved) , this kind of multi-text summarization can integrate the semantic features of multiple specific and high-matching candidate retrieval results or answers, which once again improves the robustness and accuracy of the retrieval results.
图1示出了根据本申请的一些实施例的基于语义理解的信息检索方法的示例性应用场景100。如图1所示,应用场景100可以包括服务器110,并且可选地可以包括外部数据库120、网络130以及终端设备140,其中终端设备140可以由用户150控制。Fig. 1 shows an
根据本申请一些实施例的信息检索方法可以部署于服务器110并通过服务器110实现。服务器110可以配置成:首先,获取指示检索目标的第一文本信息和指示候选检索对象的多个第二文本信息;其次,确定所述第一文本信息与所述多个第二文本信息中每个第二文本信息的语义相似度;再次,根据第一文本信息与多个第二文本信息中每个第二文本信息的语义相似度,从多个第二文本信息中选取至少一个待检索文本信息;接着,从至少一个待检索文本信息中分别提取与第一文本信息语义相关的第三文本信息以形成第三文本信息集合;随后,获取第三文本信息集合中至少两个第三文本信息对应的多文本摘要;最后,基于多文本摘要,确定第一文本信息对应的检索结果。The information retrieval method according to some embodiments of the present application may be deployed on and implemented by the
示例性地,服务器110可以存储和运行可以执行本文所描述的各种方法的指令。服务器110可以是单个服务器或服务器集群,或者可以是能够提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器或云服务器集群。应理解,本文所提及的服务器典型地可以为具有大量存储器和处理器资源的服务器计算机,但是其他实施例也是可能的。此外,服务器110仅作为示例被示出,实际上,也可以替代地或附加地使用其他具有计算能力及存储能力的设备或设备的组合来提供相应的服务。Exemplarily,
如图1所示,可选地,应用场景100可以进一步包括外部数据库120和网络130。服务器110可以通过网络130与外部数据库120连接,以便例如从数据库120获取待处理文本,包括指示检索目标(例如待检索问题)的第一文本信息和与指示候选检索对象(例如与待检索问题对应的候选文章)的多个第二文本信息,以及例如将所得到的检索结果或问题答案存放至数据库120等。示例性地,数据库120可以是独立的数据存储设备或设备群,或者也可以是与其他在线服务(诸如提供智能客服、语音助手等功能的在线服务)相关的后端数据存储设备或设备群。As shown in FIG. 1 , optionally, the
如图1所示,可选地,应用场景100可以进一步包括终端设备140,其可以通过网络130与服务器110连接。如图1所示,终端设备140的用户150可以通过终端设备140经由网络130访问服务器110,以便获取服务器110所提供的服务。例如,用户150可以通过终端设备140提供的用户接口来输入指令,例如通过实体输入设备(例如键盘和/或鼠标等)或虚拟按键(例如触摸屏)、通过语音或手势指令等,以便启动部署于服务器110上的信息检索方案、发送指示检索目标或问题的第一文本信息(和/或指示候选检索对象的多个第二文本信息)、接收所得到的检索结果或问题答案等。As shown in FIG. 1 , optionally, the
示例性地,网络130的示例包括局域网(LAN)、广域网(WAN)、个域网(PAN)、和/或诸如因特网之类的通信网络的组合。服务器110、数据库120以及终端设备140中的每一个可以包括能够通过网络130进行通信的至少一个通信接口(未示出)。这样的通信接口可以是下列各项中的一个或多个:任何类型的网络接口(例如,网络接口卡(NIC))、有线或无线(诸如IEEE 802.11无线LAN(WLAN))无线接口、全球微波接入互操作(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口等。Illustratively, examples of
如图1所示,终端设备140可以是任何类型的移动计算设备,包括移动计算机(例如、个人数字助理(PDA)、膝上型计算机、笔记本计算机、平板计算机、上网本等)、移动电话(例如,蜂窝电话、智能手机等)、可穿戴式计算设备(例如智能手表、头戴式设备,包括智能眼镜等)或其他类型的移动设备。在一些实施例中,终端设备140也可以是固定式计算设备,例如台式计算机、游戏机、智能电视等。此外,当应用场景100包括多个终端设备140的情况下,该多个终端设备140可以是相同或不同类型的计算设备。As shown in FIG. 1 ,
示例性地,终端设备140可以包括显示屏以及可以经由显示屏与终端用户交互的终端应用。终端应用可以为本地应用程序、网页(Web)应用程序或者作为轻量化应用的小程序。在终端应用为需要安装的本地应用程序的情况下,可以将终端应用安装在用户终端中。在终端应用为Web应用程序的情况下,可以通过浏览器访问终端应用。在终端应用为小程序的情况下,可以通过搜索终端应用的相关信息(如终端应用的名称等)、扫描终端应用的图形码(如条形码、二维码等)等方式来在终端设备140上直接打开终端应用,而无需安装终端应用。Exemplarily, the
应理解,虽然在本文中,服务器110、数据库120与终端设备140被示出和描述为分离的结构,但它们也可以是同一计算设备的不同组成部分。其中,例如服务器110可以提供后台计算功能,数据库120提供数据交换、存储、获取功能,而终端设备140可以提供与用户进行交互的前台功能,例如通过终端应用接收用户输入及向用户提供输出。可选地,根据本申请一些实施例的基于语义理解的信息检索方法不仅限于在图1所示的服务器侧实现,而是也可以在终端设备侧实现,或者也可以在终端设备侧和服务器侧共同实现。It should be understood that although the
图2示出根据本申请一些实施例的基于语义理解的信息检索方法的示例性原理框图。需要说明,图2中的圆角方框表示各种数据信息,例如待处理的第一文本信息和多个第二文本信息、待检索文本信息、第三文本信息集合、多文本摘要和检索结果;方角矩形方框表示针对各种数据信息的处理操作,包括例如图2所示的“基于语义相似度的信息筛选”、“基于语义相关性的信息提取”以及“多文本摘要的获取”。如图2所示,根据本申请一些实施例的信息检索方法可以利用“三阶段”信息处理操作(即“基于语义相似度的信息筛选”、“基于语义相关性的信息提取”以及“多文本摘要的获取”)来确定与第一文本信息(即检索目标或检索问题)对应的精确检索结果。Fig. 2 shows an exemplary functional block diagram of an information retrieval method based on semantic understanding according to some embodiments of the present application. It should be noted that the rounded corner boxes in Figure 2 represent various data information, such as the first text information to be processed and multiple second text information, the text information to be retrieved, the third text information set, multi-text abstracts and retrieval results ; Rectangular boxes with square corners represent processing operations for various data information, including, for example, "information screening based on semantic similarity", "information extraction based on semantic correlation" and "acquisition of multi-text summaries" as shown in Figure 2. As shown in Figure 2, the information retrieval method according to some embodiments of the present application can utilize "three-stage" information processing operations (namely, "information screening based on semantic similarity", "information extraction based on semantic relevance", and "multi-text Abstract acquisition") to determine the exact retrieval result corresponding to the first text information (ie, the retrieval target or retrieval question).
如图2所示,在根据本申请一些实施例的基于语义理解的信息检索方法中,首先,在第一阶段信息处理操作中,可以基于第一文本信息210和多个第二文本信息220之间的语义相似度(例如借助于语义理解模型进行语义相似度比较),对多个第二文本信息220(即大量的原始候选检索数据)进行信息筛选,以从中选取例如与第一文本信息210语义相似度较高的至少一个待检索文本信息230,以供后续检索过程使用;其次,在第二阶段信息处理操作中,从各个待检索文本信息230中(例如借助于阅读理解模型)检索或提取与第一文本信息210语义相关的第三文本信息(例如候选检索结果相关文本信息集合),以形成第三文本信息集合240;再次,在第三信息处理操作中,针对第三文本信息集合240(例如借助于文本摘要模型)获取多文本摘要250,最后依据多文本摘要250得到最终的检索结果260。应当注意,图2所示的包括信息筛选、信息提取和摘要获取的三阶段操作都可以采用相应的神经网络模型实现,例如三阶段操作分别可以采用语义理解模型、阅读理解模型和文本摘要模型,但在本申请中,这不是限制性的,换言之,上述三阶段信息处理操作也可以采用除了神经网络模型之外的其他各种方式实现。As shown in FIG. 2 , in the information retrieval method based on semantic understanding according to some embodiments of the present application, firstly, in the first stage of information processing operation, the first text information 210 and multiple second text information 220 can be based on Semantic similarity among them (for example, carry out semantic similarity comparison by means of a semantic understanding model), carry out information screening on a plurality of second text information 220 (that is, a large amount of original candidate retrieval data), to select, for example, the same as the first text information 210 At least one text information 230 to be retrieved with high semantic similarity is used for the subsequent retrieval process; secondly, in the second stage of information processing operation, from each text information 230 to be retrieved (for example, by means of a reading comprehension model) or Extract the third text information related to the semantics of the first text information 210 (for example, the set of text information related to candidate retrieval results) to form the third set of text information 240; again, in the third information processing operation, for the third set of text information 240 (for example, by means of a text summarization model) to obtain a multi-text abstract 250 , and finally obtain a final retrieval result 260 according to the multi-text abstract 250 . It should be noted that the three-stage operations shown in Figure 2, including information screening, information extraction, and abstract acquisition, can all be implemented using corresponding neural network models. For example, the three-stage operations can use semantic understanding models, reading comprehension models, and text abstract models, respectively. However, in this application, this is not limiting. In other words, the above three-stage information processing operations can also be implemented in various ways other than the neural network model.
图3是根据本申请的一些实施例的基于语义理解的信息检索方法的流程图。图3所示的基于语义理解的信息检索方法在图1所示的应用场景中实施,其执行主体可以是图1所示的服务器110。可选地,根据本申请的一些实施例的信息检索方法也可以在图1所示的终端设备140上执行,或者也可以由服务器110和针对设备140共同实现。Fig. 3 is a flowchart of an information retrieval method based on semantic understanding according to some embodiments of the present application. The information retrieval method based on semantic understanding shown in FIG. 3 is implemented in the application scenario shown in FIG. 1 , and its execution subject may be the
如图3所示,根据本申请一些实施例的基于语义理解的信息检索方法可以包括以下步骤:As shown in FIG. 3, the information retrieval method based on semantic understanding according to some embodiments of the present application may include the following steps:
S310,文本信息获取步骤;S310, a step of acquiring text information;
S320,语义相似度确定步骤;S320, a step of determining semantic similarity;
S330,待检索信息选取步骤;S330, a step of selecting information to be retrieved;
S340,语义相关信息提取步骤;S340, a semantic related information extraction step;
S350,多文本摘要获取步骤;S350, a multi-text summary acquisition step;
S360,检索结果确定步骤。S360, a step of determining the retrieval result.
下面结合图2详细描述上述各个步骤S310-S360的执行过程。The execution process of the above steps S310-S360 will be described in detail below in conjunction with FIG. 2 .
在步骤S310(文本信息获取步骤)中,获取指示检索目标的第一文本信息和指示候选检索对象的多个第二文本信息。In step S310 (text information acquisition step), first text information indicating a retrieval target and a plurality of second text information indicating candidate retrieval objects are acquired.
在一些实施例中,检索目标是指通过信息检索过程获得待检索(或待处理)问题的答案和/或检索关键词相关信息,因此,第一文本信息可以包括(例如用户输入的)待检索问题和/或检索关键词等对应的文本信息。第二文本信息可以包括与第一文本信息对应的候选检索对象(例如待检索问题对应的文章),即其中可能蕴含待检索问题的答案和/或检索关键词相关信息的候选检索对象。诸如第一文本信息和第二文本信息之类的文本格式数据的获取有利于后续的数据处理过程的顺利开展,因为后续的信息筛选、提取、摘要获取等操作所使用的各种自然语言处理模型的输入/输出数据的格式大都是文本格式,甚至是序列化文本格式。In some embodiments, the retrieval goal refers to obtaining answers to questions to be retrieved (or to be processed) and/or information related to keywords to be retrieved through the information retrieval process. Therefore, the first text information may include (for example, input by the user) the information to be retrieved Text information corresponding to questions and/or search keywords. The second text information may include candidate search objects corresponding to the first text information (for example, articles corresponding to the question to be searched), that is, candidate search objects that may contain the answer to the question to be searched and/or information related to search keywords. The acquisition of text format data such as the first text information and the second text information is conducive to the smooth development of the subsequent data processing process, because various natural language processing models used in subsequent information screening, extraction, abstract acquisition, etc. The format of the input/output data is mostly text format, even serialized text format.
在一些实施例中,S310所示的文本信息获取步骤可以包括例如从终端设备接收用户输入的文本形式的检索目标或待检索问题作为第一文本信息,并且从例如外部数据库获得文本格式的多个候选检索数据或文章作为多个第二文本信息。可选地,用户输入的减少目标或待检索问题以及候选检索数据也可以是其他非文本形式数据(例如语音数据、视频数据、表格数据等),这时,文本信息获取步骤在接收或获得这些非文本信息之后需要(例如利用适当的格式转换工具)将其转换成文本信息,从而得到相应的第一文本信息和多个第二文本信息。在本文中,待检索问题可以是各种待解决问题或待处理问题,包括但不限于:(简单直观的)常识性问题、(复杂抽象的)逻辑推理问题、数值推理问题等。In some embodiments, the step of obtaining text information shown in S310 may include, for example, receiving from the terminal device the retrieval target or the question to be retrieved in text form input by the user as the first text information, and obtaining multiple Candidate retrieval data or articles are used as a plurality of second text information. Optionally, the reduction target or the question to be retrieved and the candidate retrieval data input by the user may also be other non-text form data (such as voice data, video data, table data, etc.), at this time, the text information acquisition step is receiving or obtaining these The non-text information needs to be converted (for example, using a suitable format conversion tool) into text information afterwards, so as to obtain the corresponding first text information and a plurality of second text information. In this paper, the questions to be retrieved can be various questions to be solved or processed, including but not limited to: (simple and intuitive) common sense questions, (complex and abstract) logical reasoning questions, numerical reasoning questions, etc.
在步骤S320(语义相似度确定步骤)中,确定第一文本信息与多个第二文本信息中每个第二文本信息的语义相似度。In step S320 (semantic similarity determining step), the semantic similarity between the first text information and each second text information in the plurality of second text information is determined.
根据本申请的构思,为了在保证检索广度和精度的同时显著降低检索量、提升检索效率,可以基于第一文本信息与多个第二文本信息内在的语义相似程度从大量的第二文本信息(即源数据或候选检索对象)中筛选出相对少量的待检索文本信息,以便能够从这些待检索文本信息作为检索对象,从中检索得到与第一文本信息(例如待检索问题)对应的检索结果。According to the idea of this application, in order to significantly reduce the amount of retrieval and improve retrieval efficiency while ensuring the breadth and accuracy of retrieval, a large amount of second text information ( That is, a relatively small amount of text information to be retrieved is screened out from the source data or candidate retrieval objects), so that the text information to be retrieved can be used as the retrieval object, and the retrieval result corresponding to the first text information (such as the question to be retrieved) can be retrieved from it.
在一些实施例中,不同文本信息的语义相似度可以通过计算各自对应的语义特征的相似度来确定,换言之,可以利用语义特征的相似度来表示不同文本信息的内在的语义相似度。一般地,为了量化的需要,文本信息的语义特征可以用向量形式表示,即语义特征向量。文本信息的语义特征向量可以指从该文本信息提取的表征其整体语义的向量。文本信息的语义特征向量可以例如利用语义理解模型获取。这样,第一文本信息与多个第二文本信息中每个第二文本信息的语义相似度的确定可以转换为第一文本信息的语义特征向量与每一个第二文本信息的语义特征向量之间的相似度的计算。可选地,也可以采用其他方式来计算语义相似度,例如可以通过第一文本信息的序列化表示(例如分词序列)对应的语义(向量)序列与每一个第二文本信息的序列化表示(例如分词序列)对应的语义(向量)序列的比较来计算二者之间的相似度。In some embodiments, the semantic similarity of different text information can be determined by calculating the similarity of corresponding semantic features. In other words, the similarity of semantic features can be used to represent the intrinsic semantic similarity of different text information. Generally, for the needs of quantification, the semantic features of text information can be expressed in the form of vectors, that is, semantic feature vectors. The semantic feature vector of text information may refer to a vector extracted from the text information to represent its overall semantics. The semantic feature vector of text information can be obtained, for example, by using a semantic understanding model. In this way, the determination of the semantic similarity between the first text information and each second text information in multiple second text information can be converted into the relationship between the semantic feature vector of the first text information and the semantic feature vector of each second text information The calculation of the similarity. Optionally, other ways can also be used to calculate the semantic similarity, for example, the semantic (vector) sequence corresponding to the serialized representation of the first text information (such as a word segmentation sequence) and the serialized representation of each second text information ( For example, the comparison of the semantic (vector) sequence corresponding to the word segmentation sequence) to calculate the similarity between the two.
由于,在步骤S320的语义相似度确定过程中,第一文本信息的语义特征(向量)覆盖到了第一文本信息中的各个部分(例如分词或词语)的语义信息及其相互间的关联信息,第二文本信息的语义特征(向量)覆盖到了第二文本信息中各个部分(例如句子、词语等)的语义信息及其关联信息,因此基于语义特征确定的第一文本信息和第二文本信息的语义相似度可以表示二者内在的、本质的语义层面的相似或接近程度,而非简单的、表面的、局部的文字(例如整体文本中的关键词、部分段落或语句)的一致性或匹配度,从而可以在检索广度和精度两方面得到显著提升。Since, in the process of determining the semantic similarity in step S320, the semantic features (vectors) of the first text information have covered the semantic information of each part (such as word segmentation or words) in the first text information and their interrelated information, The semantic features (vectors) of the second text information cover the semantic information and associated information of various parts (such as sentences, words, etc.) in the second text information, so the first text information and the second text information determined based on the semantic features Semantic similarity can indicate the similarity or closeness between the two at the intrinsic and essential semantic level, rather than the consistency or matching of simple, superficial, and partial text (such as keywords, partial paragraphs, or sentences in the overall text) Therefore, it can be significantly improved in terms of retrieval breadth and precision.
以汉语为例,由于存在大量的同义词,因此即使在某个文章中不存在与检索问题中的至少部分词语(例如关键词)完全相同的词语,但该文章仍然有可能在语义上与检索问题相似或接近,这样单纯的依赖于关键词匹配就会丢失大量的与检索问题语义相似的文章,造成重要检索资源缺失,检索广度受限;另一方面,由于汉语中多义词的存在,即使检字面相同的词语,其含义在不同语境中也可能千差万别,因此关键词匹配未必就确保找到或检索到所需的与检索问题匹配的信息,造成检索精确度难以得到保证。在本申请中,通过基于第一文本信息和第二文本信息的语义比较(即语义相似度的计算)实现信息检索,不会局限于字面信息是否一致,从根本上消除了相关技术检索广度和精度不高的缺陷,并高效实现了多个第二文本信息进行初步筛选。Taking Chinese as an example, since there are a large number of synonyms, even if there are no words in an article that are exactly the same as at least some of the words (such as keywords) in the retrieval question, the article may still be semantically different from the retrieval question. Similar or close, such a simple reliance on keyword matching will lose a large number of articles with similar semantics to the retrieval question, resulting in the lack of important retrieval resources and limited retrieval breadth; on the other hand, due to the existence of polysemy in Chinese, even if the literal The meaning of the same word may vary widely in different contexts, so keyword matching may not necessarily ensure that the required information matching the search question is found or retrieved, making it difficult to guarantee the accuracy of the search. In this application, information retrieval is realized by semantic comparison (that is, the calculation of semantic similarity) based on the first text information and the second text information. The defect of low precision is not high, and the preliminary screening of multiple second text information is realized efficiently.
在步骤S330(待检索信息选取步骤)中,根据第一文本信息与多个第二文本信息中每个第二文本信息的语义相似度,从多个第二文本信息中选取至少一个待检索文本信息。In step S330 (the step of selecting information to be retrieved), at least one text to be retrieved is selected from the plurality of second text information according to the semantic similarity between the first text information and each second text information in the plurality of second text information information.
根据本申请的构思,在确定了第一文本信息和多个第二文本信息中每一个第二文本信息的语义相似度之后,可以基于语义相似度从多个第二文本信息中筛选实际的检索对象,例如至少一个待检索文本信息。一般地,可以在多个第二文本信息中选取与第一文本信息的语义相似度较高的一个或多个第二文本信息作为待检索文本信息,因为语义相似度越高说明相应的第二文本信息在内在语义层面越接近第一文本信息。换言之,与第一文本信息语义相似度越高的第二文本信息越有可能被选取为待检索文本信息,其中被选取的待检索文本信息的数量可以预先设定。与原始的候选检索对象或数据(即多个第二文本信息)相比,经过基于语义相似度的筛选操作得到的待检索文本信息(作为后续检索基础)的数量被大大缩减,从而显著降低了检索量,提高了检索效率。According to the concept of the present application, after determining the semantic similarity between the first text information and each second text information in the plurality of second text information, the actual retrieval can be screened from the plurality of second text information based on the semantic similarity Object, such as at least one text message to be retrieved. Generally, one or more second text information with higher semantic similarity with the first text information can be selected as the text information to be retrieved among multiple second text information, because the higher the semantic similarity, the corresponding second text information The closer the text information is to the first text information at the intrinsic semantic level. In other words, the second text information with higher semantic similarity with the first text information is more likely to be selected as the text information to be retrieved, wherein the number of selected text information to be retrieved can be preset. Compared with the original candidate retrieval objects or data (that is, multiple second text information), the number of text information to be retrieved (as the basis for subsequent retrieval) obtained through the screening operation based on semantic similarity is greatly reduced, thereby significantly reducing the Retrieval volume, improve retrieval efficiency.
在一些实施例中,步骤S320(待检索信息选取步骤)可以包括:根据第一文本信息与每一个第二文本信息的语义相似度从大到小的顺序,对所述多个第二文本信息进行排序;从排序中选取前M个第二文本信息,作为M个待检索文本信息,其中M为预设的正整数。如上所述,在待检索信息筛选过程中,可以选择与第一文本信息相似度最高的前M个第二文本信息作为检索基础或实际检索对象,其中M可以根据实际的应用场景预先设定。例如针对相对复杂或抽象的待检索问题(即第一文本信息),可以将M设定为相对较大数值,以保证待检索的数据的广泛性和丰富性,从而有利于检索出更准确的答案。另一方面,为了保证检索效率,M不能被设定的太大,比如可以设定为适合于具体应用场景的最小值。In some embodiments, step S320 (the step of selecting information to be retrieved) may include: according to the descending order of semantic similarity between the first text information and each second text information, sorting the plurality of second text information Sorting; selecting the first M second text information from the sorting as M text information to be retrieved, where M is a preset positive integer. As mentioned above, in the screening process of information to be retrieved, the top M second text information with the highest similarity with the first text information can be selected as the retrieval basis or the actual retrieval object, where M can be preset according to the actual application scenario. For example, for a relatively complex or abstract problem to be retrieved (that is, the first text information), M can be set to a relatively large value to ensure the breadth and richness of the data to be retrieved, which is conducive to retrieval of more accurate Answer. On the other hand, in order to ensure retrieval efficiency, M cannot be set too large, for example, it can be set to a minimum value suitable for specific application scenarios.
在步骤S340(语义相关信息提取步骤)中,从至少一个待检索文本信息中分别提取与第一文本信息语义相关的第三文本信息以形成第三文本信息集合。In step S340 (semantic related information extraction step), third text information semantically related to the first text information is extracted from at least one text information to be retrieved to form a third text information set.
根据本申请的构思,在经过第一阶段的信息初筛(即从多个第二文本信息筛选出作为检索基础的至少一个待检索文本信息,例如待检索文章)之后,可以进入第二阶段的信息提取过程,即从每一个待检索信息中分别提取与第一文本信息语义相关第三文本信息(即,与第一文本信息对应的候选检索结果相关信息,例如从待检索文章中抽取的对应于待处理问题的候选答案相关信息),从而这些与各个待检索信息一一对应的各个第三文本信息形成第三文本信息集合,以供后续第三阶段的多文本摘要获取过程使用,作为多文本摘要获取的基础。According to the concept of the present application, after the information preliminary screening of the first stage (that is, at least one text information to be retrieved as the basis of retrieval is selected from a plurality of second text information, such as articles to be retrieved), the second stage can be entered The process of information extraction, that is, to extract the third text information semantically related to the first text information from each information to be retrieved (that is, the candidate retrieval result related information corresponding to the first text information, such as the corresponding information extracted from the articles to be retrieved) Information related to candidate answers to questions to be processed), so that each of the third text information corresponding to each of the information to be retrieved forms a third text information set, which is used in the subsequent multi-text abstract acquisition process in the third stage, as a multi-text summary The basis for text summarization.
在一些实施例中,语义相关信息提取步骤S340可以是针对每一个待检索文本信息分别提取一个与第一文本信息语义相关的第三文本信息,从而构成第三文本信息集合,集合中的各个第三文本信息可以与各个待检索文本信息一一对应。在本文中,第三文本信息与第一文本信息之间的语义相关可以理解为在每一个待检索文本信息提取的第三文本信息是与第一文本信息对应的候选检索结果相关信息。例如,当第一文本信息指示的检索目标为待检索问题时,第三文本信息可以是从待检索文章(即待检索文本信息)中提取的与该问题的匹配的候选答案相关信息。在一些实施例中,第三文本信息可以是从对应的待检索文本信息中抽取的包含第一文本信息对应的减少结果的至少一部分信息,例如,第三文本信息可以是待检索文章中的一句话、一个段落等。关于第三文本信息的提取方式,可以采用(机器)阅读理解模型(例如基于BERT的神经网络模型等)或问答模型来实现。例如,首先,将第一文本信息(例如问题)与待检索信息(例如文本)输入模型,从而得到从文章中确定问题答案;随后,从文章中提取或抽取包含问题答案的信息(例如问题答案所在的语句、自然段落或问题答案本身),作为第三文本信息。具体过程可参见图6和图7及其相关描述。In some embodiments, the semantic related information extraction step S340 may be to extract a third text information semantically related to the first text information for each text information to be retrieved, so as to form a third text information set, and each third text information in the set The three pieces of text information can be in one-to-one correspondence with each text information to be retrieved. In this paper, the semantic correlation between the third text information and the first text information can be understood as that the third text information extracted from each text information to be retrieved is the candidate retrieval result related information corresponding to the first text information. For example, when the retrieval target indicated by the first text information is a question to be retrieved, the third text information may be related information of candidate answers matching the question extracted from the article to be retrieved (ie, the text information to be retrieved). In some embodiments, the third text information may be at least a part of the information extracted from the corresponding text information to be retrieved that contains the reduction result corresponding to the first text information, for example, the third text information may be a sentence in the article to be retrieved words, a paragraph, etc. Regarding the extraction method of the third text information, a (machine) reading comprehension model (such as a BERT-based neural network model, etc.) or a question-answering model can be used to implement. For example, first, input the first text information (such as a question) and the information to be retrieved (such as text) into the model, so as to obtain the answer to the question from the article; then, extract or extract the information containing the answer to the question (such as the answer to the question) from the article The sentence, natural paragraph or question answer itself), as the third text information. For the specific process, refer to FIG. 6 and FIG. 7 and their related descriptions.
作为第二阶段的信息提取步骤,S340主要用于在第一阶段信息初筛的基础上,进一步精简信息量,利用例如阅读理解模型从待检索信息中筛选出(即提取)候选检索结果相关信息,作为后续第三阶段多文本摘要的基础,从而在保证检索数据广度和精度的基础上进一步减小检索过程的信息处理量,提升数据处理效率。As the information extraction step of the second stage, S340 is mainly used to further simplify the amount of information on the basis of the preliminary screening of information in the first stage, and use, for example, a reading comprehension model to screen out (that is, extract) relevant information of candidate search results from the information to be retrieved , as the basis of the subsequent third stage of multi-text summarization, so as to further reduce the amount of information processing in the retrieval process and improve the efficiency of data processing on the basis of ensuring the breadth and accuracy of the retrieval data.
在步骤S350(多文本摘要获取步骤)中,获取第三文本信息集合中至少两个第三文本信息对应的多文本摘要。In step S350 (multi-text abstract acquisition step), the multi-text abstract corresponding to at least two third text information in the third text information set is acquired.
根据本申请的构思,在经过第一阶段的基于语义相似度的信息初筛(即从多个第二文本信息筛选出作为检索基础的至少一个待检索文本信息,例如待检索文章)和第二阶段的语义相关信息提取(即从待检索信息中提取第三文本信息以形成第三文本信息集合)之后,可以进入第三阶段的多文本摘要获取过程,即:首先,从第三文本信息集合中选取至少两个(例如与第一文本信息的相关程度或检索匹配度较高的)第三文本信息,其次利用用于自然语言的文本摘要模型(例如基于深度神经网络的编码器-解码器模型)从至少两个第三文本信息中提取或基于这些第三文本信息生成多文本摘要,以作为最终检索结果的至少一部分。According to the idea of the present application, after the first stage of information screening based on semantic similarity (that is, screening out at least one text information to be retrieved as the basis of retrieval from a plurality of second text information, such as articles to be retrieved) and the second After the semantic-related information extraction in the stage (that is, extracting the third text information from the information to be retrieved to form the third text information set), you can enter the third stage of the multi-text summary acquisition process, that is: first, from the third text information set Select at least two third text information (such as those with a high degree of relevance or retrieval matching with the first text information) from among them, and then use a text summarization model for natural language (such as a deep neural network-based encoder-decoder model) extract from at least two third text information or generate a multi-text abstract based on these third text information, as at least a part of the final retrieval result.
在一些实施例中,根据本申请的信息检索方法涉及的多文本摘要(即从第三文本集合中至少两个第三文本获取的文本摘要)可以是生成式文本摘要。这种(例如通过文本摘要模型生成的)生成式文本摘要由于经过了语义特征理解、总结、推理等过程,能够针对更抽象、更复杂的问题自动高效地提供适当的、准确的答案,显著扩大检索方法的适用场景和使用范围,提高了检索结果的鲁棒性和准确性,极大提升用户体验。可选地,步骤S350中涉及的多文本摘要也可以是抽取式文本摘要。In some embodiments, the multi-text summarization involved in the information retrieval method according to the present application (ie, the text summarization obtained from at least two third texts in the third text collection) may be a generative text summarization. This kind of generative text summarization (such as generated by the text summarization model) can automatically and efficiently provide appropriate and accurate answers to more abstract and complex questions due to the process of semantic feature understanding, summarization, and reasoning, which significantly expands the The applicable scenarios and scope of the retrieval method improve the robustness and accuracy of the retrieval results and greatly enhance the user experience. Optionally, the multi-text summarization involved in step S350 may also be an extractive text summarization.
在一些实施例中,可以采用序列到序列的语言处理模型来获取多文本摘要,例如基于深度神经网络(如RNN和/或CNN)编码器-解码器结构的模型,其中编码器部分的输入是第三文本信息集合中至少两个(与第一文本信息的)匹配度或相关度较高的第三文本信息,解码器码部分的输出是多文本摘要结果。模型运行步骤可以包括:将至少两个第三文本信息按照匹配度高低顺序拼接起来,经过分词处理得到输入分词序列并输入编码器,以得到输入文本信息对应语义向量序列;然而,利用注意力机制结合解码器端的各个时间步对应的输入得到上下文向量序列;最后,利用解码器解码得到各个时间步(或循环解码时刻)对应的解码文本序列,即多文本摘要。In some embodiments, a sequence-to-sequence language processing model can be used to obtain a multi-text summary, such as a model based on a deep neural network (such as RNN and/or CNN) encoder-decoder structure, wherein the input of the encoder part is In the third text information set, at least two third text information (with the first text information) have higher matching degrees or higher correlation degrees, and the output of the code part of the decoder is a multi-text summary result. The model running step may include: splicing at least two third text information according to the order of matching degree, and obtaining the input word segmentation sequence through word segmentation and inputting it into the encoder to obtain the semantic vector sequence corresponding to the input text information; however, using the attention mechanism Combining the input corresponding to each time step of the decoder to obtain the sequence of context vectors; finally, using the decoder to decode to obtain the decoded text sequence corresponding to each time step (or cyclic decoding moment), that is, a multi-text summary.
在步骤S360(检索结果确定步骤)中,基于多文本摘要,确定第一文本信息对应的检索结果。In step S360 (retrieval result determination step), the search result corresponding to the first text information is determined based on the multi-text abstract.
根据本申请的构思,在获取到至少两个(相关度较高的)第三文本信息的多文本摘之后,可以直接将该多文本摘要作为最终的检索结果或其至少一部分。由于多文本摘要的基础是从第三文本集合(即与第一文本信息对应的候选检索结果相关信息集合)中选取的(与第一文本信息)相关度较高的多个第三文本信息,因此,多文本摘要融合了多个检索匹配度较高的候选检索结果相关信息的内容或内在语义特征及其相互间的语义关联,因此,多文本摘要是在检索精度较高的多个候选检索结果相关信息的基础上再次取其整体信息的精华而得到的摘要结果,从而进一步提升了最终检索结果(例如问题答案)与第一文本信息(例如待处理问题)的匹配精确度。According to the concept of the present application, after obtaining at least two (highly relevant) multi-text abstracts of the third text information, the multi-text abstracts may be directly used as the final retrieval result or at least a part thereof. Since the basis of the multi-text summary is a plurality of third text information selected from the third text collection (that is, the relevant information collection of candidate retrieval results corresponding to the first text information) (with a high degree of correlation with the first text information), Therefore, the multi-text summarization integrates the content or intrinsic semantic features of the relevant information of multiple candidate retrieval results with high matching degree and the semantic correlation between them. Based on the related information of the result, the summary result is obtained by taking the essence of the overall information again, thereby further improving the matching accuracy of the final retrieval result (such as the answer to the question) and the first text information (such as the question to be processed).
在一些实施例中,检索结果确定步骤可以包括:例如将多文本摘要作为与第一文本信息对应的检索结果予以呈现,作为信息检索结果进行呈现。在一些应用场景中,当检索目标包括待检索问题时,检索结果可以包括与待检索问题对应的答案信息。换言之,多文本摘要可以直接当作待处理问题的最终答案信息。由于融合了多个候选检索结果(即待处理问题对应的多个候选答案)的语义特征及其相互关联特征,这样的多文本摘要能够针对待检索问题信息给出直接而准确的答案信息,可以提高用户体验以及检索结果的精确度。In some embodiments, the step of determining the retrieval result may include: for example, presenting a multi-text abstract as a retrieval result corresponding to the first text information, or presenting as an information retrieval result. In some application scenarios, when the retrieval target includes a question to be retrieved, the retrieval result may include answer information corresponding to the question to be retrieved. In other words, the multi-text summarization can be directly used as the final answer information of the question to be processed. Due to the fusion of the semantic features and interrelated features of multiple candidate retrieval results (that is, multiple candidate answers corresponding to the question to be processed), such a multi-text summary can give direct and accurate answer information for the question information to be retrieved, which can Improve user experience and accuracy of search results.
在一些实施例中,检索结果确定步骤S360也可以包括:基于多文本摘要,生成第一文本信息对应的第一检索结果;基于多文本摘要对应的至少两个第三文本信息,生成第一文本信息对应的第二检索结果;根据第一检索结果和第二检索结果,确定第一文本信息对应的检索结果,使得检索结果包括第一检索结果和第二检索结果中至少一个。例如,检索结果可以不仅包括基于多文本摘要的第一检索结果,作为(精确的)主要检索结果,而且还可以包括基于作为多文本摘要基础的至少两个第三文本信息(或第三文本信息集合)生成的第二检索结果,例如第二阶段的信息提取提取过程得到的第三文本信息集合中与第一文本信息相关度较高的至少两个第三文本信息的列表,作为(可选的)辅助检索结果。这样的双检索结果形式丰富了检索结果的多样性和可选择性,可以满足不同用户的个性化需求。例如,当用户对基于多文本摘要的精确检索结果不满意时,可以参考可选的辅助的候选检索结果列表以便从中快速查询相应信息。In some embodiments, the retrieval result determination step S360 may also include: generating the first retrieval result corresponding to the first text information based on the multi-text abstract; generating the first text based on at least two third text information corresponding to the multi-text abstract The second search result corresponding to the information: according to the first search result and the second search result, determine the search result corresponding to the first text information, so that the search result includes at least one of the first search result and the second search result. For example, the search results may not only include the first search result based on the multi-text abstract as the (exact) main search result, but may also include at least two third text information based on the multi-text abstract (or the third text information set) generated second retrieval results, for example, a list of at least two third text information with high correlation with the first text information in the third text information set obtained in the information extraction extraction process of the second stage, as (optional ) auxiliary search results. Such a dual search result form enriches the diversity and selectivity of search results, and can meet the individual needs of different users. For example, when the user is dissatisfied with the precise retrieval result based on the multi-text abstract, he can refer to the optional auxiliary candidate retrieval result list to quickly query the corresponding information.
在根据本申请一些实施例的基于语义理解的信息检索方法中,通过三阶段信息处理过程高效、精确地完成了诸如智能问答之类的高级信息检索任务,即第一阶段:利用基于(相对于第一文本信息(例如待检索问题))语义相似度的第二文本信息(即候选检索对象,例如用于从中获取待检索问题的答案的候选文章)筛选;第二阶段:基于语义相关性的第三文本信息(例如候选答案相关文本)集合提取;第三阶段:基于第三文本信息集合的多文本摘要的生成。In the information retrieval method based on semantic understanding according to some embodiments of the present application, advanced information retrieval tasks such as intelligent question answering are efficiently and accurately completed through a three-stage information processing process, that is, the first stage: using a method based on (compared to The first text information (such as the question to be retrieved)) screening of the second text information (that is, the candidate retrieval object, such as the candidate article used to obtain the answer to the question to be retrieved) of the semantic similarity; the second stage: based on semantic correlation Extraction of a third set of text information (such as candidate answer-related texts); third stage: generation of multi-text summaries based on the third set of text information.
具体地,首先,获取根据第一文本信息与第二文本信息的语义相似度(而非单纯的字面匹配)来从大量的第二文本信息筛选出相对少量的待检索文本信息,从而在保证不丢失与检索问题高相关度的重要信息检索资源(即确保检索广度和准确度)的情况下显著提高整体工作效率,并克服了相关技术由于关键字精确匹配造成的重要检索资源缺失的问题以及由于流程复杂、计算量庞大造成的效率低下问题;其次,针对第一阶段筛选得到的每一个待检索文本信息,基于语义相关性从中再次提取或检索出与第一文本信息(如待检索问题)对应的第三文本信息(即待检索问题对应的候选答案相关文本信息),从而再次利用待检索问题内在的语义特征与待检索信息的语义特征及其相互关联实现候选答案及其相关文本(即第三文本信息集合)的提取,进一步确保了第三文本信息集合与第一文本信息的较高关联性以及最终检索结果的较高精确度;最后,生成第三文本信息集合中的至少两个第三文本信息的多文本摘要(例如生成式文本摘要),即从例如多个高匹配度的候选答案相关文本通过总结、推理方式生成多文本摘要作为待检索问题的最终答案(例如作为检索结果的一部分),这样的多文本摘要由于融合了多个较高检索匹配度的候选答案相关文本信息(即第三文本信息)的语义特征及其关联特征,能够进一步提升检索结果的质量和精确度。Specifically, firstly, according to the semantic similarity between the first text information and the second text information (rather than pure literal matching), a relatively small amount of text information to be retrieved is screened out from a large amount of second text information, so as to ensure that no In the case of losing important information retrieval resources that are highly relevant to the retrieval problem (that is, to ensure the breadth and accuracy of retrieval), the overall work efficiency is significantly improved, and the problem of missing important retrieval resources caused by exact keyword matching in related technologies and due to The problem of inefficiency caused by complex process and huge amount of calculation; secondly, for each text information to be retrieved that is screened in the first stage, based on semantic correlation, it is extracted or retrieved again corresponding to the first text information (such as the question to be retrieved). The third text information (that is, the relevant text information of the candidate answer corresponding to the question to be retrieved), so as to realize the candidate answer and its related text (ie, the first The extraction of three text information sets) further ensures the higher relevance of the third text information set and the first text information and the higher accuracy of the final retrieval result; finally, at least two of the third text information sets are generated in the third text information set. Multi-text summarization of three-text information (such as generative text summarization), that is, from multiple high-matching candidate answer-related texts through summarization and reasoning to generate multi-text summaries as the final answer to the question to be retrieved (such as the retrieval result Part of it), such a multi-text summary can further improve the quality and accuracy of the retrieval results due to the fusion of the semantic features and associated features of multiple candidate answer-related text information (ie, the third text information) with a high retrieval matching degree.
图4示出根据本申请一些实施例的信息检索方法的语义相似度确定步骤的示例流程图。图5示出根据本申请一些实施例利用语义理解模型确定语义相似度的原理图。Fig. 4 shows an example flowchart of the semantic similarity determination steps of the information retrieval method according to some embodiments of the present application. Fig. 5 shows a schematic diagram of determining semantic similarity using a semantic understanding model according to some embodiments of the present application.
如图4所示,语义相似度确定步骤S320可以包括:S320a-S320c。下面参照图5详细描述图4所示的各个步骤。As shown in Fig. 4, the semantic similarity determining step S320 may include: S320a-S320c. Each step shown in FIG. 4 will be described in detail below with reference to FIG. 5 .
在步骤S320a中,获取第一文本信息对应的第一语义特征向量以及多个第二文本信息分别对应的多个第二语义特征向量。In step S320a, a first semantic feature vector corresponding to the first text information and a plurality of second semantic feature vectors corresponding to the plurality of second text information are acquired.
为了量化文本信息的语义特征,可以利用语义特征向量来表示文本信息的语义特征,从而可以通过两个向量之间的相似度(例如向量之间的距离和/或夹角的余弦)来刻画各个向量对应的不同的文本信息之间的语义相似度。这样,第一文本信息与多个第二文本信息中每个第二文本信息的语义相似度的确定可以转换为第一文本信息的语义特征向量与每一个第二文本信息的语义特征向量之间的相似度的计算。In order to quantify the semantic features of text information, semantic feature vectors can be used to represent the semantic features of text information, so that the similarity between two vectors (such as the distance between vectors and/or the cosine of the angle) can be used to describe each The semantic similarity between different text information corresponding to the vector. In this way, the determination of the semantic similarity between the first text information and each second text information in multiple second text information can be converted into the relationship between the semantic feature vector of the first text information and the semantic feature vector of each second text information The calculation of the similarity.
关于文本信息的语义特征向量的获取,可以利用用于自然语言处理的语义理解模型来实现。语义理解模型也可以称为语义编码器,其可以是各种预训练的用于语义理解模型,例如BERT神经网络、Roberta神经网络、Albert神经网络等,用于将输入的文本序列(例如第一文本信息对应的分词序列)和/或词向量序列转换成表征该输入文本的整体语义的语义特征向量。The acquisition of the semantic feature vector of text information can be realized by using the semantic understanding model for natural language processing. The semantic understanding model can also be called a semantic encoder, which can be a variety of pre-trained models for semantic understanding, such as BERT neural network, Roberta neural network, Albert neural network, etc., which are used to convert input text sequences (such as the first The word segmentation sequence corresponding to the text information) and/or the word vector sequence are converted into semantic feature vectors representing the overall semantics of the input text.
如图5所示,在语义特征向量获取阶段(即语义编码阶段),针对第一文本信息510和每一个第二文本信息520,首先,分别将其输入到文本预处理器530,进行序列化处理(例如分词处理),以得到第一文本信息510和第二文本信息520分别对应的第一分词序列([CLS],q(1),...,q(k))和第二分词序列([CLS],p(1),...,p(m)),其中,[CLS]表示文本开始标记字符或全局字符,q(1)-q(k)表示第一文本信息(例如待处理问题)的总共k个分词,p(1)-p(m)表示第二文本信息(例如文章)的总共m个分词,k和m可以为大于或等于1的正整数(一般地,m>k);随后,将预处理后的第一分词序列([CLS],q(1),...,q(k))和第二分词序列([CLS],p(1),...,p(m))分别输入到语义理解模型540,经过语义理解或语义编码过程,从而分别得到与第一分词序列对应第一语义特征向量550和与第二分词序列对应的第二语义特征向量560。As shown in Figure 5, in the semantic feature vector acquisition stage (i.e., the semantic encoding stage), for the first text information 510 and each second text information 520, first, they are respectively input to the text preprocessor 530 for serialization processing (such as word segmentation processing) to obtain the first word segmentation sequence ([CLS], q(1), ..., q(k)) and the second word segmentation sequence corresponding to the first text information 510 and the second text information 520 respectively Sequence ([CLS], p(1), ..., p(m)), where [CLS] represents the text start tag character or global character, q(1)-q(k) represents the first text information ( For example, a total of k participle of the question to be processed), p(1)-p(m) represents a total of m participles of the second text information (such as an article), k and m can be positive integers greater than or equal to 1 (generally , m>k); Subsequently, the preprocessed first word sequence ([CLS], q(1),...,q(k)) and the second word sequence ([CLS], p(1) , ..., p(m)) are respectively input to the semantic understanding model 540, and through the process of semantic understanding or semantic encoding, the first semantic feature vector 550 corresponding to the first word segmentation sequence and the first semantic feature vector 550 corresponding to the second word segmentation sequence are respectively obtained. Two semantic feature vectors 560 .
如图5所示,第一分词序列([CLS],q(1),...,q(k))经过语义理解模型540的神经网络中各个网络层的处理,全局字符[CLS]对应的神经网络单元处理得到的输出向量可以表示第一文本信息510的整体语义特征,该输出向量即为第一分词序列或第一文本信息510对应的第一语义特征向量550;同理,第二分词序列([CLS],p(1),...,p(m))经过语义理解模型540的各网络层处理,全局字符[CLS]对应的神经网络单元处理得到的输出向量可以表示第二文本信息520的整体语义特征,该向量即为第二分词序列或第二文本信息520对应的第二语义特征向量560。As shown in Figure 5, the first word segmentation sequence ([CLS], q(1), ..., q(k)) is processed by each network layer in the neural network of the
示例性地,图5所示的文本预处理器530可以包括分词工具,以对待处理文本分割成一个或多个词语,分词序列或文本序列。分词工具可以是例如结巴(Jieba)分词、LTP、THULAC、NLPIR等。除了利用分词工具之外,可选地,待处理文本的序列化处理过程或分词过程也可以通过人工手动标注、随机分割、完全分割为多个单字符等其他方式来实现。一般地,对第一文本信息510和第二文本信息520进行语义理解的模型540可以是相同的;可选地,也可以使用不同的语义理解模型分别对第一文本信息510和第二文本信息520进行语义理解或语义编码。Exemplarily, the
可选地,若语义理解模型540的输入格式为向量序列,则文本预处理器530在对第一和第二文本信息进行分词处理得到相应的第一和第二分词序列之后,需要进行向量化处理,得到第一和第二词向量序列;随后第一和第二词向量序列输入分别输入到语义理解模型进行语义编码。示例性地,文本预处理器530也可以进一步包括用于文本到词向量的转换的词向量工具,以实现针对分词后的文本序列的向量化处理。词向量工具可以包括例如one-hot、word2vec、Glove等。Optionally, if the input format of the
在一些实施例中,步骤S320a可以包括:利用语义理解模型确定第一文本信息对应的第一语义特征向量;从预设的语义特征向量索引库中获取多个第二文本信息分别对应的多个第二语义特征向量,其中预设的语义特征向量索引库中存储有利用语义理解模型确定的多个第二语义特征向量。In some embodiments, step S320a may include: using the semantic understanding model to determine the first semantic feature vector corresponding to the first text information; The second semantic feature vector, wherein a plurality of second semantic feature vectors determined by using a semantic understanding model are stored in a preset semantic feature vector index library.
由于诸如文章之类的检索对象(即第二文本信息)一般是预先存储在例如服务器数据库中的相对固定的已知数据,而且检索对象的数据量往往是非常大的,因此,为了提升数据处理效率,可以考虑将大量的第二文本信息(即文章之类的检索对象)的语义理解操作(即语义特征向量或语义编码的获取或计算)在线下或离线时预先进行,并将所得的多个第二语义特征向量存储在预设的语义特征向量索引库中以供信息检索发生时在线上进行(与诸如待检索问题之类的第一文本信息)语义相似度计算。另一方面,第一文本信息的语义理解或语义编码过程可以在开始信息检索之后在线上进行,这样在信息检索过程中只需从语义特征向量索引库中直接调用各个第二文本信息对应的(线下预先定的)第二语义特征向量,而无需线上实时地针对数量庞大的第二文本信息利用语义理解模型进行语义编码操作,从而(尤其是在大数据量的待检索对象或文章的场景下)极大提升数据处理效率、显著优化网络资源配置和调度。Since the retrieval objects such as articles (that is, the second text information) are generally relatively fixed known data stored in advance, such as in a server database, and the data volume of retrieval objects is often very large, therefore, in order to improve data processing Efficiency, it can be considered that the semantic understanding operation (that is, the acquisition or calculation of semantic feature vector or semantic code) of a large amount of second text information (that is, retrieval objects such as articles) is performed offline or offline in advance, and the obtained multiple A second semantic feature vector is stored in a preset semantic feature vector index library for online semantic similarity calculation (with the first text information such as the question to be retrieved) when information retrieval occurs. On the other hand, the semantic understanding or semantic coding process of the first text information can be carried out online after starting the information retrieval, so that in the information retrieval process, it is only necessary to directly call the corresponding ( Offline pre-determined) second semantic feature vector, without the need to use the semantic understanding model to perform semantic coding operations on a huge amount of second text information online in real time, thus (especially in the case of large data volumes of objects or articles to scenario) greatly improves data processing efficiency and significantly optimizes network resource allocation and scheduling.
在步骤S320b中,计算第一语义特征向量与多个第二语义特征向量中每一个第二语义特征向量的相似度。In step S320b, the similarity between the first semantic feature vector and each of the plurality of second semantic feature vectors is calculated.
如上文所述,不同文本信息之间的语义相似度可以用相应的语义特征向量之间的相似度来表征。因此,如图5所示,在获取第一文本信息510对应的第一语义特征向量550和每个第二文本信息520对应的第二语义特征向量560之后,需要计算二者之间的向量相似度以便确定第一文本信息510和第二文本信息520之间的语义相似度570。As mentioned above, the semantic similarity between different text information can be characterized by the similarity between the corresponding semantic feature vectors. Therefore, as shown in FIG. 5, after obtaining the first semantic feature vector 550 corresponding to the first text information 510 and the second semantic feature vector 560 corresponding to each second text information 520, it is necessary to calculate the vector similarity between the two degree in order to determine the semantic similarity 570 between the first text information 510 and the second text information 520 .
在一些实施例中,两个向量相似度既可以体现为两个向量之间方向上的相似程度,例如余弦相似度;还可以体现为向量之间的距离接近程度,即基于距离的相似度。具体地,步骤S320b可以包括:基于所述多个第二语义特征向量与第一语义特征向量的距离,计算第一语义特征向量与所述多个第二语义特征向量中每一个第二语义特征向量的第一相似度;基于多个第二语义特征向量与第一语义特征向量之间的夹角的余弦,计算第一语义特征向量与多个第二语义特征向量中每一个第二语义特征向量的第二相似度;基于第一相似度和第二相似度中至少一个,确定第一语义特征向量与每一个第二语义特征向量的相似度。In some embodiments, the similarity between two vectors can be reflected in the similarity in direction between the two vectors, such as cosine similarity; it can also be reflected in the closeness of the distance between the vectors, that is, similarity based on distance. Specifically, step S320b may include: calculating the first semantic feature vector and each second semantic feature in the multiple second semantic feature vectors based on the distance between the multiple second semantic feature vectors and the first semantic feature vector The first similarity of the vector; based on the cosine of the angle between the multiple second semantic feature vectors and the first semantic feature vector, calculate the first semantic feature vector and each second semantic feature in the multiple second semantic feature vectors The second similarity of the vectors: based on at least one of the first similarity and the second similarity, determine the similarity between the first semantic feature vector and each second semantic feature vector.
示例性地,在计算两个语义特征向量之间的距离时,距离越小说明两个语义特征向量越相似,即向量相似度越大;反之,距离越大说明两个语义特征向量越不相似,即向量相似度越小。示例性地,在计算两个语义特征向量之间的夹角的余弦时,夹角越小说明两个语义特征向量越相似,即向量相似度越大;反之,夹角越大说明两个语义特征向量越不相似,即向量相似度越小。因此,既可以单独根据向量距离或单独根据夹角余弦计算第一语义特征向量与第二语义特征向量的相似度,也可以根据向量距离和夹角余弦二者共同(即第一相似度和第二相似度两者的加权和)计算第一语义特征向量与第二语义特征向量的相似度。无论是计算多个第二语义特征向量与第一语义特征向量的距离还是计算多个第二语义特征向量与第一语义特征向量之间的夹角的余弦,由于这两种计算方式都比较简洁而且语义特征向量可以精准地表示语义,因此既满足精准检索的需求,也能够提高检索效率,节约检索时间,两个特点相得益彰,这对存在海量第二文本信息的场景是十分有益的。For example, when calculating the distance between two semantic feature vectors, the smaller the distance, the more similar the two semantic feature vectors are, that is, the greater the vector similarity; conversely, the larger the distance, the less similar the two semantic feature vectors are , that is, the smaller the vector similarity is. For example, when calculating the cosine of the angle between two semantic feature vectors, the smaller the angle, the more similar the two semantic feature vectors are, that is, the greater the vector similarity; conversely, the larger the angle, the more semantic The less similar the eigenvectors are, the smaller the vector similarity is. Therefore, the similarity between the first semantic feature vector and the second semantic feature vector can be calculated solely based on the vector distance or the included angle cosine alone, or can be based on both the vector distance and the included angle cosine (that is, the first similarity and the second The weighted sum of the two similarities) calculates the similarity between the first semantic feature vector and the second semantic feature vector. Whether it is to calculate the distance between multiple second semantic feature vectors and the first semantic feature vector or to calculate the cosine of the angle between multiple second semantic feature vectors and the first semantic feature vector, since these two calculation methods are relatively simple Moreover, the semantic feature vector can accurately represent the semantics, so it not only meets the needs of precise retrieval, but also improves retrieval efficiency and saves retrieval time. The two characteristics complement each other, which is very beneficial for scenarios where there is a large amount of second text information.
在一些实施例中,第一语义特征向量与第二语义特征向量的距离可以包括欧式距离、曼哈顿距离、切比雪夫距离等。可选地,除了余弦相似度和基于距离的相似度之外,也可以采用其他方法计算第一语义特征向量与第二语义特征向量的相似度,例如相关系数或相似系数等。In some embodiments, the distance between the first semantic feature vector and the second semantic feature vector may include Euclidean distance, Manhattan distance, Chebyshev distance and the like. Optionally, besides cosine similarity and distance-based similarity, other methods may also be used to calculate the similarity between the first semantic feature vector and the second semantic feature vector, such as correlation coefficient or similarity coefficient.
在步骤S320c中,根据第一语义特征向量与每一个第二语义特征向量的相似度,确定第一文本信息与多个第二文本信息中每个第二文本信息的语义相似度。In step S320c, according to the similarity between the first semantic feature vector and each second semantic feature vector, the semantic similarity between the first text information and each second text information in the plurality of second text information is determined.
示例性地,在获得第一语义特征向量与每一个第二语义特征向量的相似度之后,可以直接将第一语义特征向量与每一个第二语义特征向量的相似度作为二者之间的语义相似度;可选地,也可以对语义特征向量之间的相似度进行适当的数据处理(例如增大或减小语义特征向量之间的相似度的区分度、进行归一化操作等),随后将处理后的数据作为语义相似度。Exemplarily, after obtaining the similarity between the first semantic feature vector and each second semantic feature vector, the similarity between the first semantic feature vector and each second semantic feature vector can be directly used as the semantic Similarity; Optionally, appropriate data processing can also be performed on the similarity between semantic feature vectors (such as increasing or reducing the degree of discrimination of the similarity between semantic feature vectors, performing normalization operations, etc.), The processed data is then used as semantic similarity.
图6示出根据本申请一些实施例的信息检索方法中语义相关信息提取步骤的示例流程图。图7示出根据本申请一些实施例利用阅读理解模型提取语义相关信息的原理图。Fig. 6 shows an example flowchart of the steps of extracting semantically related information in the information retrieval method according to some embodiments of the present application. Fig. 7 shows a schematic diagram of extracting semantic related information by using a reading comprehension model according to some embodiments of the present application.
如图6所示,语义相关信息提取步骤S340可以包括:S340a-S340c。下面参照图7详细描述图6所示的各个步骤。As shown in FIG. 6 , the semantic related information extraction step S340 may include: S340a-S340c. Each step shown in FIG. 6 will be described in detail below with reference to FIG. 7 .
在步骤S340a中,针对每一个待检索文本信息,利用阅读理解模型从中确定指示与所述第一文本信息对应的候选检索结果的第四文本信息。In step S340a, for each text information to be retrieved, a reading comprehension model is used to determine fourth text information indicating a candidate retrieval result corresponding to the first text information.
根据本申请的构思,在第二阶段的语义相关信息提取步骤S340中,需要从每一个待检索文本信息分别提取候选检索结果相关信息,即第三文本信息,从而形成第三文本信息集合,作为后续多文本摘要的基础。示例性地,第三文本信息(即候选检索结果相关信息)集合中每一个第三文本信息可以被限定为从对应的待检索信息中提取或抽取的、包含例如第一文本信息对应的候选检索结果(即第四文本信息)的文本信息。例如,当第一文本信息表示待处理问题时,第四文本信息可以表示为待处理问题对应的候选答案,第三文本信息可以表示为候选答案相关文章片段。因此,要想得到第三文本信息集合,需要首先从每一个待检索信息中分别确定相应的第四文本信息(即候选检索结果),随后提取包含候选检索结果的第三文本信息(即候选检索结果相关信息)。According to the idea of the present application, in the second stage of the semantic related information extraction step S340, it is necessary to extract the relevant information of the candidate retrieval results from each text information to be retrieved, that is, the third text information, so as to form the third text information set, as The basis for subsequent multi-text summarization. Exemplarily, each third text information in the set of third text information (that is, information related to candidate retrieval results) may be limited to be extracted or extracted from the corresponding information to be retrieved, including, for example, the candidate retrieval results corresponding to the first text information. The text information of the result (that is, the fourth text information). For example, when the first text information represents a question to be processed, the fourth text information may represent a candidate answer corresponding to the question to be processed, and the third text information may represent an article fragment related to the candidate answer. Therefore, in order to obtain the third text information set, it is necessary to first determine the corresponding fourth text information (i.e., candidate retrieval results) from each information to be retrieved, and then extract the third text information (i.e., candidate retrieval results) that contains the candidate retrieval results. Related Information).
在一些实施例中,阅读理解模型是指用于对自然语言的文章或语料进行语义理解并回答相关问题的神经网络模型,其输入可以为第一文本信息(例如待处理问题文本)和待检索文本信息(例如对应的待检索文章)的序列化表示,输出可以为候选检索结果(例如待处理文本对应的答案或候选答案),或者待检索文本信息对应的分词序列中各个分词对应的第一概率和第二概率,其中第一概率表示该分词是指示候选检索结果的第四文本信息的开始分词的概率,且所述每一个分词对应的第二概率表示该分词是所述第四文本信息的结束分词的概率In some embodiments, the reading comprehension model refers to a neural network model used for semantically understanding natural language articles or corpus and answering related questions. Its input can be the first text information (such as the text of the question to be processed) and the The serialized representation of text information (such as the corresponding article to be retrieved), the output can be candidate retrieval results (such as the answer or candidate answer corresponding to the text to be processed), or the first word corresponding to each word in the word segmentation sequence corresponding to the text information to be retrieved Probability and second probability, wherein the first probability represents the probability that the participle is the beginning participle of the fourth text information indicating the candidate retrieval result, and the second probability corresponding to each participle represents that the participle is the fourth text information The probability of ending participle of
在一些实施例中,步骤S340a可以包括针对每一个待检索文本信息执行下述步骤:In some embodiments, step S340a may include performing the following steps for each text information to be retrieved:
通过拼接第一文本信息和所述待检索文本信息形成第一待处理文本信息;forming first text information to be processed by splicing the first text information and the text information to be retrieved;
将第一待处理文本信息进行分词处理,以得到分词序列,所述分词序列包含第一文本信息对应的第一分词序列和所述待检索文本信息对应的第二分词序列;performing word segmentation processing on the first text information to be processed to obtain a word segmentation sequence, the word segmentation sequence including a first word segmentation sequence corresponding to the first text information and a second word segmentation sequence corresponding to the text information to be retrieved;
将所述分词序列输入阅读理解模型以获得所述第二分词序列中每一个分词对应的第一概率和第二概率,所述每一个分词对应的第一概率表示该分词是第四文本信息的开始分词的概率,且所述每一个分词对应的第二概率表示该分词是所述第四文本信息的结束分词的概率;Inputting the word segmentation sequence into the reading comprehension model to obtain the first probability and the second probability corresponding to each word segmentation in the second word segmentation sequence, the first probability corresponding to each word segmentation indicates that the word segmentation is the fourth text information The probability of starting the word segmentation, and the second probability corresponding to each word segmentation indicates the probability that the word segmentation is the ending word segmentation of the fourth text information;
根据第二分词序列中每一个分词对应的第一概率和第二概率,从所述第二分词序列中确定所述第四文本信息的开始分词和结束分词;According to the first probability and the second probability corresponding to each word segment in the second word segment sequence, determine the start word segment and the end word segment of the fourth text information from the second word segment sequence;
根据所述第四文本信息的开始分词和结束分词,从所述待检索文本信息中确定第四文本信息。The fourth text information is determined from the text information to be retrieved according to the start participle and the end participle of the fourth text information.
如图7所示,在从每一个待检索信息提取第四文本信息(即候选检索结果)的过程中,首先,将第一文本信息710(例如待检索问题)和相应的待检索信息720(例如文章)输入文本预处理器730,以便对其进行拼接和序列化等预处理操作,将其变换成阅读理解模型750所需的输入格式,即文本或分词序列740。如图7所示,分词序列740依次包括:待处理文本信息的开始标记符“[CLS]”、第一文本信息710对应的第一分词序列740a(即,q(1),...,q(k))、文本分隔符“[SEP]”、第二分词序列740b(即,p(1),...,p(m))。随后,将分词序列740输入到阅读理解模型750,经过模型中多层神经网络的处理和计算,可以输出第二分词序列740b中各个分词p(1),...,p(m)对应的第一概率和第二概率,其中第一概率表示该分词是指示候选检索结果的第四文本信息(例如问题候选答案)的开始分词的概率,且第二概率表示该分词是第四文本信息的结束分词的概率。此后,根据第一概率和第二概率确定开始分词和结束问题,例如可以在第二分词序列(p(1),...,p(m))中各分词分别对应的各个第一概率和第二概率中寻找最大的第一概率和最大的第二概率,则其各自对应的分词可以认为分别是开始分词和结束分词。如图7所示,例如在p(1)-p(m)各自对应的第一概率和第二概率中,p(m1)对应的第一概率最大,p(m2)对应的第二概率最大,且m1<m2,则可以确定待检索文本信息720中第m1个分词p(m1)为第四文本信息的开始分词,待检索文本信息720中第m2个分词p(m2)为第四文本信息的结束分词。最后,可以基于开始分词p(m1)和p(m2)确定第四文本信息,即将待检索信息中第m1个分词和第m2个分词之间的分词(包含p(m1)和p(m2))构成的文本信息确定为第四文本信息,作为阅读理解模型的输出文本。As shown in FIG. 7 , in the process of extracting the fourth text information (ie, candidate retrieval results) from each information to be retrieved, first, the first text information 710 (such as a question to be retrieved) and the corresponding information to be retrieved 720 ( For example, an article) is input into the
可选地,当最大的第一概率对应的分词p(m1)的序号m1大于或等于最大的第二概率对应的分词p(m2)的序号m2,即m1>=m2时,可以采取下述两种方式确定第四文本信息的开始分词和结束分词:Optionally, when the serial number m 1 of the participle p(m 1 ) corresponding to the largest first probability is greater than or equal to the serial number m 2 of the participle p(m 2 ) corresponding to the largest second probability, that is m 1 >= m 2 , the following two methods can be adopted to determine the start participle and end participle of the fourth text information:
第一种:首先第二分词序列中最大第一概率对应的分词p(m1)作为第四文本信息开始分词,随后在第二分词序列中排在分词p(m1)之后的各个分词中寻找第二概率最大的分词p(m3)(即m3>m1),作为第四文本信息的结束分词;The first type: firstly, the participle p(m 1 ) corresponding to the largest first probability in the second participle sequence is used as the fourth text information to start participle, and then in each participle after the participle p(m 1 ) in the second participle sequence Find the participle p(m 3 ) with the highest second probability (that is, m 3 >m 1 ), as the end participle of the fourth text information;
第二种:首先第二分词序列中最大第二概率对应的分词p(m2)作为第四文本信息结束分词,随后在第二分词序列中排在分词p(m2)之前的各个分词中寻找第一概率最大的分词p(m4)(即m4<m2),作为第四文本信息的开始分词。The second type: firstly, the participle p(m 2 ) corresponding to the largest second probability in the second participle sequence is used as the end participle of the fourth text information, and then it is ranked in each participle before the participle p(m 2 ) in the second participle sequence The word segment p(m 4 ) with the highest first probability (that is, m 4 <m 2 ) is searched for as the start word segment of the fourth text information.
可选地,还可以通过下述方式确定第四文本信息的开始分词和结束分词:首先,在第二分词序列740b的所有的分词对中,确定在前分词对应的第一概率与在后分词对应的第二概率之和,随后选择概率和最大的分词对,其中在前分词作为开始分词,在后分词作为结束分词。另外,还可以通过其他方式来确定第四文本信息的开始分词和结束分词。Optionally, the start and end word segmentations of the fourth text information can also be determined in the following manner: First, among all word segmentation pairs in the second word segmentation sequence 740b, determine the first probability corresponding to the preceding word segmentation and the following word segmentation The sum of the corresponding second probabilities, and then select the participle pair with the largest sum of probabilities, wherein the preceding participle is used as the starting participle, and the post participle is used as the ending participle. In addition, the start participle and the end participle of the fourth text information may also be determined in other ways.
在步骤S340b中,从每一个待检索文本信息中提取包含第四文本信息的第三文本信息。In step S340b, third text information including fourth text information is extracted from each text information to be retrieved.
基于本申请的构思,在从每一个待检索信息中确定第四文本信息(即候选检索结果)之后,可以从该待检索信息中提取出包含第四文本信息的第三文本信息(即候选检索结果相关信息),随后各个待检索信息中分别提取的第三文本信息可以形成第三文本信息集合。Based on the idea of this application, after determining the fourth text information (i.e. candidate retrieval results) from each information to be retrieved, the third text information containing the fourth text information (i.e. candidate retrieval results) can be extracted from the information to be retrieved. Result-related information), and then the third text information extracted from each of the information to be retrieved can form a third text information set.
如图7所示,在得到第四文本信息760之后,可以从待检索信息720中直接提取包含第四文本信息760的第三文本信息770。例如,第三文本信息770可以表示为分词序列(...,p(m1),...,p(m2),...)对应的文本信息,即其中包含了由开始分词p(m1)和结束分词p(m2)限定的第四文本信息760。As shown in FIG. 7 , after obtaining the fourth text information 760 , third text information 770 including the fourth text information 760 may be directly extracted from the information to be retrieved 720 . For example, the third text information 770 may be expressed as text information corresponding to a word sequence (..., p(m 1 ),..., p(m 2 ),...), that is, it contains (m 1 ) and the fourth text information 760 defined by the ending participle p(m 2 ).
在一些实施例中,步骤S340b可以包括下述步骤之一:从每一个待检索文本信息中提取第四文本信息所在的语句,作为第三文本信息;从每一个待检索文本信息中提取第四文本信息所在的自然段落,作为第三文本信息;从每一个待检索文本信息中提取第四文本信息,作为第三文本信息。换言之,第三文本信息可以是对应的待检索信息中第四文本信息所在的自然语句、自然段落或第四文本信息本身。可选地,第四文本信息也可以是对应的待检索信息中包含第三文本信息的其他形式的文本信息。In some embodiments, step S340b may include one of the following steps: extract the sentence where the fourth text information is located from each text information to be retrieved as the third text information; extract the fourth text information from each text information to be retrieved The natural paragraph where the text information is located is used as the third text information; the fourth text information is extracted from each text information to be retrieved as the third text information. In other words, the third text information may be a natural sentence, a natural paragraph, or the fourth text information itself in the corresponding information to be retrieved. Optionally, the fourth text information may also be other forms of text information that include the third text information in the corresponding information to be retrieved.
在步骤S340c中,基于从每一个待检索文本信息中提取的第三文本信息,构建第三文本信息集合。In step S340c, based on the third text information extracted from each text information to be retrieved, a third text information set is constructed.
在从每一个待检索信息中提取出包含第四文本信息的第三文本信息之后,可以将从各个待检索信息中分别提取的各个第三文本信息作为元素,形成第三文本信息集合。可选地,也可以从各个第三文本信息中挑选一部分第三文本信息构成第三文本信息集合。例如,可以选取与第一文本信息的挑选检索匹配度或相关度较高的多个第三文本信息构成集合,这里的检索匹配度或相关度可以通过第三文本信息相应的第四文本信息的开始分词和结束分词对应的第一概率和第二概率中至少一个来表征。通过依据检索匹配度选取部分第三文本信息构成第三文本信息集合,可以在保证检索精度的情况下进一步降低数据处理量,提升检索效率。After the third text information including the fourth text information is extracted from each information to be retrieved, each third text information extracted from each information to be retrieved can be used as an element to form a third text information set. Optionally, a part of the third text information may also be selected from each third text information to form the third text information set. For example, it is possible to select a plurality of third text information with a higher matching degree or correlation with the selected retrieval of the first text information to form a set, where the retrieval matching degree or correlation degree can be determined by the fourth text information corresponding to the third text information. At least one of the first probability and the second probability corresponding to the start participle and the end participle is represented. By selecting part of the third text information according to the retrieval matching degree to form the third text information set, the amount of data processing can be further reduced while the retrieval accuracy is ensured, and the retrieval efficiency can be improved.
在图6和图7所示的实施例中,通过使用阅读理解模型逐个对待检索文本信息进行分析和语义相关信息(即第三文本信息)提取,可以从每个待检索文本信息中寻找针对第一文本信息(如待检索的问题)对应的候选检索结果(即第四文本信息,例如候选答案),并且提取包含候选检索结果的第三文本信息。这个过程同样是基于第一文本信息和待检索文本信息的语义理解和分析进行的,因此可以确保检索得到的候选检索结果和候选检索结果相关信息(即第三文本信息)在内在语义层面与第一文本信息的较高相关度和检索匹配度。此外,阅读理解模型给出了每个分词作为开始分词和结束分词的第一概率和第二概率,这为确定第三文本信息提供了灵活的提取方法,从而能够适应不同的信息检索场景。In the embodiment shown in Figure 6 and Figure 7, by using the reading comprehension model to analyze the text information to be retrieved one by one and extract the semantically related information (that is, the third text information), it is possible to find the information for the first text information from each text information to be retrieved. Candidate retrieval results (that is, fourth text information, such as candidate answers) corresponding to a text information (such as a question to be retrieved), and extract the third text information including the candidate retrieval results. This process is also based on the semantic understanding and analysis of the first text information and the text information to be retrieved, so it can be ensured that the retrieved candidate retrieval results and information related to the candidate retrieval results (that is, the third text information) are consistent with the first text information at the intrinsic semantic level. A higher degree of relevance and retrieval matching of text information. In addition, the reading comprehension model gives each participle the first probability and the second probability of being the start participle and the end participle, which provides a flexible extraction method for determining the third text information, thus being able to adapt to different information retrieval scenarios.
图8示出根据本申请一些实施例的信息检索方法中多文本摘要获取步骤的示例流程图。图9示出根据本申请一些实施例利用文本摘要模型生成多文本摘要的原理图。Fig. 8 shows an example flow chart of the steps of obtaining multi-text abstracts in the information retrieval method according to some embodiments of the present application. Fig. 9 shows a principle diagram of generating a multi-text summary by using a text summary model according to some embodiments of the present application.
如图8所示,多文本摘要获取步骤S350可以包括步骤S350a-S350d,下面参照图9详细描述上述步骤。As shown in FIG. 8 , the step S350 of obtaining a multi-text summary may include steps S350a-S350d, and the above steps will be described in detail below with reference to FIG. 9 .
在步骤S3 50a中,针对第三文本信息集合中每一个第三文本信息,根据第三文本信息所包含的第四文本信息的开始分词对应的第一概率和结束分词对应的第二概率中至少一个,确定第三文本信息对应的检索匹配度。In step S350a, for each third text information in the third text information set, at least One, determining the retrieval matching degree corresponding to the third text information.
根据本申请的构思,在信息检索过程的第三阶段的多文本摘要获取操作中,多文本摘要的基础信息(即被摘要对象)可以从第三文本信息集合(即候选检索结果相关信息)中选取。为了在确保检索精确度的情况下适当减少数据量和计算量,可以从第三文本信息集合中选取若干个与第一文本信息相关程度或检索匹配度较高的第三文本信息,作为第三阶段的摘要获取的基础。在本文中,检索匹配度可以指第三文本信息(即候选检索结果相关信息)或对应的第四文本信息(即候选检索结果,例如问题答案)在语义维度上与第一文本信息(例如待检索问题)之间的匹配程度。检索匹配度越高,说明第三文本信息对应的候选检索结果与第一文本信息对应的检索目标或待检索问题越匹配、或候选检索结果越准确。According to the idea of this application, in the multi-text abstract acquisition operation in the third stage of the information retrieval process, the basic information of the multi-text abstract (that is, the object to be abstracted) can be obtained from the third text information set (that is, the relevant information of the candidate retrieval results) select. In order to properly reduce the amount of data and calculation while ensuring the accuracy of retrieval, several third text information with a high degree of correlation with the first text information or a high degree of retrieval matching can be selected from the third text information set as the third text information. The basis for the summary fetching of the stage. In this paper, retrieval matching degree can refer to the semantic dimension of the third text information (i.e. information related to candidate retrieval results) or the corresponding fourth text information (i.e. candidate retrieval results, e.g. question answers) with the first text information (e.g. The degree of matching between retrieval questions). A higher search matching degree indicates that the candidate search result corresponding to the third text information matches the search target or the question to be searched corresponding to the first text information, or the candidate search result is more accurate.
在一些实施例中,第三文本信息集合中每一个第三文本信息的检索匹配度可以根据该第三文本信息所包含的第四文本信息的开始分词对应的第一概率和结束分词对应的第二概率中至少一个来确定,因为第一概率和第二概率的大小一定程度反映了由此确定的第四文本信息(即候选检索结果)或第三文本信息与第一文本信息(即检索目标)的匹配程度或相关程度。In some embodiments, the search matching degree of each third text information in the third text information set can be based on the first probability corresponding to the start word and the end word corresponding to the fourth text information included in the third text information. at least one of the two probabilities, because the size of the first probability and the second probability reflects the determined fourth text information (that is, the candidate retrieval result) or the third text information and the first text information (that is, the retrieval target) to a certain extent. ) degree of matching or correlation.
具体地,步骤S3 50a(针对第三文本信息集合中每一个第三文本信息,根据第三文本信息所包含的第四文本信息的开始分词对应的第一概率和结束分词对应的第二概率中至少一个,确定第三文本信息对应的检索匹配度)可以包括基于下述数值中至少一个,确定所述第三文本信息对应的检索匹配度:Specifically, step S3 50a (for each third text information in the third text information set, according to the first probability corresponding to the start participle and the second probability corresponding to the end participle of the fourth text information included in the third text information At least one, determining the retrieval matching degree corresponding to the third text information) may include determining the retrieval matching degree corresponding to the third text information based on at least one of the following values:
(1)开始分词对应的第一概率和结束分词对应的第二概率的算术平均值;(1) the arithmetic mean of the first probability corresponding to the start participle and the second probability corresponding to the end participle;
(2)开始分词对应的第一概率和结束分词对应的第二概率的几何平均值;(2) the geometric mean of the first probability corresponding to the start participle and the second probability corresponding to the end participle;
(3)开始分词对应的第一概率和结束分词对应的第二概率中的最大值;(3) the maximum value in the first probability corresponding to the start participle and the second probability corresponding to the end participle;
(4)开始分词对应的第一概率和结束分词对应的第二概率中的最小值。(4) The minimum value of the first probability corresponding to the start participle and the second probability corresponding to the end participle.
上面的四种数值中任意一种或多种可以用于确定第三文本信息或第四文本信息的检索匹配度,以适应不同的情况。此外,提供的这四种确定方式都具有较小的计算量,可以节约检索时间,还可以有效地确定精准的检索匹配度。Any one or more of the above four values may be used to determine the retrieval matching degree of the third text information or the fourth text information, so as to adapt to different situations. In addition, the four determination methods provided all have a small amount of calculation, can save retrieval time, and can also effectively determine an accurate retrieval matching degree.
可选地,第三文本信息的检索匹配度的确定方式不限于上述四种方式,而是也可以采用其他方式,例如基于第三文本信息与第一文本信息的语义相似度等。Optionally, the manner of determining the retrieval matching degree of the third text information is not limited to the above four manners, but other manners may also be used, for example, based on the semantic similarity between the third text information and the first text information.
在步骤S350b中,根据第三文本集合中每一个第三文本信息对应的检索匹配度,从第三文本集合中选取至少两个第三文本信息。In step S350b, at least two pieces of third text information are selected from the third text set according to the retrieval matching degree corresponding to each third text information in the third text set.
在确定了第三文本集合中各个第三文本信息对应的检索匹配度之后,可以基于检索匹配度,从第三文本集合中选取检索匹配度较高的多个第三文本信息作为多文本摘要的对象。在一些实施例中,S3 50b可以包括:根据第一文本信息与每一个第三文本信息的检索匹配度从大到小的顺序,对所述第三文本信息集合中的各个第三文本信息进行排序;从排序中选取前N个第三文本信息,其中N为预设的大于2的正整数。这样,可以选择与第一文本信息检索匹配度最高的前N个第三文本信息作为多文本摘要基础或摘要对象,其中N可以根据实际的应用场景预先设定。例如,针对相对复杂或抽象的待检索问题(即第一文本信息),可以将N设定为相对较大数值,以保证多文本摘要过程中被摘要数据的广泛性和丰富性,从而有利于检索出更准确的检索结果或问题答案;另一方面,为了保证摘要过程的工作效率,N不能被设定的太大,比如,在保证检索精度的情况下可以设定为适合于具体应用场景的最小值。After the retrieval matching degree corresponding to each third text information in the third text set is determined, based on the retrieval matching degree, a plurality of third text information with higher retrieval matching degree can be selected from the third text set as the multi-text summary object. In some embodiments, S3 50b may include: according to the descending order of the retrieval matching degree between the first text information and each third text information, perform a search on each third text information in the third text information set Sorting: selecting the first N third text messages from the sorting, where N is a preset positive integer greater than 2. In this way, the top N third text information with the highest matching degree with the first text information retrieval can be selected as the basis or object of the multi-text abstract, where N can be preset according to the actual application scenario. For example, for relatively complex or abstract problems to be retrieved (ie, the first text information), N can be set to a relatively large value to ensure the breadth and richness of the summarized data in the process of multi-text summarization, which is beneficial to More accurate retrieval results or answers to questions can be retrieved; on the other hand, in order to ensure the efficiency of the summarization process, N cannot be set too large. For example, it can be set to be suitable for specific application scenarios while ensuring retrieval accuracy minimum value.
在步骤S3 50c中,将至少两个第三文本信息按照各自对应的检索匹配度从高到低的顺序进行拼接,以形成第二待处理文本信息。In step S3 50c, at least two pieces of third text information are spliced according to their corresponding retrieval matching degrees from high to low, so as to form second text information to be processed.
在一些实施例中,关于多文本摘要的获取,可以采用文本摘要模型来完成,因此在输入文本摘要模型之前,摘要对象(即多个第三文本信息)数据需要进行预处理以适合模型输入格式。一般地,摘要对象的数据预处理过程可以包括多个第三文本信息的拼接过程,例如将至少两个第三文本信息按照检索匹配度的高低顺序拼接在一起,以得到第二待处理文本信息。可选地,摘要对象的数据预处理过程还可以包括第二待处理文本信息的序列化过程以匹配文本摘要模型的输入格式(假设文本摘要模型为文本序列或分词序列),例如将第二待处理文本信息进行分词处理以得到相应的分词序列。In some embodiments, the acquisition of multi-text summaries can be accomplished by using a text summarization model, so before inputting into the text summarization model, the data of summarization objects (that is, a plurality of third text information) needs to be preprocessed to fit the model input format . Generally, the data preprocessing process of the abstract object may include a splicing process of multiple third text information, for example, splicing at least two third text information together in order of retrieval matching degree to obtain the second text information to be processed . Optionally, the data preprocessing process of the summary object can also include the serialization process of the second text information to be processed to match the input format of the text summary model (assuming that the text summary model is a text sequence or word segmentation sequence), for example, the second to be processed Process the text information for word segmentation to obtain the corresponding word sequence.
在步骤S350d中,利用文本摘要模型生成第二待处理文本信息对应的多文本摘要。In step S350d, a text summary model is used to generate a multi-text summary corresponding to the second text information to be processed.
在一些实施例中,文本摘要模型可以采用预训练的编码器-解码器结构,例如基于深度神经网络的(源文本)序列到(摘要文本)序列的框架结构。如图9所示,文本摘要模型可以包括编码器910和解码器920,其中编码器910(例如语义编码器)将源文本序列转换成对应的语义向量序列,解码器920(例如循环解码器)则基于语义向量序列(例如通过注意力机制和循环解码)生成摘要文本序列。具体地,文本摘要模型所采用的深度神经网络可以包括卷积神经网络(CNN)、循环神经网络(RNN),其中编码器910和解码器920可以采用相同或不同的一个或多个神经网络结构,只要能够实现相应输入输出功能即可。In some embodiments, the text summarization model can adopt a pre-trained encoder-decoder structure, such as a framework structure from (source text) sequence to (summary text) sequence based on a deep neural network. As shown in Figure 9, the text summarization model can include an
示例性地,下面参照图9描述利用文本摘要模型生成第二待处理文本信息的多文本摘要的过程。如图9所示,首先是预处理阶段,即将第二待处理文本信息930输入到文本预处理器940以便对其进行序列化处理,从而得到第二待处理文本信息930对应的分词序列或文本序列(T1,T2,T3,...,Tn),其中,Ti(i=1,2,...,n)表示文本序列中第i个文本,n表示文本序列中文本的总数。Exemplarily, the process of using the text summarization model to generate a multi-text summarization of the second text information to be processed is described below with reference to FIG. 9 . As shown in Figure 9, the first is the preprocessing stage, that is, the second text information to be processed 930 is input to the
随后,进入编码阶段,即将预处理后得到的文本序列(T1,T2,T3,...,Tn)输入到文本摘要模型的编码器910,以对其进行语义理解,得到与文本序列(T1,T2,T3,...,Tn)一一对应的语义向量序列(H1,H2,H3,...,Hn)其中,Hi(i=1,2,...,n)表示语义向量中序列中第i个语义向量,其对应于文本序列中第i个文本或分词Ti,n表示语义向量序列中语义向量总数。这里语义向量序列(H1,H2,H3,...,Hn)也可以称为编码器910的隐状态向量序列。如图9所示,语义编码阶段可以采用双向循环的神经网络来实现文本序列的语义编码,这样可以使得语义编码中更多融合不同文本或分词之间的语义关联(例如前向关联和反向关联)特征,从而更准确地反映第二待处理文本信息的整体语义特征。Subsequently, it enters the encoding stage, that is, the preprocessed text sequence (T 1 , T 2 , T3, ..., T n ) is input to the
接着,多文本摘要过程进入解码阶段。在本申请一些实施例中,解码阶段采用基于多个时间步的循环解码过程,其中解码时间步或解码时刻的数量可以根据具体应用场景等预先确定。解码器920的整个解码过程可以概括如下:首先,利用注意力机制,将编码器910输出的语义向量序列(H1,H2,H3,...,Hn)变换成内容向量序列(C1,C2,C3,...,Cr),其中r表示解码器920对应的解码时刻的数量;随后,利用(例如循环神经网络结构)解码器920在各个解码时刻进行循环解码以得到解码文本序列(A1,A2,A3,...,Ar),作为多文本摘要序列,最后通过拼接得到最终的多文本摘要。Next, the multi-text summarization process enters the decoding stage. In some embodiments of the present application, the decoding stage adopts a cyclic decoding process based on multiple time steps, wherein the number of decoding time steps or decoding moments can be predetermined according to specific application scenarios and the like. The entire decoding process of the
图9示出了多文本摘要的解码阶段中第t个解码时刻或解码时间步的解码过程。如图9所示,在多文本摘要的解码阶段的当前解码时刻,即第t个解码时刻,首先,可以通过解码器920的上一个解码时刻,即第t-1个解码时刻,输出的解码器隐状态向量St-1以及语义向量序列(H1,H2,H3,...,Hn)确定当前解码时刻对应的注意力分布950,也可以称为注意力权重序列。如图9所示,注意力分布950中各个黑色直方柱可以表示与语义向量序列(H1,H2,H3,...,Hn)中各个语义向量一一对应的注意力权重,例如,直方柱越长,表示相应的注意力权重越大。这样,随后,如图9所示,可以基于注意力权重序列或注意力分别950,对语义向量序列(H1,H2,H3,...,Hn)各个语义向量计算加权和,以得到当前解码时刻(即第t个解码时刻)的内容向量Ct。之后,在解码器920中根据当前解码时刻的内容向量Ct、前一解码时刻(即第t-1个解码时刻)的解码器隐状态向量St-1以及前一解码时刻解码器920输出的解码文本At-1,确定当前解码时刻的解码器隐状态向量St;随后在解码器920中,根据当前解码时刻(即第t个解码时刻)的内容向量Ct、当前解码时刻的解码器隐状态向量St以及前一解码时刻解码器920输出的解码文本At-1,确定当前解码时刻对应的(最终)解码文本At。FIG. 9 shows the decoding process of the t-th decoding moment or decoding time step in the decoding stage of the multi-text summary. As shown in Figure 9, at the current decoding moment of the decoding stage of the multi-text abstract, that is, the t-th decoding moment, first, the output decoding The hidden state vector S t-1 of the device and the semantic vector sequence (H 1 , H 2 , H 3 , ..., H n ) determine the
如图9所示,在确定当前解码时刻对应的(最终)解码文本At的过程中,首先可以利用解码器920(例如分类预测功能)预测当前解码时刻对应词汇概率分布960,即所有可能的输出文本或词汇成为最终解码文本At的概率分布。图9所示的词汇概率分布960以多个直方柱的形式示出,每个直方柱代表一个词汇成为当前时刻解码文本的概率;显然直方柱越高,表示相应的词汇成为当前解码时刻对应的解码文本的可能性越大。因此,如图9所示,可以在词汇概率分布960中选择最高的直方图对应的词汇或文本作为当前解码时刻(即第t个解码时刻)对应的解码文本At。As shown in Figure 9, in the process of determining the (final) decoded text A t corresponding to the current decoding moment, the decoder 920 (such as classification prediction function) can be used to predict the
在根据本申请一些实施例的多文本摘要获取过程的解码阶段中,注意力机制可以为语义向量序列中的各个语义向量赋权,以区分不同语义向量在当前解码时刻对应的内容向量转换和后续的解码过程中的关键程度,从而提高神经网络的效率和准确性。例如,在文本摘要模型的解码阶段,输入的语义向量序列中各个向量分别对应于第二待处理文本信息对应的分词,但这些分词与待处理问题的关联程度是不同的,因此通过注意力权重(或注意力分布)来体现不同语义向量在当前解码时刻的解码过程中的重要程度,从而提升模型预测推理过程的效率和准确性。In the decoding phase of the multi-text summary acquisition process according to some embodiments of the present application, the attention mechanism can give weights to each semantic vector in the semantic vector sequence, so as to distinguish the content vector conversion and subsequent The degree of criticality in the decoding process, thereby improving the efficiency and accuracy of the neural network. For example, in the decoding stage of the text summarization model, each vector in the input semantic vector sequence corresponds to the word segmentation corresponding to the second text information to be processed, but the degree of association between these word segmentation and the problem to be processed is different, so through the attention weight (or attention distribution) to reflect the importance of different semantic vectors in the decoding process at the current decoding moment, thereby improving the efficiency and accuracy of the model prediction reasoning process.
此外,在根据本申请一些实施例的多文本摘要获取过程的解码阶段中,当前解码时刻(即第t个解码时刻)的解码文本的计算法过程中,解码器920的输入不仅包括当前解码时刻的内容向量Ct和前一解码时刻(即第t-1个解码时刻)输出的解码向量St-1和,而且包括上一解码时刻的解码文本At-1,这样可以使得文本摘要模型的解码器920的输入融合了更多不同种类的相关特征(例如At-1对应的上下文特征),有利于当前解码时刻的最终解码文本At更准确反映第二待除了文本信息的整体语义特征,尤其上下文语义关系特征或关联特征。这样使得最终输出的解码文本序列和/或多文本摘要更精确地匹配待处理问题。In addition, in the decoding stage of the multi-text summary acquisition process according to some embodiments of the present application, during the calculation process of the decoded text at the current decoding moment (that is, the t-th decoding moment), the input of the
图10是根据本申请的一些实施例的信息检索方法的完整过程的示意图。在图10所示的实施例中,第一文本信息为待检索问题,第二文本信息为文章,待检索信息为待检索文章,第三文本信息为文章片段,第三文本信息为候选答案。如图10所示,虚线上方表示在线操作过程,虚线下方表示离线操作过程。Fig. 10 is a schematic diagram of the entire process of the information retrieval method according to some embodiments of the present application. In the embodiment shown in FIG. 10 , the first text information is a question to be retrieved, the second text information is an article, the information to be retrieved is an article to be retrieved, the third text information is an article fragment, and the third text information is a candidate answer. As shown in FIG. 10 , the upper part of the dotted line indicates the online operation process, and the lower part of the dotted line indicates the offline operation process.
如图10所示,在离线情况下,使用语义理解模型来确定候选检索对象,即多个文章的语义特征向量,然后基于这些语义特征向量本地构建语义特性向量索引库,或者将这些语义特征向量存储在预设的语义特征向量索引库中,供在线检索时直接调用。从而在大数据量的待检索文章的场景下极大提升数据处理效率、显著优化网络资源调度。As shown in Figure 10, in the offline situation, use the semantic understanding model to determine candidate retrieval objects, that is, the semantic feature vectors of multiple articles, and then build a semantic feature vector index library locally based on these semantic feature vectors, or use these semantic feature vectors Stored in the preset semantic feature vector index library, it can be called directly during online retrieval. In this way, in the scenario of a large amount of articles to be retrieved, the data processing efficiency is greatly improved, and the network resource scheduling is significantly optimized.
然后如图10所示,在获取到待检索问题之后,在线信息检索开始,在线信息检索过程可以分为三个阶段:文章筛选、片段提取和文本摘要。如图10所示,在文章筛选阶段,首先将待检索问题输入语义理解模型以确定该待检索问题对应的第一语义特征向量;然后,为了筛选语义相似的文章,从语义特征向量索引库中调用各个文章对应的第二语义特征向量并计算第一语义特征向量与每一个第二语义特征向量的相似度;随后,将各个文章按照各自对应的语义相似度进行排序,从中选择前M个文章作为待检索文章,M为正整数。如图10所示,在片段提取阶段,利用阅读理解模型从上一阶段筛选的M个文章中分别提取包含与待检索问题匹配候选答案的文章片段。在文本摘要阶段,如图10所示,利用文本摘要模型获取或生成从M个文章片段选取的多个文章片段(例如检索匹配度较高的前N个文章片段)对应的多文本摘要,并将多文本摘要作为最终检索结果或问题答案呈现给用户。Then, as shown in Figure 10, after the question to be retrieved is obtained, the online information retrieval starts, and the online information retrieval process can be divided into three stages: article screening, fragment extraction and text summarization. As shown in Figure 10, in the article screening stage, firstly, the query to be retrieved is input into the semantic understanding model to determine the first semantic feature vector corresponding to the query to be retrieved; Call the second semantic feature vector corresponding to each article and calculate the similarity between the first semantic feature vector and each second semantic feature vector; then, sort the articles according to their corresponding semantic similarity, and select the top M articles As an article to be retrieved, M is a positive integer. As shown in Figure 10, in the fragment extraction stage, the reading comprehension model is used to extract article fragments containing candidate answers that match the questions to be retrieved from the M articles screened in the previous stage. In the text summarization stage, as shown in Figure 10, the text summarization model is used to obtain or generate multi-text summaries corresponding to multiple article fragments selected from M article fragments (for example, the first N article fragments with high matching degree), and Present multi-text summaries to users as final search results or answers to questions.
在图10所示的实施例中,整个信息检索过程被分为在线和离线两个部分,二者相辅相成共同实现问题答案的精确高效检索。一方面,由于诸如文章之类的检索对象(即第二文本信息)一般是预先存储在例如服务器数据库中的相对固定的已知数据,而且检索对象的数据量往往是非常大的,因此,为了提升数据处理效率,将大量的第二文本信息(即文章之类的检索对象)的语义理解操作(即语义特征向量或语义编码的获取或计算)在线下或离线时预先进行,并将所得的多个第二语义特征向量存储在预设的语义特征向量索引库中以供信息检索发生时在线上进行(与诸如待检索问题之类的第一文本信息)语义相似度计算。这样避免了线上实时地针对数量庞大的第二文本信息利用语义理解模型进行语义编码操作,从而(尤其是在大数据量的待检索对象或文章的场景下)极大提升数据处理效率、显著优化网络资源配置和调度。另一方面,关于线上或在线操作部分,采用了基于深度神经网络的三个自然语言处理模型(即语义理解模型、阅读理解模型、文本摘要模型)来实现文章筛选、片段提取和多文本摘要生成等数据处理过程,有效保证了检索结果或检索答案的鲁棒性、准确性。In the embodiment shown in FIG. 10 , the entire information retrieval process is divided into two parts, online and offline, and the two complement each other to achieve accurate and efficient retrieval of answers to questions. On the one hand, since the retrieval objects such as articles (that is, the second text information) are generally relatively fixed known data pre-stored in, for example, a server database, and the data volume of retrieval objects is often very large, therefore, for To improve the efficiency of data processing, the semantic understanding operation (that is, the acquisition or calculation of semantic feature vectors or semantic codes) of a large amount of second text information (that is, retrieval objects such as articles) is performed offline or offline in advance, and the obtained A plurality of second semantic feature vectors are stored in a preset semantic feature vector index library for online semantic similarity calculation (with the first text information such as the question to be retrieved) when information retrieval occurs. This avoids using the semantic understanding model to perform semantic coding operations on a huge amount of second text information online in real time, thereby (especially in the scenario of large data volumes of objects or articles to be retrieved) greatly improving data processing efficiency and significantly Optimize network resource configuration and scheduling. On the other hand, regarding the online or online operation part, three natural language processing models based on deep neural networks (namely, semantic understanding model, reading comprehension model, and text summarization model) are used to realize article screening, fragment extraction, and multi-text summarization Data processing such as generation effectively guarantees the robustness and accuracy of retrieval results or retrieval answers.
图11是根据本申请的一些实施例的基于语义理解的信息检索装置1100的示例性结构框图。如图11所示,该信息检索装置1100可以包括:第一获取模块1110、第一确定模块1120、选取模块1130、提取模块1140、第二获取模块1150和第二确定模块1160。Fig. 11 is an exemplary structural block diagram of an
第一获取模块1110可以配置成获取指示检索目标的第一文本信息和指示候选检索对象的多个第二文本信息。第一确定模块1120可以配置成确定第一文本信息与多个第二文本信息中每个第二文本信息的语义相似度。选取模块1130可以配置成根据第一文本信息与多个第二文本信息中每个第二文本信息的语义相似度,从多个第二文本信息中选取至少一个待检索文本信息。提取模块1140可以配置成从至少一个待检索文本信息中分别提取与第一文本信息语义相关的第三文本信息以形成第三文本信息集合。第二获取模块1150可以配置成获取第三文本信息集合中至少两个第三文本信息对应的多文本摘要。第二确定模块1160可以配置成基于多文本摘要,确定第一文本信息对应的检索结果。The first acquiring module 1110 may be configured to acquire first text information indicating a retrieval target and a plurality of second text information indicating candidate retrieval objects. The first determination module 1120 may be configured to determine the semantic similarity between the first text information and each second text information in the plurality of second text information. The selecting module 1130 may be configured to select at least one text information to be retrieved from the plurality of second text information according to the semantic similarity between the first text information and each second text information in the plurality of second text information. The extracting
应注意,上述各种模块可以以软件或硬件或两者的组合来实现。多个不同模块可以在同一软件或硬件结构中实现,或者一个模块可以由多个不同的软件或硬件结构实现。It should be noted that the various modules described above may be implemented in software or hardware or a combination of both. Several different modules can be realized in the same software or hardware structure, or one module can be realized by several different software or hardware structures.
在根据本申请一些实施例的信息检索装置中,首先,获取根据第一文本信息与第二文本信息的语义相似度(而非单纯的字面匹配)来从大量的第二文本信息筛选出相对少量的待检索文本信息,从而在保证不丢失与检索问题高相关度的重要信息检索资源(即确保检索广度和准确度)的情况下显著提高整体工作效率,并克服了相关技术由于关键字精确匹配造成的重要检索资源缺失的问题以及由于流程复杂、计算量庞大造成的效率低下问题;其次,针对第一阶段筛选得到的每一个待检索文本信息,基于语义相关性从中再次提取或检索出与第一文本信息(如待检索问题)对应的第三文本信息(即待检索问题对应的候选答案相关文本信息),从而再次利用待检索问题内在的语义特征与待检索信息的语义特征实现候选答案及其相关文本(即第三文本信息集合)的提取,进一步确保了第三文本信息集合与第一文本信息的较高关联性以及最终检索结果的较高精确度;最后,生成第三文本信息集合中的至少两个第三文本信息的多文本摘要(例如生成式文本摘要),即从例如多个高匹配度的候选答案相关文本通过总结、推理方式生成多文本摘要作为待检索问题的最终答案(例如作为检索结果的一部分),这样的多文本摘要由于融合了多个较高检索匹配度的候选答案相关文本信息(即第三文本信息),能够进一步提升检索结果的质量和精确度。In the information retrieval device according to some embodiments of the present application, firstly, according to the semantic similarity between the first text information and the second text information (rather than pure literal matching), a relatively small amount of second text information is screened out from a large amount of text information. The text information to be retrieved, thus significantly improving the overall work efficiency without losing important information retrieval resources that are highly relevant to the retrieval problem (that is, ensuring the breadth and accuracy of retrieval), and overcoming the problem of exact keyword matching in related technologies The problem of lack of important retrieval resources and the low efficiency caused by the complex process and huge amount of calculation; secondly, for each text information to be retrieved that is screened in the first stage, based on the semantic correlation, it is extracted or retrieved again. The third text information corresponding to the text information (such as the question to be retrieved) (that is, the text information related to the candidate answer corresponding to the question to be retrieved), so as to realize the candidate answer and The extraction of its relevant text (that is, the third text information set) further ensures the higher relevance of the third text information set and the first text information and the higher accuracy of the final retrieval result; finally, the third text information set is generated A multi-text summary (such as a generative text summary) of at least two third text information in the , that is, from a plurality of high-matching candidate answer-related texts, for example, a multi-text summary is generated by summarizing and reasoning as the final answer to the question to be retrieved (For example, as a part of the retrieval result), such a multi-text summary can further improve the quality and accuracy of the retrieval result because it incorporates multiple text information (ie, the third text information) related to the candidate answer with a high retrieval matching degree.
图12示意性示出了根据本申请一些实施例的计算设备1200的示例框图。计算设备1200可以代表用以实现本文描述的各种装置或模块和/或执行本文描述的各种方法的设备。计算设备1200可以是例如服务器、台式计算机、膝上型计算机、平板、智能电话、智能手表、可穿戴设备或任何其它合适的计算设备或计算系统,其可以包括从具有大量存储和处理资源的全资源设备到具有有限存储和/或处理资源的低资源设备的各种级别的设备。在一些实施例中,上面关于图11描述的基于语义理解的信息检索装置1100可以分别在一个或多个计算设备1200中实现。FIG. 12 schematically illustrates an example block diagram of a computing device 1200 according to some embodiments of the application. The computing device 1200 may represent a device to implement various means or modules described herein and/or perform various methods described herein. Computing device 1200 may be, for example, a server, desktop computer, laptop computer, tablet, smart phone, smart watch, wearable device, or any other suitable computing device or computing system, which may include resources from a full range of storage and processing resources Various classes of devices from resource devices to low resource devices with limited storage and/or processing resources. In some embodiments, the
如图12所示,示例计算设备1200包括彼此通信耦合的处理系统1201、一个或多个计算机可读介质1202以及一个或多个I/O接口1203。尽管未示出,但是计算设备1200还可以包括将各种组件彼此耦合的系统总线或其他数据和命令传送系统。系统总线可以包括不同总线结构的任何一个或组合,所述总线结构可以是诸如存储器总线或存储器控制器、外围总线、通用串行总线、和/或利用各种总线架构中的任何一种的处理器或局部总线。或者,还可以包括诸如控制和数据线。As shown in FIG. 12 , example computing device 1200 includes a
处理系统1201代表使用硬件执行一个或多个操作的功能。因此,处理系统1201被图示为包括可被配置为处理器、功能块等的硬件元件1204。这可以包括在硬件中实现作为专用集成电路或使用一个或多个半导体形成的其它逻辑器件。硬件元件1204不受其形成的材料或其中采用的处理机构的限制。例如,处理器可以由(多个)半导体和/或晶体管(例如,电子集成电路(IC))组成。在这样的上下文中,处理器可执行指令可以是电子可执行指令。The
计算机可读介质1202被图示为包括存储器/存储装置1205。存储器/存储装置1205表示与一个或多个计算机可读介质相关联的存储器/存储装置。存储器/存储装置1205可以包括易失性介质(诸如随机存取存储器(RAM))和/或非易失性介质(诸如只读存储器(ROM)、闪存、光盘、磁盘等)。存储器/存储装置1205可以包括固定介质(例如,RAM、ROM、固定硬盘驱动器等)以及可移动介质(例如,闪存、可移动硬盘驱动器、光盘等)。示例性地,存储器/存储装置1205可以用于存储上文实施例中提及的第一文本信息、第二文本信息、第三文本信息、检索结果等数据。计算机可读介质1202可以以下面进一步描述的各种其他方式进行配置。Computer readable medium 1202 is illustrated as including memory/storage 1205 . Memory/storage 1205 represents memory/storage associated with one or more computer-readable media. Memory/storage 1205 may include volatile media (such as random access memory (RAM)) and/or non-volatile media (such as read only memory (ROM), flash memory, optical disks, magnetic disks, etc.). Memory/storage 1205 may include fixed media (eg, RAM, ROM, fixed hard drives, etc.) as well as removable media (eg, flash memory, removable hard drives, optical disks, etc.). Exemplarily, the memory/storage device 1205 may be used to store data such as first text information, second text information, third text information, and search results mentioned in the above embodiments. Computer-readable medium 1202 may be configured in various other ways as described further below.
一个或多个I/O(输入/输出)接口1203代表允许用户向计算设备1200键入命令和信息并且还允许使用各种输入/输出设备将信息显示给用户和/或发送给其他组件或设备的功能。输入设备的示例包括键盘、光标控制设备(例如,鼠标)、麦克风(例如,用于语音输入)、扫描仪、触摸功能(例如,被配置为检测物理触摸的容性或其他传感器)、相机(例如,可以采用可见或不可见的波长(诸如红外频率)将不涉及触摸的运动检测为手势)、网卡、接收机等等。输出设备的示例包括显示设备、扬声器、打印机、触觉响应设备、网卡、发射机等。One or more I/O (input/output) interfaces 1203 represent interfaces that allow a user to enter commands and information into computing device 1200 and also allow information to be displayed to the user and/or sent to other components or devices using various input/output devices. Function. Examples of input devices include keyboards, cursor control devices (e.g., mice), microphones (e.g., for voice input), scanners, touch capabilities (e.g., capacitive or other sensors configured to detect physical touch), cameras (e.g., For example, visible or invisible wavelengths (such as infrared frequencies to detect motion not involving touch as gestures), network cards, receivers, etc. may be employed. Examples of output devices include display devices, speakers, printers, tactile response devices, network cards, transmitters, and the like.
计算设备1200还包括信息检索策略1206。信息检索策略1206可以作为计算程序指令存储在存储器/存储装置1205中,也可以是硬件或固件。信息检索策略1206可以连同处理系统1201等一起实现关于图11描述的基于语义理解的信息检索装置1100的各个模块的全部功能。Computing device 1200 also includes
本文可以在软件、硬件、元件或程序模块的一般上下文中描述各种技术。一般地,这些模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、元素、组件、数据结构等。本文所使用的术语“模块”、“功能”等一般表示软件、固件、硬件或其组合。本文描述的技术的特征是与平台无关的,意味着这些技术可以在具有各种处理器的各种计算平台上实现。Various techniques may be described herein in the general context of software, hardware, components or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. As used herein, the terms "module", "function" and the like generally represent software, firmware, hardware or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques can be implemented on a variety of computing platforms with a variety of processors.
所描述的模块和技术的实现可以存储在某种形式的计算机可读介质上或者跨某种形式的计算机可读介质传输。计算机可读介质可以包括可由计算设备1200访问的各种介质。作为示例而非限制,计算机可读介质可以包括“计算机可读存储介质”和“计算机可读信号介质”。An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer-readable media may include a variety of media that may be accessed by computing device 1200 . By way of example, and not limitation, computer readable media may include "computer readable storage media" and "computer readable signal media."
与单纯的信号传输、载波或信号本身相反,“计算机可读存储介质”是指能够持久存储信息的介质和/或设备,和/或有形的存储装置。因此,计算机可读存储介质是指非信号承载介质。计算机可读存储介质包括诸如易失性和非易失性、可移动和不可移动介质和/或以适用于存储信息(诸如计算机可读指令、数据结构、程序模块、逻辑元件/电路或其他数据)的方法或技术实现的存储设备之类的硬件。计算机可读存储介质的示例可以包括但不限于RAM、ROM、EEPROM、闪存或其它存储器技术、CD-ROM、数字通用盘(DVD)或其他光学存储装置、硬盘、盒式磁带、磁带,磁盘存储装置或其他磁存储设备,或其他存储设备、有形介质或适于存储期望信息并可以由计算机访问的制品。A "computer-readable storage medium" refers to a medium and/or device capable of persistently storing information, and/or a tangible storage device, as opposed to a mere signal transmission, carrier wave, or signal itself. Thus, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include media such as volatile and nonvolatile, removable and non-removable media and/or media suitable for storing information such as computer-readable instructions, data structures, program modules, logic elements/circuits, or ) method or technology to implement hardware such as storage devices. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical storage device, hard disk, cassette tape, magnetic tape, disk storage device or other magnetic storage device, or other storage device, tangible medium, or article of manufacture suitable for storing desired information and accessible by a computer.
“计算机可读信号介质”是指被配置为诸如经由网络将指令发送到计算设备1200的硬件的信号承载介质。信号介质典型地可以将计算机可读指令、数据结构、程序模块或其他数据体现在诸如载波、数据信号或其它传输机制的调制数据信号中。信号介质还包括任何信息传递介质。作为示例而非限制,信号介质包括诸如有线网络或直接连线的有线介质以及诸如声、RF、红外和其它无线介质的无线介质。"Computer-readable signal medium" refers to a signal-bearing medium configured as hardware to transmit instructions to computing device 1200 , such as via a network. Signal media typically embody computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, data signal or other transport mechanism. Signal media also includes any information delivery media. By way of example, and not limitation, signal media include wired media such as a wired network or direct-wire, and wireless media such as acoustic, RF, infrared and other wireless media.
如前所述,硬件元件1204和计算机可读介质1202代表以硬件形式实现的指令、模块、可编程器件逻辑和/或固定器件逻辑,其在一些实施例中可以用于实现本文描述的技术的至少一些方面。硬件元件可以包括集成电路或片上系统、专用集成电路(ASIC)、现场可编程门阵列(FPGA)、复杂可编程逻辑器件(CPLD)以及硅中的其它实现或其他硬件设备的组件。在这种上下文中,硬件元件可以作为执行由硬件元件所体现的指令、模块和/或逻辑所定义的程序任务的处理设备,以及用于存储用于执行的指令的硬件设备,例如,先前描述的计算机可读存储介质。As previously described,
前述的组合也可以用于实现本文所述的各种技术和模块。因此,可以将软件、硬件或程序模块和其它程序模块实现为在某种形式的计算机可读存储介质上和/或由一个或多个硬件元件1204体现的一个或多个指令和/或逻辑。计算设备1200可以被配置为实现与软件和/或硬件模块相对应的特定指令和/或功能。因此,例如通过使用处理系统的计算机可读存储介质和/或硬件元件1204,可以至少部分地以硬件来实现将模块实现为可由计算设备1200作为软件执行的模块。指令和/或功能可以由例如一个或多个计算设备1200和/或处理系统1201执行/可操作以实现本文所述的技术、模块和示例。Combinations of the foregoing can also be used to implement the various techniques and modules described herein. Thus, software, hardware or program modules and other program modules may be implemented as one or more instructions and/or logic on some form of computer readable storage medium and/or embodied by one or
本文描述的技术可以由计算设备1200的这些各种配置来支持,并且不限于本文所描述的技术的具体示例。The techniques described herein may be supported by these various configurations of computing device 1200 and are not limited to the specific examples of the techniques described herein.
特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机程序。例如,本申请的实施例提供一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行本申请的方法实施例中的至少一个步骤的程序代码。In particular, according to the embodiments of the present application, the processes described above with reference to the flowcharts can be implemented as computer programs. For example, embodiments of the present application provide a computer program product, which includes a computer program carried on a computer-readable medium, where the computer program includes program code for executing at least one step in the method embodiments of the present application.
在本申请的一些实施例中,提供了一种或多种计算机可读存储介质,其上存储有计算机可读指令,该计算机可读指令在被执行时实现根据本申请一些实施例的基于语义理解的信息检索方法。根据本申请一些实施例所述的基于语义理解的信息检索方法的各个步骤可以通过程序设计被转化为计算机可读指令,从而存储在计算机可读存储介质中。当这样的计算机可读存储介质被计算设备或计算机读取或访问时,其中的计算机可读指令被计算设备或计算机上的处理器执行以实现根据本申请一些实施例所述的方法。In some embodiments of the present application, one or more computer-readable storage media are provided, on which are stored computer-readable instructions that, when executed, implement semantic-based Understand information retrieval methods. Each step of the information retrieval method based on semantic understanding according to some embodiments of the present application can be converted into computer-readable instructions through programming, and stored in a computer-readable storage medium. When such a computer-readable storage medium is read or accessed by a computing device or a computer, the computer-readable instructions therein are executed by a processor on the computing device or computer to implement the method according to some embodiments of the present application.
在本说明书的描述中,术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点被包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, descriptions of the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean specific features described in conjunction with the embodiment or examples, A structure, material or characteristic is included in at least one embodiment or example of the present application. In this specification, the schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the described specific features, structures, materials or characteristics may be combined in any suitable manner in any one or more embodiments or examples. In addition, those skilled in the art can combine and combine different embodiments or examples and features of different embodiments or examples described in this specification without conflicting with each other.
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序(包括根据所涉及的功能按基本同时的方式或按相反的顺序)来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method description in a flow diagram or otherwise described herein may be understood to represent a module, segment or portion of code comprising one or more executable instructions for implementing the steps of a custom logical function or process, and that the scope of preferred embodiments of the present application includes additional implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in the reverse order depending on the functions involved, which should be considered Those skilled in the art to which the embodiments of the present application belong can understand.
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,“计算机可读介质”可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。The logic and/or steps represented in the flowcharts or otherwise described herein, for example, can be considered as a sequenced listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium, For use with instruction execution systems, devices, or devices (such as computer-based systems, systems including processors, or other systems that can fetch instructions from instruction execution systems, devices, or devices and execute instructions), or in conjunction with these instruction execution systems, devices or equipment used. For the purposes of this specification, a "computer-readable medium" may be any device that can contain, store, communicate, propagate or transmit a program for use in or in conjunction with an instruction execution system, device or device.
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,则可用本领域公知的下列技术中的任一项或它们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路、具有合适的组合逻辑门电路的专用集成电路、可编程门阵列(Programmable Gate Array)、现场可编程门阵列(Field Programmable Gate Array)等。It should be understood that each part of the present application may be realized by hardware, software, firmware or a combination thereof. In the embodiments described above, various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system. For example, if it is implemented in hardware, it can be implemented by any one or combination of the following technologies known in the art: discrete logic circuits with logic gates for implementing logic functions on data signals; Application-specific integrated circuits of logic gate circuits, programmable gate arrays (Programmable Gate Array), field programmable gate arrays (Field Programmable Gate Array), etc.
本技术领域的普通技术人员可以理解上述实施例方法的全部或部分步骤可以通过程序指令相关的硬件完成,所述程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括执行方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps of the methods in the above embodiments can be completed by hardware related to program instructions, and the program can be stored in a computer-readable storage medium. When the program is executed, it includes: One or a combination of the steps of the method embodiment.
此外,在本申请各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are implemented in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.
Claims (17)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310331474.3A CN116340502A (en) | 2023-03-30 | 2023-03-30 | Information retrieval method and device based on semantic understanding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310331474.3A CN116340502A (en) | 2023-03-30 | 2023-03-30 | Information retrieval method and device based on semantic understanding |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116340502A true CN116340502A (en) | 2023-06-27 |
Family
ID=86891110
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310331474.3A Pending CN116340502A (en) | 2023-03-30 | 2023-03-30 | Information retrieval method and device based on semantic understanding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116340502A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118093649A (en) * | 2024-04-23 | 2024-05-28 | 腾讯科技(深圳)有限公司 | Content query method and related device based on database |
CN118394892A (en) * | 2024-07-01 | 2024-07-26 | 浪潮电子信息产业股份有限公司 | Question answering method, device, equipment and computer readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080086298A1 (en) * | 2006-10-10 | 2008-04-10 | Anisimovich Konstantin | Method and system for translating sentences between langauges |
CN112100365A (en) * | 2020-08-31 | 2020-12-18 | 电子科技大学 | Two-stage text summarization method |
CN112445887A (en) * | 2019-08-29 | 2021-03-05 | 南京大学 | Method and device for realizing machine reading understanding system based on retrieval |
CN113051371A (en) * | 2021-04-12 | 2021-06-29 | 平安国际智慧城市科技股份有限公司 | Chinese machine reading understanding method and device, electronic equipment and storage medium |
CN113704421A (en) * | 2021-04-02 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Information retrieval method and device, electronic equipment and computer readable storage medium |
-
2023
- 2023-03-30 CN CN202310331474.3A patent/CN116340502A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080086298A1 (en) * | 2006-10-10 | 2008-04-10 | Anisimovich Konstantin | Method and system for translating sentences between langauges |
CN112445887A (en) * | 2019-08-29 | 2021-03-05 | 南京大学 | Method and device for realizing machine reading understanding system based on retrieval |
CN112100365A (en) * | 2020-08-31 | 2020-12-18 | 电子科技大学 | Two-stage text summarization method |
CN113704421A (en) * | 2021-04-02 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Information retrieval method and device, electronic equipment and computer readable storage medium |
CN113051371A (en) * | 2021-04-12 | 2021-06-29 | 平安国际智慧城市科技股份有限公司 | Chinese machine reading understanding method and device, electronic equipment and storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118093649A (en) * | 2024-04-23 | 2024-05-28 | 腾讯科技(深圳)有限公司 | Content query method and related device based on database |
CN118394892A (en) * | 2024-07-01 | 2024-07-26 | 浪潮电子信息产业股份有限公司 | Question answering method, device, equipment and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11928434B2 (en) | Method for text generation, device and storage medium | |
Li et al. | A survey on deep learning for named entity recognition | |
Liu et al. | A survey on deep neural network-based image captioning | |
CN113434636B (en) | Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium | |
CN107066464B (en) | Semantic natural language vector space | |
KR101754473B1 (en) | Method and system for automatically summarizing documents to images and providing the image-based contents | |
CN111753060A (en) | Information retrieval method, device, equipment and computer readable storage medium | |
CN111324771B (en) | Video tag determination method and device, electronic equipment and storage medium | |
US11928418B2 (en) | Text style and emphasis suggestions | |
US20230386238A1 (en) | Data processing method and apparatus, computer device, and storage medium | |
US12326867B2 (en) | Method and system of using domain specific knowledge in retrieving multimodal assets | |
CN112101031B (en) | Entity identification method, terminal equipment and storage medium | |
CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN116340502A (en) | Information retrieval method and device based on semantic understanding | |
CN110968725B (en) | Image content description information generation method, electronic device and storage medium | |
CN115203421A (en) | Method, device and equipment for generating label of long text and storage medium | |
CN110874408A (en) | Model training method, text recognition device and computing equipment | |
CN113761125B (en) | Dynamic summary determination method and device, computing device and computer storage medium | |
CN111061939A (en) | Scientific research academic news keyword matching recommendation method based on deep learning | |
CN117079298A (en) | Information extraction method, training method of information extraction system and information extraction system | |
CN114880436A (en) | Text processing method and device | |
CN112463914A (en) | Entity linking method, device and storage medium for internet service | |
CN114818727A (en) | Key sentence extraction method and device | |
CN114722837A (en) | A method, device and computer-readable storage medium for recognizing multi-round dialogue intent | |
US20240403339A1 (en) | Document recommendation using contextual embeddings |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |