WO2023236253A1

WO2023236253A1 - Document retrieval method and apparatus, and electronic device

Info

Publication number: WO2023236253A1
Application number: PCT/CN2022/100569
Authority: WO
Inventors: 段沛宸
Original assignee: 来也科技(北京)有限公司
Priority date: 2022-06-07
Filing date: 2022-06-22
Publication date: 2023-12-14
Also published as: CN114925174A

Abstract

The present disclosure relates to a document retrieval method and apparatus, and an electronic device. The method comprises: performing a query on the basis of a query sentence, so as to acquire, from among a plurality of content fragments comprised in at least one document, a plurality of candidate content fragments related to the query sentence; acquiring a first correlation between the query sentence and each candidate content fragment by using a correlation model in the field of NLP; and on the basis of each first correlation, acquiring from among the candidate content fragments a target content fragment that matches the query sentence. Automatic document retrieval is realized, and the labor cost and time cost required for document retrieval are reduced. Moreover, a target content fragment is acquired according to a correlation between each content fragment in a document, which is acquired on the basis of AI technology, and a query sentence, such that the specific content that can answer a user's question can be accurately determined from the document, thus laying a foundation for accurately providing an answer to the user's question. In the present disclosure, RPA and AI can also be combined to realize the acquisition of content fragments in a document in IA, further reducing the labor cost.

Description

Document retrieval method, device and electronic equipment

Cross-references to related applications

This application is filed based on a Chinese patent application with application number 2022106370191 and a filing date of June 7, 2022, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application as a reference.

Technical field

The present disclosure relates to the technical fields of robotic process automation and artificial intelligence, and specifically relates to a document retrieval method, device and electronic equipment.

Background technique

Robotic Process Automation (RPA) uses specific "robot software" to simulate human operations on a computer and automatically execute process tasks according to rules.

Artificial Intelligence (AI for short) is a technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence.

Intelligent Automation (IA) is a general term for a series of technologies from robotic process automation to artificial intelligence. It combines RPA with Optical Character Recognition (OCR), Intelligent Character Recognition (ICR), and process mining. (Process Mining), Deep Learning (DL), Machine Learning (ML), Natural Language Processing (NLP), Speech Recognition (Automatic Speech Recognition, ASR), Speech Synthesis (Text To Speech) , TTS), Computer Vision (CV) and other AI technologies are combined to create end-to-end business processes that can think, learn and adapt, covering from process discovery, process automation, to automatic and continuous The entire process of data collection, understanding the meaning of data, and using data to manage and optimize business processes.

At present, in many business scenarios, such as power question and answer systems, it is necessary to retrieve specific content that can answer the questions raised by users from a large number of documents, such as a certain sentence or certain units in a table. grid content, etc., to give accurate answers based on the content. In related technologies, after obtaining questions raised by users, a large number of documents are usually manually queried to find specific content that can answer the user's questions, or conventional document-level retrieval is used to find the information related to the user's question through string matching. matching documents. The above-mentioned method of document retrieval through manual query will waste a lot of labor and time costs, and the conventional document-level retrieval method can only retrieve documents that can answer the user's question, but cannot accurately retrieve the documents that can answer the user's question. details. Therefore, there is a need for a document retrieval method that can accurately retrieve the specific content in the document that can answer the user's question at a low labor cost and time cost.

Contents of the invention

The present disclosure provides a document retrieval method, device and electronic equipment to solve the technical problems of high labor and time costs of the document retrieval method and the inability to accurately retrieve specific content in the document that can answer user questions.

An embodiment of the first aspect of the present disclosure provides a document retrieval method. The method includes: obtaining a query statement; performing a query based on the query statement to obtain multiple candidates related to the query statement from multiple content fragments included in at least one document. Content fragments; use the correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content fragment; based on each first correlation, obtain the target that matches the query statement from each candidate content fragment Content snippets.

In some embodiments, querying is performed based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, including: obtaining the content contained in each content fragment and each content Attribute information of the fragment; based on the content contained in each content fragment, obtain the content correlation between the query statement and the corresponding content fragment, and based on the attribute information of each content fragment, obtain the attributes between the query statement and the corresponding content fragment Relevance: Based on the content correlation and attribute correlation between the query statement and each content fragment, multiple candidate content fragments related to the query statement are obtained from multiple content fragments.

In some embodiments, content relevance has a corresponding first weight, and attribute relevance has a corresponding second weight; based on the content relevance and attribute relevance between the query statement and each content segment, from multiple content segments , obtaining multiple candidate content fragments related to the query statement, including: determining the relationship between the query statement and the corresponding content fragment based on each content correlation degree and the corresponding first weight, and each attribute correlation degree and the corresponding second weight. Second correlation degree; based on the second correlation degree between the query statement and each content fragment, obtain multiple candidate content fragments related to the query statement from multiple content fragments.

In some embodiments, using a correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content fragment includes: for each candidate content fragment, input the query statement and the candidate content fragment Relevance model to obtain the first correlation between the query statement and the candidate content fragment.

In some embodiments, the correlation model in the field of natural language processing NLP is used to obtain the first correlation between the query statement and each candidate content fragment, including: for each candidate content fragment, obtaining the corresponding attribute information, and The attribute information is spliced with the candidate content fragments to obtain the corresponding splicing results; the query statement and the splicing results corresponding to the candidate content fragments are input into the correlation model to obtain the first correlation between the query statement and the candidate content fragments.

In some embodiments, before querying based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, the method further includes: optical character recognition based on the artificial intelligence field. OCR technology identifies each document to obtain the recognition results of each document; performs structured processing on each recognition result to obtain multiple content fragments included in each document; and stores each content fragment corresponding to the corresponding content field.

In some embodiments, based on optical character recognition OCR technology in the field of artificial intelligence, each document is recognized to obtain the recognition results of each document, including: calling an RPA robot to upload each document to a document processing platform to utilize document processing The platform recognizes each document based on optical character recognition (OCR) technology and obtains the recognition results of each document returned by the document processing platform.

In some embodiments, the recognition results include text recognition results and/or table recognition results; structured processing is performed on each recognition result to obtain multiple content fragments included in each document, including: text segmentation according to a preset segmentation method. The recognition results and/or table recognition results are segmented to obtain multiple segmented segments; the multiple segmented segments are aggregated according to a preset aggregation method to obtain multiple content segments, wherein each content segment is aggregated by at least one segmented segment get.

In some embodiments, the attribute information includes at least one of a document name, a chapter title, and parent titles of various levels of the chapter title.

A second embodiment of the present disclosure provides a document retrieval device. The device includes: a first acquisition module for acquiring a query statement; a query module for performing a query based on the query statement to retrieve multiple contents included in at least one document. In the fragment, multiple candidate content fragments related to the query statement are obtained; the second acquisition module is used to obtain the first correlation degree between the query statement and each candidate content fragment using a correlation model in the field of natural language processing NLP; The third acquisition module is used to acquire the target content segment that matches the query statement from each candidate content segment based on each first correlation degree.

In some embodiments, the query module includes: a first acquisition unit, used to acquire the content contained in each content segment and the attribute information of each content segment; a second acquisition unit, configured to acquire, based on the content contained in each content segment, The content correlation between the query statement and the corresponding content fragment, and the attribute correlation between the query statement and the corresponding content fragment based on the attribute information of each content fragment; The third acquisition unit is used to obtain the attribute correlation between the query statement and the corresponding content fragment based on the query statement and each content fragment. Based on the content correlation and attribute correlation between content fragments, multiple candidate content fragments related to the query statement are obtained from multiple content fragments.

In some embodiments, the content relevance has a corresponding first weight, and the attribute relevance has a corresponding second weight; a third acquisition unit is used to: based on each content relevance and the corresponding first weight, and each attribute correlation degree and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment; based on the second correlation between the query statement and each content fragment, obtain the query statement from multiple content fragments Multiple related candidate content snippets.

In some embodiments, the second acquisition module includes: a fourth acquisition unit, configured to input the query statement and the candidate content fragment into the correlation model for each candidate content fragment, so as to obtain the correlation between the query statement and the candidate content fragment. The first degree of correlation.

The third embodiment of the present disclosure provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the above-mentioned first step of the present disclosure is implemented. Methods described in aspect embodiments.

The fourth embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the method described in the first embodiment of the present disclosure is implemented.

The fifth aspect embodiment of the present disclosure proposes a computer program product, which includes a computer program. When executed by a processor, the computer program implements the method described in the above first aspect embodiment of the present disclosure.

The sixth embodiment of the present disclosure provides a computer program. The computer program includes computer program code. When the computer program code is run on a computer, it causes the computer to execute the method described in the first embodiment of the present disclosure. method.

The technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects:

It realizes automatic document retrieval, reduces the labor cost and time cost required for document retrieval, and obtains the target content that matches the query statement based on the correlation between each content fragment in the document obtained based on AI technology and the query statement. Fragments enable the precise determination of specific content that can answer user questions from documents, laying the foundation for accurately providing answers to user questions.

Additional aspects and advantages of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.

Description of the drawings

In the drawings, unless otherwise specified, the same reference numbers refer to the same or similar parts or elements throughout the several figures. The drawings are not necessarily to scale. It should be understood that these drawings depict only some embodiments in accordance with the disclosure and are not to be considered limiting of the scope of the disclosure.

Figure 1 is a schematic flowchart of a document retrieval method according to the first embodiment of the present disclosure;

Figure 2 is a schematic flowchart of a document retrieval method according to a second embodiment of the present disclosure;

Figure 3 is a schematic flowchart of a document retrieval method according to a third embodiment of the present disclosure;

Figure 4 is an example diagram of an interactive interface provided by a document retrieval device according to a third embodiment of the present disclosure;

Figure 5 is an example diagram of candidate content segments and corresponding attribute information according to the third embodiment of the present disclosure;

Figure 6 is a schematic flowchart of a document retrieval method according to the fourth embodiment of the present disclosure;

Figure 7 is an example diagram of an interactive interface of a document processing platform and a recognition result of a document according to the fourth embodiment of the present disclosure;

Figure 8 is an example diagram of text recognition results and corresponding content fragments according to the fourth embodiment of the present disclosure;

Figure 9 is an example diagram of table recognition results and corresponding content fragments according to the fourth embodiment of the present disclosure;

Figure 10 is a schematic structural diagram of a document retrieval device according to the fifth embodiment of the present disclosure;

FIG. 11 is a block diagram of an electronic device used to implement the document retrieval method of an embodiment of the present disclosure.

Detailed ways

Embodiments of the present disclosure are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are exemplary and are only used to explain the present disclosure and are not to be construed as limitations of the present disclosure.

These and other aspects of embodiments of the present disclosure will become apparent with reference to the following description and accompanying drawings. In these descriptions and drawings, some specific implementations of the embodiments of the disclosure are specifically disclosed to represent some of the ways of implementing the principles of the embodiments of the disclosure, but it should be understood that the scope of the embodiments of the disclosure is not limited by this restriction. On the contrary, the disclosed embodiments include all changes, modifications and equivalents falling within the spirit and scope of the appended claims.

It should be noted that in the technical solution of this disclosure, the acquisition, storage and application of user personal information involved are in compliance with relevant laws and regulations and do not violate public order and good customs.

Embodiments of the present disclosure provide a document retrieval method, which reduces the labor cost and time cost required for document retrieval by replacing manual document retrieval automatically. Specifically, after obtaining the user's query statement, a query can be performed based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, and then use the natural language processing NLP field The correlation model obtains the first correlation degree between the query statement and each candidate content fragment, and then obtains the target content fragment matching the query statement from each candidate content fragment based on each first correlation degree. As a result, by obtaining the target content fragments that match the query statement based on the correlation between each content fragment in the document obtained based on AI technology and the query statement, it is possible to accurately determine the specific content that can answer the user's question from the document, providing It lays the foundation for providing accurate answers to user questions.

In order to clearly explain the various embodiments of the present disclosure, technical terms involved in the embodiments of the present disclosure are first explained.

In the description of the present disclosure, the term "plurality" means two or more.

In the description of this disclosure, "RPA robot" refers to a software robot that can combine AI technology and RPA technology to automatically perform business processing. RPA robots have two characteristics: "connector" and "non-intrusion". By simulating human operation methods, they can extract, integrate and connect data from different systems in a non-intrusive way without changing the information system.

In the description of this disclosure, "query statement" refers to the statement input by the user for query, that is, the question the user wants to ask. It can be a statement in text form or a statement in voice form. This disclosure does not make any comment on this. limit.

In the description of this disclosure, "document" is a document in electronic form used to retrieve specific content that can answer the user's question. It can be a PDF (Portable Document Format) obtained by scanning a paper document. The document in the format may also be a document edited on a smart device such as a computer or mobile phone, and this disclosure does not limit this.

In the description of this disclosure, a "content fragment" is a fragment composed of part of the content in the document. The content fragment can be one sentence or several sentences, or it can be a paragraph in the document, or a table in the document, or Partial content in a table, etc., this disclosure does not limit this. In some embodiments of the present disclosure, the number of characters included in the content fragments can be set in advance, so that by processing all documents to be retrieved, the content in all documents is divided into multiple content fragments, and the characters included in each content fragment are The number is less than or equal to the preset number of characters.

In the description of this disclosure, "candidate content fragments" refer to content fragments related to the query statement obtained from all content fragments included in all documents. "Target content fragment" refers to the content fragment obtained from the candidate content fragment that matches the query statement, that is, the specific content that can answer the user's question.

In the description of this disclosure, "attribute information" is information that represents the attributes of a content fragment, such as the document name of the document where the content fragment is located, the chapter title corresponding to the content fragment, the parent titles of each level of the chapter title, etc.

In the description of this disclosure, "correlation degree" is used to express the magnitude of the degree of correlation. The "first correlation degree" is the correlation degree between the query statement and the candidate content fragment determined by the correlation degree model. The first correlation degree is used to indicate the degree of correlation between the query statement and the candidate content fragment.

In the description of this disclosure, "correlation model" is any machine model used to determine the degree of correlation, such as Bert (Bidirectional Encoder Representations from Transformers, a bidirectional encoder representation model) and other neural network models. Among them, the correlation model can be obtained by fine-tuning the pre-trained model in the NLP field.

In the description of the present disclosure, "content relevance" is the correlation between the query statement and the content fragment determined based on the content contained in the content fragment, and is used to represent the correlation between the content contained in the content fragment and the query statement. The size of the degree.

In the description of this disclosure, "attribute correlation" is the correlation between the query statement and the content fragment determined based on the attribute information corresponding to the content fragment, and is used to represent the correlation between the attribute information corresponding to the content fragment and the query statement. The size of the degree.

In the description of this disclosure, the "second correlation degree" is the correlation degree between the query statement and the content fragment determined based on the content correlation degree and attribute correlation degree, and is used to comprehensively represent the content contained in the content fragment and the corresponding attribute information. , the degree of correlation with the query statement.

In the description of this disclosure, "segmented fragments" refer to fragments composed of content obtained by dividing the document. For example, after the document is divided into multiple sentences according to the punctuation marks used at the end of the sentence, each sentence is A split fragment. Each content segment in the embodiment of the present disclosure may include one or more segmented segments.

In the description of this disclosure, a "document processing platform" is an intelligent automation platform for intelligently processing documents. Among them, Intelligent Document Processing (IDP) is one of the core capabilities of the intelligent automation platform. Intelligent document processing (IDP) is based on AI technologies such as Optical Character Recognition (OCR), Computer Vision (CV), Natural Language Processing (NLP), and Knowledge Graph (KG). , a new generation of automation technology that identifies, classifies, extracts elements, verifies, compares, and corrects errors in various types of documents, helping enterprises realize the intelligence and automation of document processing.

In the description of this disclosure, a "content field" is a field composed of a single character or multiple consecutive characters. The "content field" can be understood as the attribute item key, and the content contained in the content fragment can be understood as the attribute value value. The content fields and corresponding content fragments together form a piece of structured data. In addition, the content field and the fields corresponding to the attribute information of the content fragment, such as the field named "Document Name", the field named "Chapter Title", and the field named "Parent Title at All Levels", can form a structure. .

The following describes document retrieval methods, devices, electronic devices and storage media according to embodiments of the present disclosure with reference to the accompanying drawings.

First, the document retrieval method in the embodiment of the present disclosure will be described with reference to the accompanying drawings.

Figure 1 is a flow chart of a document retrieval method according to the first embodiment of the present disclosure. As shown in Figure 1, the method may include the following steps: steps 101-104.

Step 101: Obtain the query statement.

It should be noted that the document retrieval method in the embodiment of the present disclosure can be executed by a document retrieval device. Wherein, the document retrieval device can be implemented by software and/or hardware. The document retrieval device can be an electronic device, or can also be configured in an electronic device to realize automatic document retrieval, thereby reducing the labor cost required for document retrieval. and time cost, and achieve accurate determination of specific content that can answer user questions from documents based on AI technology. The electronic device may include but is not limited to a terminal device, a server, etc., and the embodiment of the present disclosure does not specifically limit the electronic device.

In the embodiment of the present disclosure, the document retrieval device can provide an interactive interface, so that the user can input a query statement in the interactive interface to perform a query, and accordingly, the document retrieval device can obtain the query statement.

Step 102: Perform a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document.

In the embodiment of the present disclosure, a large number of documents to be retrieved (that is, documents from which specific content that can answer user questions need to be retrieved) can be processed in advance to obtain multiple content fragments, and the multiple content fragments are saved to the retrieval engine. , and then after obtaining the query statement, the search engine can be used to query based on the query statement. Based on the search engine, multiple candidate content fragments related to the query statement are obtained from multiple content fragments and returned to the document retrieval device. , the document retrieval device can obtain multiple candidate content fragments.

The retrieval engine can be any retrieval engine with a retrieval function, and this disclosure does not limit this. In addition, the retrieval engine may be configured in the document retrieval device, or the retrieval engine may be configured separately and connected to the document retrieval device through an interface, which is not limited by the present disclosure.

In the embodiment of the present disclosure, the number of candidate content fragments can be set in advance, so that the retrieval engine can obtain the correlation between the query statement and each content fragment, and process each content fragment in order from high to low according to the corresponding correlation. Sorting: determine a preset number of content fragments that are ranked first as multiple candidate content fragments.

In the embodiment of the present disclosure, the first correlation threshold can be set in advance, so that the search engine can obtain the correlation between the query statement and each content fragment, and in each content fragment, the corresponding correlation is greater than the first correlation threshold. Multiple content segments are determined as multiple candidate content segments. The first correlation threshold can be set arbitrarily as needed, and this disclosure does not limit this.

Step 103: Use the correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content segment.

In the embodiment of the present disclosure, the relevance model can be pre-trained. The input of the relevance model is the candidate content fragment and the query statement, and the output is the correlation score (ie, confidence) between the candidate content fragment and the query statement, and then for each A candidate content fragment, the query statement and the candidate content fragment can be input into the trained correlation model, so that the correlation model determines the correlation between the candidate content fragment and the query statement based on the content contained in the query statement and the candidate content fragment. degree, and outputs the first correlation degree, so that the document retrieval device can obtain the first correlation degree between the query statement and the candidate content fragment according to the output of the correlation degree model.

Step 104: Based on each first correlation degree, obtain a target content segment that matches the query statement from each candidate content segment.

The number of target content segments may be one or multiple, and may be set as needed, and this disclosure does not limit this.

In this embodiment of the present disclosure, taking the number of target content segments as one as an example, based on the first correlation between the query statement and each candidate content segment, the corresponding candidate content segment with the highest first correlation can be used as the target. Content snippets.

Furthermore, based on the target content segment, an answer for answering the query statement can be obtained.

It should be noted that the document retrieval device can provide an interactive interface, so that the answer to the query statement can be displayed through the interactive interface. In addition, while obtaining the target content fragment, the document retrieval device can also obtain the attribute information of the target content fragment, and The target content fragment, corresponding attribute information, and paragraphs or tables containing the target content fragment are displayed through an interactive interface, so that users can more clearly understand the source of the answer to the query statement.

In summary, the document retrieval method provided by the embodiment of the present disclosure obtains a query statement and performs a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, using natural The correlation model in the field of language processing NLP obtains the first correlation between the query statement and each candidate content fragment, and obtains the target content fragment matching the query statement from each candidate content fragment based on each first correlation. As a result, automatic document retrieval is realized, reducing the labor cost and time cost required for document retrieval, and based on the correlation between each content fragment in the document and the query statement obtained based on AI technology, the matching query statement is obtained The target content fragment realizes the precise determination of specific content that can answer user questions from the document, laying the foundation for accurately providing answers to user questions.

Next, with reference to Figure 2, the process of querying based on a query statement in the document retrieval method provided by the embodiment of the present disclosure to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document will be further described. illustrate.

Figure 2 is a flow chart of a document retrieval method according to the second embodiment of the present disclosure. As shown in Figure 2, the method includes: steps 201-206.

Step 201: Obtain the query statement.

Step 202: Obtain the content contained in each content segment and the attribute information of each content segment.

The attribute information of the content fragment may include at least one of the document name of the document in which the content fragment is located, the chapter title corresponding to the content fragment, and the parent titles at all levels of the chapter title corresponding to the content fragment.

In this embodiment of the present disclosure, taking the attribute information including document name, chapter title, and parent title at all levels as an example, the content contained in each content segment and the attribute information of the content segment can be saved in the form of a structure. The structure The fields in the body may include a field named "Document Name", a field named "Chapter Title", a field named "Parent Titles at All Levels" and a field named "Content Fragment", so that the document retrieval device can be based on For each structure, obtain the content contained in the corresponding content fragment and the corresponding attribute information.

Step 203: Based on the content contained in each content fragment, obtain the content correlation between the query statement and the corresponding content fragment, and obtain the attribute correlation between the query statement and the corresponding content fragment based on the attribute information of each content fragment. .

Among them, when the attribute information of the content fragment includes multiple information such as document name, chapter title, parent title at each level, etc., correspondingly, for each content fragment, the relationship between the query statement and the corresponding content fragment can be obtained based on each attribute information. The correlation of each attribute.

In the embodiment of the present disclosure, the query statement can be segmented into words, and the content correlation between the query statement and the content segment can be determined based on the number of times each segment appears in the content contained in the content segment. For example, the more times each segment appears in the content contained in a certain content segment, the higher the content correlation between the query statement and the content segment is determined; when each segment appears in the content contained in a certain content segment, The fewer the occurrences in , the lower the content relevance between the query statement and the content fragment.

Similarly, the query statement can be segmented, and the attribute correlation between the query statement and the content segment can be determined based on the number of times each segment appears in the attribute information of a certain content segment. For example, the more times each segment appears in the document name of a certain content fragment, the higher the attribute correlation of the corresponding document name between the query statement and the content fragment is determined; when each segment appears in the document name of a certain content fragment, The fewer the occurrences in the document name, the lower the correlation of the attribute corresponding to the document name between the query statement and the content fragment.

For example, assuming that the query statement is "transformer type" and the attribute information includes the document name and chapter title, the query statement can be segmented to obtain "transformer" and "type", and then according to the content contained in each content fragment, The number of times "transformer" and "type" are used to determine the content correlation between the query statement "transformer type" and the corresponding content fragment, and based on the number of times "transformer" and "type" appear in the document name of the document where each content fragment is located, Determine the attribute correlation of the corresponding document name between the query statement "Transformer Type" and the corresponding content fragment, and determine the query statement "Transformer Type" based on the number of times "Transformer" and "Type" appear in the chapter title corresponding to each content fragment. The attribute correlation between the corresponding chapter title and the corresponding content fragment.

Step 204: Based on the content correlation and attribute correlation between the query statement and each content fragment, obtain multiple candidate content fragments related to the query statement from multiple content fragments.

In the embodiment of the present disclosure, a second correlation threshold corresponding to the content correlation and a third correlation threshold corresponding to the attribute correlation can be set, so that among multiple content segments, the corresponding content correlation can be greater than the second correlation The degree threshold, and/or the content fragments whose corresponding attribute correlation is greater than the third correlation threshold, are determined as multiple candidate content fragments related to the query statement. The second correlation threshold and the third correlation threshold can be set as needed, and are not limited here.

As a result, multiple candidate content segments that are highly relevant to the query statement can be accurately obtained from all content segments included in all documents.

In the embodiment of the present disclosure, content relevance and attribute relevance can also be set to have corresponding weights. In order to facilitate the distinction, the weight corresponding to the content relevance is called the first weight, and the weight corresponding to the attribute relevance is called the second weight. That is, the content relevance has a corresponding first weight, and the attribute relevance has a corresponding second weight. Wherein, when the attribute information of the content fragment includes multiple attributes, corresponding weights can be set respectively corresponding to the attribute correlation of each attribute information, and the weights corresponding to each attribute correlation can be the same or different, and are not limited here. The first weight and the second weight can be determined through experiments, experience, or other methods, and this disclosure does not limit this.

Correspondingly, step 204 can be implemented in the following manner: based on each content correlation and the corresponding first weight, and each attribute correlation and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment. ; Based on the second correlation between the query statement and each content fragment, obtain multiple candidate content fragments related to the query statement from the multiple content fragments.

For each content segment, the weighted sum of the content relevance and the attribute relevance can be determined based on the content relevance and the corresponding first weight, and the attribute relevance and the corresponding second weight, and the weighted sum can be used as a query statement The second degree of correlation with the content fragment. Furthermore, among the plurality of content segments, the content segment whose corresponding second correlation degree is greater than the fourth correlation degree threshold may be determined as a candidate content segment, or a preset number of content segments whose corresponding second correlation degree is the highest (i.e., After each content segment is arranged in order from high to low according to the corresponding second correlation degree, the preset number of content segments that are ranked first) are determined as candidate content segments.

Therefore, by setting the content relevance to have a corresponding first weight, the attribute relevance to have a corresponding second weight, and based on each content relevance and the corresponding first weight, and each attribute relevance and the corresponding second weight, Obtaining candidate content fragments from multiple content fragments enables flexible adjustment of the method of determining the degree of correlation between content fragments and query statements as needed.

In the embodiment of the present disclosure, steps 202-204 are implemented through a document retrieval device, or can also be implemented based on a retrieval engine. For example, taking the attribute information including document name, chapter title, and parent title at each level as an example, each content segment can be pre-recorded. The content, as well as the document name, chapter title, and parent title at each level corresponding to each content fragment, are saved in the form of a structure. The fields in the structure can correspond to fields named "content fragment", fields named "document name" field, a field named "Chapter Title", and a field named "Level Parent Title". Therefore, after the document retrieval device obtains the query statement, the query statement can be segmented, and all the segmented words in the query sentence can be spliced with "document name", "chapter title", "parent title at all levels" and "content fragment" respectively, to obtain Retrieval conditions, and input the retrieval conditions into the retrieval engine to obtain the second correlation between the query statement and each content segment based on the retrieval engine in the manner shown in the above embodiment, and then obtain the second correlation between the query statement and each content segment from multiple content segments. Multiple candidate content fragments related to the query statement are returned to the document retrieval device.

Step 205: Use the correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content segment.

Step 206: Based on each first correlation degree, obtain a target content segment that matches the query statement from each candidate content segment.

For the specific implementation process and principles of steps 205-206, reference can be made to the description of the above embodiments and will not be described again here.

In summary, the document retrieval method provided by the embodiment of the present disclosure obtains the query statement, obtains the content contained in each content fragment and the attribute information of each content fragment, and obtains the query statement and the corresponding content fragment based on the content contained in each content fragment. Based on the content correlation between the query statement and the corresponding content fragment, and based on the attribute information of each content fragment, the attribute correlation between the query statement and the corresponding content fragment is obtained. Based on the content correlation and attribute correlation between the query statement and each content fragment, from multiple From each content fragment, obtain multiple candidate content fragments related to the query statement, and use the correlation model in the field of natural language processing NLP to obtain the first correlation degree between the query statement and each candidate content fragment, based on each first correlation degree , obtain the target content fragment that matches the query statement from each candidate content fragment. As a result, automatic document retrieval is realized, reducing the labor cost and time cost required for document retrieval, and based on the correlation between each content fragment in the document and the query statement obtained based on AI technology, the matching query statement is obtained The target content fragment realizes the precise determination of specific content that can answer user questions from the document, laying the foundation for accurately providing answers to user questions.

The following is a further explanation of the process of obtaining the first correlation between the query statement and each candidate content fragment by using the correlation model in the field of natural language processing NLP in the document retrieval method provided by the embodiment of the present disclosure with reference to FIG. 3 .

Figure 3 is a flow chart of a document retrieval method according to the third embodiment of the present disclosure. As shown in Figure 3, the method includes: steps 301-305.

Step 301: Obtain the query statement.

Step 302: Perform a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document.

For the specific implementation process and principles of steps 301-302, reference can be made to the description of the above embodiments and will not be described again here.

Step 303: For each candidate content segment, obtain the corresponding attribute information, and splice the attribute information and the candidate content segment to obtain the corresponding splicing result.

The attribute information of the candidate content fragment may include at least one of the name of the document in which the candidate content fragment is located, the chapter title corresponding to the candidate content fragment, and the parent titles of each level of the chapter title.

In this embodiment of the present disclosure, for each candidate content fragment, the corresponding document name, chapter title, and parent title of the chapter title can be obtained, and the document name, chapter title, and parent title of the chapter title can be spliced with the candidate content fragment, to obtain the corresponding splicing results.

Step 304: Enter the splicing results corresponding to the query statement and the candidate content fragments into the correlation model to obtain the first correlation between the query statement and the candidate content fragments.

In the embodiment of the present disclosure, the query statement and the splicing result corresponding to the candidate content fragment can be input into the correlation model, so that the correlation model determines the relationship between the candidate content fragment and the query based on the query statement and the content and attribute information of the candidate content fragment itself. The degree of correlation between the sentences and outputting the first correlation degree, so that the document retrieval device can obtain the first correlation degree between the query statement and the candidate content fragment according to the output of the correlation model.

Alternatively, for each candidate content fragment, only the query statement and the candidate content fragment can be input into the correlation model to obtain the first correlation between the query statement and the candidate content fragment.

Step 305: Based on each first correlation degree, obtain a target content segment that matches the query statement from each candidate content segment.

Further, based on the target content fragment, an answer for answering the query statement can be obtained.

Referring to Figure 4, the user can enter the question "Can the factory model parameters of the terminal support erasing and writing by the main station" in the interactive interface provided by the document retrieval device, and click the "Start Retrieval" button to start the document retrieval process. Correspondingly, the document retrieval device You can obtain the query statement "Can the factory model parameters of the terminal support erasing and writing by the master station?" After the document retrieval device obtains multiple candidate content segments related to the query statement shown in Figure 5 in the manner shown in the above embodiment, it can obtain the first correlation degree between the query statement and each candidate content segment (i.e., Figure 5 5), and obtain the attribute information of each candidate content fragment (i.e., each document number in the document number column, each document name in the document name column, each chapter serial number, chapter number in the chapter serial number column in Figure 5 Each chapter title in the title column), and then determine the corresponding first candidate content fragment with the highest correlation (that is, the candidate content fragment with the serial number 1) as the target content fragment, and then display the target content through the interactive interface shown in Figure 4 Fragments and corresponding attribute information, etc.

Among them, each serial number in the leftmost serial number column in Figure 5 is used to uniquely identify the corresponding candidate content field. Each document number in the document number column is used to uniquely identify the document in which the candidate content fragment is located. Each document name in the document name column is the name of the document in which the corresponding candidate content fragment is located. Each chapter serial number in the chapter serial number column is the serial number of the chapter where the corresponding candidate content fragment is located, and is used to uniquely identify the chapter where the candidate content fragment is located. Each chapter title in the chapter title column is the title of the chapter where the corresponding candidate content fragment is located. Each content segment in the candidate content segment column is the content contained in the corresponding candidate content segment. Each confidence in the confidence column is the first correlation between the query statement determined by the correlation model and the corresponding candidate content fragment.

In summary, the document retrieval method provided by the embodiment of the present disclosure obtains a query statement and performs a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document. For each document, candidate content fragments, obtain the corresponding attribute information, and splice the attribute information with the candidate content fragments to obtain the corresponding splicing results. Enter the query statement and the splicing results corresponding to the candidate content fragments into the correlation model to obtain the query statement. and the first correlation degree between the candidate content fragments, and based on each first correlation degree, the target content fragment matching the query statement is obtained from each candidate content fragment. As a result, automatic document retrieval is realized, reducing the labor cost and time cost required for document retrieval, and based on the correlation between each content fragment in the document and the query statement obtained based on AI technology, the matching query statement is obtained The target content fragment realizes the precise determination of specific content that can answer user questions from the document, laying the foundation for accurately providing answers to user questions. In addition, by adopting the correlation model in the field of natural language processing NLP, based on the query statement, the attribute information of each candidate content fragment, and the content contained in the candidate content fragment itself, the first correlation degree between each candidate content fragment and the query statement is determined. , further improving the accuracy of the identified target content segments.

From the above analysis, it can be seen that a large number of documents to be retrieved can be processed in advance to obtain multiple content fragments. Then, after the document retrieval device obtains the query statement, it can query based on the query statement to obtain the query statement from multiple content fragments. Multiple related candidate content snippets. The following describes the process of processing documents to be retrieved to obtain multiple content fragments in the document retrieval method provided by the embodiment of the present disclosure with reference to FIG. 6 .

Figure 6 is a flow chart of a document retrieval method according to the fourth embodiment of the present disclosure. As shown in Figure 6, based on the above embodiment, the method may also include the following steps 601-603.

Step 601: Recognize each document based on the optical character recognition OCR technology in the field of artificial intelligence AI to obtain the recognition results of each document.

In the embodiment of the present disclosure, the document retrieval device can recognize each document based on optical character recognition (OCR) technology to obtain the recognition results of each document.

In the embodiment of the present disclosure, the document retrieval device can also be connected to the document processing platform through an interface, thereby uploading each document to the document processing platform, so as to use the document processing platform to identify each document based on optical character recognition OCR technology, and then Obtain the recognition results of each document returned by the document processing platform.

In the embodiment of the present disclosure, the document retrieval device can also call the RPA robot to upload each document to the document processing platform, so as to use the document processing platform to identify each document based on optical character recognition OCR technology, and then obtain the document returned by the document processing platform. Recognition results for each document. Therefore, when there are a large number of documents to be retrieved, the labor costs required for document uploading can be reduced by calling the RPA robot to upload each document one by one to the document processing platform.

Referring to the left drawing of Figure 7, the document processing platform may provide an interactive interface, which may include an "upload document" button for uploading documents and a "start recognition" button for starting the document recognition process. The document retrieval device can call the RPA robot to simulate mouse operations, click the "Upload Document" button on the interactive interface for uploading documents to upload the documents to be processed to the document processing platform, and then click on the interactive interface for starting Click the "Start Recognition" button of the document recognition process to start the document recognition process on the document processing platform, and then obtain the document recognition results shown in the right side of Figure 7. Among them, "cl_num" in Figure 7 represents the chapter serial number, "cl_name" represents the chapter title, "cl_rank" represents the row where the chapter is located, and "cl_content" represents the content contained in the chapter.

Step 602: Perform structured processing on each recognition result to obtain multiple content fragments included in each document.

In this embodiment of the present disclosure, the document may include text and/or tables, and accordingly, the recognition results of the document may include text recognition results and/or table recognition results.

Accordingly, step 602 can be implemented in the following ways: segment the text recognition results and/or table recognition results according to a preset segmentation method to obtain multiple segmented segments; aggregate the multiple segmented segments according to a preset aggregation method, To obtain multiple content segments, each content segment is obtained by aggregating at least one segmented segment.

The preset segmentation method is a method of dividing the recognition result of the document into multiple segmented segments, which can be determined according to the type of content contained in the document (such as text type, table type).

The default aggregation method is a method of aggregating divided fragments to obtain content fragments, which can be determined according to the type of content contained in the document (such as text type, table type).

For example, assume that the document recognition results include text recognition results, and the text recognition results include chapter numbers, commas, periods and other punctuation marks. The document retrieval device can perform the first segmentation of the text recognition results based on chapter numbers, and then perform a second segmentation on the results of the first segmentation based on punctuation marks (generally end-of-sentence punctuation marks such as periods), thereby segmenting the text recognition results. It is a plurality of sentences, each sentence is a segmented segment, and each segmented segment is arranged from front to back according to its corresponding position in the document.

Furthermore, a specific length can be given, such as 200 characters, and then gradually accumulated from the first segmented segment backwards. When the accumulated length is greater than 200 characters, the previously accumulated segmented segments are regarded as one content segment. Use the currently accumulated split segment as the first split segment of the next content segment. For example, when the length of the fifth sentence is 203 characters, and the length of the previously accumulated sentences is 197 characters, the four previously accumulated sentences will be regarded as one content fragment, and the fifth sentence will be used as the next content fragment. The first sentence, and then the subsequent sentences are accumulated to determine the next content fragment.

Referring to Figure 8, by performing structured processing on the text recognition results shown in the left figure, multiple content fragments shown in the right figure of Figure 8 can be obtained.

Or, assume that the recognition results of the document include table recognition results, and the table recognition results include delimiter symbols used to distinguish different cells, and the row numbers where the cells are located. The document retrieval device can perform the first segmentation of the table recognition result by line number, and then perform the second segmentation of the first segmentation result according to the delimiter symbol, thereby dividing the table recognition result into multiple cell contents, each cell The content of the grid is a segmented segment, and the segmented segments in each row are arranged from front to back according to their corresponding positions in the document. Furthermore, the divided fragments in each row can be spliced into one content fragment.

Referring to Figure 9, by performing structured processing on the table recognition results shown in the left figure, multiple content fragments shown in the right figure of Figure 9 can be obtained.

It should be noted that the above-mentioned ways of segmenting text recognition results or table recognition results, and the ways of aggregating multiple segmented segments obtained by segmentation are only illustrative descriptions and cannot be understood as limitations to the technical solution of the present disclosure. In practical applications, those skilled in the art can set a preset segmentation method for segmenting the recognition results of the document as needed, and a preset aggregation method for aggregating multiple segmented fragments, and this disclosure does not limit this.

Step 603: Save each content segment in correspondence with the corresponding content field.

In the embodiment of the present disclosure, the name of the content field can be set to "content fragment", and each content fragment can be saved corresponding to the corresponding content field, so that when the content contained in the content fragment needs to be obtained later, the content can be obtained through the content The field obtains the content contained in the corresponding content fragment.

In addition, in the embodiment of the present disclosure, the content contained in each content segment and the document name, chapter title, and parent title at each level corresponding to each content segment can also be saved in the form of a structure. The fields in the structure can include corresponding A field named "Content Fragment", a field named "Document Name", a field named "Chapter Title", and a field named "Level Parent Title".

Among them, steps 601-603 may be executed before step 102, or before step 202, or before step 302.

In summary, the document retrieval method provided by the embodiments of the present disclosure is based on optical character recognition OCR technology to identify each document to obtain the recognition results of each document, and perform structured processing on each recognition result to obtain the information included in each document. Multiple content fragments, each content fragment is saved corresponding to the corresponding content field, and the document to be retrieved is processed to obtain multiple content fragments. In order to accurately determine the specific content that can answer the user's question from the document, to accurately provide Answers to user questions provide the foundation. And by calling the RPA robot to upload each document to the document processing platform, the document processing platform can be used to identify each document based on the OCR technology in the field of artificial intelligence, and then the identification results of each document returned by the document processing platform can be obtained, and then each document can be identified. The recognition results are structured and processed to obtain multiple content fragments included in each document. This enables the combination of RPA and AI to implement IA to obtain content fragments in the document, further reducing the labor costs required for document retrieval.

In order to implement the above embodiments, the present disclosure also proposes a document retrieval device. Figure 10 is a schematic structural diagram of a document retrieval device according to the fifth embodiment of the present disclosure.

As shown in Figure 10, the document retrieval device 1000 includes: a first acquisition module 1001, a query module 1002, a second acquisition module 1003 and a third acquisition module 1004.

Among them, the first acquisition module 1001 is used to acquire query statements;

The query module 1002 is configured to perform a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document;

The second acquisition module 1003 is used to obtain the first correlation between the query statement and each candidate content fragment using a correlation model in the field of natural language processing NLP;

The third acquisition module 1004 is configured to acquire the target content segment that matches the query statement from each candidate content segment based on each first correlation degree.

It should be noted that the document retrieval device 1000 in the embodiment of the present disclosure can execute the document retrieval method provided in the above embodiment. The document retrieval device 1000 can be implemented by software and/or hardware. The document retrieval device can be an electronic device, or can also be configured in an electronic device to realize automatic retrieval of documents, thereby reducing the labor cost required for document retrieval. and time cost, and achieve accurate determination of specific content that can answer user questions from documents based on AI technology. The electronic device may include but is not limited to a terminal device, a server, etc., and the embodiment of the present disclosure does not specifically limit the electronic device.

In one embodiment of the present disclosure, query module 1002 includes:

The first acquisition unit is used to acquire the content contained in each content segment and the attribute information of each content segment;

The second acquisition unit is used to obtain the content correlation between the query statement and the corresponding content fragment based on the content contained in each content fragment, and to obtain the content correlation between the query statement and the corresponding content fragment based on the attribute information of each content fragment. attribute correlation;

The third acquisition unit is used to acquire multiple candidate content segments related to the query statement from multiple content segments based on the content correlation and attribute correlation between the query statement and each content segment.

In one embodiment of the present disclosure, the content relevance has a corresponding first weight, and the attribute relevance has a corresponding second weight;

The third acquisition unit is used for:

Based on each content correlation and the corresponding first weight, and each attribute correlation and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment;

Based on the second correlation between the query statement and each content fragment, multiple candidate content fragments related to the query statement are obtained from the plurality of content fragments.

In one embodiment of the present disclosure, the second acquisition module 1003 includes:

The fourth acquisition unit is configured to input the query statement and the candidate content fragment into the correlation model for each candidate content fragment, so as to obtain the first correlation between the query statement and the candidate content fragment.

The fifth acquisition unit is used to obtain the corresponding attribute information for each candidate content segment, and splice the attribute information with the candidate content segment to obtain the corresponding splicing result;

The sixth acquisition unit is used to input the splicing results corresponding to the query statement and the candidate content fragment into the correlation model to obtain the first correlation between the query statement and the candidate content fragment.

In one embodiment of the present disclosure, the document retrieval device 1000 also includes: a recognition module, used to recognize each document based on the optical character recognition OCR technology in the artificial intelligence field to obtain the recognition results of each document;

The processing module is used to perform structured processing on each recognition result to obtain multiple content fragments included in each document;

The saving module is used to save each content fragment correspondingly to the corresponding content field.

In one embodiment of the present disclosure, the identification module includes:

The upload unit is used to call the RPA robot to upload each document to the document processing platform, so as to use the document processing platform to identify each document based on optical character recognition OCR technology;

The seventh acquisition unit is used to acquire the recognition results of each document returned by the document processing platform.

In one embodiment of the present disclosure, the recognition results include text recognition results and/or table recognition results;

Processing modules, including:

A segmentation unit used to segment text recognition results and/or table recognition results according to a preset segmentation method to obtain multiple segmented segments;

The aggregation unit is used to aggregate multiple segmented segments according to a preset aggregation method to obtain multiple content segments, wherein each content segment is obtained by aggregating at least one segmented segment.

In one embodiment of the present disclosure, the attribute information includes at least one of a document name, a chapter title, and parent titles at various levels of the chapter title.

It should be noted that the foregoing explanation of the embodiment of the document retrieval method also applies to the document retrieval device of this embodiment. Unpublished details of the embodiment of the document retrieval device of the present disclosure will not be described again here.

In summary, the document retrieval device of the embodiment of the present disclosure acquires a query statement and performs a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, using natural language Process the correlation model in the NLP field, obtain the first correlation between the query statement and each candidate content fragment, and obtain the target content fragment matching the query statement from each candidate content fragment based on each first correlation. As a result, automatic document retrieval is realized, reducing the labor cost and time cost required for document retrieval, and based on the correlation between each content fragment in the document and the query statement obtained based on AI technology, the matching query statement is obtained The target content fragment realizes the precise determination of specific content that can answer user questions from the document, laying the foundation for accurately providing answers to user questions.

In order to implement the above embodiments, embodiments of the present disclosure also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, The document retrieval method as described in any of the foregoing method embodiments.

In order to implement the above embodiments, embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the document retrieval method as described in any of the foregoing method embodiments is implemented. In some embodiments, the computer-readable storage medium is a non-transitory computer-readable storage medium.

In order to implement the above embodiments, embodiments of the present disclosure also provide a computer program product. When the instruction processor in the computer program product is executed, the document retrieval method as described in any of the foregoing method embodiments is implemented.

In order to implement the above embodiments, an embodiment of the present disclosure also proposes a computer program. The computer program includes computer program code. When the computer program code is run on a computer, it causes the computer to execute as described in any of the foregoing method embodiments. Document retrieval method.

11 illustrates a block diagram of an exemplary electronic device suitable for implementing embodiments of the present disclosure. The electronic device 11 shown in FIG. 11 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.

As shown in Figure 11, electronic device 11 is embodied in the form of a general computing device. The components of electronic device 11 may include, but are not limited to: one or more processors or processing units 16, system memory 28, and a bus 18 connecting different system components (including memory 28 and processing unit 16).

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics accelerated port, a processor, or a local bus using any of a variety of bus structures. For example, these architectures include but are not limited to Industry Standard Architecture (hereinafter referred to as: ISA) bus, Micro Channel Architecture (Micro Channel Architecture; hereafter referred to as: MAC) bus, enhanced ISA bus, video electronics Standards Association (Video Electronics Standards Association; hereinafter referred to as: VESA) local bus and Peripheral Component Interconnection (hereinafter referred to as: PCI) bus.

Electronic device 11 typically includes a variety of computer system readable media. These media may be any available media that can be accessed by electronic device 11, including volatile and non-volatile media, removable and non-removable media.

The memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter referred to as: RAM) 30 and/or cache memory 32. Electronic device 11 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in Figure 11, commonly referred to as a "hard drive"). Although not shown in FIG. 11, a disk drive for reading and writing a removable non-volatile disk (e.g., a "floppy disk"), and a removable non-volatile optical disk (e.g., a compact disk read-only memory) may be provided. Disc Read Only Memory (hereinafter referred to as: CD-ROM), Digital Video Disc Read Only Memory (hereinafter referred to as: DVD-ROM) or other optical media) read and write optical disc drives. In these cases, each drive may be connected to bus 18 through one or more data media interfaces. Memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of embodiments of the present disclosure.

A program/utility 40 having a set of (at least one) program modules 42, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored, for example, in memory 28 , each of these examples or some combination may include the implementation of a network environment. Program modules 42 generally perform functions and/or methods in the embodiments described in this disclosure.

Electronic device 11 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable a user to interact with electronic device 11, and/or with Any device (eg, network card, modem, etc.) that enables the electronic device 11 to communicate with one or more other computing devices. This communication may occur through input/output (I/O) interface 22. Moreover, the electronic device 11 can also communicate with one or more networks (such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (Wide Area Network; hereinafter referred to as: WAN)) and/or a public network, such as the Internet, through the network adapter 20 ) communication. As shown in FIG. 11 , the network adapter 20 communicates with other modules of the electronic device 11 through the bus 18 . It should be understood that, although not shown in Figure 11, other hardware and/or software modules may be used in conjunction with electronic device 11, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tapes drives and data backup storage systems, etc.

The processing unit 16 executes programs stored in the memory 28 to perform various functional applications and data processing, such as implementing the methods mentioned in the previous embodiments.

It should be noted that the foregoing explanations of the embodiments of the document retrieval method are also applicable to the electronic devices, computer-readable storage media, computer program products and computer programs of the embodiments of the present disclosure, and will not be described again here.

In the description of this specification, reference to the terms "one embodiment," "some embodiments," "an example," "specific examples," or "some examples" or the like means that specific features are described in connection with the embodiment or example. , structures, materials, or features are included in at least one embodiment or example of the present disclosure. In this specification, the schematic expressions of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine different embodiments or examples and features of different embodiments or examples described in this specification unless they are inconsistent with each other.

In addition, the terms “first” and “second” are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of the present disclosure, "plurality" means at least two, such as two, three, etc., unless otherwise expressly and specifically limited.

Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments, or portions of code that include one or more executable instructions for implementing customized logical functions or steps of the process. , and the scope of the preferred embodiments of the present disclosure includes additional implementations in which functions may be performed out of the order shown or discussed, including in a substantially simultaneous manner or in the reverse order, depending on the functionality involved, which shall It should be understood by those skilled in the art to which embodiments of the present disclosure belong.

The logic and/or steps represented in the flowcharts or otherwise described herein, for example, may be considered a sequenced list of executable instructions for implementing the logical functions, and may be embodied in any computer-readable medium, For use by, or in combination with, instruction execution systems, devices or devices (such as computer-based systems, systems including processors or other systems that can fetch instructions from and execute instructions from the instruction execution system, device or device) or equipment. For the purposes of this specification, a "computer-readable medium" may be any device that can contain, store, communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections with one or more wires (electronic device), portable computer disk cartridges (magnetic device), random access memory (RAM), Read-only memory (ROM), erasable and programmable read-only memory (EPROM or flash memory), fiber optic devices, and portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, and subsequently edited, interpreted, or otherwise suitable as necessary. process to obtain the program electronically and then store it in computer memory.

It should be understood that various parts of the present disclosure may be implemented in hardware, software, firmware, or combinations thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if it is implemented in hardware, as in another embodiment, it can be implemented by any one of the following technologies known in the art or their combination: discrete logic gate circuits with logic functions for implementing data signals; Logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGA), field programmable gate arrays (FPGA), etc.

Those of ordinary skill in the art can understand that all or part of the steps involved in implementing the methods of the above embodiments can be completed by instructing relevant hardware through a program. The program can be stored in a computer-readable storage medium. The program can be stored in a computer-readable storage medium. When executed, one of the steps of the method embodiment or a combination thereof is included.

In addition, each functional unit in various embodiments of the present disclosure may be integrated into one processing module, each unit may exist physically alone, or two or more units may be integrated into one module. The above integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.

The storage media mentioned above can be read-only memory, magnetic disks or optical disks, etc. Although the embodiments of the present disclosure have been shown and described above, it can be understood that the above-mentioned embodiments are illustrative and should not be construed as limitations of the present disclosure. Those of ordinary skill in the art can make modifications to the above-mentioned embodiments within the scope of the present disclosure. The embodiments are subject to changes, modifications, substitutions and variations.

Claims

A document retrieval method including:

Get the query statement;

Perform a query based on the query statement to obtain a plurality of candidate content fragments related to the query statement from a plurality of content fragments included in at least one document;

Using a correlation model in the field of natural language processing NLP, obtain the first correlation between the query statement and each of the candidate content fragments;

Based on each of the first correlations, a target content segment matching the query statement is obtained from each of the candidate content segments.
The method of claim 1, wherein querying based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document includes:

Obtain the content contained in each content segment and the attribute information of each content segment;

Based on the content included in each content segment, the content correlation between the query statement and the corresponding content segment is obtained, and based on the attribute information of each content segment, the relationship between the query statement and the corresponding content segment is obtained. attribute correlation between;

Based on the content correlation and the attribute correlation between the query statement and each of the content fragments, multiple candidate content fragments related to the query statement are obtained from a plurality of the content fragments.
The method of claim 2, wherein the content relevance has a corresponding first weight, and the attribute relevance has a corresponding second weight;

Based on the content correlation and the attribute correlation between the query statement and each of the content fragments, obtain a plurality of candidate content fragments related to the query statement from a plurality of the content fragments. ,include:

Based on each content correlation and the corresponding first weight, and each attribute correlation and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment;

Based on the second correlation between the query statement and each of the content fragments, a plurality of candidate content fragments related to the query statement are obtained from a plurality of the content fragments.
The method according to any one of claims 1 to 3, wherein the first correlation between the query statement and each of the candidate content segments is obtained by using a correlation model in the field of natural language processing NLP, include:

For each candidate content segment, the query statement and the candidate content segment are input into the correlation model to obtain a first correlation between the query statement and the candidate content segment.
The method according to any one of claims 1 to 4, wherein the first correlation between the query statement and each of the candidate content fragments is obtained using a correlation model in the field of natural language processing NLP, include:

For each candidate content segment, obtain the corresponding attribute information, and splice the attribute information and the candidate content segment to obtain the corresponding splicing result;

The splicing result corresponding to the query statement and the candidate content segment is input into the correlation model to obtain the first correlation between the query statement and the candidate content segment.
The method according to any one of claims 1 to 5, wherein the query is performed based on the query statement to obtain a plurality of content fragments related to the query statement from a plurality of content fragments included in at least one document. Before the candidate content snippet, also include:

Based on the optical character recognition OCR technology in the field of artificial intelligence, identify each of the documents to obtain the recognition results of each of the documents;

Perform structured processing on each of the recognition results to obtain a plurality of content fragments included in each of the documents;

Each content segment is stored in correspondence with the corresponding content field.
The method according to claim 6, wherein the optical character recognition (OCR) technology based on the field of artificial intelligence (AI) recognizes each of the documents to obtain the recognition results of each of the documents, including:

Call the RPA robot to upload each of the documents to the document processing platform, so as to use the document processing platform to identify each of the documents based on the optical character recognition OCR technology;

Obtain the identification results of each document returned by the document processing platform.
The method according to claim 6 or 7, wherein the recognition results include text recognition results and/or table recognition results;

The step of performing structured processing on each of the recognition results to obtain a plurality of content fragments included in each of the documents includes:

Segment the text recognition result and/or the table recognition result according to a preset segmentation method to obtain multiple segmented segments;

A plurality of the segmented segments are aggregated according to a preset aggregation method to obtain a plurality of the content segments, wherein each content segment is obtained by aggregating at least one of the segmented segments.
The method according to any one of claims 1 to 8, wherein the attribute information includes at least one of a document name, a chapter title, and parent titles of each level of the chapter title.
A document retrieval device, including:

The first acquisition module is used to obtain query statements;

A query module configured to perform a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document;

The second acquisition module is used to obtain the first correlation between the query statement and each of the candidate content fragments using a correlation model in the field of natural language processing NLP;

A third acquisition module is configured to acquire a target content segment that matches the query statement from each of the candidate content segments based on each of the first correlation degrees.
The device according to claim 10, wherein the query module includes:

A first acquisition unit, configured to acquire the content contained in each content segment and the attribute information of each content segment;

The second acquisition unit is configured to obtain the content correlation between the query statement and the corresponding content segment based on the content contained in each content segment, and obtain the query based on the attribute information of each content segment. The attribute correlation between the statement and the corresponding content fragment;

A third acquisition unit configured to acquire, from a plurality of content segments, information related to the query statement based on the content correlation and the attribute correlation between the query statement and each of the content segments. Multiple candidate content snippets.
The device according to claim 11, wherein the content relevance has a corresponding first weight, and the attribute relevance has a corresponding second weight;

The third acquisition unit is used for:

Based on each content correlation and the corresponding first weight, and each attribute correlation and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment;

Based on the second correlation between the query statement and each of the content fragments, a plurality of candidate content fragments related to the query statement are obtained from a plurality of the content fragments.
The device according to any one of claims 10 to 12, wherein the second acquisition module includes:

The fourth acquisition unit is configured to input the query statement and the candidate content fragment into the correlation model for each of the candidate content fragments, so as to obtain the third relationship between the query statement and the candidate content fragment. A degree of correlation.
An electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the implementation as described in any one of claims 1 to 9 is achieved. Methods.
A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the method according to any one of claims 1 to 9 is implemented.
A computer program product, wherein the computer program product includes a computer program, and when the computer program is executed by a processor, the method according to any one of claims 1 to 9 is implemented.
A computer program, wherein the computer program includes computer program code, which when run on a computer causes the computer to perform the method according to any one of claims 1 to 9.