WO2023236253A1 - Document retrieval method and apparatus, and electronic device - Google Patents

Document retrieval method and apparatus, and electronic device Download PDF

Info

Publication number
WO2023236253A1
WO2023236253A1 PCT/CN2022/100569 CN2022100569W WO2023236253A1 WO 2023236253 A1 WO2023236253 A1 WO 2023236253A1 CN 2022100569 W CN2022100569 W CN 2022100569W WO 2023236253 A1 WO2023236253 A1 WO 2023236253A1
Authority
WO
WIPO (PCT)
Prior art keywords
content
correlation
query statement
document
fragments
Prior art date
Application number
PCT/CN2022/100569
Other languages
French (fr)
Chinese (zh)
Inventor
段沛宸
Original Assignee
来也科技(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 来也科技(北京)有限公司 filed Critical 来也科技(北京)有限公司
Publication of WO2023236253A1 publication Critical patent/WO2023236253A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables

Definitions

  • the present disclosure relates to the technical fields of robotic process automation and artificial intelligence, and specifically relates to a document retrieval method, device and electronic equipment.
  • Robotic Process Automation uses specific "robot software” to simulate human operations on a computer and automatically execute process tasks according to rules.
  • AI Artificial Intelligence
  • Intelligent Automation is a general term for a series of technologies from robotic process automation to artificial intelligence. It combines RPA with Optical Character Recognition (OCR), Intelligent Character Recognition (ICR), and process mining. (Process Mining), Deep Learning (DL), Machine Learning (ML), Natural Language Processing (NLP), Speech Recognition (Automatic Speech Recognition, ASR), Speech Synthesis (Text To Speech) , TTS), Computer Vision (CV) and other AI technologies are combined to create end-to-end business processes that can think, learn and adapt, covering from process discovery, process automation, to automatic and continuous The entire process of data collection, understanding the meaning of data, and using data to manage and optimize business processes.
  • the present disclosure provides a document retrieval method, device and electronic equipment to solve the technical problems of high labor and time costs of the document retrieval method and the inability to accurately retrieve specific content in the document that can answer user questions.
  • An embodiment of the first aspect of the present disclosure provides a document retrieval method.
  • the method includes: obtaining a query statement; performing a query based on the query statement to obtain multiple candidates related to the query statement from multiple content fragments included in at least one document. Content fragments; use the correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content fragment; based on each first correlation, obtain the target that matches the query statement from each candidate content fragment Content snippets.
  • querying is performed based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, including: obtaining the content contained in each content fragment and each content Attribute information of the fragment; based on the content contained in each content fragment, obtain the content correlation between the query statement and the corresponding content fragment, and based on the attribute information of each content fragment, obtain the attributes between the query statement and the corresponding content fragment Relevance: Based on the content correlation and attribute correlation between the query statement and each content fragment, multiple candidate content fragments related to the query statement are obtained from multiple content fragments.
  • content relevance has a corresponding first weight
  • attribute relevance has a corresponding second weight
  • based on the content relevance and attribute relevance between the query statement and each content segment, from multiple content segments obtaining multiple candidate content fragments related to the query statement, including: determining the relationship between the query statement and the corresponding content fragment based on each content correlation degree and the corresponding first weight, and each attribute correlation degree and the corresponding second weight.
  • Second correlation degree based on the second correlation degree between the query statement and each content fragment, obtain multiple candidate content fragments related to the query statement from multiple content fragments.
  • using a correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content fragment includes: for each candidate content fragment, input the query statement and the candidate content fragment Relevance model to obtain the first correlation between the query statement and the candidate content fragment.
  • the correlation model in the field of natural language processing NLP is used to obtain the first correlation between the query statement and each candidate content fragment, including: for each candidate content fragment, obtaining the corresponding attribute information, and The attribute information is spliced with the candidate content fragments to obtain the corresponding splicing results; the query statement and the splicing results corresponding to the candidate content fragments are input into the correlation model to obtain the first correlation between the query statement and the candidate content fragments.
  • the method before querying based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, the method further includes: optical character recognition based on the artificial intelligence field.
  • OCR technology identifies each document to obtain the recognition results of each document; performs structured processing on each recognition result to obtain multiple content fragments included in each document; and stores each content fragment corresponding to the corresponding content field.
  • each document is recognized to obtain the recognition results of each document, including: calling an RPA robot to upload each document to a document processing platform to utilize document processing
  • the platform recognizes each document based on optical character recognition (OCR) technology and obtains the recognition results of each document returned by the document processing platform.
  • OCR optical character recognition
  • the recognition results include text recognition results and/or table recognition results; structured processing is performed on each recognition result to obtain multiple content fragments included in each document, including: text segmentation according to a preset segmentation method.
  • the recognition results and/or table recognition results are segmented to obtain multiple segmented segments; the multiple segmented segments are aggregated according to a preset aggregation method to obtain multiple content segments, wherein each content segment is aggregated by at least one segmented segment get.
  • the attribute information includes at least one of a document name, a chapter title, and parent titles of various levels of the chapter title.
  • a second embodiment of the present disclosure provides a document retrieval device.
  • the device includes: a first acquisition module for acquiring a query statement; a query module for performing a query based on the query statement to retrieve multiple contents included in at least one document.
  • the fragment multiple candidate content fragments related to the query statement are obtained;
  • the second acquisition module is used to obtain the first correlation degree between the query statement and each candidate content fragment using a correlation model in the field of natural language processing NLP;
  • the third acquisition module is used to acquire the target content segment that matches the query statement from each candidate content segment based on each first correlation degree.
  • the query module includes: a first acquisition unit, used to acquire the content contained in each content segment and the attribute information of each content segment; a second acquisition unit, configured to acquire, based on the content contained in each content segment, The content correlation between the query statement and the corresponding content fragment, and the attribute correlation between the query statement and the corresponding content fragment based on the attribute information of each content fragment; The third acquisition unit is used to obtain the attribute correlation between the query statement and the corresponding content fragment based on the query statement and each content fragment. Based on the content correlation and attribute correlation between content fragments, multiple candidate content fragments related to the query statement are obtained from multiple content fragments.
  • the content relevance has a corresponding first weight
  • the attribute relevance has a corresponding second weight
  • a third acquisition unit is used to: based on each content relevance and the corresponding first weight, and each attribute correlation degree and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment; based on the second correlation between the query statement and each content fragment, obtain the query statement from multiple content fragments Multiple related candidate content snippets.
  • the second acquisition module includes: a fourth acquisition unit, configured to input the query statement and the candidate content fragment into the correlation model for each candidate content fragment, so as to obtain the correlation between the query statement and the candidate content fragment.
  • the first degree of correlation is the first degree of correlation.
  • the third embodiment of the present disclosure provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program, the above-mentioned first step of the present disclosure is implemented. Methods described in aspect embodiments.
  • the fourth embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the method described in the first embodiment of the present disclosure is implemented.
  • the fifth aspect embodiment of the present disclosure proposes a computer program product, which includes a computer program. When executed by a processor, the computer program implements the method described in the above first aspect embodiment of the present disclosure.
  • the sixth embodiment of the present disclosure provides a computer program.
  • the computer program includes computer program code.
  • the computer program code When the computer program code is run on a computer, it causes the computer to execute the method described in the first embodiment of the present disclosure. method.
  • Figure 1 is a schematic flowchart of a document retrieval method according to the first embodiment of the present disclosure
  • Figure 2 is a schematic flowchart of a document retrieval method according to a second embodiment of the present disclosure
  • Figure 3 is a schematic flowchart of a document retrieval method according to a third embodiment of the present disclosure.
  • Figure 4 is an example diagram of an interactive interface provided by a document retrieval device according to a third embodiment of the present disclosure
  • Figure 5 is an example diagram of candidate content segments and corresponding attribute information according to the third embodiment of the present disclosure.
  • Figure 6 is a schematic flowchart of a document retrieval method according to the fourth embodiment of the present disclosure.
  • Figure 7 is an example diagram of an interactive interface of a document processing platform and a recognition result of a document according to the fourth embodiment of the present disclosure
  • Figure 8 is an example diagram of text recognition results and corresponding content fragments according to the fourth embodiment of the present disclosure.
  • Figure 9 is an example diagram of table recognition results and corresponding content fragments according to the fourth embodiment of the present disclosure.
  • Figure 10 is a schematic structural diagram of a document retrieval device according to the fifth embodiment of the present disclosure.
  • FIG. 11 is a block diagram of an electronic device used to implement the document retrieval method of an embodiment of the present disclosure.
  • Embodiments of the present disclosure provide a document retrieval method, which reduces the labor cost and time cost required for document retrieval by replacing manual document retrieval automatically.
  • a query can be performed based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, and then use the natural language processing NLP field
  • the correlation model obtains the first correlation degree between the query statement and each candidate content fragment, and then obtains the target content fragment matching the query statement from each candidate content fragment based on each first correlation degree.
  • RPA robot refers to a software robot that can combine AI technology and RPA technology to automatically perform business processing.
  • RPA robots have two characteristics: “connector” and “non-intrusion”. By simulating human operation methods, they can extract, integrate and connect data from different systems in a non-intrusive way without changing the information system.
  • query statement refers to the statement input by the user for query, that is, the question the user wants to ask. It can be a statement in text form or a statement in voice form. This disclosure does not make any comment on this. limit.
  • document is a document in electronic form used to retrieve specific content that can answer the user's question. It can be a PDF (Portable Document Format) obtained by scanning a paper document.
  • PDF Portable Document Format
  • the document in the format may also be a document edited on a smart device such as a computer or mobile phone, and this disclosure does not limit this.
  • a "content fragment” is a fragment composed of part of the content in the document.
  • the content fragment can be one sentence or several sentences, or it can be a paragraph in the document, or a table in the document, or Partial content in a table, etc., this disclosure does not limit this.
  • the number of characters included in the content fragments can be set in advance, so that by processing all documents to be retrieved, the content in all documents is divided into multiple content fragments, and the characters included in each content fragment are The number is less than or equal to the preset number of characters.
  • candidate content fragments refer to content fragments related to the query statement obtained from all content fragments included in all documents.
  • Target content fragment refers to the content fragment obtained from the candidate content fragment that matches the query statement, that is, the specific content that can answer the user's question.
  • attribute information is information that represents the attributes of a content fragment, such as the document name of the document where the content fragment is located, the chapter title corresponding to the content fragment, the parent titles of each level of the chapter title, etc.
  • correlation degree is used to express the magnitude of the degree of correlation.
  • first correlation degree is the correlation degree between the query statement and the candidate content fragment determined by the correlation degree model.
  • the first correlation degree is used to indicate the degree of correlation between the query statement and the candidate content fragment.
  • correlation model is any machine model used to determine the degree of correlation, such as Bert (Bidirectional Encoder Representations from Transformers, a bidirectional encoder representation model) and other neural network models. Among them, the correlation model can be obtained by fine-tuning the pre-trained model in the NLP field.
  • content relevance is the correlation between the query statement and the content fragment determined based on the content contained in the content fragment, and is used to represent the correlation between the content contained in the content fragment and the query statement.
  • the size of the degree is the correlation between the query statement and the content fragment determined based on the content contained in the content fragment, and is used to represent the correlation between the content contained in the content fragment and the query statement. The size of the degree.
  • attribute correlation is the correlation between the query statement and the content fragment determined based on the attribute information corresponding to the content fragment, and is used to represent the correlation between the attribute information corresponding to the content fragment and the query statement.
  • the size of the degree is the correlation between the query statement and the content fragment determined based on the attribute information corresponding to the content fragment, and is used to represent the correlation between the attribute information corresponding to the content fragment and the query statement. The size of the degree.
  • the "second correlation degree” is the correlation degree between the query statement and the content fragment determined based on the content correlation degree and attribute correlation degree, and is used to comprehensively represent the content contained in the content fragment and the corresponding attribute information. , the degree of correlation with the query statement.
  • segmented fragments refer to fragments composed of content obtained by dividing the document. For example, after the document is divided into multiple sentences according to the punctuation marks used at the end of the sentence, each sentence is A split fragment.
  • Each content segment in the embodiment of the present disclosure may include one or more segmented segments.
  • a "document processing platform” is an intelligent automation platform for intelligently processing documents.
  • Intelligent Document Processing is one of the core capabilities of the intelligent automation platform.
  • Intelligent document processing is based on AI technologies such as Optical Character Recognition (OCR), Computer Vision (CV), Natural Language Processing (NLP), and Knowledge Graph (KG). , a new generation of automation technology that identifies, classifies, extracts elements, verifies, compares, and corrects errors in various types of documents, helping enterprises realize the intelligence and automation of document processing.
  • OCR Optical Character Recognition
  • CV Computer Vision
  • NLP Natural Language Processing
  • KG Knowledge Graph
  • a “content field” is a field composed of a single character or multiple consecutive characters.
  • the “content field” can be understood as the attribute item key, and the content contained in the content fragment can be understood as the attribute value value.
  • the content fields and corresponding content fragments together form a piece of structured data.
  • the content field and the fields corresponding to the attribute information of the content fragment such as the field named "Document Name”, the field named “Chapter Title”, and the field named "Parent Title at All Levels”, can form a structure. .
  • Figure 1 is a flow chart of a document retrieval method according to the first embodiment of the present disclosure. As shown in Figure 1, the method may include the following steps: steps 101-104.
  • Step 101 Obtain the query statement.
  • the document retrieval method in the embodiment of the present disclosure can be executed by a document retrieval device.
  • the document retrieval device can be implemented by software and/or hardware.
  • the document retrieval device can be an electronic device, or can also be configured in an electronic device to realize automatic document retrieval, thereby reducing the labor cost required for document retrieval. and time cost, and achieve accurate determination of specific content that can answer user questions from documents based on AI technology.
  • the electronic device may include but is not limited to a terminal device, a server, etc., and the embodiment of the present disclosure does not specifically limit the electronic device.
  • the document retrieval device can provide an interactive interface, so that the user can input a query statement in the interactive interface to perform a query, and accordingly, the document retrieval device can obtain the query statement.
  • Step 102 Perform a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document.
  • a large number of documents to be retrieved (that is, documents from which specific content that can answer user questions need to be retrieved) can be processed in advance to obtain multiple content fragments, and the multiple content fragments are saved to the retrieval engine.
  • the search engine can be used to query based on the query statement.
  • multiple candidate content fragments related to the query statement are obtained from multiple content fragments and returned to the document retrieval device.
  • the document retrieval device can obtain multiple candidate content fragments.
  • the retrieval engine can be any retrieval engine with a retrieval function, and this disclosure does not limit this.
  • the retrieval engine may be configured in the document retrieval device, or the retrieval engine may be configured separately and connected to the document retrieval device through an interface, which is not limited by the present disclosure.
  • the number of candidate content fragments can be set in advance, so that the retrieval engine can obtain the correlation between the query statement and each content fragment, and process each content fragment in order from high to low according to the corresponding correlation.
  • Sorting determine a preset number of content fragments that are ranked first as multiple candidate content fragments.
  • the first correlation threshold can be set in advance, so that the search engine can obtain the correlation between the query statement and each content fragment, and in each content fragment, the corresponding correlation is greater than the first correlation threshold. Multiple content segments are determined as multiple candidate content segments.
  • the first correlation threshold can be set arbitrarily as needed, and this disclosure does not limit this.
  • Step 103 Use the correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content segment.
  • the relevance model can be pre-trained.
  • the input of the relevance model is the candidate content fragment and the query statement, and the output is the correlation score (ie, confidence) between the candidate content fragment and the query statement, and then for each A candidate content fragment, the query statement and the candidate content fragment can be input into the trained correlation model, so that the correlation model determines the correlation between the candidate content fragment and the query statement based on the content contained in the query statement and the candidate content fragment. degree, and outputs the first correlation degree, so that the document retrieval device can obtain the first correlation degree between the query statement and the candidate content fragment according to the output of the correlation degree model.
  • Step 104 Based on each first correlation degree, obtain a target content segment that matches the query statement from each candidate content segment.
  • the number of target content segments may be one or multiple, and may be set as needed, and this disclosure does not limit this.
  • the corresponding candidate content segment with the highest first correlation can be used as the target.
  • Content snippets taking the number of target content segments as one as an example, based on the first correlation between the query statement and each candidate content segment, the corresponding candidate content segment with the highest first correlation can be used as the target.
  • an answer for answering the query statement can be obtained.
  • the document retrieval device can provide an interactive interface, so that the answer to the query statement can be displayed through the interactive interface.
  • the document retrieval device can also obtain the attribute information of the target content fragment, and The target content fragment, corresponding attribute information, and paragraphs or tables containing the target content fragment are displayed through an interactive interface, so that users can more clearly understand the source of the answer to the query statement.
  • the document retrieval method obtains a query statement and performs a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, using natural
  • the correlation model in the field of language processing NLP obtains the first correlation between the query statement and each candidate content fragment, and obtains the target content fragment matching the query statement from each candidate content fragment based on each first correlation.
  • Figure 2 is a flow chart of a document retrieval method according to the second embodiment of the present disclosure. As shown in Figure 2, the method includes: steps 201-206.
  • Step 201 Obtain the query statement.
  • Step 202 Obtain the content contained in each content segment and the attribute information of each content segment.
  • the attribute information of the content fragment may include at least one of the document name of the document in which the content fragment is located, the chapter title corresponding to the content fragment, and the parent titles at all levels of the chapter title corresponding to the content fragment.
  • the content contained in each content segment and the attribute information of the content segment can be saved in the form of a structure.
  • the fields in the body may include a field named "Document Name”, a field named "Chapter Title”, a field named “Parent Titles at All Levels” and a field named "Content Fragment”, so that the document retrieval device can be based on For each structure, obtain the content contained in the corresponding content fragment and the corresponding attribute information.
  • Step 203 Based on the content contained in each content fragment, obtain the content correlation between the query statement and the corresponding content fragment, and obtain the attribute correlation between the query statement and the corresponding content fragment based on the attribute information of each content fragment. .
  • the attribute information of the content fragment includes multiple information such as document name, chapter title, parent title at each level, etc., correspondingly, for each content fragment, the relationship between the query statement and the corresponding content fragment can be obtained based on each attribute information.
  • the correlation of each attribute is a simple expression of each attribute.
  • the query statement can be segmented into words, and the content correlation between the query statement and the content segment can be determined based on the number of times each segment appears in the content contained in the content segment. For example, the more times each segment appears in the content contained in a certain content segment, the higher the content correlation between the query statement and the content segment is determined; when each segment appears in the content contained in a certain content segment, The fewer the occurrences in , the lower the content relevance between the query statement and the content fragment.
  • the query statement can be segmented, and the attribute correlation between the query statement and the content segment can be determined based on the number of times each segment appears in the attribute information of a certain content segment. For example, the more times each segment appears in the document name of a certain content fragment, the higher the attribute correlation of the corresponding document name between the query statement and the content fragment is determined; when each segment appears in the document name of a certain content fragment, The fewer the occurrences in the document name, the lower the correlation of the attribute corresponding to the document name between the query statement and the content fragment.
  • the query statement can be segmented to obtain “transformer” and "type", and then according to the content contained in each content fragment,
  • the number of times "transformer” and “type” are used to determine the content correlation between the query statement “transformer type” and the corresponding content fragment, and based on the number of times "transformer” and “type” appear in the document name of the document where each content fragment is located, Determine the attribute correlation of the corresponding document name between the query statement "Transformer Type” and the corresponding content fragment, and determine the query statement “Transformer Type” based on the number of times "Transformer” and "Type” appear in the chapter title corresponding to each content fragment.
  • the attribute correlation between the corresponding chapter title and the corresponding content fragment is
  • Step 204 Based on the content correlation and attribute correlation between the query statement and each content fragment, obtain multiple candidate content fragments related to the query statement from multiple content fragments.
  • a second correlation threshold corresponding to the content correlation and a third correlation threshold corresponding to the attribute correlation can be set, so that among multiple content segments, the corresponding content correlation can be greater than the second correlation
  • the degree threshold, and/or the content fragments whose corresponding attribute correlation is greater than the third correlation threshold, are determined as multiple candidate content fragments related to the query statement.
  • the second correlation threshold and the third correlation threshold can be set as needed, and are not limited here.
  • content relevance and attribute relevance can also be set to have corresponding weights.
  • the weight corresponding to the content relevance is called the first weight
  • the weight corresponding to the attribute relevance is called the second weight. That is, the content relevance has a corresponding first weight
  • the attribute relevance has a corresponding second weight.
  • corresponding weights can be set respectively corresponding to the attribute correlation of each attribute information, and the weights corresponding to each attribute correlation can be the same or different, and are not limited here.
  • the first weight and the second weight can be determined through experiments, experience, or other methods, and this disclosure does not limit this.
  • step 204 can be implemented in the following manner: based on each content correlation and the corresponding first weight, and each attribute correlation and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment. ; Based on the second correlation between the query statement and each content fragment, obtain multiple candidate content fragments related to the query statement from the multiple content fragments.
  • the weighted sum of the content relevance and the attribute relevance can be determined based on the content relevance and the corresponding first weight, and the attribute relevance and the corresponding second weight, and the weighted sum can be used as a query statement
  • the second degree of correlation with the content fragment may be determined as a candidate content segment, or a preset number of content segments whose corresponding second correlation degree is the highest (i.e., After each content segment is arranged in order from high to low according to the corresponding second correlation degree, the preset number of content segments that are ranked first) are determined as candidate content segments.
  • steps 202-204 are implemented through a document retrieval device, or can also be implemented based on a retrieval engine.
  • each content segment can be pre-recorded.
  • the content, as well as the document name, chapter title, and parent title at each level corresponding to each content fragment, are saved in the form of a structure.
  • the fields in the structure can correspond to fields named "content fragment", fields named "document name” field, a field named "Chapter Title", and a field named "Level Parent Title".
  • the query statement can be segmented, and all the segmented words in the query sentence can be spliced with "document name”, “chapter title”, “parent title at all levels” and “content fragment” respectively, to obtain Retrieval conditions, and input the retrieval conditions into the retrieval engine to obtain the second correlation between the query statement and each content segment based on the retrieval engine in the manner shown in the above embodiment, and then obtain the second correlation between the query statement and each content segment from multiple content segments. Multiple candidate content fragments related to the query statement are returned to the document retrieval device.
  • Step 205 Use the correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content segment.
  • Step 206 Based on each first correlation degree, obtain a target content segment that matches the query statement from each candidate content segment.
  • the document retrieval method obtains the query statement, obtains the content contained in each content fragment and the attribute information of each content fragment, and obtains the query statement and the corresponding content fragment based on the content contained in each content fragment. Based on the content correlation between the query statement and the corresponding content fragment, and based on the attribute information of each content fragment, the attribute correlation between the query statement and the corresponding content fragment is obtained.
  • Figure 3 is a flow chart of a document retrieval method according to the third embodiment of the present disclosure. As shown in Figure 3, the method includes: steps 301-305.
  • Step 301 Obtain the query statement.
  • Step 302 Perform a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document.
  • Step 303 For each candidate content segment, obtain the corresponding attribute information, and splice the attribute information and the candidate content segment to obtain the corresponding splicing result.
  • the attribute information of the candidate content fragment may include at least one of the name of the document in which the candidate content fragment is located, the chapter title corresponding to the candidate content fragment, and the parent titles of each level of the chapter title.
  • the corresponding document name, chapter title, and parent title of the chapter title can be obtained, and the document name, chapter title, and parent title of the chapter title can be spliced with the candidate content fragment, to obtain the corresponding splicing results.
  • Step 304 Enter the splicing results corresponding to the query statement and the candidate content fragments into the correlation model to obtain the first correlation between the query statement and the candidate content fragments.
  • the query statement and the splicing result corresponding to the candidate content fragment can be input into the correlation model, so that the correlation model determines the relationship between the candidate content fragment and the query based on the query statement and the content and attribute information of the candidate content fragment itself.
  • the degree of correlation between the sentences and outputting the first correlation degree so that the document retrieval device can obtain the first correlation degree between the query statement and the candidate content fragment according to the output of the correlation model.
  • only the query statement and the candidate content fragment can be input into the correlation model to obtain the first correlation between the query statement and the candidate content fragment.
  • Step 305 Based on each first correlation degree, obtain a target content segment that matches the query statement from each candidate content segment.
  • an answer for answering the query statement can be obtained.
  • the document retrieval device can provide an interactive interface, so that the answer to the query statement can be displayed through the interactive interface.
  • the document retrieval device can also obtain the attribute information of the target content fragment, and The target content fragment, corresponding attribute information, and paragraphs or tables containing the target content fragment are displayed through an interactive interface, so that users can more clearly understand the source of the answer to the query statement.
  • the user can enter the question “Can the factory model parameters of the terminal support erasing and writing by the main station” in the interactive interface provided by the document retrieval device, and click the "Start Retrieval” button to start the document retrieval process.
  • the document retrieval device You can obtain the query statement "Can the factory model parameters of the terminal support erasing and writing by the master station?" After the document retrieval device obtains multiple candidate content segments related to the query statement shown in Figure 5 in the manner shown in the above embodiment, it can obtain the first correlation degree between the query statement and each candidate content segment (i.e., Figure 5 5), and obtain the attribute information of each candidate content fragment (i.e., each document number in the document number column, each document name in the document name column, each chapter serial number, chapter number in the chapter serial number column in Figure 5 Each chapter title in the title column), and then determine the corresponding first candidate content fragment with the highest correlation (that is, the candidate content fragment with the serial number 1) as the target content fragment, and then display the target content through the interactive interface shown in Figure 4 Fragments and corresponding attribute information, etc.
  • the first correlation degree between the query statement and each candidate content segment i.e., Figure 5 5
  • the attribute information of each candidate content fragment i.e., each document number in the document number column, each
  • the document retrieval method obtains a query statement and performs a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document. For each document, candidate content fragments, obtain the corresponding attribute information, and splice the attribute information with the candidate content fragments to obtain the corresponding splicing results. Enter the query statement and the splicing results corresponding to the candidate content fragments into the correlation model to obtain the query statement. and the first correlation degree between the candidate content fragments, and based on each first correlation degree, the target content fragment matching the query statement is obtained from each candidate content fragment.
  • the target content fragment realizes the precise determination of specific content that can answer user questions from the document, laying the foundation for accurately providing answers to user questions.
  • the correlation model in the field of natural language processing NLP, based on the query statement, the attribute information of each candidate content fragment, and the content contained in the candidate content fragment itself, the first correlation degree between each candidate content fragment and the query statement is determined. , further improving the accuracy of the identified target content segments.
  • Figure 6 is a flow chart of a document retrieval method according to the fourth embodiment of the present disclosure. As shown in Figure 6, based on the above embodiment, the method may also include the following steps 601-603.
  • Step 601 Recognize each document based on the optical character recognition OCR technology in the field of artificial intelligence AI to obtain the recognition results of each document.
  • the document processing platform may provide an interactive interface, which may include an "upload document” button for uploading documents and a "start recognition” button for starting the document recognition process.
  • the document retrieval device can call the RPA robot to simulate mouse operations, click the "Upload Document” button on the interactive interface for uploading documents to upload the documents to be processed to the document processing platform, and then click on the interactive interface for starting Click the "Start Recognition” button of the document recognition process to start the document recognition process on the document processing platform, and then obtain the document recognition results shown in the right side of Figure 7.
  • “cl_num” in Figure 7 represents the chapter serial number
  • "cl_name” represents the chapter title
  • “cl_rank” represents the row where the chapter is located
  • “cl_content” represents the content contained in the chapter.
  • the document may include text and/or tables, and accordingly, the recognition results of the document may include text recognition results and/or table recognition results.
  • the default aggregation method is a method of aggregating divided fragments to obtain content fragments, which can be determined according to the type of content contained in the document (such as text type, table type).
  • the document recognition results include text recognition results
  • the text recognition results include chapter numbers, commas, periods and other punctuation marks.
  • the document retrieval device can perform the first segmentation of the text recognition results based on chapter numbers, and then perform a second segmentation on the results of the first segmentation based on punctuation marks (generally end-of-sentence punctuation marks such as periods), thereby segmenting the text recognition results. It is a plurality of sentences, each sentence is a segmented segment, and each segmented segment is arranged from front to back according to its corresponding position in the document.
  • the recognition results of the document include table recognition results
  • the table recognition results include delimiter symbols used to distinguish different cells, and the row numbers where the cells are located.
  • the document retrieval device can perform the first segmentation of the table recognition result by line number, and then perform the second segmentation of the first segmentation result according to the delimiter symbol, thereby dividing the table recognition result into multiple cell contents, each cell
  • the content of the grid is a segmented segment, and the segmented segments in each row are arranged from front to back according to their corresponding positions in the document. Furthermore, the divided fragments in each row can be spliced into one content fragment.
  • Step 603 Save each content segment in correspondence with the corresponding content field.
  • the name of the content field can be set to "content fragment”, and each content fragment can be saved corresponding to the corresponding content field, so that when the content contained in the content fragment needs to be obtained later, the content can be obtained through the content
  • the field obtains the content contained in the corresponding content fragment.
  • each content segment and the document name, chapter title, and parent title at each level corresponding to each content segment can also be saved in the form of a structure.
  • the fields in the structure can include corresponding A field named "Content Fragment”, a field named “Document Name”, a field named "Chapter Title”, and a field named "Level Parent Title”.
  • the document retrieval method is based on optical character recognition OCR technology to identify each document to obtain the recognition results of each document, and perform structured processing on each recognition result to obtain the information included in each document. Multiple content fragments, each content fragment is saved corresponding to the corresponding content field, and the document to be retrieved is processed to obtain multiple content fragments. In order to accurately determine the specific content that can answer the user's question from the document, to accurately provide Answers to user questions provide the foundation. And by calling the RPA robot to upload each document to the document processing platform, the document processing platform can be used to identify each document based on the OCR technology in the field of artificial intelligence, and then the identification results of each document returned by the document processing platform can be obtained, and then each document can be identified. The recognition results are structured and processed to obtain multiple content fragments included in each document. This enables the combination of RPA and AI to implement IA to obtain content fragments in the document, further reducing the labor costs required for document retrieval.
  • Figure 10 is a schematic structural diagram of a document retrieval device according to the fifth embodiment of the present disclosure.
  • the document retrieval device 1000 includes: a first acquisition module 1001, a query module 1002, a second acquisition module 1003 and a third acquisition module 1004.
  • the first acquisition module 1001 is used to acquire query statements
  • the query module 1002 is configured to perform a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document;
  • the second acquisition module 1003 is used to obtain the first correlation between the query statement and each candidate content fragment using a correlation model in the field of natural language processing NLP;
  • the third acquisition module 1004 is configured to acquire the target content segment that matches the query statement from each candidate content segment based on each first correlation degree.
  • the document retrieval device 1000 in the embodiment of the present disclosure can execute the document retrieval method provided in the above embodiment.
  • the document retrieval device 1000 can be implemented by software and/or hardware.
  • the document retrieval device can be an electronic device, or can also be configured in an electronic device to realize automatic retrieval of documents, thereby reducing the labor cost required for document retrieval. and time cost, and achieve accurate determination of specific content that can answer user questions from documents based on AI technology.
  • the electronic device may include but is not limited to a terminal device, a server, etc., and the embodiment of the present disclosure does not specifically limit the electronic device.
  • query module 1002 includes:
  • the first acquisition unit is used to acquire the content contained in each content segment and the attribute information of each content segment;
  • the second acquisition unit is used to obtain the content correlation between the query statement and the corresponding content fragment based on the content contained in each content fragment, and to obtain the content correlation between the query statement and the corresponding content fragment based on the attribute information of each content fragment. attribute correlation;
  • the third acquisition unit is used to acquire multiple candidate content segments related to the query statement from multiple content segments based on the content correlation and attribute correlation between the query statement and each content segment.
  • the content relevance has a corresponding first weight
  • the attribute relevance has a corresponding second weight
  • the third acquisition unit is used for:
  • multiple candidate content fragments related to the query statement are obtained from the plurality of content fragments.
  • the second acquisition module 1003 includes:
  • the fourth acquisition unit is configured to input the query statement and the candidate content fragment into the correlation model for each candidate content fragment, so as to obtain the first correlation between the query statement and the candidate content fragment.
  • the second acquisition module 1003 includes:
  • the fifth acquisition unit is used to obtain the corresponding attribute information for each candidate content segment, and splice the attribute information with the candidate content segment to obtain the corresponding splicing result;
  • the sixth acquisition unit is used to input the splicing results corresponding to the query statement and the candidate content fragment into the correlation model to obtain the first correlation between the query statement and the candidate content fragment.
  • the document retrieval device 1000 also includes: a recognition module, used to recognize each document based on the optical character recognition OCR technology in the artificial intelligence field to obtain the recognition results of each document;
  • the processing module is used to perform structured processing on each recognition result to obtain multiple content fragments included in each document;
  • the saving module is used to save each content fragment correspondingly to the corresponding content field.
  • the identification module includes:
  • the upload unit is used to call the RPA robot to upload each document to the document processing platform, so as to use the document processing platform to identify each document based on optical character recognition OCR technology;
  • the seventh acquisition unit is used to acquire the recognition results of each document returned by the document processing platform.
  • the recognition results include text recognition results and/or table recognition results
  • Processing modules including:
  • a segmentation unit used to segment text recognition results and/or table recognition results according to a preset segmentation method to obtain multiple segmented segments
  • the aggregation unit is used to aggregate multiple segmented segments according to a preset aggregation method to obtain multiple content segments, wherein each content segment is obtained by aggregating at least one segmented segment.
  • the attribute information includes at least one of a document name, a chapter title, and parent titles at various levels of the chapter title.
  • the document retrieval device of the embodiment of the present disclosure acquires a query statement and performs a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, using natural language Process the correlation model in the NLP field, obtain the first correlation between the query statement and each candidate content fragment, and obtain the target content fragment matching the query statement from each candidate content fragment based on each first correlation.
  • embodiments of the present disclosure also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program, The document retrieval method as described in any of the foregoing method embodiments.
  • embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the document retrieval method as described in any of the foregoing method embodiments is implemented.
  • the computer-readable storage medium is a non-transitory computer-readable storage medium.
  • embodiments of the present disclosure also provide a computer program product.
  • the instruction processor in the computer program product is executed, the document retrieval method as described in any of the foregoing method embodiments is implemented.
  • an embodiment of the present disclosure also proposes a computer program.
  • the computer program includes computer program code.
  • the computer program code When the computer program code is run on a computer, it causes the computer to execute as described in any of the foregoing method embodiments. Document retrieval method.
  • FIG. 11 illustrates a block diagram of an exemplary electronic device suitable for implementing embodiments of the present disclosure.
  • the electronic device 11 shown in FIG. 11 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.
  • electronic device 11 is embodied in the form of a general computing device.
  • the components of electronic device 11 may include, but are not limited to: one or more processors or processing units 16, system memory 28, and a bus 18 connecting different system components (including memory 28 and processing unit 16).
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics accelerated port, a processor, or a local bus using any of a variety of bus structures.
  • these architectures include but are not limited to Industry Standard Architecture (hereinafter referred to as: ISA) bus, Micro Channel Architecture (Micro Channel Architecture; hereafter referred to as: MAC) bus, enhanced ISA bus, video electronics Standards Association (Video Electronics Standards Association; hereinafter referred to as: VESA) local bus and Peripheral Component Interconnection (hereinafter referred to as: PCI) bus.
  • ISA Industry Standard Architecture
  • MAC Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnection
  • Electronic device 11 typically includes a variety of computer system readable media. These media may be any available media that can be accessed by electronic device 11, including volatile and non-volatile media, removable and non-removable media.
  • the memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter referred to as: RAM) 30 and/or cache memory 32.
  • Electronic device 11 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in Figure 11, commonly referred to as a "hard drive”).
  • a disk drive for reading and writing a removable non-volatile disk e.g., a "floppy disk”
  • a removable non-volatile optical disk e.g., a compact disk read-only memory
  • CD-ROM Compact Disc Read Only Memory
  • DVD-ROM Digital Video Disc Read Only Memory
  • Memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of embodiments of the present disclosure.
  • a program/utility 40 having a set of (at least one) program modules 42 may be stored, for example, in memory 28 , each of these examples or some combination may include the implementation of a network environment.
  • Program modules 42 generally perform functions and/or methods in the embodiments described in this disclosure.
  • Electronic device 11 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable a user to interact with electronic device 11, and/or with Any device (eg, network card, modem, etc.) that enables the electronic device 11 to communicate with one or more other computing devices. This communication may occur through input/output (I/O) interface 22.
  • the electronic device 11 can also communicate with one or more networks (such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (Wide Area Network; hereinafter referred to as: WAN)) and/or a public network, such as the Internet, through the network adapter 20 ) communication.
  • networks such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (Wide Area Network; hereinafter referred to as: WAN)
  • a public network such as the Internet
  • the network adapter 20 communicates with other modules of the electronic device 11 through the bus 18 .
  • other hardware and/or software modules may be used in conjunction with electronic device 11, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tapes drives and data backup storage systems, etc.
  • the processing unit 16 executes programs stored in the memory 28 to perform various functional applications and data processing, such as implementing the methods mentioned in the previous embodiments.
  • references to the terms “one embodiment,” “some embodiments,” “an example,” “specific examples,” or “some examples” or the like means that specific features are described in connection with the embodiment or example. , structures, materials, or features are included in at least one embodiment or example of the present disclosure. In this specification, the schematic expressions of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine different embodiments or examples and features of different embodiments or examples described in this specification unless they are inconsistent with each other.
  • first and second are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as “first” and “second” may explicitly or implicitly include at least one of these features.
  • “plurality” means at least two, such as two, three, etc., unless otherwise expressly and specifically limited.
  • a "computer-readable medium” may be any device that can contain, store, communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Non-exhaustive list of computer readable media include the following: electrical connections with one or more wires (electronic device), portable computer disk cartridges (magnetic device), random access memory (RAM), Read-only memory (ROM), erasable and programmable read-only memory (EPROM or flash memory), fiber optic devices, and portable compact disc read-only memory (CDROM).
  • the computer-readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, and subsequently edited, interpreted, or otherwise suitable as necessary. process to obtain the program electronically and then store it in computer memory.
  • various parts of the present disclosure may be implemented in hardware, software, firmware, or combinations thereof.
  • various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if it is implemented in hardware, as in another embodiment, it can be implemented by any one of the following technologies known in the art or their combination: discrete logic gate circuits with logic functions for implementing data signals; Logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGA), field programmable gate arrays (FPGA), etc.
  • the program can be stored in a computer-readable storage medium.
  • the program can be stored in a computer-readable storage medium.
  • each functional unit in various embodiments of the present disclosure may be integrated into one processing module, each unit may exist physically alone, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
  • the storage media mentioned above can be read-only memory, magnetic disks or optical disks, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to a document retrieval method and apparatus, and an electronic device. The method comprises: performing a query on the basis of a query sentence, so as to acquire, from among a plurality of content fragments comprised in at least one document, a plurality of candidate content fragments related to the query sentence; acquiring a first correlation between the query sentence and each candidate content fragment by using a correlation model in the field of NLP; and on the basis of each first correlation, acquiring from among the candidate content fragments a target content fragment that matches the query sentence. Automatic document retrieval is realized, and the labor cost and time cost required for document retrieval are reduced. Moreover, a target content fragment is acquired according to a correlation between each content fragment in a document, which is acquired on the basis of AI technology, and a query sentence, such that the specific content that can answer a user's question can be accurately determined from the document, thus laying a foundation for accurately providing an answer to the user's question. In the present disclosure, RPA and AI can also be combined to realize the acquisition of content fragments in a document in IA, further reducing the labor cost.

Description

文档检索方法、装置及电子设备Document retrieval method, device and electronic equipment
相关申请的交叉引用Cross-references to related applications
本申请基于申请号为2022106370191、申请日为2022年6月7日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This application is filed based on a Chinese patent application with application number 2022106370191 and a filing date of June 7, 2022, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application as a reference.
技术领域Technical field
本公开涉及机器人流程自动化及人工智能技术领域,具体涉及一种文档检索方法、装置及电子设备。The present disclosure relates to the technical fields of robotic process automation and artificial intelligence, and specifically relates to a document retrieval method, device and electronic equipment.
背景技术Background technique
机器人流程自动化(Robotic Process Automation,简称RPA),是通过特定的“机器人软件”,模拟人在计算机上的操作,按规则自动执行流程任务。Robotic Process Automation (RPA) uses specific "robot software" to simulate human operations on a computer and automatically execute process tasks according to rules.
人工智能(Artificial Intelligence,简称AI)是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门技术科学。Artificial Intelligence (AI for short) is a technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence.
智能自动化(Intelligent Automation,简称IA)是一系列从机器人流程自动化到人工智能的技术总称,将RPA与光学字符识别(Optical Character Recognition,OCR)、智能字符识别(Intelligent Character Recognition,ICR)、流程挖掘(Process Mining)、深度学习(Deep Learning,DL)、机器学习(Machine Learning,ML)、自然语言处理(Natural Language Processing,NLP)、语音识别(Automatic Speech Recognition,ASR)、语音合成(Text To Speech,TTS)、计算机视觉(Computer Vision,CV)等多种AI技术相结合,以创建能够思考、学习及自适应的端到端的业务流程,涵盖从流程发现、流程自动化,到通过自动而持续的数据收集、理解数据的含义,使用数据来管理和优化业务流程的整个历程。Intelligent Automation (IA) is a general term for a series of technologies from robotic process automation to artificial intelligence. It combines RPA with Optical Character Recognition (OCR), Intelligent Character Recognition (ICR), and process mining. (Process Mining), Deep Learning (DL), Machine Learning (ML), Natural Language Processing (NLP), Speech Recognition (Automatic Speech Recognition, ASR), Speech Synthesis (Text To Speech) , TTS), Computer Vision (CV) and other AI technologies are combined to create end-to-end business processes that can think, learn and adapt, covering from process discovery, process automation, to automatic and continuous The entire process of data collection, understanding the meaning of data, and using data to manage and optimize business processes.
目前,在很多业务场景中,比如电力问答系统中,需要对于用户提出的问题,从大量文档中检索到能够回答该问题的具体内容,比如某句话,或者某个表格中的某几个单元格内容等,以根据该内容给出准确的答案。相关技术中,在获取到用户提出的问题后,通常是通过人工查询大量文档,从中找到能够回答用户问题的具体内容,或者采用常规的文档级检索,通过字符串匹配的方式,找到与用户问题匹配的文档。上述通过人工查询进行文档检索的方式,会浪费大量的人力成本和时间成本,而常规的文档级检索方式,仅能检索到能够回答用户问题的文档,无法精确检索到文档中能够回答用户问题的具体内容。因此,需要一种能够以较低的人力成本和时间成本,精确检索到文档中能够回答用户问题的具体内容的文档检索方法。At present, in many business scenarios, such as power question and answer systems, it is necessary to retrieve specific content that can answer the questions raised by users from a large number of documents, such as a certain sentence or certain units in a table. grid content, etc., to give accurate answers based on the content. In related technologies, after obtaining questions raised by users, a large number of documents are usually manually queried to find specific content that can answer the user's questions, or conventional document-level retrieval is used to find the information related to the user's question through string matching. matching documents. The above-mentioned method of document retrieval through manual query will waste a lot of labor and time costs, and the conventional document-level retrieval method can only retrieve documents that can answer the user's question, but cannot accurately retrieve the documents that can answer the user's question. details. Therefore, there is a need for a document retrieval method that can accurately retrieve the specific content in the document that can answer the user's question at a low labor cost and time cost.
发明内容Contents of the invention
本公开提供一种文档检索方法、装置及电子设备,以解决文档检索方法存在的人力成本和时间成本高,且无法精确检索到文档中能够回答用户问题的具体内容的技术问题。The present disclosure provides a document retrieval method, device and electronic equipment to solve the technical problems of high labor and time costs of the document retrieval method and the inability to accurately retrieve specific content in the document that can answer user questions.
本公开第一方面实施例提供一种文档检索方法,该方法包括:获取查询语句;基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询语句相关的多个候选内容片段;采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度;基于各第一相关度,从各候选内容片段中获取与查询语句匹配的目标内容片段。An embodiment of the first aspect of the present disclosure provides a document retrieval method. The method includes: obtaining a query statement; performing a query based on the query statement to obtain multiple candidates related to the query statement from multiple content fragments included in at least one document. Content fragments; use the correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content fragment; based on each first correlation, obtain the target that matches the query statement from each candidate content fragment Content snippets.
在一些实施例中,基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询 语句相关的多个候选内容片段,包括:获取各内容片段所包含的内容以及各内容片段的属性信息;基于各内容片段所包含的内容,获取查询语句与对应的内容片段之间的内容相关度,以及基于各内容片段的属性信息,获取查询语句与对应的内容片段之间的属性相关度;基于查询语句与各内容片段之间的内容相关度以及属性相关度,从多个内容片段中,获取与查询语句相关的多个候选内容片段。In some embodiments, querying is performed based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, including: obtaining the content contained in each content fragment and each content Attribute information of the fragment; based on the content contained in each content fragment, obtain the content correlation between the query statement and the corresponding content fragment, and based on the attribute information of each content fragment, obtain the attributes between the query statement and the corresponding content fragment Relevance: Based on the content correlation and attribute correlation between the query statement and each content fragment, multiple candidate content fragments related to the query statement are obtained from multiple content fragments.
在一些实施例中,内容相关度具有对应的第一权重,属性相关度具有对应的第二权重;基于查询语句与各内容片段之间的内容相关度以及属性相关度,从多个内容片段中,获取与查询语句相关的多个候选内容片段,包括:基于各内容相关度和对应的第一权重,以及各属性相关度和对应的第二权重,确定查询语句与对应的内容片段之间的第二相关度;基于查询语句与各内容片段之间的第二相关度,从多个内容片段中,获取与查询语句相关的多个候选内容片段。In some embodiments, content relevance has a corresponding first weight, and attribute relevance has a corresponding second weight; based on the content relevance and attribute relevance between the query statement and each content segment, from multiple content segments , obtaining multiple candidate content fragments related to the query statement, including: determining the relationship between the query statement and the corresponding content fragment based on each content correlation degree and the corresponding first weight, and each attribute correlation degree and the corresponding second weight. Second correlation degree; based on the second correlation degree between the query statement and each content fragment, obtain multiple candidate content fragments related to the query statement from multiple content fragments.
在一些实施例中,采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度,包括:对于每个候选内容片段,将查询语句与候选内容片段输入相关度模型,以获取查询语句与候选内容片段之间的第一相关度。In some embodiments, using a correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content fragment includes: for each candidate content fragment, input the query statement and the candidate content fragment Relevance model to obtain the first correlation between the query statement and the candidate content fragment.
在一些实施例中,采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度,包括:对于每个候选内容片段,获取对应的属性信息,并将属性信息与候选内容片段进行拼接,以得到对应的拼接结果;将查询语句以及候选内容片段对应的拼接结果,输入相关度模型,以获取查询语句与候选内容片段之间的第一相关度。In some embodiments, the correlation model in the field of natural language processing NLP is used to obtain the first correlation between the query statement and each candidate content fragment, including: for each candidate content fragment, obtaining the corresponding attribute information, and The attribute information is spliced with the candidate content fragments to obtain the corresponding splicing results; the query statement and the splicing results corresponding to the candidate content fragments are input into the correlation model to obtain the first correlation between the query statement and the candidate content fragments.
在一些实施例中,基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询语句相关的多个候选内容片段之前,还包括:基于人工智能AI领域的光学字符识别OCR技术,对各文档进行识别,以获取各文档的识别结果;对各识别结果进行结构化处理,以得到各文档中包括的多个内容片段;将各内容片段与对应的内容字段对应保存。In some embodiments, before querying based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, the method further includes: optical character recognition based on the artificial intelligence field. OCR technology identifies each document to obtain the recognition results of each document; performs structured processing on each recognition result to obtain multiple content fragments included in each document; and stores each content fragment corresponding to the corresponding content field.
在一些实施例中,基于人工智能AI领域的光学字符识别OCR技术,对各文档进行识别,以获取各文档的识别结果,包括:调用RPA机器人将各文档上传至文档处理平台,以利用文档处理平台,基于光学字符识别OCR技术,对各文档进行识别;获取文档处理平台返回的各文档的识别结果。In some embodiments, based on optical character recognition OCR technology in the field of artificial intelligence, each document is recognized to obtain the recognition results of each document, including: calling an RPA robot to upload each document to a document processing platform to utilize document processing The platform recognizes each document based on optical character recognition (OCR) technology and obtains the recognition results of each document returned by the document processing platform.
在一些实施例中,识别结果包括文本识别结果和/或表格识别结果;对各识别结果进行结构化处理,以得到各文档中包括的多个内容片段,包括:按照预设分割方式,对文本识别结果和/或表格识别结果进行分割,以得到多个分割片段;将多个分割片段按照预设聚合方式进行聚合,以得到多个内容片段,其中,每个内容片段通过至少一个分割片段聚合得到。In some embodiments, the recognition results include text recognition results and/or table recognition results; structured processing is performed on each recognition result to obtain multiple content fragments included in each document, including: text segmentation according to a preset segmentation method. The recognition results and/or table recognition results are segmented to obtain multiple segmented segments; the multiple segmented segments are aggregated according to a preset aggregation method to obtain multiple content segments, wherein each content segment is aggregated by at least one segmented segment get.
在一些实施例中,属性信息包括文档名称、章节标题、章节标题的各级父标题中的至少一个。In some embodiments, the attribute information includes at least one of a document name, a chapter title, and parent titles of various levels of the chapter title.
本公开第二方面实施例提供一种文档检索装置,该装置包括:第一获取模块,用于获取查询语句;查询模块,用于基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询语句相关的多个候选内容片段;第二获取模块,用于采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度;第三获取模块,用于基于各第一相关度,从各候选内容片段中获取与查询语句匹配的目标内容片段。A second embodiment of the present disclosure provides a document retrieval device. The device includes: a first acquisition module for acquiring a query statement; a query module for performing a query based on the query statement to retrieve multiple contents included in at least one document. In the fragment, multiple candidate content fragments related to the query statement are obtained; the second acquisition module is used to obtain the first correlation degree between the query statement and each candidate content fragment using a correlation model in the field of natural language processing NLP; The third acquisition module is used to acquire the target content segment that matches the query statement from each candidate content segment based on each first correlation degree.
在一些实施例中,查询模块包括:第一获取单元,用于获取各内容片段所包含的内容以及各内容片段的属性信息;第二获取单元,用于基于各内容片段所包含的内容,获取查询语句与对应的内容片段之间的内容相关度,以及基于各内容片段的属性信息,获取查询语句与对应的内容片段之间的属性相关度; 第三获取单元,用于基于查询语句与各内容片段之间的内容相关度以及属性相关度,从多个内容片段中,获取与查询语句相关的多个候选内容片段。In some embodiments, the query module includes: a first acquisition unit, used to acquire the content contained in each content segment and the attribute information of each content segment; a second acquisition unit, configured to acquire, based on the content contained in each content segment, The content correlation between the query statement and the corresponding content fragment, and the attribute correlation between the query statement and the corresponding content fragment based on the attribute information of each content fragment; The third acquisition unit is used to obtain the attribute correlation between the query statement and the corresponding content fragment based on the query statement and each content fragment. Based on the content correlation and attribute correlation between content fragments, multiple candidate content fragments related to the query statement are obtained from multiple content fragments.
在一些实施例中,内容相关度具有对应的第一权重,属性相关度具有对应的第二权重;第三获取单元,用于:基于各内容相关度和对应的第一权重,以及各属性相关度和对应的第二权重,确定查询语句与对应的内容片段之间的第二相关度;基于查询语句与各内容片段之间的第二相关度,从多个内容片段中,获取与查询语句相关的多个候选内容片段。In some embodiments, the content relevance has a corresponding first weight, and the attribute relevance has a corresponding second weight; a third acquisition unit is used to: based on each content relevance and the corresponding first weight, and each attribute correlation degree and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment; based on the second correlation between the query statement and each content fragment, obtain the query statement from multiple content fragments Multiple related candidate content snippets.
在一些实施例中,第二获取模块,包括:第四获取单元,用于对于每个候选内容片段,将查询语句与候选内容片段输入相关度模型,以获取查询语句与候选内容片段之间的第一相关度。In some embodiments, the second acquisition module includes: a fourth acquisition unit, configured to input the query statement and the candidate content fragment into the correlation model for each candidate content fragment, so as to obtain the correlation between the query statement and the candidate content fragment. The first degree of correlation.
本公开第三方面实施例提出了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,该处理器执行计算机程序时,实现如本公开上述第一方面实施例所述的方法。The third embodiment of the present disclosure provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the above-mentioned first step of the present disclosure is implemented. Methods described in aspect embodiments.
本公开第四方面实施例提出了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如本公开上述第一方面实施例所述的方法。The fourth embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the method described in the first embodiment of the present disclosure is implemented.
本公开第五方面实施例提出了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现如本公开上述第一方面实施例所述的方法。The fifth aspect embodiment of the present disclosure proposes a computer program product, which includes a computer program. When executed by a processor, the computer program implements the method described in the above first aspect embodiment of the present disclosure.
本公开第六方面实施例提出了一种计算机程序,所述计算机程序包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如本公开上述第一方面实施例所述的方法。The sixth embodiment of the present disclosure provides a computer program. The computer program includes computer program code. When the computer program code is run on a computer, it causes the computer to execute the method described in the first embodiment of the present disclosure. method.
本公开实施例提供的技术方案可以包括以下有益效果:The technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects:
实现了自动进行文档检索,降低了文档检索所需的人力成本及时间成本,且通过根据基于AI技术获取的文档中各内容片段与查询语句之间的相关程度,获取与查询语句匹配的目标内容片段,实现了从文档中精确确定能够回答用户问题的具体内容,为准确提供用户问题的答案奠定了基础。It realizes automatic document retrieval, reduces the labor cost and time cost required for document retrieval, and obtains the target content that matches the query statement based on the correlation between each content fragment in the document obtained based on AI technology and the query statement. Fragments enable the precise determination of specific content that can answer user questions from documents, laying the foundation for accurately providing answers to user questions.
本公开的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本公开的实践了解到。Additional aspects and advantages of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.
附图说明Description of the drawings
在附图中,除非另外规定,否则贯穿多个附图相同的附图标记表示相同或相似的部件或元素。这些附图不一定是按照比例绘制的。应该理解,这些附图仅描绘了根据本公开的一些实施方式,而不应将其视为是对本公开范围的限制。In the drawings, unless otherwise specified, the same reference numbers refer to the same or similar parts or elements throughout the several figures. The drawings are not necessarily to scale. It should be understood that these drawings depict only some embodiments in accordance with the disclosure and are not to be considered limiting of the scope of the disclosure.
图1是根据本公开第一实施例的文档检索方法的流程示意图;Figure 1 is a schematic flowchart of a document retrieval method according to the first embodiment of the present disclosure;
图2是根据本公开第二实施例的文档检索方法的流程示意图;Figure 2 is a schematic flowchart of a document retrieval method according to a second embodiment of the present disclosure;
图3是根据本公开第三实施例的文档检索方法的流程示意图;Figure 3 is a schematic flowchart of a document retrieval method according to a third embodiment of the present disclosure;
图4是根据本公开第三实施例的文档检索装置提供的交互界面的示例图;Figure 4 is an example diagram of an interactive interface provided by a document retrieval device according to a third embodiment of the present disclosure;
图5是根据本公开第三实施例的候选内容片段及对应的属性信息的示例图;Figure 5 is an example diagram of candidate content segments and corresponding attribute information according to the third embodiment of the present disclosure;
图6是根据本公开第四实施例的文档检索方法的流程示意图;Figure 6 is a schematic flowchart of a document retrieval method according to the fourth embodiment of the present disclosure;
图7是根据本公开第四实施例的文档处理平台的交互界面及文档的识别结果的示例图;Figure 7 is an example diagram of an interactive interface of a document processing platform and a recognition result of a document according to the fourth embodiment of the present disclosure;
图8是根据本公开第四实施例的文本识别结果及对应的内容片段的示例图;Figure 8 is an example diagram of text recognition results and corresponding content fragments according to the fourth embodiment of the present disclosure;
图9是根据本公开第四实施例的表格识别结果及对应的内容片段的示例图;Figure 9 is an example diagram of table recognition results and corresponding content fragments according to the fourth embodiment of the present disclosure;
图10是根据本公开第五实施例的文档检索装置的结构示意图;Figure 10 is a schematic structural diagram of a document retrieval device according to the fifth embodiment of the present disclosure;
图11是用来实现本公开实施例的文档检索方法的电子设备的框图。FIG. 11 is a block diagram of an electronic device used to implement the document retrieval method of an embodiment of the present disclosure.
具体实施方式Detailed ways
下面详细描述本公开的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本公开,而不能理解为对本公开的限制。Embodiments of the present disclosure are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are exemplary and are only used to explain the present disclosure and are not to be construed as limitations of the present disclosure.
参照下面的描述和附图,将清楚本公开的实施例的这些和其他方面。在这些描述和附图中,具体公开了本公开的实施例中的一些特定实施方式,来表示实施本公开的实施例的原理的一些方式,但是应当理解,本公开的实施例的范围不受此限制。相反,本公开的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等同物。These and other aspects of embodiments of the present disclosure will become apparent with reference to the following description and accompanying drawings. In these descriptions and drawings, some specific implementations of the embodiments of the disclosure are specifically disclosed to represent some of the ways of implementing the principles of the embodiments of the disclosure, but it should be understood that the scope of the embodiments of the disclosure is not limited by this restriction. On the contrary, the disclosed embodiments include all changes, modifications and equivalents falling within the spirit and scope of the appended claims.
需要说明的是,本公开的技术方案中,所涉及的用户个人信息的获取,存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。It should be noted that in the technical solution of this disclosure, the acquisition, storage and application of user personal information involved are in compliance with relevant laws and regulations and do not violate public order and good customs.
本公开实施例提供一种文档检索方法,通过代替人工自动进行文档检索,降低了文档检索所需的人力成本及时间成本。具体的,在获取用户的查询语句后,可以基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询语句相关的多个候选内容片段,再采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度,进而基于各第一相关度,从各候选内容片段中获取与查询语句匹配的目标内容片段。由此,通过根据基于AI技术获取的文档中各内容片段与查询语句之间的相关程度,获取与查询语句匹配的目标内容片段,实现了从文档中精确确定能够回答用户问题的具体内容,为准确提供用户问题的答案奠定了基础。Embodiments of the present disclosure provide a document retrieval method, which reduces the labor cost and time cost required for document retrieval by replacing manual document retrieval automatically. Specifically, after obtaining the user's query statement, a query can be performed based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, and then use the natural language processing NLP field The correlation model obtains the first correlation degree between the query statement and each candidate content fragment, and then obtains the target content fragment matching the query statement from each candidate content fragment based on each first correlation degree. As a result, by obtaining the target content fragments that match the query statement based on the correlation between each content fragment in the document obtained based on AI technology and the query statement, it is possible to accurately determine the specific content that can answer the user's question from the document, providing It lays the foundation for providing accurate answers to user questions.
为了清楚说明本公开的各实施例,首先对本公开实施例中涉及到的技术名词进行解释说明。In order to clearly explain the various embodiments of the present disclosure, technical terms involved in the embodiments of the present disclosure are first explained.
在本公开的描述中,术语“多个”指两个或两个以上。In the description of the present disclosure, the term "plurality" means two or more.
在本公开的描述中,“RPA机器人”,是指可结合AI技术和RPA技术,自动进行业务处理的软件机器人。RPA机器人拥有“连接器”和“无侵入”两个特性,通过模拟人类的操作方法,在不更改信息系统的前提下,使用非侵入的方式,将不同系统的数据进行提取、整合和连通。In the description of this disclosure, "RPA robot" refers to a software robot that can combine AI technology and RPA technology to automatically perform business processing. RPA robots have two characteristics: "connector" and "non-intrusion". By simulating human operation methods, they can extract, integrate and connect data from different systems in a non-intrusive way without changing the information system.
在本公开的描述中,“查询语句”,指用户输入的用于查询的语句,即用户想问的问题,其可以是文本形式的语句,也可以是语音形式的语句,本公开对此不作限制。In the description of this disclosure, "query statement" refers to the statement input by the user for query, that is, the question the user wants to ask. It can be a statement in text form or a statement in voice form. This disclosure does not make any comment on this. limit.
在本公开的描述中,“文档”,为用于从中检索能够回答用户问题的具体内容的电子形式的文档,其可以是对纸质文件进行扫描得到的PDF(Portable Document Format,便携式文档格式)格式的文档,也可以是在电脑、手机等智能设备中编辑形成的文档,本公开对此不作限制。In the description of this disclosure, "document" is a document in electronic form used to retrieve specific content that can answer the user's question. It can be a PDF (Portable Document Format) obtained by scanning a paper document. The document in the format may also be a document edited on a smart device such as a computer or mobile phone, and this disclosure does not limit this.
在本公开的描述中,“内容片段”,为文档中的部分内容组成的片段,内容片段可以是一句话或几句话,也可以是文档中的一个段落,或者文档中的一个表格,或者一个表格中的部分内容等,本公开对此不作限制。本公开的一些实施例中,可以预先设置内容片段中包括的字符数量,从而通过对待检索的所有文档进行处理,将所有文档中的内容划分为多个内容片段,每个内容片段中包括的字符数量小于或等于预设字符数量。In the description of this disclosure, a "content fragment" is a fragment composed of part of the content in the document. The content fragment can be one sentence or several sentences, or it can be a paragraph in the document, or a table in the document, or Partial content in a table, etc., this disclosure does not limit this. In some embodiments of the present disclosure, the number of characters included in the content fragments can be set in advance, so that by processing all documents to be retrieved, the content in all documents is divided into multiple content fragments, and the characters included in each content fragment are The number is less than or equal to the preset number of characters.
在本公开的描述中,“候选内容片段”,指从所有文档包括的所有内容片段中,获取的与查询语句相关的内容片段。“目标内容片段”,指从候选内容片段中,获取的与查询语句匹配的内容片段,即能够回答用户问题的具体内容。In the description of this disclosure, "candidate content fragments" refer to content fragments related to the query statement obtained from all content fragments included in all documents. "Target content fragment" refers to the content fragment obtained from the candidate content fragment that matches the query statement, that is, the specific content that can answer the user's question.
在本公开的描述中,“属性信息”,为表示内容片段的属性的信息,比如内容片段所在文档的文档名称,内容片段对应的章节标题,章节标题的各级父标题等。In the description of this disclosure, "attribute information" is information that represents the attributes of a content fragment, such as the document name of the document where the content fragment is located, the chapter title corresponding to the content fragment, the parent titles of each level of the chapter title, etc.
在本公开的描述中,“相关度”,用于表示相关程度的大小。“第一相关度”,为通过相关度模型确定的查询语句与候选内容片段之间的相关度,该第一相关度,用于表示查询语句与候选内容片段之间的相关程度的大小。In the description of this disclosure, "correlation degree" is used to express the magnitude of the degree of correlation. The "first correlation degree" is the correlation degree between the query statement and the candidate content fragment determined by the correlation degree model. The first correlation degree is used to indicate the degree of correlation between the query statement and the candidate content fragment.
在本公开的描述中,“相关度模型”,为用于确定相关程度的任意机器模型,比如Bert(Bidirectional Encoder Representations from Transformers,一种基于双向编码器表示模型)等神经网络模型。其中,相关度模型可以通过对NLP领域的预训练模型进行微调得到。In the description of this disclosure, "correlation model" is any machine model used to determine the degree of correlation, such as Bert (Bidirectional Encoder Representations from Transformers, a bidirectional encoder representation model) and other neural network models. Among them, the correlation model can be obtained by fine-tuning the pre-trained model in the NLP field.
在本公开的描述中,“内容相关度”,为基于内容片段所包含的内容确定的查询语句与内容片段之间的相关度,用于表示内容片段所包含的内容与查询语句之间的相关程度的大小。In the description of the present disclosure, "content relevance" is the correlation between the query statement and the content fragment determined based on the content contained in the content fragment, and is used to represent the correlation between the content contained in the content fragment and the query statement. The size of the degree.
在本公开的描述中,“属性相关度”,为基于内容片段对应的属性信息确定的查询语句与内容片段之间的相关度,用于表示内容片段对应的属性信息与查询语句之间的相关程度的大小。In the description of this disclosure, "attribute correlation" is the correlation between the query statement and the content fragment determined based on the attribute information corresponding to the content fragment, and is used to represent the correlation between the attribute information corresponding to the content fragment and the query statement. The size of the degree.
在本公开的描述中,“第二相关度”为基于内容相关度与属性相关度确定的查询语句与内容片段之间的相关度,用于综合表示内容片段所包含的内容以及对应的属性信息,与查询语句之间的相关程度。In the description of this disclosure, the "second correlation degree" is the correlation degree between the query statement and the content fragment determined based on the content correlation degree and attribute correlation degree, and is used to comprehensively represent the content contained in the content fragment and the corresponding attribute information. , the degree of correlation with the query statement.
在本公开的描述中,“分割片段”,指对文档进行分割得到的内容所组成的片段,比如,按照用于句末的标点符号,将文档分割成多个句子后,每个句子即为一个分割片段。本公开实施例中的每个内容片段,可以包括一个或多个分割片段。In the description of this disclosure, "segmented fragments" refer to fragments composed of content obtained by dividing the document. For example, after the document is divided into multiple sentences according to the punctuation marks used at the end of the sentence, each sentence is A split fragment. Each content segment in the embodiment of the present disclosure may include one or more segmented segments.
在本公开的描述中,“文档处理平台”,为用于对文档进行智能处理的智能自动化平台。其中,智能文档处理(Intelligent Document Processing,IDP)是智能自动化平台的核心能力之一。智能文档处理(IDP)是基于光学字符识别(Optical Character Recognition,OCR)、计算机视觉(Computer Vision,CV)、自然语言处理(Natural Language Processing,NLP)、知识图谱(Knowledge Graph,KG)等AI技术,对各类文档进行识别、分类、要素提取、校验、比对、纠错等处理,帮助企业实现文档处理工作的智能化和自动化的新一代自动化技术。In the description of this disclosure, a "document processing platform" is an intelligent automation platform for intelligently processing documents. Among them, Intelligent Document Processing (IDP) is one of the core capabilities of the intelligent automation platform. Intelligent document processing (IDP) is based on AI technologies such as Optical Character Recognition (OCR), Computer Vision (CV), Natural Language Processing (NLP), and Knowledge Graph (KG). , a new generation of automation technology that identifies, classifies, extracts elements, verifies, compares, and corrects errors in various types of documents, helping enterprises realize the intelligence and automation of document processing.
在本公开的描述中,“内容字段”,为由单个字符或连续的多个字符组成的字段,“内容字段”可以理解为属性项key,内容片段所包含的内容可以理解为属性值value。内容字段和对应的内容片段共同组成一条结构化数据。另外,内容字段、以及内容片段的属性信息对应的字段,比如名称为“文档名称”的字段、名称为“章节标题”的字段、名称为“各级父标题”的字段,可以组成一个结构体。In the description of this disclosure, a "content field" is a field composed of a single character or multiple consecutive characters. The "content field" can be understood as the attribute item key, and the content contained in the content fragment can be understood as the attribute value value. The content fields and corresponding content fragments together form a piece of structured data. In addition, the content field and the fields corresponding to the attribute information of the content fragment, such as the field named "Document Name", the field named "Chapter Title", and the field named "Parent Title at All Levels", can form a structure. .
以下结合附图描述根据本公开实施例的文档检索方法、装置、电子设备及存储介质。The following describes document retrieval methods, devices, electronic devices and storage media according to embodiments of the present disclosure with reference to the accompanying drawings.
首先结合附图,对本公开实施例中的文档检索方法进行说明。First, the document retrieval method in the embodiment of the present disclosure will be described with reference to the accompanying drawings.
图1是本公开第一实施例的文档检索方法的流程图。如图1所示,该方法可包括以下步骤:步骤101-104。Figure 1 is a flow chart of a document retrieval method according to the first embodiment of the present disclosure. As shown in Figure 1, the method may include the following steps: steps 101-104.
步骤101,获取查询语句。Step 101: Obtain the query statement.
需要说明的是,本公开实施例的文档检索方法,可以由文档检索装置执行。其中,该文档检索装置可以由软件和/或硬件实现,该文档检索装置可以为电子设备,或者也可以配置在电子设备中,以实现对自动进行文档检索,从而降低文档检索所需的人力成本及时间成本,且实现基于AI技术从文档中精确确定能够回答用户问题的具体内容。其中,该电子设备可以包括但不限于终端设备、服务器等,本公开实施例对电子设备不作具体限定。It should be noted that the document retrieval method in the embodiment of the present disclosure can be executed by a document retrieval device. Wherein, the document retrieval device can be implemented by software and/or hardware. The document retrieval device can be an electronic device, or can also be configured in an electronic device to realize automatic document retrieval, thereby reducing the labor cost required for document retrieval. and time cost, and achieve accurate determination of specific content that can answer user questions from documents based on AI technology. The electronic device may include but is not limited to a terminal device, a server, etc., and the embodiment of the present disclosure does not specifically limit the electronic device.
在本公开实施例中,文档检索装置可以提供交互界面,从而用户可以在交互界面中输入查询语句进行查询,相应的,文档检索装置可以获取查询语句。In the embodiment of the present disclosure, the document retrieval device can provide an interactive interface, so that the user can input a query statement in the interactive interface to perform a query, and accordingly, the document retrieval device can obtain the query statement.
步骤102,基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询语句相关的多个候选内容片段。Step 102: Perform a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document.
在本公开实施例中,可以预先对待检索的大量文档(即需要从中检索能够回答用户问题的具体内容的文档)进行处理,以得到多个内容片段,并将多个内容片段保存到检索引擎中,进而在获取查询语句后,可以采用该检索引擎,基于查询语句进行查询,基于检索引擎从多个内容片段中,获取与查询语句相关的多个候选内容片段,并返回至文档检索装置,相应的,文档检索装置可以获取多个候选内容片段。In the embodiment of the present disclosure, a large number of documents to be retrieved (that is, documents from which specific content that can answer user questions need to be retrieved) can be processed in advance to obtain multiple content fragments, and the multiple content fragments are saved to the retrieval engine. , and then after obtaining the query statement, the search engine can be used to query based on the query statement. Based on the search engine, multiple candidate content fragments related to the query statement are obtained from multiple content fragments and returned to the document retrieval device. , the document retrieval device can obtain multiple candidate content fragments.
其中,检索引擎可以为任意具有检索功能的检索引擎,本公开对此不作限制。另外,检索引擎可以配置在文档检索装置中,或者检索引擎也可以单独配置并通过接口与文档检索装置连接,本公开对此不作限制。The retrieval engine can be any retrieval engine with a retrieval function, and this disclosure does not limit this. In addition, the retrieval engine may be configured in the document retrieval device, or the retrieval engine may be configured separately and connected to the document retrieval device through an interface, which is not limited by the present disclosure.
在本公开实施例中,可以预先设置候选内容片段的数量,从而检索引擎可以获取查询语句与各内容片段之间的相关度,并将各内容片段按照对应的相关度从高到低的顺序进行排序,将排序在前的预设数量的多个内容片段,确定为多个候选内容片段。In the embodiment of the present disclosure, the number of candidate content fragments can be set in advance, so that the retrieval engine can obtain the correlation between the query statement and each content fragment, and process each content fragment in order from high to low according to the corresponding correlation. Sorting: determine a preset number of content fragments that are ranked first as multiple candidate content fragments.
在本公开实施例中,可以预先设置第一相关度阈值,从而检索引擎可以获取查询语句与各内容片段之间的相关度,并将各内容片段中,对应的相关度大于第一相关度阈值的多个内容片段,确定为多个候选内容片段。其中,第一相关度阈值可以根据需要任意设置,本公开对此不作限制。In the embodiment of the present disclosure, the first correlation threshold can be set in advance, so that the search engine can obtain the correlation between the query statement and each content fragment, and in each content fragment, the corresponding correlation is greater than the first correlation threshold. Multiple content segments are determined as multiple candidate content segments. The first correlation threshold can be set arbitrarily as needed, and this disclosure does not limit this.
步骤103,采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度。Step 103: Use the correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content segment.
在本公开实施例中,可以预先训练相关度模型,相关度模型的输入为候选内容片段以及查询语句,输出为候选内容片段以及查询语句之间的相关程度得分(即置信度),进而对于每个候选内容片段,可以将查询语句与候选内容片段,输入训练好的相关度模型,以使相关度模型基于查询语句与候选内容片段所包含的内容,确定候选内容片段与查询语句之间的相关程度,并输出第一相关度,从而文档检索装置可以根据相关度模型的输出,获取查询语句与候选内容片段之间的第一相关度。In the embodiment of the present disclosure, the relevance model can be pre-trained. The input of the relevance model is the candidate content fragment and the query statement, and the output is the correlation score (ie, confidence) between the candidate content fragment and the query statement, and then for each A candidate content fragment, the query statement and the candidate content fragment can be input into the trained correlation model, so that the correlation model determines the correlation between the candidate content fragment and the query statement based on the content contained in the query statement and the candidate content fragment. degree, and outputs the first correlation degree, so that the document retrieval device can obtain the first correlation degree between the query statement and the candidate content fragment according to the output of the correlation degree model.
步骤104,基于各第一相关度,从各候选内容片段中获取与查询语句匹配的目标内容片段。Step 104: Based on each first correlation degree, obtain a target content segment that matches the query statement from each candidate content segment.
其中,目标内容片段的数量可以为一个,也可以为多个,可以根据需要设置,本公开对此不作限制。The number of target content segments may be one or multiple, and may be set as needed, and this disclosure does not limit this.
在本公开实施例中,以目标内容片段的数量为一个为例,可以基于查询语句与各候选内容片段之间的第一相关度,将对应的第一相关度最高的候选内容片段,作为目标内容片段。In this embodiment of the present disclosure, taking the number of target content segments as one as an example, based on the first correlation between the query statement and each candidate content segment, the corresponding candidate content segment with the highest first correlation can be used as the target. Content snippets.
进而,基于该目标内容片段,可以得到用于回答查询语句的答案。Furthermore, based on the target content segment, an answer for answering the query statement can be obtained.
需要说明的是,文档检索装置可以提供交互界面,从而可以通过交互界面,展示回答查询语句的答案,另外,文档检索装置在获取目标内容片段的同时,还可以获取目标内容片段的属性信息,并通过交互界面展示目标内容片段、对应的属性信息以及包含目标内容片段的段落或表格等,以使用户可以更清楚的了解查询语句的答案出处。It should be noted that the document retrieval device can provide an interactive interface, so that the answer to the query statement can be displayed through the interactive interface. In addition, while obtaining the target content fragment, the document retrieval device can also obtain the attribute information of the target content fragment, and The target content fragment, corresponding attribute information, and paragraphs or tables containing the target content fragment are displayed through an interactive interface, so that users can more clearly understand the source of the answer to the query statement.
综上,本公开实施例提供的文档检索方法,获取查询语句,基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询语句相关的多个候选内容片段,采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度,基于各第一相关度,从各候选内容片 段中获取与查询语句匹配的目标内容片段。由此,实现了自动进行文档检索,降低了文档检索所需的人力成本及时间成本,且通过根据基于AI技术获取的文档中各内容片段与查询语句之间的相关程度,获取与查询语句匹配的目标内容片段,实现了从文档中精确确定能够回答用户问题的具体内容,为准确提供用户问题的答案奠定了基础。In summary, the document retrieval method provided by the embodiment of the present disclosure obtains a query statement and performs a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, using natural The correlation model in the field of language processing NLP obtains the first correlation between the query statement and each candidate content fragment, and obtains the target content fragment matching the query statement from each candidate content fragment based on each first correlation. As a result, automatic document retrieval is realized, reducing the labor cost and time cost required for document retrieval, and based on the correlation between each content fragment in the document and the query statement obtained based on AI technology, the matching query statement is obtained The target content fragment realizes the precise determination of specific content that can answer user questions from the document, laying the foundation for accurately providing answers to user questions.
下面结合图2,对本公开实施例提供的文档检索方法中,基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询语句相关的多个候选内容片段的过程进行进一步说明。Next, with reference to Figure 2, the process of querying based on a query statement in the document retrieval method provided by the embodiment of the present disclosure to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document will be further described. illustrate.
图2是根据本公开第二实施例的文档检索方法的流程图,如图2所示,该方法包括:步骤201-206。Figure 2 is a flow chart of a document retrieval method according to the second embodiment of the present disclosure. As shown in Figure 2, the method includes: steps 201-206.
步骤201,获取查询语句。Step 201: Obtain the query statement.
步骤202,获取各内容片段所包含的内容以及各内容片段的属性信息。Step 202: Obtain the content contained in each content segment and the attribute information of each content segment.
其中,内容片段的属性信息,可以包括内容片段所在文档的文档名称、内容片段对应的章节标题、内容片段对应的章节标题的各级父标题中的至少一个。The attribute information of the content fragment may include at least one of the document name of the document in which the content fragment is located, the chapter title corresponding to the content fragment, and the parent titles at all levels of the chapter title corresponding to the content fragment.
在本公开实施例中,以属性信息包括文档名称、章节标题、各级父标题为例,每个内容片段所包含的内容、以及内容片段的属性信息,可以以结构体的形式进行保存,结构体中的字段可以包括名称为“文档名称”的字段、名称为“章节标题”的字段、名称为“各级父标题”的字段以及名称为“内容片段”的字段,从而文档检索装置可以基于各结构体,获取对应内容片段所包含的内容以及对应的属性信息。In this embodiment of the present disclosure, taking the attribute information including document name, chapter title, and parent title at all levels as an example, the content contained in each content segment and the attribute information of the content segment can be saved in the form of a structure. The structure The fields in the body may include a field named "Document Name", a field named "Chapter Title", a field named "Parent Titles at All Levels" and a field named "Content Fragment", so that the document retrieval device can be based on For each structure, obtain the content contained in the corresponding content fragment and the corresponding attribute information.
步骤203,基于各内容片段所包含的内容,获取查询语句与对应的内容片段之间的内容相关度,以及基于各内容片段的属性信息,获取查询语句与对应的内容片段之间的属性相关度。Step 203: Based on the content contained in each content fragment, obtain the content correlation between the query statement and the corresponding content fragment, and obtain the attribute correlation between the query statement and the corresponding content fragment based on the attribute information of each content fragment. .
其中,在内容片段的属性信息包括文档名称、章节标题、各级父标题等多个信息时,相应的,对于每个内容片段,可以基于各属性信息,获取查询语句与对应的内容片段之间的各属性相关度。Among them, when the attribute information of the content fragment includes multiple information such as document name, chapter title, parent title at each level, etc., correspondingly, for each content fragment, the relationship between the query statement and the corresponding content fragment can be obtained based on each attribute information. The correlation of each attribute.
在本公开实施例中,可以对查询语句进行分词,并根据各分词在某个内容片段所包含的内容中出现的次数,确定查询语句与该内容片段之间的内容相关度。比如,在各分词在某个内容片段所包含的内容中出现的次数越多时,则确定查询语句与该内容片段之间的内容相关度越高;在各分词在某个内容片段所包含的内容中出现的次数越少时,则确定查询语句与该内容片段之间的内容相关度越低。In the embodiment of the present disclosure, the query statement can be segmented into words, and the content correlation between the query statement and the content segment can be determined based on the number of times each segment appears in the content contained in the content segment. For example, the more times each segment appears in the content contained in a certain content segment, the higher the content correlation between the query statement and the content segment is determined; when each segment appears in the content contained in a certain content segment, The fewer the occurrences in , the lower the content relevance between the query statement and the content fragment.
类似的,可以对查询语句进行分词,并根据各分词在某个内容片段的属性信息中出现的次数,确定查询语句与该内容片段之间的属性相关度。比如,在各分词在某个内容片段的文档名称中出现的次数越多时,则确定查询语句与该内容片段之间的对应文档名称的属性相关度越高;在各分词在某个内容片段的文档名称中出现的次数越少时,则确定查询语句与该内容片段之间的对应文档名称的属性相关度越低。Similarly, the query statement can be segmented, and the attribute correlation between the query statement and the content segment can be determined based on the number of times each segment appears in the attribute information of a certain content segment. For example, the more times each segment appears in the document name of a certain content fragment, the higher the attribute correlation of the corresponding document name between the query statement and the content fragment is determined; when each segment appears in the document name of a certain content fragment, The fewer the occurrences in the document name, the lower the correlation of the attribute corresponding to the document name between the query statement and the content fragment.
举例来说,假设查询语句为“变压器类型”,属性信息包括文档名称、章节标题,则可以对查询语句进行分词,得到“变压器”及“类型”,进而根据各内容片段所包含的内容中出现“变压器”及“类型”的次数,确定查询语句“变压器类型”与对应内容片段之间的内容相关度,并根据各内容片段所在文档的文档名称中出现“变压器”及“类型”的次数,确定查询语句“变压器类型”与对应内容片段之间的对应文档名称的属性相关度,并根据各内容片段对应的章节标题中出现“变压器”及“类型”的次数,确定查询语句“变压器类型”与对应内容片段之间的对应章节标题的属性相关度。For example, assuming that the query statement is "transformer type" and the attribute information includes the document name and chapter title, the query statement can be segmented to obtain "transformer" and "type", and then according to the content contained in each content fragment, The number of times "transformer" and "type" are used to determine the content correlation between the query statement "transformer type" and the corresponding content fragment, and based on the number of times "transformer" and "type" appear in the document name of the document where each content fragment is located, Determine the attribute correlation of the corresponding document name between the query statement "Transformer Type" and the corresponding content fragment, and determine the query statement "Transformer Type" based on the number of times "Transformer" and "Type" appear in the chapter title corresponding to each content fragment. The attribute correlation between the corresponding chapter title and the corresponding content fragment.
步骤204,基于查询语句与各内容片段之间的内容相关度以及属性相关度,从多个内容片段中,获取与查询语句相关的多个候选内容片段。Step 204: Based on the content correlation and attribute correlation between the query statement and each content fragment, obtain multiple candidate content fragments related to the query statement from multiple content fragments.
在本公开实施例中,可以设置内容相关度对应的第二相关度阈值,以及属性相关度对应的第三相关 度阈值,进而可以将多个内容片段中,对应的内容相关度大于第二相关度阈值,和/或对应的属性相关度大于第三相关度阈值的内容片段,确定为与查询语句相关的多个候选内容片段。其中,第二相关度阈值与第三相关度阈值可以根据需要设置,此处不作限制。In the embodiment of the present disclosure, a second correlation threshold corresponding to the content correlation and a third correlation threshold corresponding to the attribute correlation can be set, so that among multiple content segments, the corresponding content correlation can be greater than the second correlation The degree threshold, and/or the content fragments whose corresponding attribute correlation is greater than the third correlation threshold, are determined as multiple candidate content fragments related to the query statement. The second correlation threshold and the third correlation threshold can be set as needed, and are not limited here.
由此,可以从所有文档包括的所有内容片段中,准确获取与查询语句相关程度较高的多个候选内容片段。As a result, multiple candidate content segments that are highly relevant to the query statement can be accurately obtained from all content segments included in all documents.
在本公开实施例中,还可以设置内容相关度以及属性相关度具有对应的权重。为了便于区分,将内容相关度对应的权重称为第一权重,将属性相关度对应的权重称为第二权重。即内容相关度具有对应的第一权重,属性相关度具有对应的第二权重。其中,在内容片段的属性信息包括多个时,对应各属性信息的属性相关度,可以分别设置对应的权重,且各属性相关度对应的权重可以相同,也可以不同,此处不作限制。其中,第一权重和第二权重可以通过实验确定,也可以根据经验确定,或者通过其它方式确定,本公开对此不作限制。In the embodiment of the present disclosure, content relevance and attribute relevance can also be set to have corresponding weights. In order to facilitate the distinction, the weight corresponding to the content relevance is called the first weight, and the weight corresponding to the attribute relevance is called the second weight. That is, the content relevance has a corresponding first weight, and the attribute relevance has a corresponding second weight. Wherein, when the attribute information of the content fragment includes multiple attributes, corresponding weights can be set respectively corresponding to the attribute correlation of each attribute information, and the weights corresponding to each attribute correlation can be the same or different, and are not limited here. The first weight and the second weight can be determined through experiments, experience, or other methods, and this disclosure does not limit this.
相应的,步骤204可以通过以下方式实现:基于各内容相关度和对应的第一权重,以及各属性相关度和对应的第二权重,确定查询语句与对应的内容片段之间的第二相关度;基于查询语句与各内容片段之间的第二相关度,从多个内容片段中,获取与查询语句相关的多个候选内容片段。Correspondingly, step 204 can be implemented in the following manner: based on each content correlation and the corresponding first weight, and each attribute correlation and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment. ; Based on the second correlation between the query statement and each content fragment, obtain multiple candidate content fragments related to the query statement from the multiple content fragments.
其中,对于每个内容片段,可以基于内容相关度与对应的第一权重,以及属性相关度与对应的第二权重,确定内容相关度与属性相关度的加权和,并将加权和作为查询语句与该内容片段之间的第二相关度。进而,可以将多个内容片段中,对应的第二相关度大于第四相关度阈值的内容片段,确定为候选内容片段,或者将对应的第二相关度最高的预设数量的内容片段(即将各内容片段按照对应的第二相关度从高到低的顺序排列后,排序在前的预设数量的内容片段),确定为候选内容片段。For each content segment, the weighted sum of the content relevance and the attribute relevance can be determined based on the content relevance and the corresponding first weight, and the attribute relevance and the corresponding second weight, and the weighted sum can be used as a query statement The second degree of correlation with the content fragment. Furthermore, among the plurality of content segments, the content segment whose corresponding second correlation degree is greater than the fourth correlation degree threshold may be determined as a candidate content segment, or a preset number of content segments whose corresponding second correlation degree is the highest (i.e., After each content segment is arranged in order from high to low according to the corresponding second correlation degree, the preset number of content segments that are ranked first) are determined as candidate content segments.
由此,通过设置内容相关度具有对应的第一权重,属性相关度具有对应的第二权重,并基于各内容相关度和对应的第一权重,以及各属性相关度和对应的第二权重,从多个内容片段中获取候选内容片段,实现了根据需要灵活调整确定内容片段与查询语句之间的相关程度的方式。Therefore, by setting the content relevance to have a corresponding first weight, the attribute relevance to have a corresponding second weight, and based on each content relevance and the corresponding first weight, and each attribute relevance and the corresponding second weight, Obtaining candidate content fragments from multiple content fragments enables flexible adjustment of the method of determining the degree of correlation between content fragments and query statements as needed.
在本公开实施例中,步骤202-204通过文档检索装置实现,也可以基于检索引擎实现,比如以属性信息包括文档名称、章节标题、各级父标题为例,可以预先将各内容片段所包含的内容以及各内容片段对应的文档名称、章节标题、各级父标题,以结构体的形式保存,结构体中的字段可以对应包括名称为“内容片段”的字段、名称为“文档名称”的字段、名称为“章节标题”的字段,以及名称为“各级父标题”的字段。从而在文档检索装置获取查询语句后,可以对查询语句进行分词,并将查询语句中所有分词分别与“文档名称”、“章节标题”、“各级父标题”以及“内容片段”拼接,得到检索条件,并将该检索条件输入检索引擎,以基于检索引擎采用上述实施例所示的方式,获取查询语句与各内容片段之间的第二相关度,进而从多个内容片段中,获取与查询语句相关的多个候选内容片段,再将多个候选内容片段返回文档检索装置。In the embodiment of the present disclosure, steps 202-204 are implemented through a document retrieval device, or can also be implemented based on a retrieval engine. For example, taking the attribute information including document name, chapter title, and parent title at each level as an example, each content segment can be pre-recorded. The content, as well as the document name, chapter title, and parent title at each level corresponding to each content fragment, are saved in the form of a structure. The fields in the structure can correspond to fields named "content fragment", fields named "document name" field, a field named "Chapter Title", and a field named "Level Parent Title". Therefore, after the document retrieval device obtains the query statement, the query statement can be segmented, and all the segmented words in the query sentence can be spliced with "document name", "chapter title", "parent title at all levels" and "content fragment" respectively, to obtain Retrieval conditions, and input the retrieval conditions into the retrieval engine to obtain the second correlation between the query statement and each content segment based on the retrieval engine in the manner shown in the above embodiment, and then obtain the second correlation between the query statement and each content segment from multiple content segments. Multiple candidate content fragments related to the query statement are returned to the document retrieval device.
步骤205,采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度。Step 205: Use the correlation model in the field of natural language processing NLP to obtain the first correlation between the query statement and each candidate content segment.
步骤206,基于各第一相关度,从各候选内容片段中获取与查询语句匹配的目标内容片段。Step 206: Based on each first correlation degree, obtain a target content segment that matches the query statement from each candidate content segment.
其中,步骤205-206的具体实现过程及原理,可以参考上述实施例的描述,此处不再赘述。For the specific implementation process and principles of steps 205-206, reference can be made to the description of the above embodiments and will not be described again here.
综上,本公开实施例提供的文档检索方法,获取查询语句,获取各内容片段所包含的内容以及各内容片段的属性信息,基于各内容片段所包含的内容,获取查询语句与对应的内容片段之间的内容相关度, 以及基于各内容片段的属性信息,获取查询语句与对应的内容片段之间的属性相关,基于查询语句与各内容片段之间的内容相关度以及属性相关度,从多个内容片段中,获取与查询语句相关的多个候选内容片段,采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度,基于各第一相关度,从各候选内容片段中获取与查询语句匹配的目标内容片段。由此,实现了自动进行文档检索,降低了文档检索所需的人力成本及时间成本,且通过根据基于AI技术获取的文档中各内容片段与查询语句之间的相关程度,获取与查询语句匹配的目标内容片段,实现了从文档中精确确定能够回答用户问题的具体内容,为准确提供用户问题的答案奠定了基础。In summary, the document retrieval method provided by the embodiment of the present disclosure obtains the query statement, obtains the content contained in each content fragment and the attribute information of each content fragment, and obtains the query statement and the corresponding content fragment based on the content contained in each content fragment. Based on the content correlation between the query statement and the corresponding content fragment, and based on the attribute information of each content fragment, the attribute correlation between the query statement and the corresponding content fragment is obtained. Based on the content correlation and attribute correlation between the query statement and each content fragment, from multiple From each content fragment, obtain multiple candidate content fragments related to the query statement, and use the correlation model in the field of natural language processing NLP to obtain the first correlation degree between the query statement and each candidate content fragment, based on each first correlation degree , obtain the target content fragment that matches the query statement from each candidate content fragment. As a result, automatic document retrieval is realized, reducing the labor cost and time cost required for document retrieval, and based on the correlation between each content fragment in the document and the query statement obtained based on AI technology, the matching query statement is obtained The target content fragment realizes the precise determination of specific content that can answer user questions from the document, laying the foundation for accurately providing answers to user questions.
下面结合图3,对本公开实施例提供的文档检索方法中,采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度的过程进行进一步说明。The following is a further explanation of the process of obtaining the first correlation between the query statement and each candidate content fragment by using the correlation model in the field of natural language processing NLP in the document retrieval method provided by the embodiment of the present disclosure with reference to FIG. 3 .
图3是根据本公开第三实施例的文档检索方法的流程图,如图3所示,该方法包括:步骤301-305。Figure 3 is a flow chart of a document retrieval method according to the third embodiment of the present disclosure. As shown in Figure 3, the method includes: steps 301-305.
步骤301,获取查询语句。Step 301: Obtain the query statement.
步骤302,基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询语句相关的多个候选内容片段。Step 302: Perform a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document.
其中,步骤301-302的具体实现过程及原理,可以参考上述实施例的描述,此处不再赘述。For the specific implementation process and principles of steps 301-302, reference can be made to the description of the above embodiments and will not be described again here.
步骤303,对于每个候选内容片段,获取对应的属性信息,并将属性信息与候选内容片段进行拼接,以得到对应的拼接结果。Step 303: For each candidate content segment, obtain the corresponding attribute information, and splice the attribute information and the candidate content segment to obtain the corresponding splicing result.
其中,候选内容片段的属性信息,可以包括候选内容片段所在的文档名称、候选内容片段对应的章节标题、章节标题的各级父标题中的至少一个。The attribute information of the candidate content fragment may include at least one of the name of the document in which the candidate content fragment is located, the chapter title corresponding to the candidate content fragment, and the parent titles of each level of the chapter title.
在本公开实施例中,对于每个候选内容片段,可以获取对应的文档名称、章节标题、章节标题的父标题,并将文档名称、章节标题、章节标题的父标题与候选内容片段进行拼接,以得到对应的拼接结果。In this embodiment of the present disclosure, for each candidate content fragment, the corresponding document name, chapter title, and parent title of the chapter title can be obtained, and the document name, chapter title, and parent title of the chapter title can be spliced with the candidate content fragment, to obtain the corresponding splicing results.
步骤304,将查询语句以及候选内容片段对应的拼接结果,输入相关度模型,以获取查询语句与候选内容片段之间的第一相关度。Step 304: Enter the splicing results corresponding to the query statement and the candidate content fragments into the correlation model to obtain the first correlation between the query statement and the candidate content fragments.
在本公开实施例中,可以将查询语句以及候选内容片段对应的拼接结果,输入相关度模型,以使相关度模型基于查询语句以及候选内容片段本身的内容和属性信息,确定候选内容片段与查询语句之间的相关程度,并输出第一相关度,从而文档检索装置可以根据相关度模型的输出,获取查询语句与候选内容片段之间的第一相关度。In the embodiment of the present disclosure, the query statement and the splicing result corresponding to the candidate content fragment can be input into the correlation model, so that the correlation model determines the relationship between the candidate content fragment and the query based on the query statement and the content and attribute information of the candidate content fragment itself. The degree of correlation between the sentences and outputting the first correlation degree, so that the document retrieval device can obtain the first correlation degree between the query statement and the candidate content fragment according to the output of the correlation model.
或者,对于每个候选内容片段,也可以仅将查询语句与候选内容片段输入相关度模型,以获取查询语句与候选内容片段之间的第一相关度。Alternatively, for each candidate content fragment, only the query statement and the candidate content fragment can be input into the correlation model to obtain the first correlation between the query statement and the candidate content fragment.
步骤305,基于各第一相关度,从各候选内容片段中获取与查询语句匹配的目标内容片段。Step 305: Based on each first correlation degree, obtain a target content segment that matches the query statement from each candidate content segment.
进一步的,基于该目标内容片段,可以得到用于回答查询语句的答案。Further, based on the target content fragment, an answer for answering the query statement can be obtained.
需要说明的是,文档检索装置可以提供交互界面,从而可以通过交互界面,展示回答查询语句的答案,另外,文档检索装置在获取目标内容片段的同时,还可以获取目标内容片段的属性信息,并通过交互界面展示目标内容片段、对应的属性信息以及包含目标内容片段的段落或表格等,以使用户可以更清楚的了解查询语句的答案出处。It should be noted that the document retrieval device can provide an interactive interface, so that the answer to the query statement can be displayed through the interactive interface. In addition, while obtaining the target content fragment, the document retrieval device can also obtain the attribute information of the target content fragment, and The target content fragment, corresponding attribute information, and paragraphs or tables containing the target content fragment are displayed through an interactive interface, so that users can more clearly understand the source of the answer to the query statement.
参考图4,用户可以在文档检索装置提供的交互界面中输入问题“终端出厂型号参数能支持主站擦写的吗”,并点击“开始检索”按钮启动文档检索过程,相应的,文档检索装置可以获取查询语句“终端出厂型 号参数能支持主站擦写的吗”。在文档检索装置通过上述实施例所示的方式,获取图5所示的与查询语句相关的多个候选内容片段后,可以获取查询语句与各候选内容片段之间的第一相关度(即图5中的置信度),并且获取各候选内容片段的属性信息(即图5中的文档编号列中的各文档编号、文档名称列中的各文档名称、章节序号列中的各章节序号、章节标题列中的各章节标题),进而将对应的第一相关度最高的候选内容片段(即序号为1的候选内容片段)确定为目标内容片段,进而通过图4所示的交互界面展示目标内容片段以及对应的属性信息等。Referring to Figure 4, the user can enter the question "Can the factory model parameters of the terminal support erasing and writing by the main station" in the interactive interface provided by the document retrieval device, and click the "Start Retrieval" button to start the document retrieval process. Correspondingly, the document retrieval device You can obtain the query statement "Can the factory model parameters of the terminal support erasing and writing by the master station?" After the document retrieval device obtains multiple candidate content segments related to the query statement shown in Figure 5 in the manner shown in the above embodiment, it can obtain the first correlation degree between the query statement and each candidate content segment (i.e., Figure 5 5), and obtain the attribute information of each candidate content fragment (i.e., each document number in the document number column, each document name in the document name column, each chapter serial number, chapter number in the chapter serial number column in Figure 5 Each chapter title in the title column), and then determine the corresponding first candidate content fragment with the highest correlation (that is, the candidate content fragment with the serial number 1) as the target content fragment, and then display the target content through the interactive interface shown in Figure 4 Fragments and corresponding attribute information, etc.
其中,图5中最左侧的序号列中的各序号,用于唯一标识对应的候选内容字段。文档编号列中的各文档编号,用于对候选内容片段所在的文档进行唯一标识。文档名称列中的各文档名称,为对应的候选内容片段所在的文档的名称。章节序号列中的各章节序号,为对应的候选内容片段所在章节的序号,用于唯一标识候选内容片段所在章节。章节标题列中的各章节标题,为对应的候选内容片段所在章节的标题。候选内容片段列中的各内容片段,为对应的候选内容片段所包含的内容。置信度列中的各置信度,为相关度模型确定的查询语句与对应的候选内容片段之间的第一相关度。Among them, each serial number in the leftmost serial number column in Figure 5 is used to uniquely identify the corresponding candidate content field. Each document number in the document number column is used to uniquely identify the document in which the candidate content fragment is located. Each document name in the document name column is the name of the document in which the corresponding candidate content fragment is located. Each chapter serial number in the chapter serial number column is the serial number of the chapter where the corresponding candidate content fragment is located, and is used to uniquely identify the chapter where the candidate content fragment is located. Each chapter title in the chapter title column is the title of the chapter where the corresponding candidate content fragment is located. Each content segment in the candidate content segment column is the content contained in the corresponding candidate content segment. Each confidence in the confidence column is the first correlation between the query statement determined by the correlation model and the corresponding candidate content fragment.
综上,本公开实施例提供的文档检索方法,获取查询语句,基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询语句相关的多个候选内容片段,对于每个候选内容片段,获取对应的属性信息,并将属性信息与候选内容片段进行拼接,以得到对应的拼接结果,将查询语句以及候选内容片段对应的拼接结果,输入相关度模型,以获取查询语句与候选内容片段之间的第一相关度,基于各第一相关度,从各候选内容片段中获取与查询语句匹配的目标内容片段。由此,实现了自动进行文档检索,降低了文档检索所需的人力成本及时间成本,且通过根据基于AI技术获取的文档中各内容片段与查询语句之间的相关程度,获取与查询语句匹配的目标内容片段,实现了从文档中精确确定能够回答用户问题的具体内容,为准确提供用户问题的答案奠定了基础。另外,通过采用自然语言处理NLP领域的相关度模型,基于查询语句、各候选内容片段的属性信息以及候选内容片段本身所包含的内容,确定各候选内容片段与查询语句之间的第一相关度,进一步提高了确定的目标内容片段的准确性。In summary, the document retrieval method provided by the embodiment of the present disclosure obtains a query statement and performs a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document. For each document, candidate content fragments, obtain the corresponding attribute information, and splice the attribute information with the candidate content fragments to obtain the corresponding splicing results. Enter the query statement and the splicing results corresponding to the candidate content fragments into the correlation model to obtain the query statement. and the first correlation degree between the candidate content fragments, and based on each first correlation degree, the target content fragment matching the query statement is obtained from each candidate content fragment. As a result, automatic document retrieval is realized, reducing the labor cost and time cost required for document retrieval, and based on the correlation between each content fragment in the document and the query statement obtained based on AI technology, the matching query statement is obtained The target content fragment realizes the precise determination of specific content that can answer user questions from the document, laying the foundation for accurately providing answers to user questions. In addition, by adopting the correlation model in the field of natural language processing NLP, based on the query statement, the attribute information of each candidate content fragment, and the content contained in the candidate content fragment itself, the first correlation degree between each candidate content fragment and the query statement is determined. , further improving the accuracy of the identified target content segments.
通过上述分析可知,可以预先对待检索的大量文档进行处理,以得到多个内容片段,进而文档检索装置获取查询语句后,可以基于查询语句进行查询,以从多个内容片段中,获取与查询语句相关的多个候选内容片段。下面结合图6,对本公开实施例提供的文档检索方法中,对待检索的文档进行处理,以得到多个内容片段的过程进行说明。From the above analysis, it can be seen that a large number of documents to be retrieved can be processed in advance to obtain multiple content fragments. Then, after the document retrieval device obtains the query statement, it can query based on the query statement to obtain the query statement from multiple content fragments. Multiple related candidate content snippets. The following describes the process of processing documents to be retrieved to obtain multiple content fragments in the document retrieval method provided by the embodiment of the present disclosure with reference to FIG. 6 .
图6是根据本公开第四实施例的文档检索方法的流程图,如图6所示,在上述实施例的基础上,该方法还可以包括以下步骤601-603。Figure 6 is a flow chart of a document retrieval method according to the fourth embodiment of the present disclosure. As shown in Figure 6, based on the above embodiment, the method may also include the following steps 601-603.
步骤601,基于人工智能AI领域的光学字符识别OCR技术,对各文档进行识别,以获取各文档的识别结果。Step 601: Recognize each document based on the optical character recognition OCR technology in the field of artificial intelligence AI to obtain the recognition results of each document.
在本公开实施例中,文档检索装置可以基于光学字符识别OCR技术,对各文档进行识别,以获取各文档的识别结果。In the embodiment of the present disclosure, the document retrieval device can recognize each document based on optical character recognition (OCR) technology to obtain the recognition results of each document.
在本公开实施例中,文档检索装置也可以通过接口与文档处理平台连接,从而将各文档上传至文档处理平台,以利用文档处理平台,基于光学字符识别OCR技术,对各文档进行识别,再获取文档处理平台返回的各文档的识别结果。In the embodiment of the present disclosure, the document retrieval device can also be connected to the document processing platform through an interface, thereby uploading each document to the document processing platform, so as to use the document processing platform to identify each document based on optical character recognition OCR technology, and then Obtain the recognition results of each document returned by the document processing platform.
在本公开实施例中,文档检索装置也可以调用RPA机器人将各文档上传至文档处理平台,以利用文 档处理平台,基于光学字符识别OCR技术,对各文档进行识别,再获取文档处理平台返回的各文档的识别结果。由此,在待检索的文档数量较多时,通过调用RPA机器人将各文档一一上传至文档处理平台,可以减少文档上传所需的人工成本。In the embodiment of the present disclosure, the document retrieval device can also call the RPA robot to upload each document to the document processing platform, so as to use the document processing platform to identify each document based on optical character recognition OCR technology, and then obtain the document returned by the document processing platform. Recognition results for each document. Therefore, when there are a large number of documents to be retrieved, the labor costs required for document uploading can be reduced by calling the RPA robot to upload each document one by one to the document processing platform.
参考图7的左侧附图,文档处理平台可以提供交互界面,该交互界面上可以包括用于上传文档的“上传文档”按钮以及用于启动文档识别过程的“开始识别”按钮。文档检索装置可以调用RPA机器人模拟鼠标操作,点击该交互界面上的用于上传文档的“上传文档”按钮,以将待处理的文档上传至文档处理平台,进而点击该交互界面上的用于启动文档识别过程的“开始识别”按钮,以启动文档处理平台对文档的识别过程,进而得到图7右侧附图所示的文档的识别结果。其中,图7中的“cl_num”表示章节序号,“cl_name”表示章节标题,“cl_rank”表示章节所在行,“cl_content”表示章节所包含的内容。Referring to the left drawing of Figure 7, the document processing platform may provide an interactive interface, which may include an "upload document" button for uploading documents and a "start recognition" button for starting the document recognition process. The document retrieval device can call the RPA robot to simulate mouse operations, click the "Upload Document" button on the interactive interface for uploading documents to upload the documents to be processed to the document processing platform, and then click on the interactive interface for starting Click the "Start Recognition" button of the document recognition process to start the document recognition process on the document processing platform, and then obtain the document recognition results shown in the right side of Figure 7. Among them, "cl_num" in Figure 7 represents the chapter serial number, "cl_name" represents the chapter title, "cl_rank" represents the row where the chapter is located, and "cl_content" represents the content contained in the chapter.
步骤602,对各识别结果进行结构化处理,以得到各文档中包括的多个内容片段。Step 602: Perform structured processing on each recognition result to obtain multiple content fragments included in each document.
在本公开实施例中,文档可以包括文本和/或表格,相应的,文档的识别结果,可以包括文本识别结果和/或表格识别结果。In this embodiment of the present disclosure, the document may include text and/or tables, and accordingly, the recognition results of the document may include text recognition results and/or table recognition results.
相应的,步骤602可以通过以下方式实现:按照预设分割方式,对文本识别结果和/或表格识别结果进行分割,以得到多个分割片段;将多个分割片段按照预设聚合方式进行聚合,以得到多个内容片段,其中,每个内容片段通过至少一个分割片段聚合得到。Accordingly, step 602 can be implemented in the following ways: segment the text recognition results and/or table recognition results according to a preset segmentation method to obtain multiple segmented segments; aggregate the multiple segmented segments according to a preset aggregation method, To obtain multiple content segments, each content segment is obtained by aggregating at least one segmented segment.
其中,预设分割方式,为将文档的识别结果分割为多个分割片段的方式,可以根据文档所包含的内容的类型(比如文本类型、表格类型)确定。The preset segmentation method is a method of dividing the recognition result of the document into multiple segmented segments, which can be determined according to the type of content contained in the document (such as text type, table type).
预设聚合方式,为将分割片段聚合得到内容片段的方式,可以根据文档所包含的内容的类型(比如文本类型、表格类型)确定。The default aggregation method is a method of aggregating divided fragments to obtain content fragments, which can be determined according to the type of content contained in the document (such as text type, table type).
举例来说,假设文档的识别结果包括文本识别结果,文本识别结果中包括章节序号、逗号、句号等标点符号。文档检索装置可以通过章节序号对文本识别结果进行第一次分割,再按照标点符号(一般是句号等句末标点符号)对第一次分割的结果进行第二次分割,从而将文本识别结果分割为多个句子,每个句子为一个分割片段,各分割片段按照在文档中的对应位置依次从前向后排列。For example, assume that the document recognition results include text recognition results, and the text recognition results include chapter numbers, commas, periods and other punctuation marks. The document retrieval device can perform the first segmentation of the text recognition results based on chapter numbers, and then perform a second segmentation on the results of the first segmentation based on punctuation marks (generally end-of-sentence punctuation marks such as periods), thereby segmenting the text recognition results. It is a plurality of sentences, each sentence is a segmented segment, and each segmented segment is arranged from front to back according to its corresponding position in the document.
进一步的,可以给定一个特定长度,比如200个字符,再从第一个分割片段开始向后逐渐累加,直到累加后的长度大于200个字符时,将之前累加的分割片段作为一个内容片段,将当前累加的分割片段作为下一个内容片段的第一个分割片段。比如累加到第5个句子时的长度为203个字符,之前累加的句子的长度为197个字符,则将之前累加的4个句子作为一个内容片段,将第5个句子作为下一个内容片段的第一个句子,再依次将之后的句子累加,确定下一个内容片段。Furthermore, a specific length can be given, such as 200 characters, and then gradually accumulated from the first segmented segment backwards. When the accumulated length is greater than 200 characters, the previously accumulated segmented segments are regarded as one content segment. Use the currently accumulated split segment as the first split segment of the next content segment. For example, when the length of the fifth sentence is 203 characters, and the length of the previously accumulated sentences is 197 characters, the four previously accumulated sentences will be regarded as one content fragment, and the fifth sentence will be used as the next content fragment. The first sentence, and then the subsequent sentences are accumulated to determine the next content fragment.
参考图8,通过对左侧附图所示的文本识别结果进行结构化处理,可以得到图8右侧附图所示的多个内容片段。Referring to Figure 8, by performing structured processing on the text recognition results shown in the left figure, multiple content fragments shown in the right figure of Figure 8 can be obtained.
或者,假设文档的识别结果包括表格识别结果,表格识别结果中包括用于区分不同单元格的分隔符号,以及单元格所在行号。文档检索装置可以通过行号对表格识别结果进行第一次分割,再按照分隔符号对第一次分割的结果进行第二次分割,从而将表格识别结果分割为多个单元格内容,每个单元格内容为一个分割片段,每行中的各分割片段按照在文档中的对应位置依次从前向后排列。进一步的,可以将每行中的各分割片段拼接为一个内容片段。Or, assume that the recognition results of the document include table recognition results, and the table recognition results include delimiter symbols used to distinguish different cells, and the row numbers where the cells are located. The document retrieval device can perform the first segmentation of the table recognition result by line number, and then perform the second segmentation of the first segmentation result according to the delimiter symbol, thereby dividing the table recognition result into multiple cell contents, each cell The content of the grid is a segmented segment, and the segmented segments in each row are arranged from front to back according to their corresponding positions in the document. Furthermore, the divided fragments in each row can be spliced into one content fragment.
参考图9,通过对左侧附图所示的表格识别结果进行结构化处理,可以得到图9右侧附图所示的多个 内容片段。Referring to Figure 9, by performing structured processing on the table recognition results shown in the left figure, multiple content fragments shown in the right figure of Figure 9 can be obtained.
需要说明的是,上述对文本识别结果或表格识别结果进行分割的方式,以及将分割得到的多个分割片段进行聚合的方式,仅是示例性说明,不能理解为对本公开技术方案的限制,在实际应用中,本领域技术人员可以根据需要设置对文档的识别结果进行分割的预设分割方式,以及对多个分割片段进行聚合的预设聚合方式,本公开对此不作限制。It should be noted that the above-mentioned ways of segmenting text recognition results or table recognition results, and the ways of aggregating multiple segmented segments obtained by segmentation are only illustrative descriptions and cannot be understood as limitations to the technical solution of the present disclosure. In practical applications, those skilled in the art can set a preset segmentation method for segmenting the recognition results of the document as needed, and a preset aggregation method for aggregating multiple segmented fragments, and this disclosure does not limit this.
步骤603,将各内容片段与对应的内容字段对应保存。Step 603: Save each content segment in correspondence with the corresponding content field.
在本公开的实施例中,可以将内容字段的名称设置为“内容片段”,并将各内容片段与对应的内容字段对应保存,从而在后续需要获取内容片段所包含的内容时,可以通过内容字段获取对应的内容片段所包含的内容。In the embodiment of the present disclosure, the name of the content field can be set to "content fragment", and each content fragment can be saved corresponding to the corresponding content field, so that when the content contained in the content fragment needs to be obtained later, the content can be obtained through the content The field obtains the content contained in the corresponding content fragment.
另外,本公开实施例中,还可以将各内容片段所包含的内容以及各内容片段对应的文档名称、章节标题、各级父标题,以结构体的形式保存,结构体中的字段可以对应包括名称为“内容片段”的字段、名称为“文档名称”的字段、名称为“章节标题”的字段,以及名称为“各级父标题”的字段。In addition, in the embodiment of the present disclosure, the content contained in each content segment and the document name, chapter title, and parent title at each level corresponding to each content segment can also be saved in the form of a structure. The fields in the structure can include corresponding A field named "Content Fragment", a field named "Document Name", a field named "Chapter Title", and a field named "Level Parent Title".
其中,步骤601-603可以在步骤102之前执行,或者在步骤202之前执行,或者在步骤302之前执行。Among them, steps 601-603 may be executed before step 102, or before step 202, or before step 302.
综上,本公开实施例提供的文档检索方法,基于光学字符识别OCR技术,对各文档进行识别,以获取各文档的识别结果,对各识别结果进行结构化处理,以得到各文档中包括的多个内容片段,将各内容片段与对应的内容字段对应保存,实现了对待检索的文档进行处理,得到多个内容片段,为实现从文档中精确确定能够回答用户问题的具体内容,以准确提供用户问题的答案奠定了基础。且通过调用RPA机器人将各文档上传至文档处理平台,以利用文档处理平台,基于人工智能AI领域的OCR技术对各文档进行识别,再获取文档处理平台返回的各文档的识别结果,进而对各识别结果进行结构化处理,得到各文档中包括的多个内容片段,实现了结合RPA和AI实现IA的获取文档中的内容片段,进一步减少了文档检索所需的人工成本。In summary, the document retrieval method provided by the embodiments of the present disclosure is based on optical character recognition OCR technology to identify each document to obtain the recognition results of each document, and perform structured processing on each recognition result to obtain the information included in each document. Multiple content fragments, each content fragment is saved corresponding to the corresponding content field, and the document to be retrieved is processed to obtain multiple content fragments. In order to accurately determine the specific content that can answer the user's question from the document, to accurately provide Answers to user questions provide the foundation. And by calling the RPA robot to upload each document to the document processing platform, the document processing platform can be used to identify each document based on the OCR technology in the field of artificial intelligence, and then the identification results of each document returned by the document processing platform can be obtained, and then each document can be identified. The recognition results are structured and processed to obtain multiple content fragments included in each document. This enables the combination of RPA and AI to implement IA to obtain content fragments in the document, further reducing the labor costs required for document retrieval.
为了实现上述实施例,本公开还提出了一种文档检索装置。图10是根据本公开第五实施例的文档检索装置的结构示意图。In order to implement the above embodiments, the present disclosure also proposes a document retrieval device. Figure 10 is a schematic structural diagram of a document retrieval device according to the fifth embodiment of the present disclosure.
如图10所示,该文档检索装置1000,包括:第一获取模块1001、查询模块1002、第二获取模块1003和第三获取模块1004。As shown in Figure 10, the document retrieval device 1000 includes: a first acquisition module 1001, a query module 1002, a second acquisition module 1003 and a third acquisition module 1004.
其中,第一获取模块1001,用于获取查询语句;Among them, the first acquisition module 1001 is used to acquire query statements;
查询模块1002,用于基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询语句相关的多个候选内容片段;The query module 1002 is configured to perform a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document;
第二获取模块1003,用于采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度;The second acquisition module 1003 is used to obtain the first correlation between the query statement and each candidate content fragment using a correlation model in the field of natural language processing NLP;
第三获取模块1004,用于基于各第一相关度,从各候选内容片段中获取与查询语句匹配的目标内容片段。The third acquisition module 1004 is configured to acquire the target content segment that matches the query statement from each candidate content segment based on each first correlation degree.
需要说明的是,本公开实施例的文档检索装置1000,可以执行上述实施例提供的文档检索方法。其中,文档检索装置1000可以由软件和/或硬件实现,该文档检索装置可以为电子设备,或者也可以配置在电子设备中,以实现对文档的自动检索,从而降低文档检索所需的人力成本及时间成本,且实现基于AI技术从文档中精确确定能够回答用户问题的具体内容。其中,该电子设备可以包括但不限于终端设备、 服务器等,本公开实施例对电子设备不作具体限定。It should be noted that the document retrieval device 1000 in the embodiment of the present disclosure can execute the document retrieval method provided in the above embodiment. The document retrieval device 1000 can be implemented by software and/or hardware. The document retrieval device can be an electronic device, or can also be configured in an electronic device to realize automatic retrieval of documents, thereby reducing the labor cost required for document retrieval. and time cost, and achieve accurate determination of specific content that can answer user questions from documents based on AI technology. The electronic device may include but is not limited to a terminal device, a server, etc., and the embodiment of the present disclosure does not specifically limit the electronic device.
在本公开的一个实施例中,查询模块1002,包括:In one embodiment of the present disclosure, query module 1002 includes:
第一获取单元,用于获取各内容片段所包含的内容以及各内容片段的属性信息;The first acquisition unit is used to acquire the content contained in each content segment and the attribute information of each content segment;
第二获取单元,用于基于各内容片段所包含的内容,获取查询语句与对应的内容片段之间的内容相关度,以及基于各内容片段的属性信息,获取查询语句与对应的内容片段之间的属性相关度;The second acquisition unit is used to obtain the content correlation between the query statement and the corresponding content fragment based on the content contained in each content fragment, and to obtain the content correlation between the query statement and the corresponding content fragment based on the attribute information of each content fragment. attribute correlation;
第三获取单元,用于基于查询语句与各内容片段之间的内容相关度以及属性相关度,从多个内容片段中,获取与查询语句相关的多个候选内容片段。The third acquisition unit is used to acquire multiple candidate content segments related to the query statement from multiple content segments based on the content correlation and attribute correlation between the query statement and each content segment.
在本公开的一个实施例中,内容相关度具有对应的第一权重,属性相关度具有对应的第二权重;In one embodiment of the present disclosure, the content relevance has a corresponding first weight, and the attribute relevance has a corresponding second weight;
第三获取单元,用于:The third acquisition unit is used for:
基于各内容相关度和对应的第一权重,以及各属性相关度和对应的第二权重,确定查询语句与对应的内容片段之间的第二相关度;Based on each content correlation and the corresponding first weight, and each attribute correlation and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment;
基于查询语句与各内容片段之间的第二相关度,从多个内容片段中,获取与查询语句相关的多个候选内容片段。Based on the second correlation between the query statement and each content fragment, multiple candidate content fragments related to the query statement are obtained from the plurality of content fragments.
在本公开的一个实施例中,第二获取模块1003,包括:In one embodiment of the present disclosure, the second acquisition module 1003 includes:
第四获取单元,用于对于每个候选内容片段,将查询语句与候选内容片段输入相关度模型,以获取查询语句与候选内容片段之间的第一相关度。The fourth acquisition unit is configured to input the query statement and the candidate content fragment into the correlation model for each candidate content fragment, so as to obtain the first correlation between the query statement and the candidate content fragment.
在本公开的一个实施例中,第二获取模块1003,包括:In one embodiment of the present disclosure, the second acquisition module 1003 includes:
第五获取单元,用于对于每个候选内容片段,获取对应的属性信息,并将属性信息与候选内容片段进行拼接,以得到对应的拼接结果;The fifth acquisition unit is used to obtain the corresponding attribute information for each candidate content segment, and splice the attribute information with the candidate content segment to obtain the corresponding splicing result;
第六获取单元,用于将查询语句以及候选内容片段对应的拼接结果,输入相关度模型,以获取查询语句与候选内容片段之间的第一相关度。The sixth acquisition unit is used to input the splicing results corresponding to the query statement and the candidate content fragment into the correlation model to obtain the first correlation between the query statement and the candidate content fragment.
在本公开的一个实施例中,文档检索装置1000,还包括:识别模块,用于基于人工智能AI领域的光学字符识别OCR技术,对各文档进行识别,以获取各文档的识别结果;In one embodiment of the present disclosure, the document retrieval device 1000 also includes: a recognition module, used to recognize each document based on the optical character recognition OCR technology in the artificial intelligence field to obtain the recognition results of each document;
处理模块,用于对各识别结果进行结构化处理,以得到各文档中包括的多个内容片段;The processing module is used to perform structured processing on each recognition result to obtain multiple content fragments included in each document;
保存模块,用于将各内容片段与对应的内容字段对应保存。The saving module is used to save each content fragment correspondingly to the corresponding content field.
在本公开的一个实施例中,识别模块,包括:In one embodiment of the present disclosure, the identification module includes:
上传单元,用于调用RPA机器人将各文档上传至文档处理平台,以利用文档处理平台,基于光学字符识别OCR技术,对各文档进行识别;The upload unit is used to call the RPA robot to upload each document to the document processing platform, so as to use the document processing platform to identify each document based on optical character recognition OCR technology;
第七获取单元,用于获取文档处理平台返回的各文档的识别结果。The seventh acquisition unit is used to acquire the recognition results of each document returned by the document processing platform.
在本公开的一个实施例中,识别结果包括文本识别结果和/或表格识别结果;In one embodiment of the present disclosure, the recognition results include text recognition results and/or table recognition results;
处理模块,包括:Processing modules, including:
分割单元,用于按照预设分割方式,对文本识别结果和/或表格识别结果进行分割,以得到多个分割片段;A segmentation unit used to segment text recognition results and/or table recognition results according to a preset segmentation method to obtain multiple segmented segments;
聚合单元,用于将多个分割片段按照预设聚合方式进行聚合,以得到多个内容片段,其中,每个内容片段通过至少一个分割片段聚合得到。The aggregation unit is used to aggregate multiple segmented segments according to a preset aggregation method to obtain multiple content segments, wherein each content segment is obtained by aggregating at least one segmented segment.
在本公开的一个实施例中,属性信息,包括文档名称、章节标题、章节标题的各级父标题中的至少 一个。In one embodiment of the present disclosure, the attribute information includes at least one of a document name, a chapter title, and parent titles at various levels of the chapter title.
需要说明的是,前述对文档检索方法实施例的解释说明也适用于该实施例的文档检索装置,本公开文档检索装置实施例中未公布的细节,此处不再赘述。It should be noted that the foregoing explanation of the embodiment of the document retrieval method also applies to the document retrieval device of this embodiment. Unpublished details of the embodiment of the document retrieval device of the present disclosure will not be described again here.
综上,本公开实施例的文档检索装置,获取查询语句,基于查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与查询语句相关的多个候选内容片段,采用自然语言处理NLP领域的相关度模型,获取查询语句与各候选内容片段之间的第一相关度,基于各第一相关度,从各候选内容片段中获取与查询语句匹配的目标内容片段。由此,实现了自动进行文档检索,降低了文档检索所需的人力成本及时间成本,且通过根据基于AI技术获取的文档中各内容片段与查询语句之间的相关程度,获取与查询语句匹配的目标内容片段,实现了从文档中精确确定能够回答用户问题的具体内容,为准确提供用户问题的答案奠定了基础。In summary, the document retrieval device of the embodiment of the present disclosure acquires a query statement and performs a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document, using natural language Process the correlation model in the NLP field, obtain the first correlation between the query statement and each candidate content fragment, and obtain the target content fragment matching the query statement from each candidate content fragment based on each first correlation. As a result, automatic document retrieval is realized, reducing the labor cost and time cost required for document retrieval, and based on the correlation between each content fragment in the document and the query statement obtained based on AI technology, the matching query statement is obtained The target content fragment realizes the precise determination of specific content that can answer user questions from the document, laying the foundation for accurately providing answers to user questions.
为了实现上述实施例,本公开实施例还提出一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现如前述任一方法实施例所述的文档检索方法。In order to implement the above embodiments, embodiments of the present disclosure also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, The document retrieval method as described in any of the foregoing method embodiments.
为了实现上述实施例,本公开实施例还提出一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如前述任一方法实施例所述的文档检索方法。在一些实施例中,该计算机可读存储介质是非临时性计算机可读存储介质。In order to implement the above embodiments, embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the document retrieval method as described in any of the foregoing method embodiments is implemented. In some embodiments, the computer-readable storage medium is a non-transitory computer-readable storage medium.
为了实现上述实施例,本公开实施例还提出一种计算机程序产品,当所述计算机程序产品中的指令处理器执行时,实现如前述任一方法实施例所述的文档检索方法。In order to implement the above embodiments, embodiments of the present disclosure also provide a computer program product. When the instruction processor in the computer program product is executed, the document retrieval method as described in any of the foregoing method embodiments is implemented.
为了实现上述实施例,本公开实施例还提出一种计算机程序,所述计算机程序包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如前述任一方法实施例所述的文档检索方法。In order to implement the above embodiments, an embodiment of the present disclosure also proposes a computer program. The computer program includes computer program code. When the computer program code is run on a computer, it causes the computer to execute as described in any of the foregoing method embodiments. Document retrieval method.
图11示出了适于用来实现本公开实施方式的示例性电子设备的框图。图11显示的电子设备11仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。11 illustrates a block diagram of an exemplary electronic device suitable for implementing embodiments of the present disclosure. The electronic device 11 shown in FIG. 11 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.
如图11所示,电子设备11以通用计算设备的形式表现。电子设备11的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括存储器28和处理单元16)的总线18。As shown in Figure 11, electronic device 11 is embodied in the form of a general computing device. The components of electronic device 11 may include, but are not limited to: one or more processors or processing units 16, system memory 28, and a bus 18 connecting different system components (including memory 28 and processing unit 16).
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Standard Architecture;以下简称:ISA)总线,微通道体系结构(Micro Channel Architecture;以下简称:MAC)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association;以下简称:VESA)局域总线以及外围组件互连(Peripheral Component Interconnection;以下简称:PCI)总线。 Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics accelerated port, a processor, or a local bus using any of a variety of bus structures. For example, these architectures include but are not limited to Industry Standard Architecture (hereinafter referred to as: ISA) bus, Micro Channel Architecture (Micro Channel Architecture; hereafter referred to as: MAC) bus, enhanced ISA bus, video electronics Standards Association (Video Electronics Standards Association; hereinafter referred to as: VESA) local bus and Peripheral Component Interconnection (hereinafter referred to as: PCI) bus.
电子设备11典型地包括多种计算机系统可读介质。这些介质可以是任何能够被电子设备11访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。 Electronic device 11 typically includes a variety of computer system readable media. These media may be any available media that can be accessed by electronic device 11, including volatile and non-volatile media, removable and non-removable media.
存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory;以下简称:RAM)30和/或高速缓存存储器32。电子设备11可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非 易失性磁介质(图11未显示,通常称为“硬盘驱动器”)。尽管图11中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如:光盘只读存储器(Compact Disc Read Only Memory;以下简称:CD-ROM)、数字多功能只读光盘(Digital Video Disc Read Only Memory;以下简称:DVD-ROM)或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本公开各实施例的功能。The memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter referred to as: RAM) 30 and/or cache memory 32. Electronic device 11 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in Figure 11, commonly referred to as a "hard drive"). Although not shown in FIG. 11, a disk drive for reading and writing a removable non-volatile disk (e.g., a "floppy disk"), and a removable non-volatile optical disk (e.g., a compact disk read-only memory) may be provided. Disc Read Only Memory (hereinafter referred to as: CD-ROM), Digital Video Disc Read Only Memory (hereinafter referred to as: DVD-ROM) or other optical media) read and write optical disc drives. In these cases, each drive may be connected to bus 18 through one or more data media interfaces. Memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of embodiments of the present disclosure.
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储器28中,这样的程序模块42包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本公开所描述的实施例中的功能和/或方法。A program/utility 40 having a set of (at least one) program modules 42, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored, for example, in memory 28 , each of these examples or some combination may include the implementation of a network environment. Program modules 42 generally perform functions and/or methods in the embodiments described in this disclosure.
电子设备11也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该电子设备11交互的设备通信,和/或与使得该电子设备11能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。并且,电子设备11还可以通过网络适配器20与一个或者多个网络(例如局域网(Local Area Network;以下简称:LAN),广域网(Wide Area Network;以下简称:WAN)和/或公共网络,例如因特网)通信。如图11所示,网络适配器20通过总线18与电子设备11的其它模块通信。应当明白,尽管图11中未示出,可以结合电子设备11使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。 Electronic device 11 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable a user to interact with electronic device 11, and/or with Any device (eg, network card, modem, etc.) that enables the electronic device 11 to communicate with one or more other computing devices. This communication may occur through input/output (I/O) interface 22. Moreover, the electronic device 11 can also communicate with one or more networks (such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (Wide Area Network; hereinafter referred to as: WAN)) and/or a public network, such as the Internet, through the network adapter 20 ) communication. As shown in FIG. 11 , the network adapter 20 communicates with other modules of the electronic device 11 through the bus 18 . It should be understood that, although not shown in Figure 11, other hardware and/or software modules may be used in conjunction with electronic device 11, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tapes drives and data backup storage systems, etc.
处理单元16通过运行存储在存储器28中的程序,从而执行各种功能应用以及数据处理,例如实现前述实施例中提及的方法。The processing unit 16 executes programs stored in the memory 28 to perform various functional applications and data processing, such as implementing the methods mentioned in the previous embodiments.
需要说明的是,前述对文档检索方法实施例的解释说明也适用于本公开实施例的电子设备、计算机可读存储介质、计算机程序产品和计算机程序,此处不再赘述。It should be noted that the foregoing explanations of the embodiments of the document retrieval method are also applicable to the electronic devices, computer-readable storage media, computer program products and computer programs of the embodiments of the present disclosure, and will not be described again here.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本公开的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, reference to the terms "one embodiment," "some embodiments," "an example," "specific examples," or "some examples" or the like means that specific features are described in connection with the embodiment or example. , structures, materials, or features are included in at least one embodiment or example of the present disclosure. In this specification, the schematic expressions of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, those skilled in the art may combine and combine different embodiments or examples and features of different embodiments or examples described in this specification unless they are inconsistent with each other.
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本公开的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。In addition, the terms “first” and “second” are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of the present disclosure, "plurality" means at least two, such as two, three, etc., unless otherwise expressly and specifically limited.
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现定制逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本公开的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本公开的实施例所属技术领域的技术人员所理解。Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments, or portions of code that include one or more executable instructions for implementing customized logical functions or steps of the process. , and the scope of the preferred embodiments of the present disclosure includes additional implementations in which functions may be performed out of the order shown or discussed, including in a substantially simultaneous manner or in the reverse order, depending on the functionality involved, which shall It should be understood by those skilled in the art to which embodiments of the present disclosure belong.
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的 可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。The logic and/or steps represented in the flowcharts or otherwise described herein, for example, may be considered a sequenced list of executable instructions for implementing the logical functions, and may be embodied in any computer-readable medium, For use by, or in combination with, instruction execution systems, devices or devices (such as computer-based systems, systems including processors or other systems that can fetch instructions from and execute instructions from the instruction execution system, device or device) or equipment. For the purposes of this specification, a "computer-readable medium" may be any device that can contain, store, communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections with one or more wires (electronic device), portable computer disk cartridges (magnetic device), random access memory (RAM), Read-only memory (ROM), erasable and programmable read-only memory (EPROM or flash memory), fiber optic devices, and portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, and subsequently edited, interpreted, or otherwise suitable as necessary. process to obtain the program electronically and then store it in computer memory.
应当理解,本公开的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。如,如果用硬件来实现和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。It should be understood that various parts of the present disclosure may be implemented in hardware, software, firmware, or combinations thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if it is implemented in hardware, as in another embodiment, it can be implemented by any one of the following technologies known in the art or their combination: discrete logic gate circuits with logic functions for implementing data signals; Logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGA), field programmable gate arrays (FPGA), etc.
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps involved in implementing the methods of the above embodiments can be completed by instructing relevant hardware through a program. The program can be stored in a computer-readable storage medium. The program can be stored in a computer-readable storage medium. When executed, one of the steps of the method embodiment or a combination thereof is included.
此外,在本公开各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in various embodiments of the present disclosure may be integrated into one processing module, each unit may exist physically alone, or two or more units may be integrated into one module. The above integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
上述提到的存储介质可以是只读存储器,磁盘或光盘等。尽管上面已经示出和描述了本公开的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本公开的限制,本领域的普通技术人员在本公开的范围内可以对上述实施例进行变化、修改、替换和变型。The storage media mentioned above can be read-only memory, magnetic disks or optical disks, etc. Although the embodiments of the present disclosure have been shown and described above, it can be understood that the above-mentioned embodiments are illustrative and should not be construed as limitations of the present disclosure. Those of ordinary skill in the art can make modifications to the above-mentioned embodiments within the scope of the present disclosure. The embodiments are subject to changes, modifications, substitutions and variations.

Claims (17)

  1. 一种文档检索方法,包括:A document retrieval method including:
    获取查询语句;Get the query statement;
    基于所述查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与所述查询语句相关的多个候选内容片段;Perform a query based on the query statement to obtain a plurality of candidate content fragments related to the query statement from a plurality of content fragments included in at least one document;
    采用自然语言处理NLP领域的相关度模型,获取所述查询语句与各所述候选内容片段之间的第一相关度;Using a correlation model in the field of natural language processing NLP, obtain the first correlation between the query statement and each of the candidate content fragments;
    基于各所述第一相关度,从各所述候选内容片段中获取与所述查询语句匹配的目标内容片段。Based on each of the first correlations, a target content segment matching the query statement is obtained from each of the candidate content segments.
  2. 根据权利要求1所述的方法,其中,所述基于所述查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与所述查询语句相关的多个候选内容片段,包括:The method of claim 1, wherein querying based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document includes:
    获取各所述内容片段所包含的内容以及各所述内容片段的属性信息;Obtain the content contained in each content segment and the attribute information of each content segment;
    基于各所述内容片段所包含的内容,获取所述查询语句与对应的内容片段之间的内容相关度,以及基于各所述内容片段的属性信息,获取所述查询语句与对应的内容片段之间的属性相关度;Based on the content included in each content segment, the content correlation between the query statement and the corresponding content segment is obtained, and based on the attribute information of each content segment, the relationship between the query statement and the corresponding content segment is obtained. attribute correlation between;
    基于所述查询语句与各所述内容片段之间的所述内容相关度以及所述属性相关度,从多个所述内容片段中,获取与所述查询语句相关的多个候选内容片段。Based on the content correlation and the attribute correlation between the query statement and each of the content fragments, multiple candidate content fragments related to the query statement are obtained from a plurality of the content fragments.
  3. 根据权利要求2所述的方法,其中,所述内容相关度具有对应的第一权重,所述属性相关度具有对应的第二权重;The method of claim 2, wherein the content relevance has a corresponding first weight, and the attribute relevance has a corresponding second weight;
    所述基于所述查询语句与各所述内容片段之间的所述内容相关度以及所述属性相关度,从多个所述内容片段中,获取与所述查询语句相关的多个候选内容片段,包括:Based on the content correlation and the attribute correlation between the query statement and each of the content fragments, obtain a plurality of candidate content fragments related to the query statement from a plurality of the content fragments. ,include:
    基于各所述内容相关度和对应的第一权重,以及各所述属性相关度和对应的第二权重,确定所述查询语句与对应的内容片段之间的第二相关度;Based on each content correlation and the corresponding first weight, and each attribute correlation and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment;
    基于所述查询语句与各所述内容片段之间的第二相关度,从多个所述内容片段中,获取与所述查询语句相关的多个候选内容片段。Based on the second correlation between the query statement and each of the content fragments, a plurality of candidate content fragments related to the query statement are obtained from a plurality of the content fragments.
  4. 根据权利要求1至3中任一项所述的方法,其中,所述采用自然语言处理NLP领域的相关度模型,获取所述查询语句与各所述候选内容片段之间的第一相关度,包括:The method according to any one of claims 1 to 3, wherein the first correlation between the query statement and each of the candidate content segments is obtained by using a correlation model in the field of natural language processing NLP, include:
    对于每个所述候选内容片段,将所述查询语句与所述候选内容片段输入所述相关度模型,以获取所述查询语句与所述候选内容片段之间的第一相关度。For each candidate content segment, the query statement and the candidate content segment are input into the correlation model to obtain a first correlation between the query statement and the candidate content segment.
  5. 根据权利要求1至4中任一项所述的方法,其中,所述采用自然语言处理NLP领域的相关度模型,获取所述查询语句与各所述候选内容片段之间的第一相关度,包括:The method according to any one of claims 1 to 4, wherein the first correlation between the query statement and each of the candidate content fragments is obtained using a correlation model in the field of natural language processing NLP, include:
    对于每个所述候选内容片段,获取对应的属性信息,并将所述属性信息与所述候选内容片段进行拼接,以得到对应的拼接结果;For each candidate content segment, obtain the corresponding attribute information, and splice the attribute information and the candidate content segment to obtain the corresponding splicing result;
    将所述查询语句以及所述候选内容片段对应的拼接结果,输入所述相关度模型,以获取所述查询语句与所述候选内容片段之间的第一相关度。The splicing result corresponding to the query statement and the candidate content segment is input into the correlation model to obtain the first correlation between the query statement and the candidate content segment.
  6. 根据权利要求1至5中任一项所述的方法,其中,所述基于所述查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与所述查询语句相关的多个候选内容片段之前,还包括:The method according to any one of claims 1 to 5, wherein the query is performed based on the query statement to obtain a plurality of content fragments related to the query statement from a plurality of content fragments included in at least one document. Before the candidate content snippet, also include:
    基于人工智能AI领域的光学字符识别OCR技术,对各所述文档进行识别,以获取各所述文档的识别结果;Based on the optical character recognition OCR technology in the field of artificial intelligence, identify each of the documents to obtain the recognition results of each of the documents;
    对各所述识别结果进行结构化处理,以得到各所述文档中包括的多个所述内容片段;Perform structured processing on each of the recognition results to obtain a plurality of content fragments included in each of the documents;
    将各所述内容片段与对应的内容字段对应保存。Each content segment is stored in correspondence with the corresponding content field.
  7. 根据权利要求6所述的方法,其中,所述基于人工智能AI领域的光学字符识别OCR技术,对各所述文档进行识别,以获取各所述文档的识别结果,包括:The method according to claim 6, wherein the optical character recognition (OCR) technology based on the field of artificial intelligence (AI) recognizes each of the documents to obtain the recognition results of each of the documents, including:
    调用RPA机器人将各所述文档上传至文档处理平台,以利用所述文档处理平台,基于所述光学字符识别OCR技术,对各所述文档进行识别;Call the RPA robot to upload each of the documents to the document processing platform, so as to use the document processing platform to identify each of the documents based on the optical character recognition OCR technology;
    获取所述文档处理平台返回的各所述文档的识别结果。Obtain the identification results of each document returned by the document processing platform.
  8. 根据权利要求6或7所述的方法,其中,所述识别结果包括文本识别结果和/或表格识别结果;The method according to claim 6 or 7, wherein the recognition results include text recognition results and/or table recognition results;
    所述对各所述识别结果进行结构化处理,以得到各所述文档中包括的多个所述内容片段,包括:The step of performing structured processing on each of the recognition results to obtain a plurality of content fragments included in each of the documents includes:
    按照预设分割方式,对所述文本识别结果和/或所述表格识别结果进行分割,以得到多个分割片段;Segment the text recognition result and/or the table recognition result according to a preset segmentation method to obtain multiple segmented segments;
    将多个所述分割片段按照预设聚合方式进行聚合,以得到多个所述内容片段,其中,每个所述内容片段通过至少一个所述分割片段聚合得到。A plurality of the segmented segments are aggregated according to a preset aggregation method to obtain a plurality of the content segments, wherein each content segment is obtained by aggregating at least one of the segmented segments.
  9. 根据权利要求1至8中任一项所述的方法,其中,所述属性信息包括文档名称、章节标题、所述章节标题的各级父标题中的至少一个。The method according to any one of claims 1 to 8, wherein the attribute information includes at least one of a document name, a chapter title, and parent titles of each level of the chapter title.
  10. 一种文档检索装置,包括:A document retrieval device, including:
    第一获取模块,用于获取查询语句;The first acquisition module is used to obtain query statements;
    查询模块,用于基于所述查询语句进行查询,以从至少一个文档包括的多个内容片段中,获取与所述查询语句相关的多个候选内容片段;A query module configured to perform a query based on the query statement to obtain multiple candidate content fragments related to the query statement from multiple content fragments included in at least one document;
    第二获取模块,用于采用自然语言处理NLP领域的相关度模型,获取所述查询语句与各所述候选内容片段之间的第一相关度;The second acquisition module is used to obtain the first correlation between the query statement and each of the candidate content fragments using a correlation model in the field of natural language processing NLP;
    第三获取模块,用于基于各所述第一相关度,从各所述候选内容片段中获取与所述查询语句匹配的目标内容片段。A third acquisition module is configured to acquire a target content segment that matches the query statement from each of the candidate content segments based on each of the first correlation degrees.
  11. 根据权利要求10所述的装置,其中,所述查询模块包括:The device according to claim 10, wherein the query module includes:
    第一获取单元,用于获取各所述内容片段所包含的内容以及各所述内容片段的属性信息;A first acquisition unit, configured to acquire the content contained in each content segment and the attribute information of each content segment;
    第二获取单元,用于基于各所述内容片段所包含的内容,获取所述查询语句与对应的内容片段之间的内容相关度,以及基于各所述内容片段的属性信息,获取所述查询语句与对应的内容片段之间的属性相关度;The second acquisition unit is configured to obtain the content correlation between the query statement and the corresponding content segment based on the content contained in each content segment, and obtain the query based on the attribute information of each content segment. The attribute correlation between the statement and the corresponding content fragment;
    第三获取单元,用于基于所述查询语句与各所述内容片段之间的所述内容相关度以及所述属性相关度,从多个所述内容片段中,获取与所述查询语句相关的多个候选内容片段。A third acquisition unit configured to acquire, from a plurality of content segments, information related to the query statement based on the content correlation and the attribute correlation between the query statement and each of the content segments. Multiple candidate content snippets.
  12. 根据权利要求11所述的装置,其中,所述内容相关度具有对应的第一权重,所述属性相关度具有对应的第二权重;The device according to claim 11, wherein the content relevance has a corresponding first weight, and the attribute relevance has a corresponding second weight;
    所述第三获取单元,用于:The third acquisition unit is used for:
    基于各所述内容相关度和对应的第一权重,以及各所述属性相关度和对应的第二权重,确定所述查询语句与对应的内容片段之间的第二相关度;Based on each content correlation and the corresponding first weight, and each attribute correlation and the corresponding second weight, determine the second correlation between the query statement and the corresponding content fragment;
    基于所述查询语句与各所述内容片段之间的第二相关度,从多个所述内容片段中,获取与所述查询语句相关的多个候选内容片段。Based on the second correlation between the query statement and each of the content fragments, a plurality of candidate content fragments related to the query statement are obtained from a plurality of the content fragments.
  13. 根据权利要求10至12中任一项所述的装置,其中,所述第二获取模块包括:The device according to any one of claims 10 to 12, wherein the second acquisition module includes:
    第四获取单元,用于对于每个所述候选内容片段,将所述查询语句与所述候选内容片段输入所述相关度模型,以获取所述查询语句与所述候选内容片段之间的第一相关度。The fourth acquisition unit is configured to input the query statement and the candidate content fragment into the correlation model for each of the candidate content fragments, so as to obtain the third relationship between the query statement and the candidate content fragment. A degree of correlation.
  14. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现如权利要求1至9中任一项所述的方法。An electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the implementation as described in any one of claims 1 to 9 is achieved. Methods.
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至9中任一项所述的方法。A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the method according to any one of claims 1 to 9 is implemented.
  16. 一种计算机程序产品,其中,所述计算机程序产品中包括计算机程序,当所述计算机程序在被处理器执行时,实现如权利要求1至9中任一项所述的方法。A computer program product, wherein the computer program product includes a computer program, and when the computer program is executed by a processor, the method according to any one of claims 1 to 9 is implemented.
  17. 一种计算机程序,其中,所述计算机程序包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如权利要求1至9中任一项所述的方法。A computer program, wherein the computer program includes computer program code, which when run on a computer causes the computer to perform the method according to any one of claims 1 to 9.
PCT/CN2022/100569 2022-06-07 2022-06-22 Document retrieval method and apparatus, and electronic device WO2023236253A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210637019.1 2022-06-07
CN202210637019.1A CN114925174A (en) 2022-06-07 2022-06-07 Document retrieval method and device and electronic equipment

Publications (1)

Publication Number Publication Date
WO2023236253A1 true WO2023236253A1 (en) 2023-12-14

Family

ID=82813388

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100569 WO2023236253A1 (en) 2022-06-07 2022-06-22 Document retrieval method and apparatus, and electronic device

Country Status (2)

Country Link
CN (1) CN114925174A (en)
WO (1) WO2023236253A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116610775A (en) * 2023-07-20 2023-08-18 科大讯飞股份有限公司 Man-machine interaction method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633407A (en) * 2018-06-20 2019-12-31 百度在线网络技术(北京)有限公司 Information retrieval method, device, equipment and computer readable medium
CN112100326A (en) * 2020-08-28 2020-12-18 广州探迹科技有限公司 Anti-interference knowledge base question-answering method and system integrating retrieval and machine reading understanding
CN112528681A (en) * 2020-12-18 2021-03-19 北京百度网讯科技有限公司 Cross-language retrieval and model training method, device, equipment and storage medium
CN113704427A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Text provenance determination method, device, equipment and storage medium
US20220121668A1 (en) * 2021-01-28 2022-04-21 Beijing Baidu Netcom Science Technology Co., Ltd. Method for recommending document, electronic device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633407A (en) * 2018-06-20 2019-12-31 百度在线网络技术(北京)有限公司 Information retrieval method, device, equipment and computer readable medium
CN112100326A (en) * 2020-08-28 2020-12-18 广州探迹科技有限公司 Anti-interference knowledge base question-answering method and system integrating retrieval and machine reading understanding
CN112528681A (en) * 2020-12-18 2021-03-19 北京百度网讯科技有限公司 Cross-language retrieval and model training method, device, equipment and storage medium
US20220121668A1 (en) * 2021-01-28 2022-04-21 Beijing Baidu Netcom Science Technology Co., Ltd. Method for recommending document, electronic device and storage medium
CN113704427A (en) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 Text provenance determination method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114925174A (en) 2022-08-19

Similar Documents

Publication Publication Date Title
CN110795543B (en) Unstructured data extraction method, device and storage medium based on deep learning
US11610507B2 (en) Guided operation of a language-learning device based on learned user memory characteristics
US10235007B2 (en) Guided operation of a language device based on constructed, time-dependent data structures
CN113495900B (en) Method and device for obtaining structured query language statement based on natural language
CN112364660B (en) Corpus text processing method, corpus text processing device, computer equipment and storage medium
WO2023236252A1 (en) Answer generation method and apparatus, electronic device, and storage medium
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN107766325B (en) Text splicing method and device
US20180373702A1 (en) Interactive method and apparatus based on test-type application
WO2022188584A1 (en) Similar sentence generation method and apparatus based on pre-trained language model
US11934781B2 (en) Systems and methods for controllable text summarization
US11875585B2 (en) Semantic cluster formation in deep learning intelligent assistants
CN104573099A (en) Topic searching method and device
WO2024011813A1 (en) Text expansion method and apparatus, device, and medium
CN108563645B (en) Metadata translation method and device of HIS (hardware-in-the-system)
CN111552773A (en) Method and system for searching key sentence of question or not in reading and understanding task
WO2022108671A1 (en) Automatic document sketching
WO2023236253A1 (en) Document retrieval method and apparatus, and electronic device
CN113821593A (en) Corpus processing method, related device and equipment
JP2002091276A (en) Method and system for teaching explanatory writing to student
CN113806500B (en) Information processing method, device and computer equipment
KR100852970B1 (en) System and method for language training using image division mechanism, recording medium and language study book therefor
CN117370190A (en) Test case generation method and device, electronic equipment and storage medium
CN114462428A (en) Translation evaluation method and system, electronic device and readable storage medium
CN112560431A (en) Method, apparatus, device, storage medium, and computer program product for generating test question tutoring information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22945392

Country of ref document: EP

Kind code of ref document: A1