CN117573800A

CN117573800A - Paragraph retrieval method, device, equipment and storage medium

Info

Publication number: CN117573800A
Application number: CN202311376666.2A
Authority: CN
Inventors: 朱剑
Original assignee: Hunan Aishu Information Technology Group Co ltd
Current assignee: Hunan Aishu Information Technology Group Co ltd
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2024-02-20

Abstract

The invention discloses a paragraph retrieval method, a paragraph retrieval device, paragraph retrieval equipment and a storage medium. The method comprises the following steps: constructing a search document data set, dividing paragraphs of each document in the search document data set to obtain at least one paragraph, and generating a paragraph library according to the at least one paragraph; coding each paragraph in the paragraph library to obtain a corresponding coding vector of each paragraph; generating paragraph data corresponding to each paragraph according to the coding vector corresponding to each paragraph and the paragraph text corresponding to each paragraph, and constructing an index library according to the paragraph data; the method comprises the steps of obtaining target problems, carrying out literal recall retrieval and vector recall retrieval on an index library according to the target problems to obtain target paragraph data corresponding to the target problems, and through the technical scheme of the method, semantic information of multiple views of a document can be fused, accuracy of paragraph text semantic representation is enhanced, retrieval results of literal recall retrieval and vector recall retrieval are integrated, and recall rate and accuracy of paragraph retrieval are improved.

Description

Paragraph retrieval method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of information retrieval, in particular to a paragraph retrieval method, a paragraph retrieval device, paragraph retrieval equipment and a storage medium.

Background

Paragraph search, which is a key component in many natural language processing tasks, is an important leading-edge topic in the fields of natural language processing and artificial intelligence, and has received extensive attention in academia and industry in recent years. From the technical aspect, text retrieval is the basis of NLP (Natural Language Processing ) tasks, and a question-answering system, reading understanding and the like are widely applied. From the enterprise level, information retrieval directly relates to the utilization value of enterprise data, influences the efficiency of staff to acquire information, and even possibly influences the production benefit of enterprises.

Paragraph retrieval is the simplest way to find out the paragraphs most similar to the question by literal recall, however this type of approach is difficult to handle some complex questions effectively. In recent years, the industry and academia are exploring complex problem semantic representation methods based on a deep learning model and semantic similarity calculation methods oriented to chapters and complex sentences. The current paragraph retrieval mainstream scheme is to use a language model to encode a problem and a paragraph respectively, then calculate the similarity of a problem vector and a paragraph vector to carry out rough ranking, and splice the paragraph recalled by the rough ranking and the problem and then fine ranking to obtain a final retrieval result. Compared with the direct text similarity matching, the two-step search scheme of the coarse row and the fine row has a greatly improved similarity matching, and is also a structure widely used at present.

Such methods achieve good results in many data sets, but there is a large fall in the performance results in practical applications. Because the document text is longer, the text data is complex and the format is complex, the paragraphs are difficult to split accurately according to the semantics, and the existing retrieval method only considers the paragraph text, ignores the structure information of the document and the global information of the text, particularly has large differences in terms of the questions and the answer literals and semantics for step type and flow type answers, has no direct or obvious association, and is difficult to capture the relationship between the questions and the answer literals, so that the retrieval result of the paragraphs in practical application is inaccurate.

Disclosure of Invention

The embodiment of the invention provides a paragraph retrieval method, a device, equipment and a storage medium, which solve the problem that the retrieval result of a document paragraph is inaccurate because the existing retrieval method is difficult to split the paragraph according to semantics, only the paragraph text is considered, and the structural information of the document and the global information of the text are ignored.

According to an aspect of the present invention, there is provided a paragraph retrieval method, including:

constructing a search document data set, dividing paragraphs of each document in the search document data set to obtain at least one paragraph, and generating a paragraph library according to the at least one paragraph;

Coding each paragraph in the paragraph library to obtain a corresponding coding vector of each paragraph;

generating paragraph data corresponding to each paragraph according to the coding vector corresponding to each paragraph and the paragraph text corresponding to each paragraph, and constructing an index library according to the paragraph data;

and acquiring the target problem, and carrying out literal recall retrieval and vector recall retrieval on the index library according to the target problem to obtain target paragraph data corresponding to the target problem.

According to another aspect of the present invention, there is provided a paragraph retrieving apparatus including:

the generation module is used for constructing a search document data set, dividing paragraphs of each document in the search document data set to obtain at least one paragraph, and generating a paragraph library according to the at least one paragraph;

the coding module is used for coding each paragraph in the paragraph library to obtain a coding vector corresponding to each paragraph;

the construction module is used for generating paragraph data corresponding to each paragraph according to the coding vector corresponding to each paragraph and the paragraph text corresponding to each paragraph, and constructing an index library according to the paragraph data;

the obtaining module is used for obtaining the target problem, and carrying out literal recall search and vector recall search on the index library according to the target problem to obtain target paragraph data corresponding to the target problem.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the paragraph retrieval method according to any of the embodiments of the invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a paragraph retrieval method according to any of the embodiments of the present invention.

According to the embodiment of the invention, a search document data set is constructed, and each document in the search document data set is subjected to paragraph division to obtain at least one paragraph, and a paragraph library is generated according to the at least one paragraph; coding each paragraph in the paragraph library to obtain a corresponding coding vector of each paragraph; generating paragraph data corresponding to each paragraph according to the coding vector corresponding to each paragraph and the paragraph text corresponding to each paragraph, and constructing an index library according to the paragraph data; the method comprises the steps of obtaining target problems, carrying out literal recall retrieval and vector recall retrieval on an index library according to the target problems to obtain target paragraph data corresponding to the target problems, solving the problem that the paragraph retrieval result of a document is inaccurate due to the fact that the existing retrieval method is difficult to split paragraphs accurately according to semantics and only considers paragraph texts, ignoring the structural information of the document and global information of the texts, integrating the multi-view semantic information of the document, enhancing the accuracy of the semantic representation of the paragraph texts, integrating the retrieval results of literal recall retrieval and vector recall retrieval, and improving the recall rate and the accuracy of paragraph retrieval.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a paragraph retrieving method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of another paragraph retrieval method according to the first embodiment of the present invention;

FIG. 3 is a schematic diagram of a paragraph retrieving device according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device in a third embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

Example 1

Fig. 1 is a flowchart of a paragraph retrieving method in a first embodiment of the present invention, where the present embodiment is applicable to a case of retrieving paragraphs of an enterprise document, the method may be performed by a paragraph retrieving device in the embodiment of the present invention, and the device may be implemented in a software and/or hardware manner, as shown in fig. 1, and the method specifically includes the following steps:

S110, constructing a search document data set, dividing paragraphs of each document in the search document data set to obtain at least one paragraph, and generating a paragraph library according to the at least one paragraph.

Wherein the retrieved document dataset comprises at least one document. The paragraph library comprises at least one paragraph.

Specifically, a search document data set is constructed, and each document in the search document data set is subjected to paragraph division to obtain at least one paragraph, and a manner of generating a paragraph library according to the at least one paragraph can be as follows: the method comprises the steps that a search document data set can be obtained from an enterprise document library, wherein the search document data set mainly comprises a product description document, a version upgrading document, a best practice manual, a use guide, an operation description, common fault treatment, a problem treatment work order and the like, and relates to documents in a plurality of formats such as docx, doc, pptx, ppt, txt, pdf and the like; and reading and analyzing the search document data set, dividing the text content and the preset text length in each document in the search document data set to obtain divided paragraphs obtained by each document, and generating a paragraph library according to the divided paragraphs obtained by each document.

S120, each paragraph in the paragraph library is encoded, and a corresponding encoding vector of each paragraph is obtained.

It should be noted that the corresponding encoding vector is different for each paragraph.

Specifically, the manner of encoding each paragraph in the paragraph library to obtain the encoding vector corresponding to each paragraph may be: each paragraph can be encoded through a document-vector model (Doc 2 vec) to obtain an encoding vector corresponding to each paragraph; specifically, the manner of encoding each paragraph in the paragraph library to obtain the encoding vector corresponding to each paragraph may be: training a model suitable for carrying out semantic coding on the enterprise document text through the enterprise document data set, and coding paragraph text corresponding to each paragraph, document names corresponding to each paragraph and chapter titles corresponding to each paragraph based on the model suitable for carrying out semantic coding on the enterprise document text to obtain a coding vector corresponding to each paragraph. For example, the enterprise document data set is used for supervised training to obtain a SimCSE model, and the paragraph text corresponding to each paragraph, the document name corresponding to each paragraph and the chapter title name corresponding to each paragraph are encoded according to the trained SimCSE model to obtain the encoding vector corresponding to each paragraph.

S130, generating paragraph data corresponding to each paragraph according to the coding vector corresponding to each paragraph and the paragraph text corresponding to each paragraph, and constructing an index base according to the paragraph data.

The paragraph data can include a coding vector corresponding to each paragraph and a paragraph text corresponding to the paragraph, and can also include a document name and a chapter title name corresponding to each paragraph, so that the paragraph data can fuse semantic information of multiple views such as global information and local information of the document. The index library includes at least one paragraph data that can be used to subsequently retrieve the index library based on the user's target question.

Specifically, the method for generating paragraph data corresponding to each paragraph according to the coding vector corresponding to each paragraph and the paragraph text corresponding to each paragraph, and constructing the index library according to the paragraph data may be: and correlating the coding vector corresponding to each paragraph with the paragraph text corresponding to the paragraph to obtain paragraph data corresponding to each paragraph, and constructing an index library according to the paragraph data corresponding to each paragraph.

And S140, acquiring a target problem, and carrying out literal recall search and vector recall search on the index library according to the target problem to obtain target paragraph data corresponding to the target problem.

The target problem is a problem obtained according to the actual requirement of the user. The target paragraph data is paragraph retrieval results corresponding to the target problems proposed by the user.

The word recall search can directly recall according to word content or keywords of a target problem, paragraph text in the recalled paragraph data is directly related to words, the performance of characterization and warehousing is high, the interpretability, the controllability and the stability are high, the recall effect of short text without context is more obvious, the word recall search is particularly suitable for searching products, tools, related professional names and the like frequently involved in enterprise document paragraph search, keyword matching can achieve good effects, for example, paragraph data recall can be performed according to a BM25 algorithm during word recall search. The vector recall search can characterize the target problem as a low-dimensional vector, can search a vector close to the low-dimensional vector corresponding to the target problem in an index library, has strong generalization capability, low sensitivity to the change of paragraph text and strong robustness.

Specifically, the method for obtaining the target problem, and performing literal recall search and vector recall search on the index library according to the target problem, so as to obtain the target paragraph data corresponding to the target problem may be as follows: acquiring a target problem according to the actual requirement of a user, respectively carrying out literal recall search and vector recall search on an index library based on the target problem, and determining the first preset number of paragraph data most relevant to the problem as a paragraph set corresponding to the literal recall search during the literal recall search; when the vector recall is searched, the data of the previous preset number of paragraphs most relevant to the problem is determined as a paragraph set corresponding to the vector recall; and performing duplicate removal processing on repeated paragraphs in the paragraph set corresponding to the literal recall search and the paragraph set corresponding to the vector recall search to obtain target paragraph data corresponding to the target problem.

Constructing a search document data set, dividing paragraphs of each document in the search document data set to obtain at least one paragraph, and generating a paragraph library according to the at least one paragraph; coding each paragraph in the paragraph library to obtain a corresponding coding vector of each paragraph; generating paragraph data corresponding to each paragraph according to the coding vector corresponding to each paragraph and the paragraph text corresponding to each paragraph, and constructing an index library according to the paragraph data; the method comprises the steps of obtaining target problems, carrying out literal recall retrieval and vector recall retrieval on an index library according to the target problems to obtain target paragraph data corresponding to the target problems, quickly establishing the index library, fusing multi-view semantic information of documents, enhancing accuracy of paragraph text semantic representation, combining literal recall retrieval and vector recall retrieval to complement each other in advantages, obtaining the target paragraph data corresponding to the target problems according to retrieval results of the literal recall retrieval and the vector recall retrieval, and improving recall rate and accuracy of paragraph retrieval.

Optionally, performing paragraph division on each document in the retrieved document data set to obtain at least one paragraph, and generating a paragraph library according to the at least one paragraph includes:

Reading the document name of each document and the chapter title name of each document in the search document data set;

performing chapter division on the documents according to the chapter title names of each document to obtain at least one chapter;

acquiring the text length and the preset text length of each sentence in at least one chapter;

and dividing the at least one section according to the text length of each sentence and the preset text length to obtain at least one section in the at least one section, and generating a section library according to the at least one section.

Wherein, the document name may represent an object of the document, such as "xx configuration guide", "xxx best practice", "xxx instruction manual", "xxx description", etc.; the chapter title names are titles in the document, such as "1.2.4 cluster installation successful detection". The preset text length can be the maximum length which can be processed by the model, and can be set according to actual requirements.

Specifically, the manner of reading the document name of each document and the chapter title name of each document in the retrieved document data set may be: the document name of each document in the document data set is read and analyzed and searched, and the searching range of the document can be rapidly reduced through the document name; the chapter title names in each document are read and parsed.

Specifically, the documents are divided into chapters according to the chapter title names of each document to obtain at least one chapter, the contents under different chapter title names are often expressed as different contents, and the documents are divided into chapters according to the chapter title names of each document to obtain at least one chapter.

Specifically, the manner of acquiring the text length and the preset text length of each sentence in at least one chapter may be: and presetting the text length according to the actual demand, determining each sentence under each chapter according to punctuation marks in the text content under each chapter, and acquiring the text length of each sentence.

Specifically, the at least one section is divided according to the text length of each sentence and the preset text length to obtain at least one section in the at least one section, and a section library is generated according to the at least one section, for example, if the preset text length is 500, the section a has the following sentences: [200, 80, 250, 90, 30] text length, 200<500 text length, 1 and 280<500 text length, 2 and 530>500 text length, 3 and 3 text length, and each chapter is divided by determining the text length as the sentence of the next paragraph, and generating paragraph library according to each paragraph.

Retrieving a document name of each document and a chapter title name of each document in the document dataset by reading; performing chapter division on the documents according to the chapter title names of each document to obtain at least one chapter; acquiring the text length and the preset text length of each sentence in at least one chapter; according to the text length of each sentence and the preset text length, at least one section is divided into sections, at least one section in the at least one section is obtained, and a section library is generated according to the at least one section, so that the problems of high difficulty in overall document coding and inaccurate semantic expression caused by the fact that enterprise document texts are generally long are solved, the text content in the whole document can be reasonably divided, a plurality of sections are obtained, and the subsequent section retrieval according to target problems is facilitated.

Optionally, encoding each paragraph in the paragraph library to obtain an encoding vector corresponding to each paragraph, including:

acquiring a training document data set, and acquiring a question and paragraph pair corresponding to each document in the training document data set;

determining a positive sample set and a negative sample set according to the question and paragraph pairs corresponding to each document;

constructing a model to be trained, carrying out iterative training on the model to be trained according to a first preset loss function based on a training document data set until an iteration ending condition is met, and obtaining a first model;

Based on the positive sample set and the negative sample set, performing iterative training on the first model according to a second preset loss function until an iteration ending condition is met, and obtaining a target model;

and coding each paragraph in the paragraph library according to the target model, the paragraph text corresponding to each paragraph in the paragraph library, the document name corresponding to each paragraph and the chapter title name corresponding to each paragraph to obtain a coding vector corresponding to each paragraph.

Wherein each enterprise document in the training document dataset may be the same as or different from each enterprise document in the retrieval document dataset.

It should be noted that many chapter title names in the enterprise document can be directly regarded as questions, and the chapter title names (or the chapter title names are added according to the objects involved in the document names) and the paragraph text are used for constructing the question and the paragraph pair, where the paragraph text is used for solving the question, for example, the question may be: anyBackup power on and network configuration, paragraph: step 1: xx; step 2: xx.

The positive sample set is the correct question and paragraph pair, and the negative sample set is the wrong question and paragraph pair.

The model to be trained may be a BERT model, or may be another model, which is not limited herein. The first model is a trained model obtained based on a model to be trained and a training document data set, and can be an unsupervised training SimCSE model, and the target model can be a trained model obtained based on the first model, a positive sample set and a negative sample set, and can be a supervised training SimCSE model. The first predetermined loss function and the second predetermined loss function may be set according to a training process. Because there is a large difference between the enterprise document data and the general field data, for example, the enterprise document relates to a plurality of fields or professional descriptions such as a plurality of product names, computer network related nouns, linux system commands, etc., the to-be-trained model in the general field is difficult to understand the semantics represented by the enterprise vocabulary, so that the to-be-trained model needs to be trained according to the enterprise document data, i.e., the training document data set, so that the first model is more suitable for the enterprise document.

It should be noted that the SimCSE model introduces the idea of contrast learning into text matching, i.e., zooms in on similar samples, and zooms out on dissimilar samples. Since the model to be trained usually uses dropout mechanism during training, this means: even if the same sample is trained twice, two different numerical vectors can be obtained. Because the same sample has high similarity, the distance between two numerical vectors output by the model to be trained should be as close as possible; conversely, the emmbedding obtained after the different input samples have passed the model to be trained should be pushed as far as possible.

Specifically, the method for acquiring the training document data set and acquiring the problem and paragraph pair corresponding to each document in the training document data set may be: the method comprises the steps of acquiring a training document data set, inputting the text of each document in the training document data set into the prompt of a LLM (Large Language Model ) model, and enabling the text to construct a question and paragraph pair corresponding to each document. It should be noted that, the negative sample set may be, in addition to the wrong question and paragraph pair obtained by the LLM model, a wrong question and paragraph pair formed by using the correct answer of other questions as the answer of the current question.

Specifically, the manner of determining the positive sample set and the negative sample set according to the question and paragraph pairs corresponding to each document may be: manually screening the obtained question and paragraph pairs, and sorting the correct question and paragraph pairs to obtain a positive sample set; and (3) sorting the wrong problem and paragraph pairs to obtain a negative sample set.

Specifically, a model to be trained is constructed, iterative training is performed on the model to be trained according to a first preset loss function based on a training document data set until an iteration ending condition is met, and the mode of obtaining the first model can be as follows: the method comprises the steps of constructing a model to be trained, inputting a training document data set into the model to be trained, carrying out back propagation on the model to be trained based on a first preset loss function and the training document data set to obtain the model to be trained for the next iteration, and entering the next iteration until an iteration ending condition is met to obtain a first model.

Specifically, based on the positive sample set and the negative sample set, performing iterative training on the first model according to a second preset loss function until an iteration ending condition is met, a mode of obtaining a target model may be as follows: and back-propagating the first model based on the positive sample set, the negative sample set and the second preset loss function to obtain a first model for the next iteration, and entering the next iteration until the iteration ending condition is met to obtain a target model.

Specifically, the method for coding each paragraph in the paragraph library according to the target model, the paragraph text corresponding to each paragraph in the paragraph library, the document name corresponding to each paragraph and the chapter title name corresponding to each paragraph may be as follows: when each document in the search document data set is subjected to paragraph division to generate a paragraph library, obtaining a paragraph text corresponding to each paragraph, a document name corresponding to each paragraph and a chapter title name corresponding to each paragraph in the paragraph library, and carrying out semantic coding on the paragraph text corresponding to each paragraph, the document name corresponding to each paragraph and the chapter title name corresponding to each paragraph in the paragraph library based on the obtained target model to obtain a coding vector corresponding to each paragraph.

The first model is obtained by training the document data set, and the target model is obtained by the positive sample set, the negative sample set and the first model, so that the target model is more suitable for enterprise document data, and the accuracy of vector coding of the target model on each section in the section library is further improved.

Optionally, generating paragraph data corresponding to each paragraph according to the coding vector corresponding to each paragraph and the paragraph text corresponding to each paragraph includes:

Acquiring a document name corresponding to each paragraph, a chapter title name corresponding to each paragraph and a paragraph identifier divided in the chapter where each paragraph is located;

and generating paragraph data corresponding to each paragraph according to the coding vector corresponding to each paragraph, the paragraph text corresponding to each paragraph, the document name corresponding to each paragraph, the chapter title name corresponding to each paragraph and the paragraph identifier.

The paragraph identifier of each paragraph divided in the section can be a serial number when dividing the section, can be a digital identifier, and can also be a letter identifier, for example, the paragraph identifier is a paragraph id.

Specifically, the obtaining the document name corresponding to each paragraph, the chapter title name corresponding to each paragraph, and the paragraph identifier of each paragraph divided in the located chapter may be, for example, obtaining the document name corresponding to each paragraph when dividing the retrieved document data set, for example, "anybackup5.0.3.0 production instruction manual", where the chapter title corresponding to each paragraph may be the deepest chapter, for example, "1.2.4 cluster installation success detection", and the paragraph identifier of each paragraph divided in the located chapter, for example, paragraph id, where the located chapter of each paragraph is the chapter title name of each paragraph.

Specifically, the manner of generating the paragraph data corresponding to each paragraph according to the encoding vector corresponding to each paragraph, the paragraph text corresponding to each paragraph, the document name corresponding to each paragraph, the chapter title name corresponding to each paragraph and the paragraph identifier may be: after the coding vector corresponding to each paragraph is obtained, the coding vector corresponding to each paragraph, the paragraph text corresponding to each paragraph, the document name corresponding to each paragraph, the chapter title name corresponding to each paragraph and the paragraph identification are formed into finished paragraph data, namely paragraph data corresponding to each paragraph are generated, and an index library is constructed according to the obtained paragraph data corresponding to each paragraph.

The method comprises the steps of generating paragraph data corresponding to each paragraph according to a coding vector corresponding to each paragraph, a paragraph text corresponding to each paragraph, a document name corresponding to each paragraph, a chapter title name corresponding to each paragraph and a paragraph identifier, and further constructing an index base, wherein the whole paragraph data of each paragraph can be obtained by considering semantic information of different view angles such as global information (document name) of the paragraph, semi-local information (chapter title name) of the paragraph, local information (paragraph text and paragraph identifier) of the paragraph and the coding vector corresponding to the paragraph, so that the whole paragraph data can be obtained when a problem is inquired in the index base later.

Optionally, performing literal recall search and vector recall search on the index library according to the target problem to obtain target paragraph data corresponding to the target problem, including:

obtaining a synonym library, wherein the synonym library comprises: at least one noun abbreviation and noun full name;

expanding each noun abbreviation in the target problem based on the synonym library to obtain a first problem;

and carrying out literal recall retrieval and vector recall retrieval on the index library according to the first problem to obtain target paragraph data corresponding to the target problem.

The synonym library is a synonym library preset according to an enterprise document, and the synonym library comprises at least one abbreviation (abbreviation) of a noun and a corresponding full name, for example, the abbreviation of the noun is AS, and the full name is AnyShare; abbreviated AD, collectively known as AnyDATA.

The first question is a question obtained by expanding a term in the target question.

Specifically, the manner of obtaining the synonym library may be: the synonym library may be built in advance from shorthand and full names of nouns in the enterprise document.

Specifically, expanding each noun abbreviation in the target problem based on the synonym library to obtain a first problem, for example, the target problem may be an AD installation deployment, and the first problem is an AnyDATA installation deployment.

Specifically, the manner of performing literal recall search and vector recall search on the index library according to the first problem to obtain the target paragraph data corresponding to the target problem may be: and respectively carrying out literal recall search and vector recall search on the index library according to the first problem to obtain literal recall search paragraph data and vector recall search paragraph data, and carrying out de-duplication processing on the literal recall search paragraph data and the vector recall search paragraph data to obtain target paragraph data corresponding to the target problem.

Each noun abbreviation in the target problem is expanded through the synonym library to obtain a first problem, the index library is subjected to word recall search and vector recall search according to the first problem to obtain target paragraph data corresponding to the target problem, the noun abbreviation processing can be carried out on the target problem, and the accuracy of obtaining paragraph data matched with the target problem is improved.

Optionally, performing literal recall search and vector recall search on the index library according to the first problem to obtain target paragraph data corresponding to the target problem, including:

performing literal recall search on the index library according to the first problem to obtain a first similarity between the first problem and each paragraph of data in the index library;

Sorting paragraph data from high to low according to the first similarity, and determining a preset number of paragraph data before ranking as a first paragraph set corresponding to literal recall search;

vector recall retrieval is carried out on the index library according to the first problem, and second similarity between the first problem and each paragraph of data in the index library is obtained;

sorting paragraph data from high to low according to the second similarity, and determining a preset number of paragraph data before ranking as a second paragraph set corresponding to vector recall retrieval;

summarizing the first paragraph set and the second paragraph set, and performing de-duplication treatment to obtain a target paragraph set;

and obtaining target paragraph data corresponding to the target problem according to the target paragraph set.

The first similarity is the similarity between the first problem and each paragraph of data during the literal recall search; the second similarity is the similarity between the first problem and each paragraph data during the vector recall search. The pre-ranking preset number of paragraph data can be set according to actual requirements, wherein the first paragraph set is all paragraph data retrieved through literal recall, and the second paragraph set is all paragraph data retrieved through vector recall. The target paragraph set is a paragraph set obtained by summarizing and de-duplicating the first paragraph set and the second paragraph set.

Specifically, the method for performing a literal recall search on the index library according to the first problem to obtain the first similarity between the first problem and the data of each section in the index library may be: and carrying out literal recall retrieval on the index library according to the first problem, and calculating the first similarity between the first problem and the text in each paragraph of data in the index library through a BM25 algorithm.

Specifically, the paragraph data is ranked from high to low according to the first similarity, and the manner of determining the preset number of paragraph data before ranking as the first paragraph set corresponding to the literal recall search may be as follows: and sorting each paragraph data in the index library from high to low according to the first similarity, then finding out the pre-ranking preset number of paragraph data most relevant to the first problem from the first paragraph data, wherein the higher the ranking is, the higher the relevance between the description and the first problem is, and generating the first paragraph set by the pre-ranking preset number of paragraph data, for example, when the word recall is searched, generating the first paragraph set according to the top 100 paragraph data most relevant to the first problem.

Specifically, the method for performing vector recall search on the index library according to the first problem to obtain the second similarity between the first problem and the data of each section in the index library may be: and encoding the first problem into vectors, and calculating a second similarity between the vector codes corresponding to the first problem and the encoded vectors corresponding to each paragraph in each paragraph data in the index library through cosine similarity.

Specifically, the paragraph data is ranked from high to low according to the second similarity, and the manner of determining the preset number of paragraph data before ranking as the second paragraph set corresponding to the vector recall search may be as follows: and sorting the data of each paragraph in the index library from high to low according to the second similarity, then finding out the data of the preset number of paragraphs which are most relevant to the first problem from the data of each paragraph, wherein the higher the ranking is, the higher the relevance between the description and the first problem is, and generating a second paragraph set by the data of the preset number of paragraphs which are the first to the ranking, for example, when vector recall is searched, the second paragraph set is generated according to the data of the first 100 paragraphs which are most relevant to the first problem.

Specifically, the method for summarizing the first paragraph set and the second paragraph set and performing deduplication processing to obtain the target paragraph set may be: and after the first paragraph set and the second paragraph set are summarized, repeated paragraph data in the first paragraph set and the second paragraph set are obtained, and repeated paragraph data are subjected to de-duplication processing, so that a target paragraph set is obtained.

Specifically, the manner of obtaining the target paragraph data corresponding to the target problem according to the target paragraph set may be: each paragraph data in the target paragraph set is directly determined as the target paragraph data corresponding to the target problem, and can be directly returned to the user. The method for obtaining the target paragraph data corresponding to the target problem according to the target paragraph set may further be: and integrating the paragraph data according to the paragraph identifier in each paragraph data in the target paragraph set to obtain integrated paragraph data, and determining the integrated paragraph data as target paragraph data corresponding to the target problem.

The answer range corresponding to the target question can be rapidly reduced by obtaining the preset number of paragraph data before ranking during literal recall search and the preset number of paragraph data before ranking during vector recall search and obtaining the target paragraph set after de-duplication processing.

Optionally, obtaining the target paragraph data corresponding to the target problem according to the target paragraph set includes:

splicing the first problem with each paragraph data in the target paragraph set to obtain a first splicing result;

performing score calculation according to the first splicing result to obtain a target score corresponding to the first splicing result;

sorting the data of each paragraph in the target paragraph set from high to low according to the target score to obtain a first sorting result;

integrating paragraph data which are positioned under the same document name and the same chapter title name in the first sequencing result, and updating the first sequencing result according to the integrated paragraph data;

and obtaining the target paragraph data corresponding to the target problem according to the updated first sequencing result.

The first splicing result is a result obtained by splicing the first problem and the paragraph text in each paragraph data. The target score is a score obtained after fine discharge is performed according to the first splicing result. The first ranking result is a ranking result of each paragraph of data from high to low according to the target score.

Specifically, the way to splice the first problem with each paragraph data in the target paragraph set to obtain the first splicing result may be: and splicing the first problem with the paragraph text of each paragraph data in the target paragraph set to obtain a first splicing result.

Specifically, the manner of calculating the score according to the first splicing result to obtain the target score corresponding to the first splicing result may be: and calculating the score of the first problem and the paragraph text of each paragraph data in the first splicing result by using a paragraph refined model of an interactive model (cross-encoder) according to the first splicing result, and obtaining a target score corresponding to the first splicing result.

Specifically, the manner of sorting the data of each paragraph in the target paragraph set from high to low according to the target score to obtain the first sorting result may be: and sorting paragraph data from high to low based on the target score corresponding to each paragraph data, and obtaining a first sorting result.

Specifically, the way of integrating the paragraph data in the first sorting result under the same document name and the same chapter title name and updating the first sorting result according to the integrated paragraph data may be: and integrating the paragraph data based on the document name, the chapter title name and the paragraph identifier in each paragraph data in the first sequencing result, splicing the paragraph text based on the paragraph identifier (paragraph id) under the same document name and the same chapter title name to obtain complete paragraph data, and updating the first sequencing result according to the integrated paragraph data, wherein the mode of updating the first sequencing result can preset the weight corresponding to each paragraph data, the higher the original score is, the larger the paragraph data weight is, and then updating the first sequencing result according to the weight corresponding to each paragraph data.

Specifically, the manner of obtaining the target paragraph data corresponding to the target problem according to the updated first sorting result may be: and returning the spliced complete paragraph data to the user in sequence from high to low according to the updated first sequencing result, wherein the spliced complete paragraph data is target paragraph data corresponding to the target problem.

It should be noted that after the target paragraph data is obtained, the spliced complete paragraph data can be returned to the user from high to low according to the updated first sorting result, and when the complete paragraph data is returned, gns of the document where each paragraph text in the paragraph data is located can be carried, the document can be directly linked in the retrieval document data set through gns, and the confidence of paragraph retrieval corresponding to the target problem can be improved through the chapter title name, the document name and gns of the document in the paragraph data, so that the user can directly trace the source.

Splicing each paragraph data in the target paragraph set through the first problem to obtain a first splicing result; performing score calculation according to the first splicing result to obtain a target score corresponding to the first splicing result; sorting the data of each paragraph in the target paragraph set from high to low according to the target score to obtain a first sorting result, solving the problem of semantic interaction between the target problem of the user and the paragraphs not considered in the paragraph recall stage, obtaining the target score based on the first splicing result, obtaining the sorting result based on the target score, preferentially presenting the answers of the paragraphs with higher relevance to the target problem to the user, integrating the paragraph data which are in the same document name and under the same chapter title name in the first sorting result, and updating the first sorting result according to the integrated paragraph data; and obtaining target paragraph data corresponding to the target questions according to the updated first sequencing result, carrying out paragraph reconstruction aggregation on the paragraph data obtained by searching, ensuring the completeness of answers corresponding to the target questions, avoiding secondary searching of missing information by a user, greatly improving the recall rate and accuracy of searching, and further improving the efficiency and the working efficiency of staff information acquisition of enterprises.

In a specific example, fig. 2 is a flowchart of another paragraph searching method in the first embodiment of the present invention, as shown in fig. 2, an index library is built offline (dashed line box in fig. 2), when an online user performs paragraph searching (solid line box in fig. 2) and builds the index library offline, each document text in the searched document data set is divided into paragraphs, a paragraph library is obtained according to the divided paragraphs, each paragraph in the paragraph library is encoded to obtain an encoded vector, a paragraph text corresponding to each paragraph in the paragraph division, a document name corresponding to each paragraph in the paragraph division, and a paragraph identifier in the paragraph division generate paragraph data corresponding to each paragraph, and an index library is built according to the paragraph data corresponding to each paragraph. When an online user performs paragraph retrieval, acquiring a target problem of the user, expanding the target problem based on a synonym library, then performing literal recall retrieval and vector recall retrieval on the expanded target problem based on an index library, summarizing and deduplicating retrieval results of the literal recall retrieval and the vector recall retrieval to obtain target paragraph data, splicing each paragraph in the target paragraph data with the target problem, calculating scores of the spliced paragraphs and the target problem based on an interactive model, sorting the paragraphs from high to low according to the scores, splicing paragraph texts based on paragraph identifiers (paragraph ids) under the same document name and the same chapter title name, obtaining complete paragraph data, and returning the complete paragraph data to the user.

According to the technical scheme, a search document data set is constructed, and each document in the search document data set is subjected to paragraph division to obtain at least one paragraph, and a paragraph library is generated according to the at least one paragraph; coding each paragraph in the paragraph library to obtain a corresponding coding vector of each paragraph; generating paragraph data corresponding to each paragraph according to the coding vector corresponding to each paragraph and the paragraph text corresponding to each paragraph, and constructing an index library according to the paragraph data; the method comprises the steps of obtaining target problems, carrying out literal recall retrieval and vector recall retrieval on an index library according to the target problems to obtain target paragraph data corresponding to the target problems, solving the problem that the paragraph retrieval result of a document is inaccurate due to the fact that the existing retrieval method is difficult to split paragraphs accurately according to semantics and only considers paragraph texts, ignoring the structural information of the document and global information of the texts, integrating the multi-view semantic information of the document, enhancing the accuracy of the semantic representation of the paragraph texts, integrating the retrieval results of literal recall retrieval and vector recall retrieval, and improving the recall rate and the accuracy of paragraph retrieval.

Example two

Fig. 3 is a schematic structural diagram of a paragraph retrieving device in a second embodiment of the present invention. The embodiment may be applicable to the case of paragraph retrieval of enterprise documents, and the apparatus may be implemented in software and/or hardware, and may be integrated in any device that provides a function of paragraph retrieval, as shown in fig. 3, where the paragraph retrieval apparatus specifically includes: a generation module 210, an encoding module 220, a construction module 230, and an obtaining module 240.

The generating module 210 is configured to construct a search document data set, and segment each document in the search document data set to obtain at least one paragraph, and generate a paragraph library according to the at least one paragraph;

the encoding module 220 is configured to encode each paragraph in the paragraph library to obtain an encoding vector corresponding to each paragraph;

a construction module 230, configured to generate paragraph data corresponding to each paragraph according to the coding vector corresponding to each paragraph and the paragraph text corresponding to each paragraph, and construct an index library according to the paragraph data;

the obtaining module 240 is configured to obtain a target problem, and perform literal recall search and vector recall search on the index library according to the target problem, so as to obtain target paragraph data corresponding to the target problem.

Optionally, the generating module is specifically configured to:

Optionally, the encoding module is specifically configured to:

Optionally, the construction module is specifically configured to:

Optionally, the obtaining module is specifically configured to:

The product can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example III

Fig. 4 is a schematic structural diagram of an electronic device in a third embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM12 and the RAM13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the paragraph retrieval method.

In some embodiments, the paragraph retrieval method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM12 and/or the communication unit 19. When the computer program is loaded into RAM13 and executed by processor 11, one or more steps of the paragraph retrieval method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the paragraph retrieval method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A paragraph retrieval method, comprising:

2. The method of claim 1, wherein the step of performing a paragraph segmentation on each document in the retrieved document dataset to obtain at least one paragraph, generating a paragraph library from the at least one paragraph, comprises:

3. The method of claim 2, wherein encoding each paragraph in the paragraph library to obtain a corresponding encoding vector for each paragraph comprises:

4. The method of claim 2, wherein generating paragraph data for each paragraph based on the encoding vector for each paragraph and the paragraph text for each paragraph comprises:

5. The method of claim 1, wherein performing a literal recall search and a vector recall search on the index library according to the target question to obtain target paragraph data corresponding to the target question comprises:

6. The method of claim 5, wherein performing a literal recall search and a vector recall search on the index library according to the first question to obtain the target paragraph data corresponding to the target question comprises:

7. The method of claim 6, wherein obtaining target paragraph data corresponding to a target question from a target paragraph set comprises:

8. A paragraph retrieval device, comprising:

9. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the paragraph retrieval method according to any of claims 1-7.

10. A computer readable storage medium storing computer instructions for causing a processor to perform the paragraph retrieval method according to any of claims 1-7.