CN116610795B

CN116610795B - Text retrieval method and device

Info

Publication number: CN116610795B
Application number: CN202310863139.8A
Authority: CN
Inventors: 徐琳; 暴宇健; 王芳
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2023-07-14
Filing date: 2023-07-14
Publication date: 2024-03-15
Anticipated expiration: 2043-07-14
Also published as: CN116610795A

Abstract

The disclosure relates to the technical field of computers, and provides a text retrieval method and a text retrieval device. The method comprises the following steps: inputting a to-be-searched problem sentence into a problem encoder of a preset relevance scoring model, wherein the relevance scoring model is provided with a double-tower frame, two branches of the double-tower frame are a document encoder and a problem encoder respectively, and part of negative refractory cases in training data of the relevance scoring model are generated by using a first language model; inputting pseudo-problem sentences generated by using a first language model according to the in-library documents in the document library and corresponding in-library documents into a document encoder, wherein the document encoder is provided with an interactive framework; obtaining a relevance score of a to-be-retrieved problem statement output by a relevance scoring model and an in-library document in a document library; and obtaining the retrieval text of the problem statement to be retrieved according to the relevance score. According to the technical scheme, the accuracy of document characterization is improved in the text retrieval process, and meanwhile the query efficiency is not lost.

Description

Text retrieval method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a text retrieval method and apparatus.

Background

Currently, dense vector retrieval has played a vital role in information retrieval. Dense vectors are vectors modeled on vectors using the data structures of the array. Compared with the traditional BM25 (Best Match 25, algorithm iteration 25) retrieval mode, the dense vector retrieval can better acquire semantic information between the problem and the document.

There are mainly two scoring frames, a non-interactive frame and an interactive frame, for relevance scoring between the question and the document. The interactive frames are too computationally intensive to affect the efficiency of text retrieval, while non-interactive frames do not perform well for multi-topic content in long documents, and therefore are not accurate.

How to improve the accuracy of document characterization without losing the high efficiency of query is a technical problem which needs to be solved currently.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a text retrieval method, apparatus, electronic device, and computer readable storage medium, so as to solve the technical problem in the prior art that in the text retrieval process, the document characterization accuracy is not high or the retrieval efficiency is not high.

In a first aspect of an embodiment of the present disclosure, there is provided a text retrieval method, including: inputting a to-be-searched problem sentence into a problem encoder of a preset relevance scoring model, wherein the relevance scoring model is provided with a double-tower frame, two branches of the double-tower frame are a document encoder and a problem encoder respectively, and part of negative refractory cases in training data of the relevance scoring model are generated by using a first language model; inputting pseudo-problem sentences generated by using a first language model according to the in-library documents in the document library and corresponding in-library documents into a document encoder, wherein the document encoder is provided with an interactive framework; obtaining a relevance score of a to-be-retrieved problem statement output by a relevance scoring model and an in-library document in a document library; and obtaining the retrieval text of the problem statement to be retrieved according to the relevance score.

In a second aspect of embodiments of the present disclosure, there is provided a text retrieval apparatus, the apparatus comprising: the problem coding module is used for inputting a to-be-searched problem sentence into a problem coder of a preset relevance scoring model, wherein the relevance scoring model is provided with a double-tower frame, two branches of the double-tower frame are respectively a document coder and a problem coder, and partial negative refractory cases in training data of the relevance scoring model are generated by using a first language model; a document encoding module for inputting pseudo-problem sentences generated using a first language model from an in-library document in a document library and corresponding in-library documents into a document encoder, wherein the document encoder has an interactive framework; the relevance score acquisition module is used for acquiring relevance scores of the problem sentences to be searched and the in-library documents in the document library, which are output by the relevance score model; and the retrieval text acquisition module is used for acquiring the retrieval text of the to-be-retrieved problem statement according to the relevance score.

In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the steps of the above-described method.

Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: according to the technical scheme, the text retrieval process is realized by using the relevance evaluation model with the local interactive framework and the overall double-tower framework, the partial negative difficulty of the training data of the relevance scoring model is generated by combining the first language model, meanwhile, the pseudo-problem statement is generated according to the in-library document by using the first language model, and the technical effect of improving the accuracy of document characterization and simultaneously not losing the high efficiency of query is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a schematic flow chart of a text retrieval method according to an embodiment of the disclosure;

FIG. 2 is a schematic diagram of a correlation scoring model provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a text retrieval device according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

In the related art, the relevance scores for the query and document mainly have two frameworks, i.e., dual-tower type, and Cross-end type. The Cross-Encoder framework is too computationally intensive to use in the recall phase, while the Dual-Encoder framework does not perform well with multiple subject matter in long documents because of problems and no interaction with the documents, and therefore has low accuracy. Some research is directed to improving the speed and effect of models using a post-interaction architecture, but is still limited by the inability to directly use pre-stored vectors for ordering, with poor speed and effect improvement results.

To solve the above problems, embodiments of the present disclosure provide a text retrieval scheme to improve accuracy of document characterization without losing efficiency of query in text retrieval.

Specifically, in the embodiment of the disclosure, the generated problem statement can be used for learning the representation of a single document under multiple angles of view and further used for querying the document, so that the accuracy of the representation of the document can be improved and the high efficiency of query is not lost.

Text retrieval methods and apparatuses according to embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a flow chart of a text retrieval method according to an embodiment of the present disclosure. The method provided by the embodiments of the present disclosure may be performed by any electronic device, such as a terminal or server, having computer processing capabilities. As shown in fig. 1, the text retrieval method includes:

step S101, inputting a to-be-searched problem sentence into a problem encoder of a preset relevance scoring model, wherein the relevance scoring model is provided with a double-tower frame, two branches of the double-tower frame are a document encoder and a problem encoder respectively, and part of negative refractory cases in training data of the relevance scoring model are generated by using a first language model.

Specifically, after the to-be-searched question sentences are input to the question encoder of the relevance scoring model, the upper and lower Wen Biaozheng vectors of the to-be-searched question sentences can be obtained.

Step S102, inputting pseudo-problem sentences generated by using a first language model according to the in-library documents in the document library and corresponding in-library documents into a document encoder, wherein the document encoder is provided with an interactive framework.

Specifically, after the pseudo-problem sentences and the corresponding in-library documents are input into a document encoder of a relevance scoring model, the encoding vectors of the multi-view in-library documents can be obtained.

Step S103, obtaining the relevance scores of the to-be-retrieved problem sentences output by the relevance scoring model and the in-library documents in the document library.

Specifically, the relevance scoring model outputs relevance scores of the code vectors of the to-be-searched problem sentences and the multi-view in-library documents, namely, the relevance scores of the same in-library document and the to-be-searched problem sentences at different view angles, and the maximum value of the relevance scores of the different view angles is taken as the relevance score of the to-be-searched problem sentences and the in-library document. And traversing all the in-library documents to obtain the relevance scores of all the in-library documents and the problem sentences to be searched.

Step S104, obtaining the retrieval text of the to-be-retrieved problem statement according to the relevance score.

Specifically, the in-library document with the highest relevance score to the problem statement to be searched in the library documents can be selected as the search text with the highest similarity to the problem statement to be searched.

As shown in fig. 2, the relevance scoring model in the embodiments of the present disclosure has two branches: question encoder 204 and document encoder 205, respectively. The question 201 is input to the question encoder 204, resulting in a first vector 206, and the question 202 and the document 203 are input to the document encoder 205, resulting in a second vector 207. From the first vector 206 and the second vector 207, a relevance score 208 of the question to the document may be derived. The first vector is the upper and lower Wen Biaozheng vectors, and the second vector is the encoding vector of the multi-view document.

The relevance scoring model of the embodiment of the disclosure is obtained by fusing interactive and non-interactive retrieval model architectures, has the characteristic of higher accuracy of the interactive architecture, and also retains the characteristic of small calculation amount of the non-interactive architecture, so that the relevance scoring model cannot be used for recall due to overlarge data calculation in an reasoning stage. Further, the interactive architecture of the document encoder using the relevance scoring model can learn single document characterization at multiple perspectives using the diversity of questions generated by the first language model, thereby improving the characterization capability of the query document, so that the relevance scoring model is higher than a single non-interactive model in retrieval accuracy.

In particular, a first vector of the output of the problem encoder as shown in FIG. 2Can be represented by the following formula (1):

（1）；

where q is the problem of the input problem encoder,representing the problem encoder.

Second vector of output of document encoder as shown in FIG. 2Can be represented by the following formula (2):

（2）；

where q is the problem of the input document encoder,representing the document encoder, q+d represents the document encoder that incorporates the question q and the document d into the relevance scoring model as inputs.

Similarity between the first vector and the second vectorThe measurement can be made by the following dot product equation (3):

（3）；

among them, in the related art, the problems of the input problem encoder and the document encoder are different because only the standard problems of documents in the training set can be used when training the relevance scoring model, and it is not realistic to manually write the problems that may occur in the library documents in the entire document library. Thus, in the disclosed embodiments, one existing question generation model, i.e., the first language model, may be used to generate corresponding search sentences, i.e., questions, from the in-library documents. The first language model may be a plurality of mature large language models, which may be chatGPT (Chat GenerativePre-trained Transformer, generating pre-training converter), BARD (a chat robot), or LLaMA (an artificial intelligence language model), but is not limited thereto. Specifically, the first language model can be used for generating a plurality of pseudo problems for each in-library document, and a TopK mode is adopted during decoding to ensure the diversity of the pseudo problems.

Specifically, in generating a pseudo problem, problem generation may be performed using a prompt (prompt word) like the following:

"please generate a question statement for the article that it may be retrieved. For example: the following articles: a short supply of siberian cold air is approaching beijing in the near day, where … … is expected. A possible problem of this article is: beijing weather recently.

Mimicking the above format, the problem is constructed for the following articles: the article content of the question is to be generated. "

In embodiments of the present disclosure, more than three large language models may be used. For the same document, at least 3 questions can be generated by using the same language model, so that the total number of questions of the enhanced single document is 3×3=9, and the diversity of training data of the model can be greatly improved.

The relevance scoring model of the embodiments of the present disclosure is trained from an initial scoring model, which may be any one of the following: BERT (Bidirectional Encoder Representationsfrom Transformers, bi-directional encoder representation from the converter), roBERTa (A Robustly Optimized BERT, brute force optimized BERT), ALBERT (lightweight BERT) and simBERT (simple BERT).

As shown in fig. 1, in the application reasoning process of the relevance score model provided in the embodiment of the present disclosure, before the relevance score model is applied, a framework of an initial score model needs to be determined, and the initial score model is trained to obtain the relevance score model.

When the relevance scoring model is trained, a pseudo training problem statement generated by using a first language model according to a training document can be input into a problem encoder of an initial scoring model, and the pseudo training problem statement and the training document are input into a document encoder of the initial scoring model, wherein the initial scoring model is an initial model of the relevance scoring model, and first training data of the relevance scoring model comprises the pseudo training problem statement and the training document.

After the first training data is input into the initial scoring model, the relevance evaluation score output by the initial scoring model can be obtained, and the initial scoring model is trained according to the relevance evaluation score and the training label corresponding to the first training data, so that a trained intermediate scoring model is obtained. The training label is a relevance score of the label corresponding to the first training data.

Further, after the trained intermediate scoring model is obtained, the second training data is selected to train the intermediate scoring model, and the relevance scoring model is obtained.

Specifically, when model training is performed, a data enhancement mode can be used to generate a problem, the generated problem is regarded as pseudo-annotation data, and model training is performed based on the pseudo-annotation data and a training document. The model training process can be considered as a warm-up training phase, after which model fine-tuning is performed on a truly labeled high quality training set.

Before training the relevance scoring model, pseudo-training question sentences, i.e., pseudo-annotation data, need to be generated from the training documents using the first language model.

After fusing the generated questions into the document representation, the positive and negative examples of the sample may be redefined. For a given problem, four forms of positive and negative examples may be defined. Negative examples include hard negative examples (hard negative) and in-batch negative examples. The positive example is a value to be predicted, and the negative example is a value not to be predicted. A difficult negative example is also called a difficult negative sample, and refers to a negative sample that is difficult to distinguish or classify as compared to a positive sample. The differences between these negative samples and the anchor point, i.e. the reference data point, are small.

There are two ways in which negative documents are randomly extracted from documents that are top ranked by BM 25. Negative examples can enable models to learn finer granularity information, such as negative documents are often associated with questions, but do not answer questions accurately, and also prevent models from learning only matching signals from the question side, and ignore document side information.

In another construction, a large language model is used to generate the problem of difficult negative examples.

The specific generation of the promt can be as follows: please generate questions for me that are related to the article but that cannot be answered directly by the article. Examples: a recent day of siberian cold air mass approaches beijing, where … … is expected, the problems associated with this article but not directly answered are: statistics and rules of precipitation in Beijing calendar year.

Difficult query to generate the article: an article to be generated. Difficult query: "

The ratio of the negative cases generated in the two ways is different in different training stages, in the embodiment of the disclosure, the partial negative cases generated by using the first language model account for 30% of the total negative cases in the first third of the iterations of training the initial scoring model, the partial negative cases generated by using the first language model account for 50% of the total negative cases in the middle third of the iterations of training the initial scoring model, and the partial negative cases generated by using the first language model account for 60% of the total negative cases in the second third of the iterations of training the initial scoring model.

The use of negative examples in a batch can improve training efficiency and enable the model to learn the distinguishing capability of the theme hierarchy.

In the embodiment of the disclosure, in the process of training the initial scoring model and training the intermediate scoring model, the initial scoring model may be trained by using a contrast learning loss function, and the intermediate scoring model may be trained by using the contrast learning loss function. Specifically, the contrast learning loss function may be a triplet loss function, and is not limited thereto.

When the initial scoring model is trained by adopting the contrast learning loss function, the convergence condition of the initial scoring model can be that the function value of the contrast learning loss function is minimum, or the function value of the contrast learning loss function fluctuates in a range, or the training times reach a certain number.

After determining the loss of the initial scoring model, adjusting network parameters of the initial scoring model according to the loss, namely, a parameter adjusting process of the initial scoring model. In the actual training process, iterative parameter adjustment training is carried out for a plurality of times according to the first training data until the initial scoring model converges, and an intermediate scoring model can be obtained.

Similarly, after determining the loss of the intermediate scoring model, adjusting network parameters of the intermediate scoring model according to the loss, namely, a secondary parameter adjusting process of the initial scoring model. And in the actual training process, performing iterative parameter adjustment training for a plurality of times according to the second training data until the intermediate scoring model converges, and obtaining the correlation scoring model.

In the embodiment of the present disclosure, the number ratio of the positive examples to the negative examples of the sample used in training the initial scoring model may be 1:7, and is not limited thereto. The number of samples (batch_size) that are transferred to the program at a time when training the initial scoring model may be 32, and is not limited thereto, so that the training effect is better.

In the embodiment of the disclosure, before the pseudo problem statement generated by using the first language model according to the existing document in the document library and the corresponding existing document are input into the document encoder, the maximum or minimum first K element sub-pseudo problem statement may be generated for each existing document, where K is a natural number. Where the largest or smallest first K elements, topK, are algorithms that find the largest or smallest K elements in the data set.

In the embodiment of the disclosure, the topK problem generation needs to be performed on each document in advance, and the generated result and the document are put into a document encoder together to obtain a candidate encoding vector of one view of the document. A document generates candidate vectors for how many questions there are how many perspectives. Further, the document library, i.e., the corpus, is encoded, and a multi-view document representation with problematic deep interactions may be obtained.

Generating problemsAnd document view->The procedure of (2) is as shown in the following formula (4) and formula (5):

（4）；

（5）；

wherein,i question of the j-th document, < ->An ith view, denoted as jth document,>representing a first language model.

Specifically, for the setting document a, examples of multi-view problem generation for it are as follows:

"please generate a plurality of queries that differ in question angle and can be answered by the article. Examples: one possible query that the article can answer directly is that the siberian cold air bolus approaches beijing, the beijing area expects … …, the day. Beijing near day weather conditions.

Please generate multiple similar queries with different question angles for the article: article text … …, query: "

In the embodiment of the disclosure, the quality of the questions generated by the first language model can be subjected to sampling inspection, and the number of the generated questions, the selection of the first language model and the written optimization of the prompt words are determined.

Further, when the relevance evaluation model is used for document retrieval, a question encoder is used for encoding a question to obtain upper and lower Wen Biaozheng vectors. Multi-view vector encoding of a document and question q and document +.>Maximum value of correlation scores (max-pooling) of different views in (a) as question q and document +.>Correlation score->。

（6）；

The relevance score supports direct use of inter-vector similarity for ranking.

According to the text retrieval method, a text retrieval process is achieved through the relevance evaluation model with the local interactive framework and the overall double-tower framework, partial negative difficulty of training data of the relevance scoring model is generated by combining the first language model, meanwhile, pseudo-problem sentences are generated according to the in-library documents by using the first language model, and the technical effect of improving the accuracy of document representation and simultaneously not losing the efficiency of query is achieved.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. The text retrieval apparatus described below and the text retrieval method described above may be referred to correspondingly to each other. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Fig. 3 is a schematic diagram of a text retrieval device provided in an embodiment of the present disclosure. As shown in fig. 3, the text retrieval apparatus includes:

the problem encoding module 301 is configured to input a to-be-retrieved problem sentence into a problem encoder of a preset relevance scoring model, where the relevance scoring model has a double-tower frame, two branches of the double-tower frame are a document encoder and a problem encoder, and a part of negative refractory cases in training data of the relevance scoring model are generated using a first language model.

A document encoding module 302 for inputting pseudo-problem statements generated using a first language model from an in-library document in a document library and a corresponding in-library document into a document encoder, wherein the document encoder has an interactive framework.

And the relevance score acquisition module 303 is used for acquiring the relevance score of the to-be-searched problem statement output by the relevance score model and the in-library document in the document library.

The retrieval text obtaining module 304 is configured to obtain a retrieval text of the question sentence to be retrieved according to the relevance score.

The text retrieval device of the embodiment of the disclosure can further comprise a training module for:

inputting a pseudo training problem statement generated by using the first language model according to a training document into a problem encoder of an initial scoring model, and inputting the pseudo training problem statement and the training document into a document encoder of the initial scoring model, wherein the initial scoring model is an initial model of the relevance scoring model, and first training data of the relevance scoring model comprises the pseudo training problem statement and the training document; obtaining a relevance evaluation score output by the initial scoring model; training the initial scoring model according to the relevance evaluation score and a training label corresponding to the first training data to obtain a trained intermediate scoring model; and selecting second training data to train the intermediate scoring model to obtain the relevance scoring model.

In the embodiment of the disclosure, in the first third of the iterations of training the initial scoring model, the partial negative refractory cases generated using the first language model account for 30% of the total negative refractory cases, in the middle third of the iterations of training the initial scoring model, the partial negative refractory cases generated using the first language model account for 50% of the total negative refractory cases, and in the last third of the iterations of training the initial scoring model, the partial negative refractory cases generated using the first language model account for 60% of the total negative refractory cases.

In an embodiment of the disclosure, the training module is further configured to: training the initial scoring model and training the intermediate scoring model by adopting a contrast learning loss function, wherein the training comprises the following steps: and training the intermediate scoring model by adopting a contrast learning loss function.

In the embodiment of the disclosure, the number ratio of positive examples to negative examples of the samples adopted in training the initial scoring model includes 1:7; and/or the number of samples that are passed to the program at a time when training the initial scoring model includes 32.

In an embodiment of the present disclosure, the initial scoring model includes any one of the following models: bi-directional encoder representation from converter, brute force optimized bi-directional encoder representation from converter, lightweight bi-directional encoder representation from converter, and simple bi-directional encoder representation from converter.

The text retrieval device of the embodiment of the disclosure may further include a generation module, configured to perform topK times of pseudo-problem statement generation on each of the in-library documents before the pseudo-problem statement generated using the first language model according to the in-library documents in the document library and the corresponding in-library document are input to the document encoder.

Since each functional module of the text retrieval device according to the exemplary embodiment of the present disclosure corresponds to a step of the exemplary embodiment of the text retrieval method described above, for details not disclosed in the embodiment of the device of the present disclosure, please refer to the embodiment of the text retrieval method described above in the present disclosure.

According to the text retrieval device disclosed by the embodiment of the disclosure, a text retrieval process is realized by using the relevance evaluation model with the local interactive framework and the overall double-tower framework, partial negative difficulty of training data of the relevance scoring model is generated by combining the first language model, meanwhile, pseudo-problem sentences are generated according to the in-library documents by using the first language model, and the technical effect of improving the accuracy of document characterization and simultaneously not losing the efficiency of query is realized.

Fig. 4 is a schematic diagram of an electronic device 4 provided by an embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps of the various method embodiments described above are implemented by processor 401 when executing computer program 403. Alternatively, the processor 401 may execute the computer program 403 to implement the functions of the modules in the above-described device embodiments.

The electronic device 4 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the electronic device 4 and is not limiting of the electronic device 4 and may include more or fewer components than shown, or different components.

The processor 401 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field programmable gate array (Field-ProgrammableGate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 4. Memory 402 may also include both internal storage units and external storage devices of electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.

The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims

1. A text retrieval method, the method comprising:

inputting a to-be-retrieved problem sentence into a problem encoder of a preset relevance scoring model, wherein the relevance scoring model is provided with a double-tower frame, two branches of the double-tower frame are respectively a document encoder and the problem encoder, the problem encoder is provided with a non-interactive frame, partial negative refractory cases in training data of the relevance scoring model are generated by using a first language model, the problem encoder is used for encoding an input problem into up and down Wen Biaozheng vectors, and the document encoder is used for encoding the input problem and the document into encoding vectors of the document with multiple angles;

inputting pseudo-problem statements generated using the first language model from an in-library document in a document library and a corresponding in-library document into the document encoder, wherein the document encoder has an interactive framework;

acquiring a relevance score of the to-be-retrieved problem statement output by the relevance scoring model and an in-library document in the document library;

acquiring a retrieval text of the to-be-retrieved problem statement according to the relevance score;

the training method of the relevance scoring model comprises the following steps:

inputting a pseudo training problem statement generated by using the first language model according to a training document into a problem encoder of an initial scoring model, and inputting the pseudo training problem statement and the training document into a document encoder of the initial scoring model, wherein the initial scoring model is an initial model of the relevance scoring model, and first training data of the relevance scoring model comprises the pseudo training problem statement and the training document;

obtaining a relevance evaluation score output by the initial scoring model;

training the initial scoring model according to the relevance evaluation score and a training label corresponding to the first training data to obtain a trained intermediate scoring model;

and selecting second training data to train the intermediate scoring model to obtain the relevance scoring model.

2. The method of claim 1, wherein the partial negative difficulty using the first language model accounts for 30% of all negative difficulty in a first third of the iterations in which the initial scoring model is trained, wherein the partial negative difficulty using the first language model accounts for 50% of all negative difficulty in a middle third of the iterations in which the initial scoring model is trained, and wherein the partial negative difficulty using the first language model accounts for 60% of all negative difficulty in a second third of the iterations in which the initial scoring model is trained.

3. The method of claim 1, wherein training the initial scoring model comprises: training the initial scoring model by adopting a contrast learning loss function;

training the intermediate scoring model, comprising: and training the intermediate scoring model by adopting a contrast learning loss function.

4. The method of claim 1, wherein the quantitative ratio of positive and negative examples of samples employed in training the initial scoring model comprises 1:7; and/or the number of the groups of groups,

the number of samples that were passed to the program in a single pass when training the initial scoring model included 32.

5. The method of claim 1, wherein the initial scoring model comprises any one of the following: bi-directional encoder representation from converter, brute force optimized bi-directional encoder representation from converter, lightweight bi-directional encoder representation from converter, and simple bi-directional encoder representation from converter.

6. The method of claim 1, wherein prior to inputting the pseudo-problem statement and corresponding in-library document generated using the first language model from the in-library documents in the document library into the document encoder, the method further comprises:

and generating K element sub-pseudo problem sentences with the maximum or minimum top K elements on each in-library document, wherein K is a natural number.

7. A text retrieval apparatus, the apparatus comprising:

a question coding module, configured to input a question sentence to be retrieved into a question encoder of a preset relevance scoring model, where the relevance scoring model has a double-tower frame, two branches of the double-tower frame are a document encoder and the question encoder, the question encoder has a non-interactive frame, a part of negative difficulties in training data of the relevance scoring model are generated by using a first language model, the question encoder is configured to code an input question into upper and lower Wen Biaozheng vectors, and the document encoder is configured to code the input question and a document into coding vectors of a document with multiple views;

a document encoding module for inputting pseudo-problem statements generated using the first language model from an in-library document in a document library and a corresponding in-library document into the document encoder, wherein the document encoder has an interactive framework;

the relevance score acquisition module is used for acquiring the relevance score of the to-be-searched problem statement output by the relevance score model and the in-library document in the document library;

the retrieval text acquisition module is used for acquiring the retrieval text of the to-be-retrieved problem statement according to the relevance score;

the training module is used for inputting a pseudo training problem statement generated by using the first language model according to a training document into a problem encoder of an initial scoring model, and inputting the pseudo training problem statement and the training document into a document encoder of the initial scoring model, wherein the initial scoring model is an initial model of the relevance scoring model, and first training data of the relevance scoring model comprises the pseudo training problem statement and the training document; obtaining a relevance evaluation score output by the initial scoring model; training the initial scoring model according to the relevance evaluation score and a training label corresponding to the first training data to obtain a trained intermediate scoring model; and selecting second training data to train the intermediate scoring model to obtain the relevance scoring model.

8. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.