CN116610795B - Text retrieval method and device - Google Patents

Text retrieval method and device Download PDF

Info

Publication number
CN116610795B
CN116610795B CN202310863139.8A CN202310863139A CN116610795B CN 116610795 B CN116610795 B CN 116610795B CN 202310863139 A CN202310863139 A CN 202310863139A CN 116610795 B CN116610795 B CN 116610795B
Authority
CN
China
Prior art keywords
document
scoring model
training
encoder
relevance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310863139.8A
Other languages
Chinese (zh)
Other versions
CN116610795A (en
Inventor
徐琳
暴宇健
王芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xumi Yuntu Space Technology Co Ltd
Original Assignee
Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xumi Yuntu Space Technology Co Ltd filed Critical Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority to CN202310863139.8A priority Critical patent/CN116610795B/en
Publication of CN116610795A publication Critical patent/CN116610795A/en
Application granted granted Critical
Publication of CN116610795B publication Critical patent/CN116610795B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to the technical field of computers, and provides a text retrieval method and a text retrieval device. The method comprises the following steps: inputting a to-be-searched problem sentence into a problem encoder of a preset relevance scoring model, wherein the relevance scoring model is provided with a double-tower frame, two branches of the double-tower frame are a document encoder and a problem encoder respectively, and part of negative refractory cases in training data of the relevance scoring model are generated by using a first language model; inputting pseudo-problem sentences generated by using a first language model according to the in-library documents in the document library and corresponding in-library documents into a document encoder, wherein the document encoder is provided with an interactive framework; obtaining a relevance score of a to-be-retrieved problem statement output by a relevance scoring model and an in-library document in a document library; and obtaining the retrieval text of the problem statement to be retrieved according to the relevance score. According to the technical scheme, the accuracy of document characterization is improved in the text retrieval process, and meanwhile the query efficiency is not lost.

Description

Text retrieval method and device
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a text retrieval method and apparatus.
Background
Currently, dense vector retrieval has played a vital role in information retrieval. Dense vectors are vectors modeled on vectors using the data structures of the array. Compared with the traditional BM25 (Best Match 25, algorithm iteration 25) retrieval mode, the dense vector retrieval can better acquire semantic information between the problem and the document.
There are mainly two scoring frames, a non-interactive frame and an interactive frame, for relevance scoring between the question and the document. The interactive frames are too computationally intensive to affect the efficiency of text retrieval, while non-interactive frames do not perform well for multi-topic content in long documents, and therefore are not accurate.
How to improve the accuracy of document characterization without losing the high efficiency of query is a technical problem which needs to be solved currently.
Disclosure of Invention
In view of this, embodiments of the present disclosure provide a text retrieval method, apparatus, electronic device, and computer readable storage medium, so as to solve the technical problem in the prior art that in the text retrieval process, the document characterization accuracy is not high or the retrieval efficiency is not high.
In a first aspect of an embodiment of the present disclosure, there is provided a text retrieval method, including: inputting a to-be-searched problem sentence into a problem encoder of a preset relevance scoring model, wherein the relevance scoring model is provided with a double-tower frame, two branches of the double-tower frame are a document encoder and a problem encoder respectively, and part of negative refractory cases in training data of the relevance scoring model are generated by using a first language model; inputting pseudo-problem sentences generated by using a first language model according to the in-library documents in the document library and corresponding in-library documents into a document encoder, wherein the document encoder is provided with an interactive framework; obtaining a relevance score of a to-be-retrieved problem statement output by a relevance scoring model and an in-library document in a document library; and obtaining the retrieval text of the problem statement to be retrieved according to the relevance score.
In a second aspect of embodiments of the present disclosure, there is provided a text retrieval apparatus, the apparatus comprising: the problem coding module is used for inputting a to-be-searched problem sentence into a problem coder of a preset relevance scoring model, wherein the relevance scoring model is provided with a double-tower frame, two branches of the double-tower frame are respectively a document coder and a problem coder, and partial negative refractory cases in training data of the relevance scoring model are generated by using a first language model; a document encoding module for inputting pseudo-problem sentences generated using a first language model from an in-library document in a document library and corresponding in-library documents into a document encoder, wherein the document encoder has an interactive framework; the relevance score acquisition module is used for acquiring relevance scores of the problem sentences to be searched and the in-library documents in the document library, which are output by the relevance score model; and the retrieval text acquisition module is used for acquiring the retrieval text of the to-be-retrieved problem statement according to the relevance score.
In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the steps of the above-described method.
Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: according to the technical scheme, the text retrieval process is realized by using the relevance evaluation model with the local interactive framework and the overall double-tower framework, the partial negative difficulty of the training data of the relevance scoring model is generated by combining the first language model, meanwhile, the pseudo-problem statement is generated according to the in-library document by using the first language model, and the technical effect of improving the accuracy of document characterization and simultaneously not losing the high efficiency of query is realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is a schematic flow chart of a text retrieval method according to an embodiment of the disclosure;
FIG. 2 is a schematic diagram of a correlation scoring model provided by an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a text retrieval device according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
In the related art, the relevance scores for the query and document mainly have two frameworks, i.e., dual-tower type, and Cross-end type. The Cross-Encoder framework is too computationally intensive to use in the recall phase, while the Dual-Encoder framework does not perform well with multiple subject matter in long documents because of problems and no interaction with the documents, and therefore has low accuracy. Some research is directed to improving the speed and effect of models using a post-interaction architecture, but is still limited by the inability to directly use pre-stored vectors for ordering, with poor speed and effect improvement results.
To solve the above problems, embodiments of the present disclosure provide a text retrieval scheme to improve accuracy of document characterization without losing efficiency of query in text retrieval.
Specifically, in the embodiment of the disclosure, the generated problem statement can be used for learning the representation of a single document under multiple angles of view and further used for querying the document, so that the accuracy of the representation of the document can be improved and the high efficiency of query is not lost.
Text retrieval methods and apparatuses according to embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
Fig. 1 is a flow chart of a text retrieval method according to an embodiment of the present disclosure. The method provided by the embodiments of the present disclosure may be performed by any electronic device, such as a terminal or server, having computer processing capabilities. As shown in fig. 1, the text retrieval method includes:
step S101, inputting a to-be-searched problem sentence into a problem encoder of a preset relevance scoring model, wherein the relevance scoring model is provided with a double-tower frame, two branches of the double-tower frame are a document encoder and a problem encoder respectively, and part of negative refractory cases in training data of the relevance scoring model are generated by using a first language model.
Specifically, after the to-be-searched question sentences are input to the question encoder of the relevance scoring model, the upper and lower Wen Biaozheng vectors of the to-be-searched question sentences can be obtained.
Step S102, inputting pseudo-problem sentences generated by using a first language model according to the in-library documents in the document library and corresponding in-library documents into a document encoder, wherein the document encoder is provided with an interactive framework.
Specifically, after the pseudo-problem sentences and the corresponding in-library documents are input into a document encoder of a relevance scoring model, the encoding vectors of the multi-view in-library documents can be obtained.
Step S103, obtaining the relevance scores of the to-be-retrieved problem sentences output by the relevance scoring model and the in-library documents in the document library.
Specifically, the relevance scoring model outputs relevance scores of the code vectors of the to-be-searched problem sentences and the multi-view in-library documents, namely, the relevance scores of the same in-library document and the to-be-searched problem sentences at different view angles, and the maximum value of the relevance scores of the different view angles is taken as the relevance score of the to-be-searched problem sentences and the in-library document. And traversing all the in-library documents to obtain the relevance scores of all the in-library documents and the problem sentences to be searched.
Step S104, obtaining the retrieval text of the to-be-retrieved problem statement according to the relevance score.
Specifically, the in-library document with the highest relevance score to the problem statement to be searched in the library documents can be selected as the search text with the highest similarity to the problem statement to be searched.
As shown in fig. 2, the relevance scoring model in the embodiments of the present disclosure has two branches: question encoder 204 and document encoder 205, respectively. The question 201 is input to the question encoder 204, resulting in a first vector 206, and the question 202 and the document 203 are input to the document encoder 205, resulting in a second vector 207. From the first vector 206 and the second vector 207, a relevance score 208 of the question to the document may be derived. The first vector is the upper and lower Wen Biaozheng vectors, and the second vector is the encoding vector of the multi-view document.
The relevance scoring model of the embodiment of the disclosure is obtained by fusing interactive and non-interactive retrieval model architectures, has the characteristic of higher accuracy of the interactive architecture, and also retains the characteristic of small calculation amount of the non-interactive architecture, so that the relevance scoring model cannot be used for recall due to overlarge data calculation in an reasoning stage. Further, the interactive architecture of the document encoder using the relevance scoring model can learn single document characterization at multiple perspectives using the diversity of questions generated by the first language model, thereby improving the characterization capability of the query document, so that the relevance scoring model is higher than a single non-interactive model in retrieval accuracy.
In particular, a first vector of the output of the problem encoder as shown in FIG. 2Can be represented by the following formula (1):
(1);
where q is the problem of the input problem encoder,representing the problem encoder.
Second vector of output of document encoder as shown in FIG. 2Can be represented by the following formula (2):
(2);
where q is the problem of the input document encoder,representing the document encoder, q+d represents the document encoder that incorporates the question q and the document d into the relevance scoring model as inputs.
Similarity between the first vector and the second vectorThe measurement can be made by the following dot product equation (3):
(3);
among them, in the related art, the problems of the input problem encoder and the document encoder are different because only the standard problems of documents in the training set can be used when training the relevance scoring model, and it is not realistic to manually write the problems that may occur in the library documents in the entire document library. Thus, in the disclosed embodiments, one existing question generation model, i.e., the first language model, may be used to generate corresponding search sentences, i.e., questions, from the in-library documents. The first language model may be a plurality of mature large language models, which may be chatGPT (Chat GenerativePre-trained Transformer, generating pre-training converter), BARD (a chat robot), or LLaMA (an artificial intelligence language model), but is not limited thereto. Specifically, the first language model can be used for generating a plurality of pseudo problems for each in-library document, and a TopK mode is adopted during decoding to ensure the diversity of the pseudo problems.
Specifically, in generating a pseudo problem, problem generation may be performed using a prompt (prompt word) like the following:
"please generate a question statement for the article that it may be retrieved. For example: the following articles: a short supply of siberian cold air is approaching beijing in the near day, where … … is expected. A possible problem of this article is: beijing weather recently.
Mimicking the above format, the problem is constructed for the following articles: the article content of the question is to be generated. "
In embodiments of the present disclosure, more than three large language models may be used. For the same document, at least 3 questions can be generated by using the same language model, so that the total number of questions of the enhanced single document is 3×3=9, and the diversity of training data of the model can be greatly improved.
The relevance scoring model of the embodiments of the present disclosure is trained from an initial scoring model, which may be any one of the following: BERT (Bidirectional Encoder Representationsfrom Transformers, bi-directional encoder representation from the converter), roBERTa (A Robustly Optimized BERT, brute force optimized BERT), ALBERT (lightweight BERT) and simBERT (simple BERT).
As shown in fig. 1, in the application reasoning process of the relevance score model provided in the embodiment of the present disclosure, before the relevance score model is applied, a framework of an initial score model needs to be determined, and the initial score model is trained to obtain the relevance score model.
When the relevance scoring model is trained, a pseudo training problem statement generated by using a first language model according to a training document can be input into a problem encoder of an initial scoring model, and the pseudo training problem statement and the training document are input into a document encoder of the initial scoring model, wherein the initial scoring model is an initial model of the relevance scoring model, and first training data of the relevance scoring model comprises the pseudo training problem statement and the training document.
After the first training data is input into the initial scoring model, the relevance evaluation score output by the initial scoring model can be obtained, and the initial scoring model is trained according to the relevance evaluation score and the training label corresponding to the first training data, so that a trained intermediate scoring model is obtained. The training label is a relevance score of the label corresponding to the first training data.
Further, after the trained intermediate scoring model is obtained, the second training data is selected to train the intermediate scoring model, and the relevance scoring model is obtained.
Specifically, when model training is performed, a data enhancement mode can be used to generate a problem, the generated problem is regarded as pseudo-annotation data, and model training is performed based on the pseudo-annotation data and a training document. The model training process can be considered as a warm-up training phase, after which model fine-tuning is performed on a truly labeled high quality training set.
Before training the relevance scoring model, pseudo-training question sentences, i.e., pseudo-annotation data, need to be generated from the training documents using the first language model.
After fusing the generated questions into the document representation, the positive and negative examples of the sample may be redefined. For a given problem, four forms of positive and negative examples may be defined. Negative examples include hard negative examples (hard negative) and in-batch negative examples. The positive example is a value to be predicted, and the negative example is a value not to be predicted. A difficult negative example is also called a difficult negative sample, and refers to a negative sample that is difficult to distinguish or classify as compared to a positive sample. The differences between these negative samples and the anchor point, i.e. the reference data point, are small.
There are two ways in which negative documents are randomly extracted from documents that are top ranked by BM 25. Negative examples can enable models to learn finer granularity information, such as negative documents are often associated with questions, but do not answer questions accurately, and also prevent models from learning only matching signals from the question side, and ignore document side information.
In another construction, a large language model is used to generate the problem of difficult negative examples.
The specific generation of the promt can be as follows: please generate questions for me that are related to the article but that cannot be answered directly by the article. Examples: a recent day of siberian cold air mass approaches beijing, where … … is expected, the problems associated with this article but not directly answered are: statistics and rules of precipitation in Beijing calendar year.
Difficult query to generate the article: an article to be generated. Difficult query: "
The ratio of the negative cases generated in the two ways is different in different training stages, in the embodiment of the disclosure, the partial negative cases generated by using the first language model account for 30% of the total negative cases in the first third of the iterations of training the initial scoring model, the partial negative cases generated by using the first language model account for 50% of the total negative cases in the middle third of the iterations of training the initial scoring model, and the partial negative cases generated by using the first language model account for 60% of the total negative cases in the second third of the iterations of training the initial scoring model.
The use of negative examples in a batch can improve training efficiency and enable the model to learn the distinguishing capability of the theme hierarchy.
In the embodiment of the disclosure, in the process of training the initial scoring model and training the intermediate scoring model, the initial scoring model may be trained by using a contrast learning loss function, and the intermediate scoring model may be trained by using the contrast learning loss function. Specifically, the contrast learning loss function may be a triplet loss function, and is not limited thereto.
When the initial scoring model is trained by adopting the contrast learning loss function, the convergence condition of the initial scoring model can be that the function value of the contrast learning loss function is minimum, or the function value of the contrast learning loss function fluctuates in a range, or the training times reach a certain number.
After determining the loss of the initial scoring model, adjusting network parameters of the initial scoring model according to the loss, namely, a parameter adjusting process of the initial scoring model. In the actual training process, iterative parameter adjustment training is carried out for a plurality of times according to the first training data until the initial scoring model converges, and an intermediate scoring model can be obtained.
Similarly, after determining the loss of the intermediate scoring model, adjusting network parameters of the intermediate scoring model according to the loss, namely, a secondary parameter adjusting process of the initial scoring model. And in the actual training process, performing iterative parameter adjustment training for a plurality of times according to the second training data until the intermediate scoring model converges, and obtaining the correlation scoring model.
In the embodiment of the present disclosure, the number ratio of the positive examples to the negative examples of the sample used in training the initial scoring model may be 1:7, and is not limited thereto. The number of samples (batch_size) that are transferred to the program at a time when training the initial scoring model may be 32, and is not limited thereto, so that the training effect is better.
In the embodiment of the disclosure, before the pseudo problem statement generated by using the first language model according to the existing document in the document library and the corresponding existing document are input into the document encoder, the maximum or minimum first K element sub-pseudo problem statement may be generated for each existing document, where K is a natural number. Where the largest or smallest first K elements, topK, are algorithms that find the largest or smallest K elements in the data set.
In the embodiment of the disclosure, the topK problem generation needs to be performed on each document in advance, and the generated result and the document are put into a document encoder together to obtain a candidate encoding vector of one view of the document. A document generates candidate vectors for how many questions there are how many perspectives. Further, the document library, i.e., the corpus, is encoded, and a multi-view document representation with problematic deep interactions may be obtained.
Generating problemsAnd document view->The procedure of (2) is as shown in the following formula (4) and formula (5):
(4);
(5);
wherein,i question of the j-th document, < ->An ith view, denoted as jth document,>representing a first language model.
Specifically, for the setting document a, examples of multi-view problem generation for it are as follows:
"please generate a plurality of queries that differ in question angle and can be answered by the article. Examples: one possible query that the article can answer directly is that the siberian cold air bolus approaches beijing, the beijing area expects … …, the day. Beijing near day weather conditions.
Please generate multiple similar queries with different question angles for the article: article text … …, query: "
In the embodiment of the disclosure, the quality of the questions generated by the first language model can be subjected to sampling inspection, and the number of the generated questions, the selection of the first language model and the written optimization of the prompt words are determined.
Further, when the relevance evaluation model is used for document retrieval, a question encoder is used for encoding a question to obtain upper and lower Wen Biaozheng vectors. Multi-view vector encoding of a document and question q and document +.>Maximum value of correlation scores (max-pooling) of different views in (a) as question q and document +.>Correlation score->
(6);
The relevance score supports direct use of inter-vector similarity for ranking.
According to the text retrieval method, a text retrieval process is achieved through the relevance evaluation model with the local interactive framework and the overall double-tower framework, partial negative difficulty of training data of the relevance scoring model is generated by combining the first language model, meanwhile, pseudo-problem sentences are generated according to the in-library documents by using the first language model, and the technical effect of improving the accuracy of document representation and simultaneously not losing the efficiency of query is achieved.
The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. The text retrieval apparatus described below and the text retrieval method described above may be referred to correspondingly to each other. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.
Fig. 3 is a schematic diagram of a text retrieval device provided in an embodiment of the present disclosure. As shown in fig. 3, the text retrieval apparatus includes:
the problem encoding module 301 is configured to input a to-be-retrieved problem sentence into a problem encoder of a preset relevance scoring model, where the relevance scoring model has a double-tower frame, two branches of the double-tower frame are a document encoder and a problem encoder, and a part of negative refractory cases in training data of the relevance scoring model are generated using a first language model.
Specifically, after the to-be-searched question sentences are input to the question encoder of the relevance scoring model, the upper and lower Wen Biaozheng vectors of the to-be-searched question sentences can be obtained.
A document encoding module 302 for inputting pseudo-problem statements generated using a first language model from an in-library document in a document library and a corresponding in-library document into a document encoder, wherein the document encoder has an interactive framework.
Specifically, after the pseudo-problem sentences and the corresponding in-library documents are input into a document encoder of a relevance scoring model, the encoding vectors of the multi-view in-library documents can be obtained.
And the relevance score acquisition module 303 is used for acquiring the relevance score of the to-be-searched problem statement output by the relevance score model and the in-library document in the document library.
Specifically, the relevance scoring model outputs relevance scores of the code vectors of the to-be-searched problem sentences and the multi-view in-library documents, namely, the relevance scores of the same in-library document and the to-be-searched problem sentences at different view angles, and the maximum value of the relevance scores of the different view angles is taken as the relevance score of the to-be-searched problem sentences and the in-library document. And traversing all the in-library documents to obtain the relevance scores of all the in-library documents and the problem sentences to be searched.
The retrieval text obtaining module 304 is configured to obtain a retrieval text of the question sentence to be retrieved according to the relevance score.
Specifically, the in-library document with the highest relevance score to the problem statement to be searched in the library documents can be selected as the search text with the highest similarity to the problem statement to be searched.
The text retrieval device of the embodiment of the disclosure can further comprise a training module for:
inputting a pseudo training problem statement generated by using the first language model according to a training document into a problem encoder of an initial scoring model, and inputting the pseudo training problem statement and the training document into a document encoder of the initial scoring model, wherein the initial scoring model is an initial model of the relevance scoring model, and first training data of the relevance scoring model comprises the pseudo training problem statement and the training document; obtaining a relevance evaluation score output by the initial scoring model; training the initial scoring model according to the relevance evaluation score and a training label corresponding to the first training data to obtain a trained intermediate scoring model; and selecting second training data to train the intermediate scoring model to obtain the relevance scoring model.
In the embodiment of the disclosure, in the first third of the iterations of training the initial scoring model, the partial negative refractory cases generated using the first language model account for 30% of the total negative refractory cases, in the middle third of the iterations of training the initial scoring model, the partial negative refractory cases generated using the first language model account for 50% of the total negative refractory cases, and in the last third of the iterations of training the initial scoring model, the partial negative refractory cases generated using the first language model account for 60% of the total negative refractory cases.
In an embodiment of the disclosure, the training module is further configured to: training the initial scoring model and training the intermediate scoring model by adopting a contrast learning loss function, wherein the training comprises the following steps: and training the intermediate scoring model by adopting a contrast learning loss function.
In the embodiment of the disclosure, the number ratio of positive examples to negative examples of the samples adopted in training the initial scoring model includes 1:7; and/or the number of samples that are passed to the program at a time when training the initial scoring model includes 32.
In an embodiment of the present disclosure, the initial scoring model includes any one of the following models: bi-directional encoder representation from converter, brute force optimized bi-directional encoder representation from converter, lightweight bi-directional encoder representation from converter, and simple bi-directional encoder representation from converter.
The text retrieval device of the embodiment of the disclosure may further include a generation module, configured to perform topK times of pseudo-problem statement generation on each of the in-library documents before the pseudo-problem statement generated using the first language model according to the in-library documents in the document library and the corresponding in-library document are input to the document encoder.
Since each functional module of the text retrieval device according to the exemplary embodiment of the present disclosure corresponds to a step of the exemplary embodiment of the text retrieval method described above, for details not disclosed in the embodiment of the device of the present disclosure, please refer to the embodiment of the text retrieval method described above in the present disclosure.
According to the text retrieval device disclosed by the embodiment of the disclosure, a text retrieval process is realized by using the relevance evaluation model with the local interactive framework and the overall double-tower framework, partial negative difficulty of training data of the relevance scoring model is generated by combining the first language model, meanwhile, pseudo-problem sentences are generated according to the in-library documents by using the first language model, and the technical effect of improving the accuracy of document characterization and simultaneously not losing the efficiency of query is realized.
Fig. 4 is a schematic diagram of an electronic device 4 provided by an embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps of the various method embodiments described above are implemented by processor 401 when executing computer program 403. Alternatively, the processor 401 may execute the computer program 403 to implement the functions of the modules in the above-described device embodiments.
The electronic device 4 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the electronic device 4 and is not limiting of the electronic device 4 and may include more or fewer components than shown, or different components.
The processor 401 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field programmable gate array (Field-ProgrammableGate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
The memory 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 4. Memory 402 may also include both internal storage units and external storage devices of electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.
The integrated modules, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.
The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims (9)

1. A text retrieval method, the method comprising:
inputting a to-be-retrieved problem sentence into a problem encoder of a preset relevance scoring model, wherein the relevance scoring model is provided with a double-tower frame, two branches of the double-tower frame are respectively a document encoder and the problem encoder, the problem encoder is provided with a non-interactive frame, partial negative refractory cases in training data of the relevance scoring model are generated by using a first language model, the problem encoder is used for encoding an input problem into up and down Wen Biaozheng vectors, and the document encoder is used for encoding the input problem and the document into encoding vectors of the document with multiple angles;
inputting pseudo-problem statements generated using the first language model from an in-library document in a document library and a corresponding in-library document into the document encoder, wherein the document encoder has an interactive framework;
acquiring a relevance score of the to-be-retrieved problem statement output by the relevance scoring model and an in-library document in the document library;
acquiring a retrieval text of the to-be-retrieved problem statement according to the relevance score;
the training method of the relevance scoring model comprises the following steps:
inputting a pseudo training problem statement generated by using the first language model according to a training document into a problem encoder of an initial scoring model, and inputting the pseudo training problem statement and the training document into a document encoder of the initial scoring model, wherein the initial scoring model is an initial model of the relevance scoring model, and first training data of the relevance scoring model comprises the pseudo training problem statement and the training document;
obtaining a relevance evaluation score output by the initial scoring model;
training the initial scoring model according to the relevance evaluation score and a training label corresponding to the first training data to obtain a trained intermediate scoring model;
and selecting second training data to train the intermediate scoring model to obtain the relevance scoring model.
2. The method of claim 1, wherein the partial negative difficulty using the first language model accounts for 30% of all negative difficulty in a first third of the iterations in which the initial scoring model is trained, wherein the partial negative difficulty using the first language model accounts for 50% of all negative difficulty in a middle third of the iterations in which the initial scoring model is trained, and wherein the partial negative difficulty using the first language model accounts for 60% of all negative difficulty in a second third of the iterations in which the initial scoring model is trained.
3. The method of claim 1, wherein training the initial scoring model comprises: training the initial scoring model by adopting a contrast learning loss function;
training the intermediate scoring model, comprising: and training the intermediate scoring model by adopting a contrast learning loss function.
4. The method of claim 1, wherein the quantitative ratio of positive and negative examples of samples employed in training the initial scoring model comprises 1:7; and/or the number of the groups of groups,
the number of samples that were passed to the program in a single pass when training the initial scoring model included 32.
5. The method of claim 1, wherein the initial scoring model comprises any one of the following: bi-directional encoder representation from converter, brute force optimized bi-directional encoder representation from converter, lightweight bi-directional encoder representation from converter, and simple bi-directional encoder representation from converter.
6. The method of claim 1, wherein prior to inputting the pseudo-problem statement and corresponding in-library document generated using the first language model from the in-library documents in the document library into the document encoder, the method further comprises:
and generating K element sub-pseudo problem sentences with the maximum or minimum top K elements on each in-library document, wherein K is a natural number.
7. A text retrieval apparatus, the apparatus comprising:
a question coding module, configured to input a question sentence to be retrieved into a question encoder of a preset relevance scoring model, where the relevance scoring model has a double-tower frame, two branches of the double-tower frame are a document encoder and the question encoder, the question encoder has a non-interactive frame, a part of negative difficulties in training data of the relevance scoring model are generated by using a first language model, the question encoder is configured to code an input question into upper and lower Wen Biaozheng vectors, and the document encoder is configured to code the input question and a document into coding vectors of a document with multiple views;
a document encoding module for inputting pseudo-problem statements generated using the first language model from an in-library document in a document library and a corresponding in-library document into the document encoder, wherein the document encoder has an interactive framework;
the relevance score acquisition module is used for acquiring the relevance score of the to-be-searched problem statement output by the relevance score model and the in-library document in the document library;
the retrieval text acquisition module is used for acquiring the retrieval text of the to-be-retrieved problem statement according to the relevance score;
the training module is used for inputting a pseudo training problem statement generated by using the first language model according to a training document into a problem encoder of an initial scoring model, and inputting the pseudo training problem statement and the training document into a document encoder of the initial scoring model, wherein the initial scoring model is an initial model of the relevance scoring model, and first training data of the relevance scoring model comprises the pseudo training problem statement and the training document; obtaining a relevance evaluation score output by the initial scoring model; training the initial scoring model according to the relevance evaluation score and a training label corresponding to the first training data to obtain a trained intermediate scoring model; and selecting second training data to train the intermediate scoring model to obtain the relevance scoring model.
8. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.
CN202310863139.8A 2023-07-14 2023-07-14 Text retrieval method and device Active CN116610795B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310863139.8A CN116610795B (en) 2023-07-14 2023-07-14 Text retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310863139.8A CN116610795B (en) 2023-07-14 2023-07-14 Text retrieval method and device

Publications (2)

Publication Number Publication Date
CN116610795A CN116610795A (en) 2023-08-18
CN116610795B true CN116610795B (en) 2024-03-15

Family

ID=87678512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310863139.8A Active CN116610795B (en) 2023-07-14 2023-07-14 Text retrieval method and device

Country Status (1)

Country Link
CN (1) CN116610795B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117932039A (en) * 2024-03-21 2024-04-26 山东大学 Interpretable text review method and system based on heuristic question-answer reasoning

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579704A (en) * 2022-03-03 2022-06-03 贝壳找房网(北京)信息技术有限公司 Semantic matching method and device
CN114625838A (en) * 2022-03-10 2022-06-14 平安科技(深圳)有限公司 Search system optimization method and device, storage medium and computer equipment
CN114780709A (en) * 2022-03-22 2022-07-22 北京三快在线科技有限公司 Text matching method and device and electronic equipment
CN114880452A (en) * 2022-05-25 2022-08-09 重庆大学 Text retrieval method based on multi-view contrast learning
CN115329749A (en) * 2022-10-14 2022-11-11 成都数之联科技股份有限公司 Recall and ordering combined training method and system for semantic retrieval
CN115344672A (en) * 2022-10-18 2022-11-15 北京澜舟科技有限公司 Document retrieval model training method, retrieval method and storage medium
CN115374251A (en) * 2022-09-07 2022-11-22 重庆大学 Dense retrieval method based on syntax comparison learning
CN115878564A (en) * 2021-09-29 2023-03-31 华为技术有限公司 Document retrieval method and device
CN116186562A (en) * 2023-04-27 2023-05-30 中南大学 Encoder-based long text matching method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115878564A (en) * 2021-09-29 2023-03-31 华为技术有限公司 Document retrieval method and device
CN114579704A (en) * 2022-03-03 2022-06-03 贝壳找房网(北京)信息技术有限公司 Semantic matching method and device
CN114625838A (en) * 2022-03-10 2022-06-14 平安科技(深圳)有限公司 Search system optimization method and device, storage medium and computer equipment
CN114780709A (en) * 2022-03-22 2022-07-22 北京三快在线科技有限公司 Text matching method and device and electronic equipment
CN114880452A (en) * 2022-05-25 2022-08-09 重庆大学 Text retrieval method based on multi-view contrast learning
CN115374251A (en) * 2022-09-07 2022-11-22 重庆大学 Dense retrieval method based on syntax comparison learning
CN115329749A (en) * 2022-10-14 2022-11-11 成都数之联科技股份有限公司 Recall and ordering combined training method and system for semantic retrieval
CN115344672A (en) * 2022-10-18 2022-11-15 北京澜舟科技有限公司 Document retrieval model training method, retrieval method and storage medium
CN116186562A (en) * 2023-04-27 2023-05-30 中南大学 Encoder-based long text matching method

Also Published As

Publication number Publication date
CN116610795A (en) 2023-08-18

Similar Documents

Publication Publication Date Title
Bakhtin et al. Real or fake? learning to discriminate machine from human generated text
CN112487182B (en) Training method of text processing model, text processing method and device
Yang et al. Unsupervised text style transfer using language models as discriminators
Logeswaran et al. Sentence ordering and coherence modeling using recurrent neural networks
US11379736B2 (en) Machine comprehension of unstructured text
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
Ma et al. Pre-train a discriminative text encoder for dense retrieval via contrastive span prediction
CN110309839A (en) A kind of method and device of iamge description
WO2023137911A1 (en) Intention classification method and apparatus based on small-sample corpus, and computer device
CN113239169A (en) Artificial intelligence-based answer generation method, device, equipment and storage medium
CN116610795B (en) Text retrieval method and device
CN116629235A (en) Large-scale pre-training language model fine tuning method and device, electronic equipment and medium
Ji et al. SKGSUM: Abstractive document summarization with semantic knowledge graphs
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN112396091A (en) Social media image popularity prediction method, system, storage medium and application
Wang Short Sequence Chinese‐English Machine Translation Based on Generative Adversarial Networks of Emotion
CN116757195A (en) Implicit emotion recognition method based on prompt learning
Hu et al. Dynamically retrieving knowledge via query generation for informative dialogue generation
CN116484851A (en) Pre-training model training method and device based on variant character detection
He et al. Distant supervised relation extraction via long short term memory networks with sentence embedding
Xia et al. Generating Questions Based on Semi-Automated and End-to-End Neural Network.
Xu et al. Neural dialogue model with retrieval attention for personalized response generation
Yu et al. Semantic extraction for sentence representation via reinforcement learning
CN115269844B (en) Model processing method, device, electronic equipment and storage medium
Liu et al. Curriculum learning for distant supervision relation extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant