CN113934830A

CN113934830A - Text retrieval model training, question and answer retrieval method, device, equipment and medium

Info

Publication number: CN113934830A
Application number: CN202111216601.2A
Authority: CN
Inventors: 杨修远
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2021-10-19
Filing date: 2021-10-19
Publication date: 2022-01-14
Anticipated expiration: 2041-10-19
Also published as: CN113934830B

Abstract

The invention relates to the technical field of artificial intelligence, and provides a text retrieval model training method, a question-answer retrieval method, a device, equipment and a medium, wherein the method comprises the following steps: acquiring a sample set to be retrieved and a document library; inputting the training sample queue into a text retrieval model containing initial parameters; carrying out BERT-based embedded vector coding processing on each stored document in the training sample queue and the document library, carrying out vector retrieval by using an ANN (artificial neural network) search algorithm and a KNN (K nearest neighbor) algorithm to obtain a retrieval result, and recording training times; calculating to obtain a loss value through a loss function; and when the loss value is detected to be not up to the preset convergence condition and the relation between the training times and the preset times is detected to be multiple, iteratively updating the initial parameters, updating the training sample queue, and continuously training to obtain the text retrieval model. The invention realizes that the text retrieval model is more accurate by continuously updating the training method of the negative sample with strong relevance. The method is suitable for the field of artificial intelligence, and can further promote the construction of smart cities.

Description

Text retrieval model training, question and answer retrieval method, device, equipment and medium

Technical Field

The invention relates to the technical field of artificial intelligence model construction, in particular to a text retrieval model training method, a text retrieval model training device, a question-answer retrieval method, a question-answer retrieval device, computer equipment and a storage medium.

Background

Currently, in many application scenarios, there is a question-answer matching requirement. For example, a user may initiate a question inquiry to the question and answer platform through the intelligent terminal, and the question and answer platform needs to match a suitable answer to a question inquired by the user and return the answer to the user. For another example, in a human-computer interaction system, a user may perform voice question-answer interaction with a robot, and the existing question-answer matching method generally matches a user question with an existing question, finds an existing question matched with the user question, and returns an answer corresponding to the existing question to the user. However, the efficiency of the question-answer matching method is low.

Disclosure of Invention

The invention provides a text retrieval model training method, a question-answer retrieval device, a computer device and a storage medium, which realize a retrieval method for training by continuously updating negative samples with strong relevance, improve the accuracy and the accuracy of documents retrieved by a text retrieval model, are suitable for the field of artificial intelligence, and can further promote the construction of smart cities.

A text retrieval model training method comprises the following steps:

acquiring a sample set to be retrieved and a document library; the sample set to be retrieved comprises a plurality of samples to be retrieved; one sample to be retrieved corresponds to one target document;

generating a queue for each sample to be retrieved to obtain a training sample queue corresponding to each sample to be retrieved one by one; each training sample queue comprises one to-be-retrieved sample and at least one negative sample;

inputting the training sample queue into a text retrieval model containing initial parameters;

carrying out embedded vector coding processing based on BERT on the training sample queue and each storage document in the document library through the text retrieval model, carrying out vector retrieval on the training sample queue after the embedded vector coding processing from the storage document after the embedded vector coding processing by using an ANN (artificial neural network) search algorithm and a KNN (K nearest neighbor) algorithm to obtain a retrieval result, and recording the training times;

calculating to obtain a loss value according to the training sample queue after the embedded vector coding processing and the retrieval result through a loss function;

when the training times and the preset times are in a multiple relation, and the loss value does not reach a preset convergence condition, iteratively updating initial parameters of the text retrieval model, updating a training sample queue according to a retrieval result obtained before current iterative updating, and executing the step of inputting the training sample queue into the text retrieval model containing the initial parameters until the loss value reaches the preset convergence condition, and recording the text retrieval model after convergence as the text retrieval model after training.

A question-answer retrieval method comprises the following steps:

obtaining a problem to be retrieved;

inputting the to-be-retrieved question into a trained text retrieval model obtained by training through the text retrieval model training method to perform vector retrieval, and obtaining a preset number of candidate answers corresponding to the to-be-retrieved question;

and sequencing all the candidate answers, and determining the candidate answer with the first sequence after the sequencing as a reply answer.

A text retrieval model training apparatus, comprising:

the first acquisition module is used for acquiring a sample set to be retrieved and a document library; the sample set to be retrieved comprises a plurality of samples to be retrieved; one sample to be retrieved corresponds to one target document;

the generating module is used for generating queues of the samples to be retrieved to obtain training sample queues which correspond to the samples to be retrieved one by one; each training sample queue comprises one to-be-retrieved sample and at least one negative sample;

the input module is used for inputting the training sample queue into a text retrieval model containing initial parameters;

the first retrieval module is used for carrying out embedded vector coding processing based on BERT on the training sample queue and each storage document in the document library through the text retrieval model, carrying out vector retrieval on the training sample queue after the embedded vector coding processing from the storage document after the embedded vector coding processing by using an ANN (artificial neural network) search algorithm and a KNN (K nearest neighbor) algorithm to obtain a retrieval result, and recording the training times;

the loss module is used for calculating a loss value according to the training sample queue after the embedded vector coding processing and the retrieval result through a loss function;

and the training module is used for iteratively updating the initial parameters of the text retrieval model when the training times and the preset times are in a multiple relation and the loss value does not reach the preset convergence condition, updating a training sample queue according to a retrieval result obtained before the current iterative updating, and executing the step of inputting the training sample queue into the text retrieval model containing the initial parameters until the loss value reaches the preset convergence condition, and recording the converged text retrieval model as the trained text retrieval model.

A question-answer retrieval apparatus comprising:

the second acquisition module is used for acquiring the problem to be retrieved;

the second retrieval module is used for inputting the to-be-retrieved question into a trained text retrieval model obtained by training through the text retrieval model training method to perform vector retrieval, and obtaining a preset number of candidate answers corresponding to the to-be-retrieved question;

and the reply module is used for sequencing all the candidate answers and determining the candidate answer with the first sequence after the sequencing as a reply answer.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the text retrieval model training method described above when executing the computer program, or implementing the steps of the question-and-answer retrieval method described above when executing the computer program.

A computer-readable storage medium, which stores a computer program, wherein the computer program realizes the steps of the above text retrieval model training method when executed by a processor, or realizes the steps of the above question-and-answer retrieval method when executed by a processor.

The text retrieval model training method, the text retrieval model training device, the computer equipment and the storage medium provided by the invention are characterized in that a sample set to be retrieved and a document library are obtained; inputting the training sample queue into a text retrieval model containing initial parameters; carrying out embedded vector coding processing based on BERT on the training sample queue and each storage document in the document library through the text retrieval model, carrying out vector retrieval on the training sample queue after the embedded vector coding processing from the storage document after the embedded vector coding processing by using an ANN (artificial neural network) search algorithm and a KNN (K nearest neighbor) algorithm to obtain a retrieval result, and recording the training times; calculating to obtain a loss value according to the training vector queue and the retrieval result through a loss function; when the loss value is detected not to reach the preset convergence condition and the training times and the preset times are in a multiple relation, iteratively updating the initial parameters of the text retrieval model, updating a training sample queue according to a retrieval result obtained before the current iterative update, and executing the step of inputting the training sample queue into a text retrieval model containing initial parameters until the loss value reaches the preset convergence condition, recording the text retrieval model after convergence as a trained text retrieval model, therefore, the retrieval method for training by continuously updating the negative sample with strong relevance can be realized, the subtle difference between the negative sample with strong relevance and the negative sample with strong relevance is continuously learned, the fine granularity of the difference between the negative sample with strong relevance and the negative sample with strong relevance is enlarged, the text retrieval model can be more accurate and reliable, and the accuracy and the correctness of the document retrieved by the text retrieval model are improved.

According to the question and answer retrieval method, the question and answer retrieval device, the computer equipment and the storage medium, the question to be retrieved is obtained, the text retrieval model which is obtained by training through the text retrieval training method is automatically obtained, the answer which is most matched with the question to be retrieved is quickly retrieved, the answer is accurately determined, and the accuracy and reliability of answer output can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a text search model training method or a question-and-answer search method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for training a text retrieval model according to an embodiment of the invention;

FIG. 3 is a flow chart of a question and answer retrieval method according to an embodiment of the present invention;

FIG. 4 is a block diagram of a text retrieval model training apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic block diagram of a question answering retrieval device according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a computer device in an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The text retrieval model training method provided by the invention can be applied to the application environment shown in fig. 1, wherein a client (computer equipment) communicates with a server through a network. The client (computer device) includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, cameras, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a text retrieval model training method is provided, which mainly includes the following steps S10-S60:

s10, acquiring a sample set to be retrieved and a document library; the sample set to be retrieved comprises a plurality of samples to be retrieved; one sample to be retrieved corresponds to one target document.

Understandably, the sample set to be retrieved is a set of all collected samples to be retrieved which need to be trained, the samples to be retrieved are historically collected sentences or question sentences which need to be retrieved, the target documents are sentences or answers matched with the corresponding samples to be retrieved, and each sample to be retrieved corresponds to one target document.

S20, performing queue generation on each sample to be retrieved to obtain a training sample queue corresponding to each sample to be retrieved one by one; wherein each training sample queue comprises one to-be-retrieved sample and at least one negative sample.

Understandably, the negative sample can be selected according to requirements, the content of a plurality of character combinations can be randomly generated, or the preset number of target documents can be randomly selected from the to-be-retrieved sample set, preferably, the preset number of target documents corresponding to other to-be-retrieved samples except for the to-be-retrieved samples in the training sample queue in the to-be-retrieved sample set is randomly extracted as the negative sample in the training sample queue, so that the training sample queue is extracted because the document or text obtained by searching the to-be-retrieved sample in the training sample queue by the text retrieval model in training is similar to or close to the target document by the training sample queue, and the method is far away from other negative samples, so that the text retrieval model is more accurate, and therefore a plurality of training sample queues can be constructed.

The training sample queue comprises a to-be-retrieved sample, a target document corresponding to the to-be-retrieved sample and a preset number of negative samples.

And S30, inputting the training sample queue into a text retrieval model containing initial parameters.

Understandably, the text retrieval model is a neural network model based on BERT model search, and is trained by inputting each training sample into the text retrieval model one by one, and the initial parameters comprise the network structure of the BERT model and related parameters of each level.

And S40, performing BERT-based embedded vector coding processing on the training sample queue and each storage document in the document library through the text retrieval model, performing vector retrieval on the training sample queue subjected to embedded vector coding processing from the storage document subjected to embedded vector coding processing by using an ANN search algorithm and a KNN algorithm to obtain a retrieval result, and recording the training times.

Understandably, the training times is a training process of once vector retrieval record and continuously accumulating times, the process of embedded vector coding processing is a process of segmenting input text, performing context semantic recognition on segmented words or words to obtain a text array, performing embedded word vector conversion on each word or word in the text array to obtain a corresponding vector, the ANN search algorithm is a method of learning search through neural network parameters, also called approach Nearest Neighbors (Approximate Nearest neighbor classification algorithm), and the ANN search algorithm can be classified into three categories: the method comprises a tree-based method, a Hash method and a vector quantization method, wherein the KNN algorithm is a K-Nearest Neighbor (K-Nearest Neighbor) algorithm, the KNN algorithm is an algorithm that if most of K most similar examples (namely Nearest neighbors in a feature space) in a feature space belong to a certain class, the most similar examples also belong to the class, the ANN search algorithm and the KNN algorithm are used for constructing a data structured search graph formed by all storage documents subjected to embedded vector coding processing, the vector search process is a process of rapidly searching out corresponding documents by constructing the data structured search graph and comparing through vector quantization to match the distance of a central point, and the search result is a set result of the storage documents adjacent to an input training sample queue.

The document library is a database storing all collected historical documents, the storage files are collected historical documents, the storage documents can be question documents or answer documents, and the retrieval result comprises the retrieval documents with the preset number, wherein the retrieval documents are storage documents adjacent to an input training sample queue, and the number of the retrieval documents is consistent with the number of negative samples in one training sample queue.

In an embodiment, in step S40, the performing, by the text retrieval model, BERT-based embedded vector coding processing on the training sample queue and each stored document in the document library, and performing vector retrieval on the training sample queue after embedded vector coding processing from the stored document after embedded vector coding processing by using an ANN search algorithm and a KNN algorithm to obtain a retrieval result includes:

s401, performing BERT-based embedded vector coding processing on the training sample queue through the text retrieval model to obtain a training vector queue corresponding to the training sample queue, and performing embedded vector coding on each storage document in the document library to obtain a document vector corresponding to each storage document.

Understandably, the embedded vector coding is a coding process for performing vector conversion on an input text by using a BERT model, the BERT model is a model represented by a good feature vector and a label by forming a word into a good feature vector and a label under a self-supervision learning method of a large number of predictions in training learning, the self-supervision learning refers to supervision learning operated on data without artificial labels, parameters are iterated in the training process through the BERT model to embody the feature vector and the label of each word, so that the to-be-retrieved sample, the target document and each negative sample in the training sample are respectively subjected to embedded vector coding to respectively obtain the to-be-retrieved vector, the target vector and each negative sample vector in the training vector queue, and the embedded vector coding can be simultaneously or before each storage document to obtain the document vectors corresponding to each storage document one by one, in an embodiment, the first sentence or the first segment in each of the storage documents is subjected to embedded vector coding to improve the efficiency of data processing and reduce the cost of data processing, and a vector array obtained by performing embedded vector coding on the first sentence or the first segment in one of the storage documents is used as the document vector corresponding to the storage document.

In an embodiment, the training vector queue includes a vector to be retrieved, a target vector, and the preset number of negative sample vectors.

Understandably, the vector to be retrieved is a vector array obtained after the sample to be retrieved is subjected to embedded vector encoding, the target vector is a vector array obtained after the target document is subjected to embedded vector encoding, the negative sample vector is a vector array obtained after the negative sample is subjected to embedded vector encoding, and the number of the negative sample vectors corresponds to the number of the negative samples one to one.

In an embodiment, in step S401, that is, performing BERT-based embedded vector encoding processing on the training sample queue through the text retrieval model to obtain a training vector queue corresponding to the training sample queue, includes:

and segmenting the to-be-retrieved sample through the text retrieval model to obtain each unit segmentation.

Understandably, the participle is a word or a word which divides the sample to be retrieved into the minimum unit, and a plurality of unit participles corresponding to the sample to be retrieved are obtained.

And performing context semantic recognition on all the unit participles through the text retrieval model to obtain a keyword array corresponding to the sample to be retrieved.

Understandably, the context semantic identification may be a verification method that applies a Bi-LSTM algorithm, also referred to as a bidirectional long-and-short memory network algorithm, and performs common coding in both forward and reverse directions to perform embedded word vector conversion, so as to ensure that unit participles are converted into codes that best conform to semantics, to extract a real entity from codes of each unit participle, to remove non-entities such as dummy words and auxiliary words, and to determine the extracted entity as the keyword array corresponding to the sample to be retrieved.

And performing embedded word vector conversion on the keyword array through the text retrieval model to obtain the vector to be retrieved corresponding to the sample to be retrieved.

Understandably, performing uniform conversion of embedded word vectors on the keyword arrays, namely searching word vectors corresponding to the entities in each keyword array one by one from a preset dictionary library, and splicing the word vectors after one-to-one conversion according to a uniform format to obtain the vector to be retrieved corresponding to the sample to be retrieved.

And performing word segmentation and word embedding vector conversion on the target document and each negative sample through the text retrieval model to obtain a target vector corresponding to the target document and negative sample vectors corresponding to each negative sample one by one.

Understandably, the word segmentation processing is performed on the target document, the embedded word vector conversion processing is performed on the target document after the word segmentation processing, the target document is converted into a group of vector arrays and determined as the target vector corresponding to the target document, and the word segmentation and the embedded word vector conversion are performed on each negative sample to obtain the negative sample vector corresponding to each negative sample.

The invention realizes the word segmentation of the sample to be retrieved through the text retrieval model to obtain the word segmentation of each unit; performing context semantic recognition on all the unit participles through the text retrieval model to obtain a keyword array corresponding to the sample to be retrieved; performing embedded word vector conversion on the keyword array through the text retrieval model to obtain the vector to be retrieved corresponding to the sample to be retrieved; the target document and each negative sample are subjected to word segmentation and word embedding vector conversion through the text retrieval model to obtain a target vector corresponding to the target document and negative sample vectors corresponding to the negative samples one by one, so that the sample to be retrieved, the target document and all the negative samples can be accurately converted into corresponding vector values through word segmentation, context semantic recognition and word embedding vector conversion based on BERT, the quantization degrees of words in the sample to be retrieved, the target document and all the negative samples can be embodied through the vector values, and a data base is provided for subsequent retrieval.

In an embodiment, performing word segmentation and word embedding vector conversion on each negative sample to obtain a negative sample vector corresponding to each negative sample one to one, includes: and acquiring a first segment in each negative sample, performing word segmentation and word embedding vector conversion on the first segment in each negative sample, and using the converted vector array as the negative sample vector corresponding to each negative sample so as to improve the data processing efficiency and reduce the data processing cost.

S402, an ANN search algorithm is used for constructing index numbers corresponding to the stored documents based on the document vectors.

Understandably, the ANN search algorithm is a method of learning search through neural network parameters, also called Approximate Nearest neighbor classifiers (Approximate Nearest neighbor classification algorithm), and can be classified into three major categories: a tree-based method, a hash method, and a vector quantization method, wherein the tree-based method is an approximate nearest neighbor search library using a tree as a data structure, for the document vector to be inserted, the vector of the root node is used from the root node in sequence to perform inner product operation, thereby determining which side (left sub-tree or right sub-tree) of the root node plane is used, gradually determining the next layer, gradually bisecting which side of the inserted node is used, thereby assigning a path number to the inserted document vector, i.e. an index number corresponding to the document vector, the hash method is a method of calculating the hash value of each document vector, constructing the index number of each document vector according to the hash value of each document vector, the vector quantization method is a process of encoding points in a vector space by using a limited subset of the vector, and performing space segmentation on all the document vectors, and cutting a plurality of subspaces, clustering each subspace to obtain a clustering center of each subspace, and taking the distance between each document vector and the clustering center of the corresponding subspace as a corresponding index number.

In an embodiment, the step S402 of constructing, by using an ANN search algorithm, an index number corresponding to each of the stored documents based on each of the document vectors includes:

and establishing a hash chain table for each word in all the document vectors through a hash algorithm in the ANN search algorithm.

Understandably, the hash algorithm is an algorithm for summarizing all words in each document vector to construct a hash chain table, the hashing algorithm may be a direct addressing method, a square-cut-median method, or a numerical analysis method, etc., preferably a direct addressing method, i.e. taking a key or some linear function value of a key as a Hash address, said Hash chain Table is also called Hash Table (Hash Table) or Hash Table, a space-time data structure that can be accessed quickly according to a key value, by learning the mapping function that can be established for a word by the vector distribution of all words and the sequence in the same vector value of the linked list, the sequence is a sequence number arranged in the same word or word vector, and the hash value of each word can be calculated according to the position and the sequence of each word in the hash chain table.

And constructing the index number corresponding to each stored document according to the hash value in the hash chain table corresponding to each word in each document vector.

Understandably, the hash values corresponding to the words in the document vector are combined or summed to obtain the hash value corresponding to the document vector through calculation, so as to construct the index number corresponding to the stored document, the positions of each point can be accurately drawn in a high-dimensional space through the index number, if the distance between the two position points is very close, the hash values of the two points are calculated by designing a hash function, so that the probability that the hash values of the two points are close is very high, and if the distance between the two points is long, the probability that the hash values of the two points are close is very low.

The invention realizes that a hash chain table is established for each word in all the document vectors through a hash algorithm in the ANN search algorithm; and constructing the index numbers corresponding to the storage documents according to the hash values in the hash chain table corresponding to the words in the document vectors, so that the hash chain tables among the words in all the document vectors can be automatically established by using a hash algorithm, the index numbers corresponding to the storage documents are constructed according to the hash values in the hash chain tables, and the positions of the storage documents can be quickly embodied according to the index numbers.

And S403, performing vector retrieval on the training vector queue according to all the index numbers by using a KNN algorithm to obtain a retrieval result.

Understandably, the KNN algorithm is a K-Nearest Neighbor (K-Nearest Neighbor) algorithm, and the KNN algorithm is an algorithm that if most of K most similar (i.e. Nearest neighbors in feature space) instances in feature space belong to a certain category, the instances also belong to the category, where an instance is a value comparable to the same dimension as the index number, and the document vectors corresponding to the index numbers with the K similar distances between the vector to be retrieved and the index numbers are extracted by calculating the distances between the vector to be retrieved and the index numbers in the training vector queue.

In an embodiment, in the step S403, that is, the performing, by using the KNN algorithm, a vector search on the training vector queue according to all the index numbers to obtain a search result includes:

and generating a graph data structure for all the index numbers by using an HNSW method to obtain a retrieval graph.

Understandably, the HNSW method is also called a Hierarchical navigatable Small World method, Delaunay triangulation is performed on all points, then a rapid retrieval channel is constructed by adding some random long edges, a Navigable Small World network is constructed, randomness is introduced through random insertion of nodes, and a network structure similar to a Small World is constructed.

The retrieval graph is a navigable small world network and a plurality of layers randomly inserted into a network structure similar to the small world.

And performing vector retrieval on the vector to be retrieved and each negative sample vector by using a KNN algorithm based on the retrieval map to obtain the document vectors with the preset number adjacent to the vector to be retrieved.

Understandably, a KNN algorithm is used for inquiring K document vectors of neighbors confirmed by combining the input vector to be retrieved and each negative sample vector, the processing procedure is to enter from a fixed input node, search is started at the topmost layer of the retrieval graph, a unique nearest neighbor node is searched at each layer, then the nearest neighbor node is used as an entry node of the next layer, search gradually downwards, finally top K most similar nodes are searched at the bottom layer, top K most similar nodes of the retrieved vector to be retrieved and each negative sample most similar node (namely, one most similar node is searched at the bottom layer) are compared, the top K most similar nodes of the vector to be retrieved and each negative sample most similar node are the same, the same number is calculated, and top K most similar nodes of the vector to be retrieved are expanded by the same number, that is, the same number of similar nodes are removed from the document vector, and different nodes which are the most similar to the negative examples are sought after by outward expansion, so that the document vector with the preset number adjacent to the vector to be retrieved is obtained, wherein the preset number can be K.

The node with the most similar negative sample may also be the same node, because the negative sample may be a sample extracted from the stored document corresponding to the document vector itself.

And searching all vectors to obtain the stored document records corresponding to the document vectors as the search result.

Understandably, the stored documents corresponding to each of the document vectors obtained by the vector retrieval are taken as retrieval documents, and all of the retrieval documents are determined as the retrieval results.

The invention realizes the generation of the graph data structure of all the index numbers by using an HNSW method to obtain a retrieval graph; performing vector retrieval on the vector to be retrieved and each negative sample vector by using a KNN algorithm based on the retrieval map to obtain the document vectors with the preset number adjacent to the vector to be retrieved; and searching all vectors to obtain the storage documents corresponding to all the document vectors, and recording the storage documents as the search results, so that K adjacent storage documents can be automatically searched from all the storage documents through an HNSW method and a KNN algorithm, and the search speed is improved.

And S50, calculating to obtain a loss value according to the training vector queue and the retrieval result through a loss function.

Understandably, the loss function can be set according to the requirement, such as the loss function is:

wherein loss (q, d)⁺,D^-) Is the loss value; q is a sample to be retrieved in the training sample queue; d⁺For the target document in the training sample queue, D^-The negative sample set in the training sample queue comprises a preset number of negative samples in the training sample queue; d^-Negative samples in the training samples; exp (f (q, d)⁺) Search result obtained for search q and target document d⁺The distance between them;

negative sample d of search result obtained for search q^-The distance between them.

And S60, when the training times and the preset times are in a multiple relation, and the loss value does not reach the preset convergence condition, iteratively updating the initial parameters of the text retrieval model, updating the training sample queue according to the retrieval result obtained before the current iterative updating, and executing the step of inputting the training sample queue into the text retrieval model containing the initial parameters, and recording the converged text retrieval model as the trained text retrieval model until the loss value reaches the preset convergence condition.

Understandably, the preset times are preset times serving as a batch of batchs, when it is detected that the loss value does not reach a preset convergence condition and the training times and the preset times are in a multiple relation, the loss value does not reach the preset convergence condition, the preset times are triggered every other time, the initial parameters of the text retrieval model are updated in an iterative manner, and the retrieval result obtained before the current iterative update replaces all negative samples in the current training sample queue, so that the training sample queue is updated, therefore, the retrieval result can be used as the negative sample with strong relevance of the document to be retrieved instead of randomly obtaining the negative sample irrelevant to the document to be retrieved, the accuracy of the text retrieval model can be improved, the performance of the text retrieval model can be improved, and the step of inputting the training sample queue into the text retrieval model containing the initial parameters can be executed, and continuously circulating, and recording the search model after convergence as a trained text search model when the loss value reaches the preset convergence condition.

The preset convergence condition may be set according to a requirement, for example, the training time reaches 100000 times, or the loss value reaches a preset threshold, or the loss value does not decrease continuously, or the like.

The invention realizes the retrieval by acquiring the sample set to be retrieved and the document library; inputting the training sample queue into a text retrieval model containing initial parameters; carrying out embedded vector coding processing based on BERT on the training sample queue and each storage document in the document library through the text retrieval model, carrying out vector retrieval on the training sample queue after the embedded vector coding processing from the storage document after the embedded vector coding processing by using an ANN (artificial neural network) search algorithm and a KNN (K nearest neighbor) algorithm to obtain a retrieval result, and recording the training times; calculating to obtain a loss value according to the training vector queue and the retrieval result through a loss function; when the loss value is detected not to reach the preset convergence condition and the training times and the preset times are in a multiple relation, iteratively updating the initial parameters of the text retrieval model, updating a training sample queue according to a retrieval result obtained before the current iterative update, and executing the step of inputting the training sample queue into a text retrieval model containing initial parameters until the loss value reaches the preset convergence condition, recording the text retrieval model after convergence as a trained text retrieval model, therefore, the retrieval method for training by continuously updating the negative sample with strong relevance can be realized, the subtle difference between the negative sample with strong relevance and the negative sample with strong relevance is continuously learned, the fine granularity of the difference between the negative sample with strong relevance and the negative sample with strong relevance is enlarged, the text retrieval model can be more accurate and reliable, and the accuracy and the correctness of the document retrieved by the text retrieval model are improved.

In an embodiment, after the step S50, that is, after the calculating a loss value according to the training vector queue and the search result by the loss function, the method includes:

and S70, when the loss value is not reached to a preset convergence condition and the training frequency is not multiplied by the preset frequency, iteratively updating the initial parameters of the text retrieval model, and executing the step of inputting the training sample queue into the text retrieval model containing the initial parameters until the loss value reaches to the preset convergence condition, and recording the converged text retrieval model as the trained text retrieval model.

Understandably, when the loss value is not retrieved to reach the preset convergence condition and the training frequency is not in a multiple relation with the preset frequency, only the initial parameters of the text retrieval model are updated in an iterative mode, and then the step of inputting the training sample queue into the text retrieval model containing the initial parameters is executed, the process is continuously circulated until the loss value reaches the preset convergence condition, the retrieval model after convergence is recorded as the text retrieval model after the training is completed, so that the initial parameters are updated in an iterative mode under the condition that the loss value does not reach the multiple relation with the preset frequency, and the learning is continuously carried out according to the current training sample queue.

The question-answer retrieval method provided by the invention can be applied to the application environment shown in fig. 1, wherein a client (computer equipment) communicates with a server through a network. The client (computer device) includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, cameras, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 3, a question and answer retrieval method is provided, which mainly includes the following steps S100 to S300:

s100, the problem to be retrieved is obtained.

Understandably, the question to be retrieved is a question that the user needs to ask, the question to be retrieved can be obtained through a text input mode, or the voice of the user asking can be recognized through voice, and the voice is converted into text content for obtaining through a natural language technology.

S200, inputting the to-be-retrieved question into a trained text retrieval model obtained by training according to the text retrieval model training method for vector retrieval, and obtaining a preset number of candidate answers corresponding to the to-be-retrieved question.

Understandably, by inputting the question to be retrieved into the trained text retrieval model, and performing the vector retrieval, a preset number of candidate answers corresponding to the question to be retrieved can be retrieved, the candidate answers are storage documents adjacent to the question to be retrieved in all the storage documents, and all the storage documents can be a set of answers for answering all the questions.

S300, all the candidate answers are ranked, and the candidate answer with the first ranked sequence is determined as a reply answer.

Understandably, sorting all the candidate answers according to the distances between the candidate answers and the corresponding neighbors thereof in an ascending manner, and using the candidate answer with the first sorted sequence as a reply answer, wherein the reply answer can be displayed or broadcasted through a display interface corresponding to a client or a user side of a user so as to answer the question to be retrieved, thereby achieving the effect of man-machine interaction.

According to the method and the device, the problem to be retrieved is obtained, the text retrieval model which is obtained by training through the text retrieval training method and is trained, the answer which is matched with the problem to be retrieved is rapidly retrieved, the answer is accurately determined, and the accuracy and the reliability of the output of the answer are improved.

In an embodiment, a text search model training device is provided, and the text search model training device corresponds to the text search model training method in the above embodiments one to one. As shown in fig. 4, the text retrieval model training apparatus includes a first obtaining module 11, a generating module 12, an input module 13, a first retrieving module 14, a loss module 15, and a training module 16. The functional modules are explained in detail as follows:

the first obtaining module 11 is configured to obtain a sample set to be retrieved and a document library; the sample set to be retrieved comprises a plurality of samples to be retrieved; one sample to be retrieved corresponds to one target document;

a generating module 12, configured to perform queue generation on each to-be-retrieved sample to obtain a training sample queue corresponding to each to-be-retrieved sample one to one; each training sample queue comprises one to-be-retrieved sample and at least one negative sample;

an input module 13, configured to input the training sample queue into a text retrieval model containing initial parameters;

the first retrieval module 14 is configured to perform BERT-based embedded vector coding processing on the training sample queue and each storage document in the document library through the text retrieval model, perform vector retrieval on the training sample queue after the embedded vector coding processing from the storage document after the embedded vector coding processing by using an ANN search algorithm and a KNN algorithm, obtain a retrieval result, and record training times;

the loss module 15 is configured to calculate a loss value according to the training sample queue after the embedded vector coding processing and the search result through a loss function;

and the training module 16 is configured to iteratively update the initial parameters of the text retrieval model when the training frequency is in a multiple relationship with a preset frequency and the loss value does not reach a preset convergence condition, update a training sample queue according to a retrieval result obtained before current iterative update, and perform a step of inputting the training sample queue into the text retrieval model containing the initial parameters until the loss value reaches the preset convergence condition, and record the text retrieval model after convergence as the text retrieval model after training.

For the specific definition of the text retrieval model training device, reference may be made to the above definition of the text retrieval model training method, which is not described herein again. The modules in the text retrieval model training device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a question-answer retrieving device is provided, which corresponds to the question-answer retrieving method in the above embodiments one to one. As shown in fig. 5, the question answering retrieval device includes a second acquisition module 101, a second retrieval module 102, and a reply module 103. The functional modules are explained in detail as follows:

a second obtaining module 101, configured to obtain a problem to be retrieved;

the second retrieval module 102 is configured to input the to-be-retrieved question into a trained text retrieval model obtained by training with the text retrieval model training method, and perform vector retrieval to obtain a preset number of candidate answers corresponding to the to-be-retrieved question;

the reply module 103 is configured to rank all the candidate answers, and determine the candidate answer with the first ranked sequence as a reply answer.

For specific limitations of the question-answer retrieving device, reference may be made to the above limitations of the question-answer retrieving method, which are not described herein again. The modules in the above question and answer retrieving device may be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text search model training method, or a question-and-answer search method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the text retrieval model training method in the above embodiments when executing the computer program, or implements the question-and-answer retrieval method in the above embodiments when executing the computer program.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the text retrieval model training method in the above-described embodiments, or which when executed by a processor implements the question-and-answer retrieval method in the above-described embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A text retrieval model training method is characterized by comprising the following steps:

2. The method for training a text retrieval model according to claim 1, wherein the performing, by the text retrieval model, BERT-based embedded vector encoding processing on the training sample queue and each stored document in the document library, and performing vector retrieval on the training sample queue after embedded vector encoding processing from the stored document after embedded vector encoding processing by using an ANN search algorithm and a KNN algorithm to obtain a retrieval result comprises:

performing BERT-based embedded vector coding processing on the training sample queue through the text retrieval model to obtain a training vector queue corresponding to the training sample queue, and performing embedded vector coding on each storage document in the document library to obtain a document vector corresponding to each storage document;

constructing index numbers corresponding to the stored documents based on the document vectors by using an ANN search algorithm;

and carrying out vector retrieval on the training vector queue according to all the index numbers by using a KNN algorithm to obtain a retrieval result.

3. The training method of the text retrieval model according to claim 2, wherein the training vector queue includes a vector to be retrieved, a target vector and the preset number of negative sample vectors;

the embedding vector coding processing based on BERT is carried out on the training sample queue through the text retrieval model to obtain a training vector queue corresponding to the training sample queue, and the embedding vector coding processing comprises the following steps:

performing word segmentation on the sample to be retrieved through the text retrieval model to obtain word segmentation of each unit;

performing context semantic recognition on all the unit participles through the text retrieval model to obtain a keyword array corresponding to the sample to be retrieved;

performing embedded word vector conversion on the keyword array through the text retrieval model to obtain the vector to be retrieved corresponding to the sample to be retrieved;

4. The method of claim 2, wherein the constructing an index number corresponding to each of the stored documents based on each of the document vectors using an ANN search algorithm comprises:

establishing a hash chain table for each word in all the document vectors through a hash algorithm in an ANN search algorithm;

5. The method for training a text retrieval model according to claim 2, wherein the performing a vector retrieval on the training vector queue according to all the index numbers by using a KNN algorithm to obtain a retrieval result comprises:

generating a graph data structure for all the index numbers by using an HNSW method to obtain a retrieval graph;

performing vector retrieval on the vector to be retrieved and each negative sample vector by using a KNN algorithm based on the retrieval map to obtain the document vectors with the preset number adjacent to the vector to be retrieved;

6. A question-answer retrieval method is characterized by comprising the following steps:

obtaining a problem to be retrieved;

inputting the to-be-retrieved question into a trained text retrieval model obtained by training according to the text retrieval model training method of any one of claims 1 to 5 for vector retrieval to obtain a preset number of candidate answers corresponding to the to-be-retrieved question;

7. A text search model training apparatus, comprising:

8. A question-answer retrieval apparatus, comprising:

a second retrieval module, configured to input the to-be-retrieved question into a trained text retrieval model obtained by training with the text retrieval model training method according to any one of claims 1 to 5, and perform vector retrieval to obtain a preset number of candidate answers corresponding to the to-be-retrieved question;

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the text retrieval model training method according to any one of claims 1 to 5 when executing the computer program, or the processor implements the question-answering retrieval method according to claim 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, wherein the computer program, when executed by a processor, implements the text retrieval model training method according to any one of claims 1 to 5, or wherein the processor, when executing the computer program, implements the question-and-answer retrieval method according to claim 6.