CN113722512A - Text retrieval method, device and equipment based on language model and storage medium - Google Patents

Text retrieval method, device and equipment based on language model and storage medium Download PDF

Info

Publication number
CN113722512A
CN113722512A CN202111019330.1A CN202111019330A CN113722512A CN 113722512 A CN113722512 A CN 113722512A CN 202111019330 A CN202111019330 A CN 202111019330A CN 113722512 A CN113722512 A CN 113722512A
Authority
CN
China
Prior art keywords
vector
sentence
data
retrieved
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111019330.1A
Other languages
Chinese (zh)
Inventor
杨焱麒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202111019330.1A priority Critical patent/CN113722512A/en
Publication of CN113722512A publication Critical patent/CN113722512A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The embodiment of the invention relates to the field of artificial intelligence, and discloses a text retrieval method, a text retrieval device, text retrieval equipment and a storage medium based on a language model, wherein the method comprises the following steps: acquiring a labeled data set, wherein the data set comprises a plurality of labeled sentences with the same semantic meaning; extracting sentence characteristic vectors from the labeled data set, and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model; inputting the data to be retrieved into a sentence vector model to obtain a sentence vector of the data to be retrieved; calculating the similarity between each first vector matrix in a preset vector matrix library and a sentence vector of data to be retrieved; and determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index, so that the accuracy and efficiency of text retrieval are improved. The present invention relates to blockchain techniques, such as data sets can be written into blockchains for use in scenarios such as data forensics.

Description

Text retrieval method, device and equipment based on language model and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a text retrieval method, a text retrieval device, text retrieval equipment and a storage medium based on a language model.
Background
How to quickly and accurately search and acquire required information from a large number of storage media is a core problem in the field of text retrieval. In the middle of the last 50 th century, researchers have achieved some efforts on the task of retrieving text data on a computer basis, and the most notable of them is a matching method based on word indexing, which has become the prototype of the modern common inverted indexing technology. In the 60 s, many evaluation indexes were developed, and in the 70 s to 80 s, many theories and models of information retrieval were proposed, of which the vector space model is the most notable. After the 90 s, a large number of text retrieval systems were used with the rise of the internet.
The text retrieval system based on the inverted index is commonly used in scenes such as a search engine, a question-answering system and the like which need fast full-text search, and has the advantages of high speed and full-text search support. Because the positioning of the answer is based on the word matching result in the question, under the service scene with more synonyms, the synonyms need to be added manually, and better performance can be achieved. Under the service scene of the question answering system with limited data volume, different questions of the user are difficult to be covered.
In the text retrieval based on similarity, the more common method is to calculate the weight of key words through statistics in a text set, then form document vectors through the weight of the key words, and then calculate cosine similarity through the retrieved vectors and all candidate document vectors, so that the similarity ranking of the candidate documents can be formed, and the most relevant documents can be found. The search can also be optimized according to the relation between the document length and the average length of the specific task data set, and certain flexibility is achieved. Compared with the reverse index, the method has the advantages that the statistical characteristics of the words are utilized, the retrieval effect is better, but the defects that only the word matching condition can be processed and the semantic relevance cannot be correctly perceived are still overcome.
After deep learning appears, the text retrieval is combined with natural language processing to enable results to be well represented, and particularly after a Bert model appears in 2018, due to the fact that the Bert model is trained on a large-scale text in advance, a downstream task can easily cover many problems which can be solved only by semantics. However, the vector representation of the Bert has a certain problem, and the word frequency distribution in the text data is very different, so that the vector of the word with higher frequency in the vector space where the Bert representation vector exists is closer to the origin, and in this case, the cosine distance cannot correctly represent the semantic distance of the sentence, so that the similar text cannot be accurately determined. Therefore, how to more effectively improve the accuracy of text retrieval becomes a major research point.
Disclosure of Invention
The embodiment of the invention provides a text retrieval method, a text retrieval device, text retrieval equipment and a storage medium based on a language model, and the accuracy and efficiency of text retrieval are improved.
In a first aspect, an embodiment of the present invention provides a text retrieval method based on a language model, including:
acquiring a labeled data set, wherein the labeled data set comprises a plurality of labeled sentences with the same semantic meaning;
extracting sentence characteristic vectors from the labeled data set, and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model;
acquiring data to be retrieved, and inputting the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved;
calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved;
and determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index.
Further, said extracting sentence feature vectors from said labeled data sets comprises:
determining the data volume of each sentence included in the data set according to the labeled data set;
and determining a target data set from the labeled data sets according to the data quantity, and extracting the sentence characteristic vector from the target data set.
Further, the determining a target data set from the labeled data sets according to the data volume includes:
acquiring the data format of each sentence included in the labeled data set, and determining a target data format according to the data volume and the data format;
and recombining each sentence included in the labeled data set according to the target data format to obtain the target data set.
Further, the inputting the sentence feature vector into a pre-trained Bert model for training to obtain a sentence vector model includes:
inputting the sentence characteristic vector into the pre-trained Bert model to obtain a loss function value;
comparing the loss function value with a target loss function value, and adjusting the model parameter of the Bert model according to the comparison result when the comparison result does not meet the preset condition;
inputting the sentence characteristic vector into the Bert model after model parameters are adjusted for retraining;
and when the comparison result of the obtained loss function value and the target loss function value meets a preset condition, determining to obtain the sentence vector model.
Further, the inputting the sentence feature vector into the pre-trained Bert model to obtain a loss function value includes:
inputting the sentence characteristic vector into the Bert model to obtain a second vector matrix corresponding to the sentence characteristic vector;
multiplying each vector in the second vector matrix pairwise to obtain an interaction matrix, wherein the interaction matrix comprises a plurality of element values;
and determining a label matrix according to each element value in the interaction matrix, and calculating to obtain the loss function value according to the interaction matrix and the label matrix.
Further, the calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved includes:
obtaining the dimensionality of each first vector matrix in the preset vector matrix library, and carrying out segmentation processing on each first vector matrix according to the dimensionality;
clustering vectors in the first vector matrix of each segment after the segmentation processing respectively to obtain a plurality of cluster centers;
and calculating the distance between the sentence vector of the data to be retrieved and each clustered cluster center, and determining the similarity between the sentence vector of the data to be retrieved and each first vector matrix according to the distance.
Further, the determining an index corresponding to a sentence vector of the data to be retrieved according to the similarity includes:
sequencing the similarity according to the sequence of the similarity from big to small;
and acquiring indexes corresponding to each vector in each first vector matrix in the preset vector matrix library with the similarity arranged in the front K in sequence, wherein K is a positive integer.
In a second aspect, an embodiment of the present invention provides a text retrieval apparatus based on a language model, including:
the system comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a labeled data set, and the labeled data set comprises a plurality of labeled sentences with the same semantic meaning;
the training unit is used for extracting sentence characteristic vectors from the labeled data set and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model;
the test unit is used for acquiring data to be retrieved and inputting the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved;
the retrieval unit is used for calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved;
and the determining unit is used for determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index.
In a third aspect, an embodiment of the present invention provides a computer device, including a processor and a memory, where the memory is used to store a computer program, and the computer program includes a program, and the processor is configured to call the computer program to execute the method of the first aspect.
In a fourth aspect, the present invention provides a computer-readable storage medium, which stores a computer program, where the computer program is executed by a processor to implement the method of the first aspect.
The embodiment of the invention can obtain a labeled data set, wherein the labeled data set comprises a plurality of labeled sentences with the same semantic meaning; extracting sentence characteristic vectors from the labeled data set, and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model; acquiring data to be retrieved, and inputting the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved; calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved; and determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index. In the embodiment of the invention, the sentence vector corresponding to the data to be retrieved, which is obtained by the sentence vector model, is combined with the first vector matrix in the preset vector matrix library by training the sentence vector model to determine the sentence corresponding to the data to be retrieved, which is beneficial to improving the accuracy and efficiency of text retrieval.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a text retrieval method based on a language model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a method for determining a target data set according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a tag matrix provided by an embodiment of the invention;
FIG. 4 is a schematic block diagram of a text retrieval apparatus based on a language model according to an embodiment of the present invention;
fig. 5 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The text retrieval method based on the language model provided by the embodiment of the invention can be applied to a text retrieval device based on the language model, and in some embodiments, the text retrieval device based on the language model is arranged in computer equipment. In certain embodiments, the computer device includes, but is not limited to, one or more of a smartphone, tablet, laptop, and the like.
The embodiment of the invention can obtain a labeled data set, wherein the labeled data set comprises a plurality of labeled sentences with the same semantic meaning; extracting sentence characteristic vectors from the labeled data set, and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model; acquiring data to be retrieved, and inputting the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved; calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved; and determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index. In the embodiment of the invention, the sentence vector corresponding to the data to be retrieved, which is obtained by the sentence vector model, is combined with the first vector matrix in the preset vector matrix library by training the sentence vector model to determine the sentence corresponding to the data to be retrieved, which is beneficial to improving the accuracy and efficiency of text retrieval.
The embodiment of the application can acquire and process related data (such as labeled data sets) based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The text retrieval method based on the language model provided by the embodiment of the invention is schematically described below with reference to fig. 1.
Referring to fig. 1, fig. 1 is a schematic flow chart of a text retrieval method based on a language model according to an embodiment of the present invention, and as shown in fig. 1, the method may be executed by a text retrieval device based on a language model, where the text retrieval device based on a language model is disposed in a computer device. Specifically, the method of the embodiment of the present invention includes the following steps.
S101: obtaining a labeled data set, wherein the labeled data set comprises a plurality of labeled sentences with the same semantic meaning.
In the embodiment of the present invention, a text retrieval device based on a language model may obtain a labeled data set, where the labeled data set includes a plurality of labeled sentences with the same semantic meaning.
In one example, assume that the labeled data set is D, where column i in the labeled data set D<di,d′i>The marked sentences with the same semanteme are obtained.
S102: and extracting sentence characteristic vectors from the labeled data set, and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model.
In the embodiment of the invention, the text retrieval device based on the language model can extract the sentence characteristic vector from the labeled data set, and input the sentence characteristic vector into a pre-trained Bert model for training to obtain the sentence vector model.
In one embodiment, the language model-based text retrieval apparatus may determine, when extracting sentence feature vectors from the labeled data set, a data amount of each sentence included in the data set from the labeled data set; and determining a target data set from the labeled data set according to the data volume, and extracting the sentence feature vector from the target data set.
In one embodiment, when determining a target data set from the labeled data set according to the data amount, the language model-based text retrieval device may obtain a data format of each sentence included in the labeled data set, and determine the target data format according to the data amount and the data format; and recombining each sentence included in the labeled data set according to the target data format to obtain the target data set.
In an embodiment, assuming that the data size in the labeled data set acquired by the text retrieval device based on the language model is n, and the data format of each sentence included in the labeled data set is n rows and 2 columns, the target data format may be determined to be 2n rows and 1 columns, and each sentence included in the labeled data set may be recombined according to the target data format to obtain the target data set.
Specifically, fig. 2 is an example, and fig. 2 is a schematic diagram of determining a target data set according to an embodiment of the present invention, as shown in fig. 2, where 2a is a data format of sentences in the same language in the data set, and 2b is a target data set obtained by recombining sentences included in a labeled data set according to the determined target data format, where the target data set includes a sequence of each recombined sentence.
In one embodiment, when the text retrieval device based on the language model inputs the sentence feature vector into the pre-trained Bert model for training to obtain the sentence vector model, the sentence feature vector may be input into the pre-trained Bert model to obtain a loss function value; comparing the loss function value with a target loss function value, and adjusting the model parameter of the Bert model according to the comparison result when the comparison result does not meet the preset condition; inputting the sentence characteristic vector into the Bert model after model parameters are adjusted for retraining; and when the comparison result of the obtained loss function value and the target loss function value meets a preset condition, determining to obtain the sentence vector model.
In one embodiment, when the sentence feature vector is input into the pre-trained Bert model to obtain the loss function value, the language model-based text retrieval device may input the sentence feature vector into the Bert model to obtain a second vector matrix corresponding to the sentence feature vector; multiplying each vector in the second vector matrix pairwise to obtain an interaction matrix, wherein the interaction matrix comprises a plurality of element values; and determining a label matrix according to each element value in the interaction matrix, and calculating to obtain the loss function value according to the interaction matrix and the label matrix.
In an embodiment, when the text retrieval device based on the language model multiplies each vector in the second vector matrix by each other to obtain the interaction matrix, each sentence in the target data set may be sequentially input into the Bert model M to obtain a vector matrix e composed of sentence feature vectorsn/2×dimAnd en/2×dimWherein d ═ d (d)1,d2,...,dn/2),d'=(d′1,d'2,...,d'n/2) False, falseLet dim be the vector dimension output by the Bert model, and the calculation formula of the vector matrix is shown in the following formula (1).
e=M(d)
e'=M(d') (1)
Then, the interaction matrix Corr can be obtained by multiplying each two by each other according to the vectorn/2×n/2The calculation formula of the interaction matrix is shown in the following formula (2).
Figure BDA0003238413900000081
In an embodiment, when determining the tag matrix according to each element value in the interaction matrix, the text retrieval device based on the language model may determine that a diagonal line of the tag matrix is a tag 1 of a vector multiplier used for representing a similar sentence, and other tags in the tag matrix are tags of a non-similar sentence, where the tags of the non-similar sentence may be obtained by calculation according to the element value used for indicating the non-similar sentence in the interaction matrix, or may be determined by a score obtained by performing normalization processing on the non-similar sentence. In some embodiments, the label matrix labeln/2×n/2The form of the invention is shown in fig. 3, and fig. 3 is a schematic diagram of a tag matrix provided by an embodiment of the invention.
In one embodiment, the text retrieval device based on the language model may calculate the loss function value according to the interaction matrix Corr when calculating the loss function value according to the interaction matrix and the label matrixn/2×n/2And the label matrix labeln/2×n/2The calculation formula of the Loss function value Loss obtained by the mean square error mse of each element value is shown in the following formula (3).
Figure BDA0003238413900000082
In one embodiment, after the sentence vector model is obtained through training, in an offline case, vectors of all candidate sentences may be generated and stored in a preset vector matrix library.
S103: and acquiring data to be retrieved, and inputting the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved.
In the embodiment of the invention, the text retrieval device based on the language model can acquire the data to be retrieved and input the data to be retrieved into the sentence vector model to obtain the sentence vector of the data to be retrieved.
S104: and calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved.
In the embodiment of the invention, the text retrieval device based on the language model can calculate the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved.
In one embodiment, when the language model-based text retrieval device calculates the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved, the language model-based text retrieval device may obtain the dimension of each first vector matrix in the preset vector matrix library, and perform segmentation processing on each first vector matrix according to the dimension; clustering vectors in the first vector matrix of each segment after the segmentation processing respectively to obtain a plurality of cluster centers; and calculating the distance between the sentence vector of the data to be retrieved and each clustered cluster center, and determining the similarity between the sentence vector of the data to be retrieved and each first vector matrix according to the distance.
In one embodiment, assuming that the dimension of each first vector matrix in the preset vector matrix library is 768, when calculating the similarity between each first vector matrix in the preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved, the nearest neighbor search technique may be used. Specifically, each first vector matrix may be segmented according to the dimension of each first vector matrix in a preset vector matrix library, and vectors in each segmented first vector matrix are clustered to obtain a plurality of quantization results (one quantization result for each segment). In this way, the vector dimension involved in the calculation can be reduced.
In one embodiment, when clustering vectors in each segment of the first vector matrix, the first vector matrix may be divided into m segments, and the vectors of each segment of the first vector matrix may be clustered to obtain n cluster centers corresponding to each segment of the first vector matrix, so that n × m cluster centers may be obtained from m segments of the first vector matrix.
The number of vectors needing to be calculated can be optimized in the mode, and the method is favorable for improving the text retrieval speed.
S105: and determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index.
In the embodiment of the present invention, the text retrieval device based on the language model may determine, according to the similarity, an index corresponding to a sentence vector of the data to be retrieved, and determine, from a preset database, a sentence corresponding to the data to be retrieved according to the index.
In one embodiment, when determining the index corresponding to the sentence vector of the data to be retrieved according to the similarity, the text retrieval device based on the language model may sort the similarities in the order from large to small; and obtaining indexes corresponding to each vector in each first vector matrix in the preset vector matrix library with the similarity arranged in the order of K, wherein K is a positive integer.
In an embodiment, the text retrieval device based on the language model may calculate distances between sentence vectors of the data to be retrieved and n cluster centers, arrange the sentence vectors according to a sequence from small to large, determine similarity corresponding to the first K distances as similarity arranged at the first K according to the arrangement sequence, obtain vectors under the cluster centers corresponding to the K distances, and query sentences corresponding to the data to be retrieved from a preset database according to indexes of the vectors of the K cluster centers.
In the embodiment of the invention, a text retrieval device based on a language model can acquire a labeled data set, wherein the labeled data set comprises a plurality of labeled sentences with the same semantic meaning; extracting sentence characteristic vectors from the labeled data set, and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model; acquiring data to be retrieved, and inputting the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved; calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved; and determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index. The embodiment of the invention trains the sentence vector model by utilizing the method of calculating the loss function value by using the interactive matrix and the label matrix, and calculates the similarity between the sentence vector corresponding to the data to be retrieved obtained by the sentence vector model and each first vector matrix in the preset vector matrix library, so that the similarity is more accurate, the sentence of the data to be retrieved is determined according to the similarity, and the accuracy and the efficiency of text retrieval are improved.
The embodiment of the invention also provides a text retrieval device based on the language model, which is used for executing the unit of any one of the methods. Specifically, referring to fig. 4, fig. 4 is a schematic block diagram of a text retrieval device based on a language model according to an embodiment of the present invention. The text retrieval device based on the language model of the embodiment comprises: an acquisition unit 401, a training unit 402, a testing unit 403, a retrieval unit 404, and a determination unit 405.
An obtaining unit 401, configured to obtain a labeled data set, where the labeled data set includes a plurality of labeled sentences with the same semantic meaning;
a training unit 402, configured to extract a sentence feature vector from the labeled data set, and input the sentence feature vector into a pre-trained Bert model for training to obtain a sentence vector model;
the test unit 403 is configured to obtain data to be retrieved, and input the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved;
a retrieving unit 404, configured to calculate, according to the sentence vector of the data to be retrieved, a similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved;
a determining unit 405, configured to determine, according to the similarity, an index corresponding to the sentence vector of the data to be retrieved, and determine, according to the index, a sentence corresponding to the data to be retrieved from a preset database.
Further, when the training unit 402 extracts a sentence feature vector from the labeled data set, it is specifically configured to:
determining the data volume of each sentence included in the data set according to the labeled data set;
and determining a target data set from the labeled data sets according to the data quantity, and extracting the sentence characteristic vector from the target data set.
Further, when the training unit 402 determines the target data set from the labeled data sets according to the data volume, specifically, the training unit is configured to:
acquiring the data format of each sentence included in the labeled data set, and determining a target data format according to the data volume and the data format;
and recombining each sentence included in the labeled data set according to the target data format to obtain the target data set.
Further, the training unit 402 inputs the sentence feature vector into a pre-trained Bert model for training, and when obtaining a sentence vector model, is specifically configured to:
inputting the sentence characteristic vector into the pre-trained Bert model to obtain a loss function value;
comparing the loss function value with a target loss function value, and adjusting the model parameter of the Bert model according to the comparison result when the comparison result does not meet the preset condition;
inputting the sentence characteristic vector into the Bert model after model parameters are adjusted for retraining;
and when the comparison result of the obtained loss function value and the target loss function value meets a preset condition, determining to obtain the sentence vector model.
Further, the training unit 402 inputs the sentence feature vector into the pre-trained Bert model, and when obtaining the loss function value, is specifically configured to:
inputting the sentence characteristic vector into the Bert model to obtain a second vector matrix corresponding to the sentence characteristic vector;
multiplying each vector in the second vector matrix pairwise to obtain an interaction matrix, wherein the interaction matrix comprises a plurality of element values;
and determining a label matrix according to each element value in the interaction matrix, and calculating to obtain the loss function value according to the interaction matrix and the label matrix.
Further, when the retrieving unit 404 calculates the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved, it is specifically configured to:
obtaining the dimensionality of each first vector matrix in the preset vector matrix library, and carrying out segmentation processing on each first vector matrix according to the dimensionality;
clustering vectors in the first vector matrix of each segment after the segmentation processing respectively to obtain a plurality of cluster centers;
and calculating the distance between the sentence vector of the data to be retrieved and each clustered cluster center, and determining the similarity between the sentence vector of the data to be retrieved and each first vector matrix according to the distance.
Further, when the determining unit 405 determines the index corresponding to the sentence vector of the data to be retrieved according to the similarity, it is specifically configured to:
sequencing the similarity according to the sequence of the similarity from big to small;
and acquiring indexes corresponding to each vector in each first vector matrix in the preset vector matrix library with the similarity arranged in the front K in sequence, wherein K is a positive integer.
In the embodiment of the invention, a text retrieval device based on a language model can acquire a labeled data set, wherein the labeled data set comprises a plurality of labeled sentences with the same semantic meaning; extracting sentence characteristic vectors from the labeled data set, and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model; acquiring data to be retrieved, and inputting the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved; calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved; and determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index. The embodiment of the invention trains the sentence vector model by utilizing the method of calculating the loss function value by using the interactive matrix and the label matrix, and calculates the similarity between the sentence vector corresponding to the data to be retrieved obtained by the sentence vector model and each first vector matrix in the preset vector matrix library, so that the similarity is more accurate, the sentence of the data to be retrieved is determined according to the similarity, and the accuracy and the efficiency of text retrieval are improved.
Referring to fig. 5, fig. 5 is a schematic block diagram of a computer device provided in an embodiment of the present invention, and in some embodiments, the computer device in the embodiment shown in fig. 5 may include: one or more processors 501; one or more input devices 502, one or more output devices 503, and memory 504. The processor 501, the input device 502, the output device 503, and the memory 504 are connected by a bus 505. The memory 504 is used for storing computer programs, including programs, and the processor 501 is used for executing the programs stored in the memory 504. Wherein the processor 501 is configured to invoke the program to perform:
acquiring a labeled data set, wherein the labeled data set comprises a plurality of labeled sentences with the same semantic meaning;
extracting sentence characteristic vectors from the labeled data set, and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model;
acquiring data to be retrieved, and inputting the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved;
calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved;
and determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index.
Further, when the processor 501 extracts a sentence feature vector from the labeled data set, it is specifically configured to:
determining the data volume of each sentence included in the data set according to the labeled data set;
and determining a target data set from the labeled data sets according to the data quantity, and extracting the sentence characteristic vector from the target data set.
Further, when the processor 501 determines the target data set from the labeled data sets according to the data volume, it is specifically configured to:
acquiring the data format of each sentence included in the labeled data set, and determining a target data format according to the data volume and the data format;
and recombining each sentence included in the labeled data set according to the target data format to obtain the target data set.
Further, the processor 501 inputs the sentence feature vector into a pre-trained Bert model for training, and when obtaining a sentence vector model, is specifically configured to:
inputting the sentence characteristic vector into the pre-trained Bert model to obtain a loss function value;
comparing the loss function value with a target loss function value, and adjusting the model parameter of the Bert model according to the comparison result when the comparison result does not meet the preset condition;
inputting the sentence characteristic vector into the Bert model after model parameters are adjusted for retraining;
and when the comparison result of the obtained loss function value and the target loss function value meets a preset condition, determining to obtain the sentence vector model.
Further, the processor 501 inputs the sentence feature vector into the pre-trained Bert model, and when obtaining the loss function value, is specifically configured to:
inputting the sentence characteristic vector into the Bert model to obtain a second vector matrix corresponding to the sentence characteristic vector;
multiplying each vector in the second vector matrix pairwise to obtain an interaction matrix, wherein the interaction matrix comprises a plurality of element values;
and determining a label matrix according to each element value in the interaction matrix, and calculating to obtain the loss function value according to the interaction matrix and the label matrix.
Further, when the processor 501 calculates the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved, the processor is specifically configured to:
obtaining the dimensionality of each first vector matrix in the preset vector matrix library, and carrying out segmentation processing on each first vector matrix according to the dimensionality;
clustering vectors in the first vector matrix of each segment after the segmentation processing respectively to obtain a plurality of cluster centers;
and calculating the distance between the sentence vector of the data to be retrieved and each clustered cluster center, and determining the similarity between the sentence vector of the data to be retrieved and each first vector matrix according to the distance.
Further, when determining the index corresponding to the sentence vector of the data to be retrieved according to the similarity, the processor 501 is specifically configured to:
sequencing the similarity according to the sequence of the similarity from big to small;
and acquiring indexes corresponding to each vector in each first vector matrix in the preset vector matrix library with the similarity arranged in the front K in sequence, wherein K is a positive integer.
In the embodiment of the invention, computer equipment can acquire a labeled data set, wherein the labeled data set comprises a plurality of labeled sentences with the same semantic meaning; extracting sentence characteristic vectors from the labeled data set, and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model; acquiring data to be retrieved, and inputting the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved; calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved; and determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index. The embodiment of the invention trains the sentence vector model by utilizing the method of calculating the loss function value by using the interactive matrix and the label matrix, and calculates the similarity between the sentence vector corresponding to the data to be retrieved obtained by the sentence vector model and each first vector matrix in the preset vector matrix library, so that the similarity is more accurate, the sentence of the data to be retrieved is determined according to the similarity, and the accuracy and the efficiency of text retrieval are improved.
It should be understood that, in the embodiment of the present invention, the Processor 501 may be a Central Processing Unit (CPU), and may also be other general processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Input devices 502 may include a touch pad, microphone, etc., and output devices 503 may include a display (LCD, etc.), speakers, etc.
The memory 504 may include a read-only memory and a random access memory, and provides instructions and data to the processor 501. A portion of the memory 504 may also include non-volatile random access memory. For example, the memory 504 may also store device type information.
In specific implementation, the processor 501, the input device 502, and the output device 503 described in this embodiment of the present invention may execute the implementation described in the method embodiment shown in fig. 1 provided in this embodiment of the present invention, and may also execute the implementation of the text retrieval device based on the language model described in fig. 4 in this embodiment of the present invention, which is not described herein again.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for text retrieval based on a language model described in the embodiment corresponding to fig. 1 may be implemented, or the apparatus for text retrieval based on a language model according to the embodiment corresponding to fig. 4 may also be implemented, which is not described herein again.
The computer readable storage medium may be an internal storage unit of the language model based text retrieval device according to any of the foregoing embodiments, for example, a hard disk or a memory of the language model based text retrieval device. The computer readable storage medium may also be an external storage device of the language model based text retrieval device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the language model based text retrieval device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the language model-based text retrieval device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the language model-based text retrieval apparatus. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a computer-readable storage medium, which includes several instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned computer-readable storage media comprise: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. The computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
It is emphasized that the data may also be stored in a node of a blockchain in order to further ensure the privacy and security of the data. The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The above description is only a part of the embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present invention, and these modifications or substitutions should be covered within the scope of the present invention.

Claims (10)

1. A text retrieval method based on a language model is characterized by comprising the following steps:
acquiring a labeled data set, wherein the labeled data set comprises a plurality of labeled sentences with the same semantic meaning;
extracting sentence characteristic vectors from the labeled data set, and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model;
acquiring data to be retrieved, and inputting the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved;
calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved;
and determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index.
2. The method of claim 1, wherein said extracting sentence feature vectors from said labeled data sets comprises:
determining the data volume of each sentence included in the data set according to the labeled data set;
and determining a target data set from the labeled data sets according to the data quantity, and extracting the sentence characteristic vector from the target data set.
3. The method of claim 2, wherein determining a target dataset from the labeled datasets based on the data volume comprises:
acquiring the data format of each sentence included in the labeled data set, and determining a target data format according to the data volume and the data format;
and recombining each sentence included in the labeled data set according to the target data format to obtain the target data set.
4. The method of claim 2, wherein the inputting the sentence feature vector into a pre-trained Bert model for training to obtain a sentence vector model comprises:
inputting the sentence characteristic vector into the pre-trained Bert model to obtain a loss function value;
comparing the loss function value with a target loss function value, and adjusting the model parameter of the Bert model according to the comparison result when the comparison result does not meet the preset condition;
inputting the sentence characteristic vector into the Bert model after model parameters are adjusted for retraining;
and when the comparison result of the obtained loss function value and the target loss function value meets a preset condition, determining to obtain the sentence vector model.
5. The method of claim 4, wherein said inputting the sentence feature vector into the pre-trained Bert model resulting in a loss function value comprises:
inputting the sentence characteristic vector into the Bert model to obtain a second vector matrix corresponding to the sentence characteristic vector;
multiplying each vector in the second vector matrix pairwise to obtain an interaction matrix, wherein the interaction matrix comprises a plurality of element values;
and determining a label matrix according to each element value in the interaction matrix, and calculating to obtain the loss function value according to the interaction matrix and the label matrix.
6. The method according to claim 1, wherein the calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved includes:
obtaining the dimensionality of each first vector matrix in the preset vector matrix library, and carrying out segmentation processing on each first vector matrix according to the dimensionality;
clustering vectors in the first vector matrix of each segment after the segmentation processing respectively to obtain a plurality of cluster centers;
and calculating the distance between the sentence vector of the data to be retrieved and each clustered cluster center, and determining the similarity between the sentence vector of the data to be retrieved and each first vector matrix according to the distance.
7. The method of claim 6, wherein the determining an index corresponding to a sentence vector of the data to be retrieved according to the similarity comprises:
sequencing the similarity according to the sequence of the similarity from big to small;
and acquiring indexes corresponding to each vector in each first vector matrix in the preset vector matrix library with the similarity arranged in the front K in sequence, wherein K is a positive integer.
8. A text retrieval apparatus based on a language model, comprising:
the system comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a labeled data set, and the labeled data set comprises a plurality of labeled sentences with the same semantic meaning;
the training unit is used for extracting sentence characteristic vectors from the labeled data set and inputting the sentence characteristic vectors into a pre-trained Bert model for training to obtain a sentence vector model;
the test unit is used for acquiring data to be retrieved and inputting the data to be retrieved into the sentence vector model to obtain a sentence vector of the data to be retrieved;
the retrieval unit is used for calculating the similarity between each first vector matrix in a preset vector matrix library and the sentence vector of the data to be retrieved according to the sentence vector of the data to be retrieved;
and the determining unit is used for determining an index corresponding to the sentence vector of the data to be retrieved according to the similarity, and determining a sentence corresponding to the data to be retrieved from a preset database according to the index.
9. A computer device comprising a processor and a memory, wherein the memory is configured to store a computer program and the processor is configured to invoke the computer program to perform the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method of any one of claims 1-7.
CN202111019330.1A 2021-08-31 2021-08-31 Text retrieval method, device and equipment based on language model and storage medium Pending CN113722512A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111019330.1A CN113722512A (en) 2021-08-31 2021-08-31 Text retrieval method, device and equipment based on language model and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111019330.1A CN113722512A (en) 2021-08-31 2021-08-31 Text retrieval method, device and equipment based on language model and storage medium

Publications (1)

Publication Number Publication Date
CN113722512A true CN113722512A (en) 2021-11-30

Family

ID=78680492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111019330.1A Pending CN113722512A (en) 2021-08-31 2021-08-31 Text retrieval method, device and equipment based on language model and storage medium

Country Status (1)

Country Link
CN (1) CN113722512A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117150305A (en) * 2023-11-01 2023-12-01 杭州光云科技股份有限公司 Text data enhancement method and device integrating retrieval and filling and electronic equipment
CN117171331A (en) * 2023-11-01 2023-12-05 清华大学 Professional field information interaction method, device and equipment based on large language model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117150305A (en) * 2023-11-01 2023-12-01 杭州光云科技股份有限公司 Text data enhancement method and device integrating retrieval and filling and electronic equipment
CN117171331A (en) * 2023-11-01 2023-12-05 清华大学 Professional field information interaction method, device and equipment based on large language model
CN117171331B (en) * 2023-11-01 2024-02-06 清华大学 Professional field information interaction method, device and equipment based on large language model
CN117150305B (en) * 2023-11-01 2024-02-27 杭州光云科技股份有限公司 Text data enhancement method and device integrating retrieval and filling and electronic equipment

Similar Documents

Publication Publication Date Title
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
WO2020224219A1 (en) Chinese word segmentation method and apparatus, electronic device and readable storage medium
JP5936698B2 (en) Word semantic relation extraction device
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN111400493A (en) Text matching method, device and equipment based on slot position similarity and storage medium
WO2020114100A1 (en) Information processing method and apparatus, and computer storage medium
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN112329460B (en) Text topic clustering method, device, equipment and storage medium
Zhang et al. Continuous word embeddings for detecting local text reuses at the semantic level
CN111581949A (en) Method and device for disambiguating name of learner, storage medium and terminal
CN113239169A (en) Artificial intelligence-based answer generation method, device, equipment and storage medium
CN112395875A (en) Keyword extraction method, device, terminal and storage medium
CN113722512A (en) Text retrieval method, device and equipment based on language model and storage medium
CN111460797A (en) Keyword extraction method and device, electronic equipment and readable storage medium
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
CN113836938A (en) Text similarity calculation method and device, storage medium and electronic device
CN113849661A (en) Entity embedded data extraction method and device, electronic equipment and storage medium
CN116992007A (en) Limiting question-answering system based on question intention understanding
CN112800205B (en) Method and device for obtaining question-answer related paragraphs based on semantic change manifold analysis
CN112199958A (en) Concept word sequence generation method and device, computer equipment and storage medium
CN117076636A (en) Information query method, system and equipment for intelligent customer service
CN116910599A (en) Data clustering method, system, electronic equipment and storage medium
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination