CN115203378A - Retrieval enhancement method, system and storage medium based on pre-training language model - Google Patents

Retrieval enhancement method, system and storage medium based on pre-training language model Download PDF

Info

Publication number
CN115203378A
CN115203378A CN202211103284.8A CN202211103284A CN115203378A CN 115203378 A CN115203378 A CN 115203378A CN 202211103284 A CN202211103284 A CN 202211103284A CN 115203378 A CN115203378 A CN 115203378A
Authority
CN
China
Prior art keywords
text
language model
search
index
vector representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211103284.8A
Other languages
Chinese (zh)
Other versions
CN115203378B (en
Inventor
王宇龙
薄琳
华菁云
周明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lanzhou Technology Co ltd
Original Assignee
Beijing Lanzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lanzhou Technology Co ltd filed Critical Beijing Lanzhou Technology Co ltd
Priority to CN202211103284.8A priority Critical patent/CN115203378B/en
Publication of CN115203378A publication Critical patent/CN115203378A/en
Application granted granted Critical
Publication of CN115203378B publication Critical patent/CN115203378B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Abstract

The invention relates to the technical field of natural language processing, in particular to a search enhancement method and a search enhancement system based on a pre-training language model and a computer readable storage medium, wherein the search enhancement method comprises the following steps: acquiring a plurality of texts in a preset corpus, and inputting the plurality of texts into a pre-training language model to obtain vector representation of each text; correspondingly distributing the vector representation of each text to a plurality of nodes based on the number of the nodes to obtain the text vector representation of the plurality of nodes; the text vector representation of each node is input into a preset search library to establish an index, and a vector index of each text is obtained; the pre-training language model can learn the information of the text and can more accurately represent the text, each node processes a part of vectors, the distributed search speed improvement and search magnitude expansion are realized, the setting of the pre-training search library can realize the similarity vector search at the second level, the vector index speed is accelerated, and the problem of low search efficiency in the prior art is solved.

Description

Pre-training language model-based retrieval enhancement method and system and storage medium
Technical Field
The invention relates to the technical field of natural language processing, in particular to a search enhancement method and system based on a pre-training language model and a computer readable storage medium.
Background
Text retrieval, also known as natural language retrieval, refers to a system that matches and finds words in natural language directly through a computer without any indexing of documents. With the advent of computers, people can manage more documents more conveniently with the help of computers, and computer hard disks can even hold all library books all over the world. In order to quickly find documents managed by a computer, a first generation of text retrieval technology has appeared, in which documents containing keywords are picked up as retrieval results to be presented to a user according to keyword matching.
With the increase of the number of documents, it is difficult to search accurate search results by using the first generation text search technology, and thus the second generation text search technology based on text contents is developed. The method comprises the steps of calculating the similarity of a text and a retrieval sentence according to the comprehension of the text and the retrieval sentence by a system, sequencing the retrieval results according to the similarity, and presenting the retrieval result with the highest similarity to a user. In the prior art, a retrieval scheme utilizes word2vec to perform text representation on a speech material, and because words and vectors are in one-to-one relationship, the problem of ambiguous words cannot be solved; meanwhile, the method is a static mode, although the universality is strong, the dynamic optimization cannot be carried out aiming at a specific task. And when a large amount of linguistic data is faced, the retrieval efficiency is low.
Disclosure of Invention
In order to solve the problem of low retrieval efficiency in the prior art, the invention provides a method, a system and a computer readable storage medium for enhancing retrieval based on a pre-training language model.
The invention provides a retrieval enhancement method based on a pre-training language model, which solves the technical problem and comprises the following steps:
acquiring a plurality of texts in a preset corpus, and inputting the plurality of texts into a pre-training language model to obtain vector representation of each text;
correspondingly distributing the vector representation of each text to a plurality of nodes based on the number of the nodes to obtain the text vector representation of the plurality of nodes;
and inputting the text vector representation of each node into a preset search library to establish an index, and obtaining the vector index of each text.
Preferably, the pre-trained language model is a Bert model.
Preferably, the preset search library is a Faiss library.
Preferably, the obtaining of the plurality of texts in the preset corpus and the inputting of the plurality of texts into the pre-training language model to obtain the vector representation of each text specifically includes the following steps:
acquiring a plurality of texts of a preset corpus, and identifying the plurality of texts to obtain digital representation of each text;
and inputting the digital representation of each text into a pre-training language model for training to obtain the vector representation of each text.
Preferably, the Faiss library adopts one or more of HNSW search algorithm, flat search algorithm, PCAR search algorithm, OPQ search algorithm or IVF search algorithm.
Preferably, the step of inputting the text vector representation of each node into a preset search library to establish an index to obtain the vector index of each text specifically includes the following steps:
selecting a retrieval algorithm according to a preset rule by the text vector representation of each node;
and (4) establishing indexes for the text vector representation of each node through a corresponding retrieval algorithm to obtain the vector index of each text.
Preferably, the preset rule includes one of precision search, memory limitation based or memory data size based.
Preferably, the selection of the search algorithm according to the preset rule by the vector representation of each node specifically comprises the following steps:
if the accuracy is based, selecting an HNSW retrieval algorithm and establishing an index;
or, if the index is based on the memory limitation, selecting one of a Flat retrieval algorithm, a PCAR retrieval algorithm or an OPQ retrieval algorithm, and establishing the index;
or, if based on the data size, selecting the IVF retrieval algorithm and establishing the index.
The invention also provides a retrieval enhancement system based on a pre-training language model for solving the technical problems, which comprises the following modules:
an acquisition module: the system comprises a language model pre-training module, a database and a database, wherein the language model pre-training module is used for pre-training a language model to obtain a plurality of texts in a pre-set corpus and inputting the texts into the pre-training language model to obtain vector representation of each text;
a processing module: the text vector representation system is used for correspondingly distributing the vector representation of each text to a plurality of nodes based on the number of the nodes to obtain the text vector representation of the nodes;
an index establishing module: and the text vector representation of each node is input into a preset search library to establish an index, so that the vector index of each text is obtained.
The present invention further provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements any one of the above-mentioned methods for enhancing a search based on a pre-trained language model.
Compared with the prior art, the search enhancement method, the search enhancement system and the computer readable storage medium based on the pre-training language model have the following advantages:
1. the invention relates to a retrieval enhancement method based on a pre-training language model, which comprises the steps of firstly obtaining a plurality of texts in a pre-training language model, inputting the plurality of texts into the pre-training language model, obtaining vector representations corresponding to the plurality of texts through the pre-training language model, learning the information of the texts through the setting of the pre-training language model, obtaining more accurate text representation compared with the traditional method, correspondingly distributing the vector representations of the plurality of texts to a plurality of nodes based on the number of the nodes, processing a part of vectors of the texts by each node, realizing the promotion of retrieval speed and the expansion of retrieval magnitude by adopting a distributed mode, finally inputting the vector representation of each node into a pre-setting retrieval base to establish an index, adding the pre-setting retrieval base, realizing the retrieval of similarity vectors at the level of seconds, accelerating the speed of obtaining the vector index of each text, promoting efficiency, having stronger practicability and solving the problem of lower retrieval efficiency in the prior art.
2. The pre-training language model is a Bert model, and text content information can be better learned through the model, so that more accurate text vector representation can be obtained, the index can be favorably established for the subsequent vector representation of the text, and the method has stronger practicability.
3. The preset Search library is a Faiss library (Facebook AI Similarity Search), the Faiss library is essentially a vector database, when the Search is carried out, the basic is an original vector database, a vector X is input by default, K vectors which are most similar to the X can be returned through the Faiss library, the efficient and reliable Similarity clustering and Search method is provided, and the practicability is high.
4. The method comprises the steps of firstly, acquiring a plurality of texts in a preset corpus, and identifying the plurality of texts to obtain the digital representation of each text; the digital representation of each text is input into a pre-training language model for training, so that the vector representation of each text is obtained, and the obtained vector representation provides data support for subsequently obtaining the index of the text.
5. The retrieval algorithm adopted by the Faiss library is one or more of an HNSW retrieval algorithm, a Flat retrieval algorithm, a PCAR retrieval algorithm, an OPQ retrieval algorithm or an IVF retrieval algorithm, and can be freely selected according to the effect required by a user, wherein the HNSW retrieval algorithm has high retrieval speed, but has large memory consumption and is difficult to dynamically delete data; the Flat retrieval algorithm has higher precision, but the speed and the magnitude of the data magnitude are both limited, the data cannot be compressed, but are directly stored in the memory, and the higher the required accuracy rate is, the slower the speed is; the PCAR retrieval algorithm is characterized in that dimension reduction is performed on data, and the occupied memory of the data with the same magnitude is smaller; the OPQ retrieval algorithm also reduces the dimension of the data, but the OPQ is linear transformation, so that the data can be better compressed; the IVF retrieval algorithm can be selected if the user has limitation on the size of the data, the Faiss library comprises a plurality of retrieval algorithms, the user can select a more appropriate retrieval algorithm according to the actual requirement, and the practicability is high.
6. In the steps of the invention, the text vector of each node selects a proper retrieval algorithm according to a preset rule, after the selection is finished, the vector representation of each node is input into the corresponding retrieval algorithm to establish an index, and the vector index of each text is obtained, so that the proper retrieval algorithm can be selected according to the requirements of users, and the method has strong practicability.
7. According to the invention, the preset rule is set, so that a user can select the retrieval algorithm according to the requirement, wherein the selection can be carried out according to the precision and the size of the memory or the memory data, and the method has strong practicability.
8. In the steps of the invention, a user can select a corresponding retrieval algorithm and establish an index according to specific requirements, and the index can be based on one of precision, memory limitation or data size; if the accuracy is based, the HNSW search algorithm can be selected, if the memory limitation is based, one of the Flat, PCAR or PQ search algorithms can be selected, and a user can select a proper search algorithm according to the requirement, so that the method has strong practicability.
9. The invention also provides a retrieval enhancement system based on the pre-training language model and a computer readable storage medium, which have the same beneficial effects as the retrieval enhancement method based on the pre-training language model, and are not repeated herein.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flowchart illustrating steps of a method for enhancing a search based on a pre-trained language model according to a first embodiment of the present invention.
Fig. 2 is a flowchart illustrating a step S1 of a search enhancement method based on a pre-trained language model according to a first embodiment of the present invention.
Fig. 3 is a flowchart illustrating a step S3 of a method for enhancing search based on a pre-trained language model according to a first embodiment of the present invention.
Fig. 4 is a flowchart illustrating the step S31 of a method for enhancing search based on a pre-trained language model according to a first embodiment of the present invention.
Fig. 5 is a block diagram of a pre-trained language model based search enhancement system according to a second embodiment of the present invention.
The attached drawings indicate the following:
1. a search enhancement system based on a pre-trained language model;
10. an acquisition module; 20. a processing module; 30. and an index establishing module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and implementation examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, a first embodiment of the present invention provides a method for enhancing search based on a pre-trained language model, including the following steps:
s1: acquiring a plurality of texts in a preset corpus, and inputting the plurality of texts into a pre-training language model to obtain vector representation of each text;
s2: correspondingly distributing the vector representation of each text to a plurality of nodes based on the number of the nodes to obtain the text vector representation of the plurality of nodes;
s3: and inputting the text vector representation of each node into a preset search library to establish an index, and obtaining the vector index of each text.
It can be understood that, in the steps of the present invention, a plurality of texts in a preset corpus are obtained first, the plurality of texts are input into a pre-training language model, vector representations corresponding to the plurality of texts are obtained through the pre-training language model, information of the texts can be learned through setting of the pre-training language model, compared with a traditional method, more accurate text representations can be obtained, the vector representations of the plurality of texts are correspondingly distributed to a plurality of nodes based on the number of the nodes, each node processes a part of vectors of the texts, the improvement of the retrieval speed and the expansion of the retrieval magnitude are realized by adopting a distributed mode, finally, the text vector representation of each node is input into a preset retrieval library to establish an index, the preset retrieval library is added, the similarity vector retrieval at the level of seconds can be realized, the speed of obtaining the vector index of each text is accelerated, the efficiency is improved, the present invention has strong practicability, and the present problem of low retrieval efficiency is also solved.
As an alternative embodiment, the pre-trained language model is a Bert model.
Understandably, the pre-training language model is a Bert model, and text content information can be better learned through the model, so that more accurate text vector representation can be obtained, the index establishment for the subsequent text vector representation is facilitated, and the method has stronger practicability.
As an alternative embodiment, the preset search library is a Faiss library.
It can be understood that the preset Search library of the present invention is a Faiss library (Facebook AI Similarity Search), which is essentially a vector database, and when searching is performed, the basic is an original vector database, a vector X is input by default, and K vectors most similar to X can be returned through the Faiss library, so that an efficient and reliable Similarity clustering and searching method is provided, and strong practicability is achieved.
Referring to fig. 2, step S1 specifically includes the following steps:
s11: acquiring a plurality of texts of a preset corpus, and identifying the plurality of texts to obtain digital representation of each text;
s12: and inputting the digital representation of each text into a pre-training language model for training to obtain the vector representation of each text.
It can be understood that, in the steps of the present invention, a plurality of texts in a preset corpus are obtained first, and the plurality of texts are identified to obtain a digital representation of each text; the digital representation of each text is input into a pre-training language model for training, so that the vector representation of each text is obtained, and the obtained vector representation provides data support for subsequently obtaining the index of the text.
In an alternative embodiment, the Faiss database employs one or more of HNSW search algorithm, flat search algorithm, PCAR search algorithm, OPQ search algorithm or IVF search algorithm.
The method has the advantages that the search algorithm adopted by the Faiss library is one or more of an HNSW search algorithm, a Flat search algorithm, a PCAR search algorithm, an OPQ search algorithm or an IVF search algorithm, and can be freely selected according to the effect required by a user, wherein the HNSW search algorithm is high in search speed, but the memory consumption is large, and the dynamic deletion of data is not easy; the Flat retrieval algorithm has higher precision, but the speed and the magnitude of the data magnitude are both limited, the data cannot be compressed, but are directly stored in the memory, and the higher the required accuracy rate is, the slower the speed is; the PCAR retrieval algorithm is to firstly reduce the dimension of data and occupy less memory for the data with the same magnitude; the OPQ retrieval algorithm also reduces the dimension of the data, but the OPQ is linear transformation, so that the data can be better compressed; the IVF retrieval algorithm can be selected if the user has limitation on the size of the data, the Faiss library comprises a plurality of retrieval algorithms, and the user can select a more appropriate retrieval algorithm according to the actual requirement, so that the IVF retrieval algorithm has strong practicability.
Referring to fig. 3, step S3 specifically includes the following steps:
s31: the text vector representation of each node selects a retrieval algorithm according to a preset rule;
s32: and (4) establishing indexes for the text vector representation of each node through a corresponding retrieval algorithm to obtain the vector index of each text.
It can be understood that in the steps of the invention, the text vector of each node selects a proper retrieval algorithm according to the preset rule, after the selection is finished, the vector representation of each node is input into the corresponding retrieval algorithm, the index is established, the vector index of each text is obtained, the proper retrieval algorithm can be selected according to the user requirement, and the invention has strong practicability.
As an optional implementation manner, the preset rule includes one of precision lookup, memory limitation based or memory data size based.
It can be understood that in the invention, by setting the preset rule, the user can select the retrieval algorithm according to the requirement, wherein the selection can be performed according to the precision and the size of the memory or the memory data, and the invention has strong practicability.
Referring to fig. 4, step S31 specifically includes the following steps:
s311: if the accuracy is based, selecting an HNSW retrieval algorithm and establishing an index;
s312: or, if the index is based on the memory limitation, selecting one of a Flat retrieval algorithm, a PCAR retrieval algorithm or an OPQ retrieval algorithm, and establishing the index;
s313: or, if based on the data size, selecting the IVF retrieval algorithm and establishing the index.
It can be understood that, in the steps of the present invention, the user may select the corresponding search algorithm and establish the index according to the specific requirements, which may be based on one of the precision, the memory limitation or the data size; if the accuracy is based, the HNSW search algorithm can be selected, if the memory limitation is based, one of the Flat, PCAR or PQ search algorithms can be selected, and a user can select a proper search algorithm according to the requirement, so that the method has strong practicability.
In the process, the vector index in each preset corpus obtained by the first embodiment of the invention can quickly find the text, the text retrieval speed is improved to a higher degree, and the efficiency is improved.
Further, when similarity calculation is performed on the search text provided by the user and the text in the preset corpus, a method for calculating the text similarity may be adopted.
Optionally, the method for calculating the text similarity is one of an euclidean distance, a cosine distance or a dot product method, and the method for calculating the text similarity by the cosine distance is adopted in the embodiment of the present invention.
The formula for calculating the cosine distance is as follows:
Figure 866981DEST_PATH_IMAGE001
the calculation formula of the Euclidean distance is as follows:
Figure 302511DEST_PATH_IMAGE002
the formula for calculating the dot product is:
Figure 494457DEST_PATH_IMAGE003
where d represents the distance between two vectors, A, B represent the two vectors for calculating the distance, A i ,B i Component representing the ith dimension of the vector
Referring to fig. 5, a second embodiment of the present invention provides a pre-trained language model based search enhancement system 1: the system comprises the following modules:
the acquisition module 10: the system comprises a language model pre-training module, a database and a database, wherein the language model pre-training module is used for pre-training a language model to obtain a plurality of texts in a pre-set corpus and inputting the texts into the pre-training language model to obtain vector representation of each text;
the processing module 20: the text vector representation system is used for correspondingly distributing the vector representation of each text to a plurality of nodes based on the number of the nodes to obtain the text vector representation of the nodes;
the index creation module 30: and the text vector representation of each node is input into a preset search library to establish an index, so that the vector index of each text is obtained.
It can be understood that, when the modules of the search enhancement system 2 based on the pre-trained language model are operated, the search enhancement method based on the pre-trained language model provided in the first embodiment needs to be utilized, and therefore, it is within the scope of the present invention to integrate or configure different hardware to generate the functions similar to the effects achieved by the present invention by the acquisition module 10, the processing module 20, and the index establishing module 30.
A third embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the search enhancement method based on a pre-training language model provided in the first embodiment of the present invention.
It will be appreciated that the processes described above with reference to the flow diagrams may be implemented as computer software programs, in accordance with the disclosed embodiments of the invention. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium. The computer program performs the above-mentioned functions defined in the method of the present application when executed by a Central Processing Unit (CPU). It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a from which B can be determined. It should also be understood, however, that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also appreciate that the embodiments described in this specification are exemplary and alternative embodiments, and that the acts and modules illustrated are not required in order to practice the invention.
In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not imply a necessary order of execution, and the order of execution of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
The flowchart and block diagrams in the figures of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Compared with the prior art, the search enhancement method, the search enhancement system and the computer readable storage medium based on the pre-training language model have the following advantages:
1. the invention relates to a retrieval enhancement method based on a pre-training language model, which comprises the steps of firstly obtaining a plurality of texts in a pre-training language model, inputting the plurality of texts into the pre-training language model, obtaining vector representations corresponding to the plurality of texts through the pre-training language model, learning information of the texts through the setting of the pre-training language model, obtaining more accurate text representation compared with the traditional method, correspondingly distributing the vector representations of the plurality of texts to a plurality of nodes based on the number of the nodes, processing vectors of a part of texts by each node, realizing the promotion of retrieval speed and the expansion of retrieval magnitude by adopting a distributed mode, finally inputting the vector representation of each node into a pre-setting retrieval library to establish an index, adding the pre-setting retrieval library, realizing the retrieval of similarity vectors at the level of seconds, accelerating the speed of obtaining the vector index of each text, promoting efficiency, having stronger practicability and solving the problem of lower retrieval efficiency in the prior art.
2. The pre-training language model is a Bert model, and text content information can be better learned through the model, so that more accurate text vector representation can be obtained, the index can be favorably established for the subsequent vector representation of the text, and the method has stronger practicability.
3. The preset Search library is a Faiss library (Facebook AI Similarity Search), the Faiss library is essentially a vector database, when the Search is carried out, the basic is an original vector database, a vector X is input by default, K vectors which are most similar to the X can be returned through the Faiss library, the efficient and reliable Similarity clustering and Search method is provided, and the practicability is high.
4. The method comprises the steps of firstly acquiring a plurality of texts in a preset corpus, and identifying the plurality of texts to obtain digital representation of each text; the digital representation of each text is input into the pre-training language model for training, so that the vector representation of each text is obtained, the obtained vector representation provides data support for subsequently obtaining the index of the text, and the preset search library is a Faiss library which is essentially a vector database, namely the digital representation of each text is input into the pre-training language model to obtain the vector representation of the text, so that the preset search library is more adaptive, the text can establish the index through the Faiss library, and the method has convenience and strong practicability.
5. The retrieval algorithm adopted by the Faiss library is one or more of an HNSW retrieval algorithm, a Flat retrieval algorithm, a PCAR retrieval algorithm, an OPQ retrieval algorithm or an IVF retrieval algorithm, and can be freely selected according to the effect required by a user, wherein the HNSW retrieval algorithm has high retrieval speed, but has large memory consumption and is difficult to dynamically delete data; the Flat retrieval algorithm has higher precision, but the speed and the magnitude of the data magnitude are both limited, the data cannot be compressed, but are directly stored in the memory, and the higher the required accuracy rate is, the slower the speed is; the PCAR retrieval algorithm is to firstly reduce the dimension of data and occupy less memory for the data with the same magnitude; the OPQ retrieval algorithm also reduces the dimension of the data, but the OPQ is linear transformation, so that the data can be better compressed; the IVF retrieval algorithm can be selected if the user has limitation on the size of the data, the Faiss library comprises a plurality of retrieval algorithms, and the user can select a more appropriate retrieval algorithm according to the actual requirement, so that the IVF retrieval algorithm has strong practicability.
6. In the steps of the invention, the text vector of each node selects a proper retrieval algorithm according to a preset rule, after the selection is finished, the vector representation of each node is input into the corresponding retrieval algorithm to establish an index, and the vector index of each text is obtained, so that the proper retrieval algorithm can be selected according to the user requirements, and the method has strong practicability.
7. According to the invention, the preset rule is set, so that a user can select the retrieval algorithm according to the requirement, wherein the selection can be carried out according to the precision and the size of the memory or the memory data, and the method has strong practicability.
8. In the steps of the invention, a user can select a corresponding retrieval algorithm and establish an index according to specific requirements, and the index can be based on one of precision, memory limitation or data size; if the accuracy is based, the HNSW search algorithm can be selected, if the memory limitation is based, one of the Flat, PCAR or PQ search algorithms can be selected, and a user can select a proper search algorithm according to the requirement, so that the method has strong practicability.
9. The invention also provides a retrieval enhancement system based on the pre-training language model and a computer readable storage medium, which have the same beneficial effects as the retrieval enhancement method based on the pre-training language model, and are not repeated herein.
The method, the system and the computer-readable storage medium for enhancing retrieval based on the pre-training language model disclosed by the embodiment of the invention are introduced in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for the persons skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present description should not be construed as a limitation to the present invention, and any modification, equivalent replacement, and improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A retrieval enhancement method based on a pre-training language model is characterized by comprising the following steps: the method comprises the following steps:
acquiring a plurality of texts in a preset corpus, and inputting the plurality of texts into a pre-training language model to obtain vector representation of each text;
correspondingly distributing the vector representation of each text to a plurality of nodes based on the number of the nodes to obtain the text vector representation of the plurality of nodes;
and inputting the text vector representation of each node into a preset search library to establish an index, and obtaining the vector index of each text.
2. The method of claim 1, wherein the pre-trained language model based search enhancement method comprises: the pre-training language model is a Bert model.
3. The pre-trained language model-based search enhancement method of claim 1, wherein: the preset search library is a Faiss library.
4. The method of claim 1, wherein the pre-trained language model based search enhancement method comprises: the method for obtaining the vector representation of each text by obtaining the plurality of texts in the preset corpus and inputting the plurality of texts into the pre-training language model specifically comprises the following steps:
acquiring a plurality of texts of a preset corpus, and identifying the plurality of texts to obtain digital representation of each text;
and inputting the digital representation of each text into a pre-training language model for training to obtain the vector representation of each text.
5. A method as claimed in claim 3, wherein the method comprises: the Faiss library adopts a retrieval algorithm which is one or more of an HNSW retrieval algorithm, a Flat retrieval algorithm, a PCAR retrieval algorithm, an OPQ retrieval algorithm or an IVF retrieval algorithm.
6. The method of claim 5, wherein the pre-trained language model based search enhancement method comprises: the method for inputting the text vector representation of each node into a preset search library to establish an index to obtain the vector index of each text specifically comprises the following steps:
selecting a retrieval algorithm according to a preset rule by the text vector representation of each node;
and (4) establishing indexes for the text vector representation of each node through a corresponding retrieval algorithm to obtain the vector index of each text.
7. The method of claim 6, wherein the pre-trained language model based search enhancement method comprises: the preset rule comprises one of precision search, memory limitation or memory data size.
8. The method of claim 7, wherein the pre-trained language model based search enhancement method comprises: the vector representation of each node selects a retrieval algorithm according to a preset rule, and specifically comprises the following steps:
if the accuracy is based, selecting an HNSW retrieval algorithm and establishing an index;
or, if the index is based on the memory limitation, selecting one of a Flat retrieval algorithm, a PCAR retrieval algorithm or an OPQ retrieval algorithm, and establishing the index;
or, if based on the data size, selecting the IVF retrieval algorithm and establishing the index.
9. A retrieval enhancement system based on a pre-training language model is characterized in that: the system comprises the following modules:
an acquisition module: the system comprises a language model pre-training module, a database and a database, wherein the language model pre-training module is used for pre-training a language model to obtain a plurality of texts in a pre-set corpus and inputting the texts into the pre-training language model to obtain vector representation of each text;
a processing module: the text vector representation system is used for correspondingly distributing the vector representation of each text to a plurality of nodes based on the number of the nodes to obtain the text vector representation of the nodes;
an index establishing module: and the text vector representation of each node is input into a preset search library to establish an index, so that the vector index of each text is obtained.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when executed by a processor, implements a pre-trained language model based search enhancement method as claimed in any one of claims 1 to 8.
CN202211103284.8A 2022-09-09 2022-09-09 Retrieval enhancement method, system and storage medium based on pre-training language model Active CN115203378B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211103284.8A CN115203378B (en) 2022-09-09 2022-09-09 Retrieval enhancement method, system and storage medium based on pre-training language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211103284.8A CN115203378B (en) 2022-09-09 2022-09-09 Retrieval enhancement method, system and storage medium based on pre-training language model

Publications (2)

Publication Number Publication Date
CN115203378A true CN115203378A (en) 2022-10-18
CN115203378B CN115203378B (en) 2023-01-24

Family

ID=83572735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211103284.8A Active CN115203378B (en) 2022-09-09 2022-09-09 Retrieval enhancement method, system and storage medium based on pre-training language model

Country Status (1)

Country Link
CN (1) CN115203378B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117035064A (en) * 2023-10-10 2023-11-10 北京澜舟科技有限公司 Combined training method for retrieving enhanced language model and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059417A1 (en) * 2006-08-28 2008-03-06 Akitomo Yamada Structured document management system and method of managing indexes in the same system
US20090077009A1 (en) * 2007-09-13 2009-03-19 International Business Machines Corporation System and method for storage, management and automatic indexing of structured documents
CN102004778A (en) * 2010-11-19 2011-04-06 清华大学 Text index online updating method in cloud environment
CN105787097A (en) * 2016-03-16 2016-07-20 中山大学 Distributed index establishment method and system based on text clustering
CN112836008A (en) * 2021-02-07 2021-05-25 中国科学院新疆理化技术研究所 Index establishing method based on decentralized storage data
CN113407738A (en) * 2021-07-12 2021-09-17 网易(杭州)网络有限公司 Similar text retrieval method and device, electronic equipment and storage medium
WO2022141876A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Word embedding-based search method, apparatus and device, and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059417A1 (en) * 2006-08-28 2008-03-06 Akitomo Yamada Structured document management system and method of managing indexes in the same system
US20090077009A1 (en) * 2007-09-13 2009-03-19 International Business Machines Corporation System and method for storage, management and automatic indexing of structured documents
CN102004778A (en) * 2010-11-19 2011-04-06 清华大学 Text index online updating method in cloud environment
CN105787097A (en) * 2016-03-16 2016-07-20 中山大学 Distributed index establishment method and system based on text clustering
WO2022141876A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Word embedding-based search method, apparatus and device, and storage medium
CN112836008A (en) * 2021-02-07 2021-05-25 中国科学院新疆理化技术研究所 Index establishing method based on decentralized storage data
CN113407738A (en) * 2021-07-12 2021-09-17 网易(杭州)网络有限公司 Similar text retrieval method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SERGEY MELINK等: "Building a distributed full-text index for the web", 《ACM TRANSACTIONS ON INFORMATION SYSTEM》 *
侯祥松等: "基于结构化P2P的语义查询技术", 《电子与信息学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117035064A (en) * 2023-10-10 2023-11-10 北京澜舟科技有限公司 Combined training method for retrieving enhanced language model and storage medium
CN117035064B (en) * 2023-10-10 2024-02-20 北京澜舟科技有限公司 Combined training method for retrieving enhanced language model and storage medium

Also Published As

Publication number Publication date
CN115203378B (en) 2023-01-24

Similar Documents

Publication Publication Date Title
CN107491547B (en) Search method and device based on artificial intelligence
CN107273503B (en) Method and device for generating parallel text in same language
CN107391549B (en) Artificial intelligence based news recall method, device, equipment and storage medium
CN111428010B (en) Man-machine intelligent question-answering method and device
WO2021135455A1 (en) Semantic recall method, apparatus, computer device, and storage medium
CN109858045B (en) Machine translation method and device
JP2022050379A (en) Semantic retrieval method, apparatus, electronic device, storage medium, and computer program product
CN111753551B (en) Information generation method and device based on word vector generation model
CN110795541B (en) Text query method, text query device, electronic equipment and computer readable storage medium
US10095736B2 (en) Using synthetic events to identify complex relation lookups
CN111930894B (en) Long text matching method and device, storage medium and electronic equipment
CN115062134B (en) Knowledge question-answering model training and knowledge question-answering method, device and computer equipment
CN111400584A (en) Association word recommendation method and device, computer equipment and storage medium
CN111444321B (en) Question answering method, device, electronic equipment and storage medium
CN115203378B (en) Retrieval enhancement method, system and storage medium based on pre-training language model
CN112182255A (en) Method and apparatus for storing media files and for retrieving media files
US11361031B2 (en) Dynamic linguistic assessment and measurement
CN114490926A (en) Method and device for determining similar problems, storage medium and terminal
US10229156B2 (en) Using priority scores for iterative precision reduction in structured lookups for questions
CN113343692A (en) Search intention recognition method, model training method, device, medium and equipment
CN112307243A (en) Method and apparatus for retrieving image
CN116957006A (en) Training method, device, equipment, medium and program product of prediction model
CN112328751A (en) Method and device for processing text
CN113761933A (en) Retrieval method, retrieval device, electronic equipment and readable storage medium
CN113011152A (en) Text processing method, device and equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant