CN112988952B - Multi-level-length text vector retrieval method and device and electronic equipment - Google Patents
Multi-level-length text vector retrieval method and device and electronic equipment Download PDFInfo
- Publication number
- CN112988952B CN112988952B CN202110421266.3A CN202110421266A CN112988952B CN 112988952 B CN112988952 B CN 112988952B CN 202110421266 A CN202110421266 A CN 202110421266A CN 112988952 B CN112988952 B CN 112988952B
- Authority
- CN
- China
- Prior art keywords
- text
- search request
- text segment
- level
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a device for searching multilayer level long text vectors and electronic equipment. The method comprises the following steps: dividing a long text in an open field into text segments; respectively encoding the text segments and the search requests into dense vectors by using a trained encoder; utilizing the text segments and the dense vectors of the search requests, and inquiring to obtain target text segments similar to the search requests based on vector retrieval; wherein the encoder is trained using a training data set comprising multi-level text segments. By considering the multi-level relevance of the text segments in the training data set and the search request, the obtained model is easy to select a proper segment from a plurality of relevant segments, and the recall efficiency is obviously improved.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a device for searching multi-layer level long text vectors and electronic equipment.
Background
Open-domain question-answering is an important task in the field of natural language processing. It can be described simply as: given a factual question, the system needs to retrieve the document in which the answer to the question is located from a large-scale multi-domain document library, and then extract or generate the answer therefrom. For the open field question-answering task, document retrieval is often the most important part, and the accuracy of document retrieval determines the overall effect upper limit of the system.
At present, a common method for document retrieval in the question-answering task in the open field is based on sparse matrix or dense vector retrieval. The sparse matrix-based retrieval method generally uses TD-IDF or BM25, and these methods generally include the following steps: extracting semantic information of the document, including keyword extraction, named entity identification, proper noun extraction and the like, to obtain key information in the document; constructing a plurality of index domains by using document texts and semantic information extraction results, wherein a search engine tool such as an elastic search tool is often used in the step; and for a new search request, extracting the same semantic information, converting the semantic information into a sparse matrix, comparing and scoring the sparse matrix with the documents in the library, and recalling the result with the highest score. In contrast, the dense vector retrieval method generally encodes documents and search requests into dense vectors by using a neural network model, and then performs similarity calculation to recall search results.
The sparse matrix-based retrieval method has the following disadvantages: (1) manual feature engineering is needed, which is a tedious, time-consuming and error-prone process, and the codes for performing manual feature engineering each time are specific to a particular problem, and when a new problem and a new data set are to be solved, relevant codes need to be rewritten; (2) it is difficult to solve the ambiguity problem of words in the open domain. For example, for the word "apple", if its contextual information is ignored, it is difficult for the system to identify whether it represents fruit or a technology company; (3) there is a lack of deep understanding of semantics. For example, for the two words of "Ministry of industry and information department", the system cannot automatically find the relevance, and needs to be normalized manually; (4) the space for effect optimization is limited. Due to the technical limitation of artificial characteristic engineering, when the retrieval effect reaches a certain degree, the optimization is difficult to continue; (5) such a method is poor in generalization. Since the various indexes in the system are constructed with strong domain attributes, the effect is often poor when encountering search requests outside the text domain.
Dense vector-based retrieval methods may address to some extent many of the shortcomings of sparse matrix-based search methods. In general, dense vectorsThe retrieval method is to train an encoder based on a deep neural networkThe documents and the search requests are then encoded into dense vectors, and relevance scores are obtained by performing similarity calculations on the dense vectors of the documents and the search requests. Currently, training encodersContains only positive and negative examples: the document segment containing the correct answer corresponding to the search request is taken as a positive example, and other document segments (obtained by retrieval, random sampling and the like) are taken as negative examples. The model is optimized by a two-classification loss function, so as to train the obtained encoder. However, the training method using the positive/negative two-class label has the following problems: when the model is trained, a long document containing answers is cut into a plurality of text segments, wherein some text segments contain documents, and some text segments do not contain answers but are semantically related to a search request. As a practical matter, the text segment semantically related to the search request should be treated as a positive example, but according to the existing method, the text segment semantically related to the search request is trained as a negative example. Therefore, the effect of the model is reduced in practical applications. For example, what the material requirements for the "2021 year application of the project of Natural science Foundation in Beijing"such a request corresponds to a prompt that the document has only" science foundation project of beijing, 2021 years "at the beginning, but since the long text is cut into text segments, the first segment is trained as a negative example, thereby reducing the effectiveness of the model.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides the following technical scheme.
The invention provides a multi-level length text vector retrieval method on one hand, which comprises the following steps:
dividing a long text in an open field into text segments;
respectively encoding the text segments and the search requests into dense vectors by using a trained encoder;
utilizing the text segments and the dense vectors of the search requests, and inquiring to obtain target text segments similar to the search requests based on vector retrieval;
wherein the encoder is trained using a training data set comprising multi-level text segments.
Preferably, the multi-level text fragment comprises: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request.
Preferably, the text segment containing the answer and the text segment not containing the answer in the document containing the answer are obtained by segmenting the document containing the answer; the text segment related to the search request is obtained by segmenting the document related to the search request obtained by searching; the text segments irrelevant to the search request are obtained by segmenting the document obtained by random sampling.
wherein the content of the first and second substances,
in order to request for a search, a search engine,in order to include the text segment of the answer,for a text segment in the document containing an answer that does not contain an answer,for the text segment associated with the search request,a text segment that is not relevant to the search request;
Is a preset constant that represents the minimum acceptable distance of correlation between the search request and the two hierarchical text segments.
Preferably, the querying to obtain text segments similar to the search request based on vector retrieval by using the text segments and the dense vector of the search request comprises:
constructing a vector index from the dense vectors of the text segments;
and retrieving the vector index according to the dense vector of the search request to obtain a target text segment similar to the search request.
Preferably, obtaining a target text segment similar to the search request further includes:
acquiring the highest score of the similarity between the target text segment and the search request;
and screening the target text fragment by using the highest score of the similarity and the minimum acceptable distance of the correlation between the search request and the two hierarchical text fragments to obtain the target hierarchical text fragment.
Preferably, the screening comprises:
and screening the target text segment by using the difference between the highest similarity score and the minimum acceptable distance as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment.
A second aspect of the present invention provides a multi-level-length text vector retrieval apparatus, including:
the text segmentation module is used for segmenting the long text in the open field into text segments;
the vector coding module is used for coding the text segments and the search requests into dense vectors by utilizing a trained coder, and the coder is obtained by utilizing a training data set comprising multi-level text segments through training;
and the vector retrieval module is used for querying and obtaining a target text segment similar to the search request based on vector retrieval by utilizing the text segment and the dense vector of the search request.
The invention also provides a memory storing a plurality of instructions for implementing the method.
The invention also provides an electronic device comprising a processor and a memory connected with the processor, wherein the memory stores a plurality of instructions which can be loaded and executed by the processor so as to enable the processor to execute the method.
The invention has the beneficial effects that: the embodiment of the invention provides a method and a device for searching multilayer level long text vectors and electronic equipment, wherein the method comprises the following steps: dividing a long text in an open field into text segments; respectively encoding the text segments and the search requests into dense vectors by using a trained encoder; utilizing the text segments and the dense vectors of the search requests, and inquiring to obtain target text segments similar to the search requests based on vector retrieval; wherein the encoder is trained using a training data set comprising multi-level text segments. By considering the multi-level relevance between the text segments in the training data set and the search request, the obtained model is easy to select a proper segment from a plurality of relevant segments, and the recall efficiency is improved.
Drawings
FIG. 1 is a schematic flow chart of a multi-level-length text vector retrieval method according to the present invention;
fig. 2 is a schematic diagram of a multi-level text vector retrieval apparatus according to the present invention.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The method provided by the invention can be implemented in the following terminal environment, and the terminal can comprise one or more of the following components: a processor, a memory, and a display screen. Wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the methods described in the embodiments described below.
A processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and calling data stored in the memory.
The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory may be used to store instructions, programs, code sets, or instructions.
The display screen is used for displaying user interfaces of all the application programs.
In addition, those skilled in the art will appreciate that the above-described terminal configurations are not intended to be limiting, and that the terminal may include more or fewer components, or some components may be combined, or a different arrangement of components. For example, the terminal further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and other components, which are not described herein again.
Example one
As shown in fig. 1, an embodiment of the present invention provides a multi-level-length text vector retrieval method, including:
s101, dividing a long text in an open field into text segments;
s102, respectively encoding the text segments and the search requests into dense vectors by using a trained encoder;
s103, by utilizing the text segments and the dense vectors of the search requests, based on vector retrieval, inquiring to obtain target text segments similar to the search requests;
wherein the encoder is trained using a training data set comprising multi-level text segments.
In practical application, since a long document often needs to be cut into a plurality of text segments for model training, the relevance of the search request and the text segments is multi-level, and not only the relevant and irrelevant labels. For example, the following four text snippets: a. a document snippet containing the answer; b. document segments which do not contain answers in the documents containing the answers; c. document snippets that are relevant to the search request, and d. Its relevance to the search request is ranked as a > b > c > d. In the prior art, the hierarchical relationship is not considered in the model training process, and only two-classification training is carried out, so that the obtained model is difficult to select the most appropriate segment from a plurality of related segments.
In the method provided by the invention, the multilevel correlation between the search request and the text segment is considered, the training data set comprising the multilevel text segment is used for training to obtain the encoder, then the trained encoder is used for encoding the text segment and the search request into dense vectors, and finally the target text segment similar to the search request is obtained through similarity calculation.
The method provided by the invention is based on a deep neural network model and adopts a large-scale pre-training language model to extract deep semantic information. Compared with the method based on manual characteristic engineering in the prior art, the reusability, the search effect, the search performance, the usability and the maintainability of the method are greatly improved.
In addition, because the encoder considers the multi-level relevance of the text segment and the search request in the training process, the obtained model can easily select a proper segment from a plurality of relevant segments.
Step S101 is executed, and for a long text in an open field, to ensure complete semantic information of the segmented text segment, the long text may be segmented according to paragraphs first. For paragraph texts with length less than the maximum sequence length, the context can be spliced; for paragraph texts with too long length, the paragraph text can be continuously segmented by sentences to form short sub-paragraphs. Meanwhile, an ID code can be generated for each text segment obtained by segmentation, so that the original text can be restored through the information of the text segment ID codes.
In step S102, the encoder may be trained using a training data set comprising multi-level text segments.
Wherein, the hierarchical number of the multi-level text fragment can be generated according to the actual situation.
As an example, four levels of text fragments may include: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request.
The text segment containing the answer can be obtained by segmenting a document containing the answer; the text segment which does not contain the answer in the document containing the answer can be obtained by segmenting the document containing the answer; the text segment related to the search request can be obtained by segmenting the searched document related to the search request; the text segments irrelevant to the search request can be obtained by segmenting the document obtained by random sampling.
For example, what the material requirements for the search request "application of the project of science fund in Beijing, 2021 years", the four levels of text fragments associated therewith may be as shown in the following table.
Note: level a is a text fragment containing an answer, level b is a text fragment containing no answer in a document containing an answer, level c is a text fragment related to a search request, and level d is a text fragment unrelated to a search request.
When training the encoder, a training data set composed of search requests and multi-level text segments is first obtained, for example, the training data set isWhereinThe text fragments respectively represent four levels of a, b, c and d. For a piece of dataUsing encodersIt is encoded to obtain a vector representation. For two texts, the similarity is the dot product of two vectors:
wherein the content of the first and second substances,
in order to request for a search, a search engine,in order to include the text segment of the answer,for a text segment in the document containing an answer that does not contain an answer,for the text segment associated with the search request,a text segment that is not relevant to the search request;
Is a preset constant that represents the minimum acceptable distance of correlation between the search request and the two hierarchical text segments. For a training data set comprising four levels of text segments,Included,representing relevance of search request to a-level text segmentRelevance of search request to d-level text segmentA minimum acceptable distance therebetween;representing relevance of search request to a-level text segmentRelevance of search request to b-level text segmentA minimum acceptable distance therebetween;representing relevance of search request to a-level text segmentRelevance of search request to c-level text segmentA minimum acceptable distance therebetween;representing relevance of search request to b-level text segmentRelevance of search request to c-level text segmentA minimum acceptable distance therebetween;representing relevance of search request to b-level text segmentRelevance of search request to d-level text segmentA minimum acceptable distance therebetween;indicating relevance of search requests to c-tier text fragmentsRelevance of search request to d-level text segmentA minimum acceptable distance therebetween. The minimum acceptable distance is, for example, a preset distanceThen the optimization objective isAndan acceptable distance therebetween. By arranging in sequenceObtaining an objective function. By optimizing an objective functionAfter training with fixed iteration times, the trained encoder is obtained。
According to the training method provided by the invention, the data set can be automatically constructed and is provided with multi-level labels. Compared with a positive label data set and a negative label data set used by the existing method, the data set constructed by the method provided by the invention reserves more relevant information; in addition, the model training method fully considers the relative distance information between the text segments of different levels and the correlation of the search request, and can bring remarkable effect improvement compared with the existing positive and negative two-classification training method.
After the encoder is trained, the text segment and the search request obtained by segmenting the long text in the open field are respectively encoded into dense vectors by using the trained encoder.
Step S103 is executed, which may be implemented by the following method:
constructing a vector index from the dense vectors of the text segments;
and retrieving the vector index according to the dense vector of the search request to obtain a target text segment similar to the search request.
One of a plurality of vector index types can be selected, for example: dot product, inner product, IVFx, etc. Vector retrieval may also be performed using various types of vector retrieval engines, such as faiss, milvus, and the like.
In the vector retrieval, similarity scores of dense vectors of the search request and dense vectors of text segments in the vector index can be calculated, and the text segments with the former similarity scores are selected as target text segments similar to the search request.
The obtained target text segments similar to the search request comprise multi-level text segments, and in order to further obtain a certain level text segment in the target text segments, the invention provides the following method:
acquiring the highest score of the similarity between the target text segment and the search request;
and screening the target text fragment by using the highest score of the similarity and the minimum acceptable distance of the correlation between the search request and the two hierarchical text fragments to obtain the target hierarchical text fragment.
In other words, the similarity scores of the target text segments and the search request calculated in the previous step are counted to obtain the highest score. Meanwhile, the minimum acceptable distance of the correlation between the preset search request and the two hierarchical text segments in the encoder training process is obtained、、、、OrAnd then screening the target text segments by using the highest similarity and the minimum acceptable distance to obtain target level text segments, namely target text segments of a certain level.
Wherein, the specific screening process may include:
and screening the target text segment by using the difference between the highest similarity score and the minimum acceptable distance as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment.
As an example, for example, if the highest score of similarity between each target text segment obtained by vector retrieval and the search request is 0.9, if all text segments containing answers are expected to be further obtained in the target text segment, that is, a text segment of a level is expected to be obtained, the highest score may be usedFiltering a target text segment as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment; if it is expected that all relevant text segments are further obtained in the target text segment, i.e. a text segment of the c-level is expected to be obtained, it may be usedFiltering the target text segment as a threshold value, if the target text segment and the search requestAnd if the similarity score reaches a threshold value, the target text segment is the target level text segment.
The method provided by the invention optimizes the distance between the text segments of any two levels and the correlation of the search request to a preset value during model training, and provides a reference standard for text segment filtering during the post-processing of recalling the target text segment. Compared with the method for manually setting the filtering threshold value in the existing method, the method provided by the invention is more explanatory and more flexible.
Example two
As shown in fig. 2, another aspect of the present invention further includes a functional module architecture completely corresponding to the foregoing method flow, that is, an embodiment of the present invention further provides a device for retrieving a multilevel-level long text vector, including:
the text segmentation module 201 is configured to segment a long text in an open field into text segments;
a vector encoding module 202, configured to encode the text segments and the search request into dense vectors respectively by using a trained encoder, where the encoder is obtained by using a training data set including multi-level text segments through training;
and the vector retrieval module 203 is used for utilizing the text segments and the dense vectors of the search requests, and querying to obtain target text segments similar to the search requests based on vector retrieval.
Wherein, in the vector encoding module, the multi-level text segment comprises: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request.
Further, the text segment containing the answer is obtained by segmenting the document containing the answer; the text segment which does not contain the answer in the document containing the answer is obtained by segmenting the document containing the answer; the text segment related to the search request is obtained by segmenting the document related to the search request obtained by searching; the text segments irrelevant to the search request are obtained by segmenting the document obtained by random sampling.
In the vector encoding module, the encoder trains an objective functionComprises the following steps:
wherein the content of the first and second substances,
in order to request for a search, a search engine,in order to include the text segment of the answer,for a text segment in the document containing an answer that does not contain an answer,for the text segment associated with the search request,a text segment that is not relevant to the search request;
Is a preset constant that represents the minimum acceptable distance of correlation between the search request and the two hierarchical text segments.
Further, the vector retrieval module is specifically configured to:
constructing a vector index from the dense vectors of the text segments;
and retrieving the vector index according to the dense vector of the search request to obtain a target text segment similar to the search request.
The device for retrieving the multi-level long text vector provided by the embodiment of the invention further comprises a screening module, which is used for obtaining the target text fragment similar to the search request:
acquiring the highest score of the similarity between the target text segment and the search request;
and screening the target text fragment by using the highest score of the similarity and the minimum acceptable distance of the correlation between the search request and the two hierarchical text fragments to obtain the target hierarchical text fragment.
Further, in the screening module, the screening includes:
and screening the target text segment by using the difference between the highest similarity score and the minimum acceptable distance as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment.
The device can be implemented by the multi-level long text vector retrieval method provided in the first embodiment, and specific implementation methods can be referred to the description in the first embodiment and are not described herein again.
The invention also provides a memory storing a plurality of instructions for implementing the method according to the first embodiment.
The invention also provides an electronic device comprising a processor and a memory connected to the processor, wherein the memory stores a plurality of instructions, and the instructions can be loaded and executed by the processor to enable the processor to execute the method according to the first embodiment.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (8)
1. A multi-level long text vector retrieval method is characterized by comprising the following steps:
dividing a long text in an open field into text segments;
respectively encoding the text segments and the search requests into dense vectors by using a trained encoder;
utilizing the text segments and the dense vectors of the search requests, and inquiring to obtain target text segments similar to the search requests based on vector retrieval;
wherein the encoder is trained using a training data set comprising multi-level text segments;
the multi-level text fragment includes: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request;
wherein the content of the first and second substances,
in order to request for a search, a search engine,in order to include the text segment of the answer,for a text segment in the document containing an answer that does not contain an answer,for the text segment associated with the search request,a text segment that is not relevant to the search request;
A preset constant representing a minimum acceptable distance of correlation between the search request and the two hierarchical text segments; for a training data set comprising four levels of text segments,Included,representing relevance of search request to a-level text segmentRelevance of search request to d-level text segmentA minimum acceptable distance therebetween;representing relevance of search request to a-level text segmentRelevance of search request to b-level text segmentA minimum acceptable distance therebetween;representing relevance of search request to a-level text segmentRelevance of search request to c-level text segmentA minimum acceptable distance therebetween;representing relevance of search request to b-level text segment Relevance of search request to c-level text segmentA minimum acceptable distance therebetween;representing relevance of search request to b-level text segmentRelevance of search request to d-level text segmentA minimum acceptable distance therebetween;indicating relevance of search requests to c-tier text fragmentsRelevance of search request to d-level text segmentThe minimum acceptable distance therebetween.
2. The multi-hierarchy long text vector retrieval method of claim 1, wherein the text segment containing the answer and the text segment not containing the answer in the document containing the answer are obtained by segmenting the document containing the answer; the text segment related to the search request is obtained by segmenting the document related to the search request obtained by searching; the text segments irrelevant to the search request are obtained by segmenting the document obtained by random sampling.
3. The method of multi-level long text vector retrieval of claim 1, wherein said querying for text segments similar to a search request based on vector retrieval using text segments and dense vectors of the search request comprises:
constructing a vector index from the dense vectors of the text segments;
and retrieving the vector index according to the dense vector of the search request to obtain a target text segment similar to the search request.
4. The method of multi-level long text vector retrieval according to claim 1, wherein said obtaining a target text segment similar to said search request further comprises:
acquiring the highest score of the similarity between the target text segment and the search request;
and screening the target text fragment by using the highest score of the similarity and the minimum acceptable distance of the correlation between the search request and the two hierarchical text fragments to obtain the target hierarchical text fragment.
5. The multi-level long text vector retrieval method of claim 4, wherein the filtering comprises:
and screening the target text segment by using the difference between the highest similarity score and the minimum acceptable distance as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment.
6. A multi-level long text vector retrieval apparatus, comprising:
the text segmentation module is used for segmenting the long text in the open field into text segments;
a vector encoding module for encoding the text segment and the search request into dense vectors using a trained encoder, respectivelyThe encoder is trained by using a training data set comprising multi-level text segments; the multi-level text fragment includes: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request; an objective function trained by the encoderComprises the following steps:
wherein the content of the first and second substances,
in order to request for a search, a search engine,in order to include the text segment of the answer,for a text segment in the document containing an answer that does not contain an answer,for the text segment associated with the search request,a text segment that is not relevant to the search request;
A preset constant representing a minimum acceptable distance of correlation between the search request and the two hierarchical text segments; for a training data set comprising four levels of text segments,Included,representing relevance of search request to a-level text segmentRelevance of search request to d-level text segmentA minimum acceptable distance therebetween;representing relevance of search request to a-level text segmentRelevance of search request to b-level text segmentA minimum acceptable distance therebetween;representing relevance of search request to a-level text segmentCorrelation of search requests with c-level text snippetsProperty of (2)A minimum acceptable distance therebetween;representing relevance of search request to b-level text segmentRelevance of search request to c-level text segmentA minimum acceptable distance therebetween;representing relevance of search request to b-level text segmentRelevance of search request to d-level text segmentA minimum acceptable distance therebetween;indicating relevance of search requests to c-tier text fragmentsRelevance of search request to d-level text segmentA minimum acceptable distance therebetween;
and the vector retrieval module is used for querying and obtaining a target text segment similar to the search request based on vector retrieval by utilizing the text segment and the dense vector of the search request.
7. A memory storing a plurality of instructions for implementing the method of any one of claims 1-5.
8. An electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions that are loadable and executable by the processor to enable the processor to perform the method according to any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110421266.3A CN112988952B (en) | 2021-04-20 | 2021-04-20 | Multi-level-length text vector retrieval method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110421266.3A CN112988952B (en) | 2021-04-20 | 2021-04-20 | Multi-level-length text vector retrieval method and device and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112988952A CN112988952A (en) | 2021-06-18 |
CN112988952B true CN112988952B (en) | 2021-08-24 |
Family
ID=76341126
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110421266.3A Active CN112988952B (en) | 2021-04-20 | 2021-04-20 | Multi-level-length text vector retrieval method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112988952B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111881264A (en) * | 2020-09-28 | 2020-11-03 | 北京智源人工智能研究院 | Method and electronic equipment for searching long text in question-answering task in open field |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107491518B (en) * | 2017-08-15 | 2020-08-04 | 北京百度网讯科技有限公司 | Search recall method and device, server and storage medium |
CN107491547B (en) * | 2017-08-28 | 2020-11-10 | 北京百度网讯科技有限公司 | Search method and device based on artificial intelligence |
-
2021
- 2021-04-20 CN CN202110421266.3A patent/CN112988952B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111881264A (en) * | 2020-09-28 | 2020-11-03 | 北京智源人工智能研究院 | Method and electronic equipment for searching long text in question-answering task in open field |
Also Published As
Publication number | Publication date |
---|---|
CN112988952A (en) | 2021-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11222167B2 (en) | Generating structured text summaries of digital documents using interactive collaboration | |
CN110442760B (en) | Synonym mining method and device for question-answer retrieval system | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN107463607B (en) | Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning | |
CN111027327A (en) | Machine reading understanding method, device, storage medium and device | |
CN110688854B (en) | Named entity recognition method, device and computer readable storage medium | |
CN112818093B (en) | Evidence document retrieval method, system and storage medium based on semantic matching | |
CN112256860A (en) | Semantic retrieval method, system, equipment and storage medium for customer service conversation content | |
CN112100326B (en) | Anti-interference question and answer method and system integrating retrieval and machine reading understanding | |
CN111881264B (en) | Method and electronic equipment for searching long text in question-answering task in open field | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
CN112036184A (en) | Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model | |
CN108491407B (en) | Code retrieval-oriented query expansion method | |
CN110738059A (en) | text similarity calculation method and system | |
CN112199958A (en) | Concept word sequence generation method and device, computer equipment and storage medium | |
CN112988952B (en) | Multi-level-length text vector retrieval method and device and electronic equipment | |
CN110705285A (en) | Government affair text subject word bank construction method, device, server and readable storage medium | |
CN114385819B (en) | Environment judicial domain ontology construction method and device and related equipment | |
CN115617954A (en) | Question answering method and device, electronic equipment and storage medium | |
CN112949293A (en) | Similar text generation method, similar text generation device and intelligent equipment | |
CN112036183A (en) | Word segmentation method and device based on BilSTM network model and CRF model, computer device and computer storage medium | |
Qu | English-Chinese name transliteration by latent analogy | |
CN116842168B (en) | Cross-domain problem processing method and device, electronic equipment and storage medium | |
CN115437620B (en) | Natural language programming method, device, equipment and storage medium | |
KR102541806B1 (en) | Method, system, and computer readable record medium for ranking reformulated query |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |