CN112988952B

CN112988952B - Multi-level-length text vector retrieval method and device and electronic equipment

Info

Publication number: CN112988952B
Application number: CN202110421266.3A
Authority: CN
Inventors: 钱泓锦; 刘占亮; 窦志成; 文继荣; 曹岗
Original assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Current assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-08-24
Anticipated expiration: 2041-04-20
Also published as: CN112988952A

Abstract

The invention discloses a method and a device for searching multilayer level long text vectors and electronic equipment. The method comprises the following steps: dividing a long text in an open field into text segments; respectively encoding the text segments and the search requests into dense vectors by using a trained encoder; utilizing the text segments and the dense vectors of the search requests, and inquiring to obtain target text segments similar to the search requests based on vector retrieval; wherein the encoder is trained using a training data set comprising multi-level text segments. By considering the multi-level relevance of the text segments in the training data set and the search request, the obtained model is easy to select a proper segment from a plurality of relevant segments, and the recall efficiency is obviously improved.

Description

Multi-level-length text vector retrieval method and device and electronic equipment

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a device for searching multi-layer level long text vectors and electronic equipment.

Background

Open-domain question-answering is an important task in the field of natural language processing. It can be described simply as: given a factual question, the system needs to retrieve the document in which the answer to the question is located from a large-scale multi-domain document library, and then extract or generate the answer therefrom. For the open field question-answering task, document retrieval is often the most important part, and the accuracy of document retrieval determines the overall effect upper limit of the system.

At present, a common method for document retrieval in the question-answering task in the open field is based on sparse matrix or dense vector retrieval. The sparse matrix-based retrieval method generally uses TD-IDF or BM25, and these methods generally include the following steps: extracting semantic information of the document, including keyword extraction, named entity identification, proper noun extraction and the like, to obtain key information in the document; constructing a plurality of index domains by using document texts and semantic information extraction results, wherein a search engine tool such as an elastic search tool is often used in the step; and for a new search request, extracting the same semantic information, converting the semantic information into a sparse matrix, comparing and scoring the sparse matrix with the documents in the library, and recalling the result with the highest score. In contrast, the dense vector retrieval method generally encodes documents and search requests into dense vectors by using a neural network model, and then performs similarity calculation to recall search results.

The sparse matrix-based retrieval method has the following disadvantages: (1) manual feature engineering is needed, which is a tedious, time-consuming and error-prone process, and the codes for performing manual feature engineering each time are specific to a particular problem, and when a new problem and a new data set are to be solved, relevant codes need to be rewritten; (2) it is difficult to solve the ambiguity problem of words in the open domain. For example, for the word "apple", if its contextual information is ignored, it is difficult for the system to identify whether it represents fruit or a technology company; (3) there is a lack of deep understanding of semantics. For example, for the two words of "Ministry of industry and information department", the system cannot automatically find the relevance, and needs to be normalized manually; (4) the space for effect optimization is limited. Due to the technical limitation of artificial characteristic engineering, when the retrieval effect reaches a certain degree, the optimization is difficult to continue; (5) such a method is poor in generalization. Since the various indexes in the system are constructed with strong domain attributes, the effect is often poor when encountering search requests outside the text domain.

Dense vector-based retrieval methods may address to some extent many of the shortcomings of sparse matrix-based search methods. In general, dense vectorsThe retrieval method is to train an encoder based on a deep neural network

The documents and the search requests are then encoded into dense vectors, and relevance scores are obtained by performing similarity calculations on the dense vectors of the documents and the search requests. Currently, training encoders

Contains only positive and negative examples: the document segment containing the correct answer corresponding to the search request is taken as a positive example, and other document segments (obtained by retrieval, random sampling and the like) are taken as negative examples. The model is optimized by a two-classification loss function, so as to train the obtained encoder

. However, the training method using the positive/negative two-class label has the following problems: when the model is trained, a long document containing answers is cut into a plurality of text segments, wherein some text segments contain documents, and some text segments do not contain answers but are semantically related to a search request. As a practical matter, the text segment semantically related to the search request should be treated as a positive example, but according to the existing method, the text segment semantically related to the search request is trained as a negative example. Therefore, the effect of the model is reduced in practical applications. For example, what the material requirements for the "2021 year application of the project of Natural science Foundation in Beijing

"such a request corresponds to a prompt that the document has only" science foundation project of beijing, 2021 years "at the beginning, but since the long text is cut into text segments, the first segment is trained as a negative example, thereby reducing the effectiveness of the model.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides the following technical scheme.

The invention provides a multi-level length text vector retrieval method on one hand, which comprises the following steps:

dividing a long text in an open field into text segments;

respectively encoding the text segments and the search requests into dense vectors by using a trained encoder;

utilizing the text segments and the dense vectors of the search requests, and inquiring to obtain target text segments similar to the search requests based on vector retrieval;

wherein the encoder is trained using a training data set comprising multi-level text segments.

Preferably, the multi-level text fragment comprises: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request.

Preferably, the text segment containing the answer and the text segment not containing the answer in the document containing the answer are obtained by segmenting the document containing the answer; the text segment related to the search request is obtained by segmenting the document related to the search request obtained by searching; the text segments irrelevant to the search request are obtained by segmenting the document obtained by random sampling.

Preferably, the encoder trains an objective function

Comprises the following steps:

wherein the content of the first and second substances,

in order to request for a search, a search engine,

in order to include the text segment of the answer,

for a text segment in the document containing an answer that does not contain an answer,

for the text segment associated with the search request,

a text segment that is not relevant to the search request;

for the correlation between the search request and the text passage,

to represent

，

，

，

；

Is a preset constant that represents the minimum acceptable distance of correlation between the search request and the two hierarchical text segments.

Preferably, the querying to obtain text segments similar to the search request based on vector retrieval by using the text segments and the dense vector of the search request comprises:

constructing a vector index from the dense vectors of the text segments;

and retrieving the vector index according to the dense vector of the search request to obtain a target text segment similar to the search request.

Preferably, obtaining a target text segment similar to the search request further includes:

acquiring the highest score of the similarity between the target text segment and the search request;

and screening the target text fragment by using the highest score of the similarity and the minimum acceptable distance of the correlation between the search request and the two hierarchical text fragments to obtain the target hierarchical text fragment.

Preferably, the screening comprises:

and screening the target text segment by using the difference between the highest similarity score and the minimum acceptable distance as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment.

A second aspect of the present invention provides a multi-level-length text vector retrieval apparatus, including:

the text segmentation module is used for segmenting the long text in the open field into text segments;

the vector coding module is used for coding the text segments and the search requests into dense vectors by utilizing a trained coder, and the coder is obtained by utilizing a training data set comprising multi-level text segments through training;

and the vector retrieval module is used for querying and obtaining a target text segment similar to the search request based on vector retrieval by utilizing the text segment and the dense vector of the search request.

The invention also provides a memory storing a plurality of instructions for implementing the method.

The invention also provides an electronic device comprising a processor and a memory connected with the processor, wherein the memory stores a plurality of instructions which can be loaded and executed by the processor so as to enable the processor to execute the method.

The invention has the beneficial effects that: the embodiment of the invention provides a method and a device for searching multilayer level long text vectors and electronic equipment, wherein the method comprises the following steps: dividing a long text in an open field into text segments; respectively encoding the text segments and the search requests into dense vectors by using a trained encoder; utilizing the text segments and the dense vectors of the search requests, and inquiring to obtain target text segments similar to the search requests based on vector retrieval; wherein the encoder is trained using a training data set comprising multi-level text segments. By considering the multi-level relevance between the text segments in the training data set and the search request, the obtained model is easy to select a proper segment from a plurality of relevant segments, and the recall efficiency is improved.

Drawings

FIG. 1 is a schematic flow chart of a multi-level-length text vector retrieval method according to the present invention;

fig. 2 is a schematic diagram of a multi-level text vector retrieval apparatus according to the present invention.

Detailed Description

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

The method provided by the invention can be implemented in the following terminal environment, and the terminal can comprise one or more of the following components: a processor, a memory, and a display screen. Wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the methods described in the embodiments described below.

A processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and calling data stored in the memory.

The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory may be used to store instructions, programs, code sets, or instructions.

The display screen is used for displaying user interfaces of all the application programs.

In addition, those skilled in the art will appreciate that the above-described terminal configurations are not intended to be limiting, and that the terminal may include more or fewer components, or some components may be combined, or a different arrangement of components. For example, the terminal further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and other components, which are not described herein again.

Example one

As shown in fig. 1, an embodiment of the present invention provides a multi-level-length text vector retrieval method, including:

s101, dividing a long text in an open field into text segments;

s102, respectively encoding the text segments and the search requests into dense vectors by using a trained encoder;

s103, by utilizing the text segments and the dense vectors of the search requests, based on vector retrieval, inquiring to obtain target text segments similar to the search requests;

In practical application, since a long document often needs to be cut into a plurality of text segments for model training, the relevance of the search request and the text segments is multi-level, and not only the relevant and irrelevant labels. For example, the following four text snippets: a. a document snippet containing the answer; b. document segments which do not contain answers in the documents containing the answers; c. document snippets that are relevant to the search request, and d. Its relevance to the search request is ranked as a > b > c > d. In the prior art, the hierarchical relationship is not considered in the model training process, and only two-classification training is carried out, so that the obtained model is difficult to select the most appropriate segment from a plurality of related segments.

In the method provided by the invention, the multilevel correlation between the search request and the text segment is considered, the training data set comprising the multilevel text segment is used for training to obtain the encoder, then the trained encoder is used for encoding the text segment and the search request into dense vectors, and finally the target text segment similar to the search request is obtained through similarity calculation.

The method provided by the invention is based on a deep neural network model and adopts a large-scale pre-training language model to extract deep semantic information. Compared with the method based on manual characteristic engineering in the prior art, the reusability, the search effect, the search performance, the usability and the maintainability of the method are greatly improved.

In addition, because the encoder considers the multi-level relevance of the text segment and the search request in the training process, the obtained model can easily select a proper segment from a plurality of relevant segments.

Step S101 is executed, and for a long text in an open field, to ensure complete semantic information of the segmented text segment, the long text may be segmented according to paragraphs first. For paragraph texts with length less than the maximum sequence length, the context can be spliced; for paragraph texts with too long length, the paragraph text can be continuously segmented by sentences to form short sub-paragraphs. Meanwhile, an ID code can be generated for each text segment obtained by segmentation, so that the original text can be restored through the information of the text segment ID codes.

In step S102, the encoder may be trained using a training data set comprising multi-level text segments.

Wherein, the hierarchical number of the multi-level text fragment can be generated according to the actual situation.

As an example, four levels of text fragments may include: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request.

The text segment containing the answer can be obtained by segmenting a document containing the answer; the text segment which does not contain the answer in the document containing the answer can be obtained by segmenting the document containing the answer; the text segment related to the search request can be obtained by segmenting the searched document related to the search request; the text segments irrelevant to the search request can be obtained by segmenting the document obtained by random sampling.

For example, what the material requirements for the search request "application of the project of science fund in Beijing, 2021 years

", the four levels of text fragments associated therewith may be as shown in the following table.

Note: level a is a text fragment containing an answer, level b is a text fragment containing no answer in a document containing an answer, level c is a text fragment related to a search request, and level d is a text fragment unrelated to a search request.

When training the encoder, a training data set composed of search requests and multi-level text segments is first obtained, for example, the training data set is

Wherein

The text fragments respectively represent four levels of a, b, c and d. For a piece of data

Using encoders

It is encoded to obtain a vector representation. For two texts, the similarity is the dot product of two vectors：

Target function for encoder training

Comprises the following steps:

wherein the content of the first and second substances,

in order to request for a search, a search engine,

in order to include the text segment of the answer,

for the text segment associated with the search request,

a text segment that is not relevant to the search request;

for the correlation between the search request and the text passage,

to represent

，

，

，

；

Is a preset constant that represents the minimum acceptable distance of correlation between the search request and the two hierarchical text segments. For a training data set comprising four levels of text segments,

Included

，

representing relevance of search request to a-level text segment

Relevance of search request to d-level text segment

A minimum acceptable distance therebetween;

representing relevance of search request to a-level text segment

Relevance of search request to b-level text segment

A minimum acceptable distance therebetween;

representing relevance of search request to a-level text segment

Relevance of search request to c-level text segment

A minimum acceptable distance therebetween;

representing relevance of search request to b-level text segment

Relevance of search request to c-level text segment

A minimum acceptable distance therebetween;

representing relevance of search request to b-level text segment

Relevance of search request to d-level text segment

A minimum acceptable distance therebetween;

indicating relevance of search requests to c-tier text fragments

Relevance of search request to d-level text segment

A minimum acceptable distance therebetween. The minimum acceptable distance is, for example, a preset distance

Then the optimization objective is

And

an acceptable distance therebetween

. By arranging in sequence

Obtaining an objective function

. By optimizing an objective function

After training with fixed iteration times, the trained encoder is obtained

。

According to the training method provided by the invention, the data set can be automatically constructed and is provided with multi-level labels. Compared with a positive label data set and a negative label data set used by the existing method, the data set constructed by the method provided by the invention reserves more relevant information; in addition, the model training method fully considers the relative distance information between the text segments of different levels and the correlation of the search request, and can bring remarkable effect improvement compared with the existing positive and negative two-classification training method.

After the encoder is trained, the text segment and the search request obtained by segmenting the long text in the open field are respectively encoded into dense vectors by using the trained encoder.

Step S103 is executed, which may be implemented by the following method:

constructing a vector index from the dense vectors of the text segments;

One of a plurality of vector index types can be selected, for example: dot product, inner product, IVFx, etc. Vector retrieval may also be performed using various types of vector retrieval engines, such as faiss, milvus, and the like.

In the vector retrieval, similarity scores of dense vectors of the search request and dense vectors of text segments in the vector index can be calculated, and the text segments with the former similarity scores are selected as target text segments similar to the search request.

The obtained target text segments similar to the search request comprise multi-level text segments, and in order to further obtain a certain level text segment in the target text segments, the invention provides the following method:

In other words, the similarity scores of the target text segments and the search request calculated in the previous step are counted to obtain the highest score. Meanwhile, the minimum acceptable distance of the correlation between the preset search request and the two hierarchical text segments in the encoder training process is obtained

、

、

、

、

Or

And then screening the target text segments by using the highest similarity and the minimum acceptable distance to obtain target level text segments, namely target text segments of a certain level.

Wherein, the specific screening process may include:

As an example, for example, if the highest score of similarity between each target text segment obtained by vector retrieval and the search request is 0.9, if all text segments containing answers are expected to be further obtained in the target text segment, that is, a text segment of a level is expected to be obtained, the highest score may be used

Filtering a target text segment as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment; if it is expected that all relevant text segments are further obtained in the target text segment, i.e. a text segment of the c-level is expected to be obtained, it may be used

Filtering the target text segment as a threshold value, if the target text segment and the search requestAnd if the similarity score reaches a threshold value, the target text segment is the target level text segment.

The method provided by the invention optimizes the distance between the text segments of any two levels and the correlation of the search request to a preset value during model training, and provides a reference standard for text segment filtering during the post-processing of recalling the target text segment. Compared with the method for manually setting the filtering threshold value in the existing method, the method provided by the invention is more explanatory and more flexible.

Example two

As shown in fig. 2, another aspect of the present invention further includes a functional module architecture completely corresponding to the foregoing method flow, that is, an embodiment of the present invention further provides a device for retrieving a multilevel-level long text vector, including:

the text segmentation module 201 is configured to segment a long text in an open field into text segments;

a vector encoding module 202, configured to encode the text segments and the search request into dense vectors respectively by using a trained encoder, where the encoder is obtained by using a training data set including multi-level text segments through training;

and the vector retrieval module 203 is used for utilizing the text segments and the dense vectors of the search requests, and querying to obtain target text segments similar to the search requests based on vector retrieval.

Wherein, in the vector encoding module, the multi-level text segment comprises: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request.

Further, the text segment containing the answer is obtained by segmenting the document containing the answer; the text segment which does not contain the answer in the document containing the answer is obtained by segmenting the document containing the answer; the text segment related to the search request is obtained by segmenting the document related to the search request obtained by searching; the text segments irrelevant to the search request are obtained by segmenting the document obtained by random sampling.

In the vector encoding module, the encoder trains an objective function

Comprises the following steps:

wherein the content of the first and second substances,

in order to request for a search, a search engine,

in order to include the text segment of the answer,

for the text segment associated with the search request,

a text segment that is not relevant to the search request;

for the correlation between the search request and the text passage,

to represent

，

，

，

；

Further, the vector retrieval module is specifically configured to:

constructing a vector index from the dense vectors of the text segments;

The device for retrieving the multi-level long text vector provided by the embodiment of the invention further comprises a screening module, which is used for obtaining the target text fragment similar to the search request:

Further, in the screening module, the screening includes:

The device can be implemented by the multi-level long text vector retrieval method provided in the first embodiment, and specific implementation methods can be referred to the description in the first embodiment and are not described herein again.

The invention also provides a memory storing a plurality of instructions for implementing the method according to the first embodiment.

The invention also provides an electronic device comprising a processor and a memory connected to the processor, wherein the memory stores a plurality of instructions, and the instructions can be loaded and executed by the processor to enable the processor to execute the method according to the first embodiment.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A multi-level long text vector retrieval method is characterized by comprising the following steps:

dividing a long text in an open field into text segments;

wherein the encoder is trained using a training data set comprising multi-level text segments;

the multi-level text fragment includes: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request;

an objective function trained by the encoder

Comprises the following steps:

wherein the content of the first and second substances,

in order to request for a search, a search engine,

in order to include the text segment of the answer,

for the text segment associated with the search request,

a text segment that is not relevant to the search request;

for the correlation between the search request and the text passage,

to represent

，

，

，

；

A preset constant representing a minimum acceptable distance of correlation between the search request and the two hierarchical text segments; for a training data set comprising four levels of text segments,

Included

，

representing relevance of search request to a-level text segment

Relevance of search request to d-level text segment

A minimum acceptable distance therebetween;

representing relevance of search request to a-level text segment

Relevance of search request to b-level text segment

A minimum acceptable distance therebetween;

representing relevance of search request to a-level text segment

Relevance of search request to c-level text segment

A minimum acceptable distance therebetween;

representing relevance of search request to b-level text segment

Relevance of search request to c-level text segment

A minimum acceptable distance therebetween;

representing relevance of search request to b-level text segment

Relevance of search request to d-level text segment

A minimum acceptable distance therebetween;

indicating relevance of search requests to c-tier text fragments

Relevance of search request to d-level text segment

The minimum acceptable distance therebetween.

2. The multi-hierarchy long text vector retrieval method of claim 1, wherein the text segment containing the answer and the text segment not containing the answer in the document containing the answer are obtained by segmenting the document containing the answer; the text segment related to the search request is obtained by segmenting the document related to the search request obtained by searching; the text segments irrelevant to the search request are obtained by segmenting the document obtained by random sampling.

3. The method of multi-level long text vector retrieval of claim 1, wherein said querying for text segments similar to a search request based on vector retrieval using text segments and dense vectors of the search request comprises:

constructing a vector index from the dense vectors of the text segments;

4. The method of multi-level long text vector retrieval according to claim 1, wherein said obtaining a target text segment similar to said search request further comprises:

5. The multi-level long text vector retrieval method of claim 4, wherein the filtering comprises:

6. A multi-level long text vector retrieval apparatus, comprising:

a vector encoding module for encoding the text segment and the search request into dense vectors using a trained encoder, respectivelyThe encoder is trained by using a training data set comprising multi-level text segments; the multi-level text fragment includes: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request; an objective function trained by the encoder

Comprises the following steps:

wherein the content of the first and second substances,

in order to request for a search, a search engine,

in order to include the text segment of the answer,

for the text segment associated with the search request,

a text segment that is not relevant to the search request;

for the correlation between the search request and the text passage,

to represent

，

，

，

；

Included

，

representing relevance of search request to a-level text segment

Relevance of search request to d-level text segment

A minimum acceptable distance therebetween;

representing relevance of search request to a-level text segment

Relevance of search request to b-level text segment

A minimum acceptable distance therebetween;

representing relevance of search request to a-level text segment

Correlation of search requests with c-level text snippetsProperty of (2)

A minimum acceptable distance therebetween;

representing relevance of search request to b-level text segment

Relevance of search request to c-level text segment

A minimum acceptable distance therebetween;

representing relevance of search request to b-level text segment

Relevance of search request to d-level text segment

A minimum acceptable distance therebetween;

indicating relevance of search requests to c-tier text fragments

Relevance of search request to d-level text segment

A minimum acceptable distance therebetween;

7. A memory storing a plurality of instructions for implementing the method of any one of claims 1-5.

8. An electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions that are loadable and executable by the processor to enable the processor to perform the method according to any of claims 1-5.