CN112988952B - Multi-level-length text vector retrieval method and device and electronic equipment - Google Patents

Multi-level-length text vector retrieval method and device and electronic equipment Download PDF

Info

Publication number
CN112988952B
CN112988952B CN202110421266.3A CN202110421266A CN112988952B CN 112988952 B CN112988952 B CN 112988952B CN 202110421266 A CN202110421266 A CN 202110421266A CN 112988952 B CN112988952 B CN 112988952B
Authority
CN
China
Prior art keywords
text
search request
text segment
level
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110421266.3A
Other languages
Chinese (zh)
Other versions
CN112988952A (en
Inventor
钱泓锦
刘占亮
窦志成
文继荣
曹岗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhiyuan Artificial Intelligence Research Institute
Original Assignee
Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhiyuan Artificial Intelligence Research Institute filed Critical Beijing Zhiyuan Artificial Intelligence Research Institute
Priority to CN202110421266.3A priority Critical patent/CN112988952B/en
Publication of CN112988952A publication Critical patent/CN112988952A/en
Application granted granted Critical
Publication of CN112988952B publication Critical patent/CN112988952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for searching multilayer level long text vectors and electronic equipment. The method comprises the following steps: dividing a long text in an open field into text segments; respectively encoding the text segments and the search requests into dense vectors by using a trained encoder; utilizing the text segments and the dense vectors of the search requests, and inquiring to obtain target text segments similar to the search requests based on vector retrieval; wherein the encoder is trained using a training data set comprising multi-level text segments. By considering the multi-level relevance of the text segments in the training data set and the search request, the obtained model is easy to select a proper segment from a plurality of relevant segments, and the recall efficiency is obviously improved.

Description

Multi-level-length text vector retrieval method and device and electronic equipment
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a device for searching multi-layer level long text vectors and electronic equipment.
Background
Open-domain question-answering is an important task in the field of natural language processing. It can be described simply as: given a factual question, the system needs to retrieve the document in which the answer to the question is located from a large-scale multi-domain document library, and then extract or generate the answer therefrom. For the open field question-answering task, document retrieval is often the most important part, and the accuracy of document retrieval determines the overall effect upper limit of the system.
At present, a common method for document retrieval in the question-answering task in the open field is based on sparse matrix or dense vector retrieval. The sparse matrix-based retrieval method generally uses TD-IDF or BM25, and these methods generally include the following steps: extracting semantic information of the document, including keyword extraction, named entity identification, proper noun extraction and the like, to obtain key information in the document; constructing a plurality of index domains by using document texts and semantic information extraction results, wherein a search engine tool such as an elastic search tool is often used in the step; and for a new search request, extracting the same semantic information, converting the semantic information into a sparse matrix, comparing and scoring the sparse matrix with the documents in the library, and recalling the result with the highest score. In contrast, the dense vector retrieval method generally encodes documents and search requests into dense vectors by using a neural network model, and then performs similarity calculation to recall search results.
The sparse matrix-based retrieval method has the following disadvantages: (1) manual feature engineering is needed, which is a tedious, time-consuming and error-prone process, and the codes for performing manual feature engineering each time are specific to a particular problem, and when a new problem and a new data set are to be solved, relevant codes need to be rewritten; (2) it is difficult to solve the ambiguity problem of words in the open domain. For example, for the word "apple", if its contextual information is ignored, it is difficult for the system to identify whether it represents fruit or a technology company; (3) there is a lack of deep understanding of semantics. For example, for the two words of "Ministry of industry and information department", the system cannot automatically find the relevance, and needs to be normalized manually; (4) the space for effect optimization is limited. Due to the technical limitation of artificial characteristic engineering, when the retrieval effect reaches a certain degree, the optimization is difficult to continue; (5) such a method is poor in generalization. Since the various indexes in the system are constructed with strong domain attributes, the effect is often poor when encountering search requests outside the text domain.
Dense vector-based retrieval methods may address to some extent many of the shortcomings of sparse matrix-based search methods. In general, dense vectorsThe retrieval method is to train an encoder based on a deep neural network
Figure 930569DEST_PATH_IMAGE001
The documents and the search requests are then encoded into dense vectors, and relevance scores are obtained by performing similarity calculations on the dense vectors of the documents and the search requests. Currently, training encoders
Figure 506431DEST_PATH_IMAGE001
Contains only positive and negative examples: the document segment containing the correct answer corresponding to the search request is taken as a positive example, and other document segments (obtained by retrieval, random sampling and the like) are taken as negative examples. The model is optimized by a two-classification loss function, so as to train the obtained encoder
Figure 473119DEST_PATH_IMAGE001
. However, the training method using the positive/negative two-class label has the following problems: when the model is trained, a long document containing answers is cut into a plurality of text segments, wherein some text segments contain documents, and some text segments do not contain answers but are semantically related to a search request. As a practical matter, the text segment semantically related to the search request should be treated as a positive example, but according to the existing method, the text segment semantically related to the search request is trained as a negative example. Therefore, the effect of the model is reduced in practical applications. For example, what the material requirements for the "2021 year application of the project of Natural science Foundation in Beijing
Figure 540432DEST_PATH_IMAGE002
"such a request corresponds to a prompt that the document has only" science foundation project of beijing, 2021 years "at the beginning, but since the long text is cut into text segments, the first segment is trained as a negative example, thereby reducing the effectiveness of the model.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides the following technical scheme.
The invention provides a multi-level length text vector retrieval method on one hand, which comprises the following steps:
dividing a long text in an open field into text segments;
respectively encoding the text segments and the search requests into dense vectors by using a trained encoder;
utilizing the text segments and the dense vectors of the search requests, and inquiring to obtain target text segments similar to the search requests based on vector retrieval;
wherein the encoder is trained using a training data set comprising multi-level text segments.
Preferably, the multi-level text fragment comprises: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request.
Preferably, the text segment containing the answer and the text segment not containing the answer in the document containing the answer are obtained by segmenting the document containing the answer; the text segment related to the search request is obtained by segmenting the document related to the search request obtained by searching; the text segments irrelevant to the search request are obtained by segmenting the document obtained by random sampling.
Preferably, the encoder trains an objective function
Figure 338493DEST_PATH_IMAGE003
Comprises the following steps:
Figure 629797DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 568934DEST_PATH_IMAGE005
in order to request for a search, a search engine,
Figure 751523DEST_PATH_IMAGE006
in order to include the text segment of the answer,
Figure 10947DEST_PATH_IMAGE007
for a text segment in the document containing an answer that does not contain an answer,
Figure 473152DEST_PATH_IMAGE008
for the text segment associated with the search request,
Figure 696323DEST_PATH_IMAGE009
a text segment that is not relevant to the search request;
Figure 354707DEST_PATH_IMAGE010
for the correlation between the search request and the text passage,
Figure 346933DEST_PATH_IMAGE011
to represent
Figure 245619DEST_PATH_IMAGE006
Figure 470933DEST_PATH_IMAGE007
Figure 214898DEST_PATH_IMAGE008
Figure 530473DEST_PATH_IMAGE009
Figure 583748DEST_PATH_IMAGE012
Is a preset constant that represents the minimum acceptable distance of correlation between the search request and the two hierarchical text segments.
Preferably, the querying to obtain text segments similar to the search request based on vector retrieval by using the text segments and the dense vector of the search request comprises:
constructing a vector index from the dense vectors of the text segments;
and retrieving the vector index according to the dense vector of the search request to obtain a target text segment similar to the search request.
Preferably, obtaining a target text segment similar to the search request further includes:
acquiring the highest score of the similarity between the target text segment and the search request;
and screening the target text fragment by using the highest score of the similarity and the minimum acceptable distance of the correlation between the search request and the two hierarchical text fragments to obtain the target hierarchical text fragment.
Preferably, the screening comprises:
and screening the target text segment by using the difference between the highest similarity score and the minimum acceptable distance as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment.
A second aspect of the present invention provides a multi-level-length text vector retrieval apparatus, including:
the text segmentation module is used for segmenting the long text in the open field into text segments;
the vector coding module is used for coding the text segments and the search requests into dense vectors by utilizing a trained coder, and the coder is obtained by utilizing a training data set comprising multi-level text segments through training;
and the vector retrieval module is used for querying and obtaining a target text segment similar to the search request based on vector retrieval by utilizing the text segment and the dense vector of the search request.
The invention also provides a memory storing a plurality of instructions for implementing the method.
The invention also provides an electronic device comprising a processor and a memory connected with the processor, wherein the memory stores a plurality of instructions which can be loaded and executed by the processor so as to enable the processor to execute the method.
The invention has the beneficial effects that: the embodiment of the invention provides a method and a device for searching multilayer level long text vectors and electronic equipment, wherein the method comprises the following steps: dividing a long text in an open field into text segments; respectively encoding the text segments and the search requests into dense vectors by using a trained encoder; utilizing the text segments and the dense vectors of the search requests, and inquiring to obtain target text segments similar to the search requests based on vector retrieval; wherein the encoder is trained using a training data set comprising multi-level text segments. By considering the multi-level relevance between the text segments in the training data set and the search request, the obtained model is easy to select a proper segment from a plurality of relevant segments, and the recall efficiency is improved.
Drawings
FIG. 1 is a schematic flow chart of a multi-level-length text vector retrieval method according to the present invention;
fig. 2 is a schematic diagram of a multi-level text vector retrieval apparatus according to the present invention.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The method provided by the invention can be implemented in the following terminal environment, and the terminal can comprise one or more of the following components: a processor, a memory, and a display screen. Wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the methods described in the embodiments described below.
A processor may include one or more processing cores. The processor connects various parts within the overall terminal using various interfaces and lines, performs various functions of the terminal and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory, and calling data stored in the memory.
The Memory may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). The memory may be used to store instructions, programs, code sets, or instructions.
The display screen is used for displaying user interfaces of all the application programs.
In addition, those skilled in the art will appreciate that the above-described terminal configurations are not intended to be limiting, and that the terminal may include more or fewer components, or some components may be combined, or a different arrangement of components. For example, the terminal further includes a radio frequency circuit, an input unit, a sensor, an audio circuit, a power supply, and other components, which are not described herein again.
Example one
As shown in fig. 1, an embodiment of the present invention provides a multi-level-length text vector retrieval method, including:
s101, dividing a long text in an open field into text segments;
s102, respectively encoding the text segments and the search requests into dense vectors by using a trained encoder;
s103, by utilizing the text segments and the dense vectors of the search requests, based on vector retrieval, inquiring to obtain target text segments similar to the search requests;
wherein the encoder is trained using a training data set comprising multi-level text segments.
In practical application, since a long document often needs to be cut into a plurality of text segments for model training, the relevance of the search request and the text segments is multi-level, and not only the relevant and irrelevant labels. For example, the following four text snippets: a. a document snippet containing the answer; b. document segments which do not contain answers in the documents containing the answers; c. document snippets that are relevant to the search request, and d. Its relevance to the search request is ranked as a > b > c > d. In the prior art, the hierarchical relationship is not considered in the model training process, and only two-classification training is carried out, so that the obtained model is difficult to select the most appropriate segment from a plurality of related segments.
In the method provided by the invention, the multilevel correlation between the search request and the text segment is considered, the training data set comprising the multilevel text segment is used for training to obtain the encoder, then the trained encoder is used for encoding the text segment and the search request into dense vectors, and finally the target text segment similar to the search request is obtained through similarity calculation.
The method provided by the invention is based on a deep neural network model and adopts a large-scale pre-training language model to extract deep semantic information. Compared with the method based on manual characteristic engineering in the prior art, the reusability, the search effect, the search performance, the usability and the maintainability of the method are greatly improved.
In addition, because the encoder considers the multi-level relevance of the text segment and the search request in the training process, the obtained model can easily select a proper segment from a plurality of relevant segments.
Step S101 is executed, and for a long text in an open field, to ensure complete semantic information of the segmented text segment, the long text may be segmented according to paragraphs first. For paragraph texts with length less than the maximum sequence length, the context can be spliced; for paragraph texts with too long length, the paragraph text can be continuously segmented by sentences to form short sub-paragraphs. Meanwhile, an ID code can be generated for each text segment obtained by segmentation, so that the original text can be restored through the information of the text segment ID codes.
In step S102, the encoder may be trained using a training data set comprising multi-level text segments.
Wherein, the hierarchical number of the multi-level text fragment can be generated according to the actual situation.
As an example, four levels of text fragments may include: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request.
The text segment containing the answer can be obtained by segmenting a document containing the answer; the text segment which does not contain the answer in the document containing the answer can be obtained by segmenting the document containing the answer; the text segment related to the search request can be obtained by segmenting the searched document related to the search request; the text segments irrelevant to the search request can be obtained by segmenting the document obtained by random sampling.
For example, what the material requirements for the search request "application of the project of science fund in Beijing, 2021 years
Figure 781512DEST_PATH_IMAGE002
", the four levels of text fragments associated therewith may be as shown in the following table.
Figure 594747DEST_PATH_IMAGE014
Note: level a is a text fragment containing an answer, level b is a text fragment containing no answer in a document containing an answer, level c is a text fragment related to a search request, and level d is a text fragment unrelated to a search request.
When training the encoder, a training data set composed of search requests and multi-level text segments is first obtained, for example, the training data set is
Figure 561566DEST_PATH_IMAGE015
Wherein
Figure 54252DEST_PATH_IMAGE016
The text fragments respectively represent four levels of a, b, c and d. For a piece of data
Figure 473732DEST_PATH_IMAGE017
Using encoders
Figure 28341DEST_PATH_IMAGE001
It is encoded to obtain a vector representation. For two texts, the similarity is the dot product of two vectors:
Figure 98934DEST_PATH_IMAGE018
Target function for encoder training
Figure 510324DEST_PATH_IMAGE003
Comprises the following steps:
Figure 682679DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 572138DEST_PATH_IMAGE005
in order to request for a search, a search engine,
Figure 762816DEST_PATH_IMAGE006
in order to include the text segment of the answer,
Figure 548370DEST_PATH_IMAGE007
for a text segment in the document containing an answer that does not contain an answer,
Figure 942442DEST_PATH_IMAGE008
for the text segment associated with the search request,
Figure 150438DEST_PATH_IMAGE009
a text segment that is not relevant to the search request;
Figure 680777DEST_PATH_IMAGE010
for the correlation between the search request and the text passage,
Figure 433969DEST_PATH_IMAGE011
to represent
Figure 315337DEST_PATH_IMAGE006
Figure 58516DEST_PATH_IMAGE007
Figure 912202DEST_PATH_IMAGE008
Figure 570717DEST_PATH_IMAGE009
Figure 188649DEST_PATH_IMAGE012
Is a preset constant that represents the minimum acceptable distance of correlation between the search request and the two hierarchical text segments. For a training data set comprising four levels of text segments,
Figure 223601DEST_PATH_IMAGE019
Included
Figure 259690DEST_PATH_IMAGE020
Figure 89106DEST_PATH_IMAGE021
representing relevance of search request to a-level text segment
Figure 194334DEST_PATH_IMAGE022
Relevance of search request to d-level text segment
Figure 236239DEST_PATH_IMAGE023
A minimum acceptable distance therebetween;
Figure 595676DEST_PATH_IMAGE024
representing relevance of search request to a-level text segment
Figure 110840DEST_PATH_IMAGE025
Relevance of search request to b-level text segment
Figure 188518DEST_PATH_IMAGE026
A minimum acceptable distance therebetween;
Figure 96431DEST_PATH_IMAGE027
representing relevance of search request to a-level text segment
Figure 575954DEST_PATH_IMAGE025
Relevance of search request to c-level text segment
Figure 264948DEST_PATH_IMAGE028
A minimum acceptable distance therebetween;
Figure 829922DEST_PATH_IMAGE029
representing relevance of search request to b-level text segment
Figure 213630DEST_PATH_IMAGE026
Relevance of search request to c-level text segment
Figure 796927DEST_PATH_IMAGE028
A minimum acceptable distance therebetween;
Figure 404626DEST_PATH_IMAGE030
representing relevance of search request to b-level text segment
Figure 722475DEST_PATH_IMAGE026
Relevance of search request to d-level text segment
Figure 441032DEST_PATH_IMAGE023
A minimum acceptable distance therebetween;
Figure 144415DEST_PATH_IMAGE031
indicating relevance of search requests to c-tier text fragments
Figure 923015DEST_PATH_IMAGE028
Relevance of search request to d-level text segment
Figure 665843DEST_PATH_IMAGE023
A minimum acceptable distance therebetween. The minimum acceptable distance is, for example, a preset distance
Figure 437359DEST_PATH_IMAGE032
Then the optimization objective is
Figure 745980DEST_PATH_IMAGE022
And
Figure 695482DEST_PATH_IMAGE023
an acceptable distance therebetween
Figure 456764DEST_PATH_IMAGE033
. By arranging in sequence
Figure 130558DEST_PATH_IMAGE034
Obtaining an objective function
Figure 293686DEST_PATH_IMAGE035
. By optimizing an objective function
Figure 414089DEST_PATH_IMAGE035
After training with fixed iteration times, the trained encoder is obtained
Figure 380777DEST_PATH_IMAGE001
According to the training method provided by the invention, the data set can be automatically constructed and is provided with multi-level labels. Compared with a positive label data set and a negative label data set used by the existing method, the data set constructed by the method provided by the invention reserves more relevant information; in addition, the model training method fully considers the relative distance information between the text segments of different levels and the correlation of the search request, and can bring remarkable effect improvement compared with the existing positive and negative two-classification training method.
After the encoder is trained, the text segment and the search request obtained by segmenting the long text in the open field are respectively encoded into dense vectors by using the trained encoder.
Step S103 is executed, which may be implemented by the following method:
constructing a vector index from the dense vectors of the text segments;
and retrieving the vector index according to the dense vector of the search request to obtain a target text segment similar to the search request.
One of a plurality of vector index types can be selected, for example: dot product, inner product, IVFx, etc. Vector retrieval may also be performed using various types of vector retrieval engines, such as faiss, milvus, and the like.
In the vector retrieval, similarity scores of dense vectors of the search request and dense vectors of text segments in the vector index can be calculated, and the text segments with the former similarity scores are selected as target text segments similar to the search request.
The obtained target text segments similar to the search request comprise multi-level text segments, and in order to further obtain a certain level text segment in the target text segments, the invention provides the following method:
acquiring the highest score of the similarity between the target text segment and the search request;
and screening the target text fragment by using the highest score of the similarity and the minimum acceptable distance of the correlation between the search request and the two hierarchical text fragments to obtain the target hierarchical text fragment.
In other words, the similarity scores of the target text segments and the search request calculated in the previous step are counted to obtain the highest score. Meanwhile, the minimum acceptable distance of the correlation between the preset search request and the two hierarchical text segments in the encoder training process is obtained
Figure 244828DEST_PATH_IMAGE036
Figure 528041DEST_PATH_IMAGE024
Figure 84925DEST_PATH_IMAGE027
Figure 70067DEST_PATH_IMAGE029
Figure 737809DEST_PATH_IMAGE030
Or
Figure 813212DEST_PATH_IMAGE031
And then screening the target text segments by using the highest similarity and the minimum acceptable distance to obtain target level text segments, namely target text segments of a certain level.
Wherein, the specific screening process may include:
and screening the target text segment by using the difference between the highest similarity score and the minimum acceptable distance as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment.
As an example, for example, if the highest score of similarity between each target text segment obtained by vector retrieval and the search request is 0.9, if all text segments containing answers are expected to be further obtained in the target text segment, that is, a text segment of a level is expected to be obtained, the highest score may be used
Figure 790264DEST_PATH_IMAGE037
Filtering a target text segment as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment; if it is expected that all relevant text segments are further obtained in the target text segment, i.e. a text segment of the c-level is expected to be obtained, it may be used
Figure DEST_PATH_IMAGE038
Filtering the target text segment as a threshold value, if the target text segment and the search requestAnd if the similarity score reaches a threshold value, the target text segment is the target level text segment.
The method provided by the invention optimizes the distance between the text segments of any two levels and the correlation of the search request to a preset value during model training, and provides a reference standard for text segment filtering during the post-processing of recalling the target text segment. Compared with the method for manually setting the filtering threshold value in the existing method, the method provided by the invention is more explanatory and more flexible.
Example two
As shown in fig. 2, another aspect of the present invention further includes a functional module architecture completely corresponding to the foregoing method flow, that is, an embodiment of the present invention further provides a device for retrieving a multilevel-level long text vector, including:
the text segmentation module 201 is configured to segment a long text in an open field into text segments;
a vector encoding module 202, configured to encode the text segments and the search request into dense vectors respectively by using a trained encoder, where the encoder is obtained by using a training data set including multi-level text segments through training;
and the vector retrieval module 203 is used for utilizing the text segments and the dense vectors of the search requests, and querying to obtain target text segments similar to the search requests based on vector retrieval.
Wherein, in the vector encoding module, the multi-level text segment comprises: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request.
Further, the text segment containing the answer is obtained by segmenting the document containing the answer; the text segment which does not contain the answer in the document containing the answer is obtained by segmenting the document containing the answer; the text segment related to the search request is obtained by segmenting the document related to the search request obtained by searching; the text segments irrelevant to the search request are obtained by segmenting the document obtained by random sampling.
In the vector encoding module, the encoder trains an objective function
Figure 482277DEST_PATH_IMAGE003
Comprises the following steps:
Figure 953709DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 198133DEST_PATH_IMAGE005
in order to request for a search, a search engine,
Figure 96819DEST_PATH_IMAGE006
in order to include the text segment of the answer,
Figure 744969DEST_PATH_IMAGE007
for a text segment in the document containing an answer that does not contain an answer,
Figure 207044DEST_PATH_IMAGE008
for the text segment associated with the search request,
Figure 319356DEST_PATH_IMAGE009
a text segment that is not relevant to the search request;
Figure 575894DEST_PATH_IMAGE010
for the correlation between the search request and the text passage,
Figure 773657DEST_PATH_IMAGE011
to represent
Figure 321313DEST_PATH_IMAGE006
Figure 802979DEST_PATH_IMAGE007
Figure 777888DEST_PATH_IMAGE008
Figure 462947DEST_PATH_IMAGE009
Figure 263895DEST_PATH_IMAGE012
Is a preset constant that represents the minimum acceptable distance of correlation between the search request and the two hierarchical text segments.
Further, the vector retrieval module is specifically configured to:
constructing a vector index from the dense vectors of the text segments;
and retrieving the vector index according to the dense vector of the search request to obtain a target text segment similar to the search request.
The device for retrieving the multi-level long text vector provided by the embodiment of the invention further comprises a screening module, which is used for obtaining the target text fragment similar to the search request:
acquiring the highest score of the similarity between the target text segment and the search request;
and screening the target text fragment by using the highest score of the similarity and the minimum acceptable distance of the correlation between the search request and the two hierarchical text fragments to obtain the target hierarchical text fragment.
Further, in the screening module, the screening includes:
and screening the target text segment by using the difference between the highest similarity score and the minimum acceptable distance as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment.
The device can be implemented by the multi-level long text vector retrieval method provided in the first embodiment, and specific implementation methods can be referred to the description in the first embodiment and are not described herein again.
The invention also provides a memory storing a plurality of instructions for implementing the method according to the first embodiment.
The invention also provides an electronic device comprising a processor and a memory connected to the processor, wherein the memory stores a plurality of instructions, and the instructions can be loaded and executed by the processor to enable the processor to execute the method according to the first embodiment.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (8)

1. A multi-level long text vector retrieval method is characterized by comprising the following steps:
dividing a long text in an open field into text segments;
respectively encoding the text segments and the search requests into dense vectors by using a trained encoder;
utilizing the text segments and the dense vectors of the search requests, and inquiring to obtain target text segments similar to the search requests based on vector retrieval;
wherein the encoder is trained using a training data set comprising multi-level text segments;
the multi-level text fragment includes: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request;
an objective function trained by the encoder
Figure 688580DEST_PATH_IMAGE001
Comprises the following steps:
Figure 965978DEST_PATH_IMAGE002
Figure 727261DEST_PATH_IMAGE003
Figure 443413DEST_PATH_IMAGE004
Figure 435902DEST_PATH_IMAGE005
Figure 655838DEST_PATH_IMAGE006
Figure 966733DEST_PATH_IMAGE007
Figure 689839DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 536834DEST_PATH_IMAGE009
in order to request for a search, a search engine,
Figure 687193DEST_PATH_IMAGE010
in order to include the text segment of the answer,
Figure 813281DEST_PATH_IMAGE011
for a text segment in the document containing an answer that does not contain an answer,
Figure 543339DEST_PATH_IMAGE012
for the text segment associated with the search request,
Figure 946639DEST_PATH_IMAGE013
a text segment that is not relevant to the search request;
Figure 500855DEST_PATH_IMAGE014
for the correlation between the search request and the text passage,
Figure 848659DEST_PATH_IMAGE015
to represent
Figure 647988DEST_PATH_IMAGE010
Figure 640215DEST_PATH_IMAGE011
Figure 663535DEST_PATH_IMAGE012
Figure 203363DEST_PATH_IMAGE013
Figure 478486DEST_PATH_IMAGE016
A preset constant representing a minimum acceptable distance of correlation between the search request and the two hierarchical text segments; for a training data set comprising four levels of text segments,
Figure 653115DEST_PATH_IMAGE016
Included
Figure 847336DEST_PATH_IMAGE017
Figure 45100DEST_PATH_IMAGE018
representing relevance of search request to a-level text segment
Figure 982969DEST_PATH_IMAGE019
Relevance of search request to d-level text segment
Figure 782078DEST_PATH_IMAGE020
A minimum acceptable distance therebetween;
Figure 22567DEST_PATH_IMAGE021
representing relevance of search request to a-level text segment
Figure 35522DEST_PATH_IMAGE022
Relevance of search request to b-level text segment
Figure 511503DEST_PATH_IMAGE023
A minimum acceptable distance therebetween;
Figure 598407DEST_PATH_IMAGE024
representing relevance of search request to a-level text segment
Figure 72114DEST_PATH_IMAGE022
Relevance of search request to c-level text segment
Figure 870568DEST_PATH_IMAGE025
A minimum acceptable distance therebetween;
Figure 353502DEST_PATH_IMAGE026
representing relevance of search request to b-level text segment
Figure 294913DEST_PATH_IMAGE023
Relevance of search request to c-level text segment
Figure 1838DEST_PATH_IMAGE025
A minimum acceptable distance therebetween;
Figure 723806DEST_PATH_IMAGE027
representing relevance of search request to b-level text segment
Figure 416956DEST_PATH_IMAGE023
Relevance of search request to d-level text segment
Figure 101622DEST_PATH_IMAGE020
A minimum acceptable distance therebetween;
Figure 182710DEST_PATH_IMAGE028
indicating relevance of search requests to c-tier text fragments
Figure 64079DEST_PATH_IMAGE025
Relevance of search request to d-level text segment
Figure 685553DEST_PATH_IMAGE020
The minimum acceptable distance therebetween.
2. The multi-hierarchy long text vector retrieval method of claim 1, wherein the text segment containing the answer and the text segment not containing the answer in the document containing the answer are obtained by segmenting the document containing the answer; the text segment related to the search request is obtained by segmenting the document related to the search request obtained by searching; the text segments irrelevant to the search request are obtained by segmenting the document obtained by random sampling.
3. The method of multi-level long text vector retrieval of claim 1, wherein said querying for text segments similar to a search request based on vector retrieval using text segments and dense vectors of the search request comprises:
constructing a vector index from the dense vectors of the text segments;
and retrieving the vector index according to the dense vector of the search request to obtain a target text segment similar to the search request.
4. The method of multi-level long text vector retrieval according to claim 1, wherein said obtaining a target text segment similar to said search request further comprises:
acquiring the highest score of the similarity between the target text segment and the search request;
and screening the target text fragment by using the highest score of the similarity and the minimum acceptable distance of the correlation between the search request and the two hierarchical text fragments to obtain the target hierarchical text fragment.
5. The multi-level long text vector retrieval method of claim 4, wherein the filtering comprises:
and screening the target text segment by using the difference between the highest similarity score and the minimum acceptable distance as a threshold, wherein if the similarity score between the target text segment and the search request reaches the threshold, the target text segment is the target level text segment.
6. A multi-level long text vector retrieval apparatus, comprising:
the text segmentation module is used for segmenting the long text in the open field into text segments;
a vector encoding module for encoding the text segment and the search request into dense vectors using a trained encoder, respectivelyThe encoder is trained by using a training data set comprising multi-level text segments; the multi-level text fragment includes: a text segment containing an answer, a text segment in a document containing an answer that does not contain an answer, a text segment related to a search request, and/or a text segment not related to a search request; an objective function trained by the encoder
Figure 663873DEST_PATH_IMAGE029
Comprises the following steps:
Figure 853546DEST_PATH_IMAGE030
Figure 848309DEST_PATH_IMAGE031
Figure 211157DEST_PATH_IMAGE032
Figure 981667DEST_PATH_IMAGE033
Figure 404558DEST_PATH_IMAGE034
Figure 963583DEST_PATH_IMAGE035
Figure 802226DEST_PATH_IMAGE036
wherein the content of the first and second substances,
Figure 755138DEST_PATH_IMAGE009
in order to request for a search, a search engine,
Figure 411248DEST_PATH_IMAGE010
in order to include the text segment of the answer,
Figure 816821DEST_PATH_IMAGE011
for a text segment in the document containing an answer that does not contain an answer,
Figure 193576DEST_PATH_IMAGE012
for the text segment associated with the search request,
Figure 564777DEST_PATH_IMAGE013
a text segment that is not relevant to the search request;
Figure 329470DEST_PATH_IMAGE014
for the correlation between the search request and the text passage,
Figure 894444DEST_PATH_IMAGE015
to represent
Figure 465102DEST_PATH_IMAGE010
Figure 392607DEST_PATH_IMAGE011
Figure 306DEST_PATH_IMAGE012
Figure 675745DEST_PATH_IMAGE013
Figure 987777DEST_PATH_IMAGE016
A preset constant representing a minimum acceptable distance of correlation between the search request and the two hierarchical text segments; for a training data set comprising four levels of text segments,
Figure 707472DEST_PATH_IMAGE016
Included
Figure 610705DEST_PATH_IMAGE017
Figure 212588DEST_PATH_IMAGE018
representing relevance of search request to a-level text segment
Figure 416DEST_PATH_IMAGE019
Relevance of search request to d-level text segment
Figure 138398DEST_PATH_IMAGE020
A minimum acceptable distance therebetween;
Figure 478113DEST_PATH_IMAGE021
representing relevance of search request to a-level text segment
Figure 504974DEST_PATH_IMAGE022
Relevance of search request to b-level text segment
Figure 893230DEST_PATH_IMAGE023
A minimum acceptable distance therebetween;
Figure 712151DEST_PATH_IMAGE024
representing relevance of search request to a-level text segment
Figure 832554DEST_PATH_IMAGE022
Correlation of search requests with c-level text snippetsProperty of (2)
Figure 913423DEST_PATH_IMAGE025
A minimum acceptable distance therebetween;
Figure 167687DEST_PATH_IMAGE026
representing relevance of search request to b-level text segment
Figure 355960DEST_PATH_IMAGE023
Relevance of search request to c-level text segment
Figure 733415DEST_PATH_IMAGE025
A minimum acceptable distance therebetween;
Figure 797186DEST_PATH_IMAGE027
representing relevance of search request to b-level text segment
Figure 730507DEST_PATH_IMAGE023
Relevance of search request to d-level text segment
Figure 196124DEST_PATH_IMAGE020
A minimum acceptable distance therebetween;
Figure 48542DEST_PATH_IMAGE028
indicating relevance of search requests to c-tier text fragments
Figure 101074DEST_PATH_IMAGE025
Relevance of search request to d-level text segment
Figure 838086DEST_PATH_IMAGE020
A minimum acceptable distance therebetween;
and the vector retrieval module is used for querying and obtaining a target text segment similar to the search request based on vector retrieval by utilizing the text segment and the dense vector of the search request.
7. A memory storing a plurality of instructions for implementing the method of any one of claims 1-5.
8. An electronic device comprising a processor and a memory coupled to the processor, the memory storing a plurality of instructions that are loadable and executable by the processor to enable the processor to perform the method according to any of claims 1-5.
CN202110421266.3A 2021-04-20 2021-04-20 Multi-level-length text vector retrieval method and device and electronic equipment Active CN112988952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110421266.3A CN112988952B (en) 2021-04-20 2021-04-20 Multi-level-length text vector retrieval method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110421266.3A CN112988952B (en) 2021-04-20 2021-04-20 Multi-level-length text vector retrieval method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN112988952A CN112988952A (en) 2021-06-18
CN112988952B true CN112988952B (en) 2021-08-24

Family

ID=76341126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110421266.3A Active CN112988952B (en) 2021-04-20 2021-04-20 Multi-level-length text vector retrieval method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112988952B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881264A (en) * 2020-09-28 2020-11-03 北京智源人工智能研究院 Method and electronic equipment for searching long text in question-answering task in open field

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491518B (en) * 2017-08-15 2020-08-04 北京百度网讯科技有限公司 Search recall method and device, server and storage medium
CN107491547B (en) * 2017-08-28 2020-11-10 北京百度网讯科技有限公司 Search method and device based on artificial intelligence

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881264A (en) * 2020-09-28 2020-11-03 北京智源人工智能研究院 Method and electronic equipment for searching long text in question-answering task in open field

Also Published As

Publication number Publication date
CN112988952A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
US11222167B2 (en) Generating structured text summaries of digital documents using interactive collaboration
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
CN106649818B (en) Application search intention identification method and device, application search method and server
CN107463607B (en) Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning
CN111027327A (en) Machine reading understanding method, device, storage medium and device
CN110688854B (en) Named entity recognition method, device and computer readable storage medium
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN112256860A (en) Semantic retrieval method, system, equipment and storage medium for customer service conversation content
CN112100326B (en) Anti-interference question and answer method and system integrating retrieval and machine reading understanding
CN111881264B (en) Method and electronic equipment for searching long text in question-answering task in open field
CN112328800A (en) System and method for automatically generating programming specification question answers
CN112036184A (en) Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model
CN108491407B (en) Code retrieval-oriented query expansion method
CN110738059A (en) text similarity calculation method and system
CN112199958A (en) Concept word sequence generation method and device, computer equipment and storage medium
CN112988952B (en) Multi-level-length text vector retrieval method and device and electronic equipment
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN114385819B (en) Environment judicial domain ontology construction method and device and related equipment
CN115617954A (en) Question answering method and device, electronic equipment and storage medium
CN112949293A (en) Similar text generation method, similar text generation device and intelligent equipment
CN112036183A (en) Word segmentation method and device based on BilSTM network model and CRF model, computer device and computer storage medium
Qu English-Chinese name transliteration by latent analogy
CN116842168B (en) Cross-domain problem processing method and device, electronic equipment and storage medium
CN115437620B (en) Natural language programming method, device, equipment and storage medium
KR102541806B1 (en) Method, system, and computer readable record medium for ranking reformulated query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant