CN114328863A

CN114328863A - Long text retrieval method and system based on Gaussian kernel function

Info

Publication number: CN114328863A
Application number: CN202111512377.1A
Authority: CN
Inventors: 史树敏; 朱乐; 黄河燕
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-04-12

Abstract

The invention relates to a long text retrieval method and a long text retrieval system based on a Gaussian kernel function, and belongs to the technical field of information retrieval. The method utilizes the semantic modeling capability of the pre-training language model to calculate the semantic similarity between each paragraph of the long text and the user retrieval content, and the semantic similarity is used as a pseudo label of the user click relevance, so that the problem of lack of paragraph level marking data is effectively solved. The pseudo labels are mapped to relevance scores of different dimensions by different gaussian kernel functions. The relevance scores of the user retrieval contents on the whole long text are output by utilizing the scores of all the paragraphs of the linear layer aggregation long text, so that the paragraphs with different semantic similarity levels can make a contribution to the relevance of the user click, the relevance degree of the semantic similarity and the user click relevance is enhanced, and the accuracy of the long text retrieval model is improved.

Description

Long text retrieval method and system based on Gaussian kernel function

Technical Field

The invention relates to a long text retrieval method and a long text retrieval system, in particular to a long text retrieval method and a long text retrieval system based on a Gaussian kernel function, and belongs to the technical field of information retrieval.

Background

The long text retrieval is a basic task in the field of information retrieval, and is characterized in that: the average length of the document to be retrieved is long, and a single document may contain multiple topics. The traditional retrieval model has difficulty in locating topics related to the user click intention in long texts.

In recent years, pre-trained language models have been highlighted in the field of information retrieval. The method has strong context semantic modeling capability, so that the retrieval model can better calculate the semantic similarity between the user retrieval content and the candidate document, thereby improving the accuracy of judging whether the user retrieval content and the candidate document are related by the model. However, in the task of searching long texts, the input length is limited, and the pre-training language model cannot calculate the semantic similarity between the user search content and the whole long text.

At present, in the prior art, a long text is mainly segmented and is cascaded with user search content as an input of a search model by taking a paragraph as a unit. However, in existing public datasets, the model training phase still lacks the relevance tags for paragraphs and user searches. Meanwhile, because the semantic similarity and the user click relevance are not completely equivalent, the user may click a candidate document with lower similarity.

In summary, how to search for paragraph-level relevance tags without additional labeling data and how to find a bridge between semantic similarity and user click relevance becomes a technical problem to be solved urgently for long text retrieval.

Disclosure of Invention

The invention aims to solve the technical problems of how to search paragraph-level correlation labels without additional labeled data and how to find a bridge of semantic similarity and user click correlation in long text retrieval, and creatively provides a long text retrieval method and system based on a Gaussian kernel function.

The method has the innovation points that: and calculating the semantic similarity between each paragraph of the long text and the user retrieval content by utilizing the semantic modeling capacity of the pre-training language model, and using the semantic similarity as a pseudo label of the user click relevance. The pseudo labels are mapped to relevance scores of different dimensions by different gaussian kernel functions. And outputting the relevance score of the user retrieval content to the whole long text by utilizing the score of each paragraph of the linear hierarchy aggregation long text.

Advantageous effects

Compared with the prior art, the invention has the following advantages:

1. the invention utilizes the semantic modeling capability of the pre-training language model to calculate the semantic similarity between the paragraph and the user retrieval content as a pseudo label, thereby effectively relieving the problem of lacking paragraph level marking data.

2. According to the method, the pseudo label scalar is mapped into the multi-dimensional vector by using the Gaussian kernel function, paragraphs with different semantic similarity levels can make a contribution to whether the user clicks the relevant items or not, the correlation degree of the semantic similarity and the user click relevance is enhanced, and the accuracy of the long text retrieval model is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of the system of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

A long text retrieval method based on gaussian kernel function, as shown in fig. 1, includes the following steps:

step 1: the long text is segmented.

Specifically, a length N is specified as the maximum length of each paragraph after the long text is segmented. In the range of the length N, punctuations are used as the prior segmentation cut-off points, so that the semantic integrity of the segmented text is ensured.

Step 2: the user query and candidate paragraphs are scored using a pre-trained language model.

Specifically, after the user retrieval content and the candidate paragraphs are cascaded, pre-training is performed, and a sentence vector [ CLS ] output by the pre-training is used as a text feature interaction vector. Then, the semantic similarity between the user query and the candidate paragraphs is judged by using the multi-layer perceptron MLP as a pseudo label.

And step 3: the pseudo-tag is mapped using a gaussian kernel function.

Specifically, the pseudo label scalar is mapped to a multi-dimensional score vector by pre-designed gaussian kernels of different mean values and the same variance. Then, vectors corresponding to different paragraphs are concatenated together to form a scoring matrix.

And 4, step 4: and judging the click relevance of the user by using a linear layer.

Specifically, the scoring matrix is transmitted into a linear layer after passing through a pooling layer, and the contribution of each paragraph of the long text to the click relevance of the end user under different levels is judged by using MLP.

Further, the present invention provides a long text retrieval system based on gaussian kernel function, as shown in fig. 2, including a pseudo label calculation module, a gaussian kernel mapping module and an output module.

The pseudo label calculation module is responsible for segmenting long documents, and inputting each obtained text paragraph and user retrieval content after cascading into a pre-training language model to obtain a text feature interaction vector. Meanwhile, the text feature interaction vector is used as the input of a linear layer, and the correlation between the output user retrieval content and each paragraph of the long text is used as a pseudo label.

And the Gaussian kernel mapping module is responsible for mapping the pseudo labels into score vectors through different Gaussian kernel functions by scalars.

The output module is used for cascading the score vectors of different paragraphs belonging to the same long text into a score matrix, averagely pooling the score matrix, then putting the score matrix into a linear layer, and judging and integrating the correlation between the user retrieval content and the long text under different Gaussian kernel functions.

The connection relationship among the modules is as follows:

and the output end of the pseudo label calculation module is connected with the input end of the Gaussian core mapping module. The output end of the Gaussian kernel mapping module is connected with the input end of the output module.

The working method of the system is as follows:

first, the long text is segmented in a pseudo label computation module. In order to ensure the completeness of the paragraph obtained by segmentation, the segmentation cut-off points are firstly classified according to the priority, wherein the punctuation mark has higher priority than the specified maximum paragraph length. And then, respectively cascading the paragraphs obtained by segmentation with the user retrieval content, and inputting the paragraphs into a pre-training language model to obtain a text feature interaction vector. And finally, putting the text feature interaction vector into a linear layer, and outputting the correlation between the user retrieval content and each paragraph of the long text as a pseudo label.

In the pseudo label calculation module, the pre-training language model can adopt a BERT model to obtain a text feature interaction vector V_iAs shown in formula 1:

V_i＝BERT(q,p_j) (1)

wherein, the value range of i is 1, 2, 3, …, n indicates the maximum value of the number of the segmentable paragraphs of the long text; q for the user retrieving content, p_jIs the jth paragraph of long text.

The linear layer refers to a fully-connected neural network, and maps the text feature interaction vector into correlation, as shown in formula 2:

R＝W*V_i+b (2)

wherein R represents the relevance score output by the model, W, b is a model parameter, and the relevance score can be solved through back propagation in the model training process; v_iAnd the text feature interaction vector represents the ith paragraph and the content retrieved by the user.

The pre-training language model BERT may be a Bidirective Encoder responses from Transformer, produced by Google AI in 2018, month 10.

In the gaussian kernel mapping module, the mean and variance of different gaussian kernels are initialized first, wherein the mean of each gaussian kernel is different but the variance system. And then, placing the pseudo labels output by the pseudo label calculation module into different Gaussian cores for mapping, and cascading the obtained results together to form a score vector. The Gaussian kernel function mapping is as shown in equation 3:

K(R_i)＝exp(-(R_i-μ_k)/2σ_k ²) (3)

wherein, K (R)_i) Is represented by R_iRetrieving the pseudo label, μ, of the content q and the ith paragraph for the user_k、σ_kRespectively, mean and variance of the kth gaussian kernel, exp being an exponential function.

In an output module, firstly, the score vectors corresponding to different paragraphs of the long text are cascaded together to obtain a score matrix. And after the score matrixes are averaged and pooled, inputting the score matrixes into a linear layer, and outputting the final relevance score of the user retrieval content and the long text. And finally, judging the contribution of each paragraph of the long text to the click relevance of the end user under different levels by using MLP.

Claims

1. A long text retrieval method based on a Gaussian kernel function is characterized by comprising the following steps:

step 1: segmenting the long text;

appointing a length N as the maximum length of each paragraph after the long text is segmented; in the range of the length N, punctuations are taken as the prior segmentation cut-off points, so that the semantic integrity of the segmented text is ensured;

step 2: scoring the user query and the candidate passage using a pre-trained language model;

after cascading the user retrieval content and the candidate paragraphs, pre-training, and taking a sentence vector [ CLS ] output by the pre-training as a text feature interaction vector; then, judging semantic similarity between the user query and the candidate paragraphs by using a multi-layer perceptron (MLP) as a pseudo label;

and step 3: mapping the pseudo label by using a Gaussian kernel function;

mapping the pseudo label scalar quantity into a multi-dimensional score vector through pre-designed Gaussian kernels with different mean values and the same variance; then, cascading vectors corresponding to different paragraphs together to form a scoring matrix;

and 4, step 4: judging the click relevance of the user by using a linear layer;

and (4) transmitting the scoring matrix into a linear layer after passing through a pooling layer, and judging the contribution of each paragraph of the long text to the click relevance of the final user under different levels by utilizing MLP.

2. A long text retrieval system based on a Gaussian kernel function is characterized by comprising a pseudo label calculation module, a Gaussian kernel mapping module and an output module;

the pseudo label calculation module is responsible for segmenting long documents, and inputting each obtained text paragraph and user retrieval content after cascading into a pre-training language model to obtain a text feature interaction vector; meanwhile, the text feature interaction vector is used as the input of a linear layer, and the correlation between the output user retrieval content and each paragraph of the long text is used as a pseudo label;

the Gaussian kernel mapping module is responsible for mapping the pseudo label into a score vector through different Gaussian kernel functions by a scalar;

the output module is used for cascading the score vectors of different paragraphs belonging to the same long text into a score matrix, averagely pooling the score matrix, putting the averaged pool into a linear layer, and judging and integrating the correlation between the user retrieval content and the long text under different Gaussian kernel functions;

the connection relationship among the modules is as follows:

the output end of the pseudo label calculation module is connected with the input end of the Gaussian kernel mapping module; the output end of the Gaussian kernel mapping module is connected with the input end of the output module.

3. A long text retrieval system based on gaussian kernel function as defined in claim 2 wherein:

firstly, segmenting a long text in a pseudo label calculation module; firstly, classifying segmentation cut-off points according to priority, wherein punctuation mark priority is higher than the length of a specified maximum paragraph, then respectively cascading the paragraphs obtained by segmentation with user retrieval contents and inputting the paragraphs into a pre-training language model to obtain a text feature interaction vector, and finally, putting the text feature interaction vector into a linear layer and outputting the correlation between the user retrieval contents and the paragraphs of the long text to serve as a pseudo label;

in a pseudo label calculation module, a language model is pre-trained to obtain a text characteristic interaction vector V_iAs shown in formula 1:

V_i＝BERT(q,p_j) (1)

wherein, the value range of i is 1, 2, 3, …, n, and n refers to the maximum number of segmentable paragraphs of the long textA value; q for the user retrieving content, p_jIs the jth paragraph of the long text;

R＝W*V_i+b (2)

wherein R represents the relevance score output by the model, W, b is a model parameter, and the relevance score can be solved through back propagation in the model training process; v_iA text feature interaction vector representing the ith paragraph and the user retrieval content;

in a Gaussian kernel mapping module, firstly, initializing the mean value and variance of different Gaussian kernels, wherein the mean value of each Gaussian kernel is different, but a variance system is adopted; then, placing the pseudo labels output by the pseudo label calculation module into different Gaussian kernels for mapping, and cascading the obtained results together to form a score vector; the Gaussian kernel function mapping is as shown in equation 3:

K(R_i)＝exp(-(R_i-μ_k)/2σ_k ²) (3)

wherein, K (R)_i) Is represented by R_iRetrieving the pseudo label, μ, of the content q and the ith paragraph for the user_k、σ_kRespectively representing the mean value and the variance of the kth Gaussian kernel, wherein exp is an exponential function;

in an output module, firstly, cascading score vectors corresponding to different sections of a long text together to obtain a score matrix; after the score matrixes are evenly pooled, inputting the score matrixes into a linear layer, and outputting final relevance scores of the user retrieval contents and the long text; and finally, judging the contribution of each paragraph of the long text to the click relevance of the end user under different levels by using MLP.

4. The long text retrieval system based on gaussian kernel function of claim 2 wherein the pre-trained language model is a BERT model.