CN114328863A - Long text retrieval method and system based on Gaussian kernel function - Google Patents

Long text retrieval method and system based on Gaussian kernel function Download PDF

Info

Publication number
CN114328863A
CN114328863A CN202111512377.1A CN202111512377A CN114328863A CN 114328863 A CN114328863 A CN 114328863A CN 202111512377 A CN202111512377 A CN 202111512377A CN 114328863 A CN114328863 A CN 114328863A
Authority
CN
China
Prior art keywords
long text
user
gaussian kernel
text
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111512377.1A
Other languages
Chinese (zh)
Inventor
史树敏
朱乐
黄河燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202111512377.1A priority Critical patent/CN114328863A/en
Publication of CN114328863A publication Critical patent/CN114328863A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a long text retrieval method and a long text retrieval system based on a Gaussian kernel function, and belongs to the technical field of information retrieval. The method utilizes the semantic modeling capability of the pre-training language model to calculate the semantic similarity between each paragraph of the long text and the user retrieval content, and the semantic similarity is used as a pseudo label of the user click relevance, so that the problem of lack of paragraph level marking data is effectively solved. The pseudo labels are mapped to relevance scores of different dimensions by different gaussian kernel functions. The relevance scores of the user retrieval contents on the whole long text are output by utilizing the scores of all the paragraphs of the linear layer aggregation long text, so that the paragraphs with different semantic similarity levels can make a contribution to the relevance of the user click, the relevance degree of the semantic similarity and the user click relevance is enhanced, and the accuracy of the long text retrieval model is improved.

Description

Long text retrieval method and system based on Gaussian kernel function
Technical Field
The invention relates to a long text retrieval method and a long text retrieval system, in particular to a long text retrieval method and a long text retrieval system based on a Gaussian kernel function, and belongs to the technical field of information retrieval.
Background
The long text retrieval is a basic task in the field of information retrieval, and is characterized in that: the average length of the document to be retrieved is long, and a single document may contain multiple topics. The traditional retrieval model has difficulty in locating topics related to the user click intention in long texts.
In recent years, pre-trained language models have been highlighted in the field of information retrieval. The method has strong context semantic modeling capability, so that the retrieval model can better calculate the semantic similarity between the user retrieval content and the candidate document, thereby improving the accuracy of judging whether the user retrieval content and the candidate document are related by the model. However, in the task of searching long texts, the input length is limited, and the pre-training language model cannot calculate the semantic similarity between the user search content and the whole long text.
At present, in the prior art, a long text is mainly segmented and is cascaded with user search content as an input of a search model by taking a paragraph as a unit. However, in existing public datasets, the model training phase still lacks the relevance tags for paragraphs and user searches. Meanwhile, because the semantic similarity and the user click relevance are not completely equivalent, the user may click a candidate document with lower similarity.
In summary, how to search for paragraph-level relevance tags without additional labeling data and how to find a bridge between semantic similarity and user click relevance becomes a technical problem to be solved urgently for long text retrieval.
Disclosure of Invention
The invention aims to solve the technical problems of how to search paragraph-level correlation labels without additional labeled data and how to find a bridge of semantic similarity and user click correlation in long text retrieval, and creatively provides a long text retrieval method and system based on a Gaussian kernel function.
The method has the innovation points that: and calculating the semantic similarity between each paragraph of the long text and the user retrieval content by utilizing the semantic modeling capacity of the pre-training language model, and using the semantic similarity as a pseudo label of the user click relevance. The pseudo labels are mapped to relevance scores of different dimensions by different gaussian kernel functions. And outputting the relevance score of the user retrieval content to the whole long text by utilizing the score of each paragraph of the linear hierarchy aggregation long text.
Advantageous effects
Compared with the prior art, the invention has the following advantages:
1. the invention utilizes the semantic modeling capability of the pre-training language model to calculate the semantic similarity between the paragraph and the user retrieval content as a pseudo label, thereby effectively relieving the problem of lacking paragraph level marking data.
2. According to the method, the pseudo label scalar is mapped into the multi-dimensional vector by using the Gaussian kernel function, paragraphs with different semantic similarity levels can make a contribution to whether the user clicks the relevant items or not, the correlation degree of the semantic similarity and the user click relevance is enhanced, and the accuracy of the long text retrieval model is improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of the system of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
A long text retrieval method based on gaussian kernel function, as shown in fig. 1, includes the following steps:
step 1: the long text is segmented.
Specifically, a length N is specified as the maximum length of each paragraph after the long text is segmented. In the range of the length N, punctuations are used as the prior segmentation cut-off points, so that the semantic integrity of the segmented text is ensured.
Step 2: the user query and candidate paragraphs are scored using a pre-trained language model.
Specifically, after the user retrieval content and the candidate paragraphs are cascaded, pre-training is performed, and a sentence vector [ CLS ] output by the pre-training is used as a text feature interaction vector. Then, the semantic similarity between the user query and the candidate paragraphs is judged by using the multi-layer perceptron MLP as a pseudo label.
And step 3: the pseudo-tag is mapped using a gaussian kernel function.
Specifically, the pseudo label scalar is mapped to a multi-dimensional score vector by pre-designed gaussian kernels of different mean values and the same variance. Then, vectors corresponding to different paragraphs are concatenated together to form a scoring matrix.
And 4, step 4: and judging the click relevance of the user by using a linear layer.
Specifically, the scoring matrix is transmitted into a linear layer after passing through a pooling layer, and the contribution of each paragraph of the long text to the click relevance of the end user under different levels is judged by using MLP.
Further, the present invention provides a long text retrieval system based on gaussian kernel function, as shown in fig. 2, including a pseudo label calculation module, a gaussian kernel mapping module and an output module.
The pseudo label calculation module is responsible for segmenting long documents, and inputting each obtained text paragraph and user retrieval content after cascading into a pre-training language model to obtain a text feature interaction vector. Meanwhile, the text feature interaction vector is used as the input of a linear layer, and the correlation between the output user retrieval content and each paragraph of the long text is used as a pseudo label.
And the Gaussian kernel mapping module is responsible for mapping the pseudo labels into score vectors through different Gaussian kernel functions by scalars.
The output module is used for cascading the score vectors of different paragraphs belonging to the same long text into a score matrix, averagely pooling the score matrix, then putting the score matrix into a linear layer, and judging and integrating the correlation between the user retrieval content and the long text under different Gaussian kernel functions.
The connection relationship among the modules is as follows:
and the output end of the pseudo label calculation module is connected with the input end of the Gaussian core mapping module. The output end of the Gaussian kernel mapping module is connected with the input end of the output module.
The working method of the system is as follows:
first, the long text is segmented in a pseudo label computation module. In order to ensure the completeness of the paragraph obtained by segmentation, the segmentation cut-off points are firstly classified according to the priority, wherein the punctuation mark has higher priority than the specified maximum paragraph length. And then, respectively cascading the paragraphs obtained by segmentation with the user retrieval content, and inputting the paragraphs into a pre-training language model to obtain a text feature interaction vector. And finally, putting the text feature interaction vector into a linear layer, and outputting the correlation between the user retrieval content and each paragraph of the long text as a pseudo label.
In the pseudo label calculation module, the pre-training language model can adopt a BERT model to obtain a text feature interaction vector ViAs shown in formula 1:
Vi=BERT(q,pj) (1)
wherein, the value range of i is 1, 2, 3, …, n indicates the maximum value of the number of the segmentable paragraphs of the long text; q for the user retrieving content, pjIs the jth paragraph of long text.
The linear layer refers to a fully-connected neural network, and maps the text feature interaction vector into correlation, as shown in formula 2:
R=W*Vi+b (2)
wherein R represents the relevance score output by the model, W, b is a model parameter, and the relevance score can be solved through back propagation in the model training process; viAnd the text feature interaction vector represents the ith paragraph and the content retrieved by the user.
The pre-training language model BERT may be a Bidirective Encoder responses from Transformer, produced by Google AI in 2018, month 10.
In the gaussian kernel mapping module, the mean and variance of different gaussian kernels are initialized first, wherein the mean of each gaussian kernel is different but the variance system. And then, placing the pseudo labels output by the pseudo label calculation module into different Gaussian cores for mapping, and cascading the obtained results together to form a score vector. The Gaussian kernel function mapping is as shown in equation 3:
K(Ri)=exp(-(Rik)/2σk 2) (3)
wherein, K (R)i) Is represented by RiRetrieving the pseudo label, μ, of the content q and the ith paragraph for the userk、σkRespectively, mean and variance of the kth gaussian kernel, exp being an exponential function.
In an output module, firstly, the score vectors corresponding to different paragraphs of the long text are cascaded together to obtain a score matrix. And after the score matrixes are averaged and pooled, inputting the score matrixes into a linear layer, and outputting the final relevance score of the user retrieval content and the long text. And finally, judging the contribution of each paragraph of the long text to the click relevance of the end user under different levels by using MLP.

Claims (4)

1. A long text retrieval method based on a Gaussian kernel function is characterized by comprising the following steps:
step 1: segmenting the long text;
appointing a length N as the maximum length of each paragraph after the long text is segmented; in the range of the length N, punctuations are taken as the prior segmentation cut-off points, so that the semantic integrity of the segmented text is ensured;
step 2: scoring the user query and the candidate passage using a pre-trained language model;
after cascading the user retrieval content and the candidate paragraphs, pre-training, and taking a sentence vector [ CLS ] output by the pre-training as a text feature interaction vector; then, judging semantic similarity between the user query and the candidate paragraphs by using a multi-layer perceptron (MLP) as a pseudo label;
and step 3: mapping the pseudo label by using a Gaussian kernel function;
mapping the pseudo label scalar quantity into a multi-dimensional score vector through pre-designed Gaussian kernels with different mean values and the same variance; then, cascading vectors corresponding to different paragraphs together to form a scoring matrix;
and 4, step 4: judging the click relevance of the user by using a linear layer;
and (4) transmitting the scoring matrix into a linear layer after passing through a pooling layer, and judging the contribution of each paragraph of the long text to the click relevance of the final user under different levels by utilizing MLP.
2. A long text retrieval system based on a Gaussian kernel function is characterized by comprising a pseudo label calculation module, a Gaussian kernel mapping module and an output module;
the pseudo label calculation module is responsible for segmenting long documents, and inputting each obtained text paragraph and user retrieval content after cascading into a pre-training language model to obtain a text feature interaction vector; meanwhile, the text feature interaction vector is used as the input of a linear layer, and the correlation between the output user retrieval content and each paragraph of the long text is used as a pseudo label;
the Gaussian kernel mapping module is responsible for mapping the pseudo label into a score vector through different Gaussian kernel functions by a scalar;
the output module is used for cascading the score vectors of different paragraphs belonging to the same long text into a score matrix, averagely pooling the score matrix, putting the averaged pool into a linear layer, and judging and integrating the correlation between the user retrieval content and the long text under different Gaussian kernel functions;
the connection relationship among the modules is as follows:
the output end of the pseudo label calculation module is connected with the input end of the Gaussian kernel mapping module; the output end of the Gaussian kernel mapping module is connected with the input end of the output module.
3. A long text retrieval system based on gaussian kernel function as defined in claim 2 wherein:
firstly, segmenting a long text in a pseudo label calculation module; firstly, classifying segmentation cut-off points according to priority, wherein punctuation mark priority is higher than the length of a specified maximum paragraph, then respectively cascading the paragraphs obtained by segmentation with user retrieval contents and inputting the paragraphs into a pre-training language model to obtain a text feature interaction vector, and finally, putting the text feature interaction vector into a linear layer and outputting the correlation between the user retrieval contents and the paragraphs of the long text to serve as a pseudo label;
in a pseudo label calculation module, a language model is pre-trained to obtain a text characteristic interaction vector ViAs shown in formula 1:
Vi=BERT(q,pj) (1)
wherein, the value range of i is 1, 2, 3, …, n, and n refers to the maximum number of segmentable paragraphs of the long textA value; q for the user retrieving content, pjIs the jth paragraph of the long text;
the linear layer refers to a fully-connected neural network, and maps the text feature interaction vector into correlation, as shown in formula 2:
R=W*Vi+b (2)
wherein R represents the relevance score output by the model, W, b is a model parameter, and the relevance score can be solved through back propagation in the model training process; viA text feature interaction vector representing the ith paragraph and the user retrieval content;
in a Gaussian kernel mapping module, firstly, initializing the mean value and variance of different Gaussian kernels, wherein the mean value of each Gaussian kernel is different, but a variance system is adopted; then, placing the pseudo labels output by the pseudo label calculation module into different Gaussian kernels for mapping, and cascading the obtained results together to form a score vector; the Gaussian kernel function mapping is as shown in equation 3:
K(Ri)=exp(-(Rik)/2σk 2) (3)
wherein, K (R)i) Is represented by RiRetrieving the pseudo label, μ, of the content q and the ith paragraph for the userk、σkRespectively representing the mean value and the variance of the kth Gaussian kernel, wherein exp is an exponential function;
in an output module, firstly, cascading score vectors corresponding to different sections of a long text together to obtain a score matrix; after the score matrixes are evenly pooled, inputting the score matrixes into a linear layer, and outputting final relevance scores of the user retrieval contents and the long text; and finally, judging the contribution of each paragraph of the long text to the click relevance of the end user under different levels by using MLP.
4. The long text retrieval system based on gaussian kernel function of claim 2 wherein the pre-trained language model is a BERT model.
CN202111512377.1A 2021-12-08 2021-12-08 Long text retrieval method and system based on Gaussian kernel function Pending CN114328863A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111512377.1A CN114328863A (en) 2021-12-08 2021-12-08 Long text retrieval method and system based on Gaussian kernel function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111512377.1A CN114328863A (en) 2021-12-08 2021-12-08 Long text retrieval method and system based on Gaussian kernel function

Publications (1)

Publication Number Publication Date
CN114328863A true CN114328863A (en) 2022-04-12

Family

ID=81050052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111512377.1A Pending CN114328863A (en) 2021-12-08 2021-12-08 Long text retrieval method and system based on Gaussian kernel function

Country Status (1)

Country Link
CN (1) CN114328863A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688138A (en) * 2024-02-02 2024-03-12 中船凌久高科(武汉)有限公司 Long text similarity comparison method based on paragraph division

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688138A (en) * 2024-02-02 2024-03-12 中船凌久高科(武汉)有限公司 Long text similarity comparison method based on paragraph division
CN117688138B (en) * 2024-02-02 2024-04-09 中船凌久高科(武汉)有限公司 Long text similarity comparison method based on paragraph division

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN109918666B (en) Chinese punctuation mark adding method based on neural network
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN109800284B (en) Task-oriented unstructured information intelligent question-answering system construction method
CN110442777B (en) BERT-based pseudo-correlation feedback model information retrieval method and system
CN106970910B (en) Keyword extraction method and device based on graph model
CN113239700A (en) Text semantic matching device, system, method and storage medium for improving BERT
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN111522910B (en) Intelligent semantic retrieval method based on cultural relic knowledge graph
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
US20110191374A1 (en) Joint Embedding for Item Association
CN110674252A (en) High-precision semantic search system for judicial domain
JP2005526317A (en) Method and system for automatically searching a concept hierarchy from a document corpus
CN113377897B (en) Multi-language medical term standard standardization system and method based on deep confrontation learning
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN113220864B (en) Intelligent question-answering data processing system
CN113761890A (en) BERT context sensing-based multi-level semantic information retrieval method
CN114090861A (en) Education field search engine construction method based on knowledge graph
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN111325018A (en) Domain dictionary construction method based on web retrieval and new word discovery
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination