CN115344668A - Multi-field and multi-disciplinary science and technology policy resource retrieval method and device - Google Patents

Multi-field and multi-disciplinary science and technology policy resource retrieval method and device Download PDF

Info

Publication number
CN115344668A
CN115344668A CN202210846693.0A CN202210846693A CN115344668A CN 115344668 A CN115344668 A CN 115344668A CN 202210846693 A CN202210846693 A CN 202210846693A CN 115344668 A CN115344668 A CN 115344668A
Authority
CN
China
Prior art keywords
text
score
preset
user query
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210846693.0A
Other languages
Chinese (zh)
Inventor
杜军平
喻博文
邵蓥侠
薛哲
李昂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202210846693.0A priority Critical patent/CN115344668A/en
Publication of CN115344668A publication Critical patent/CN115344668A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a multi-field and multi-disciplinary science and technology policy resource retrieval method and a device, which calculate the similarity of a user query and text segments of various science and technology policy resources through a traditional preset correlation calculation model, preliminarily recall and roughly arrange candidate documents, and further revise and rearrange the preliminarily recalled candidate documents by referring to a deep language model BERT so as to finally output a query result. The BERT model is pre-trained on the basis of the label-free text, and fine tuning is performed by adopting the labeled text in the specific field, so that the semantic capture capability in the text matching task completion process is improved. The candidate text is divided into a plurality of text segments, the similarity between each text segment and the user query is calculated respectively, and then aggregation is carried out to obtain a second relevance score, so that the problem of BERT model input limitation is solved. The invention integrates the characteristics of vocabulary, word meaning and structure level through two-stage query retrieval, and improves the precision of text matching.

Description

Multi-field and multi-disciplinary science and technology policy resource retrieval method and device
Technical Field
The invention relates to the technical field of text matching, in particular to a multi-field and multi-subject science and technology policy resource retrieval method and device.
Background
The accurate query cannot be separated from the semantic measurement of text similarity, which can also be called semantic matching.
With the rapid development of the deep learning field, more and more researches are devoted to applying a deep neural network model to a natural language matching task, so as to reduce the cost of feature engineering. From the development of the matching model, the method can be divided into a single semantic model, a multi-semantic model and a matching matrix model. The similarity is calculated after the two sentences are coded by the univocal model, and the local characteristics of the phrases in the sentences are not considered; the multi-semantic model reads sentences to be matched from a plurality of granularities, and takes local characteristics such as words and phrases into consideration; the matching matrix model considers the pairwise interaction of sentences to be matched, and after interaction, features are extracted by using a deep network, so that the relationship between deeper sentences can be obtained. From the nature of the matching model, two types can be distinguished: the method comprises a representation type model and an interactive type model, wherein the representation type model can carry out similarity calculation on two sentences to be matched at the last layer, and the interactive type model can enable the two sentences to be interacted as soon as possible and fully applies interactive characteristics.
For example, DSSM (Deep Structured Semantic Models) is a nose ancestor of a matching model, which maps sentences into a vector space and inputs the vector space into a Deep neural network to extract features, resulting in 128-dimensional feature vectors, and is represented by cosine distance (i.e., cosine similarity) in a matching layer. The DSSM has the advantages that semantic similarity between a plurality of query and Doc pairs can be rapidly calculated; compared with a word vector mode, the method adopts a supervised method, the accuracy is much higher, and meanwhile, the processing of single words or single characters does not depend on the correctness of word segmentation. The disadvantage is that the bag-of-words model is adopted in the word vector representation, and the position information of the words is not considered, which is a great loss for semantic understanding. The improved CDSSM (conditional dependent semantic model) model extracts context information under a sliding window through an input layer, extracts global context information through a convolutional layer and a pooling layer, and effectively retains the context information. But context dependencies at longer distances still cannot be obtained because of the limitation of the sliding window size. Other existing text matching methods also have similar defects, so that the traditional text matching technology can mainly solve the matching problem at the vocabulary level, but is difficult to process word sense limitation and structure limitation.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for searching for resources in a multi-domain and multi-disciplinary science and technology policy, so as to eliminate or improve one or more defects in the prior art, and solve the problems of word sense processing limitation and structure processing limitation in the conventional text matching method.
The technical scheme of the invention is as follows:
in one aspect, the invention provides a resource retrieval method for multi-domain and multi-disciplinary scientific and technological policies, comprising the following steps:
acquiring a plurality of scientific and technological policy resource texts in a user query and setting range;
respectively calculating first relevance scores of the user query and the science and technology policy resource texts by adopting a preset relevance calculation model based on statistical characteristics, and recalling a plurality of science and technology policy resource texts with higher first relevance scores as candidate documents according to a first set quantity or a first set proportion;
calculating a second relevance score for the user query and the candidate document based on a preset BERT model, comprising:
dividing the candidate document into a plurality of text sections according to a first set length, and setting overlap of a second set length between every two text sections;
respectively calculating third relevance scores of the user query and each text segment by adopting the preset BERT model and aggregating to obtain second relevance scores of the user query and the candidate documents;
calculating a comprehensive score according to the first relevance score and the second relevance score of the user query and each candidate document according to a set rule, and taking the candidate document with the highest comprehensive score as a retrieval result;
the method comprises the following steps that a plurality of label-free fields and subject policy text corpora are adopted for pre-training the preset BERT model, and the preset BERT model is further subjected to fine adjustment by adopting a preset label data set of a target field.
In some embodiments, the preset correlation calculation model adopts a BM25 model; calculating first relevance scores score (Q, D) of the user query and the technical policy resource texts respectively by adopting a preset relevance calculation model, wherein the calculation formula comprises the following steps:
Figure BDA0003731627420000021
wherein k is 1 And b is an adjustable coefficient, q i The ith keyword in the user query is represented, | D | represents the document length of the scientific and technological policy resource text, and avgdl represents the average document length of the scientific and technological policy resource text; tf (q) i And D) represents the word frequency; IDF (q) i ) Denotes q i The inverse document frequency of (2) is calculated as:
Figure BDA0003731627420000022
where N represents the total number of documents, N (q) i ) Representing containment query terms q i The number of documents.
In some embodiments, the calculating the third relevance scores of the user query and each text segment by using the preset BERT model and aggregating the third relevance scores to obtain the second relevance scores of the user query and the candidate documents respectively includes:
calculating the third relevance scores of the user query and each text segment by adopting first segment score aggregation, average score aggregation or maximum score aggregation to obtain second relevance scores;
calculating the second correlation score by adopting the first-stage score aggregation, wherein the expression is as follows:
s stage2 (Q,D)=s 1 (Q,P 1 )
wherein s is stage2 (Q, D) represents a second correlation score, s 1 (Q,P 1 ) Representing a third relevance score between the user query and a first text segment of the candidate document;
calculating the second correlation score by adopting average score aggregation, wherein the expression is as follows:
Figure BDA0003731627420000031
wherein s is stage2 (Q, D) represents a second correlation score, s i (Q,P i ) A third relevance score representing between the user query and an ith text segment of the candidate document;
calculating the second correlation score by adopting maximum score aggregation, wherein the expression is as follows:
s stage2 (Q,D)=max(s 1 (Q,P 1 ),s 2 (Q,P 2 ),…,s n (Q,P n ))
wherein s is stage2 (Q, D) represents a second correlation score, s n (Q,P n ) A third relevance score between the user query and the nth segment of text of the candidate document is represented.
In some embodiments, the pre-set BERT model is pre-trained using label-free multiple domain and subject policy text corpora, including: and pre-training the preset BERT model by adopting an MLM task.
In some embodiments, pre-training the preset BERT model using an MLM task comprises: acquiring a category feature word bag; firstly covering tokens (lemmas) in the category characteristic word bag for a plurality of domain and subject policy text corpora, and adopting a random MASK strategy to cover the rest tokens in the corpora, wherein the covered tokens have a first set probability replaced by [ MASK ], a second set probability replaced by a random token in a preset dictionary, and a third set probability kept unchanged.
In some embodiments, the pre-set BERT model is pre-trained using MLM tasks, using the formula for the loss function:
Figure BDA0003731627420000032
wherein m (x) represents the number of mask words in the sequence x, x \m(x) Representing the remaining words in the sequence and,
Figure BDA0003731627420000033
represents a masked word, x = [ x ] 1 ,x 2 ,…,x T ];x T Representing the T-th word.
In some embodiments, the preset BERT model further employs a preset labeled data set of a target domain for fine tuning, including:
forming a positive sample by using the questions and answers in a preset labeling data set of the target field, and forming a negative sample by using the questions and the non-answers;
taking the preset BERT model as a secondary classifier, taking the word element vector of the word element of the last hidden layer as characteristic representation, and inputting a probability score for training and outputting in a neural network of a single hidden layer as similarity score of user query and candidate text;
and (3) adopting the positive samples and the negative samples in the preset labeling data set to finely adjust the preset BERT model, wherein the adopted loss function is as follows:
Figure BDA0003731627420000041
wherein s is i Representing the predicted output score of the neural network, I pos Represents a set of positive samples, I neg Representing a set of negative examples.
In some embodiments, the preset BERT model further employs a preset labeled data set of a target domain for fine tuning, including: and dividing the preset labeling data set into a training set, a verification set and a test set according to a second set proportion for cross training.
In another aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method.
In another aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.
The invention has the beneficial effects that at least:
the invention relates to a multi-field and multi-disciplinary science and technology policy resource retrieval method and a device, which calculate the similarity of text segments of user query and various disciplinary policy resources through a traditional preset correlation calculation model, preliminarily recall and roughly arrange candidate documents, and further revise and rearrange the preliminarily recalled candidate documents by referring to a deep language model BERT so as to finally output a query result. The BERT model is pre-trained on the basis of the text without labels, and fine tuning is performed by adopting the text with labels in a specific field, so that the semantic capture capability in the process of completing the text matching task is improved. The candidate text is divided into a plurality of text segments, the similarity between each text segment and the user query is calculated respectively, and then aggregation is carried out to obtain a second relevance score, so that the problem of BERT model input limitation is solved. The invention integrates the characteristics of vocabulary, word meaning and structure level through two-stage query retrieval, and improves the precision of text matching.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to what has been particularly described hereinabove, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
fig. 1 is a flowchart illustrating a method for searching resources according to a multi-domain and multi-disciplinary science and technology policy according to an embodiment of the present invention.
Fig. 2 is a logic diagram illustrating a method for searching resources in a multi-domain and multi-disciplinary science and technology policy according to an embodiment of the present invention.
Fig. 3 is a logic diagram illustrating a method for computing relevance scores between a user query and candidate documents in a multi-domain and multi-disciplinary scientific and technological policy resource retrieval method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the following embodiments and the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.
It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.
Aiming at a text matching task, the traditional technologies comprise methods such as BoW, TF-IDF, BM25, jaccard, simHash and the like, and mainly can solve the matching problem at the vocabulary level, but are difficult to process word sense limitation and structure limitation.
On the basis of the mined scientific and technological policy resources, the invention provides a multi-field and multi-disciplinary retrieval method for the scientific and technological policy resources, aiming at the problems that the data in the policy field is lack of labels and a data set required by a supervised sequencing learning model is difficult to construct. And for the retrieval content input by the user, based on a correlation calculation model based on statistical characteristics, primarily recalling the scientific and technical policy documents, and further sequencing the candidate related documents by introducing a deep language model Bert. In the method, policy domain knowledge is injected into the Bert model through domain pre-training, matching scores are calculated by segmenting long texts, and paragraph scores are aggregated, so that the problem of input limitation of the Bert model is solved.
Specifically, the present invention provides a method for searching resources in a multi-domain and multi-disciplinary scientific and technological policy, as shown in fig. 1 and 2, comprising steps S101 to S104:
step S101: and acquiring a plurality of scientific and technological policy resource texts in a range queried and set by a user.
Step S102: and respectively calculating first relevance scores of user query and all technical policy resource texts by adopting a preset relevance calculation model based on statistical characteristics, and recalling a plurality of technical policy resource texts with higher first relevance scores as candidate documents according to a first set quantity or a first set proportion.
Step S103: calculating a second relevance score for the user query and the candidate document based on a preset BERT model, with reference to fig. 3, comprising:
step S1031: and dividing the candidate document into a plurality of text sections according to a first set length, and setting overlap of a second set length between every two text sections.
Step S1032: and respectively calculating third relevance scores of the user query and each text segment by adopting a preset BERT model and aggregating to obtain second relevance scores of the user query and the candidate documents.
Step S104: and calculating a comprehensive score according to the first relevance score and the second relevance score of the user query and each candidate document according to a set rule, and taking the candidate document with the highest comprehensive score as a retrieval result.
The preset BERT model is pre-trained by adopting a plurality of unmarked field and subject policy text corpora, and is also subjected to fine tuning by adopting a preset marked data set of a target field.
In step S101, the user query refers to a keyword or a question provided by the user and required to be retrieved. The set range is a search range for preliminary screening, and the set range may be the entire region. In some cases, based on a specified search target, the specified database may be searched, locking the scope to a particular technical area at the beginning of the search. The user query may contain a plurality of keywords, and in some embodiments, the length of the character string lemma of the user query may be limited within a certain length range, and the user query exceeding the limited length may be subjected to truncation processing. The scientific and technological policy resource text is used as a retrieval range for comparing the similarity and finally outputting an answer corresponding to the user query.
In step S102, the preset correlation calculation model based on the statistical characteristics may include methods such as BoW, TF-IDF, BM25, jaccard, simHash, and the like, and in this embodiment, the BM25 algorithm is preferably the preset correlation calculation model. The method comprises the steps of adopting a preset relevance calculation model to conduct preliminary recall on science and technology policy resource texts in a set range, specifically, calculating similarity between user query and each science and technology policy resource text through the preset relevance calculation model, calculating to obtain a first relevance score between the user query and each science and technology policy resource text, and conducting preliminary recall on the basis of the first relevance score. Specifically, the text of the technology policy resource that is not related to the technology policy resource can be removed preliminarily by recalling according to a first set quantity or a first set proportion.
In some embodiments, in step S102, the preset correlation calculation model adopts a BM25 model; the calculation formula for respectively calculating the first relevance scores score (Q, D) of the user query and the technical policy resource texts by adopting a preset relevance calculation model is as follows:
Figure BDA0003731627420000071
wherein k is 1 And b is an adjustable coefficient, q i The ith keyword in the user query is represented, | D | represents the document length of the science and technology policy resource text, and avgdl represents the average document length of each science and technology policy resource text; tf (q) i D) word frequency, i.e. query term q i In document DThe frequency of occurrence; IDF (q) i ) Represents q i The inverse document frequency of (2) and the information provided by the query term are measured, and the calculation formula is as follows:
Figure BDA0003731627420000072
wherein N represents the total number of documents, N (q) i ) Representing containment query terms q i The number of documents in the document set.
Because the preset relevance calculation model based on the statistical characteristics basically only focuses on the vocabulary level, neglects the word meaning and the structure, and has certain limitation, the calculated first relevance score is not enough to effectively represent the relevance between the user query and the science and technology policy resource text. Therefore, on the basis, the embodiment further introduces a deep language model BERT pre-trained by large-scale corpora, and further performs similarity judgment on the preliminarily recalled candidate documents to perform secondary recall.
In step S103, the adopted preset BERT model is pre-trained by using the text corpora of the label-free multiple fields and subject policies, and then fine-tuned by using the preset label data set of the target field.
In some embodiments, the pre-set BERT model is pre-trained using label-free multiple domain and subject policy text corpora, including: the pre-set BERT model is pre-trained using MLM (masked language models) tasks.
For the MLM task, the method is to randomly MASK the Token in the input sequence (namely to replace the original Token with the "[ MASK ]"), and then to take the vector at the corresponding MASK position in the output result of the BERT to perform true value prediction.
Specifically, the pre-training of the preset BERT model by using the MLM task includes: acquiring a category characteristic word bag; firstly covering token in the category characteristic word bag for a plurality of domain and subject policy text corpora, and adopting a random MASK strategy to cover other token in the corpora, wherein the covered token has a first set probability replaced by a MASK, a second set probability replaced by a random token in a preset dictionary, and a third set probability kept unchanged.
In some embodiments, the first set probability is 80%, the second set probability is 10%, and the third set probability is 10%.
If all tokens involved in training are 100% MASK, then all words are known at the time of fine-tuning, and there is no MASK, then the model can only predict the current word based on the information and word order structure of other tokens, and cannot utilize the information of the word itself, because they never appear in the training process, which is equal to the model never touching their information, which is equal to the loss of partial information in the whole semantic space. By applying the MASK at a probability of 80%, the model can learn and predict the words, and semantic information is kept to be displayed to the model at a probability of 20%. If the original token is used in all the reserved information, the model may be lazy in the pre-training process, and the current token information is directly copied. The random token under the probability of 10% is adopted to randomly replace the current token, so that the model cannot remember the current token, and learns the semantic expression around the word and the long-distance information dependence as much as possible to try to model complete language information. And finally, retaining the original token with a probability of 10%, namely retaining the original face of the language, so that the information is not completely hidden, and the model can clearly see the real face of the language.
In some embodiments, the MLM task is used to pre-train the pre-set BERT model using the formula of the loss function:
Figure BDA0003731627420000081
wherein m (x) represents the word mask in sequence x, x \m(x) Representing the remaining words in the sequence and,
Figure BDA0003731627420000082
represents a masked word, x = [ x ] 1 ,x 2 ,…,x T ];x T Is shown asT words.
In some embodiments, the preset BERT model further uses a preset labeled data set of the target domain for fine tuning, including steps S201 to S203:
step S201: and forming positive samples by the questions and the answers in the preset labeling data set of the target field, and forming negative samples by the questions and the non-answers.
Step S202: and taking a preset BERT model as a two-classifier, taking the word element vector of the word element of the last hidden layer as characteristic representation, and inputting a probability score which is input into a neural network of a single hidden layer for training and output as a similarity score of a user query and a candidate text.
Step S203: and (3) fine-tuning the preset BERT model by adopting positive samples and negative samples in a preset labeling data set, wherein the adopted loss function is as follows:
Figure BDA0003731627420000083
wherein s is i Representing the predicted output score of the neural network, I pos Represents a set of positive samples, I neg Representing a set of negative examples.
In some embodiments, the preset BERT model is further refined using a preset labeled data set of the target domain, including: and dividing the preset labeling data set into a training set, a verification set and a test set according to a second set proportion for cross training.
Due to the characteristics of machine learning of a large amount of data and training, the generalization error can not be directly used as a signal for understanding the generalization capability of the model, and the cost is very high due to reciprocation between a deployment environment and a training model; the degree of fit of the model to the training data set cannot be used as a signal to understand the generalization ability of the model because the obtained data may not be clean and representative. Therefore, when training supervised machine learning models, it is often necessary to split the original dataset into two parts: and training a set and a test set, so that the data of the training set is used for training the model, and after the model is tested on the test set, the error on the test set is used for approximating the generalization error of the model in a real scene. On the basis, not only the class comparison between models needs to be carried out, but also the internal part of a certain class of models needs to be continuously screened, and the verification set needs to be divided from the training set again when the evaluation of the models and the adjustment of the hyper-parameters are involved. Specifically, a K-fold cross-validation method may be used for training.
Further, in step S1031, based on the preset BERT model obtained by the pre-training and the fine-tuning, in consideration of the input limit of the preset BERT model, in the process of comparing the similarity by using the preset BERT model, the candidate document is divided into a plurality of text segments according to the first set length, and after the similarity comparison, the candidate document is aggregated to calculate the comprehensive score. The length of the text segment obtained after segmentation is less than 445 characters, and 64 characters are overlapped between adjacent text segments.
Specifically, in step S1032, respectively calculating third relevance scores of the user query and each text segment by using a preset BERT model, and aggregating to obtain second relevance scores of the user query and the candidate documents, including: calculating the third relevance scores of the user query and each text segment by adopting first segment score aggregation, average score aggregation or maximum score aggregation to obtain second relevance scores;
calculating a second correlation score by adopting the first-segment score aggregation, wherein the expression is as follows:
s stage2 (D,D)=s 1 (Q,P 1 ); (5)
wherein s is stage2 (Q, D) represents a second correlation score, s 1 (Q,P 1 ) A third relevance score between the user query and the first text segment of the candidate document is represented.
Calculating a second correlation score by adopting average score aggregation, wherein the expression is as follows:
Figure BDA0003731627420000091
wherein s is stage2 (Q, D) represents a second correlation score, s i (Q,P i ) Between the i-th text segment representing the user query and the candidate documentA third relevance score.
Calculating a second correlation score by adopting maximum score aggregation, wherein the expression is as follows:
s stage2 (Q,D)=max(s 1 (Q,P 1 ),s 2 (Q,P 2 ),…,s n (Q,P n )); (7)
wherein s is stage2 (Q, D) represents a second correlation score, s n (Q,P n ) A third relevance score between the user query and the nth segment of text of the candidate document is represented.
In another aspect, the present invention also provides an electronic device, which includes a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to implement the steps of the method.
In another aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.
The invention is illustrated below with reference to a specific embodiment:
the embodiment provides a resource retrieval method and device for multi-domain and multi-disciplinary scientific and technological policies, as shown in fig. 2, a user query is preliminarily recalled in relevant documents on the basis of a statistical similarity model BM25, and candidate relevant documents are re-ordered by introducing a deep language model Bert. In addition, the scientific and technical policy resources have the characteristic of obvious long text, often exceed the input limit of the Bert model and cannot be directly used. And performing segmented calculation on the policy text and aggregating scores to realize long text matching based on the depth language model Bert. Specifically, the present embodiment includes 3 sections.
1. Domain pretraining
In order to add correction of semantic similarity calculation on the basis of traditional similarity matching, a depth language model Bert (Bidirectional Encoder replication from transformations) is introduced and used in the embodiment, the Bert model performs pre-training on a large-scale label-free text by using two tasks, namely MLM and NSP, and a deep-layer transformer component is adopted to construct the whole model, so that the problem that the structure of the traditional pre-training model is limited by a unidirectional language model is avoided, and the Representation capability of the model is improved. However, large-scale pre-training corpora are derived from general text data, such as Chinese wiki, and lack domain-related knowledge, and the fine-tuning (fine-tuning) stage often lacks enough labeled data, so that the model is difficult to capture or learn the mode of a specific domain. Therefore, when the deep language model Bert is introduced, the model is subjected to domain data pre-training by using the label-free multi-domain and multi-disciplinary policy text corpus, and the model is helped to learn the semantic feature mode of the policy domain text better. Meanwhile, on a domain pre-training model, a small amount of labeled external cross-domain data sets are used for fine adjustment, and semantic capture capacity for less texts in the policy domain is achieved in the absence of labeled data.
In this embodiment, in the domain pre-training stage, only MLM (masked language models) tasks are used, and a word in a corpus sentence is masked by MASK strategy, and [ MASK ] is used]Instead, prediction is made by context in the sentence, for the sequence x = [ x = 1 ,x 2 ,…,x T ]The loss function is given by equation 3.
In order to better learn the language mode of the multi-field and multi-disciplinary policy field, the original mask strategy is modified, so that the model focuses more on the field characteristic words in the pre-training task. In the process, a category feature word bag is utilized, tokens appearing in the category feature word bag are firstly covered in a mask strategy, and random mask strategies are adopted for the rest tokens in the corpus: the 80% probability is replaced by [ MASK ], the 10% probability is replaced by a random token in the dictionary, and the 10% remains the same.
2. Fine tuning task based on depth language model Bert
This embodiment uses the chinese public data set cMedQA as annotation data for the fine tuning stage, where the training set contains 100000 questions and 188490 answers, the validation contains 4000 questions and 7527 answers, and the testing machine contains 4000 questions and 7552 answers.
Forming a positive sample by the questions and the answers in the annotation data, randomly collecting each question from a non-answer text as a negative sample, and adding special lemmas [ CLS ] and [ SEP ] to form model input: the method comprises the steps of [ [ CLS ], user query, [ SEP ], sample text, [ SEP ] ], limiting 64 token lengths for user query character strings, carrying out truncation processing on the text exceeding the query length limit, using a Bert model as a two-classifier, using a token vector corresponding to the token [ CLS ] of the last hidden layer as a feature representation, inputting the token vector into a neural network of a single hidden layer for training, and scoring the output probability score as the similarity of the user query and candidate text.
And the neural network classifier of the single hidden layer outputs similarity prediction of the query of the user and the candidate text, and a loss function is adopted in the fine tuning training process as shown in the formula 4.
3. Resource retrieval correlation calculation for multi-domain and multi-disciplinary science and technology policies
By introducing a deep language model Bert of large-scale corpus pre-training and combining with continuous further field pre-training and downstream task fine-tuning in the text of the multi-field and multi-disciplinary policy field, the semantic-based relevance calculation of the query and the candidate text of the user is realized. On this basis, the present embodiment provides a specific scientific and technological policy resource retrieval and sorting method. The method comprises two stages, wherein the first stage is used for primarily recalling and roughly arranging scientific and technological policy resources according to a statistical relevance model BM25, and the second stage is used for correcting and rearranging the prediction score of the first stage by combining the relevance calculation of a deep language model.
In the first stage, there is a containing key q 1 ,q 2 ,…,q n The user query Q, the correlation calculation formula of the user query Q and the document D refer to the above formula 1 and formula 2.
If the search term q i Less frequently (e.g., in terms of art), the matched IDF score is higher, and vice versa. In the first stage, a certain number of candidate documents are recalled preliminarily by statistical relevance calculation based on the word frequency TF and the inverse document frequency IDF, and the number of candidate documents is set to 200.
In the second stage, based on the deep language model Bert, the relevance score between the query of the user and the candidate document is calculated, the relevance in the first stage is corrected, reordering (re-ranking) of the candidate document is realized, a retrieval result of multi-field and multi-disciplinary science and technology policy resources is provided for the user, and the corrected relevance calculation formula is shown in a formula 8:
score(Q,D)=αS stage1 (Q,D)+(1-α)S stage2 (Q,D); (8)
wherein S is stage1 (Q, D) represents the base relevance score of the first stage, S stage2 And (Q, D) represents the correlation score based on the Bert model in the second stage, wherein alpha is a correction coefficient and is also a super parameter, the value range is between 0 and 1, and the specific value in practice can be used for carrying out grid search on the parameter on the cross validation set.
When the matching correlation degree is calculated based on the Bert model, the problem of input limitation needs to be faced, the input length of the Bert model is at most 512 characters, and multi-field and multi-disciplinary science and technology policy resources have the obvious long text characteristic and cannot be directly input into the model calculation. Aiming at the problem, the policy length text is segmented, the length of the segmented text segment is smaller than 445 characters, 64 characters are overlapped between adjacent text segments, and the relevance calculation is completed through score aggregation after the query and the text segment of the splicing user are respectively predicted and scored, as shown in fig. 3:
for the characteristics of common long texts of multi-field and multi-subject science and technology policy resources, the text of the science and technology policy is segmented, and the long text is segmented into a plurality of text sections P in a sliding manner 1 ,P 2 ,…,P n And overlapping 64 characters between adjacent different text segments to ensure that the input after the query is spliced meets the input limit of the Bert model 512 characters, and completing the input with insufficient tail length by padding operation. For each segmented text segment, respectively splicing with the query of the user to form an input format [ CLS [ [ CLS ]][ user query ]],[SEP][ text passage ]],[SEP]]And inputting the data into a network structure of section 4.2.2 to carry out correlation degree scoring prediction to obtain a plurality of correlation degree scores s 1 ,s 2 ,…,s n Then, the final relevance prediction is obtained by aggregating a plurality of relevance scores, and the first segment score can be adopted in the selection of the aggregation modeAggregation, average score aggregation, or maximum score aggregation calculations, refer to equations 5, 6, and 7 above, respectively, and this example ultimately uses the maximum score aggregation method based on experimental data.
In summary, according to the method and the device for searching the multi-domain and multi-disciplinary scientific and technological policy resource, the similarity between the user query and the text segment of each scientific and technological policy resource is calculated through the traditional preset correlation calculation model, the candidate documents are recalled preliminarily and rearranged roughly, and then the candidate documents recalled preliminarily are corrected and rearranged by further referring to the deep language model BERT so as to finally output the query result. The BERT model is pre-trained on the basis of the label-free text, and fine tuning is performed by adopting the labeled text in the specific field, so that the semantic capture capability in the text matching task completion process is improved. The candidate text is divided into a plurality of text segments, the similarity between each text segment and the user query is calculated respectively, and then aggregation is carried out to obtain a second relevance score, so that the problem of BERT model input limitation is solved. The invention integrates the characteristics of vocabulary, word senses and structural layers through two-stage query and retrieval, and improves the precision of text matching.
Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
It should also be noted that the exemplary embodiments noted in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed at the same time.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A multi-field and multi-disciplinary science and technology policy resource retrieval method is characterized by comprising the following steps:
acquiring a plurality of scientific and technological policy resource texts in a user query and setting range;
respectively calculating first relevance scores of the user query and the scientific policy resource texts by adopting a preset relevance calculation model based on statistical characteristics, and recalling a plurality of scientific policy resource texts with higher first relevance scores as candidate documents according to a first set quantity or a first set proportion;
calculating a second relevance score for the user query and the candidate document based on a preset BERT model, comprising:
dividing the candidate document into a plurality of text sections according to a first set length, and setting overlap of a second set length between every two text sections;
respectively calculating third relevance scores of the user query and each text segment by adopting the preset BERT model and aggregating to obtain second relevance scores of the user query and the candidate documents;
calculating a comprehensive score according to the first relevance score and the second relevance score of the user query and each candidate document according to a set rule, and taking the candidate document with the highest comprehensive score as a retrieval result;
the method comprises the following steps of presetting a BERT model, wherein the preset BERT model adopts a plurality of unmarked fields and subject policy text corpora for pre-training, and the preset BERT model also adopts a preset marked data set in a target field for fine adjustment.
2. The multi-domain and multi-disciplinary scientific and technological policy resource retrieval method according to claim 1, wherein the sign of the method is that the preset correlation calculation model adopts a BM25 model;
and respectively calculating a first correlation score (Q, D) of the user query and the skill policy resource text by adopting a preset correlation calculation model, wherein the calculation formula comprises the following steps:
Figure FDA0003731627410000011
wherein k is 1 And b is an adjustable coefficient, q i The ith keyword in the user query is represented, | D | represents the document length of the scientific and technological policy resource text, and avgdl represents the average document length of the scientific and technological policy resource text; tf (q) i And D) represents the word frequency; IDF (q) i ) Denotes q i The inverse document frequency of (2) is calculated as:
Figure FDA0003731627410000012
where N represents the total number of documents, N (q) i ) Representing bagContaining query terms q i The number of documents.
3. The multi-domain and multi-disciplinary scientific and technological policy resource retrieval method according to claim 1, wherein the sign of the method is that the preset BERT model is adopted to respectively calculate third relevance scores of the user query and each text segment and aggregate the third relevance scores to obtain second relevance scores of the user query and the candidate documents, and the method comprises the following steps:
calculating the third relevance scores of the user query and each text segment by adopting first segment score aggregation, average score aggregation or maximum score aggregation to obtain second relevance scores;
calculating the second correlation score by adopting the first-segment score aggregation, wherein the expression is as follows:
s stage2 (Q,D)=s 1 (Q,P 1 )
wherein s is stage2 (Q, D) represents a second correlation score, s 1 (Q,P 1 ) A third relevance score representing between the user query and the first segment of text of the candidate document;
calculating the second correlation score by adopting average score aggregation, wherein the expression is as follows:
Figure FDA0003731627410000021
wherein s is stage2 (Q, D) represents a second correlation score, s i (Q,P i ) A third relevance score representing between the user query and an ith text segment of the candidate document;
calculating the second correlation score by adopting maximum score aggregation, wherein the expression is as follows:
s stage (Q,D)=max(s 1 (Q,P 1 ),s 2 (Q,P 2 ),…,s n (Q,P n ))
wherein s is stage2 (Q, D) represents a second correlation score, s n (Q,P n ) Between the nth segment of text representing the user query and the candidate documentThe third relevance score of (1).
4. The method for resource retrieval according to multi-domain and multi-disciplinary scientific and technological policies of claim 1, wherein the pre-training of the pre-set BERT model using label-free multi-domain and disciplinary policy text corpora comprises:
and pre-training the preset BERT model by adopting an MLM task.
5. The method according to claim 1, wherein the pre-training of the pre-set BERT model using MLM tasks comprises:
acquiring a category characteristic word bag;
firstly covering tokens in the category characteristic word bag for a plurality of domain and subject policy text corpora, and adopting a random MASK strategy to cover the rest tokens in the corpora, wherein the covered tokens have a first set probability replaced by [ MASK ], a second set probability replaced by a random token in a preset dictionary, and a third set probability kept unchanged.
6. The multi-domain and multi-disciplinary scientific and technological policy resource retrieval method of claim 5, wherein signs are pre-trained on the preset BERT model by using an MLM task, and a loss function formula is adopted as follows:
Figure FDA0003731627410000031
wherein m (x) represents the number of mask words in the sequence x, x \m(x) Representing the remaining words in the sequence and,
Figure FDA0003731627410000033
represents a masked word, x = [ x ] 1 ,x 2 ,...,x T ];x T Representing the T-th word.
7. The method according to claim 1, wherein the pre-defined BERT model is further refined using a pre-defined labeled data set of a target domain, comprising:
forming a positive sample by using questions and answers in a preset labeling data set of a target field, and forming a negative sample by using the questions and the non-answers;
taking the preset BERT model as a secondary classifier, taking the word element vector of the word element of the last hidden layer as characteristic representation, and inputting a probability score for training and outputting in a neural network of a single hidden layer as similarity score of user query and candidate text;
and adopting the positive sample and the negative sample in the preset labeling data set to finely adjust the preset BERT model, wherein the adopted loss function is as follows:
Figure FDA0003731627410000032
wherein s is i Representing the predicted output score of the neural network, I pos Represents a set of positive samples, I neg Representing a set of negative examples.
8. The method according to claim 7, wherein the pre-defined BERT model is further fine-tuned using a pre-defined labeled data set of a target domain, comprising:
and dividing the preset labeling data set into a training set, a verification set and a test set according to a second set proportion for cross training.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 8 are implemented when the processor executes the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202210846693.0A 2022-07-05 2022-07-05 Multi-field and multi-disciplinary science and technology policy resource retrieval method and device Pending CN115344668A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210846693.0A CN115344668A (en) 2022-07-05 2022-07-05 Multi-field and multi-disciplinary science and technology policy resource retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210846693.0A CN115344668A (en) 2022-07-05 2022-07-05 Multi-field and multi-disciplinary science and technology policy resource retrieval method and device

Publications (1)

Publication Number Publication Date
CN115344668A true CN115344668A (en) 2022-11-15

Family

ID=83949168

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210846693.0A Pending CN115344668A (en) 2022-07-05 2022-07-05 Multi-field and multi-disciplinary science and technology policy resource retrieval method and device

Country Status (1)

Country Link
CN (1) CN115344668A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117633255A (en) * 2024-01-25 2024-03-01 中国标准化研究院 Scientific and technological resource identification analysis method and system based on active identification

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100365A (en) * 2020-08-31 2020-12-18 电子科技大学 Two-stage text summarization method
CN113836919A (en) * 2021-09-30 2021-12-24 中国建筑第七工程局有限公司 Building industry text error correction method based on transfer learning
CN114357120A (en) * 2022-01-12 2022-04-15 平安科技(深圳)有限公司 Non-supervision type retrieval method, system and medium based on FAQ
CN114565104A (en) * 2022-03-01 2022-05-31 腾讯科技(深圳)有限公司 Language model pre-training method, result recommendation method and related device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100365A (en) * 2020-08-31 2020-12-18 电子科技大学 Two-stage text summarization method
CN113836919A (en) * 2021-09-30 2021-12-24 中国建筑第七工程局有限公司 Building industry text error correction method based on transfer learning
CN114357120A (en) * 2022-01-12 2022-04-15 平安科技(深圳)有限公司 Non-supervision type retrieval method, system and medium based on FAQ
CN114565104A (en) * 2022-03-01 2022-05-31 腾讯科技(深圳)有限公司 Language model pre-training method, result recommendation method and related device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117633255A (en) * 2024-01-25 2024-03-01 中国标准化研究院 Scientific and technological resource identification analysis method and system based on active identification
CN117633255B (en) * 2024-01-25 2024-04-05 中国标准化研究院 Scientific and technological resource identification analysis method and system based on active identification

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN110442760B (en) Synonym mining method and device for question-answer retrieval system
WO2021203581A1 (en) Key information extraction method based on fine annotation text, and apparatus and storage medium
US11016966B2 (en) Semantic analysis-based query result retrieval for natural language procedural queries
Jung Semantic vector learning for natural language understanding
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
US11379668B2 (en) Topic models with sentiment priors based on distributed representations
Cohen et al. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
Wu et al. Learning of multimodal representations with random walks on the click graph
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
Qian et al. Generating accurate caption units for figure captioning
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN111967267B (en) XLNET-based news text region extraction method and system
CN112328800A (en) System and method for automatically generating programming specification question answers
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
Logeswaran et al. Sentence ordering using recurrent neural networks
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN111552773A (en) Method and system for searching key sentence of question or not in reading and understanding task
CN112966117A (en) Entity linking method
CN112307182A (en) Question-answering system-based pseudo-correlation feedback extended query method
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN116881425A (en) Universal document question-answering implementation method, system, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20221115

RJ01 Rejection of invention patent application after publication