CN104750819A

CN104750819A - Biomedicine literature search method and system based on word grading sorting algorithm

Info

Publication number: CN104750819A
Application number: CN201510147696.5A
Authority: CN
Inventors: 徐博; 林鸿飞
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2015-03-31
Filing date: 2015-03-31
Publication date: 2015-07-01
Anticipated expiration: 2035-03-31
Also published as: CN104750819B

Abstract

A biomedical literature retrieval method and system based on a word grouping sorting algorithm, the retrieval method includes S1, a search engine query extraction step; S2, a candidate extended vocabulary extraction step; S3, a candidate extended vocabulary feature extraction and labeling step; S4, a candidate The step of training the extended vocabulary sorting model; S5, the online search engine query and extraction step; S6, the online candidate extended vocabulary extraction and its feature extraction and scoring steps; S7, the query result return step. The retrieval system includes a search engine query extraction module, a candidate extended vocabulary extraction module, a candidate extended vocabulary feature extraction and labeling module, a candidate extended vocabulary sorting model training module, a query reconstruction module, and a query result return module. From the perspective of query expansion, the invention uses word grouping and sorting algorithms and inherent dictionary resources in the biomedical field to select professional vocabulary that best expresses user information needs, completes retrieval tasks, and improves retrieval performance.

Description

A Biomedical Literature Retrieval Method and System Based on Word Grouping Algorithm

技术领域technical field

本发明涉及数据挖掘和搜索引擎技术领域，尤其是一种基于词分组排序算法的生物医学文献检索方法及系统。The invention relates to the technical fields of data mining and search engines, in particular to a biomedical document retrieval method and system based on word grouping and sorting algorithms.

背景技术Background technique

近年来，随着生物医学(Biomedicine)领域的快速发展，生物医学相关研究取得了较多有价值的成果，这些成果不仅促成了一些曾经看似难以解决的疾病的治疗，从更深远的角度看，也推动了人类对于自身认识的发展和深入。In recent years, with the rapid development of the field of biomedicine (Biomedicine), biomedical research has achieved many valuable results, which not only contributed to the treatment of some diseases that once seemed difficult to solve, but also from a more far-reaching perspective. , It also promotes the development and deepening of human beings' understanding of themselves.

但是随着生物医学文献数量的飞速增加，相关信息的数量也在呈指数性增加，海量的文献和信息为生物医学研究者和相关从业人员的信息获取带来了难题，而传统的手工信息获取方式已经逐渐变得不再适用，因此，需要借助于信息检索的技术和方法，协助相关人员获取所需的信息。However, with the rapid increase in the number of biomedical literature, the amount of relevant information is also increasing exponentially. The massive amount of literature and information has brought difficulties for biomedical researchers and related practitioners to obtain information. Traditional manual information acquisition The methods have gradually become inapplicable. Therefore, it is necessary to use information retrieval technologies and methods to assist relevant personnel to obtain the required information.

传统的信息检索技术能够根据用户提交的查询，对文档或者网页进行相关性排序，并将排序结果返回给用户。而直接将传统的信息检索方法应用于生物医学文献的检索任务中，很难取得较好的检索性能。其原因在于未能充分的考虑生物医学领域的固有特点，比如生物医学领域具有较多的专业词汇，而这些专业词汇往往同时存在很多同义词和缩写词的情况。如果能在传统的信息检索方法中充分的考虑生物医学领域的特点，将会进一步提高生物医学信息检索的性能。Traditional information retrieval technology can sort documents or web pages according to the query submitted by users, and return the sorting results to users. However, it is difficult to obtain better retrieval performance by directly applying traditional information retrieval methods to the retrieval tasks of biomedical literature. The reason is that the inherent characteristics of the biomedical field are not fully considered. For example, the biomedical field has many professional vocabularies, and these professional vocabularies often have many synonyms and abbreviations at the same time. If the characteristics of the biomedical field can be fully considered in the traditional information retrieval methods, the performance of biomedical information retrieval will be further improved.

查询扩展技术是传统信息检索领域的关键技术之一。它能够在用户提交的原始查询的基础上，根据用户的检索意图，对查询进行补充和完善，从而得到更符合用户检索意图的查询，提高检索的性能。现有的查询扩展方法可以分为两大类：一类是基于文档集合的查询扩展方法，这类方法以全部数据文档集合或者部分数据文档集合为研究对象，从中提取与查询相关的内容，完善原始查询；另一类是基于外部扩展资源的查询扩展技术，外部资源主要包括有词典资源、检索系统查询日志，锚文本和维基百科等，很多研究表明利用外部扩展资源完善原始查询，可以更好的完成查询扩展任务，进而提升检索的性能。Query expansion technology is one of the key technologies in the traditional information retrieval field. It can supplement and improve the query based on the original query submitted by the user and according to the user's retrieval intention, so as to obtain a query that is more in line with the user's retrieval intention and improve the retrieval performance. The existing query expansion methods can be divided into two categories: one is the query expansion method based on the document collection, this kind of method takes the whole data document collection or part of the data document collection as the research object, extracts the content related to the query from it, and perfects the query expansion method. Original query; the other is query expansion technology based on external extended resources. External resources mainly include dictionary resources, retrieval system query logs, anchor text and Wikipedia, etc. Many studies have shown that using external extended resources to improve original queries can be better. Complete the query expansion task, thereby improving the retrieval performance.

由于生物医学领域存在较多词典等领域资源，如果能在信息检索的过程中，充分利用这些资源对用户提交的查询进行补充和完善，检索的性能将有很大可能性得到提升。Since there are many resources in the field of biomedicine, such as dictionaries, if these resources can be fully utilized in the process of information retrieval to supplement and improve the queries submitted by users, the retrieval performance will be greatly improved.

要建立针对于生物医学领域的文献检索，首先应该了解该领域的特点和资源。在生物医学领域的文献中存在着大量的专业词汇，而这些词汇又包含了很多同义词和缩写词等复杂情况，这为检索系统的建立带来了巨大的挑战，例如对于药物扑热息痛，它的英文名字叫做paracetamol，而在国际标准药物分类中，它的名称是对乙酰氨基酚(acetaminophen)，在药物化学领域它的学名是C8H9NO2或者NO2BE01，针对于以上多种名称的情况，如果在检索中只查询其中的一个名字，很难检索到所有相关的文献。值得庆幸的是，在生物医学领域还存在着许多固有的知识库和资源，例如医学主题词表(MeSH：MedicalSubjectHeadings)和基因本体(GO：Gene Ontology)等，如果能在检索的过程中充分的利用这些资源，将会对生物医学文献检索的性能带来巨大的提升。To establish a literature search for the biomedical field, one should first understand the characteristics and resources of this field. There are a large number of professional vocabularies in the literature in the field of biomedicine, and these vocabularies contain many synonyms and abbreviations and other complex situations, which brings great challenges to the establishment of retrieval systems. For example, for the drug paracetamol, its English The name is paracetamol, and in the International Standard Classification of Drugs, its name is acetaminophen (acetaminophen), and its scientific name in the field of medicinal chemistry is C8H9NO2 or NO2BE01. For the above multiple names, if only It is difficult to retrieve all relevant documents by querying one of the names. Fortunately, there are still many inherent knowledge bases and resources in the biomedical field, such as Medical Subject Headings (MeSH: Medical Subject Headings) and Gene Ontology (GO: Gene Ontology). Utilizing these resources will greatly improve the performance of biomedical literature retrieval.

排序学习(learning to rank)算法是一系列用于信息检索中对文档排序的监督学习算法的总称，它的主要特点在于应用机器学习的技术来解决信息检索中的排序问题，并获得了较好的检索排序性能。其中排序问题也可以看作是一个最优项的选择问题，因此，近年来排序学习算法被应用于多个其他的任务，例如在推荐系统中根据用户和物品的历史信息为用户推荐相应的物品等。The learning to rank algorithm is a general term for a series of supervised learning algorithms used to sort documents in information retrieval. Its main feature is the application of machine learning technology to solve the ranking problem in information retrieval, and has achieved better results. retrieval and sorting performance. The ranking problem can also be regarded as a selection problem of the optimal item. Therefore, in recent years, the ranking learning algorithm has been applied to many other tasks, such as recommending corresponding items for users in recommendation systems based on the historical information of users and items. wait.

发明内容Contents of the invention

本发明的目的是提供一种能为用户提供更为准确的生物医学文献，更为有效地满足用户的信息需求，有效的补充和完善用户查询的基于词分组排序算法的生物医学文献检索方法及系统。The purpose of the present invention is to provide a biomedical literature retrieval method based on the word grouping sorting algorithm that can provide users with more accurate biomedical literature, more effectively meet the information needs of users, and effectively supplement and improve user queries. system.

本发明解决现有技术问题所采用的技术方案：一种基于词分组排序算法的生物医学文献检索方法，包括以下离线训练阶段和在线查询阶段，其中，离线训练阶段包括以下步骤：The technical solution adopted by the present invention to solve the problems of the prior art: a biomedical literature retrieval method based on word grouping and sorting algorithm, including the following offline training phase and online query phase, wherein the offline training phase includes the following steps:

S1、搜索引擎查询提取步骤：根据搜索引擎的历史查询记录，提取多组查询以及每个查询中获得的前N条查询结果文档；并将查询及查询结果文档收集到一个查询池中，其中N为自然数；S1, search engine query extraction step: according to the historical query records of the search engine, extract multiple groups of queries and the first N query result documents obtained in each query; and collect the queries and query result documents into a query pool, where N is a natural number;

S2、候选扩展词汇提取步骤：根据生物医学资源对查询池中每个查询的前N条查询结果文档中的专业词汇进行提取，并统计获得每个专业词汇在所述查询结果文档中出现的次数或者出现次数的加权和；按照每个专业词汇在查询结果文档中出现的次数或者次数的加权和降序排列，选择出现次数最高或次数的加权和最高的M个专业词汇作为候选扩展词汇，其中M为自然数；S2. Candidate extended vocabulary extraction step: extract the professional vocabulary in the first N query result documents of each query in the query pool according to the biomedical resources, and obtain the number of occurrences of each professional vocabulary in the query result document by counting Or the weighted sum of the number of occurrences; according to the weighted and descending order of the number of times or times that each professional word appears in the query result document, select the M professional words with the highest number of occurrences or the weighted sum of the highest number of times as candidate extended words, where M is a natural number;

S3、候选扩展词汇的特征提取及标注步骤：S3. Feature extraction and labeling steps of candidate extended vocabulary:

候选扩展词汇的特征提取及标注同时进行；其中，对候选扩展词汇的相关性标注通过对比原始查询的检索性能和将该候选扩展词汇加入到原始查询中时的检索性能的高低来标注；检索性能高低的评价指标包括：准确率，平均准确率，NDCG值和MRR值；相关性标注的具体方式如下：The feature extraction and labeling of the candidate extended vocabulary are carried out at the same time; among them, the relevance labeling of the candidate expanded vocabulary is marked by comparing the retrieval performance of the original query with the retrieval performance when the candidate expanded vocabulary is added to the original query; the retrieval performance High and low evaluation indicators include: accuracy rate, average accuracy rate, NDCG value and MRR value; the specific method of correlation labeling is as follows:

$label label = = \{\begin{matrix} 11 & eval eval ((query query + + term term)) > > eval eval ((query query)) \\ 00 & eval eval ((query query + + term term)) \leq \leq eval eval ((query query)) \end{matrix}\}$

其中，eval()为用于评价检索性能高低的评价指标函数，eval(query+term)为评价指标函数eval()在评价将候选扩展词汇term加入到查询query时的得分，eval(query)为评价指标函数在评价查询query时的得分；label标注为1表示该候选扩展词汇与查询query是相关的；label标注为0表示该候选扩展词汇与查询query不相关的；Among them, eval() is the evaluation index function used to evaluate the retrieval performance, eval(query+term) is the score of the evaluation index function eval() when adding the candidate extended vocabulary term to the query query, and eval(query) is The score of the evaluation index function when evaluating the query query; the label marked as 1 indicates that the candidate extended vocabulary is related to the query query; the label marked as 0 indicates that the candidate expanded vocabulary is not related to the query query;

候选扩展词汇的特征提取，是从生物医学资源和查询池中的查询所返回的前N条查询结果文档中提取候选扩展词汇的分布信息、候选词汇在生物医学资源中的分布信息以及候选扩展词汇和原始查询的相关性信息等为训练排序模型做准备，并在提取同一候选扩展词汇的多种特征后，对所有特征值进行归一化处理，以将所有特征值控制在[0,1]区间上，归一化的过程如下所示：The feature extraction of candidate extended vocabulary is to extract the distribution information of candidate extended vocabulary, the distribution information of candidate vocabulary in biomedical resources, and the candidate extended vocabulary from the biomedical resources and the first N query result documents returned by the query in the query pool. and the correlation information of the original query to prepare for the training ranking model, and after extracting multiple features of the same candidate extended vocabulary, normalize all feature values to control all feature values in [0,1] On the interval, the normalization process is as follows:

$newFeatureValue newFeatureValue = = \frac{oldFeatureValue oldFeatureValue - - min min Value value}{max max Value value - - min min Value value}$

其中，minValue和maxValue分别为某一特征的最小值和最大值；Among them, minValue and maxValue are the minimum and maximum values of a feature, respectively;

S4、候选扩展词汇排序模型训练步骤：根据候选扩展词汇的相关程度标注和多种特征，利用词分组排序算法训练得到每种特征的权重值，具体步骤为：选择一个步骤S3中被标注为相关的候选扩展词汇和若干被标记为不相关的候选扩展词汇组成一个词分组，选择若干这样的词分组作为训练样本；随机为其中每一个候选扩展词的特征赋予初始权重，通过特征加权得分对每个词分组内的相关候选扩展词汇进行排序；根据每个词分组的排序结果，计算总体排序损失，根据损失函数的梯度值动态调整每一维特征的权重，其中排序损失为：其中NumSample为词分组中候选扩展词汇分组的数量，loss_i为每个词分组的损失值，该损失值通过计算相关扩展词汇的排序位置得到，排序位置越靠前对应的损失值越小；通过循环迭代上一过程，直到总体损失值小于某一阈值或达到指定的迭代次数训练完成，将最终选择的特征值作为训练完成的排序模型；S4. Training steps of the candidate extended vocabulary sorting model: according to the relevance degree labeling and various features of the candidate expanded vocabulary, use the word grouping sorting algorithm to train and obtain the weight value of each feature. The specific steps are: select one that is marked as relevant in step S3 Candidate extended vocabulary and a number of candidate extended vocabulary marked as irrelevant form a word group, and select several such word groups as training samples; randomly assign initial weights to the features of each candidate extended word, and use the feature weighted score to evaluate each According to the ranking results of each word group, the overall ranking loss is calculated, and the weight of each dimension feature is dynamically adjusted according to the gradient value of the loss function, where the ranking loss is: Among them, NumSample is the number of candidate extended vocabulary groups in the word grouping, loss _i is the loss value of each word grouping, the loss value is obtained by calculating the sorting position of the related expanded vocabulary, and the higher the ranking position is, the smaller the corresponding loss value is; by Iterate the previous process in a loop until the overall loss value is less than a certain threshold or reaches the specified number of iterations and the training is completed, and the final selected feature value is used as the sorting model for the training completion;

在线查询阶段包括以下步骤：The online inquiry phase includes the following steps:

S5、在线搜索引擎查询与提取步骤：对于用户在线提交的新查询，检索得到前N1条查询结果；根据生物医学资源对前N1条检索结果中的专业词汇及其多种特征进行提取，其中N1为自然数；；S5. Online search engine query and extraction steps: For the new query submitted by the user online, retrieve the first N1 query results; extract the professional vocabulary and its various features in the first N1 search results according to biomedical resources, where N1 is a natural number;

S6、在线候选扩展词汇提取及其特征提取及打分步骤：根据生物医学资源对新查询利用离线阶段S2-S3的候选扩展词汇提取方法及候选扩展词汇的特征提取方法对前N1条检索结果中的在线查询阶段专业词汇及其多种特征进行提取，得到在线查询阶段候选扩展词汇，提取的特征用于衡量候选扩展词汇在扩展查询中的重要性；根据步骤S4训练得到的特征权重，为在线查询阶段候选扩展词汇进行打分，并选择分数靠前的K1个候选扩展词汇加入到在线提交的新查询中作为扩展查询，其中K1为自然数；S6. Online Candidate Extended Vocabulary Extraction and Feature Extraction and Scoring Steps: According to the biomedical resources, use the candidate extended vocabulary extraction method in the offline stage S2-S3 and the feature extraction method for the candidate extended vocabulary to search for the first N1 search results In the online query stage, the professional vocabulary and its various features are extracted to obtain the candidate extended vocabulary in the online query stage, and the extracted features are used to measure the importance of the candidate extended vocabulary in the extended query; according to the feature weight obtained in step S4 training, the online query Stage candidate extended vocabulary is scored, and K1 candidate extended vocabulary with the highest score is selected to be added to the new query submitted online as an extended query, where K1 is a natural number;

对于利用生物医学资源标注并提取的某一个在线查询阶段候选扩展词汇，它的得分为其中FeatureNum是特征的总数，ai是排序模型中第i个特征的权重值，feature_i(term)是在线查询阶段候选扩展词汇term所对应的第i个特征的特征值；For a candidate extended vocabulary in the online query stage that is annotated and extracted using biomedical resources, its score is Where FeatureNum is the total number of features, ai is the weight value of the i-th feature in the ranking model, and feature _i (term) is the feature value of the i-th feature corresponding to the candidate extended vocabulary term in the online query phase;

根据在线查询阶段候选扩展词汇得分对其进行排序，并选择排序靠前的K1个在线查询阶段候选扩展词汇作为扩展词汇加入到在线提交的新查询中时，所加入的在线查询阶段候选扩展词汇在扩展查询中的权重可以表示为 $weight = Σ_{i = 1}^{count} {weight}_{i} \cdot {feature}_{i} + sign \cdot {weight}_{original},$ 其中sign为符号函数，当该在线查询阶段候选扩展词汇出现在在线提交的新查询中时sign＝1，否则sign＝0，weight_original为在线提交的新查询在扩展查询中的权重值；Rank them according to the scores of the candidate extended vocabulary in the online query stage, and select the top K1 candidate extended vocabulary in the online query stage as the extended vocabulary to add to the new query submitted online, the added candidate extended vocabulary in the online query stage is in The weights in the extended query can be expressed as $weight = Σ_{i = 1}^{count} {weight}_{i} \cdot {feature}_{i} + sign &Center Dot; {weight}_{original},$ Wherein sign is a sign function, sign=1 when the candidate extended vocabulary in the online query stage appears in the new query submitted online, otherwise sign=0, weight _original is the weight value of the new query submitted online in the extended query;

S7、查询结果返回步骤：根据扩展查询进行检索，将检索结果返回给用户。S7. Step of returning query results: performing retrieval according to the extended query, and returning the retrieval results to the user.

步骤S2中，专业词汇在所述查询结果文档中出现次数的加权和为其中count_i为该词汇在第i篇文档中出现的次数，d(i)为第i篇文档的衰减因子。In step S2, the weighted sum of the number of occurrences of professional words in the query result document is Among them, count _i is the number of times the word appears in the i-th document, and d(i) is the decay factor of the i-th document.

在步骤S3中，评价指标函数eval()为平均准确率函数，即：In step S3, the evaluation index function eval() is the average accuracy rate function, namely:

${eval eval}_{MAP MAP} = = \frac{11}{{RelDoc RelDoc}_{query query}} \cdot \cdot {Σ Σ}_{i i = = 11}^{{RelDoc RelDoc}_{query query}} \frac{i i}{rank rank ((i i))}$

其中，RelDoc_query为给定的查询query的相关文档的个数，rank(i)表示在文档结果排序列表中的第i篇相关文档的位置。Wherein, RelDoc _query is the number of related documents of a given query query, and rank(i) indicates the position of the i-th related document in the ranking list of document results.

在步骤S1中，当无历史查询记录的情况时，通过构造生物医学查询和检索方法的方式，人工获得查询及其结果的记录；所述检索方法采用向量空间模型、BM25检索模型或基于不同平滑方法的语言模型。In step S1, when there is no historical query record, manually obtain the records of the query and its results by constructing a biomedical query and retrieval method; the retrieval method adopts the vector space model, BM25 retrieval model or based on different smoothing The language model for the method.

步骤S4中损失值为：其中rank_i为相关的候选扩展词在词分组列表中排序的位置。The loss value in step S4 is: Wherein, rank _i is the ranking position of the relevant candidate expansion words in the word grouping list.

生物医学资源是指包含生物医学专业词汇的词典或者知识库。Biomedical resources refer to dictionaries or knowledge bases containing biomedical vocabulary.

所述候选扩展词汇的特征包括候选扩展词汇在结果文档中出现的频率TF、候选扩展词汇的TF-IDF值、候选扩展词汇与原始查询共同出现的文档个数、候选扩展词汇与原始查询在同一文本窗口中共同出现的次数、在生物医学资源中候选扩展词汇出现的次数、在生物医学资源中，包含该候选扩展词汇的术语概念的个数以及在生物医学专业术语概念之间的包含关系。The characteristics of the candidate extended vocabulary include the frequency TF of the candidate extended vocabulary in the result document, the TF-IDF value of the candidate extended vocabulary, the number of documents where the candidate extended vocabulary and the original query co-occur, the candidate extended vocabulary and the original query in the same The number of co-occurrences in the text window, the number of occurrences of candidate extended vocabulary in biomedical resources, the number of term concepts containing the candidate extended vocabulary in biomedical resources, and the inclusion relationship between biomedical term concepts.

一种基于词分组排序算法的生物医学文献检索系统，包括离线训练部分和在线检索部分；所述离线训练部分包括以下部分：A biomedical literature retrieval system based on the word grouping sorting algorithm, comprising an offline training part and an online retrieval part; the offline training part includes the following parts:

搜索引擎查询提取模块：用于根据搜索引擎的历史查询记录，提取多组查询以及每个查询中获得的前N条查询结果文档；并将查询及查询结果文档收集到一个查询池中，其中N为自然数；Search engine query extraction module: used to extract multiple sets of queries and the first N query result documents obtained in each query according to the historical query records of the search engine; and collect the queries and query result documents into a query pool, where N is a natural number;

候选扩展词汇提取模块：用于在给定用户查询时，利用生物医学领域固有的资源，在搜索引擎查询提取模块得到的前N个查询结果文档中，提取得到专业词汇，并对该专业词汇在查询结果文档中出现的频率或者出现次数的加权和进行记录；按照每个专业词汇在查询结果文档中出现的次数或者出现次数的加权和降序排列，选择出现次数最高的M个专业词汇作为候选扩展词汇，其中M为自然数；Candidate Extended Vocabulary Extraction Module: It is used to extract professional vocabulary from the first N query result documents obtained by the search engine query extraction module by using the inherent resources in the biomedical field when a user query is given, and the professional vocabulary in the The frequency of occurrence in the query result document or the weighted sum of the number of occurrences is recorded; according to the number of occurrences of each professional word in the query result document or the weighted and descending order of the number of occurrences, select the M professional words with the highest frequency of occurrence as candidate extensions Vocabulary, where M is a natural number;

候选扩展词汇的特征提取及标注模块：用于在候选扩展词提取模块中所得到的候选扩展词汇中提取与之相关的特征，并根据候选扩展词汇对于检索性能的影响，标注候选扩展词汇的相关程度；Feature extraction and labeling module of candidate extended vocabulary: used to extract features related to the candidate extended vocabulary obtained in the candidate extended word extraction module, and mark the correlation of candidate extended vocabulary according to the impact of candidate extended vocabulary on retrieval performance degree;

候选扩展词汇排序模型训练模块：用于利用词分组排序算法，在提取候选扩展词汇特征和标注候选扩展词汇相关程度后，训练词汇排序模型获得候选扩展词汇的每一特征的权重值；Candidate extended vocabulary sorting model training module: used to use the word grouping sorting algorithm to train the vocabulary sorting model to obtain the weight value of each feature of the candidate extended vocabulary after extracting the features of the candidate extended vocabulary and marking the degree of relevance of the candidate extended vocabulary;

所述在线检索部分包括：The online search section includes:

查询重构模块：用于新查询中的专业词汇提取和候选扩展词汇打分；包括在线搜索引擎查询提取模块、在线候选扩展词汇提取及其特征提取及打分模块，其中，在线搜索引擎查询提取模块用于对用户在线提交的新查询，检索得到前N1条查询结果；根据生物医学资源对前N1条检索结果中的专业词汇及其多种特征进行提取，其中N1为自然数。在线候选扩展词汇提取及其特征提取及打分模块利用词汇排序模型输出的候选扩展词汇权重值得分计算相应的权重，并将其加入到原始查询中，得到扩展查询；Query reconstruction module: used for professional vocabulary extraction and candidate extended vocabulary scoring in new queries; including online search engine query extraction module, online candidate extended vocabulary extraction and feature extraction and scoring modules, in which the online search engine query extraction module is used Based on the new query submitted by the user online, the first N1 query results are retrieved; the professional vocabulary and its various features in the first N1 search results are extracted according to biomedical resources, where N1 is a natural number. The online candidate extended vocabulary extraction and feature extraction and scoring module uses the weight value of the candidate extended vocabulary output by the vocabulary ranking model to calculate the corresponding weight, and adds it to the original query to obtain the extended query;

查询结果返回模块：用于将扩展查询检索得到的结果文档，返回给用户。Query result return module: used to return the result document retrieved by the extended query to the user.

本发明的有益效果在于：本发明主要从查询扩展的角度出发，通过在查询扩展中利用词分组排序算法和生物医学领域固有的词典等资源选择最能表达用户信息需求的专业词汇，更为有效的完成检索的任务，从而为用户提供与之需求更加贴切的检索结果，本发明利用生物医学领域内的资源，补充和完善原始查询，进而改善检索的性能。当使用TREC基因任务文献数据集合作为数据集合，采用传统的BM25检索模型作为基准检索模型进行文献检索时，可以获得25.62％的文献检索准确率；而在此基础上采用本发明所涉及方法和系统进行检索时，可以获得26.30％的文献检索准确率，检索性能得到了显著的提升而且本发明所涉及文献检索系统可以有效地检索到和用户查询最为相关的生物医学文献，提高用户的满意程度。The beneficial effect of the present invention is that: the present invention mainly proceeds from the perspective of query expansion, and selects the professional vocabulary that can best express the user's information needs by using word grouping sorting algorithms and inherent dictionaries in the biomedical field in the query expansion, which is more effective The task of retrieval can be completed in order to provide users with more appropriate retrieval results. The present invention uses resources in the field of biomedicine to supplement and improve the original query, thereby improving the performance of retrieval. When using the TREC gene task literature data set as the data set and using the traditional BM25 retrieval model as the benchmark retrieval model for literature retrieval, a literature retrieval accuracy rate of 25.62% can be obtained; and on this basis, the method and system involved in the present invention are used When searching, the literature retrieval accuracy rate of 26.30% can be obtained, the retrieval performance has been significantly improved, and the literature retrieval system involved in the present invention can effectively retrieve the biomedical literature most relevant to the user's query, improving user satisfaction.

附图说明Description of drawings

图1为本发明检索方法的流程示意图；Fig. 1 is a schematic flow chart of the retrieval method of the present invention;

图2为本发明检索系统的逻辑结构示意图。Fig. 2 is a schematic diagram of the logical structure of the retrieval system of the present invention.

具体实施方式Detailed ways

以下结合附图及具体实施方式对本发明进行说明：The present invention is described below in conjunction with accompanying drawing and specific embodiment:

图1是本发明一种基于词分组排序算法的生物医学文献检索方法的流程示意图，一种基于词分组排序算法的生物医学文献检索方法，包括以下离线训练阶段和在线查询阶段，其中，离线训练阶段包括以下步骤：Fig. 1 is a schematic flow chart of a biomedical literature retrieval method based on word grouping and sorting algorithm of the present invention, a kind of biomedical literature retrieval method based on word grouping and sorting algorithm, comprises following offline training stage and online query stage, wherein, offline training Phase consists of the following steps:

S1、搜索引擎查询提取步骤：根据搜索引擎的历史查询记录，提取多组查询以及每个查询中获得的前N条查询结果文档；并将查询及查询结果文档收集到一个查询池中，N为自然数。本实施例中，N＝10；S1, search engine query extraction step: according to the historical query records of the search engine, extract multiple groups of queries and the first N query result documents obtained in each query; and collect the queries and query result documents into a query pool, where N is Natural number. In this embodiment, N=10;

其中，搜索引擎的历史查询记录主要是指针对于生物医学文献的检索系统所记录的查询历史以及相应的查询结果，这些查询和对应的查询结果将用于离线状态下排序模型的训练。Among them, the historical query records of the search engine mainly refer to the query history recorded by the biomedical literature retrieval system and the corresponding query results. These queries and the corresponding query results will be used for training the ranking model in the offline state.

当无相关历史查询记录的情况时，可以通过构造生物医学查询和检索的方式，人工获得查询及其检索结果的记录。检索方法可以采用传统信息检索中的多种排序模型，包括但不限于向量空间模型，BM25检索模型，基于不同平滑方法的语言模型等。When there is no relevant historical query record, the records of query and retrieval results can be obtained manually by constructing biomedical query and retrieval. The retrieval method can adopt various sorting models in traditional information retrieval, including but not limited to vector space model, BM25 retrieval model, language model based on different smoothing methods, etc.

S2、候选扩展词汇提取步骤：根据生物医学资源对查询池中每个查询的前N条查询结果文档中的专业词汇进行提取，并统计获得每个专业词汇在所述查询结果文档中出现的次数或者出现次数的加权和；按照每个专业词汇在查询结果文档中出现的次数或者次数的加权和降序排列，选择出现次数最高或者次数加权和最高的M个专业词汇作为候选扩展词汇，其中M为自然数；S2. Candidate extended vocabulary extraction step: extract the professional vocabulary in the first N query result documents of each query in the query pool according to the biomedical resources, and obtain the number of occurrences of each professional vocabulary in the query result document by counting Or the weighted sum of the number of occurrences; according to the number of times each professional word appears in the query result document or the weighted and descending order of the number of times, select the M professional words with the highest frequency of occurrence or the highest weighted sum of times as candidate extended words, where M is Natural number;

其中，生物医学资源是指包含生物医学专业词汇的词典或者知识库等资源，包括但不限于：医学主题词表(MeSH)，基因本体(GO)和统一医学语言系统(UMLS)发布的超级词汇库(Metathesaurus)，语义网络(Semantic Network)和专家语义词典工具(SPECIALIST Lexicon and Lexical Tools)等。Among them, biomedical resources refer to resources such as dictionaries or knowledge bases containing biomedical professional vocabulary, including but not limited to: Medical Thesaurus (MeSH), Gene Ontology (GO) and the super vocabulary released by the Unified Medical Language System (UMLS) Library (Metathesaurus), Semantic Network (Semantic Network) and expert semantic dictionary tools (SPECIALIST Lexicon and Lexical Tools), etc.

以医学主题词表MeSH作为本发明所使用的生物医学资源为例，提取查询所对应的前N篇查询结果文档中的专业词汇，其中提取到的每一个专业词汇都对应了其在文档中出现的次数或者出现次数的加权和。例如专业词汇term在前N篇文档中出现次数加权和由计算得到，其中count_i为该词汇在第i篇文档中出现的次数，d(i)为第i篇文档的衰减因子，专业词汇的次数加权和用来对不同文档中出现的词频进行加权，从而使得排序靠前的文档中的词频具有更大的权重，控制使得排序越靠后的文档中所包含的专业词汇获得的得分越少。根据上述公式的中的count(term)的值由高到低对所选择的专业词汇进行排序，或者根据score(term)的值由高到低对所选择的专业词汇进行排序，选择排序最为靠前的前M个词汇作为候选的扩展词汇，在本实施例中M的取值为150。Taking the medical thesaurus MeSH as the biomedical resource used in the present invention as an example, the professional vocabulary in the first N query result documents corresponding to the query is extracted, and each professional vocabulary extracted corresponds to its occurrence in the document The number of times or the weighted sum of the number of occurrences. For example, the professional vocabulary term is weighted by the number of times it appears in the first N documents Calculated, where count _i is the number of times the word appears in the i-th document, d(i) is the attenuation factor of the i-th document, and the weighted sum of the number of professional words is used to weight the frequency of words that appear in different documents. Therefore, the word frequency in the documents ranked higher has a greater weight, and the control makes the professional words contained in the documents ranked lower obtain less scores. According to the above formula Sort the selected professional vocabulary from high to low in the value of count(term), or sort the selected professional vocabulary according to the value of score(term) from high to low, and select the top M with the highest ranking vocabulary as a candidate extended vocabulary, and the value of M is 150 in this embodiment.

S3、候选扩展词汇的特征提取及相关性标注步骤：S3. Steps of feature extraction and correlation labeling of candidate extended vocabulary:

候选扩展词汇的特征提取及标注同时进行；其中，对候选扩展词汇的相关性标注通过对比原始查询的检索性能和将该扩展词汇加入到原始查询中时的检索性能实现。候选扩展词汇的标注的思路为：将单个候选扩展词汇加入到原始查询中进行检索，若检索结果性能的提升，则标注该扩展词汇与原始查询具有相关性。检索性能的评价指标包括但不限定于：准确率(Precision)，平均准确率(MAP)，NDCG值和MRR值等。标注的具体方式如下：The feature extraction and labeling of the candidate extended vocabulary are carried out at the same time; wherein, the relevance labeling of the candidate extended vocabulary is realized by comparing the retrieval performance of the original query with the retrieval performance when the extended vocabulary is added to the original query. The idea of tagging the candidate extended vocabulary is: add a single candidate extended vocabulary to the original query for retrieval, and if the performance of the retrieval results is improved, mark the extended vocabulary as relevant to the original query. The evaluation indicators of retrieval performance include but are not limited to: accuracy rate (Precision), average accuracy rate (MAP), NDCG value and MRR value, etc. The specific way of labeling is as follows:

其中，eval()为用于评价检索性能高低的评价指标函数，eval(query+term)为评价指标函数eval()在评价将候选扩展词汇term加入到给定查询query时的得分，eval(query)为评价指标函数在评价给定查询query时的得分。当用原始查询加上某一候选词汇进行检索的评价得分大于原始查询本身进行检索的评价得分时，将该候选扩展词汇标注为1，标注为1意味着该词汇与原始查询是相关的；而当原始查询加上某一候选词汇进行检索的评价得分不大于原始查询本身进行检索的评价得分时，将该候选扩展词汇标注为0，标注为0意味着该词汇与原始查询时不相关的。Among them, eval() is the evaluation index function used to evaluate the retrieval performance, eval(query+term) is the score of the evaluation index function eval() when adding the candidate extended vocabulary term to the given query query, eval(query+term) ) is the score of the evaluation index function when evaluating a given query query. When the evaluation score of retrieval with the original query plus a candidate vocabulary is greater than the evaluation score of the original query itself, the candidate extended vocabulary is marked as 1, which means that the vocabulary is related to the original query; and When the evaluation score of the original query plus a candidate vocabulary is not greater than the evaluation score of the original query itself, the candidate extended vocabulary is marked as 0, which means that the vocabulary is irrelevant to the original query.

在本实施例中，评价函数eval()为平均准确率，即：In this embodiment, the evaluation function eval () is the average accuracy rate, namely:

其中，RelDoc_query为给定的查询query的相关文档的个数，rank(i)表示在文档结果排序列表中的第i篇相关文档的位置，例如rank(3)＝5表示在结果排序列表中第3篇相关文档出现在排序列表的第5个位置。Among them, RelDoc _query is the number of relevant documents of a given query query, and rank(i) indicates the position of the i-th related document in the document result ranking list, for example, rank(3)=5 indicates that it is in the result ranking list The 3rd related document appears in the 5th position of the sorted list.

而候选扩展词汇的特征提取，是从生物医学资源和查询池中的查询所返回的前N调查询结果文档中提取候选扩展词汇的分布信息、候选词汇在生物医学资源中的分布信息以及候选扩展词汇和原始查询的相关性信息等为训练排序模型做准备，并在提取同一候选扩展词汇的多种特征后，对所有特征值进行归一化处理；以将所有特征值控制在[0，1]区间上，归一化的具体过程为：The feature extraction of candidate extended vocabulary is to extract the distribution information of candidate extended vocabulary, the distribution information of candidate vocabulary in biomedical resources, and the candidate extension The correlation information of the vocabulary and the original query is used to prepare for the training ranking model, and after extracting multiple features of the same candidate extended vocabulary, normalize all feature values; to control all feature values in [0, 1 ] interval, the specific process of normalization is:

$newFeatureValue = \frac{oldFeatureValue - \min Value}{\max Value - \min Value},$ minValue和maxValue分别为某一特征的最小值和最大值。 $newFeatureValue = \frac{oldFeatureValue - \min value}{\max value - \min value},$ minValue and maxValue are the minimum and maximum values of a feature, respectively.

其中，扩展词汇的特征具体包括：Among them, the characteristics of extended vocabulary include:

1、候选扩展词汇在结果文档中出现的频率TF。该特征可根据专业词汇term在结果文档中出现次数获得。1. The frequency TF of the candidate extended vocabulary appearing in the result document. This feature can be obtained according to the number of occurrences of the professional vocabulary term in the result document.

2、候选扩展词汇的TF-IDF值。TF-IDF是信息检索领域的经典模型之一，可用来衡量词汇的相对重要程度，计算方法如以下公式所示：2. The TF-IDF value of the candidate extended vocabulary. TF-IDF is one of the classic models in the field of information retrieval, which can be used to measure the relative importance of words. The calculation method is shown in the following formula:

${score score}_{TF TF - - IDF IDF} = = count count ((term term)) \cdot &Center Dot; log log \frac{TotalDoc TotalDoc}{df df ((term term))}$

其中count(term)为候选扩展词汇在第i篇结果文档中出现的次数，TotalDoc为训练数据中的文档总数，df(term)为出现该候选扩展词汇的文档的个数。Among them, count(term) is the number of times the candidate extended vocabulary appears in the result document i, TotalDoc is the total number of documents in the training data, and df(term) is the number of documents in which the candidate extended vocabulary appears.

3、候选扩展词汇与原始查询共同出现的文档个数。该特征能够用来计算原始查询与候选扩展词汇的相关程度。3. The number of documents in which the candidate extended vocabulary co-occurs with the original query. This feature can be used to calculate how relevant the original query is to the candidate expanded vocabulary.

4、候选扩展词汇与原始查询在同一文本窗口中共同出现的次数。该特征用来计算在一定范围内原始查询中的查询词与该候选扩展词汇的相关程度，其中文本窗口指在同一篇出现了原始查询词和该候选词汇的文档范围内，该扩展词汇与原始查询词之间间隔的词的数目。4. The number of times the candidate expanded vocabulary co-occurs with the original query in the same text window. This feature is used to calculate the degree of correlation between the query word in the original query and the candidate extended vocabulary within a certain range, where the text window refers to the range of documents in which the original query word and the candidate word appear in the same document, and the extended vocabulary is related to the original The number of words that are spaced between query words.

5、在生物医学资源如MeSH中，候选扩展词汇出现的次数。该特征用来计算和衡量该候选扩展词汇在生物医学资源中的分部信息。5. The number of occurrences of candidate expanded vocabulary in biomedical resources such as MeSH. This feature is used to calculate and measure the part information of the candidate extended vocabulary in biomedical resources.

6、在生物医学资源如MeSH中，包含该候选扩展词汇的术语概念的个数。在生物医学专业术语概念之间经常有包含的关系，该特征同样能够衡量某一个候选词汇在生物医学资源中的重要性。6. In biomedical resources such as MeSH, the number of term concepts that include the candidate expanded vocabulary. There is often an inclusion relationship between the concepts of biomedical terminology, and this feature can also measure the importance of a candidate word in biomedical resources.

在以上提取的候选扩展词汇特征中，特征1和特征2用来衡量候选扩展词汇在文献集合中的分布信息；特征3和特征4用来衡量候选扩展词汇与原始查询的相关程度信息；而特征5和特征6用来衡量候选扩展词汇在生物医学资源中的分布信息。本发明所涉及的扩展词汇特征包含但不限定于上述特征，通过上述多种特征的提取，可以作为词分组排序算法的输入，更好的衡量候选扩展词汇的重要程度。Among the features of candidate extended vocabulary extracted above, feature 1 and feature 2 are used to measure the distribution information of candidate extended vocabulary in the document collection; feature 3 and feature 4 are used to measure the degree of relevance between candidate extended vocabulary and the original query; and feature 5 and feature 6 are used to measure the distribution information of candidate extended vocabulary in biomedical resources. The extended vocabulary features involved in the present invention include but are not limited to the above-mentioned features. Through the extraction of the above-mentioned multiple features, they can be used as the input of word grouping and sorting algorithms to better measure the importance of candidate extended vocabulary.

S4、候选扩展词汇排序模型训练步骤：根据步骤S3中获得的候选扩展词汇的相关程度标注和多种特征作为输入，利用词分组排序算法的排序模型训练得到每种特征的权重值，具体步骤为选择一个步骤S3中被标注为相关的候选扩展词汇(即label为1时所对应的候选扩展词汇)和若干被标记为不相关的候选扩展词汇(即label为0时所对应的候选扩展词汇)组成一个词分组，选择若干这样的词分组作为训练样本；随机为每一个候选扩展词汇的词特征赋予初始权重，通过特征加权得分对每个词分组内的相关扩展词汇进行排序；根据每个词分组的排序结果，计算总体排序损失，根据损失函数的梯度值动态调整每一维特征的权重，其中排序损失为：其中NumSample为词分组中候选扩展词汇分组的数量，loss_i为每个词分组的损失值，该损失值通过计算相关扩展词汇的排序位置得到，排序位置越靠前对应的损失值越小；通过循环迭代上一过程，直到总体损失值小于某一阈值或达到指定的迭代次数训练完成，将最终选择的特征值作为训练完成的排序模型；本实施例中选择迭代100次终止训练。S4. Training step of the candidate extended vocabulary sorting model: according to the relevant degree labeling and various features of the candidate expanded vocabulary obtained in step S3 as input, use the sorting model training of the word grouping sorting algorithm to obtain the weight value of each feature. The specific steps are as follows: Select a candidate extended vocabulary that is marked as relevant in step S3 (that is, the corresponding candidate extended vocabulary when the label is 1) and several candidate extended vocabulary that are marked as irrelevant (that is, the corresponding candidate extended vocabulary when the label is 0) Form a word group, select several such word groups as training samples; randomly assign initial weights to the word features of each candidate extended vocabulary, and sort the related extended vocabulary in each word group through the feature weighted score; according to each word For the sorting results of the group, calculate the overall sorting loss, and dynamically adjust the weight of each dimension feature according to the gradient value of the loss function, where the sorting loss is: Among them, NumSample is the number of candidate extended vocabulary groups in the word grouping, and loss _i is the loss value of each word grouping. The loss value is obtained by calculating the sorting position of the related expanded vocabulary. The higher the sorting position, the smaller the corresponding loss value; Iterate the previous process in a loop until the overall loss value is less than a certain threshold or reaches the specified number of iterations, and the training is completed, and the finally selected feature value is used as the sorting model for the training completion; in this embodiment, 100 iterations are selected to terminate the training.

本实施例中损失值为：其中rank_i为相关的候选扩展词在词分组列表中排序的位置，当其排在第一位时损失为0，当其排在最后一位时损失被最大化。此外，损失值的计算公式包含但不限于此计算公式。In this example, the loss value is: Among them, rank _i is the ranking position of the related candidate expansion words in the word grouping list. When it is ranked first, the loss is 0, and when it is ranked last, the loss is maximized. In addition, the calculation formula of the loss value includes but is not limited to this calculation formula.

排序模型中，扩展词汇最终得分的计算公式如下：In the ranking model, the formula for calculating the final score of the extended vocabulary is as follows:

$score score ((term term)) = = {Σ Σ}_{i i = = 11}^{FeatureNum FeatureNum} {a a}_{i i} \cdot \cdot {feature feature}_{i i} ((term term))$

其中，FeatureNum是特征的总数，a_i为第i个特征的权重值，feature_i(term)为候选词汇term对应的第i个特征的特征值。此处训练后得到的排序模型可以用于测试查询相关的扩展词汇的选择。以上步骤均在离线情况下完成。Among them, FeatureNum is the total number of features, a _i is the weight value of the i-th feature, and feature _i (term) is the feature value of the i-th feature corresponding to the candidate word term. The ranking model trained here can be used to test the selection of query-related extended vocabulary. The above steps are completed offline.

S5、在线搜索引擎查询与提取步骤：对于用户在线提交的新查询，检索得到前N1条查询结果；根据生物医学资源对前N1条检索结果中的专业词汇及其多种特征进行提取，其中N1为自然数；S5. Online search engine query and extraction steps: For the new query submitted by the user online, retrieve the first N1 query results; extract the professional vocabulary and its various features in the first N1 search results according to biomedical resources, where N1 is a natural number;

需要说明的是，本步骤是指在线情况下，当用户向生物医学文献检索引擎提交的查询后，本方法会自动获取初次检索排序最为靠前的N1篇查询结果，用于对用户查询的扩展等处理，该处理对用户来说是透明的。It should be noted that this step refers to the online situation. After the user submits the query to the biomedical literature search engine, this method will automatically obtain the N1 query results with the highest ranking in the initial search, which is used to expand the user query. And so on, the processing is transparent to the user.

S6、在线候选扩展词汇提取及其特征提取及打分步骤：根据生物医学资源对新查询利用离线阶段S2-S3的候选扩展词汇提取方法及候选扩展词汇的特征提取方法对前N1条检索结果中的在线查询阶段专业词汇及其多种特征进行提取，得到在线查询阶段候选扩展词汇，提取的特征用于衡量候选扩展词汇在扩展查询中的重要性；根据步骤S4训练得到的特征权重，为在线查询阶段候选扩展词汇进行打分，根据打分构建新的查询，并选择分数靠前的K1个在线查询阶段候选扩展词汇加入到在线提交的新查询中作为在线阶段的扩展查询，其中K1为自然数；S6. Online Candidate Extended Vocabulary Extraction and Feature Extraction and Scoring Steps: According to the biomedical resources, use the candidate extended vocabulary extraction method in the offline stage S2-S3 and the feature extraction method for the candidate extended vocabulary to search for the first N1 search results In the online query stage, the professional vocabulary and its various features are extracted to obtain the candidate extended vocabulary in the online query stage, and the extracted features are used to measure the importance of the candidate extended vocabulary in the extended query; according to the feature weight obtained in step S4 training, the online query Stage candidate extended vocabulary is scored, a new query is constructed according to the score, and K1 online query stage candidate extended vocabulary with the highest score is selected to be added to the new query submitted online as the online stage extended query, where K1 is a natural number;

对于利用生物医学资源标注并提取的某一个在线查询阶段候选扩展词汇，它的得分为其中FeatureNum是特征的总数，a_i是排序模型中第i个特征的权重值，feature_i(term)是在线查询阶段候选扩展词汇term所对应的第i个特征的特征值；For a candidate extended vocabulary in the online query stage that is annotated and extracted using biomedical resources, its score is Where FeatureNum is the total number of features, a _i is the weight value of the i-th feature in the ranking model, and feature _i (term) is the feature value of the i-th feature corresponding to the candidate extended vocabulary term in the online query phase;

根据在线查询候选扩展词汇得分对其进行排序，并选择排序靠前的K1个词汇作为在线阶段候选扩展词汇加入到新查询中时，所加入的在线阶段候选扩展词汇在扩展查询中的权重可以表示为 $weight = Σ_{i = 1}^{count} {weight}_{i} \cdot {feature}_{i} + sign \cdot {weight}_{original},$ 其中sign为符号函数，当该在线阶段候选扩展词汇出现在在线提交的新查询中时sign＝1，否则sign＝0，weight_original为在线提交的新查询在扩展查询中的权重值；Rank them according to the scores of online query candidate extended vocabulary, and select the top K1 words as the online stage candidate extended vocabulary to add to the new query, the weight of the added online stage candidate extended vocabulary in the extended query can be expressed as for $weight = Σ_{i = 1}^{count} {weight}_{i} \cdot {feature}_{i} + sign &Center Dot; {weight}_{original},$ Wherein sign is a sign function, sign=1 when the candidate extended vocabulary of this online stage appears in the new query submitted online, otherwise sign=0, weight _original is the weight value of the new query submitted online in the extended query;

最终的扩展查询的具体形式如下所示：The specific form of the final extended query is as follows:

(weight₁ query_original weight₂(w₁ term₁ w₂ term₂ … w_k term_k))(weight ₁ query _original weight ₂ (w ₁ term ₁ w ₂ term ₂ … w _k term _k ))

其中weight₁为在线提交的新查询在扩展查询中的权重，weight₂为新加入的扩展词汇的全体在扩展查询中的权重，w₁,w₂,…,w_K为扩展词汇term₁,term₂,…,term_K对应的得分权重，K为最终选择的扩展词汇的个数。在本实施例中weight₁取值为0.5，weight₂取值为0.5，K的取值为50。Among them, weight ₁ is the weight of the new query submitted online in the extended query, weight ₂ is the weight of all newly added extended vocabulary in the extended query, w ₁ ,w ₂ ,...,w _K are the extended vocabulary term ₁ ,term ₂ ,...,term _K corresponds to the score weight, and K is the number of extended vocabulary finally selected. In this embodiment, the value of weight ₁ is 0.5, the value of weight ₂ is 0.5, and the value of K is 50.

S7、查询结果返回步骤：根据扩展查询进行检索，将检索结果返回给用户，完成检索过程。S7. Step of returning query results: performing retrieval according to the extended query, returning the retrieval results to the user, and completing the retrieval process.

与上述方法相对应，本发明还提供了一种基于词分组排序算法的生物医学文献检索系统。附图2给出了该系统的逻辑结构图。Corresponding to the above method, the present invention also provides a biomedical document retrieval system based on word grouping and sorting algorithm. Accompanying drawing 2 has provided the logical structural diagram of this system.

搜索引擎查询提取模块：用于根据搜索引擎的历史查询记录，提取多组查询以及每个查询中获得的前N条查询结果文档；并将查询及查询结果文档收集到一个查询池中，其中N为自然数；搜索引擎查询提取模块能够根据用户的查询，检索与用户查询相关联的生物医学文献，并将检索的结果返回给用户，而在系统内部对于查询的扩展等运算和操作对用户来说是透明看不到的。Search engine query extraction module: used to extract multiple sets of queries and the first N query result documents obtained in each query according to the historical query records of the search engine; and collect the queries and query result documents into a query pool, where N is a natural number; the search engine query extraction module can retrieve the biomedical literature associated with the user query according to the user query, and return the search results to the user, while the operations and operations such as query expansion in the system are very important to the user It is transparent and cannot be seen.

候选扩展词汇提取模块：用于在给定用户查询时，利用生物医学领域固有的资源，在搜索引擎查询提取模块得到的前N个查询结果文档中，提取得到专业词汇，并对该专业词汇在查询结果文档中出现的次数(频率)或者出现次数的加权和进行记录；按照每个专业词汇在查询结果文档中出现的次数或者出现次数的加权和降序排列，选择出现次数最高的M个专业词汇作为候选扩展词汇，其中M为自然数；Candidate Extended Vocabulary Extraction Module: It is used to extract professional vocabulary from the first N query result documents obtained by the search engine query extraction module by using the inherent resources in the biomedical field when a user query is given, and the professional vocabulary in the The number of times (frequency) or the weighted sum of the number of occurrences in the query result document is recorded; according to the number of times each professional word appears in the query result document or the weighted and descending order of the number of occurrences, select the M professional words with the highest number of occurrences As a candidate extended vocabulary, where M is a natural number;

候选扩展词汇的特征提取及标注模块：用于在候选扩展词提取模块中所得到的候选扩展词汇中提取与之相关的特征，并根据候选扩展词汇对于检索性能的影响，标注候选扩展词汇的相关程度；在离线训练时，候选扩展词汇的相关程度标注和多种特征将用于词分组排序算法的输入；在在线查询时，该模块用于提取与候选扩展词汇相关联的特征信息。Feature extraction and labeling module of candidate extended vocabulary: used to extract features related to the candidate extended vocabulary obtained in the candidate extended word extraction module, and mark the correlation of candidate extended vocabulary according to the impact of candidate extended vocabulary on retrieval performance degree; during offline training, the relevant degree labels and various features of the candidate extended vocabulary will be used as the input of the word grouping and sorting algorithm; during online query, this module is used to extract the feature information associated with the candidate extended vocabulary.

候选扩展词汇排序模型训练模块：用于利用词分组排序算法，在提取候选扩展词汇特征和标注候选扩展词汇相关程度后，训练词汇排序模型输出候选扩展词汇的每一特征的权重值；该权重值能够用于对未知查询的扩展词汇的重要程度的度量。Candidate extended vocabulary sorting model training module: used to use the word grouping sorting algorithm to train the vocabulary sorting model to output the weight value of each feature of the candidate extended vocabulary after extracting the candidate extended vocabulary features and marking the degree of relevance of the candidate extended vocabulary; the weight value A measure of the importance of an expanded vocabulary that can be used for unknown queries.

所述在线检索部分包括：The online search section includes:

查询重构模块：用于新查询中的专业词汇提取和候选扩展词汇打分；包括在线搜索引擎查询提取模块、在线候选扩展词汇提取及其特征提取及打分模块，其中，在线搜索引擎查询提取模块用于对用户在线提交的新查询，检索得到前N1条查询结果；根据生物医学资源对前N1条检索结果中的专业词汇及其多种特征进行提取，其中N1为自然数。在线候选扩展词汇提取及其特征提取及打分模块利用词汇排序模型输出的候选扩展词汇权重值得分计算相应的权重，并将其加入到原始查询中，得到扩展查询。Query reconstruction module: used for professional vocabulary extraction and candidate extended vocabulary scoring in new queries; including online search engine query extraction module, online candidate extended vocabulary extraction and feature extraction and scoring modules, in which the online search engine query extraction module is used Based on the new query submitted by the user online, the first N1 query results are retrieved; the professional vocabulary and its various features in the first N1 search results are extracted according to biomedical resources, where N1 is a natural number. The online candidate extended vocabulary extraction and its feature extraction and scoring module use the weights of candidate extended vocabulary output by the vocabulary ranking model to calculate the corresponding weights, and add them to the original query to obtain the extended query.

查询结果返回模块，用于将扩展查询检索得到的结果文档，返回给用户。用户得到的返回结果实际上是其提交输入的返回结果在查询扩展之后的结果，而查询扩展的过程对用户来说是不可见的。The query result returning module is used to return the result document retrieved by the extended query to the user. The returned result obtained by the user is actually the result of the returned result submitted by the user after query expansion, and the process of query expansion is invisible to the user.

根据上述针对于本发明所涉及方法和系统具体实施方式的描述，结合具体实施例进行说明。本实施例中假定用户已经通过历史数据完成了排序模型的训练，当用户提交一个新的查询“mad cow disease”(疯牛病)时，系统首先根据该词在初次检索考前文档中的频率信息，选择候选的扩展词汇，其中候选扩展词汇中排名靠前的10个扩展词汇及其相关性标注情况如下表所示：Based on the above description of specific implementations of the method and system involved in the present invention, description will be made in conjunction with specific embodiments. In this embodiment, it is assumed that the user has completed the training of the sorting model through historical data. When the user submits a new query "mad cow disease", the system first retrieves the frequency information in the pre-examination document according to the word, Select the candidate extended vocabulary, and the top 10 extended vocabulary among the candidate extended vocabulary and their correlation annotations are shown in the following table:

排名ranking 词汇vocabulary 相关性Correlation 11 disease(疾病)disease 相关relevant 22 prions(朊病毒)prions (prions) 相关relevant 33 cause(引起)cause (cause) 不相关irrelevant 44 infectious(感染性)infectious 相关relevant 55 conversion(转换)conversion (conversion) 不相关irrelevant 66 cow(牛)cow (cattle) 相关relevant 77 spongiform(海绵组织)spongiform (sponge tissue) 相关relevant 88 fatal(致命的)fatal (fatal) 不相关irrelevant 99 encephalopathies(癫痫性脑病)encephalopathies (epileptic encephalopathy) 相关relevant 1010 mad(疯狂)mad (crazy) 相关relevant

由上表可以看出，在排名前10位的候选扩展词汇中，不相关词汇有3个，如果直接将其加入到原始查询中，会对检索性能产生负面的影响。接下来从文档和生物医学辞典MeSH中提取与候选扩展词汇相关的特征，并利用排序模型得到每种特征的权重，对所有的候选扩展词汇进行重新打分并排序。It can be seen from the above table that among the top 10 candidate extended words, there are 3 irrelevant words, if they are directly added to the original query, it will have a negative impact on the retrieval performance. Next, the features related to the candidate extended vocabulary are extracted from the document and the biomedical dictionary MeSH, and the weight of each feature is obtained by using the ranking model, and all the candidate extended vocabulary are re-scored and sorted.

经过排序后最终选择的排名前10的扩展词汇如下表所示。从表中可以看出，经过排序完善后的扩展查询中排序最为靠前的10个查询均为相关词汇。将这些查询按照其归一化后的排序得分作为权重，加入到原始查询中，进行检索可以进一步提高检索的性能。The top 10 extended vocabulary selected after sorting is shown in the table below. It can be seen from the table that the top 10 queries in the extended queries after sorting are all related words. Adding these queries to the original query according to their normalized ranking scores as weights for retrieval can further improve retrieval performance.

上述实施例的描述解释并说明了本发明提供的基于词分组排序算法的生物医学文献检索方法及系统。该方法和系统能够利用生物医学领域的知识库等资源对用户提交的原始查询进行扩展，在扩展中使用了词分组排序算法用于扩展词汇重要性度量，通过查询扩展过程对用户提交的查询进行了补充和完善，保证了查询结果的准确性，进一步满足了用户的信息需求。The description of the above embodiments explains and illustrates the biomedical literature retrieval method and system based on the word grouping and sorting algorithm provided by the present invention. The method and system can expand the original query submitted by the user by using resources such as the knowledge base in the biomedical field. In the expansion, the word grouping and sorting algorithm is used to expand the vocabulary importance measure, and the query submitted by the user is processed through the query expansion process. In order to supplement and improve, it ensures the accuracy of the query results and further satisfies the information needs of users.

以上内容是结合具体的优选技术方案对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in combination with specific preferred technical solutions, and it cannot be assumed that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deduction or replacement can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims

1. A biomedical literature retrieval method based on word grouping sorting algorithm, is characterized in that, comprises following off-line training phase and online query phase, wherein, off-line training phase comprises the following steps:

S1, search engine query extraction step: according to the historical query records of the search engine, extract multiple groups of queries and the first N query result documents obtained in each query; and collect the queries and query result documents into a query pool, where N is a natural number;

S2. Candidate extended vocabulary extraction step: extract the professional vocabulary in the first N query result documents of each query in the query pool according to the biomedical resources, and obtain the number of occurrences of each professional vocabulary in the query result document by counting Or the weighted sum of the number of occurrences; according to the weighted and descending order of the number of times or times that each professional word appears in the query result document, select the M professional words with the highest number of occurrences or the weighted sum of the highest number of times as candidate extended words, where M is a natural number;

S3. Feature extraction and labeling steps of candidate extended vocabulary:

The feature extraction and labeling of the candidate extended vocabulary are carried out at the same time; among them, the relevance labeling of the candidate expanded vocabulary is marked by comparing the retrieval performance of the original query with the retrieval performance when the candidate expanded vocabulary is added to the original query; the retrieval performance High and low evaluation indicators include: accuracy rate, average accuracy rate, NDCG value and MRR value; the specific method of correlation labeling is as follows:

label label = = \{\begin{matrix} 11 & eval eval ((query query + + term term)) > > eval eval ((query query)) \\ 00 & eval eval ((query query + + term term)) \leq \leq eval eval ((query query)) \end{matrix}\}

Among them, eval() is the evaluation index function used to evaluate the retrieval performance, eval(query+term) is the score of the evaluation index function eval() when adding the candidate extended vocabulary term to the query query, and eval(query) is The score of the evaluation index function when evaluating the query query; the label marked as 1 indicates that the candidate extended vocabulary is related to the query query; the label marked as 0 indicates that the candidate expanded vocabulary is not related to the query query;

The feature extraction of candidate extended vocabulary is to extract the distribution information of candidate extended vocabulary, the distribution information of candidate vocabulary in biomedical resources, and the candidate extended vocabulary from the biomedical resources and the first N query result documents returned by the query in the query pool. and the correlation information of the original query to prepare for the training ranking model, and after extracting multiple features of the same candidate extended vocabulary, normalize all feature values to control all feature values in [0,1] On the interval, the normalization process is as follows:

newFeatureValue newFeatureValue = = \frac{oldFeatureValue oldFeatureValue - - min min Value value}{max max Value value - - min min Value value}

Among them, minValue and maxValue are the minimum and maximum values of a feature, respectively;

S4. Training step of the candidate extended vocabulary sorting model: according to the relevance degree labeling and various features of the candidate expanded vocabulary, use the word grouping sorting algorithm to train to obtain the weight value of each feature. The specific steps are: select one that is marked as relevant in step S3 Candidate extended vocabulary and a number of candidate extended vocabulary marked as irrelevant form a word group, and select several such word groups as training samples; randomly assign initial weights to the features of each candidate extended word, and use the feature weighted score to evaluate each According to the ranking results of each word group, the overall ranking loss is calculated, and the weight of each dimension feature is dynamically adjusted according to the gradient value of the loss function, where the ranking loss is: Among them, NumSample is the number of candidate extended vocabulary groups in the word grouping, and loss _i is the loss value of each word grouping. The loss value is obtained by calculating the sorting position of the related expanded vocabulary. The higher the sorting position, the smaller the corresponding loss value; Iterate the previous process in a loop until the overall loss value is less than a certain threshold or reaches the specified number of iterations, and the training is completed, and the final selected feature value is used as the sorting model for the training completion;

The online inquiry phase includes the following steps:

S5. Online search engine query and extraction steps: For the new query submitted by the user online, retrieve the first N1 query results; extract the professional vocabulary and its various features in the first N1 search results according to biomedical resources, where N1 is a natural number;

S6. Online Candidate Extended Vocabulary Extraction and Feature Extraction and Scoring Steps: According to the biomedical resources, use the candidate extended vocabulary extraction method in the offline stage S2-S3 and the feature extraction method for the candidate extended vocabulary to search for the first N1 search results In the online query stage, the professional vocabulary and its various features are extracted to obtain the candidate extended vocabulary in the online query stage, and the extracted features are used to measure the importance of the candidate extended vocabulary in the extended query; according to the feature weight obtained in step S4 training, the online query Stage candidate extended vocabulary is scored, and K1 candidate extended vocabulary with the highest score is selected to be added to the new query submitted online as an extended query, where K1 is a natural number;

For a candidate extended vocabulary in the online query stage that is annotated and extracted using biomedical resources, its score is Where FeatureNum is the total number of features, ai is the weight value of the i-th feature in the ranking model, and feature _i (term) is the feature value of the i-th feature corresponding to the candidate extended vocabulary term in the online query phase;

Rank them according to the scores of the candidate extended vocabulary in the online query stage, and select the top K1 candidate extended vocabulary in the online query stage as the extended vocabulary to add to the new query submitted online, the added candidate extended vocabulary in the online query stage is in The weights in the extended query can be expressed as

weight = Σ_{i = 1}^{count} {weight}_{i} \cdot feature e_{i} + sign \cdot weight t_{original},

Wherein sign is a sign function, sign=1 when the candidate extended vocabulary in the online query stage appears in the new query submitted online, otherwise sign=0, weight _original is the weight value of the new query submitted online in the extended query;

S7. Step of returning query results: performing retrieval according to the extended query, and returning the retrieval results to the user.

2. a kind of biomedical literature retrieval method based on word grouping sorting algorithm according to claim 1, it is characterized in that, in step S2, the weighted sum of the number of occurrences of professional vocabulary in described query result document is Among them, count _i is the number of times the word appears in the i-th document, and d(i) is the decay factor of the i-th document.

3. a kind of biomedical literature retrieval method based on word grouping sorting algorithm according to claim 1, is characterized in that, in step S3, evaluation index function eval () is average accuracy rate function, namely:

{eval eval}_{MAP MAP} = = \frac{11}{{RelDoc RelDoc}_{query query}} \cdot &Center Dot; {Σ Σ}_{i i = = 11}^{{RelDoc RelDoc}_{query query}} \frac{i i}{rank rank ((i i))}

Wherein, RelDoc _query is the number of related documents of a given query query, and rank(i) indicates the position of the i-th related document in the ranking list of document results.

4. a kind of biomedical literature retrieval method based on word grouping sorting algorithm according to claim 1, is characterized in that, in step S1, when there is no historical query record, by constructing biomedical query and retrieval method The method is to manually obtain the records of the query and its results; the retrieval method uses a vector space model, a BM25 retrieval model or a language model based on different smoothing methods.

5. a kind of biomedical literature retrieval method based on word grouping sorting algorithm according to claim 1, is characterized in that, loss value is in the step S4: Wherein, rank _i is the ranking position of the relevant candidate expansion words in the word grouping list.

6. A biomedical literature retrieval method based on word grouping and sorting algorithm according to claim 1, wherein the biomedical resources refer to dictionaries or knowledge bases containing biomedical professional vocabulary.

7. A kind of biomedical literature retrieval method based on word grouping sorting algorithm according to claim 1, is characterized in that, the feature of described candidate expanded vocabulary comprises the frequency TF that candidate expanded vocabulary appears in result document, candidate expanded vocabulary The TF-IDF value of , the number of documents in which the candidate extended vocabulary co-occurs with the original query, the number of times the candidate extended vocabulary and the original query co-occur in the same text window, the number of candidate extended vocabulary in biomedical resources, and the biomedical In the resources, the number of terms and concepts that contain the candidate extended vocabulary and the inclusion relationship between the terms and concepts of biomedical specialty.

8. A biomedical literature retrieval system based on word grouping sorting algorithm, is characterized in that, comprises off-line training part and on-line retrieval part; Described off-line training part comprises following part:

Search engine query extraction module: used to extract multiple sets of queries and the first N query result documents obtained in each query according to the historical query records of the search engine; and collect the queries and query result documents into a query pool, where N is a natural number;

Candidate Extended Vocabulary Extraction Module: It is used to extract professional vocabulary from the first N query result documents obtained by the search engine query extraction module by using the inherent resources in the biomedical field when a user query is given, and the professional vocabulary in the The frequency of occurrence in the query result document or the weighted sum of the number of occurrences is recorded; according to the number of occurrences of each professional word in the query result document or the weighted and descending order of the number of occurrences, select the M professional words with the highest frequency of occurrence as candidate extensions Vocabulary, where M is a natural number;

Feature extraction and labeling module of candidate extended vocabulary: used to extract features related to the candidate extended vocabulary obtained in the candidate extended word extraction module, and mark the correlation of candidate extended vocabulary according to the impact of candidate extended vocabulary on retrieval performance degree;

Candidate extended vocabulary sorting model training module: used to use the word grouping sorting algorithm to train the vocabulary sorting model to obtain the weight value of each feature of the candidate extended vocabulary after extracting the features of the candidate extended vocabulary and marking the degree of relevance of the candidate extended vocabulary;

The online search section includes:

Query reconstruction module: used for professional vocabulary extraction and candidate extended vocabulary scoring in new queries; including online search engine query extraction module, online candidate extended vocabulary extraction and feature extraction and scoring modules, in which the online search engine query extraction module is used Based on the new query submitted by the user online, the first N1 query results are retrieved; the professional vocabulary and its various features in the first N1 search results are extracted according to biomedical resources, where N1 is a natural number. The online candidate extended vocabulary extraction and feature extraction and scoring module uses the weight value of the candidate extended vocabulary output by the vocabulary ranking model to calculate the corresponding weight, and adds it to the original query to obtain the extended query;

Query result return module: used to return the result document retrieved by the extended query to the user.