WO2022120975A1 - Document searching method and apparatus, and electronic device - Google Patents

Document searching method and apparatus, and electronic device Download PDF

Info

Publication number
WO2022120975A1
WO2022120975A1 PCT/CN2020/139255 CN2020139255W WO2022120975A1 WO 2022120975 A1 WO2022120975 A1 WO 2022120975A1 CN 2020139255 W CN2020139255 W CN 2020139255W WO 2022120975 A1 WO2022120975 A1 WO 2022120975A1
Authority
WO
WIPO (PCT)
Prior art keywords
entry
scholar
search
entries
matrix
Prior art date
Application number
PCT/CN2020/139255
Other languages
French (fr)
Chinese (zh)
Inventor
吴嘉澍
王洋
须成忠
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022120975A1 publication Critical patent/WO2022120975A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention belongs to the technical field of document search, and in particular relates to a document search method, device and electronic device.
  • the purpose of the present invention is to provide a document search method, device and electronic device, which aims to solve the technical problem of low search quality in the prior art when search keywords do not appear explicitly.
  • the present invention provides a literature search method, applied to electronic equipment, including:
  • Entries are expanded for all documents according to the hierarchical relationship of entries
  • a machine learning algorithm is used to train the scholar-entry matrix to generate a search model
  • a matching operation is performed on the search keywords in the search model, and a document search result is output according to the matching degree.
  • step of obtaining the entries in all the documents of each scholar includes:
  • the entry level relationship is an academic vocabulary level relationship
  • the step of performing entry expansion on all documents according to the entry level relationship includes:
  • the academic vocabulary is expanded to the upper-level entry.
  • the occurrence status includes the number of occurrences and the occurrence position, and for each scholar, according to the occurrence status of each entry in the literature written by the scholar and the expansion of the entry, the entry is given different weights.
  • the sub-steps include:
  • the steps of assigning different weights to the entry include:
  • the corresponding expansion score is assigned to the entry.
  • the machine learning algorithm is used to train the scholar-entry matrix, and the step of generating a search model includes:
  • the present invention provides a document search device, comprising:
  • the entry acquisition module is used to acquire entries in all literatures of various researchers;
  • Entry expansion module which is used to expand entries of all documents according to the hierarchical relationship of entries
  • the matrix building module is used for each scholar to assign different weights to the entries according to the occurrence of each entry in the literature written by the scholar and the expansion of the entry to construct a scholar-entry matrix;
  • a training module used for using a machine learning algorithm to train the scholar-entry matrix to generate a search model
  • the search module is used to perform matching operation on the search keywords in the search model, and output the document search results according to the matching degree.
  • the present invention also provides an electronic device, comprising:
  • the memory stores readable instructions that, when executed by the processor, implement the method of the first aspect.
  • the present invention provides a computer-readable storage medium having stored thereon a computer program that, when executed, implements the method of the first aspect.
  • the expansion score makes an overall consideration of all the literature written by the scholar according to the expansion of the scholar's other literature, and fully considers Factors such as the expansion of the entry in the scholar's other literature, the distance between the expanded entry and the original entry in the lexical hierarchy, and the position of the original entry in the literature, so as to achieve a reasonable expansion of the literature. Therefore, the literature search of scholars can be better carried out, which effectively solves the problem of search quality when the search keywords do not appear explicitly.
  • FIG. 1 is a flow chart of the realization of the document search method shown in the first embodiment.
  • Fig. 2 is a schematic diagram of a knowledge tree containing the hierarchical relationship of academic vocabulary.
  • FIG. 3 is a schematic diagram illustrating text entry expansion according to an exemplary embodiment.
  • FIG. 4 is a specific application flowchart of Embodiment 1 according to an exemplary embodiment.
  • FIG. 5 is a structural block diagram of the document search apparatus shown in the second embodiment.
  • FIG. 1 is a flow chart of the realization of the document search method shown in the first embodiment.
  • the document search method shown in the first embodiment is applicable to an electronic device, and a processor is set in the electronic device to perform a document search according to a search keyword.
  • a processor is set in the electronic device to perform a document search according to a search keyword.
  • step S110 the entries in all documents of each scholar are acquired.
  • Step S120 performing lexical expansion on all documents according to the lexical relationship of the lexical entries.
  • Step S130 for each scholar, according to the occurrence of each entry in the scholar's literature and the expansion of the entry, assign different weights to the entry to construct a scholar-entry matrix.
  • Step S140 using a machine learning algorithm to train the scholar-entry matrix to generate a search model.
  • step S150 a matching operation is performed on the search keywords in the search model, and a document search result is output according to the matching degree.
  • a knowledge tree (as shown in FIG. 2 ) containing the hierarchical relationship of academic words will be used to assist document search, so as to deal with the problem that search keywords do not appear in documents explicitly.
  • the expansion score takes a global consideration of all the literature written by the scholar according to the expansion of the scholar's other literature.
  • the expansion score will be adjusted accordingly, and the expansion score will also be assigned different scores according to the level gap between the original entry and the original entry in the knowledge tree, so as to achieve the goal of evaluating the literature.
  • Reasonable expansion and therefore better literature searches for scholars.
  • all documents of each scholar are acquired first, and then pre-operation processing is performed on each document to acquire the entries in each document.
  • pre-operation processing is performed on each document to acquire the entries in each document.
  • languages such as English, French, etc.
  • the text needs to be lowercase, but this step is not required for languages such as Chinese.
  • the sentence segmentation operation is performed, and finally, the word segmentation operation is performed on each document with the thesaurus.
  • the entry expansion is performed, for each entry in the document, the corresponding academic vocabulary is searched in the academic vocabulary hierarchical relationship, and then the academic vocabulary is expanded to the upper level entry according to the academic vocabulary hierarchical relationship.
  • Search Item 1 Computer Science; Ranking of Related Scholars: 1. Zhang San, 2. Li Si, ...
  • Search Item 2 Natural Language Processing; Ranking of Related Scholars: 1. Wang Wu, 2. Zhang San, ...
  • FIG. 2 is a schematic diagram of a knowledge tree containing the hierarchical relationship of academic vocabulary. As shown in Figure 2, the academic vocabulary from top to bottom in the knowledge tree are: engineering, computer science, natural language processing, machine translation, Neural Machine Translation, etc.
  • FIG. 3 is a schematic diagram illustrating text entry expansion according to an exemplary embodiment.
  • the entries will be assigned different weights according to the appearance of each entry in the literature written by scholars and the expansion of the entry. According to the scores of each document and the entry in the document, the The document-entry matrix is converted into a scholar-entry matrix, and then the scholar-entry matrix is used as the model input for training.
  • the expanded entries that have different levels of gaps with the original entries should be assigned different scores, thus reflecting different matching degrees, which will further reflect the pertinence of the search.
  • the equal scoring method cannot highlight the knowledge tree. The distance relationship between middle words and words.
  • the extended entry will be differentiated according to the distance of the level difference between the entry and the original entry and the document part where the original entry is located.
  • the way of assigning points is determined by the number of layers that differ between the original entry and the expanded entry in the entry level relationship. For example, if the hierarchical relationship of the entry has six levels, five parameters will be generated, corresponding to the difference between one level and five levels. At the same time, for each part of the literature (title, abstract, main text, etc.), the score should also be different. Therefore, the final algorithm will have "(knowledge tree height - 1) * part number" assigning parameters.
  • the document-entry matrix can be formed from the scores of each document and the entries in the document. In order to consider all the documents written by each scholar as a whole, the document-entry matrix is transformed into a scholar-entry matrix.
  • a machine learning algorithm will be used to train the scholar-entry matrix to generate a search model.
  • the search keywords are matched in the search model, and the literature search results can be accurately output according to the matching degree.
  • the XGBoost algorithm is used to train and learn the scholar-entry matrix, and the training loss value of the model on the search ranking data set is obtained.
  • the Bayesian optimization grid search algorithm is used to optimize the parameters. Update the scholar-term matrix until the training loss value converges.
  • the XGBoost algorithm is a ranking learning algorithm based on result pairs, which converts the ranking problem into a binary classification problem of whether the result A is ranked higher than the result B given a pair of search results A and B. Finally, the algorithm will output the trained pairwise ranking binary classification error rate.
  • the present invention adopts the Bayesian optimization grid search algorithm to quickly optimize and select the parameters, and the optimization goal is to minimize the pairwise sorting two generated by the XGBoost model. Classification error rate.
  • Bayesian optimization grid search is a parameter optimization algorithm that optimizes parameters such as weights and scores in the training model through the Bayesian optimization grid search algorithm. For example, the Bayesian optimization grid search algorithm first tests the combination of parameters, and then the new round of parameter selection will guide the selection of the next round of parameters based on the experimental results of the previous round of parameter selection, and minimize the loss of the XGBoost model as the Goal, iteratively optimize parameter selection until convergence.
  • the Bayesian optimization grid search algorithm can dynamically optimize and adjust the parameter selection of the next iteration based on the training results of the last parameter selection, so as to optimize the parameter selection faster.
  • FIG. 4 is a specific application flowchart of Embodiment 1 according to an exemplary embodiment.
  • the text is preprocessed to extract the entries in the literature, and then the knowledge tree is used to expand the entries of the literature.
  • the entry is assigned with different weights, and the literature is formed according to the scores of each document and the entry in the document.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • Embodiment 2 of the present invention provides a document search apparatus, and the apparatus can execute all or part of the steps of any of the above-mentioned document search methods.
  • the system includes:
  • Entry acquisition module used to acquire entries in all documents of researchers
  • Entry expansion module 2 used to expand entries of all documents according to the hierarchical relationship of entries
  • Matrix building module 3 is used for each scholar to assign different weights to the entries according to the appearance of each entry in the literature written by the scholar and the expansion of the entry to construct a scholar-entry matrix;
  • the training module 4 is used to train the scholar-entry matrix by using a machine learning algorithm to generate a search model
  • the search module 5 is used to perform a matching operation on the search keywords in the search model, and output the document search results according to the matching degree.
  • the third embodiment of the present invention provides an electronic device, and the electronic device can execute all or part of the steps of any of the above-mentioned document search methods.
  • the electronic equipment includes:
  • the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform as described in any of the above exemplary embodiments method, which will not be described in detail here.
  • a storage medium is also provided, and the storage medium is a computer-readable storage medium, for example, a temporary and non-transitory computer-readable storage medium including instructions.
  • the storage medium includes, for example, a memory of instructions that can be executed by a processor of the server system to implement the above-mentioned document searching method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention is applicable to the technical field of document searching, and provides a document searching method and apparatus, and an electronic device. The method comprises: obtaining entries in all documents of scholars, and performing entry extension on all the documents according to a hierarchical relationship of the entries; for each scholar, according to appearance conditions and entry extension conditions of entries in documents written by the scholar, performing different weights of scoring on the entries to construct a scholar-entry matrix; using a machine learning algorithm to train the scholar-entry matrix so as to generate a search model; and performing operation on a search keyword in the search model to output a document searching result. During entry extension, an extension score is obtained by globally considering all documents of a scholar according to extension conditions in other documents of the scholar, and the extension score may further be given different values according to a hierarchical gap between an extended entry and an original entry in the hierarchical relationship of entries, such that reasonable extension for documents is achieved, and the problem of search quality when a search keyword does not explicitly appear is effectively solved.

Description

文献搜索方法、装置及电子设备Literature search method, device and electronic device 技术领域technical field
本发明属于文献搜索技术领域,尤其涉及一种文献搜索方法、装置及电子设备。The invention belongs to the technical field of document search, and in particular relates to a document search method, device and electronic device.
背景技术Background technique
随着大数据时代数据量的激增,为了满足人们的信息需求,如何高效地从海量信息中搜索出与自身需求相关的信息变得愈发重要。作为信息检索应用技术之一,针对学者的文献搜索可以让用户通过输入感兴趣的关键词,即可从一个机构、一所学校或是更广范围的学者中检索出与关键词相关的学者,且结果根据相关程度做降序排列。为实现这一功能,检索系统中往往存储有各个学者所发表的学术文献,如论文、期刊文章等,从而使得系统可以根据各个学者的文献在搜索时产生搜索结果及其排序。With the surge in the amount of data in the era of big data, in order to meet people's information needs, how to efficiently search for information related to one's own needs from massive information becomes more and more important. As one of the application technologies of information retrieval, literature search for scholars allows users to search for scholars related to keywords from an institution, a school or a wider range of scholars by entering keywords of interest. And the results are sorted in descending order according to the degree of relevance. In order to achieve this function, the retrieval system often stores academic documents published by various scholars, such as papers, journal articles, etc., so that the system can generate search results and sort them according to the documents of various scholars.
然而,针对学者的搜索系统会面临一个问题,如用户在搜索“计算机科学”时,从事“自然语言处理”的学者虽与计算机科学高度相关,但却不会被搜索到,或是搜索排序很低。导致这一现象的原因是绝大多数的学者并不会在每篇“自然语言处理”文献中都提及像“计算机科学”这种更高层级领域的概念及关键词,也就是说,用户所键入的搜索关键词“计算机科学”并没有显式的出现在学者所著的文献之中,从而导致搜索结果的质量降低。However, the search system for scholars will face a problem. For example, when users search for "computer science", although scholars engaged in "natural language processing" are highly related to computer science, they will not be searched, or the search ranking is very high. Low. The reason for this phenomenon is that the vast majority of scholars do not mention concepts and keywords in higher-level fields like "computer science" in every "natural language processing" literature, that is, users. The typed search term "computer science" did not appear explicitly in the literature written by scholars, resulting in a lower quality of search results.
技术问题technical problem
本发明的目的在于提供一种文献搜索方法、装置及电子设备,旨在解决现有技术中对搜索关键词不显式出现时的搜索质量不高的技术问题。The purpose of the present invention is to provide a document search method, device and electronic device, which aims to solve the technical problem of low search quality in the prior art when search keywords do not appear explicitly.
技术解决方案technical solutions
第一方面,本发明提供了一种文献搜索方法,应用于电子设备,包括:In a first aspect, the present invention provides a literature search method, applied to electronic equipment, including:
获取各学者所有文献中的词条;Obtain the entries in all the literatures of various scholars;
根据词条层级关系对所有文献进行词条扩展;Entries are expanded for all documents according to the hierarchical relationship of entries;
针对每一学者,根据各词条在所述学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分,构建学者-词条矩阵;For each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, assign different weights to the entry to construct a scholar-entry matrix;
采用机器学习算法对所述学者-词条矩阵进行训练,生成搜索模型;A machine learning algorithm is used to train the scholar-entry matrix to generate a search model;
将搜索关键词在所述搜索模型中进行匹配运算,按照匹配程度输出文献搜索结果。A matching operation is performed on the search keywords in the search model, and a document search result is output according to the matching degree.
进一步的,所述获取各学者所有文献中的词条的步骤包括:Further, the step of obtaining the entries in all the documents of each scholar includes:
获取各学者的所有文献;Access to all literature of various scholars;
对各文献进行预操作处理,获取各文献中的词条。Perform pre-operation processing on each document to obtain the entries in each document.
进一步的,所述词条层级关系为学术词汇层级关系,所述根据词条层级关系对所有文献进行词条扩展的步骤包括:Further, the entry level relationship is an academic vocabulary level relationship, and the step of performing entry expansion on all documents according to the entry level relationship includes:
针对文献中的各词条,在所述学术词汇层级关系中查找对应的学术词汇;For each entry in the literature, look up the corresponding academic vocabulary in the academic vocabulary hierarchical relationship;
按照所述学术词汇层级关系,将所述学术词汇进行向上层级的词条扩展。According to the hierarchical relationship of the academic vocabulary, the academic vocabulary is expanded to the upper-level entry.
进一步的,所述针对每一学者,根据各词条在所述学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分,构建学者-词条矩阵的步骤包括:Further, for each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, the entry is assigned with different weights, and the steps of constructing the scholar-entry matrix include: :
针对每一学者,根据各词条在所述学者所著文献中的出现状况、及词条扩展情况,对词条进行不同权重的赋分;For each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, the entries are assigned different weights;
按照各文献及文献中词条的分数,形成文献-词条矩阵;According to the scores of each document and the entry in the document, a document-entry matrix is formed;
将所述文献-词条矩阵转换为学者-词条矩阵。Transform the document-entry matrix into a scholar-entry matrix.
进一步的,所述出现状况包括出现次数、出现位置,所述针对每一学者,根据各词条在所述学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分的步骤包括:Further, the occurrence status includes the number of occurrences and the occurrence position, and for each scholar, according to the occurrence status of each entry in the literature written by the scholar and the expansion of the entry, the entry is given different weights. The sub-steps include:
针对每一学者,根据各词条在所述学者所著文献中的出现次数、出现位置,赋予相应的出现次数分数、出现位置分数。For each scholar, according to the occurrence frequency and occurrence position of each entry in the literature written by the scholar, the corresponding occurrence frequency score and occurrence position score are assigned.
进一步的,所述针对每一学者,根据各词条在所述学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分的步骤包括:Further, for each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, the steps of assigning different weights to the entry include:
针对每一学者,根据词条在所述学者所有所著的文献中平均被扩展出来的次数,赋予所述词条相应的学者分数;且For each scholar, assign a corresponding scholar score to the entry according to the average number of times the entry is expanded in all the literature written by the scholar; and
根据词条扩展时词条层级关系的远近,对词条赋予相应的扩展分数。According to the distance of the hierarchical relationship of the entry when the entry is expanded, the corresponding expansion score is assigned to the entry.
进一步的,所述采用机器学习算法对所述学者-词条矩阵进行训练,生成搜索模型的步骤包括:Further, the machine learning algorithm is used to train the scholar-entry matrix, and the step of generating a search model includes:
采用XGBoost算法对所述学者-词条矩阵进行训练学习,得到搜索排序数据集上的训练损失值;Use the XGBoost algorithm to train and learn the scholar-entry matrix, and obtain the training loss value on the search ranking data set;
在所述训练损失值未收敛时,采用贝叶斯优化网格搜索算法进行参数优化,更新所述学者-词条矩阵,直至所述训练损失值收敛。When the training loss value does not converge, a Bayesian optimization grid search algorithm is used to optimize parameters, and the scholar-entry matrix is updated until the training loss value converges.
第二方面,本发明提供了一种文献搜索装置,包括:In a second aspect, the present invention provides a document search device, comprising:
词条获取模块,用于获取各学者所有文献中的词条;The entry acquisition module is used to acquire entries in all literatures of various scholars;
词条扩展模块,用于根据词条层级关系对所有文献进行词条扩展;Entry expansion module, which is used to expand entries of all documents according to the hierarchical relationship of entries;
矩阵构建模块,用于针对每一学者,根据各词条在所述学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分,构建学者-词条矩阵;The matrix building module is used for each scholar to assign different weights to the entries according to the occurrence of each entry in the literature written by the scholar and the expansion of the entry to construct a scholar-entry matrix;
训练模块,用于采用机器学习算法对所述学者-词条矩阵进行训练,生成搜索模型;a training module, used for using a machine learning algorithm to train the scholar-entry matrix to generate a search model;
搜索模块,用于将搜索关键词在所述搜索模型中进行匹配运算,按照匹配程度输出文献搜索结果。The search module is used to perform matching operation on the search keywords in the search model, and output the document search results according to the matching degree.
第三方面,本发明还提供了一种电子设备,包括:In a third aspect, the present invention also provides an electronic device, comprising:
处理器;以及processor; and
与所述处理器通讯连接的存储器;其中,a memory communicatively connected to the processor; wherein,
所述存储器存储有可读性指令,所述可读性指令被所述处理器执行时实现如第一方面所述的方法。The memory stores readable instructions that, when executed by the processor, implement the method of the first aspect.
第四方面,本发明提供了一种计算机可读性存储介质,其上存储有计算机程序,所述计算机程序在被执行时实现如第一方面的方法。In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program that, when executed, implements the method of the first aspect.
有益效果beneficial effect
本发明提供的文献搜索方法及装置、电子设备中,利用词条层级关系对文献进行词条扩展时,扩展分数根据该学者其他文献的扩展情况对学者所著所有文献进行全局考量,充分考虑了该词条在该学者其他文献中的扩展情况、扩展词条与原词条在词汇层级关系中层级间隔的远近以及原词条出现在文献中的位置等因素,从而达到对文献的合理扩展,因此也能更好地进行对学者的文献搜索,有效解决了搜索关键词不显式出现时的搜索质量问题。In the literature search method, device and electronic device provided by the present invention, when using the hierarchical relationship of the entry to expand the entry of the document, the expansion score makes an overall consideration of all the literature written by the scholar according to the expansion of the scholar's other literature, and fully considers Factors such as the expansion of the entry in the scholar's other literature, the distance between the expanded entry and the original entry in the lexical hierarchy, and the position of the original entry in the literature, so as to achieve a reasonable expansion of the literature. Therefore, the literature search of scholars can be better carried out, which effectively solves the problem of search quality when the search keywords do not appear explicitly.
附图说明Description of drawings
图1是实施例一示出的文献搜索方法的实现流程图。FIG. 1 is a flow chart of the realization of the document search method shown in the first embodiment.
图2是一种包含学术词汇层级关系的知识树的示意图。Fig. 2 is a schematic diagram of a knowledge tree containing the hierarchical relationship of academic vocabulary.
图3是根据一示例性实施例示出的进行文本词条扩展的示意图。FIG. 3 is a schematic diagram illustrating text entry expansion according to an exemplary embodiment.
图4是根据一示例性实施例示出的实施例一的一种具体应用流程图。FIG. 4 is a specific application flowchart of Embodiment 1 according to an exemplary embodiment.
图5是实施例二示出的文献搜索装置的结构框图。FIG. 5 is a structural block diagram of the document search apparatus shown in the second embodiment.
本发明的最佳实施方式BEST MODE FOR CARRYING OUT THE INVENTION
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
以下结合具体实施例对本发明的具体实现进行详细描述:The specific implementation of the present invention is described in detail below in conjunction with specific embodiments:
实施例一:Example 1:
图1是实施例一示出的文献搜索方法的实现流程图。实施例一示出的文献搜索方法适用于电子设备中,电子设备中设置处理器,以根据搜索关键词进行文献搜索。为了便于说明,仅示出了与本发明实施例相关的部分,详述如下:FIG. 1 is a flow chart of the realization of the document search method shown in the first embodiment. The document search method shown in the first embodiment is applicable to an electronic device, and a processor is set in the electronic device to perform a document search according to a search keyword. For the convenience of description, only the parts related to the embodiments of the present invention are shown, and the details are as follows:
步骤S110,获取各学者所有文献中的词条。In step S110, the entries in all documents of each scholar are acquired.
步骤S120,根据词条层级关系对所有文献进行词条扩展。Step S120, performing lexical expansion on all documents according to the lexical relationship of the lexical entries.
步骤S130,针对每一学者,根据各词条在学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分,构建学者-词条矩阵。Step S130 , for each scholar, according to the occurrence of each entry in the scholar's literature and the expansion of the entry, assign different weights to the entry to construct a scholar-entry matrix.
步骤S140,采用机器学习算法对学者-词条矩阵进行训练,生成搜索模型。Step S140, using a machine learning algorithm to train the scholar-entry matrix to generate a search model.
步骤S150,将搜索关键词在搜索模型中进行匹配运算,按照匹配程度输出文献搜索结果。In step S150, a matching operation is performed on the search keywords in the search model, and a document search result is output according to the matching degree.
由于文献数据量的剧增,如何从海量资源中检索出与搜索关键词最相关的信息,并合理排序,从而满足用户的信息需求将显得愈发重要。Due to the rapid increase in the amount of literature data, how to retrieve the information most relevant to the search keywords from the massive resources and sort them reasonably to meet the information needs of users will become more and more important.
现有的信息检索系统可以对多种实体进行检索,如文本、音视频、游戏、学者专家等,且在进行搜索时,这些检索系统都会或多或少的面临上述所提问题:搜索关键词并没有显式的出现在文本中。Existing information retrieval systems can retrieve a variety of entities, such as text, audio and video, games, scholars and experts, etc., and when searching, these retrieval systems will more or less face the above-mentioned problems: search keywords does not appear explicitly in the text.
本发明中,将利用一种包含学术词汇层级关系的知识树(如图2所示)来辅助文献搜索,以应对搜索关键词不显式地出现在文献中的问题。在利用知识树对文献进行词条扩展时,扩展分数根据该学者其他文献的扩展情况对学者所著所有文献进行全局考量。与此同时,在对文献中不同部分进行扩展时,扩展分数会相应调整,扩展分数还会根据其与原词条在知识树中的层级差距远近被赋以不同的分数,从而达到对文献的合理扩展,因此也能更好地进行对学者的文献搜索。In the present invention, a knowledge tree (as shown in FIG. 2 ) containing the hierarchical relationship of academic words will be used to assist document search, so as to deal with the problem that search keywords do not appear in documents explicitly. When using the knowledge tree to expand the entry of the document, the expansion score takes a global consideration of all the literature written by the scholar according to the expansion of the scholar's other literature. At the same time, when expanding different parts of the document, the expansion score will be adjusted accordingly, and the expansion score will also be assigned different scores according to the level gap between the original entry and the original entry in the knowledge tree, so as to achieve the goal of evaluating the literature. Reasonable expansion, and therefore better literature searches for scholars.
具体的,在获取各学者所有文献中的词条时,先获取各学者的所有文献,然后对各文献进行预操作处理,获取各文献中的词条。例如,对于特定的语言,如英语、法语等,需要将文本小写化,对如中文等语言则无需此步。之后,删除重复的空格、标点等。之后进行分句操作,最后用词库对每篇文献进行分词操作。Specifically, when acquiring entries in all documents of each scholar, all documents of each scholar are acquired first, and then pre-operation processing is performed on each document to acquire the entries in each document. For example, for certain languages, such as English, French, etc., the text needs to be lowercase, but this step is not required for languages such as Chinese. After that, remove repeated spaces, punctuation, etc. After that, the sentence segmentation operation is performed, and finally, the word segmentation operation is performed on each document with the thesaurus.
具体的,在进行词条扩展时,针对文献中的各词条,在学术词汇层级关系中查找对应的学术词汇,然后按照学术词汇层级关系,将学术词汇进行向上层级的词条扩展。Specifically, when the entry expansion is performed, for each entry in the document, the corresponding academic vocabulary is searched in the academic vocabulary hierarchical relationship, and then the academic vocabulary is expanded to the upper level entry according to the academic vocabulary hierarchical relationship.
在训练模型时需用到一个包含搜索条目及对应的正确排序的数据集对文献进行分词预处理。在该数据集中,数据以以下形式呈现:When training the model, a dataset containing the search terms and the corresponding correct ranking is needed to preprocess the document word segmentation. In this dataset, the data is presented in the following form:
搜索条目1:计算机科学;相关学者排序:1.张三,2.李四,……Search Item 1: Computer Science; Ranking of Related Scholars: 1. Zhang San, 2. Li Si, …
搜索条目2:自然语言处理;相关学者排序:1.王五,2.张三,……Search Item 2: Natural Language Processing; Ranking of Related Scholars: 1. Wang Wu, 2. Zhang San, …
图2是一种包含学术词汇层级关系的知识树的示意图,如图2所示,该知识树中,从上至下的学术词汇分别为:工程学、计算机科学、自然语言处理、机器翻译、神经机器翻译等。图3是根据一示例性实施例示出的进行文本词条扩展的示意图。Figure 2 is a schematic diagram of a knowledge tree containing the hierarchical relationship of academic vocabulary. As shown in Figure 2, the academic vocabulary from top to bottom in the knowledge tree are: engineering, computer science, natural language processing, machine translation, Neural Machine Translation, etc. FIG. 3 is a schematic diagram illustrating text entry expansion according to an exemplary embodiment.
图2、3中的知识树包含了“机器翻译”是“自然语言处理”的一个子分支等诸多类似的知识。所以,在知识树的辅助下对文献进行扩展,如果文献中包含“机器翻译”这一词条,那么在这一词条作为原词条时,词条“自然语言处理”、“计算机科学”、“工程学”等高层级关键词均会被拓展出来。所以,当用户搜索“计算机科学”时,即使该学者所著文献中从未提及“计算机科学”一词,该学者依旧可以被本算法搜索到,并有可能拥有较高的排名,只要其所著文献中“计算机科学”被扩展了很多次。值得注意的是,关于“机器翻译”的文献未必一定与“统计机器翻译”相关,所以,在扩展时,本算法只向上层级扩展。在向上对词条进行扩展时,以“机器翻译”为原词条为例,“自然语言处理”会被扩展出来,且其在知识树中与“机器翻译”的层级差距为一层。“计算机科学”也会被扩展出来,其在知识树中与“机器翻译”的层级差距为两层。The knowledge trees in Figures 2 and 3 contain similar knowledge that "machine translation" is a sub-branch of "natural language processing". Therefore, the literature is expanded with the aid of the knowledge tree. If the literature contains the entry "machine translation", then when this entry is used as the original entry, the entries "natural language processing", "computer science" , "engineering" and other high-level keywords will be expanded. Therefore, when a user searches for "computer science", even if the term "computer science" is never mentioned in the literature written by the scholar, the scholar can still be searched by this algorithm and may have a higher ranking, as long as he "Computer science" is expanded many times in the literature. It is worth noting that the literature on "machine translation" is not necessarily related to "statistical machine translation", so when expanding, the algorithm is only extended upwards. When expanding the entry upwards, taking "machine translation" as the original entry as an example, "natural language processing" will be expanded, and the level gap between "machine translation" and "machine translation" in the knowledge tree is one level. "Computer Science" will also be expanded, and its level gap with "Machine Translation" in the knowledge tree is two levels.
在进行词条扩展后,将根据各词条在学者所著文献中的出现状况、及词条扩展情况,对词条进行不同权重的赋分,按照各文献及文献中词条的分数,形成文献-词条矩阵,并将文献-词条矩阵转换为学者-词条矩阵,然后再将学者-词条矩阵作为模型输入进行训练。After the entry is expanded, the entries will be assigned different weights according to the appearance of each entry in the literature written by scholars and the expansion of the entry. According to the scores of each document and the entry in the document, the The document-entry matrix is converted into a scholar-entry matrix, and then the scholar-entry matrix is used as the model input for training.
在对词条赋分时,将充分考虑以下因素:该词条在该学者其他文献中的扩展情况、扩展词条与原词条在词汇层级关系中层级间隔的远近以及原词条出现在文献中的位置等。When assigning points to an entry, the following factors will be fully considered: the expansion of the entry in the scholar's other literature, the distance between the expanded entry and the original entry in the lexical level relationship, and the appearance of the original entry in the literature location, etc.
通过与原词条拥有不同层级差距的被扩展词条应被赋以不同的分数,从而体现出不同的匹配程度,将更进一步体现出搜索的针对性,一视同仁的赋分方式并不能凸显知识树中词与词的远近关系。The expanded entries that have different levels of gaps with the original entries should be assigned different scores, thus reflecting different matching degrees, which will further reflect the pertinence of the search. The equal scoring method cannot highlight the knowledge tree. The distance relationship between middle words and words.
与普通的文本搜索不同,针对学者的搜索需要在搜索时将每个学者所著的所有文献做整体考虑,所以在利用知识树对词条进行扩展时,需要考虑扩展出的词条在该学者所著其他文献中的扩展情况。例如,每个学者的全部所著文献中的全部被拓展出来的词条计算该词条在该学者下的“学者得分”,其分子为该词条在该学者的所有文献中被拓展出来的次数,其分母为该学者有该词条被拓展出来的文献的数量。故该词条的学者得分即为该词条在该学者的所有所著的有该词条被扩展出的文献中平均被扩展出来的次数。也就是说,那些在该学者所有所著文献中被频繁扩展出来的词条就会拥有更高的赋分。这种赋分方式充分地将该学者所著所有文献中该词条的拓展情况进行了考虑,相较于传统文本扩展方法中一视同仁的扩展方式,这种拓展方式更加合理,也更适用于针对学者进行搜索的算法。Different from ordinary text search, the search for scholars needs to consider all the literature written by each scholar as a whole, so when using the knowledge tree to expand the entry, it is necessary to consider that the expanded entry is in the scholar. Extensions in other literature. For example, all the extended entries in all the literatures written by each scholar calculate the "scholar score" of the entry under the scholar, and its numerator is the entry in all the literatures of the scholar extended. The number of times, whose denominator is the number of documents that the scholar has expanded the entry. Therefore, the scholar score of the entry is the average number of times the entry is expanded in all the literatures written by the scholar in which the entry has been expanded. That is to say, those entries that are frequently expanded in all the literature written by the scholar will have a higher score. This scoring method fully takes into account the expansion of the entry in all the literature written by the scholar. Compared with the expansion method of the traditional text expansion method, this expansion method is more reasonable and more suitable for targeting Algorithms by which scholars conduct searches.
另外,将对扩展词条根据该词条与原始词条在词条层级关系中层级相差远近与原始词条所在的文献部分进行差别赋分。In addition, the extended entry will be differentiated according to the distance of the level difference between the entry and the original entry and the document part where the original entry is located.
赋分的方式由原始词条与拓展词条在词条层级关系中相差的层数决定。例如,如果词条层级关系拥有六层,则会产生5个参数,分别对应相差一层至五层的赋分。同时,对于文献的每个部分(标题、摘要、正文等),其赋分也应不同。所以,最终算法将会有“(知识树高度-1)*部分个数”个赋分参数。The way of assigning points is determined by the number of layers that differ between the original entry and the expanded entry in the entry level relationship. For example, if the hierarchical relationship of the entry has six levels, five parameters will be generated, corresponding to the difference between one level and five levels. At the same time, for each part of the literature (title, abstract, main text, etc.), the score should also be different. Therefore, the final algorithm will have "(knowledge tree height - 1) * part number" assigning parameters.
最终,对于各个学者文献中的每个词条而言,其:Ultimately, for each entry in each scholar's literature, it:
词条分数(term_score)=出现次数 +学者分数 *扩展分数Term score (term_score) = number of occurrences + scholar score * extension score
该词条在该文献中出现的次数越高,词条分数越高。该词条在该学者所著文献中被扩展的程度越高,即学者分数越高,词条分数越高。该词条扩展得分越高,词条分数越高。扩展分数应与其在该学者的所著文献中的扩展情况一同考虑,故取二者相乘之结果。The higher the number of times the term appears in the document, the higher the score of the term. The higher the degree that the entry is expanded in the literature written by the scholar, that is, the higher the scholar's score, the higher the entry score. The higher the entry expansion score, the higher the entry score. The expansion score should be considered together with its expansion in the scholar's literature, so the result of multiplying the two is taken.
由各文献及文献中词条的分数,即可形成文献-词条矩阵,为对各个学者所著的所有文献进行整体考虑,将文献-词条矩阵转化为学者-词条矩阵。The document-entry matrix can be formed from the scores of each document and the entries in the document. In order to consider all the documents written by each scholar as a whole, the document-entry matrix is transformed into a scholar-entry matrix.
对于每个学者所有所著文献中的所有词条,其在学者-词条矩阵中的分数为该词条在该学者的所有文献中的最终分数之和,乘以该学者所著的包含该词条的文献的篇数的对数,除以该学者所著文献的篇数的对数。所以,一个词条在该学者的所有文献中的累计分数越高,该词条在学者-词条矩阵中的分数就会越高。该学者拥有该词条的文献篇数越多,该词条在学者-词条矩阵中的分数就会越高。更多的所著文章篇数会使得词条的累计分数高的概率升高,所以在矩阵转化时将该学者所著文章篇数作为分母。For all entries in all literature written by each scholar, its score in the scholar-entry matrix is the sum of the final scores of the entry in all literatures of that scholar, multiplied by the number of entries written by the scholar containing the The logarithm of the number of articles in the entry, divided by the logarithm of the number of articles written by the scholar. Therefore, the higher the cumulative score of a term in all of the scholar's literature, the higher the score of the term in the scholar-term matrix. The more articles the scholar has for that entry, the higher the entry will score in the scholar-entry matrix. More articles written will increase the probability of a high cumulative score for the entry, so the number of articles written by this scholar is used as the denominator during matrix transformation.
构建学者-词条矩阵后,将采用机器学习算法对学者-词条矩阵进行训练,生成搜索模型。进行文献搜索时,将搜索关键词在搜索模型中进行匹配运算,按照匹配程度即可准确输出文献搜索结果。After the scholar-entry matrix is constructed, a machine learning algorithm will be used to train the scholar-entry matrix to generate a search model. When performing literature search, the search keywords are matched in the search model, and the literature search results can be accurately output according to the matching degree.
具体的,采用XGBoost算法对学者-词条矩阵进行训练学习,得到模型在搜索排序数据集上的训练损失值,在训练损失值未收敛时,采用贝叶斯优化网格搜索算法进行参数优化,更新学者-词条矩阵,直至训练损失值收敛。Specifically, the XGBoost algorithm is used to train and learn the scholar-entry matrix, and the training loss value of the model on the search ranking data set is obtained. When the training loss value does not converge, the Bayesian optimization grid search algorithm is used to optimize the parameters. Update the scholar-term matrix until the training loss value converges.
XGBoost算法是一种基于结果对的排序学习算法,该算法将排序问题转换为给定两两一组的搜索结果A和B,结果A是否比结果B排序高的二分类问题。最终,算法将输出训练后的两两排序二分类错误率。The XGBoost algorithm is a ranking learning algorithm based on result pairs, which converts the ranking problem into a binary classification problem of whether the result A is ranked higher than the result B given a pair of search results A and B. Finally, the algorithm will output the trained pairwise ranking binary classification error rate.
为找到本发明所提算法中所涉及的参数的最优配置,本发明采用了贝叶斯优化网格搜索算法对参数进行快速优化选取,优化目标为最小化XGBoost模型所产生的两两排序二分类错误率。In order to find the optimal configuration of the parameters involved in the algorithm proposed by the present invention, the present invention adopts the Bayesian optimization grid search algorithm to quickly optimize and select the parameters, and the optimization goal is to minimize the pairwise sorting two generated by the XGBoost model. Classification error rate.
贝叶斯优化网格搜索是一个参数优化算法,通过贝叶斯优化网格搜索算法对训练模型中的权重和分数等参数进行优化。例如,贝叶斯优化网格搜索算法先对参数的组合进行试验,然后新的一轮参数选取会基于上一轮参数选取试验效果来指导下一轮参数的选取,以XGBoost模型损失最小化作为目标,不断迭代来优化参数选取,直至收敛。相较于传统的网格搜索优化算法,贝叶斯优化网格搜索算法可以基于上一次参数选择的训练结果,动态的优化调整下一迭代的参数选择,从而能够更快地优化参数选择。Bayesian optimization grid search is a parameter optimization algorithm that optimizes parameters such as weights and scores in the training model through the Bayesian optimization grid search algorithm. For example, the Bayesian optimization grid search algorithm first tests the combination of parameters, and then the new round of parameter selection will guide the selection of the next round of parameters based on the experimental results of the previous round of parameter selection, and minimize the loss of the XGBoost model as the Goal, iteratively optimize parameter selection until convergence. Compared with the traditional grid search optimization algorithm, the Bayesian optimization grid search algorithm can dynamically optimize and adjust the parameter selection of the next iteration based on the training results of the last parameter selection, so as to optimize the parameter selection faster.
图4是根据一示例性实施例示出的实施例一的一种具体应用流程图。如图4所示,首先对文献进行文本预处理,抽取出文献中的词条,然后利用知识树对文献进行词条扩展。在进行词条扩展后,根据各词条在学者所著文献中的出现状况、及词条扩展情况,对词条进行不同权重的赋分,按照各文献及文献中词条的分数,形成文献-词条矩阵,并将文献-词条矩阵转换为学者-词条矩阵,再将学者-词条矩阵作为XGBoost模型的输入进行训练,计算训练损失值,在训练损失值未收敛时,采用贝叶斯优化网格搜索算法进行参数优化,更新学者-词条矩阵,直至训练损失值收敛。最后进行文献搜索时,将搜索关键词在搜索模型中进行匹配运算,按照匹配程度即可准确输出文献搜索结果。FIG. 4 is a specific application flowchart of Embodiment 1 according to an exemplary embodiment. As shown in Figure 4, firstly, the text is preprocessed to extract the entries in the literature, and then the knowledge tree is used to expand the entries of the literature. After the expansion of the entry, according to the appearance of each entry in the literature written by scholars and the expansion of the entry, the entry is assigned with different weights, and the literature is formed according to the scores of each document and the entry in the document. -Entry matrix, convert the document-entry matrix into a scholar-entry matrix, and then use the scholar-entry matrix as the input of the XGBoost model for training, calculate the training loss value, when the training loss value does not converge, use the shell The Yeasian optimization grid search algorithm performs parameter optimization and updates the scholar-entry matrix until the training loss value converges. Finally, when the literature search is performed, the search keywords are matched in the search model, and the literature search results can be accurately output according to the matching degree.
实施例二:Embodiment 2:
如图5所示,本发明实施例二提供了一种文献搜索装置,该装置可执行上述任一所示的文献搜索方法的全部或者部分步骤。该系统包括:As shown in FIG. 5 , Embodiment 2 of the present invention provides a document search apparatus, and the apparatus can execute all or part of the steps of any of the above-mentioned document search methods. The system includes:
词条获取模块1,用于获取各学者所有文献中的词条;Entry acquisition module 1, used to acquire entries in all documents of scholars;
词条扩展模块2,用于根据词条层级关系对所有文献进行词条扩展;Entry expansion module 2, used to expand entries of all documents according to the hierarchical relationship of entries;
矩阵构建模块3,用于针对每一学者,根据各词条在学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分,构建学者-词条矩阵;Matrix building module 3 is used for each scholar to assign different weights to the entries according to the appearance of each entry in the literature written by the scholar and the expansion of the entry to construct a scholar-entry matrix;
训练模块4,用于采用机器学习算法对学者-词条矩阵进行训练,生成搜索模型;The training module 4 is used to train the scholar-entry matrix by using a machine learning algorithm to generate a search model;
搜索模块5,用于将搜索关键词在搜索模型中进行匹配运算,按照匹配程度输出文献搜索结果。The search module 5 is used to perform a matching operation on the search keywords in the search model, and output the document search results according to the matching degree.
实施例三:Embodiment three:
本发明实施例三提供了一种电子设备,该电子设备可执行上述任一所示的文献搜索方法的全部或者部分步骤。该电子设备包括:The third embodiment of the present invention provides an electronic device, and the electronic device can execute all or part of the steps of any of the above-mentioned document search methods. The electronic equipment includes:
处理器;以及processor; and
与处理器通信连接的存储器;其中,a memory communicatively coupled to the processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上述任一示例性实施例所述的方法,此处将不做详细阐述说明。the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform as described in any of the above exemplary embodiments method, which will not be described in detail here.
在本实施例中,还提供了一种存储介质,该存储介质为计算机可读存储介质,例如可以为包括指令的临时性和非临时性计算机可读存储介质。该存储介质例如包括指令的存储器,上述指令可由服务器系统的处理器执行以完成上述文献搜索方法。In this embodiment, a storage medium is also provided, and the storage medium is a computer-readable storage medium, for example, a temporary and non-transitory computer-readable storage medium including instructions. The storage medium includes, for example, a memory of instructions that can be executed by a processor of the server system to implement the above-mentioned document searching method.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims (10)

  1. 一种文献搜索方法,应用于电子设备,其特征在于,所述方法包括: A literature search method, applied to electronic equipment, characterized in that the method comprises:
    获取各学者所有文献中的词条;Obtain the entries in all the literatures of various scholars;
    根据词条层级关系对所有文献进行词条扩展;Entries are expanded for all documents according to the hierarchical relationship of entries;
    针对每一学者,根据各词条在所述学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分,构建学者-词条矩阵;For each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, assign different weights to the entry to construct a scholar-entry matrix;
    采用机器学习算法对所述学者-词条矩阵进行训练,生成搜索模型;A machine learning algorithm is used to train the scholar-entry matrix to generate a search model;
    将搜索关键词在所述搜索模型中进行匹配运算,按照匹配程度输出文献搜索结果。A matching operation is performed on the search keywords in the search model, and a document search result is output according to the matching degree.
  2. 如权利要求1所述的方法,其特征在于,所述获取各学者所有文献中的词条的步骤包括: The method of claim 1, wherein the step of obtaining the entries in all documents of each scholar comprises:
    获取各学者的所有文献;Access to all literature of various scholars;
    对各文献进行预操作处理,获取各文献中的词条。Perform pre-operation processing on each document to obtain the entries in each document.
  3. 如权利要求1所述的方法,其特征在于,所述词条层级关系为学术词汇层级关系,所述根据词条层级关系对所有文献进行词条扩展的步骤包括: The method according to claim 1, wherein the hierarchical relationship of terms is an academic vocabulary hierarchical relationship, and the step of performing term expansion on all documents according to the hierarchical relationship of terms comprises:
    针对文献中的各词条,在所述学术词汇层级关系中查找对应的学术词汇;For each entry in the literature, look up the corresponding academic vocabulary in the academic vocabulary hierarchical relationship;
    按照所述学术词汇层级关系,将所述学术词汇进行向上层级的词条扩展。According to the hierarchical relationship of the academic vocabulary, the academic vocabulary is expanded to the upper-level entry.
  4. 如权利要求1所述的方法,其特征在于,所述针对每一学者,根据各词条在所述学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分,构建学者-词条矩阵的步骤包括: The method according to claim 1, wherein, for each scholar, different weights are assigned to the entry according to the appearance of each entry in the literature written by the scholar and the expansion of the entry. , the steps of constructing the scholar-entry matrix include:
    针对每一学者,根据各词条在所述学者所著文献中的出现状况、及词条扩展情况,对词条进行不同权重的赋分;For each scholar, according to the appearance of each entry in the literature written by the scholar and the expansion of the entry, the entries are assigned different weights;
    按照各文献及文献中词条的分数,形成文献-词条矩阵;According to the scores of each document and the entry in the document, a document-entry matrix is formed;
    将所述文献-词条矩阵转换为学者-词条矩阵。Transform the document-entry matrix into a scholar-entry matrix.
  5. 如权利要求4所述的方法,其特征在于,所述出现状况包括出现次数、出现位置,所述针对每一学者,根据各词条在所述学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分的步骤包括: The method according to claim 4, characterized in that, the occurrence status includes the number of occurrences and the occurrence position, and for each scholar, the occurrence status of each entry in the literature written by the scholar, entry expansion In this case, the steps of assigning different weights to the entries include:
    针对每一学者,根据各词条在所述学者所著文献中的出现次数、出现位置,赋予相应的出现次数分数、出现位置分数。For each scholar, according to the occurrence frequency and occurrence position of each entry in the literature written by the scholar, the corresponding occurrence frequency score and occurrence position score are assigned.
  6. 如权利要求4所述的方法,其特征在于,所述针对每一学者,根据各词条在所述学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分的步骤包括: The method according to claim 4, wherein, for each scholar, different weights are assigned to the entry according to the appearance of each entry in the literature written by the scholar and the expansion of the entry. The steps include:
    针对每一学者,根据词条在所述学者所有所著的文献中平均被扩展出来的次数,赋予所述词条相应的学者分数;且For each scholar, assign a corresponding scholar score to the entry according to the average number of times the entry is expanded in all the literature written by the scholar; and
    根据词条扩展时词条层级关系的远近,对词条赋予相应的扩展分数。According to the distance of the hierarchical relationship of the entry when the entry is expanded, the corresponding expansion score is assigned to the entry.
  7. 如权利要求1所述的方法,其特征在于,所述采用机器学习算法对所述学者-词条矩阵进行训练,生成搜索模型的步骤包括: The method according to claim 1, characterized in that, using a machine learning algorithm to train the scholar-entry matrix, and the step of generating a search model comprises:
    采用XGBoost算法对所述学者-词条矩阵进行训练学习,得到搜索排序数据集上的训练损失值;Use the XGBoost algorithm to train and learn the scholar-entry matrix, and obtain the training loss value on the search ranking data set;
    在所述训练损失值未收敛时,采用贝叶斯优化网格搜索算法进行参数优化,更新所述学者-词条矩阵,直至所述训练损失值收敛。When the training loss value does not converge, a Bayesian optimization grid search algorithm is used to optimize parameters, and the scholar-entry matrix is updated until the training loss value converges.
  8. 一种文献搜索装置,其特征在于,所述装置包括: A document search device, characterized in that the device comprises:
    词条获取模块,用于获取各学者所有文献中的词条;The entry acquisition module is used to acquire the entries in all the literatures of various scholars;
    词条扩展模块,用于根据词条层级关系对所有文献进行词条扩展;Entry expansion module, which is used to expand entries of all documents according to the hierarchical relationship of entries;
    矩阵构建模块,用于针对每一学者,根据各词条在所述学者所著文献中的出现状况、词条扩展情况,对词条进行不同权重的赋分,构建学者-词条矩阵;The matrix building module is used for each scholar to assign different weights to the entries according to the occurrence of each entry in the literature written by the scholar and the expansion of the entry to construct a scholar-entry matrix;
    训练模块,用于采用机器学习算法对所述学者-词条矩阵进行训练,生成搜索模型;a training module, used for using a machine learning algorithm to train the scholar-entry matrix to generate a search model;
    搜索模块,用于将搜索关键词在所述搜索模型中进行匹配运算,按照匹配程度输出文献搜索结果。The search module is used to perform matching operation on the search keywords in the search model, and output the document search results according to the matching degree.
  9. 一种电子设备,其特征在于,所述电子设备包括: An electronic device, characterized in that the electronic device comprises:
    处理器;以及processor; and
    与所述处理器通讯连接的存储器;其中,a memory communicatively connected to the processor; wherein,
    所述存储器存储有可读性指令,所述可读性指令被所述处理器执行时实现如权利要求1-7任一项所述的方法。The memory stores readable instructions that, when executed by the processor, implement the method of any one of claims 1-7.
  10. 一种计算机可读性存储介质,其上存储有计算机程序,所述计算机程序在被执行时实现如权利要求1-7任一项所述的方法。 A computer-readable storage medium having stored thereon a computer program which, when executed, implements the method of any one of claims 1-7.
PCT/CN2020/139255 2020-12-10 2020-12-25 Document searching method and apparatus, and electronic device WO2022120975A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011433146.7 2020-12-10
CN202011433146.7A CN112463950B (en) 2020-12-10 2020-12-10 Document searching method and device and electronic equipment

Publications (1)

Publication Number Publication Date
WO2022120975A1 true WO2022120975A1 (en) 2022-06-16

Family

ID=74800510

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/139255 WO2022120975A1 (en) 2020-12-10 2020-12-25 Document searching method and apparatus, and electronic device

Country Status (2)

Country Link
CN (1) CN112463950B (en)
WO (1) WO2022120975A1 (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060248076A1 (en) * 2005-04-21 2006-11-02 Case Western Reserve University Automatic expert identification, ranking and literature search based on authorship in large document collections
CN104899281A (en) * 2015-06-01 2015-09-09 百度在线网络技术(北京)有限公司 Academic article processing method and search processing method and apparatus for academic articles
CN106909680A (en) * 2017-03-03 2017-06-30 中国科学技术信息研究所 A kind of sci tech experts information aggregation method of knowledge based tissue semantic relation
CN108255796A (en) * 2018-01-10 2018-07-06 华南理工大学 A kind of scientific and technological entry abstracting method for characterizing sci tech experts achievement ability
CN108846056A (en) * 2018-06-01 2018-11-20 云南电网有限责任公司电力科学研究院 A kind of scientific and technological achievement evaluation expert recommended method and device
CN110968782A (en) * 2019-10-15 2020-04-07 东北大学 Student-oriented user portrait construction and application method
CN111143672A (en) * 2019-12-16 2020-05-12 华南理工大学 Expert specialty scholars recommendation method based on knowledge graph
CN111581368A (en) * 2019-02-19 2020-08-25 中国科学院信息工程研究所 Intelligent expert recommendation-oriented user image drawing method based on convolutional neural network

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5250709B1 (en) * 2012-03-12 2013-07-31 楽天株式会社 Information processing apparatus, information processing method, information processing apparatus program, and recording medium
KR20190023722A (en) * 2017-08-30 2019-03-08 한국과학기술원 Apparatus and method for sentiment analysis keyword expansion
WO2020208728A1 (en) * 2019-04-09 2020-10-15 株式会社 AI Samurai Document searching device, document searching method, and document searching program
CN109960730B (en) * 2019-04-19 2022-12-30 广东工业大学 Short text classification method, device and equipment based on feature expansion
CN110502644B (en) * 2019-08-28 2023-08-04 同方知网数字出版技术股份有限公司 Active learning method for field level dictionary mining construction

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060248076A1 (en) * 2005-04-21 2006-11-02 Case Western Reserve University Automatic expert identification, ranking and literature search based on authorship in large document collections
CN104899281A (en) * 2015-06-01 2015-09-09 百度在线网络技术(北京)有限公司 Academic article processing method and search processing method and apparatus for academic articles
CN106909680A (en) * 2017-03-03 2017-06-30 中国科学技术信息研究所 A kind of sci tech experts information aggregation method of knowledge based tissue semantic relation
CN108255796A (en) * 2018-01-10 2018-07-06 华南理工大学 A kind of scientific and technological entry abstracting method for characterizing sci tech experts achievement ability
CN108846056A (en) * 2018-06-01 2018-11-20 云南电网有限责任公司电力科学研究院 A kind of scientific and technological achievement evaluation expert recommended method and device
CN111581368A (en) * 2019-02-19 2020-08-25 中国科学院信息工程研究所 Intelligent expert recommendation-oriented user image drawing method based on convolutional neural network
CN110968782A (en) * 2019-10-15 2020-04-07 东北大学 Student-oriented user portrait construction and application method
CN111143672A (en) * 2019-12-16 2020-05-12 华南理工大学 Expert specialty scholars recommendation method based on knowledge graph

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LU WANHUI , JING LINBO: "Research on Scholar Clustering and Academic Influence Evaluation Method Based on Author Topic Model", INFORMATION AND DOCUMENTATION SERVICES, vol. 41, no. 4, 31 July 2020 (2020-07-31), pages 60 - 66, XP055941300, ISSN: 1002-0314, DOI: 10.12154/j.qbzlgz.2020.04.008 *

Also Published As

Publication number Publication date
CN112463950A (en) 2021-03-09
CN112463950B (en) 2023-10-24

Similar Documents

Publication Publication Date Title
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
US7346487B2 (en) Method and apparatus for identifying translations
CN109858028B (en) Short text similarity calculation method based on probability model
US8694303B2 (en) Systems and methods for tuning parameters in statistical machine translation
US11210468B2 (en) System and method for comparing plurality of documents
US8521507B2 (en) Bootstrapping text classifiers by language adaptation
US7519528B2 (en) Building concept knowledge from machine-readable dictionary
CN116775847B (en) Question answering method and system based on knowledge graph and large language model
Singh et al. A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
WO2022116324A1 (en) Search model training method, apparatus, terminal device, and storage medium
JP2011118689A (en) Retrieval method and system
D’Silva et al. Unsupervised automatic text summarization of Konkani texts using K-means with Elbow method
CN108733745B (en) Query expansion method based on medical knowledge
CN114896377A (en) Knowledge graph-based answer acquisition method
Pang et al. A text similarity measurement based on semantic fingerprint of characteristic phrases
WO2023033942A1 (en) Efficient index lookup using language-agnostic vectors and context vectors
Song et al. Cross-language record linkage based on semantic matching of metadata
CN112417170A (en) Relation linking method for incomplete knowledge graph
US11468078B2 (en) Hierarchical data searching using tensor searching, fuzzy searching, and Bayesian networks
WO2022120975A1 (en) Document searching method and apparatus, and electronic device
Sun et al. Fast multi-task learning for query spelling correction
Banerjee et al. Word image based latent semantic indexing for conceptual querying in document image databases
Alfarohmi et al. Building the Indonesian NE dataset using Wikipedia and DBpedia with entities expansion method on DBpedia
Nagao et al. Classification of MathML expressions using multilayer perceptron

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20964904

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20964904

Country of ref document: EP

Kind code of ref document: A1