WO2024078141A1 - 主题文献检索预测方法 - Google Patents

主题文献检索预测方法 Download PDF

Info

Publication number
WO2024078141A1
WO2024078141A1 PCT/CN2023/113965 CN2023113965W WO2024078141A1 WO 2024078141 A1 WO2024078141 A1 WO 2024078141A1 CN 2023113965 W CN2023113965 W CN 2023113965W WO 2024078141 A1 WO2024078141 A1 WO 2024078141A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
retrieval
data
subject
document
Prior art date
Application number
PCT/CN2023/113965
Other languages
English (en)
French (fr)
Inventor
郑志军
Original Assignee
华北理工大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华北理工大学 filed Critical 华北理工大学
Priority to ZA2023/08509A priority Critical patent/ZA202308509B/en
Publication of WO2024078141A1 publication Critical patent/WO2024078141A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of data analysis and prediction, and in particular to a subject document retrieval and prediction method.
  • Nanjing University searched and browsed 274 self-built databases of 20 university libraries in China and the United States. They should learn from the experience of American universities, seek more diverse external cooperation, and attach importance to improving user experience.
  • the present invention proposes a subject document retrieval prediction method to improve the user experience in document retrieval.
  • a subject document retrieval prediction method comprising the following steps:
  • Step 1 Build a subject resource database to obtain digital document resources; scan paper documents through scanning equipment to obtain digital document resources; and build a subject search vocabulary for digital document resources;
  • Step 2 Each time a search is performed, the search data is stored in a file, and the search data in the file is searched and matched to determine the search frequency; a search information knowledge graph is constructed based on the search frequency, and other users' search history data is associated to improve the knowledge graph; finally, the knowledge graph determines the search strategy, predicts the search documents, and sorts them;
  • Step 3 User behavior analysis: When users browse the search results, record and associate the literature data browsed and downloaded by users, analyze the data to establish the search strategy in step 2 and predict the degree of association between the literature;
  • Step 4 Predict other users’ search results based on the search strategy and the degree of correlation between the predicted documents.
  • step one word frequency statistics are performed on the digital resources, and the subject search word library is determined based on the word frequency statistics results, and one word library is constructed for one document.
  • step three different weights are set for the document data browsed and downloaded by the user, and the download weight is higher than the browsing weight.
  • the present invention forms a search strategy by analyzing search keywords, historical search data and other user search data, determines the degree of association between the search strategy and predicted documents, and outputs a predicted ranking of the searched documents, thereby improving the user search experience.
  • Step 1 Build a subject resource database and obtain digital document resources; scan paper documents through scanning equipment to obtain digital document resources; and build a subject search word library for digital document resources;
  • the construction of a subject search vocabulary and the frequency statistics of digital resources The frequency statistics analysis is to count and analyze the number of times important words appear in the article. It is an important means of text mining. This technology does not need to worry about new words. As long as new words are used, they can be counted.
  • the tool "Candy Cloud" can be used to perform word frequency statistics on documents, and the statistical results are screened to eliminate meaningless words. The screened words are determined as the subject search vocabulary, and a vocabulary is constructed for one document.
  • Step 2 Each time a search is performed, the search data is stored in a file, and the search data in the file is searched and matched to determine the search frequency; a search information knowledge graph is constructed based on the search frequency, and other users' search history data is associated to improve the knowledge graph; finally, the search strategy is determined by the knowledge graph, and the search documents are predicted and sorted;
  • the knowledge graph is determined as: Fengjie->Daiyu, and the frequency is also reflected.
  • Fengjie->Daiyu represents two keywords related to the search, and the frequency of Fengjie is higher than that of Daiyu in file A.
  • the knowledge graph Fengjie->Daiyu determined this time is stored in the form of a record in file A, and the searched documents are output.
  • Step 3 User behavior analysis: When users browse the search results, record and associate the literature data browsed and downloaded by users, analyze the data to establish the search strategy in step 2 and predict the degree of association between the literature;
  • step 2 When users browse documents, behavioral data is recorded. Suppose that 10 relevant documents are detected in step 2, named document 1-document 10. When the user browses these 10 documents, he downloads document 2, opens documents 3 and 5, and does not perform other operations. At this time, the weight of document 2 is set to high, the weight of documents 3 and 5 is set to medium, and the weight of the other 7 documents is set to low.
  • the sorted document ID identification number (the ID identification numbers of documents 2, 3, and 5 are stored in order. In this example, only documents 2, 3, and 5 are browsed, and other document IDs do not need to be stored) is stored as data in file A and is in the same record as the corresponding knowledge graph established in step 2, so as to establish the degree of correlation between the retrieval strategy in step 2 and the predicted documents.
  • Step 4 Predict the search results of other users according to the degree of association between the search strategy and the predicted documents. Associate the document weight of this search with the search strategy and use it as the prediction standard for the next search. That is, the output result prediction order of other users' search strategy next time is Fengjie->Daiyu->Dream of Red Mansions is document 2,3,5...You can directly find the document ID identification numbers of data documents 2,3,5 in order in data file A, and this order is the order of document weight.
  • the invention has practical value in the field of information retrieval.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开一种主题文献检索预测方法,属于数据分析预测领域,构建主题资源数据库,对文献的数字化资源构建主题检索词库;每次检索将检索数据存入文件,对文件中检索数据进行搜索匹配确定检索频率,构建检索信息知识图谱,并关联其他用户检索历史数据,完善知识图谱;最后由知识图谱确定检索策略,预测检索文献并排序;用户在浏览检索结果时,记录并关联用户浏览和下载的文献数据,对数据进行分析,确立检索策略和预测文献之间的关联程度;按照检索策略和预测文献之间的关联程度预测其他用户的检索结果,通过对检索文献进行预测排序输出,提升用户检索体验。

Description

主题文献检索预测方法 技术领域
本发明涉及数据分析预测领域,具体涉及一种主题文献检索预测方法。
背景技术
兰州大学魏清华等研究指出,中国高校人文社会科学文献中心初步建成了一批能够完整揭示特藏文献的数据库平台,但“数字化后的资源往往存储于特定且独立的文献管理系统中,仅提供简单的文献检索与复印扫描服务”,今后仍需在多维度精细化的元数据加工、丰富多样的平台功能开发等方面加强建设。大连工业大学韩冰对42家“双一流”高校图书馆自建特色数据库建设情况进行调研,针对普遍存在的数据库建设不均衡,建库平台标准不一、功能单一,对外开放程度不高,建设主体单一,建设与服务可持续性不足等问题提出了发展建议。南京大学信息管理学院何小月等对中美共20所高校图书馆的274个自建数据库进行查找与浏览,应借鉴美国高校经验,寻求更多元的外部合作,重视提升用户使用体验。 
技术问题
针对背景技术现状,本发明提出一种主题文献检索预测方法,用以在文献检索中提升用户使用体验。
技术解决方案
本发明采用以下的技术方案:主题文献检索预测方法,包括以下步骤:
 步骤一:构建主题资源数据库,获取文献数字化资源;把纸质文献通过扫描设备进行扫描,获得文献的数字化资源;并对文献的数字化资源构建主题检索词库;
 步骤二:每次检索将检索数据存入文件,对文件中检索数据进行搜索匹配确定检索频率;根据检索频率构建检索信息知识图谱,并关联其他用户检索历史数据,完善知识图谱;最后由知识图谱确定检索策略,预测检索文献并排序;
 步骤三:用户行为分析:用户在浏览检索结果时,记录并关联用户浏览和下载的文献数据,对数据进行分析确立步骤二检索策略和预测文献之间的关联程度;
 步骤四:按照检索策略和预测文献之间的关联程度预测其他用户的检索结果。
进一步的所述步骤一中主题检索词库的构建,对数字资源进行词频统计,根据词频统计结果确定主题检索词库,一篇文献构建一个词库。
进一步的所述步骤三中用户浏览和下载的文献数据分别设置不同的权值,下载权值高于浏览权值。
有益效果
本发明通过把检索关键词和历史检索数据以及其他用户检索数据进行分析形成检索策略,确定检索策略和预测文献之间的关联程度,对检索文献进行预测排序输出,提升用户检索体验。
本发明的实施方式
步骤一:构建主题资源数据库,获取文献数字化资源;把纸质文献通过扫描设备进行扫描,获得文献的数字化资源;并对文献的数字化资源构建主题检索词库;
 主题检索词库的构建,对数字资源进行词频统计,词频统计分析是对文章中重要词汇出现的次数进行统计与分析,是文本挖掘的重要手段,这种技术无需担心新词,新词只要有使用量,就可以被统计出来,例如利用工具“糖果云”进行文献的词频统计,对统计结果进行筛选,剔除掉无意义的词汇,把筛选后的词汇确定为主题检索词库,一篇文献构建一个词库。
例如,输入红楼梦文献,词频统计排序输出:宝玉4004,笑道2454,什么1834,凤姐1743,了一1715,贾母1690,也不1451,黛玉1379,我们1226,那里1178,袭人1156,姑娘1136,去了1096,宝钗1089,王夫人1080,不知1080.....剔除代词,介词,口语等与主题无意的词语,构建文献红楼梦对应词库{宝玉,凤姐,贾母,黛玉,袭人,姑娘,宝钗,王夫人......}。
步骤二:每次检索将检索数据存入文件,对文件中检索数据进行搜索匹配确定检索频率;根据检索频率构建检索信息知识图谱,并关联其他用户检索历史数据,完善知识图谱;最后由知识图谱确定检索策略,预测检索文献并排序;
 如检索一篇关于凤姐和黛玉的文献,此文献为红楼梦中的凤姐而非网红凤姐,检索时例如输入关键词凤姐、黛玉,将检索数据“凤姐”、“黛玉”存入数据库文件A,检索文件A看是否有“凤姐”、“黛玉”关键词,有“凤姐”,那么“凤姐”频率加1,没有则直接存储关键词“凤姐”且频率设为1;有“黛玉”,那么“黛玉”频率加1,没有则直接存储关键词“黛玉”且频率设为1。根据文件A中“凤姐”、“黛玉”的频率确定知识图谱:凤姐->黛玉,同时体现频率高低,例如:凤姐->黛玉代表检索相关联的两个关键字,并且在文件A中凤姐频率高于黛玉频率;之后本次构建的知识图谱凤姐->黛玉再和其他用户检索历史数据进行关联比较,其他用户中曾经有检索:凤姐->罗*凤;凤姐->黛玉->红楼梦;凤姐等信息,本次检索确定的知识图谱凤姐->黛玉对比上述历史数据后,(因为存在相似记录凤姐->黛玉->红楼梦,此记录与凤姐->黛玉前导部分相同,以此为依据)修改本次知识图谱为凤姐->黛玉->红楼梦,以此确定最终检索策略,按照历史检索凤姐->黛玉->红楼梦的检索结果预测输出本次检索的文献并排序。如果上述知识图谱凤姐->黛玉和历史数据比较中没有相似数据则把本次确定的知识图谱凤姐->黛玉以记录形式存入文件A中,输出本次检索的文献。
步骤三:用户行为分析:用户在浏览检索结果时,记录并关联用户浏览和下载的文献数据,对数据进行分析确立步骤二检索策略和预测文献之间的关联程度;
 用户在浏览文献的时候,记录行为数据。假设,步骤二中检测出相关文献10篇,命名为文献1-文献10,用户在浏览这10篇文献时候,下载了文献2,打开了文献3,5,其他无操作,此时把文献2的权值设置为高,文献3,5的权值设置为中,其他7篇文献权值设置为低,将排序后的文献ID标识号(文献2,3,5的ID标识按顺序存储,本例中只浏览了文献2,3,5,不必存其他文献ID)作为数据存入文件A并和步骤二中确立的相应知识图谱处于同一记录,以此确立步骤二检索策略和预测文献之间的关联程度。
步骤四:按照检索策略和预测文献之间的关联程度预测其他用户的检索结果。把本次检索的文献权值和检索策略关联起来,作为下次检索的预测标准。即下次其他用户检索策略为凤姐->黛玉->红楼梦的输出结果预测顺序为文献2,3,5......可以直接在数据文件A中按顺序找到数据文献2,3,5的文献ID标识号,此顺序即是文献权值的顺序。
工业实用性
本发明在信息检索领域具备实用价值。

Claims (2)

  1.  主题文献检索预测方法,其特征在于,包括以下步骤: 
    步骤一:构建主题资源数据库,获取文献数字化资源;把纸质文献通过扫描设备进行扫描, 获得文献的数字化资源;并对文献的数字化资源构建主题检索词库;所述主题检索词库的构 建方法为:对数字资源进行词频统计,根据词频统计结果确定主题检索词库,一篇文献构建 一个词库; 
    步骤二:每次检索将检索数据存入文件,对文件中检索数据进行搜索匹配确定检索频率;根 据检索频率构建检索信息知识图谱,同时体现频率高低,并关联其他用户检索历史数据,完 善知识图谱;最后由知识图谱确定检索策略,预测检索文献并排序; 
    步骤三:用户行为分析:用户在浏览检索结果时,记录并关联用户浏览和下载的文献数据, 对数据进行分析,确立步骤二检索策略和预测文献之间的关联程度; 步骤四:按照检索策略和预测文献之间的关联程度预测其他用户的检索结果。 
  2. 根据权利要求 1 所述的主题文献检索预测方法,其特征在于,所述步骤三中用户浏览和 下载的文献数据分别设置不同的权值,下载权值高于浏览权值。
PCT/CN2023/113965 2023-05-12 2023-08-21 主题文献检索预测方法 WO2024078141A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
ZA2023/08509A ZA202308509B (en) 2023-05-12 2023-09-04 Prediction method for subject literature retrieval

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310531081.7 2023-05-12
CN202310531081.7A CN116340468A (zh) 2023-05-12 2023-05-12 主题文献检索预测方法

Publications (1)

Publication Number Publication Date
WO2024078141A1 true WO2024078141A1 (zh) 2024-04-18

Family

ID=86880668

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/113965 WO2024078141A1 (zh) 2023-05-12 2023-08-21 主题文献检索预测方法

Country Status (3)

Country Link
CN (1) CN116340468A (zh)
WO (1) WO2024078141A1 (zh)
ZA (1) ZA202308509B (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116340468A (zh) * 2023-05-12 2023-06-27 华北理工大学 主题文献检索预测方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095469A (zh) * 2015-08-07 2015-11-25 薛德军 一种基于反馈的文献检索方法
CN108804557A (zh) * 2018-05-22 2018-11-13 温州医科大学 医学期刊论文推荐方法及系统
CN112148885A (zh) * 2020-09-04 2020-12-29 上海晏鼠计算机技术股份有限公司 一种基于知识图谱的智能搜索方法及系统
CN114741627A (zh) * 2022-04-12 2022-07-12 中国人民解放军32802部队 面向互联网的辅助信息搜索方法
CN116340468A (zh) * 2023-05-12 2023-06-27 华北理工大学 主题文献检索预测方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434168A (zh) * 2020-11-09 2021-03-02 广西壮族自治区图书馆 基于图书馆的知识图谱构建方法、碎片化知识生成方法
CN112885478B (zh) * 2021-01-28 2023-07-07 平安科技(深圳)有限公司 医疗文献的检索方法、装置、电子设备及存储介质
CN115563313A (zh) * 2022-10-25 2023-01-03 上海交通大学 基于知识图谱的文献书籍语义检索系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095469A (zh) * 2015-08-07 2015-11-25 薛德军 一种基于反馈的文献检索方法
CN108804557A (zh) * 2018-05-22 2018-11-13 温州医科大学 医学期刊论文推荐方法及系统
CN112148885A (zh) * 2020-09-04 2020-12-29 上海晏鼠计算机技术股份有限公司 一种基于知识图谱的智能搜索方法及系统
CN114741627A (zh) * 2022-04-12 2022-07-12 中国人民解放军32802部队 面向互联网的辅助信息搜索方法
CN116340468A (zh) * 2023-05-12 2023-06-27 华北理工大学 主题文献检索预测方法

Also Published As

Publication number Publication date
ZA202308509B (en) 2024-03-27
CN116340468A (zh) 2023-06-27

Similar Documents

Publication Publication Date Title
US8280886B2 (en) Determining candidate terms related to terms of a query
Roshdi et al. Information retrieval techniques and applications
US8108392B2 (en) Identifying clusters of words according to word affinities
US8171029B2 (en) Automatic generation of ontologies using word affinities
JP5492187B2 (ja) 編集距離および文書情報を使用する検索結果順位付け
US8543380B2 (en) Determining a document specificity
US9317593B2 (en) Modeling topics using statistical distributions
US9081852B2 (en) Recommending terms to specify ontology space
Yagoubi et al. Massively distributed time series indexing and querying
CN106383836B (zh) 将可操作属性归于描述个人身份的数据
WO2008106667A1 (en) Searching heterogeneous interrelated entities
EP2045732A2 (en) Determining the depths of words and documents
US20120317125A1 (en) Method and apparatus for identifier retrieval
WO2024078141A1 (zh) 主题文献检索预测方法
JPH0744567A (ja) 文書検索装置
JP5324677B2 (ja) 類似文書検索支援装置及び類似文書検索支援プログラム
JP2006178599A (ja) 文書検索装置および方法
US20110191347A1 (en) Adaptive routing of documents to searchable indexes
Alipanah et al. Ontology-driven query expansion methods to facilitate federated queries
US10235432B1 (en) Document retrieval using multiple sort orders
KR102081867B1 (ko) 역 색인 구성 방법, 역 색인을 이용한 유사 데이터 검색 방법 및 장치
Li Glowworm Swarm Optimization Algorithm‐and K‐Prototypes Algorithm‐Based Metadata Tree Clustering
Abass et al. Information retrieval models, techniques and applications
EP2090992A2 (en) Determining words related to a given set of words
JP3880534B2 (ja) 文書分類方法及び文書分類プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23876346

Country of ref document: EP

Kind code of ref document: A1