CN116340468A - Theme literature retrieval prediction method - Google Patents

Theme literature retrieval prediction method Download PDF

Info

Publication number
CN116340468A
CN116340468A CN202310531081.7A CN202310531081A CN116340468A CN 116340468 A CN116340468 A CN 116340468A CN 202310531081 A CN202310531081 A CN 202310531081A CN 116340468 A CN116340468 A CN 116340468A
Authority
CN
China
Prior art keywords
search
retrieval
document
data
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310531081.7A
Other languages
Chinese (zh)
Inventor
郑志军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China University of Science and Technology
Original Assignee
North China University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China University of Science and Technology filed Critical North China University of Science and Technology
Priority to CN202310531081.7A priority Critical patent/CN116340468A/en
Publication of CN116340468A publication Critical patent/CN116340468A/en
Priority to PCT/CN2023/113965 priority patent/WO2024078141A1/en
Priority to ZA2023/08509A priority patent/ZA202308509B/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a topic document retrieval prediction method, which belongs to the field of data analysis prediction, and comprises the steps of constructing a topic resource database, and constructing a topic retrieval word library for digital resources of documents; storing the search data into the file every time, searching and matching the search data in the file to determine the search frequency, constructing a search information knowledge graph, and associating other users with the search history data to perfect the knowledge graph; finally, determining a retrieval strategy by the knowledge graph, predicting retrieval documents and sequencing; when a user browses a search result, recording and associating document data browsed and downloaded by the user, analyzing the data, and establishing association degree between a search strategy and predicted documents; and predicting search results of other users according to the association degree between the search strategy and the predicted documents, and improving the user search experience by carrying out prediction sequencing output on the search documents.

Description

Theme literature retrieval prediction method
Technical Field
The invention relates to the field of data analysis and prediction, in particular to a topic document retrieval and prediction method.
Background
Research such as the Lanzhou university Wei Qinghua indicates that a batch of database platforms capable of completely disclosing special Tibetan documents are preliminarily built by the China university human society scientific literature center, but the digitalized resources are often stored in a specific and independent document management system, only simple document retrieval and copying scanning service is provided, and the construction is still required to be enhanced in aspects of multidimensional and fine metadata processing, rich and various platform function development and the like in the future. The university of Dalian industry Han Bing researches the construction condition of the self-built characteristic database of 42 'double-first-class' university libraries, and provides development suggestions for the problems of unbalanced construction of the commonly-existing databases, different standards of the construction platform, single functions, low external opening degree, single construction main body, insufficient sustainability of construction and service and the like. The 274 self-built databases of 20 college halls of the Chinese university, such as the Nanjing university information management institute He Xiaoyue, are searched and browsed, so that the experience of the American university is referred to, more external cooperations are sought, and the user experience is considered to be improved.
Disclosure of Invention
Aiming at the current state of the background technology, the invention provides a topic document retrieval prediction method which is used for improving the user experience in document retrieval.
The invention adopts the following technical scheme: the topic document retrieval prediction method comprises the following steps:
step one: constructing a theme resource database to obtain document digital resources; scanning the paper document by a scanning device to obtain a digital resource of the document; constructing a theme retrieval word stock for the digital resources of the literature;
step two: storing the search data into the file every time, and searching and matching the search data in the file to determine the search frequency; constructing a retrieval information knowledge graph according to the retrieval frequency, associating other user retrieval history data, and perfecting the knowledge graph; finally, determining a retrieval strategy by the knowledge graph, predicting retrieval documents and sequencing;
step three: user behavior analysis: when a user browses the search result, recording and associating document data browsed and downloaded by the user, analyzing the data, and establishing association degree between the search strategy and the predicted document in the second step;
step four: and predicting search results of other users according to the association degree between the search strategy and the prediction literature.
And further constructing a topic retrieval word stock in the step one, carrying out word frequency statistics on the digital resource, determining the topic retrieval word stock according to the word frequency statistics result, and constructing a word stock by one document.
In the third step, the user browses and downloads the document data, and different weights are set respectively, wherein the download weight is higher than the browsing weight.
The method has the advantages that the search keywords, the historical search data and other user search data are analyzed to form the search strategy, the association degree between the search strategy and the predicted documents is determined, the predicted ordering output is carried out on the search documents, and the user search experience is improved.
Detailed Description
Step one: constructing a theme resource database to obtain document digital resources; scanning the paper document by a scanning device to obtain a digital resource of the document; constructing a theme retrieval word stock for the digital resources of the literature;
the method is characterized in that a topic retrieval word stock is constructed, word frequency statistics is carried out on digital resources, word frequency statistical analysis is carried out on the occurrence frequency of important words in articles, the method is an important means of text mining, the technology is free from worry about new words, the new words can be counted as long as the new words have use amount, for example, word frequency statistics of documents is carried out by using a tool 'candy cloud', nonsensical words are removed from statistical results, the screened words are determined as topic retrieval word stock, and a word stock is constructed by one document.
For example, input of a dream document of the red blood cell, output of word frequency statistics ordering: jades 4004, laugh 2454, what 1834, phoenix sister 1743, one 1715, gu Mu 1690, no 1451, dai jade 1379, we 1226, where 1178, assailant 1156, girls 1136, 1096, baohu 1089, wang Furen, no 1080.
Step two: storing the search data into the file every time, and searching and matching the search data in the file to determine the search frequency; constructing a retrieval information knowledge graph according to the retrieval frequency, associating other user retrieval history data, and perfecting the knowledge graph; finally, determining a retrieval strategy by the knowledge graph, predicting retrieval documents and sequencing;
if a document about sister and Dai Yu is searched, the document is sister in the dream of the red building and is not net red sister, for example, keywords including sister and Dai Yu are input during searching, search data including "sister" and "Dai Yu" are stored in a database file A, whether the search file A has keywords including "sister", "Dai Yu" and "sister" is searched, then the frequency of the "sister" is increased by 1, and the keywords including "sister" is directly stored and the frequency is set as 1; if the "Daiyu" is available, then the "Daiyu" frequency is increased by 1, and if the "Daiyu" is not available, the keyword "Daiyu" is directly stored and the frequency is set to 1. Determining a knowledge graph according to the frequencies of Fengshi and Daiyu in the file A: feng Jie- > Dai Yu, embody the frequency height simultaneously, for example: sister us- > Dai Yu represents retrieving the associated two keywords and is more frequent than Dai Yu in file A; and then the established knowledge graph Feng Jie- > Dai Yu is associated and compared with other user retrieval history data, and other users have retrieval: chicken sister- > roo; sister Phoenix- > Daiyu- > dream of red blood cells; by comparing the knowledge graph sister through the search with the historical data, (because of the similar record sister through the Dai jade through the dream of the red building, the record is the same as the precursor part of the sister through the Dai jade), based on this) modifies the current knowledge graph to be Fengjie- > Daiyu- > Red-building dream, so as to determine a final search strategy, and predicts and outputs the documents searched at this time according to the search result of historical search Fengjie- > Daiyu- > Red-building dream and sorts the documents. And if the comparison between the knowledge graph sister through Dai jade and the historical data does not have similar data, storing the determined knowledge graph sister through Dai jade into the file A in a record form, and outputting the searched document.
Step three: user behavior analysis: when a user browses the search result, recording and associating document data browsed and downloaded by the user, analyzing the data, and establishing association degree between the search strategy and the predicted document in the second step;
the user records behavior data while browsing the document. It is assumed that 10 related documents 1 to 10 are detected in the second step, and when the user browses these 10 documents, the user downloads document 2, opens documents 3 and 5, and otherwise does not operate, at this time, the weight of document 2 is set high, the weight of documents 3 and 5 is set low, the weights of the other 7 documents are set low, and the sorted document ID identification numbers (ID identifications of documents 2,3 and 5 are stored in order, in this example, only documents 2,3 and 5 are browsed, and other document IDs are not necessarily stored) are stored as data in document a and the corresponding knowledge map established in the second step is in the same record, so that the degree of association between the retrieval policy of the second step and the predicted document is established.
Step four: and predicting search results of other users according to the association degree between the search strategy and the prediction literature. And associating the document weight of the current retrieval with the retrieval strategy to serve as a prediction standard of the next retrieval. That is, the output result prediction order of the next other user search policy is chicken's gizzard- > Dai Yu- > red building dream is document 2,3,5.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (3)

1. The topic document retrieval prediction method is characterized by comprising the following steps:
step one: constructing a theme resource database to obtain document digital resources; scanning the paper document by a scanning device to obtain a digital resource of the document; constructing a theme retrieval word stock for the digital resources of the literature;
step two: storing the search data into the file every time, and searching and matching the search data in the file to determine the search frequency; constructing a retrieval information knowledge graph according to the retrieval frequency, associating other user retrieval history data, and perfecting the knowledge graph; finally, determining a retrieval strategy by the knowledge graph, predicting retrieval documents and sequencing;
step three: user behavior analysis: when a user browses the search result, recording and correlating document data browsed and downloaded by the user, analyzing the data, and establishing a correlation degree between the search strategy and the predicted document in the second step;
step four: and predicting search results of other users according to the association degree between the search strategy and the prediction literature.
2. The method according to claim 1, wherein in the first step, the topic search word stock is constructed, word frequency statistics is performed on the digital resource, the topic search word stock is determined according to the word frequency statistics result, and a word stock is constructed from one document.
3. The method according to claim 1, wherein the user browses and downloads document data in the third step, respectively, with different weights, and the download weight is higher than the browsing weight.
CN202310531081.7A 2023-05-12 2023-05-12 Theme literature retrieval prediction method Pending CN116340468A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202310531081.7A CN116340468A (en) 2023-05-12 2023-05-12 Theme literature retrieval prediction method
PCT/CN2023/113965 WO2024078141A1 (en) 2023-05-12 2023-08-21 Subject-based document retrieval prediction method
ZA2023/08509A ZA202308509B (en) 2023-05-12 2023-09-04 Prediction method for subject literature retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310531081.7A CN116340468A (en) 2023-05-12 2023-05-12 Theme literature retrieval prediction method

Publications (1)

Publication Number Publication Date
CN116340468A true CN116340468A (en) 2023-06-27

Family

ID=86880668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310531081.7A Pending CN116340468A (en) 2023-05-12 2023-05-12 Theme literature retrieval prediction method

Country Status (3)

Country Link
CN (1) CN116340468A (en)
WO (1) WO2024078141A1 (en)
ZA (1) ZA202308509B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024078141A1 (en) * 2023-05-12 2024-04-18 华北理工大学 Subject-based document retrieval prediction method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095469A (en) * 2015-08-07 2015-11-25 薛德军 Method for document retrieval based on feedback
CN108804557A (en) * 2018-05-22 2018-11-13 温州医科大学 Medical journals paper recommends method and system
CN112148885A (en) * 2020-09-04 2020-12-29 上海晏鼠计算机技术股份有限公司 Intelligent searching method and system based on knowledge graph
CN112434168A (en) * 2020-11-09 2021-03-02 广西壮族自治区图书馆 Knowledge graph construction method and fragmentized knowledge generation method based on library
CN112885478A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Medical document retrieval method, medical document retrieval device, electronic device, and storage medium
CN114741627A (en) * 2022-04-12 2022-07-12 中国人民解放军32802部队 Internet-oriented auxiliary information searching method
CN115563313A (en) * 2022-10-25 2023-01-03 上海交通大学 Knowledge graph-based document book semantic retrieval system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116340468A (en) * 2023-05-12 2023-06-27 华北理工大学 Theme literature retrieval prediction method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095469A (en) * 2015-08-07 2015-11-25 薛德军 Method for document retrieval based on feedback
CN108804557A (en) * 2018-05-22 2018-11-13 温州医科大学 Medical journals paper recommends method and system
CN112148885A (en) * 2020-09-04 2020-12-29 上海晏鼠计算机技术股份有限公司 Intelligent searching method and system based on knowledge graph
CN112434168A (en) * 2020-11-09 2021-03-02 广西壮族自治区图书馆 Knowledge graph construction method and fragmentized knowledge generation method based on library
CN112885478A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Medical document retrieval method, medical document retrieval device, electronic device, and storage medium
CN114741627A (en) * 2022-04-12 2022-07-12 中国人民解放军32802部队 Internet-oriented auxiliary information searching method
CN115563313A (en) * 2022-10-25 2023-01-03 上海交通大学 Knowledge graph-based document book semantic retrieval system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
马海龙等: "《被遗忘权的法教义学钩沉》", 宁夏人民教育出版社, pages: 244 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024078141A1 (en) * 2023-05-12 2024-04-18 华北理工大学 Subject-based document retrieval prediction method

Also Published As

Publication number Publication date
ZA202308509B (en) 2024-03-27
WO2024078141A1 (en) 2024-04-18

Similar Documents

Publication Publication Date Title
US8266147B2 (en) Methods and systems for database organization
US8812493B2 (en) Search results ranking using editing distance and document information
US8010534B2 (en) Identifying related objects using quantum clustering
WO2017097231A1 (en) Topic processing method and device
US20060167930A1 (en) Self-organized concept search and data storage method
US20100205172A1 (en) Method for using dual indices to support query expansion, relevance/non-relevance models, blind/relevance feedback and an intelligent search interface
JP2016532173A (en) Semantic information, keyword expansion and related keyword search method and system
JP2001519952A (en) Data summarization device
US8892574B2 (en) Search apparatus, search method, and non-transitory computer readable medium storing program that input a query representing a subset of a document set stored to a document database and output a keyword that often appears in the subset
CN107844493B (en) File association method and system
CN116340468A (en) Theme literature retrieval prediction method
CN110580255A (en) method and system for storing and retrieving data
US20120239657A1 (en) Category classification processing device and method
US8484221B2 (en) Adaptive routing of documents to searchable indexes
Kang et al. A term cluster query expansion model based on classification information in natural language information retrieval
Azcarraga et al. Evaluating keyword selection methods for WEBSOM text archives
CN106934007B (en) Associated information pushing method and device
KR101918358B1 (en) A Data Center System Providing Customized Information
US8819021B1 (en) Efficient and phased method of processing large collections of electronic data known as “best match first”™ for electronic discovery and other related applications
CN110442593B (en) Cross-application sharing method based on user search information
US8666972B2 (en) System and method for content management and determination of search conditions
Calistru et al. Multidimensional descriptor indexing: exploring the BitMatrix
Zhang et al. An efficient algorithm for clustering search engine results
KR101414999B1 (en) Search result providing system and method using tag based boolean query matching
KR100645711B1 (en) Server, Method and System for Providing Information Search Service by Using Web Page Segmented into Several Information Blocks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230627

RJ01 Rejection of invention patent application after publication