CN116340468A

CN116340468A - Theme literature retrieval prediction method

Info

Publication number: CN116340468A
Application number: CN202310531081.7A
Authority: CN
Inventors: 郑志军
Original assignee: North China University of Science and Technology
Current assignee: North China University of Science and Technology
Priority date: 2023-05-12
Filing date: 2023-05-12
Publication date: 2023-06-27
Also published as: ZA202308509B; WO2024078141A1

Abstract

The invention discloses a topic document retrieval prediction method, which belongs to the field of data analysis prediction, and comprises the steps of constructing a topic resource database, and constructing a topic retrieval word library for digital resources of documents; storing the search data into the file every time, searching and matching the search data in the file to determine the search frequency, constructing a search information knowledge graph, and associating other users with the search history data to perfect the knowledge graph; finally, determining a retrieval strategy by the knowledge graph, predicting retrieval documents and sequencing; when a user browses a search result, recording and associating document data browsed and downloaded by the user, analyzing the data, and establishing association degree between a search strategy and predicted documents; and predicting search results of other users according to the association degree between the search strategy and the predicted documents, and improving the user search experience by carrying out prediction sequencing output on the search documents.

Description

Theme literature retrieval prediction method

Technical Field

The invention relates to the field of data analysis and prediction, in particular to a topic document retrieval and prediction method.

Background

Research such as the Lanzhou university Wei Qinghua indicates that a batch of database platforms capable of completely disclosing special Tibetan documents are preliminarily built by the China university human society scientific literature center, but the digitalized resources are often stored in a specific and independent document management system, only simple document retrieval and copying scanning service is provided, and the construction is still required to be enhanced in aspects of multidimensional and fine metadata processing, rich and various platform function development and the like in the future. The university of Dalian industry Han Bing researches the construction condition of the self-built characteristic database of 42 'double-first-class' university libraries, and provides development suggestions for the problems of unbalanced construction of the commonly-existing databases, different standards of the construction platform, single functions, low external opening degree, single construction main body, insufficient sustainability of construction and service and the like. The 274 self-built databases of 20 college halls of the Chinese university, such as the Nanjing university information management institute He Xiaoyue, are searched and browsed, so that the experience of the American university is referred to, more external cooperations are sought, and the user experience is considered to be improved.

Disclosure of Invention

Aiming at the current state of the background technology, the invention provides a topic document retrieval prediction method which is used for improving the user experience in document retrieval.

The invention adopts the following technical scheme: the topic document retrieval prediction method comprises the following steps:

step one: constructing a theme resource database to obtain document digital resources; scanning the paper document by a scanning device to obtain a digital resource of the document; constructing a theme retrieval word stock for the digital resources of the literature;

step two: storing the search data into the file every time, and searching and matching the search data in the file to determine the search frequency; constructing a retrieval information knowledge graph according to the retrieval frequency, associating other user retrieval history data, and perfecting the knowledge graph; finally, determining a retrieval strategy by the knowledge graph, predicting retrieval documents and sequencing;

step three: user behavior analysis: when a user browses the search result, recording and associating document data browsed and downloaded by the user, analyzing the data, and establishing association degree between the search strategy and the predicted document in the second step;

step four: and predicting search results of other users according to the association degree between the search strategy and the prediction literature.

And further constructing a topic retrieval word stock in the step one, carrying out word frequency statistics on the digital resource, determining the topic retrieval word stock according to the word frequency statistics result, and constructing a word stock by one document.

In the third step, the user browses and downloads the document data, and different weights are set respectively, wherein the download weight is higher than the browsing weight.

The method has the advantages that the search keywords, the historical search data and other user search data are analyzed to form the search strategy, the association degree between the search strategy and the predicted documents is determined, the predicted ordering output is carried out on the search documents, and the user search experience is improved.

Detailed Description

the method is characterized in that a topic retrieval word stock is constructed, word frequency statistics is carried out on digital resources, word frequency statistical analysis is carried out on the occurrence frequency of important words in articles, the method is an important means of text mining, the technology is free from worry about new words, the new words can be counted as long as the new words have use amount, for example, word frequency statistics of documents is carried out by using a tool 'candy cloud', nonsensical words are removed from statistical results, the screened words are determined as topic retrieval word stock, and a word stock is constructed by one document.

For example, input of a dream document of the red blood cell, output of word frequency statistics ordering: jades 4004, laugh 2454, what 1834, phoenix sister 1743, one 1715, gu Mu 1690, no 1451, dai jade 1379, we 1226, where 1178, assailant 1156, girls 1136, 1096, baohu 1089, wang Furen, no 1080.

if a document about sister and Dai Yu is searched, the document is sister in the dream of the red building and is not net red sister, for example, keywords including sister and Dai Yu are input during searching, search data including "sister" and "Dai Yu" are stored in a database file A, whether the search file A has keywords including "sister", "Dai Yu" and "sister" is searched, then the frequency of the "sister" is increased by 1, and the keywords including "sister" is directly stored and the frequency is set as 1; if the "Daiyu" is available, then the "Daiyu" frequency is increased by 1, and if the "Daiyu" is not available, the keyword "Daiyu" is directly stored and the frequency is set to 1. Determining a knowledge graph according to the frequencies of Fengshi and Daiyu in the file A: feng Jie- > Dai Yu, embody the frequency height simultaneously, for example: sister us- > Dai Yu represents retrieving the associated two keywords and is more frequent than Dai Yu in file A; and then the established knowledge graph Feng Jie- > Dai Yu is associated and compared with other user retrieval history data, and other users have retrieval: chicken sister- > roo; sister Phoenix- > Daiyu- > dream of red blood cells; by comparing the knowledge graph sister through the search with the historical data, (because of the similar record sister through the Dai jade through the dream of the red building, the record is the same as the precursor part of the sister through the Dai jade), based on this) modifies the current knowledge graph to be Fengjie- > Daiyu- > Red-building dream, so as to determine a final search strategy, and predicts and outputs the documents searched at this time according to the search result of historical search Fengjie- > Daiyu- > Red-building dream and sorts the documents. And if the comparison between the knowledge graph sister through Dai jade and the historical data does not have similar data, storing the determined knowledge graph sister through Dai jade into the file A in a record form, and outputting the searched document.

the user records behavior data while browsing the document. It is assumed that 10 related documents 1 to 10 are detected in the second step, and when the user browses these 10 documents, the user downloads document 2, opens documents 3 and 5, and otherwise does not operate, at this time, the weight of document 2 is set high, the weight of documents 3 and 5 is set low, the weights of the other 7 documents are set low, and the sorted document ID identification numbers (ID identifications of documents 2,3 and 5 are stored in order, in this example, only documents 2,3 and 5 are browsed, and other document IDs are not necessarily stored) are stored as data in document a and the corresponding knowledge map established in the second step is in the same record, so that the degree of association between the retrieval policy of the second step and the predicted document is established.

Step four: and predicting search results of other users according to the association degree between the search strategy and the prediction literature. And associating the document weight of the current retrieval with the retrieval strategy to serve as a prediction standard of the next retrieval. That is, the output result prediction order of the next other user search policy is chicken's gizzard- > Dai Yu- > red building dream is document 2,3,5.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The topic document retrieval prediction method is characterized by comprising the following steps:

step three: user behavior analysis: when a user browses the search result, recording and correlating document data browsed and downloaded by the user, analyzing the data, and establishing a correlation degree between the search strategy and the predicted document in the second step;

2. The method according to claim 1, wherein in the first step, the topic search word stock is constructed, word frequency statistics is performed on the digital resource, the topic search word stock is determined according to the word frequency statistics result, and a word stock is constructed from one document.

3. The method according to claim 1, wherein the user browses and downloads document data in the third step, respectively, with different weights, and the download weight is higher than the browsing weight.