CN113343062A

CN113343062A - Scientific and technological resource matching method based on Pagerank algorithm

Info

Publication number: CN113343062A
Application number: CN202110564589.8A
Authority: CN
Inventors: 徐昱琳; 李璇; 周文举; 易开祥; 费敏锐
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2021-09-03

Abstract

The invention discloses a scientific and technological resource matching method based on a Pagerank algorithm. The relevancy calculation formula used by the Elasticissearch mainly refers to the word frequency/reverse document frequency and a vector space model, and does not consider the literature value of the query object. In order to quickly and accurately find the most valuable documents which accord with the expectation, the invention constructs an Elasticissearch document search engine, uses a new document value sorting algorithm based on a Pagerank algorithm to calculate the document value, and adds a document value factor according to the original calculation relevance rule of the Elasticissearch to obtain a new relevance sorting rule. After the document value sorting algorithm is mature, the thought of the document value sorting algorithm is used for other scientific and technological resources, and the value sorting algorithm such as intellectual property, human resources, policy consultation and the like is obtained by modifying the details of the algorithm.

Description

Scientific and technological resource matching method based on Pagerank algorithm

Technical Field

The invention relates to the field of search engines and resource matching, in particular to construction of a scientific and technological resource matching algorithm based on a Pagerank algorithm.

Background

With the development of society, the demand of human beings on the sharing of scientific and technological resources is increasing day by day, and a search engine also becomes an important tool for people to acquire mass information. Academic documents are important components of scientific and technological resources, and users mainly acquire the academic documents through academic search engines. In recent years, search engines for academic document retrieval have been introduced, whether by search engine companies or database vendors, to provide users with academic search services, such as Google Scholar, encyclopedia, Cnki academia, Web of Science, and the like. The Elasticissearch technology is used by search engines used by many websites in China at present. The elastic search is a search server developed by Java language and based on Lucene, and can realize a search function according to input keywords.

The Elasticissearch can store the data of the user in an Elasticissearch database, the corresponding sentence is segmented through a segmenter, the weight and the segmentation result are stored in the data together, when the user searches the data, the results are ranked and scored according to the weight, and the returned results are presented to the user. For document matching in scientific and technological resources, the relevance algorithm of the Elasticissearch using the search term frequency/reverse document frequency only considers the relevance degree with the search term, does not consider the value of the document, and may cause the documents with slightly low similarity but high document value not to be dominant in the ranking in the matching process.

The Pagerank algorithm is a webpage sorting algorithm of Google, a weight is attached to each target webpage, and sorting sequence is determined according to the weight. Similar to the problem of webpage ranking, the domestic scholars propose to apply the Pagerank algorithm to the ranking of documents, and simultaneously define the document value as the weighted sum of the intrinsic value of the document and the value obtained after being quoted. The intrinsic value of a document is mainly determined by the level of publication or conference published by the document and the authority of an author, and the publication age is used as a standard for measuring the importance of a reference document to describe the weight of value transfer. Throughout the large number of papers on the ranking of the value of the documents, the problem of unreasonable ranking of the documents due to the publication time is still existed.

Disclosure of Invention

In order to solve the problems in the prior art, the invention aims to overcome the defects in the prior art, and provides a scientific and technological resource matching method based on a Pagerank algorithm. The relevance ranking constructed by the method considers the value ranking of the documents, mainly aims at the treatise documents, and enables the value ranking to meet the requirements.

In order to achieve the purpose of the invention, the invention adopts the following inventive concept:

the principle of the invention is as follows: the method is characterized in that a paper document value sorting algorithm in the method is based on a Pagerank improved algorithm, the emphasis points of new and old document value evaluation are determined by using different publication times, the introduced quantity, the downloaded quantity and the influence factors are introduced to depict the inherent value of the paper document, the value obtained by introduction can be determined by mutually introducing documents to obtain the value, a new document value sorting algorithm is finally obtained, the value scores of the paper documents are evaluated, the total score finally determined for sorting is calculated by combining the relevance scores of the Elasticisearch, so that a document search engine considering the document value is realized, the idea of the document value sorting algorithm is used for other scientific and technological resources after the document value sorting algorithm is mature, and the value sorting algorithm of property rights, human resources, policy consultation and the like can be obtained by modifying the details of the algorithm.

According to the inventive concept, the invention adopts the following technical scheme:

a scientific and technological resource matching method based on a Pagerank algorithm comprises the following steps:

step 1: building a basic Elasticissearch search engine by using a Spring framework of IDEA software;

step 2: crawling and analyzing the thesis data from the resource pool by using an Elasticissearch search engine, and putting the data into an Elasticissearch index library;

and step 3: defining a new document value sorting algorithm based on a Pagerank sorting algorithm, and calculating and modifying to obtain a new correlation score calculation rule by combining the Elasticissearch correlation;

and 4, step 4: and (4) inquiring according to the input keywords, and when a Request is generated in the IDAE, sequencing the document values according to the scores by using a newly defined relevancy scoring rule.

Preferably, the Elasticissearch index library visualization is implemented using an Elasticissearch-head-master plug-in. Preferably, the method is based on an Elasticissearch (7.6.1) framework, projects are built by means of IDEA software, visualization of an Elasticissearch index library is achieved by using an Elasticissearch-head-master plug-in, document value factors are considered by search result scoring rules, and a document value sorting algorithm is based on a webpage sorting algorithm Pagerank.

Preferably, in step 1, the Elasticsearch version is configured in the configuration file of the Spring project, and the data search is implemented by calling SearchRequest, SearchResponse, resthighlevel client, TermQueryBuilder API in Elasticsearch.

Preferably, in the step 2, a Spring project is built based on a Spring project framework in the IDEA, an elastosearch engine version bound to the project is concerned in a pom.xml file, an 3.6.3 version of the maven bound elastosearch engine is automatically imported into a required jar package, and after the project is built, paper resource data is crawled by a simple crawler and stored in an ES index library after permission of the resource platform is obtained.

Preferably, in step 3, the basic Pagerank algorithm determines that a web page is important if the web page is linked to by many other web pages, that is, the Pagerank value is relatively high; if a web page with a high pageank value is linked to another web page, determining that the pageank value of the linked web page is correspondingly increased; the main calculation formula of the Pagerank algorithm is as follows:

wherein PR (pi) represents the PR value of page pi; c (pk) represents the total number of pages pk linked out by page pj; d is a damping factor, generally takes a value of 0.85, and represents the probability that the user continues to click the link, and 1-d is the probability that the user jumps out of the link and clicks a new page.

Preferably, in the step 3, for the characteristics of the scientific and technical paper documents, a weighted sum of the intrinsic value of the documents and the quoted value is used in combination with a Pagerank algorithm to define the value of a certain document, wherein the intrinsic value of the documents is determined by the quoted amount, the downloaded amount and the influence factor of the scientific and technical documents, and the quoted value of the scientific and technical documents is delivered by the quoted documents; meanwhile, the weight relationship among the 2 values is determined by publication time, and the following literature value sorting algorithm formula is adopted:

KLV(u_i)＝(1-d(t_i))*LZV(u_i)+d(t_i)*TRV(u_i)；

where KLV (ui) represents the literature value of this document, d (ti) represents the time damping function of this document, LZV (ui) represents the intrinsic value of this document itself, and TRV (ui) represents the value to which this document is cited. Aiming at the characteristics of scientific and technological paper documents, the core idea of the Pagerank algorithm is used in document value calculation, and the following can be obtained: a document is important if it is linked to many other documents, i.e. the value of the document is relatively high; if a web page with a high document value links to another document, the document value of the linked document is increased accordingly.

Preferably, in the step 4, the value score KLV of each document is calculated, the relevancy ranking rule of the Elasticsearch is modified, the score calculated by the original relevancy of the Elasticsearch accounts for 60%, the score calculated by the value of the scientific and technological paper documents accounts for 40%, and the ranking display result is displayed according to the final total score, and the specific implementation steps are as follows:

step 3.1: determining the intrinsic value of the scientific literature;

step 3.2: determining the value of the scientific literature;

step 3.3: determining a time damping coefficient;

step 3.4: establishing a document value sorting algorithm formula;

step 3.5: and fusing a literature value sorting algorithm formula to calculate the original similarity of the Elasticissearch, thereby obtaining a new literature correlation calculation mode.

A computer system program executes the scientific and technological resource matching method based on the Pagerank algorithm.

Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable advantages:

1. the invention considers that the inquired object is scientific and technological thesis documents, the value attribute of the documents is one of the influence factors for determining search sequencing display, and an algorithm for calculating the value of the documents is newly added based on the original correlation calculation, thereby not only ensuring the accuracy of search, but also ensuring the quality of the searched documents, and being more beneficial to searching proper and high-quality thesis documents by users;

2. the document value sorting algorithm used by the invention improves the damping factor in the original Pagerank algorithm, adds the time factor, gives a weight different from the obtained value to the inherent value of the document according to the publication time of the document, and increases the fairness of the new document and the old document in the value evaluation.

Drawings

FIG. 1 is a diagram illustrating the architecture of the Elasticissearch in the present invention.

FIG. 2 is a simplified flow chart of a document value ranking algorithm in accordance with the present invention.

FIG. 3 is an index library visualization interface provided by the elastic search-head-master in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The above-described scheme is further illustrated below with reference to specific embodiments, which are detailed below:

the first embodiment is as follows:

in this embodiment, referring to fig. 1-2, a scientific and technological resource matching method based on the Pagerank algorithm includes the following steps:

In the scientific and technological resource matching method based on the Pagerank algorithm, starting from literature resources in scientific and technological resources, a literature value algorithm is introduced into the Elasticissearch correlation calculation, so that a literature search engine considering literature values is built. The relevance ranking constructed by the method considers the value ranking of the documents, mainly aims at the treatise documents, and enables the value ranking to meet the requirements.

Example two:

this embodiment is substantially the same as the first embodiment, and is characterized in that:

in this embodiment, referring to FIGS. 1-3, the Elasticissearch index library visualization is implemented using the Elasticissearch-head-master plug-in.

In this embodiment, in step 1, an Elasticsearch version is configured in a configuration file of a Spring project, and data search is implemented by calling SearchRequest, SearchResponse, resthighlevel client, and TermQueryBuilder APIs in Elasticsearch.

In this embodiment, in step 2, a Spring project is built based on a Spring project framework in the IDEA, an elastosearch engine version bound to the project is concerned in a pom.xml file, an 3.6.3 version bound to the maven elastosearch engine is automatically imported into a required jar package, and after the project is built, paper resource data is crawled by a simple crawler and stored in an ES index library after permission of the resource platform is obtained.

In this embodiment, in step 3, if a web page is linked to by many other web pages, the basic pageank algorithm determines that the web page is important, that is, the pageank value is relatively high; if a web page with a high pageank value is linked to another web page, determining that the pageank value of the linked web page is correspondingly increased; the main calculation formula of the Pagerank algorithm is as follows:

In the embodiment, in the step 3, for the characteristics of the scientific and technical paper documents, a weighted sum of the intrinsic value of the documents and the quoted value is used in combination with a Pagerank algorithm to define the value of a certain document, wherein the intrinsic value of the documents is determined by the quoted amount, the reading amount and the influence factor of the scientific and technical documents, and the quoted value of the scientific and technical documents is delivered by the quoted documents; meanwhile, the weight relationship among the 2 values is determined by publication time, and the following literature value sorting algorithm formula is adopted:

KLV(u_i)＝(1-d(t_i))*LZV(u_i)+d(t_i)*TRV(u_i)；

where KLV (ui) represents the literature value of this document, d (ti) represents the time damping function of this document, LZV (ui) represents the intrinsic value of this document itself, and TRV (ui) represents the value to which this document is cited.

In this embodiment, in the step 4, the value score KLV of each document is calculated, the relevancy ranking rule of the Elasticsearch is modified, the score calculated by the original relevancy of the Elasticsearch accounts for 60%, the score calculated by the value of the scientific and technological paper documents accounts for 40%, and the ranking display result is displayed according to the final total score, which is specifically implemented as follows:

step 3.1: determining the intrinsic value of the scientific literature;

step 3.2: determining the value of the scientific literature;

step 3.3: determining a time damping coefficient;

step 3.4: establishing a document value sorting algorithm formula;

In the method, the inquired object is a scientific and technological thesis document, the value attribute of the document itself should become one of the influence factors for determining search sequencing display, and an algorithm for calculating the value of the document is newly added based on the original correlation calculation, so that the accuracy of search is ensured, the quality of the searched document is also ensured, and the method is more beneficial for a user to search for the searched documentIs suitable forAnd high quality; the document value sorting algorithm used in the method improves the damping factor in the original Pagerank algorithm, adds the time factor, gives a weight different from the quoted value to the inherent value of the document according to the publication time of the document, and increases the fairness of the new document and the old document in the value evaluation.

Example three:

this embodiment is substantially the same as the above embodiment, and is characterized in that:

in this embodiment, as shown in fig. 1: an IDEA software is used for building a Spring project, an Elasticissearch (7.6.1) is used for building a search engine, an IKAnalyzer is used for a word segmentation device, and an Elasticissearch index library visualization is realized by combining an Elasticissearch-head-master plug-in, as shown in FIG. 3.

As shown in fig. 2: because the query object is a scientific and technological paper document, the method mainly modifies the correlation calculation mode of the elastic search and adds the value attribute of the scientific and technological paper document. For the value score of scientific and technical paper documents, a new formula is provided for score calculation, and the score calculation is realized step by step according to steps 3.1 to 3.5, wherein the details of the steps are as follows:

step 3.1: the intrinsic value of scientific and technological paper documents is mainly determined by quoted quantity, download quantity and influence factors, and the quoted quantity and the download quantity are important indexes for measuring the intrinsic value of the documents, and can prove the practical value of the documents to a certain extent. The influence factors are indexes for measuring the usefulness and the display degree of the journal, but can reflect the quality of the paper to a certain extent. Therefore, the intrinsic value of the document is described by combining the three, and the formula is as follows:

wherein LZV (u)_i) The index value (YN), (ui) represents the index amount of document ui, XZ (ui) represents the document download amount, and IF (ui) represents the influence factor of the document.

Step 3.2: there is also a transfer of value in the citation relationship between documents, analogous to the link relationship between web pages. The cited documents will receive the acceptance of other documents, this acceptance being expressed in terms of the transfer of value, each document transferring its own value to the cited reference separately, and the expression of value obtained by the citation is then:

where TRV (ui) is quoted, n represents the total number of documents cited as document ui, KLV (uj) represents the value of document uj, Bj represents the collection of citations for document uj, and KLV (uk) represents the value of document uj as citation for document uk.

Step 3.3: damping function: the present invention adjusts the damping factor in the Pagerank algorithm. The cited amount of recently published documents is not high, not necessarily because of low price, so the original damping factor is combined with the time factor to obtain a new damping function, and the formula is as follows:

in the formula, d is a damping factor and generally takes a value of 0.85, ti represents publication time of a document, to represents current time, and Σ_k(t₀-t_k) The published days and of all documents are indicated. The damping function is provided, the weights of the intrinsic value of the literature and the value obtained after the literature is quoted are given according to the difference of the publication time, the objective fairness influence of the publication time on the quoted quantity is reduced, and the result of the value ordering of the literature is optimized.

Step 3.4: the invention provides a method for calculating the value of a certain document by using the weighted sum of the intrinsic value of the document and the quoted value, wherein the intrinsic value of the document is determined by the quoted amount, the download amount and the influence factor, the other value is transmitted by the quoted document, and meanwhile, the weight relation between 2 values is determined by the publication time. Therefore, the following scientific paper literature value ranking algorithm formula is proposed:

KLV(u_i)＝(1-d(t_i))*LZV(u_i)+d(t_i)*TRV(u_i)

KLV (ui) in the formula represents the value of the new algorithm in the document ui; LZV (ui) represents the transmission power of the document ui and is used for describing the inherent value of the document itself; TRV (ui) represents the value obtained after citation of the document ui; d (ti) represents a damping function, and determines the evaluation weights of the old and new documents.

Step 3.5: the Elasticsearch uses boolean models to find matching documents and calculates relevance using a formula called a utility scoring function. The formula uses the word frequency/reverse document frequency and the vector space model for reference, and simultaneously adds new characteristics of a coordination factor, field length normalization, word or query statement weight improvement and the like. The relevancy score of each paper to be queried accounts for 60% through a formula, the score of the document value of each paper accounts for 40% through step 3.4, and the query result is displayed according to the total score.

In the scientific and technological resource matching method based on the Pagerank algorithm, starting from academic documents in scientific and technological resources, documents serving as query objects have document values, and the document values can be calculated by weighted summation of intrinsic values and quoted acquisition values of the documents. The relevancy calculation formula used by the Elasticissearch mainly refers to the word frequency/reverse document frequency and a vector space model, and does not consider the literature value of the query object. In order to quickly and accurately find the most valuable documents which accord with the expectation, the invention constructs an Elasticissearch document search engine, uses a new document value sorting algorithm based on a Pagerank algorithm to calculate the document value, and adds a document value factor according to the original calculation relevance rule of the Elasticissearch to obtain a new relevance sorting rule. After the document value sorting algorithm is mature, the idea of the document value sorting algorithm is applied to other scientific and technological resources, such as value sorting algorithms of intellectual property, human resources, policy consultation and the like can be obtained by modifying the details of the algorithm.

The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and inventive concept of the present invention.

Claims

1. A scientific and technological resource matching method based on a Pagerank algorithm is characterized by comprising the following steps:

2. The scientific and technological resource matching method based on the Pagerank algorithm as claimed in claim 1, wherein: the Elasticisearch index library visualization was achieved using the Elasticisearch-head-master plug-in.

3. The scientific and technological resource matching method based on the Pagerank algorithm as claimed in claim 1, wherein: in step 1, configuring an Elasticsearch version in a configuration file of a Spring project, and implementing data search by calling SearchRequest, SearchResponse, resthighlevel client and termquerybuilder api in Elasticsearch.

4. The scientific and technological resource matching method based on the Pagerank algorithm as claimed in claim 1, wherein: in the step 2, a Spring project is built based on a Spring project framework in the IDEA, an Elasticissearch engine version bound to the project is concerned in a pom file, an 3.6.3 version bound to the maven by the Elasticissearch engine is automatically imported into a required jar package, and after the project is built, the thesis resource data is crawled by a simple crawler and stored in an ES index library after permission of the resource platform is obtained.

5. The scientific and technological resource matching method based on the Pagerank algorithm as claimed in claim 1, wherein: in step 3, the basic Pagerank algorithm determines that a web page is important if the web page is linked to by many other web pages, that is, the Pagerank value is relatively high; if a web page with a high pageank value is linked to another web page, determining that the pageank value of the linked web page is correspondingly increased; the main calculation formula of the Pagerank algorithm is as follows:

wherein PR (pi) represents the PR value of page pi; c (pk) represents the total number of pages pk linked out by page pj; d is a damping factor, generally takes a value of 0.85, represents the probability that the user continues to click the link, and 1-d clicks a new page for the user to jump out of the linkOutline of a noodle And (4) rate.

6. The scientific and technological resource matching method based on the Pagerank algorithm as claimed in claim 1, wherein: in the step 3, for the characteristics of the scientific and technical paper documents, combining a Pagerank algorithm, defining the value of a certain document by using the weighted sum of the intrinsic value of the document and the quoted value, wherein the intrinsic value of the document is determined by the quoted amount, the download amount and the influence factor of the scientific and technical document, and the quoted value of the scientific and technical document is delivered by the quoted document; meanwhile, the weight relationship among the 2 values is determined by publication time, and the following literature value sorting algorithm formula is adopted:

KLV(u_i)＝(1-d(t_i))*LZV(u_i)+d(t_i)*TRV(u_i)；

7. The scientific and technological resource matching method based on the Pagerank algorithm as claimed in claim 6, wherein: in the step 4, the value score KLV of each document is calculated, the relevancy ranking rule of the Elasticsearch is modified, the score calculated by the original relevancy of the Elasticsearch accounts for 60%, the score calculated by the value of the scientific and technological paper documents accounts for 40%, and the ranking display result is displayed according to the final total score, and the specific implementation steps are as follows:

step 3.1: determining the intrinsic value of the scientific literature;

step 3.2: determining the value of the scientific literature;

step 3.3: determining a time damping coefficient;

step 3.4: establishing a document value sorting algorithm formula;