CN106980677A - The subject search method of Industry-oriented - Google Patents

The subject search method of Industry-oriented Download PDF

Info

Publication number
CN106980677A
CN106980677A CN201710201272.1A CN201710201272A CN106980677A CN 106980677 A CN106980677 A CN 106980677A CN 201710201272 A CN201710201272 A CN 201710201272A CN 106980677 A CN106980677 A CN 106980677A
Authority
CN
China
Prior art keywords
webpage
queue
score
crawled
url
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710201272.1A
Other languages
Chinese (zh)
Other versions
CN106980677B (en
Inventor
刘道桂
韦云凯
刘强
李源颢
蒲勇全
陈怡瑾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201710201272.1A priority Critical patent/CN106980677B/en
Publication of CN106980677A publication Critical patent/CN106980677A/en
Application granted granted Critical
Publication of CN106980677B publication Critical patent/CN106980677B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of subject search method of Industry-oriented.It includes initializing and setting up initial queue to be crawled, judge whether that reaching reptile crawls whether time and queue to be crawled are empty respectively, the relevance degree of webpage and theme is calculated using Shark Search Advanced algorithms, the connection value and webpage sorting score value of webpage are calculated using PageRank Advanced algorithms, judges whether to reach the time interval that reptile crawls again.The present invention can effectively improve the accuracy and reliability of search result, so that the retrieval result of effective acquisition high-accuracy, high coverage rate, it is ensured that search engine can high efficiency, high accuracy, high coverage rate response user towards specific industry search need.

Description

The subject search method of Industry-oriented
Technical field
The invention belongs to a kind of subject search method of technical field of information retrieval, more particularly to Industry-oriented.
Background technology
Internet has become the most important Information Communication of people and content obtaining mode.With Google, Baidu, generation must be should be The universal search engine of table, information is quickly and accurately obtained for people and provides huge facility on the internet.However, logical Needed to set up huge search database with search engine, search content needs to carry out specific industry towards the whole network in user During vertical search, its precision ratio is relatively relatively low, resource cost is big.At the same time, with go where, search dog shopping for the vertical of representative Search engine, the database of oneself is specially set up for special dimension, and industry constraint is big, application flexibility is not enough, recall ratio side Face can not be fully up to expectations.
By analyzing the searching algorithm it can be found that for giving theme to existing vertical search engine technology, generally Using the way of search (such as Fish-Search, Shark-Search etc.) based on content, calculate the degree of correlation of webpage and theme from And filter out the webpage unrelated with theme;Then utilize based on network connection architecture searching algorithm (such as relevancy ranking algorithm, PageRank algorithms etc.), obtained webpage confidence level score value sequence is calculated so as to set up index database.This mode can be set up superfluous The small subject data base of remaining, but sorted according to degree of correlation size, although retrieval result is very high with degree of subject relativity, but reduces It is of overall importance, and it cannot be guaranteed that the reliability of content;If being sorted according to webpage confidence level score value, retrieval result and the phase of theme Pass degree again it cannot be guaranteed that, cause " topic drift ".
The content of the invention
The present invention goal of the invention be:In order to solve problem above present in prior art, the present invention proposes one kind The subject search method of Industry-oriented, realizes the retrieval result of effective acquisition high-accuracy, high coverage rate.
The technical scheme is that:A kind of subject search method of Industry-oriented, comprises the following steps:
A, creep website seedUrls, reptile of initialization crawl time t1, subject key words vector v ectortopicAnd reptile The time interval t crawled again2, initial queue Url_queue to be crawled is set up by the website seedUrls that creeps;
B, judge whether reach reptile crawl time t1If, then end operation, if otherwise determining whether to build in step A Whether vertical queue Url_queue to be crawled is empty;The end operation if queue Url_queue to be crawled is sky, if waiting to crawl Queue Url_queue does not carry out next step then for sky;
C, the relevance degree potential_ using Shark-Search-Advanced algorithms calculating webpage and theme score;
D, connection value PR and webpage sorting score value rank using PageRank-Advanced algorithms calculating webpage;
E, judge whether to reach the time interval t that crawls again of reptile2If, then return to step C, if otherwise repeat step E。
Further, the step C calculates the degree of correlation of webpage and theme using Shark-Search-Advanced algorithms Value potential_score specifically include it is following step by step:
C1, the depth depth and relevance degree potential_ for initializing each webpage in queue Url_queue to be crawled score;
C2, ejected from queue Url_queue heads of the queue to be crawled and a webpage and set it to current_node;
Whether the corresponding depth depth of current_node in C3, judgment step C2 are more than 0, if then carrying out next Step, if otherwise return to step C2;
C4, using the current_node and the relevance degree of theme in Shark-Search algorithm calculation procedures C2 potential_score。
It is C5, related to theme using the current_node web page contents in Shark-Search algorithm calculation procedures C2 Angle value simcurr, and choose the top n sub-pages of current web page;
The current all webpages of C6, basis build networks, and the PR values of each webpage are calculated using PageRank algorithms;
C7, the sim using Shark-Search algorithms calculating sub-pagesiValue and depth depth;
C8, the joint score value score for calculating each webpagei, further according to the joint score value score of each webpageiCalculate current The mean scores of webpage are crawledAnd web page correlation coefficient of determination δ;
C9, the joint score value score for judging each webpageiWhether web page correlation coefficient of determination δ is more than;If then should Webpage adds queue Url_queue tails of the queue to be crawled, if otherwise deleting the webpage from queue Url_queue to be crawled;
C10, Dynamic Maintenance queue Url_queue to be crawled, return to step C2.
Further, the step C5 also includes the relevance degree sim for judging current_node and themecurrIt is whether big In 0, if then choosing the preceding α * width sub-pages of current web page, wherein α is to add Url_queue sub-pages numbers Coefficient;If otherwise choosing the preceding width sub-pages of current web page.
Further, the joint score value score of each webpage is calculated in the step C8iCalculation formula be specially:
scorei=β * simi+(1-β)*PRi
Wherein, β is simiIn scoreiMiddle proportion, i ∈ [1, n], n is webpage total quantity.
Further, the current mean scores for having crawled webpage are calculated in the step C8Calculation formula it is specific For:
Further, calculating web page correlation coefficient of determination δ calculation formula is specially in the step C8:
Wherein, nmaxCombine the webpage quantity that score value is more than mean scores in webpage currently to have crawled, nminFor currently Crawl and combine the webpage quantity that score value is less than mean scores in webpage.
Further, the step D calculates the connection value PR and webpage row of webpage using PageRank-Advanced algorithms Sequence score value rank specifically include it is following step by step:
D1, to it is all crawled webpages setting initial p R values, obtain page initial p R value vectors π0
D2, the PR values using all webpages of PageRank-Advanced algorithms calculating, row vector of going forward side by side are represented;
D3, the relevance degree potential_score according to webpage in step C and theme, with reference to webpage in step D2 PR is worth to webpage sorting score value rank.
Further, the PR values progress vector representation of webpage is specially in the step D2:
πk+1kG
Wherein, πkThe PR values vector of webpage is calculated for kth time,
Further, webpage sorting score value rank is expressed as in the step D3:
Rank=γ * potential_score+ (1- γ) * PR.
The beneficial effects of the invention are as follows:The present invention crawls related to designated key on the internet first with SSA algorithms Webpage, and calculate the degree of correlation for exporting each webpage and theme;Secondly by the relevance degree calculated in SSA algorithms and PRA algorithms calculate the obtained value based on connection and combine the score value finally sorted as webpage, by the score value to retrieval result It is ranked up, can effectively improves the accuracy and reliability of search result, so that effective acquisition high-accuracy, high coverage rate Retrieval result, it is ensured that search engine can high efficiency, high accuracy, high coverage rate response user towards specific industry search Demand.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the subject search method of the Industry-oriented of the present invention.
Fig. 2 is the schematic flow sheet of SSA algorithms in the present invention.
Fig. 3 is the schematic flow sheet of PRA algorithms in the present invention.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.
As shown in figure 1, the schematic flow sheet of the subject search method for the Industry-oriented of the present invention.A kind of Industry-oriented Subject search method, comprises the following steps:
A, creep website seedUrls, reptile of initialization crawl time t1, subject key words vector v ectortopicAnd reptile The time interval t crawled again2, initial queue Url_queue to be crawled is set up by the website seedUrls that creeps;
B, judge whether reach reptile crawl time t1If, then end operation, if otherwise determining whether to build in step A Whether vertical queue Url_queue to be crawled is empty;The end operation if queue Url_queue to be crawled is sky, if waiting to crawl Queue Url_queue does not carry out next step then for sky;
C, the relevance degree potential_ using Shark-Search-Advanced algorithms calculating webpage and theme score;
D, connection value PR and webpage sorting score value rank using PageRank-Advanced algorithms calculating webpage;
E, judge whether to reach the time interval t that crawls again of reptile2If, then return to step C, if otherwise repeat step E。
The present invention crawls the webpage related to designated key first with SSA algorithms on the internet, and it is each to calculate output The degree of correlation of individual webpage and theme;The relevance degree calculated in SSA algorithms and PRA algorithms are calculated again obtain based on even The value connect combines the score value finally sorted as webpage, and retrieval result is ranked up by the score value.
In step, the present invention is initialized to search environment, that is, is initialized creep website seedUrls, reptile and climbed Take time t1, subject key words vector v ectortopicThe time interval t crawled again with reptile2, website of creeping here is row Authoritative website in the industry;Initial queue Url_queue to be crawled is set up by the website seedUrls that creeps again.
In stepb, the present invention judges whether that reaching reptile crawls time t respectively1And the team to be crawled set up in step A Whether row Url_queue is empty, when not up to reptile crawls time t1And under queue Url_queue to be crawled is not carried out for space-time One step.
In step C, as shown in Fig. 2 being the schematic flow sheet of SSA algorithms in the present invention.The present invention uses Shark- Search-Advanced algorithms calculate the relevance degree potential_score of webpage and theme, according to the seed website of input SeedUrls and subject key words vector v ectortopicCrawl on the internet and download the related website of industry, final output The web page contents of structuring and web pages relevance value potential_score and the PR value calculated, are specifically included following Step by step:
C1, the depth depth and relevance degree potential_ for initializing each webpage in queue Url_queue to be crawled score;
C2, ejected from queue Url_queue heads of the queue to be crawled and a webpage and set it to current_node;
Whether the corresponding depth depth of current_node in C3, judgment step C2 are more than 0, if then carrying out next Step, if otherwise return to step C2;
C4, using the current_node and the relevance degree of theme in Shark-Search algorithm calculation procedures C2 potential_score。
It is C5, related to theme using the current_node web page contents in Shark-Search algorithm calculation procedures C2 Angle value simcurr, and choose the top n sub-pages of current web page;
The current all webpages of C6, basis build networks, and the PR values of each webpage are calculated using PageRank algorithms;
C7, the sim using Shark-Search algorithms calculating sub-pagesiValue and depth depth;
C8, the joint score value score for calculating each webpagei, further according to the joint score value score of each webpageiCalculate current The mean scores of webpage are crawledAnd web page correlation coefficient of determination δ;
C9, the joint score value score for judging each webpageiWhether web page correlation coefficient of determination δ is more than;If then should Webpage adds queue Url_queue tails of the queue to be crawled, if otherwise deleting the webpage from queue Url_queue to be crawled;
C10, Dynamic Maintenance queue Url_queue to be crawled, return to step C2.
In step C1, the present invention treats the depth depth and relevance degree for crawling each webpage in queue Url_queue It is 0 that potential_score, which assigns initial value,.
In step C5, present invention additionally comprises the relevance degree sim for judging current_node and themecurrWhether it is more than 0, if then choosing the preceding α * width sub-pages of current web page, wherein α is predefined constant, is typically set to 1.5, expression Add the coefficient of Url_queue sub-pages numbers;If otherwise choosing the preceding width sub-pages of current web page.
In step C6, the present invention has crawled and added all webpages structure structurings in Url_queue according to current Network, using the PR values of each webpage of PageRank algorithm recursive calculations.
In step C8, the present invention calculates the joint score value score of each webpageiCalculation formula be specially:
scorei=β * simi+(1-β)*PRi
Wherein, β is simiIn scoreiMiddle proportion, i ∈ [1, n], n is webpage total quantity.
Calculate the current mean scores for having crawled webpageCalculation formula be specially:
The calculation formula for calculating web page correlation coefficient of determination δ is specially:
Wherein, nmaxCombine the webpage quantity that score value is more than mean scores in webpage currently to have crawled, nminFor currently Crawl and combine the webpage quantity that score value is less than mean scores in webpage.
In step C10, the present invention passes through to joint score value scoreiWhether web page correlation coefficient of determination δ net is more than Page adds queue Url_queue tails of the queue to be crawled, and realizes Dynamic Maintenance queue Url_queue to be crawled, when having handled current net After the current_node of page, return to step C2 ejects new webpage from queue Url_queue heads of the queue to be crawled to be continued to calculate.
The present invention from internet by crawling the webpage related to theme, then foundation after structuring processing is carried out to webpage Database.According to the seed website of user and the keyword or phrase of inquiry, the page comprising query string is regarded as and theme phase Close, calculate the degree of correlation of the page and theme, dynamically maintain priority query URL_queue to be creeped.The present invention will be with The high URL of degree of subject relativity comes queue front, is preferentially crawled by reptile;The low URL of the degree of correlation is come into queue rear end simultaneously, Crawled afterwards by reptile.When calculating the degree of correlation of the page and theme, it is related to theme that the present invention not only calculates web page contents Degree, while the degree of correlation of the Anchor Text near webpage and Anchor Text context and theme is also contemplated for into, makes information more complete Face.If simply considering the degree of correlation of webpage and theme, influence power of the webpage in the whole network is just have ignored, web page contents phase is likely to result in Close but information insecure situation in itself.Therefore the present invention passes through the PageRank overall situation by PageRank algorithms using coming in Property the theme related web page remained is filtered again, the of overall importance of remaining webpage is ensured with this.
In step D, as shown in figure 3, being the schematic flow sheet of PRA algorithms in the present invention.The present invention is used PageRank-Advanced algorithms calculate the connection value PR and webpage sorting score value rank of webpage, utilize the link between webpage Structure sets up score value computational methods using the model of random surfer, and the fair and reasonable score value by father's website distributes to child station Point, and will be calculated in SSA algorithms the PR values that obtained potential_score obtains with this method be combined obtain it is a kind of newly The scoring mechanism of type, specifically include it is following step by step:
D1, to it is all crawled webpages setting initial p R values, obtain page initial p R value vectors π0
D2, the PR values using all webpages of PageRank-Advanced algorithms calculating, row vector of going forward side by side are represented;
D3, the relevance degree potential_score according to webpage in step C and theme, with reference to webpage in step D2 PR is worth to webpage sorting score value rank.
In step D1, the network G (V, E) that the present invention is constituted according to webpage has crawled webpage setting initial p R to all Value, wherein Authoritative Web pages are entered as PRauthority, generic web page is entered as 1, obtains page initial p R value vectors π0.Network G (V, E) is the digraph of attachment structure formation between webpage, and wherein V is that point set is collections of web pages, and E is between side collection, i.e. webpage Annexation.
In step d 2, the PR values of all webpages are carried out vector representation by the present invention:
πk+1kG
Wherein, πkThe PR values vector of webpage is calculated for kth time, M is the initial value allocation matrix set up according to webpage attachment structure, and S is improves the matrix after small black holes, and G is big black to improve Matrix after hole;Small black holes, which refer to, only enters the single webpage that chain does not go out chain, and big black hole refers to only entering for several webpage compositions Chain does not go out the webpage collection of chain.
In step D3, the calculation formula that the present invention calculates webpage sorting score value rank is specially:
Rank=γ * potential_score+ (1- γ) * PR.
More clearly searched for for theme, γ values are set to larger can filter out and degree of subject relativity very high knot Really;And if user is indefinite to search keyword theme, γ can be set to smaller, and authoritative high webpage is filtered out, it is suitable Just some of the recommendations also be provide the user.
The present invention solves the problems, such as retrieval result relevancy ranking, specifically basis using PageRank-Advanced algorithms Replica detection builds oriented webpage connection figure, importance of the webpage in the whole network is calculated according to this, and combine web page contents With the degree of correlation of theme, a kind of new ordering mechanism is set up.Webpage is the number of times being cited based on it in the importance of the whole network Weighed with whether being quoted by Authoritative Web pages, i.e., the importance of one page, which is divided equally and passes to the page cited in it, to be worked as In, it is of overall importance that this can represent its;The degree of subject relativity of webpage is calculated by SSA algorithms, represents content locality, can Avoid the shortcoming of " topic drift " only brought using replica detection.Info web correlation can be embodied by two and reliable Property value be combined, set up a kind of new ordering mechanism, will the effective accuracy and reliability for improving search result.
The present invention is more reasonably sorted based on degree of correlation size and webpage confidence level score value to retrieval result, meanwhile, In order to avoid the locality of Shark-Search algorithms, set up a kind of improved selectivity with reference to PageRank algorithms and crawl webpage Strategy, download an only webpage related to theme, the small database of information redundance set up, while being avoided that traditional algorithm again Topic drift problem.
One of ordinary skill in the art will be appreciated that embodiment described here is to aid in reader and understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such especially statement and embodiment.This area Those of ordinary skill can make according to these technical inspirations disclosed by the invention various does not depart from the other each of essence of the invention Plant specific deformation and combine, these deformations and combination are still within the scope of the present invention.

Claims (9)

1. a kind of subject search method of Industry-oriented, it is characterised in that comprise the following steps:
A, creep website seedUrls, reptile of initialization crawl time t1, subject key words vector v ectortopicWith reptile again The time interval t crawled2, initial queue Url_queue to be crawled is set up by the website seedUrls that creeps;
B, judge whether reach reptile crawl time t1If, then end operation, if otherwise determining whether what is set up in step A Whether queue Url_queue to be crawled is empty;The end operation if queue Url_queue to be crawled is sky, if queue to be crawled Url_queue does not carry out next step then for sky;
C, the relevance degree potential_score using Shark-Search-Advanced algorithms calculating webpage and theme;
D, connection value PR and webpage sorting score value rank using PageRank-Advanced algorithms calculating webpage;
E, judge whether to reach the time interval t that crawls again of reptile2If, then return to step C, if otherwise repeat step E.
2. the subject search method of Industry-oriented as claimed in claim 1, it is characterised in that the step C uses Shark- Search-Advanced algorithms calculate webpage and theme relevance degree potential_score specifically include it is following step by step:
C1, the depth depth and relevance degree potential_score for initializing each webpage in queue Url_queue to be crawled;
C2, ejected from queue Url_queue heads of the queue to be crawled and a webpage and set it to current_node;
Whether the corresponding depth depth of current_node in C3, judgment step C2 are more than 0, if then carrying out next step, If otherwise return to step C2;
C4, using the current_node and the relevance degree of theme in Shark-Search algorithm calculation procedures C2 potential_score。
C5, using the current_node web page contents and the relevance degree of theme in Shark-Search algorithm calculation procedures C2 simcurr, and choose the top n sub-pages of current web page;
The current all webpages of C6, basis build networks, and the PR values of each webpage are calculated using PageRank algorithms;
C7, the sim using Shark-Search algorithms calculating sub-pagesiValue and depth depth;
C8, the joint score value score for calculating each webpagei, further according to the joint score value score of each webpageiCalculating has currently been climbed Take the mean scores of webpageAnd web page correlation coefficient of determination δ;
C9, the joint score value score for judging each webpageiWhether web page correlation coefficient of determination δ is more than;If then by the webpage Queue Url_queue tails of the queue to be crawled are added, if otherwise deleting the webpage from queue Url_queue to be crawled;
C10, Dynamic Maintenance queue Url_queue to be crawled, return to step C2.
3. the subject search method of Industry-oriented as claimed in claim 2, it is characterised in that the step C5 also includes judging Current_node and theme relevance degree simcurrWhether 0 is more than;If then choosing preceding α * width of current web page Webpage, wherein α are to add the coefficient of Url_queue sub-pages numbers;If otherwise choosing the preceding width subnet of current web page Page.
4. the subject search method of Industry-oriented as claimed in claim 3, it is characterised in that calculated in the step C8 each The joint score value score of webpageiCalculation formula be specially:
scorei=β * simi+(1-β)*PRi
Wherein, β is simiIn scoreiMiddle proportion, i ∈ [1, n], n is webpage total quantity.
5. the subject search method of Industry-oriented as claimed in claim 4, it is characterised in that calculated in the step C8 current The mean scores of webpage are crawledCalculation formula be specially:
s c o r e ‾ = Σ 1 n score i n .
6. the subject search method of Industry-oriented as claimed in claim 5, it is characterised in that calculate webpage in the step C8 Correlation prediction coefficient δ calculation formula is specially:
δ = n m a x n m i n × s c o r e ‾
Wherein, nmaxCombine the webpage quantity that score value is more than mean scores in webpage currently to have crawled, nminCurrently to have crawled Combine the webpage quantity that score value is less than mean scores in webpage.
7. the subject search method of Industry-oriented as claimed in claim 6, it is characterised in that the step D is used PageRank-Advanced algorithms calculate webpage connection value PR and webpage sorting score value rank specifically include it is following step by step:
D1, to it is all crawled webpages setting initial p R values, obtain page initial p R value vectors π0
D2, the PR values using all webpages of PageRank-Advanced algorithms calculating, row vector of going forward side by side are represented;
D3, the relevance degree potential_score according to webpage in step C and theme, with reference to the PR values of webpage in step D2 Obtain webpage sorting score value rank.
8. the subject search method of Industry-oriented as claimed in claim 7, it is characterised in that the PR of webpage in the step D2 Value carries out vector representation:
πk+1kG
Wherein, πkThe PR values vector of webpage is calculated for kth time,
9. the subject search method of Industry-oriented as claimed in claim 8, it is characterised in that webpage sorting in the step D3 Score value rank is expressed as:
Rank=γ * potential_score+ (1- γ) * PR.
CN201710201272.1A 2017-03-30 2017-03-30 Subject searching method facing industry Expired - Fee Related CN106980677B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710201272.1A CN106980677B (en) 2017-03-30 2017-03-30 Subject searching method facing industry

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710201272.1A CN106980677B (en) 2017-03-30 2017-03-30 Subject searching method facing industry

Publications (2)

Publication Number Publication Date
CN106980677A true CN106980677A (en) 2017-07-25
CN106980677B CN106980677B (en) 2020-05-12

Family

ID=59338444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710201272.1A Expired - Fee Related CN106980677B (en) 2017-03-30 2017-03-30 Subject searching method facing industry

Country Status (1)

Country Link
CN (1) CN106980677B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657005A (en) * 2017-09-22 2018-02-02 山东浪潮云服务信息科技有限公司 The search method and device of a kind of subject web page
CN110347896A (en) * 2019-06-12 2019-10-18 国网浙江省电力有限公司电力科学研究院 A kind of medical data crawling method and system based on PageRank algorithm
CN111223533A (en) * 2019-12-24 2020-06-02 深圳市联影医疗数据服务有限公司 Medical data retrieval method and system
CN112860667A (en) * 2021-02-20 2021-05-28 中国联合网络通信集团有限公司 Method for establishing relevance model, method for judging relevance model, and method and device for discovering site

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049542A (en) * 2012-12-27 2013-04-17 北京信息科技大学 Domain-oriented network information search method
CN103914538A (en) * 2014-04-01 2014-07-09 浙江大学 Theme capturing method based on anchor text context and link analysis
US20160253221A1 (en) * 2015-02-27 2016-09-01 Vmware, Inc. Pagerank algorithm lock analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049542A (en) * 2012-12-27 2013-04-17 北京信息科技大学 Domain-oriented network information search method
CN103914538A (en) * 2014-04-01 2014-07-09 浙江大学 Theme capturing method based on anchor text context and link analysis
US20160253221A1 (en) * 2015-02-27 2016-09-01 Vmware, Inc. Pagerank algorithm lock analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEI QIU: "Research on Theme Crawler Based on Shark-Search and PageRank algorithm", 《2016 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEM》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657005A (en) * 2017-09-22 2018-02-02 山东浪潮云服务信息科技有限公司 The search method and device of a kind of subject web page
CN107657005B (en) * 2017-09-22 2020-03-20 浪潮云信息技术有限公司 Retrieval method and device for theme webpage
CN110347896A (en) * 2019-06-12 2019-10-18 国网浙江省电力有限公司电力科学研究院 A kind of medical data crawling method and system based on PageRank algorithm
CN110347896B (en) * 2019-06-12 2021-09-21 国网浙江省电力有限公司电力科学研究院 Medical data crawling method and system based on PageRank algorithm
CN111223533A (en) * 2019-12-24 2020-06-02 深圳市联影医疗数据服务有限公司 Medical data retrieval method and system
CN111223533B (en) * 2019-12-24 2024-02-13 深圳市联影医疗数据服务有限公司 Medical data retrieval method and system
CN112860667A (en) * 2021-02-20 2021-05-28 中国联合网络通信集团有限公司 Method for establishing relevance model, method for judging relevance model, and method and device for discovering site
CN112860667B (en) * 2021-02-20 2023-06-20 中国联合网络通信集团有限公司 Correlation model building method, correlation model judging method, site discovery method and site discovery device

Also Published As

Publication number Publication date
CN106980677B (en) 2020-05-12

Similar Documents

Publication Publication Date Title
US7672943B2 (en) Calculating a downloading priority for the uniform resource locator in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score for targeted web crawling
US9792304B1 (en) Query by image
US8090729B2 (en) Large graph measurement
US8312035B2 (en) Search engine enhancement using mined implicit links
US9330165B2 (en) Context-aware query suggestion by mining log data
US7945565B2 (en) Method and system for generating a hyperlink-click graph
US7644069B2 (en) Search ranking method for file system and related search engine
US9922119B2 (en) Navigational ranking for focused crawling
CN104182412B (en) A kind of web page crawl method and system
US8417657B2 (en) Methods and apparatus for computing graph similarity via sequence similarity
CN106980677A (en) The subject search method of Industry-oriented
EP1653380A1 (en) Web page ranking with hierarchical considerations
JP2005327293A5 (en)
CN1437140A (en) Method and system for queuing uncalled web based on path
CN101770521A (en) Focusing relevancy ordering method for vertical search engine
JP2005327293A (en) Method and system which grade object based on relation between insides of model and relation between models
CN102662954A (en) Method for implementing topical crawler system based on learning URL string information
CN103853831A (en) Personalized searching realization method based on user interest
CN103714149B (en) Self-adaptive incremental deep web data source discovery method
CN103020123B (en) A kind of method searching for bad video website
CN104268142A (en) Meta search result ranking algorithm based on rejection strategy
CN102163234A (en) Equipment and method for error correction of query sequence based on degree of error correction association
US20240119047A1 (en) Answer facts from structured content
US7584183B2 (en) Method for node classification and scoring by combining parallel iterative scoring calculation
CN108959580A (en) A kind of optimization method and system of label data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200512