CN106980677A - The subject search method of Industry-oriented - Google Patents
The subject search method of Industry-oriented Download PDFInfo
- Publication number
- CN106980677A CN106980677A CN201710201272.1A CN201710201272A CN106980677A CN 106980677 A CN106980677 A CN 106980677A CN 201710201272 A CN201710201272 A CN 201710201272A CN 106980677 A CN106980677 A CN 106980677A
- Authority
- CN
- China
- Prior art keywords
- webpage
- queue
- score
- crawled
- url
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 241000270322 Lepidosauria Species 0.000 claims abstract description 16
- 239000013598 vector Substances 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 17
- 238000010845 search algorithm Methods 0.000 claims description 10
- 238000012423 maintenance Methods 0.000 claims description 4
- 241000251730 Chondrichthyes Species 0.000 abstract 1
- 239000011159 matrix material Substances 0.000 description 3
- 230000009193 crawling Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- JTJMJGYZQZDUJJ-UHFFFAOYSA-N phencyclidine Chemical compound C1CCCCN1C1(C=2C=CC=CC=2)CCCCC1 JTJMJGYZQZDUJJ-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of subject search method of Industry-oriented.It includes initializing and setting up initial queue to be crawled, judge whether that reaching reptile crawls whether time and queue to be crawled are empty respectively, the relevance degree of webpage and theme is calculated using Shark Search Advanced algorithms, the connection value and webpage sorting score value of webpage are calculated using PageRank Advanced algorithms, judges whether to reach the time interval that reptile crawls again.The present invention can effectively improve the accuracy and reliability of search result, so that the retrieval result of effective acquisition high-accuracy, high coverage rate, it is ensured that search engine can high efficiency, high accuracy, high coverage rate response user towards specific industry search need.
Description
Technical field
The invention belongs to a kind of subject search method of technical field of information retrieval, more particularly to Industry-oriented.
Background technology
Internet has become the most important Information Communication of people and content obtaining mode.With Google, Baidu, generation must be should be
The universal search engine of table, information is quickly and accurately obtained for people and provides huge facility on the internet.However, logical
Needed to set up huge search database with search engine, search content needs to carry out specific industry towards the whole network in user
During vertical search, its precision ratio is relatively relatively low, resource cost is big.At the same time, with go where, search dog shopping for the vertical of representative
Search engine, the database of oneself is specially set up for special dimension, and industry constraint is big, application flexibility is not enough, recall ratio side
Face can not be fully up to expectations.
By analyzing the searching algorithm it can be found that for giving theme to existing vertical search engine technology, generally
Using the way of search (such as Fish-Search, Shark-Search etc.) based on content, calculate the degree of correlation of webpage and theme from
And filter out the webpage unrelated with theme;Then utilize based on network connection architecture searching algorithm (such as relevancy ranking algorithm,
PageRank algorithms etc.), obtained webpage confidence level score value sequence is calculated so as to set up index database.This mode can be set up superfluous
The small subject data base of remaining, but sorted according to degree of correlation size, although retrieval result is very high with degree of subject relativity, but reduces
It is of overall importance, and it cannot be guaranteed that the reliability of content;If being sorted according to webpage confidence level score value, retrieval result and the phase of theme
Pass degree again it cannot be guaranteed that, cause " topic drift ".
The content of the invention
The present invention goal of the invention be:In order to solve problem above present in prior art, the present invention proposes one kind
The subject search method of Industry-oriented, realizes the retrieval result of effective acquisition high-accuracy, high coverage rate.
The technical scheme is that:A kind of subject search method of Industry-oriented, comprises the following steps:
A, creep website seedUrls, reptile of initialization crawl time t1, subject key words vector v ectortopicAnd reptile
The time interval t crawled again2, initial queue Url_queue to be crawled is set up by the website seedUrls that creeps;
B, judge whether reach reptile crawl time t1If, then end operation, if otherwise determining whether to build in step A
Whether vertical queue Url_queue to be crawled is empty;The end operation if queue Url_queue to be crawled is sky, if waiting to crawl
Queue Url_queue does not carry out next step then for sky;
C, the relevance degree potential_ using Shark-Search-Advanced algorithms calculating webpage and theme
score;
D, connection value PR and webpage sorting score value rank using PageRank-Advanced algorithms calculating webpage;
E, judge whether to reach the time interval t that crawls again of reptile2If, then return to step C, if otherwise repeat step
E。
Further, the step C calculates the degree of correlation of webpage and theme using Shark-Search-Advanced algorithms
Value potential_score specifically include it is following step by step:
C1, the depth depth and relevance degree potential_ for initializing each webpage in queue Url_queue to be crawled
score;
C2, ejected from queue Url_queue heads of the queue to be crawled and a webpage and set it to current_node;
Whether the corresponding depth depth of current_node in C3, judgment step C2 are more than 0, if then carrying out next
Step, if otherwise return to step C2;
C4, using the current_node and the relevance degree of theme in Shark-Search algorithm calculation procedures C2
potential_score。
It is C5, related to theme using the current_node web page contents in Shark-Search algorithm calculation procedures C2
Angle value simcurr, and choose the top n sub-pages of current web page;
The current all webpages of C6, basis build networks, and the PR values of each webpage are calculated using PageRank algorithms;
C7, the sim using Shark-Search algorithms calculating sub-pagesiValue and depth depth;
C8, the joint score value score for calculating each webpagei, further according to the joint score value score of each webpageiCalculate current
The mean scores of webpage are crawledAnd web page correlation coefficient of determination δ;
C9, the joint score value score for judging each webpageiWhether web page correlation coefficient of determination δ is more than;If then should
Webpage adds queue Url_queue tails of the queue to be crawled, if otherwise deleting the webpage from queue Url_queue to be crawled;
C10, Dynamic Maintenance queue Url_queue to be crawled, return to step C2.
Further, the step C5 also includes the relevance degree sim for judging current_node and themecurrIt is whether big
In 0, if then choosing the preceding α * width sub-pages of current web page, wherein α is to add Url_queue sub-pages numbers
Coefficient;If otherwise choosing the preceding width sub-pages of current web page.
Further, the joint score value score of each webpage is calculated in the step C8iCalculation formula be specially:
scorei=β * simi+(1-β)*PRi
Wherein, β is simiIn scoreiMiddle proportion, i ∈ [1, n], n is webpage total quantity.
Further, the current mean scores for having crawled webpage are calculated in the step C8Calculation formula it is specific
For:
Further, calculating web page correlation coefficient of determination δ calculation formula is specially in the step C8:
Wherein, nmaxCombine the webpage quantity that score value is more than mean scores in webpage currently to have crawled, nminFor currently
Crawl and combine the webpage quantity that score value is less than mean scores in webpage.
Further, the step D calculates the connection value PR and webpage row of webpage using PageRank-Advanced algorithms
Sequence score value rank specifically include it is following step by step:
D1, to it is all crawled webpages setting initial p R values, obtain page initial p R value vectors π0;
D2, the PR values using all webpages of PageRank-Advanced algorithms calculating, row vector of going forward side by side are represented;
D3, the relevance degree potential_score according to webpage in step C and theme, with reference to webpage in step D2
PR is worth to webpage sorting score value rank.
Further, the PR values progress vector representation of webpage is specially in the step D2:
πk+1=πkG
Wherein, πkThe PR values vector of webpage is calculated for kth time,
Further, webpage sorting score value rank is expressed as in the step D3:
Rank=γ * potential_score+ (1- γ) * PR.
The beneficial effects of the invention are as follows:The present invention crawls related to designated key on the internet first with SSA algorithms
Webpage, and calculate the degree of correlation for exporting each webpage and theme;Secondly by the relevance degree calculated in SSA algorithms and
PRA algorithms calculate the obtained value based on connection and combine the score value finally sorted as webpage, by the score value to retrieval result
It is ranked up, can effectively improves the accuracy and reliability of search result, so that effective acquisition high-accuracy, high coverage rate
Retrieval result, it is ensured that search engine can high efficiency, high accuracy, high coverage rate response user towards specific industry search
Demand.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the subject search method of the Industry-oriented of the present invention.
Fig. 2 is the schematic flow sheet of SSA algorithms in the present invention.
Fig. 3 is the schematic flow sheet of PRA algorithms in the present invention.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not
For limiting the present invention.
As shown in figure 1, the schematic flow sheet of the subject search method for the Industry-oriented of the present invention.A kind of Industry-oriented
Subject search method, comprises the following steps:
A, creep website seedUrls, reptile of initialization crawl time t1, subject key words vector v ectortopicAnd reptile
The time interval t crawled again2, initial queue Url_queue to be crawled is set up by the website seedUrls that creeps;
B, judge whether reach reptile crawl time t1If, then end operation, if otherwise determining whether to build in step A
Whether vertical queue Url_queue to be crawled is empty;The end operation if queue Url_queue to be crawled is sky, if waiting to crawl
Queue Url_queue does not carry out next step then for sky;
C, the relevance degree potential_ using Shark-Search-Advanced algorithms calculating webpage and theme
score;
D, connection value PR and webpage sorting score value rank using PageRank-Advanced algorithms calculating webpage;
E, judge whether to reach the time interval t that crawls again of reptile2If, then return to step C, if otherwise repeat step
E。
The present invention crawls the webpage related to designated key first with SSA algorithms on the internet, and it is each to calculate output
The degree of correlation of individual webpage and theme;The relevance degree calculated in SSA algorithms and PRA algorithms are calculated again obtain based on even
The value connect combines the score value finally sorted as webpage, and retrieval result is ranked up by the score value.
In step, the present invention is initialized to search environment, that is, is initialized creep website seedUrls, reptile and climbed
Take time t1, subject key words vector v ectortopicThe time interval t crawled again with reptile2, website of creeping here is row
Authoritative website in the industry;Initial queue Url_queue to be crawled is set up by the website seedUrls that creeps again.
In stepb, the present invention judges whether that reaching reptile crawls time t respectively1And the team to be crawled set up in step A
Whether row Url_queue is empty, when not up to reptile crawls time t1And under queue Url_queue to be crawled is not carried out for space-time
One step.
In step C, as shown in Fig. 2 being the schematic flow sheet of SSA algorithms in the present invention.The present invention uses Shark-
Search-Advanced algorithms calculate the relevance degree potential_score of webpage and theme, according to the seed website of input
SeedUrls and subject key words vector v ectortopicCrawl on the internet and download the related website of industry, final output
The web page contents of structuring and web pages relevance value potential_score and the PR value calculated, are specifically included following
Step by step:
C1, the depth depth and relevance degree potential_ for initializing each webpage in queue Url_queue to be crawled
score;
C2, ejected from queue Url_queue heads of the queue to be crawled and a webpage and set it to current_node;
Whether the corresponding depth depth of current_node in C3, judgment step C2 are more than 0, if then carrying out next
Step, if otherwise return to step C2;
C4, using the current_node and the relevance degree of theme in Shark-Search algorithm calculation procedures C2
potential_score。
It is C5, related to theme using the current_node web page contents in Shark-Search algorithm calculation procedures C2
Angle value simcurr, and choose the top n sub-pages of current web page;
The current all webpages of C6, basis build networks, and the PR values of each webpage are calculated using PageRank algorithms;
C7, the sim using Shark-Search algorithms calculating sub-pagesiValue and depth depth;
C8, the joint score value score for calculating each webpagei, further according to the joint score value score of each webpageiCalculate current
The mean scores of webpage are crawledAnd web page correlation coefficient of determination δ;
C9, the joint score value score for judging each webpageiWhether web page correlation coefficient of determination δ is more than;If then should
Webpage adds queue Url_queue tails of the queue to be crawled, if otherwise deleting the webpage from queue Url_queue to be crawled;
C10, Dynamic Maintenance queue Url_queue to be crawled, return to step C2.
In step C1, the present invention treats the depth depth and relevance degree for crawling each webpage in queue Url_queue
It is 0 that potential_score, which assigns initial value,.
In step C5, present invention additionally comprises the relevance degree sim for judging current_node and themecurrWhether it is more than
0, if then choosing the preceding α * width sub-pages of current web page, wherein α is predefined constant, is typically set to 1.5, expression
Add the coefficient of Url_queue sub-pages numbers;If otherwise choosing the preceding width sub-pages of current web page.
In step C6, the present invention has crawled and added all webpages structure structurings in Url_queue according to current
Network, using the PR values of each webpage of PageRank algorithm recursive calculations.
In step C8, the present invention calculates the joint score value score of each webpageiCalculation formula be specially:
scorei=β * simi+(1-β)*PRi
Wherein, β is simiIn scoreiMiddle proportion, i ∈ [1, n], n is webpage total quantity.
Calculate the current mean scores for having crawled webpageCalculation formula be specially:
The calculation formula for calculating web page correlation coefficient of determination δ is specially:
Wherein, nmaxCombine the webpage quantity that score value is more than mean scores in webpage currently to have crawled, nminFor currently
Crawl and combine the webpage quantity that score value is less than mean scores in webpage.
In step C10, the present invention passes through to joint score value scoreiWhether web page correlation coefficient of determination δ net is more than
Page adds queue Url_queue tails of the queue to be crawled, and realizes Dynamic Maintenance queue Url_queue to be crawled, when having handled current net
After the current_node of page, return to step C2 ejects new webpage from queue Url_queue heads of the queue to be crawled to be continued to calculate.
The present invention from internet by crawling the webpage related to theme, then foundation after structuring processing is carried out to webpage
Database.According to the seed website of user and the keyword or phrase of inquiry, the page comprising query string is regarded as and theme phase
Close, calculate the degree of correlation of the page and theme, dynamically maintain priority query URL_queue to be creeped.The present invention will be with
The high URL of degree of subject relativity comes queue front, is preferentially crawled by reptile;The low URL of the degree of correlation is come into queue rear end simultaneously,
Crawled afterwards by reptile.When calculating the degree of correlation of the page and theme, it is related to theme that the present invention not only calculates web page contents
Degree, while the degree of correlation of the Anchor Text near webpage and Anchor Text context and theme is also contemplated for into, makes information more complete
Face.If simply considering the degree of correlation of webpage and theme, influence power of the webpage in the whole network is just have ignored, web page contents phase is likely to result in
Close but information insecure situation in itself.Therefore the present invention passes through the PageRank overall situation by PageRank algorithms using coming in
Property the theme related web page remained is filtered again, the of overall importance of remaining webpage is ensured with this.
In step D, as shown in figure 3, being the schematic flow sheet of PRA algorithms in the present invention.The present invention is used
PageRank-Advanced algorithms calculate the connection value PR and webpage sorting score value rank of webpage, utilize the link between webpage
Structure sets up score value computational methods using the model of random surfer, and the fair and reasonable score value by father's website distributes to child station
Point, and will be calculated in SSA algorithms the PR values that obtained potential_score obtains with this method be combined obtain it is a kind of newly
The scoring mechanism of type, specifically include it is following step by step:
D1, to it is all crawled webpages setting initial p R values, obtain page initial p R value vectors π0;
D2, the PR values using all webpages of PageRank-Advanced algorithms calculating, row vector of going forward side by side are represented;
D3, the relevance degree potential_score according to webpage in step C and theme, with reference to webpage in step D2
PR is worth to webpage sorting score value rank.
In step D1, the network G (V, E) that the present invention is constituted according to webpage has crawled webpage setting initial p R to all
Value, wherein Authoritative Web pages are entered as PRauthority, generic web page is entered as 1, obtains page initial p R value vectors π0.Network G
(V, E) is the digraph of attachment structure formation between webpage, and wherein V is that point set is collections of web pages, and E is between side collection, i.e. webpage
Annexation.
In step d 2, the PR values of all webpages are carried out vector representation by the present invention:
πk+1=πkG
Wherein, πkThe PR values vector of webpage is calculated for kth time,
M is the initial value allocation matrix set up according to webpage attachment structure, and S is improves the matrix after small black holes, and G is big black to improve
Matrix after hole;Small black holes, which refer to, only enters the single webpage that chain does not go out chain, and big black hole refers to only entering for several webpage compositions
Chain does not go out the webpage collection of chain.
In step D3, the calculation formula that the present invention calculates webpage sorting score value rank is specially:
Rank=γ * potential_score+ (1- γ) * PR.
More clearly searched for for theme, γ values are set to larger can filter out and degree of subject relativity very high knot
Really;And if user is indefinite to search keyword theme, γ can be set to smaller, and authoritative high webpage is filtered out, it is suitable
Just some of the recommendations also be provide the user.
The present invention solves the problems, such as retrieval result relevancy ranking, specifically basis using PageRank-Advanced algorithms
Replica detection builds oriented webpage connection figure, importance of the webpage in the whole network is calculated according to this, and combine web page contents
With the degree of correlation of theme, a kind of new ordering mechanism is set up.Webpage is the number of times being cited based on it in the importance of the whole network
Weighed with whether being quoted by Authoritative Web pages, i.e., the importance of one page, which is divided equally and passes to the page cited in it, to be worked as
In, it is of overall importance that this can represent its;The degree of subject relativity of webpage is calculated by SSA algorithms, represents content locality, can
Avoid the shortcoming of " topic drift " only brought using replica detection.Info web correlation can be embodied by two and reliable
Property value be combined, set up a kind of new ordering mechanism, will the effective accuracy and reliability for improving search result.
The present invention is more reasonably sorted based on degree of correlation size and webpage confidence level score value to retrieval result, meanwhile,
In order to avoid the locality of Shark-Search algorithms, set up a kind of improved selectivity with reference to PageRank algorithms and crawl webpage
Strategy, download an only webpage related to theme, the small database of information redundance set up, while being avoided that traditional algorithm again
Topic drift problem.
One of ordinary skill in the art will be appreciated that embodiment described here is to aid in reader and understands this hair
Bright principle, it should be understood that protection scope of the present invention is not limited to such especially statement and embodiment.This area
Those of ordinary skill can make according to these technical inspirations disclosed by the invention various does not depart from the other each of essence of the invention
Plant specific deformation and combine, these deformations and combination are still within the scope of the present invention.
Claims (9)
1. a kind of subject search method of Industry-oriented, it is characterised in that comprise the following steps:
A, creep website seedUrls, reptile of initialization crawl time t1, subject key words vector v ectortopicWith reptile again
The time interval t crawled2, initial queue Url_queue to be crawled is set up by the website seedUrls that creeps;
B, judge whether reach reptile crawl time t1If, then end operation, if otherwise determining whether what is set up in step A
Whether queue Url_queue to be crawled is empty;The end operation if queue Url_queue to be crawled is sky, if queue to be crawled
Url_queue does not carry out next step then for sky;
C, the relevance degree potential_score using Shark-Search-Advanced algorithms calculating webpage and theme;
D, connection value PR and webpage sorting score value rank using PageRank-Advanced algorithms calculating webpage;
E, judge whether to reach the time interval t that crawls again of reptile2If, then return to step C, if otherwise repeat step E.
2. the subject search method of Industry-oriented as claimed in claim 1, it is characterised in that the step C uses Shark-
Search-Advanced algorithms calculate webpage and theme relevance degree potential_score specifically include it is following step by step:
C1, the depth depth and relevance degree potential_score for initializing each webpage in queue Url_queue to be crawled;
C2, ejected from queue Url_queue heads of the queue to be crawled and a webpage and set it to current_node;
Whether the corresponding depth depth of current_node in C3, judgment step C2 are more than 0, if then carrying out next step,
If otherwise return to step C2;
C4, using the current_node and the relevance degree of theme in Shark-Search algorithm calculation procedures C2
potential_score。
C5, using the current_node web page contents and the relevance degree of theme in Shark-Search algorithm calculation procedures C2
simcurr, and choose the top n sub-pages of current web page;
The current all webpages of C6, basis build networks, and the PR values of each webpage are calculated using PageRank algorithms;
C7, the sim using Shark-Search algorithms calculating sub-pagesiValue and depth depth;
C8, the joint score value score for calculating each webpagei, further according to the joint score value score of each webpageiCalculating has currently been climbed
Take the mean scores of webpageAnd web page correlation coefficient of determination δ;
C9, the joint score value score for judging each webpageiWhether web page correlation coefficient of determination δ is more than;If then by the webpage
Queue Url_queue tails of the queue to be crawled are added, if otherwise deleting the webpage from queue Url_queue to be crawled;
C10, Dynamic Maintenance queue Url_queue to be crawled, return to step C2.
3. the subject search method of Industry-oriented as claimed in claim 2, it is characterised in that the step C5 also includes judging
Current_node and theme relevance degree simcurrWhether 0 is more than;If then choosing preceding α * width of current web page
Webpage, wherein α are to add the coefficient of Url_queue sub-pages numbers;If otherwise choosing the preceding width subnet of current web page
Page.
4. the subject search method of Industry-oriented as claimed in claim 3, it is characterised in that calculated in the step C8 each
The joint score value score of webpageiCalculation formula be specially:
scorei=β * simi+(1-β)*PRi
Wherein, β is simiIn scoreiMiddle proportion, i ∈ [1, n], n is webpage total quantity.
5. the subject search method of Industry-oriented as claimed in claim 4, it is characterised in that calculated in the step C8 current
The mean scores of webpage are crawledCalculation formula be specially:
6. the subject search method of Industry-oriented as claimed in claim 5, it is characterised in that calculate webpage in the step C8
Correlation prediction coefficient δ calculation formula is specially:
Wherein, nmaxCombine the webpage quantity that score value is more than mean scores in webpage currently to have crawled, nminCurrently to have crawled
Combine the webpage quantity that score value is less than mean scores in webpage.
7. the subject search method of Industry-oriented as claimed in claim 6, it is characterised in that the step D is used
PageRank-Advanced algorithms calculate webpage connection value PR and webpage sorting score value rank specifically include it is following step by step:
D1, to it is all crawled webpages setting initial p R values, obtain page initial p R value vectors π0;
D2, the PR values using all webpages of PageRank-Advanced algorithms calculating, row vector of going forward side by side are represented;
D3, the relevance degree potential_score according to webpage in step C and theme, with reference to the PR values of webpage in step D2
Obtain webpage sorting score value rank.
8. the subject search method of Industry-oriented as claimed in claim 7, it is characterised in that the PR of webpage in the step D2
Value carries out vector representation:
πk+1=πkG
Wherein, πkThe PR values vector of webpage is calculated for kth time,
9. the subject search method of Industry-oriented as claimed in claim 8, it is characterised in that webpage sorting in the step D3
Score value rank is expressed as:
Rank=γ * potential_score+ (1- γ) * PR.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710201272.1A CN106980677B (en) | 2017-03-30 | 2017-03-30 | Subject searching method facing industry |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710201272.1A CN106980677B (en) | 2017-03-30 | 2017-03-30 | Subject searching method facing industry |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106980677A true CN106980677A (en) | 2017-07-25 |
CN106980677B CN106980677B (en) | 2020-05-12 |
Family
ID=59338444
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710201272.1A Expired - Fee Related CN106980677B (en) | 2017-03-30 | 2017-03-30 | Subject searching method facing industry |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106980677B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107657005A (en) * | 2017-09-22 | 2018-02-02 | 山东浪潮云服务信息科技有限公司 | The search method and device of a kind of subject web page |
CN110347896A (en) * | 2019-06-12 | 2019-10-18 | 国网浙江省电力有限公司电力科学研究院 | A kind of medical data crawling method and system based on PageRank algorithm |
CN111223533A (en) * | 2019-12-24 | 2020-06-02 | 深圳市联影医疗数据服务有限公司 | Medical data retrieval method and system |
CN112860667A (en) * | 2021-02-20 | 2021-05-28 | 中国联合网络通信集团有限公司 | Method for establishing relevance model, method for judging relevance model, and method and device for discovering site |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049542A (en) * | 2012-12-27 | 2013-04-17 | 北京信息科技大学 | Domain-oriented network information search method |
CN103914538A (en) * | 2014-04-01 | 2014-07-09 | 浙江大学 | Theme capturing method based on anchor text context and link analysis |
US20160253221A1 (en) * | 2015-02-27 | 2016-09-01 | Vmware, Inc. | Pagerank algorithm lock analysis |
-
2017
- 2017-03-30 CN CN201710201272.1A patent/CN106980677B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049542A (en) * | 2012-12-27 | 2013-04-17 | 北京信息科技大学 | Domain-oriented network information search method |
CN103914538A (en) * | 2014-04-01 | 2014-07-09 | 浙江大学 | Theme capturing method based on anchor text context and link analysis |
US20160253221A1 (en) * | 2015-02-27 | 2016-09-01 | Vmware, Inc. | Pagerank algorithm lock analysis |
Non-Patent Citations (1)
Title |
---|
LEI QIU: "Research on Theme Crawler Based on Shark-Search and PageRank algorithm", 《2016 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEM》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107657005A (en) * | 2017-09-22 | 2018-02-02 | 山东浪潮云服务信息科技有限公司 | The search method and device of a kind of subject web page |
CN107657005B (en) * | 2017-09-22 | 2020-03-20 | 浪潮云信息技术有限公司 | Retrieval method and device for theme webpage |
CN110347896A (en) * | 2019-06-12 | 2019-10-18 | 国网浙江省电力有限公司电力科学研究院 | A kind of medical data crawling method and system based on PageRank algorithm |
CN110347896B (en) * | 2019-06-12 | 2021-09-21 | 国网浙江省电力有限公司电力科学研究院 | Medical data crawling method and system based on PageRank algorithm |
CN111223533A (en) * | 2019-12-24 | 2020-06-02 | 深圳市联影医疗数据服务有限公司 | Medical data retrieval method and system |
CN111223533B (en) * | 2019-12-24 | 2024-02-13 | 深圳市联影医疗数据服务有限公司 | Medical data retrieval method and system |
CN112860667A (en) * | 2021-02-20 | 2021-05-28 | 中国联合网络通信集团有限公司 | Method for establishing relevance model, method for judging relevance model, and method and device for discovering site |
CN112860667B (en) * | 2021-02-20 | 2023-06-20 | 中国联合网络通信集团有限公司 | Correlation model building method, correlation model judging method, site discovery method and site discovery device |
Also Published As
Publication number | Publication date |
---|---|
CN106980677B (en) | 2020-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7672943B2 (en) | Calculating a downloading priority for the uniform resource locator in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score for targeted web crawling | |
US9792304B1 (en) | Query by image | |
US8090729B2 (en) | Large graph measurement | |
US8312035B2 (en) | Search engine enhancement using mined implicit links | |
US9330165B2 (en) | Context-aware query suggestion by mining log data | |
US7945565B2 (en) | Method and system for generating a hyperlink-click graph | |
US7644069B2 (en) | Search ranking method for file system and related search engine | |
US9922119B2 (en) | Navigational ranking for focused crawling | |
CN104182412B (en) | A kind of web page crawl method and system | |
US8417657B2 (en) | Methods and apparatus for computing graph similarity via sequence similarity | |
CN106980677A (en) | The subject search method of Industry-oriented | |
EP1653380A1 (en) | Web page ranking with hierarchical considerations | |
JP2005327293A5 (en) | ||
CN1437140A (en) | Method and system for queuing uncalled web based on path | |
CN101770521A (en) | Focusing relevancy ordering method for vertical search engine | |
JP2005327293A (en) | Method and system which grade object based on relation between insides of model and relation between models | |
CN102662954A (en) | Method for implementing topical crawler system based on learning URL string information | |
CN103853831A (en) | Personalized searching realization method based on user interest | |
CN103714149B (en) | Self-adaptive incremental deep web data source discovery method | |
CN103020123B (en) | A kind of method searching for bad video website | |
CN104268142A (en) | Meta search result ranking algorithm based on rejection strategy | |
CN102163234A (en) | Equipment and method for error correction of query sequence based on degree of error correction association | |
US20240119047A1 (en) | Answer facts from structured content | |
US7584183B2 (en) | Method for node classification and scoring by combining parallel iterative scoring calculation | |
CN108959580A (en) | A kind of optimization method and system of label data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200512 |