CN106980677A

CN106980677A - The subject search method of Industry-oriented

Info

Publication number: CN106980677A
Application number: CN201710201272.1A
Authority: CN
Inventors: 刘道桂; 韦云凯; 刘强; 李源颢; 蒲勇全; 陈怡瑾
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-03-30
Filing date: 2017-03-30
Publication date: 2017-07-25
Anticipated expiration: 2037-03-30
Also published as: CN106980677B

Abstract

The invention discloses a kind of subject search method of Industry-oriented.It includes initializing and setting up initial queue to be crawled, judge whether that reaching reptile crawls whether time and queue to be crawled are empty respectively, the relevance degree of webpage and theme is calculated using Shark Search Advanced algorithms, the connection value and webpage sorting score value of webpage are calculated using PageRank Advanced algorithms, judges whether to reach the time interval that reptile crawls again.The present invention can effectively improve the accuracy and reliability of search result, so that the retrieval result of effective acquisition high-accuracy, high coverage rate, it is ensured that search engine can high efficiency, high accuracy, high coverage rate response user towards specific industry search need.

Description

The subject search method of Industry-oriented

Technical field

The invention belongs to a kind of subject search method of technical field of information retrieval, more particularly to Industry-oriented.

Background technology

Internet has become the most important Information Communication of people and content obtaining mode.With Google, Baidu, generation must be should be The universal search engine of table, information is quickly and accurately obtained for people and provides huge facility on the internet.However, logical Needed to set up huge search database with search engine, search content needs to carry out specific industry towards the whole network in user During vertical search, its precision ratio is relatively relatively low, resource cost is big.At the same time, with go where, search dog shopping for the vertical of representative Search engine, the database of oneself is specially set up for special dimension, and industry constraint is big, application flexibility is not enough, recall ratio side Face can not be fully up to expectations.

By analyzing the searching algorithm it can be found that for giving theme to existing vertical search engine technology, generally Using the way of search (such as Fish-Search, Shark-Search etc.) based on content, calculate the degree of correlation of webpage and theme from And filter out the webpage unrelated with theme；Then utilize based on network connection architecture searching algorithm (such as relevancy ranking algorithm, PageRank algorithms etc.), obtained webpage confidence level score value sequence is calculated so as to set up index database.This mode can be set up superfluous The small subject data base of remaining, but sorted according to degree of correlation size, although retrieval result is very high with degree of subject relativity, but reduces It is of overall importance, and it cannot be guaranteed that the reliability of content；If being sorted according to webpage confidence level score value, retrieval result and the phase of theme Pass degree again it cannot be guaranteed that, cause " topic drift ".

The content of the invention

The present invention goal of the invention be：In order to solve problem above present in prior art, the present invention proposes one kind The subject search method of Industry-oriented, realizes the retrieval result of effective acquisition high-accuracy, high coverage rate.

The technical scheme is that：A kind of subject search method of Industry-oriented, comprises the following steps：

A, creep website seedUrls, reptile of initialization crawl time t₁, subject key words vector v ector_topicAnd reptile The time interval t crawled again₂, initial queue Url_queue to be crawled is set up by the website seedUrls that creeps；

B, judge whether reach reptile crawl time t₁If, then end operation, if otherwise determining whether to build in step A Whether vertical queue Url_queue to be crawled is empty；The end operation if queue Url_queue to be crawled is sky, if waiting to crawl Queue Url_queue does not carry out next step then for sky；

C, the relevance degree potential_ using Shark-Search-Advanced algorithms calculating webpage and theme score；

D, connection value PR and webpage sorting score value rank using PageRank-Advanced algorithms calculating webpage；

E, judge whether to reach the time interval t that crawls again of reptile₂If, then return to step C, if otherwise repeat step E。

Further, the step C calculates the degree of correlation of webpage and theme using Shark-Search-Advanced algorithms Value potential_score specifically include it is following step by step：

C1, the depth depth and relevance degree potential_ for initializing each webpage in queue Url_queue to be crawled score；

C2, ejected from queue Url_queue heads of the queue to be crawled and a webpage and set it to current_node；

Whether the corresponding depth depth of current_node in C3, judgment step C2 are more than 0, if then carrying out next Step, if otherwise return to step C2；

C4, using the current_node and the relevance degree of theme in Shark-Search algorithm calculation procedures C2 potential_score。

It is C5, related to theme using the current_node web page contents in Shark-Search algorithm calculation procedures C2 Angle value sim_curr, and choose the top n sub-pages of current web page；

The current all webpages of C6, basis build networks, and the PR values of each webpage are calculated using PageRank algorithms；

C7, the sim using Shark-Search algorithms calculating sub-pages_iValue and depth depth；

C8, the joint score value score for calculating each webpage_i, further according to the joint score value score of each webpage_iCalculate current The mean scores of webpage are crawledAnd web page correlation coefficient of determination δ；

C9, the joint score value score for judging each webpage_iWhether web page correlation coefficient of determination δ is more than；If then should Webpage adds queue Url_queue tails of the queue to be crawled, if otherwise deleting the webpage from queue Url_queue to be crawled；

C10, Dynamic Maintenance queue Url_queue to be crawled, return to step C2.

Further, the step C5 also includes the relevance degree sim for judging current_node and theme_currIt is whether big In 0, if then choosing the preceding α * width sub-pages of current web page, wherein α is to add Url_queue sub-pages numbers Coefficient；If otherwise choosing the preceding width sub-pages of current web page.

Further, the joint score value score of each webpage is calculated in the step C8_iCalculation formula be specially：

score_i=β * sim_i+(1-β)*PR_i

Wherein, β is sim_iIn score_iMiddle proportion, i ∈ [1, n], n is webpage total quantity.

Further, the current mean scores for having crawled webpage are calculated in the step C8Calculation formula it is specific For：

Further, calculating web page correlation coefficient of determination δ calculation formula is specially in the step C8：

Wherein, n_maxCombine the webpage quantity that score value is more than mean scores in webpage currently to have crawled, n_minFor currently Crawl and combine the webpage quantity that score value is less than mean scores in webpage.

Further, the step D calculates the connection value PR and webpage row of webpage using PageRank-Advanced algorithms Sequence score value rank specifically include it is following step by step：

D1, to it is all crawled webpages setting initial p R values, obtain page initial p R value vectors π₀；

D2, the PR values using all webpages of PageRank-Advanced algorithms calculating, row vector of going forward side by side are represented；

D3, the relevance degree potential_score according to webpage in step C and theme, with reference to webpage in step D2 PR is worth to webpage sorting score value rank.

Further, the PR values progress vector representation of webpage is specially in the step D2：

π_k+1=π_kG

Wherein, π_kThe PR values vector of webpage is calculated for kth time,

Further, webpage sorting score value rank is expressed as in the step D3：

Rank=γ * potential_score+ (1- γ) * PR.

The beneficial effects of the invention are as follows：The present invention crawls related to designated key on the internet first with SSA algorithms Webpage, and calculate the degree of correlation for exporting each webpage and theme；Secondly by the relevance degree calculated in SSA algorithms and PRA algorithms calculate the obtained value based on connection and combine the score value finally sorted as webpage, by the score value to retrieval result It is ranked up, can effectively improves the accuracy and reliability of search result, so that effective acquisition high-accuracy, high coverage rate Retrieval result, it is ensured that search engine can high efficiency, high accuracy, high coverage rate response user towards specific industry search Demand.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the subject search method of the Industry-oriented of the present invention.

Fig. 2 is the schematic flow sheet of SSA algorithms in the present invention.

Fig. 3 is the schematic flow sheet of PRA algorithms in the present invention.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.

As shown in figure 1, the schematic flow sheet of the subject search method for the Industry-oriented of the present invention.A kind of Industry-oriented Subject search method, comprises the following steps：

The present invention crawls the webpage related to designated key first with SSA algorithms on the internet, and it is each to calculate output The degree of correlation of individual webpage and theme；The relevance degree calculated in SSA algorithms and PRA algorithms are calculated again obtain based on even The value connect combines the score value finally sorted as webpage, and retrieval result is ranked up by the score value.

In step, the present invention is initialized to search environment, that is, is initialized creep website seedUrls, reptile and climbed Take time t₁, subject key words vector v ector_topicThe time interval t crawled again with reptile₂, website of creeping here is row Authoritative website in the industry；Initial queue Url_queue to be crawled is set up by the website seedUrls that creeps again.

In stepb, the present invention judges whether that reaching reptile crawls time t respectively₁And the team to be crawled set up in step A Whether row Url_queue is empty, when not up to reptile crawls time t₁And under queue Url_queue to be crawled is not carried out for space-time One step.

In step C, as shown in Fig. 2 being the schematic flow sheet of SSA algorithms in the present invention.The present invention uses Shark- Search-Advanced algorithms calculate the relevance degree potential_score of webpage and theme, according to the seed website of input SeedUrls and subject key words vector v ector_topicCrawl on the internet and download the related website of industry, final output The web page contents of structuring and web pages relevance value potential_score and the PR value calculated, are specifically included following Step by step：

C10, Dynamic Maintenance queue Url_queue to be crawled, return to step C2.

In step C1, the present invention treats the depth depth and relevance degree for crawling each webpage in queue Url_queue It is 0 that potential_score, which assigns initial value,.

In step C5, present invention additionally comprises the relevance degree sim for judging current_node and theme_currWhether it is more than 0, if then choosing the preceding α * width sub-pages of current web page, wherein α is predefined constant, is typically set to 1.5, expression Add the coefficient of Url_queue sub-pages numbers；If otherwise choosing the preceding width sub-pages of current web page.

In step C6, the present invention has crawled and added all webpages structure structurings in Url_queue according to current Network, using the PR values of each webpage of PageRank algorithm recursive calculations.

In step C8, the present invention calculates the joint score value score of each webpage_iCalculation formula be specially：

score_i=β * sim_i+(1-β)*PR_i

Calculate the current mean scores for having crawled webpageCalculation formula be specially：

The calculation formula for calculating web page correlation coefficient of determination δ is specially：

In step C10, the present invention passes through to joint score value score_iWhether web page correlation coefficient of determination δ net is more than Page adds queue Url_queue tails of the queue to be crawled, and realizes Dynamic Maintenance queue Url_queue to be crawled, when having handled current net After the current_node of page, return to step C2 ejects new webpage from queue Url_queue heads of the queue to be crawled to be continued to calculate.

The present invention from internet by crawling the webpage related to theme, then foundation after structuring processing is carried out to webpage Database.According to the seed website of user and the keyword or phrase of inquiry, the page comprising query string is regarded as and theme phase Close, calculate the degree of correlation of the page and theme, dynamically maintain priority query URL_queue to be creeped.The present invention will be with The high URL of degree of subject relativity comes queue front, is preferentially crawled by reptile；The low URL of the degree of correlation is come into queue rear end simultaneously, Crawled afterwards by reptile.When calculating the degree of correlation of the page and theme, it is related to theme that the present invention not only calculates web page contents Degree, while the degree of correlation of the Anchor Text near webpage and Anchor Text context and theme is also contemplated for into, makes information more complete Face.If simply considering the degree of correlation of webpage and theme, influence power of the webpage in the whole network is just have ignored, web page contents phase is likely to result in Close but information insecure situation in itself.Therefore the present invention passes through the PageRank overall situation by PageRank algorithms using coming in Property the theme related web page remained is filtered again, the of overall importance of remaining webpage is ensured with this.

In step D, as shown in figure 3, being the schematic flow sheet of PRA algorithms in the present invention.The present invention is used PageRank-Advanced algorithms calculate the connection value PR and webpage sorting score value rank of webpage, utilize the link between webpage Structure sets up score value computational methods using the model of random surfer, and the fair and reasonable score value by father's website distributes to child station Point, and will be calculated in SSA algorithms the PR values that obtained potential_score obtains with this method be combined obtain it is a kind of newly The scoring mechanism of type, specifically include it is following step by step：

In step D1, the network G (V, E) that the present invention is constituted according to webpage has crawled webpage setting initial p R to all Value, wherein Authoritative Web pages are entered as PR_authority, generic web page is entered as 1, obtains page initial p R value vectors π₀.Network G (V, E) is the digraph of attachment structure formation between webpage, and wherein V is that point set is collections of web pages, and E is between side collection, i.e. webpage Annexation.

In step d 2, the PR values of all webpages are carried out vector representation by the present invention：

π_k+1=π_kG

Wherein, π_kThe PR values vector of webpage is calculated for kth time, M is the initial value allocation matrix set up according to webpage attachment structure, and S is improves the matrix after small black holes, and G is big black to improve Matrix after hole；Small black holes, which refer to, only enters the single webpage that chain does not go out chain, and big black hole refers to only entering for several webpage compositions Chain does not go out the webpage collection of chain.

In step D3, the calculation formula that the present invention calculates webpage sorting score value rank is specially：

Rank=γ * potential_score+ (1- γ) * PR.

More clearly searched for for theme, γ values are set to larger can filter out and degree of subject relativity very high knot Really；And if user is indefinite to search keyword theme, γ can be set to smaller, and authoritative high webpage is filtered out, it is suitable Just some of the recommendations also be provide the user.

The present invention solves the problems, such as retrieval result relevancy ranking, specifically basis using PageRank-Advanced algorithms Replica detection builds oriented webpage connection figure, importance of the webpage in the whole network is calculated according to this, and combine web page contents With the degree of correlation of theme, a kind of new ordering mechanism is set up.Webpage is the number of times being cited based on it in the importance of the whole network Weighed with whether being quoted by Authoritative Web pages, i.e., the importance of one page, which is divided equally and passes to the page cited in it, to be worked as In, it is of overall importance that this can represent its；The degree of subject relativity of webpage is calculated by SSA algorithms, represents content locality, can Avoid the shortcoming of " topic drift " only brought using replica detection.Info web correlation can be embodied by two and reliable Property value be combined, set up a kind of new ordering mechanism, will the effective accuracy and reliability for improving search result.

The present invention is more reasonably sorted based on degree of correlation size and webpage confidence level score value to retrieval result, meanwhile, In order to avoid the locality of Shark-Search algorithms, set up a kind of improved selectivity with reference to PageRank algorithms and crawl webpage Strategy, download an only webpage related to theme, the small database of information redundance set up, while being avoided that traditional algorithm again Topic drift problem.

One of ordinary skill in the art will be appreciated that embodiment described here is to aid in reader and understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such especially statement and embodiment.This area Those of ordinary skill can make according to these technical inspirations disclosed by the invention various does not depart from the other each of essence of the invention Plant specific deformation and combine, these deformations and combination are still within the scope of the present invention.

Claims

1. a kind of subject search method of Industry-oriented, it is characterised in that comprise the following steps：

A, creep website seedUrls, reptile of initialization crawl time t₁, subject key words vector v ector_topicWith reptile again The time interval t crawled₂, initial queue Url_queue to be crawled is set up by the website seedUrls that creeps；

B, judge whether reach reptile crawl time t₁If, then end operation, if otherwise determining whether what is set up in step A Whether queue Url_queue to be crawled is empty；The end operation if queue Url_queue to be crawled is sky, if queue to be crawled Url_queue does not carry out next step then for sky；

C, the relevance degree potential_score using Shark-Search-Advanced algorithms calculating webpage and theme；

E, judge whether to reach the time interval t that crawls again of reptile₂If, then return to step C, if otherwise repeat step E.

2. the subject search method of Industry-oriented as claimed in claim 1, it is characterised in that the step C uses Shark- Search-Advanced algorithms calculate webpage and theme relevance degree potential_score specifically include it is following step by step：

C1, the depth depth and relevance degree potential_score for initializing each webpage in queue Url_queue to be crawled；

C5, using the current_node web page contents and the relevance degree of theme in Shark-Search algorithm calculation procedures C2 sim_curr, and choose the top n sub-pages of current web page；

C8, the joint score value score for calculating each webpage_i, further according to the joint score value score of each webpage_iCalculating has currently been climbed Take the mean scores of webpageAnd web page correlation coefficient of determination δ；

C9, the joint score value score for judging each webpage_iWhether web page correlation coefficient of determination δ is more than；If then by the webpage Queue Url_queue tails of the queue to be crawled are added, if otherwise deleting the webpage from queue Url_queue to be crawled；

C10, Dynamic Maintenance queue Url_queue to be crawled, return to step C2.

3. the subject search method of Industry-oriented as claimed in claim 2, it is characterised in that the step C5 also includes judging Current_node and theme relevance degree sim_currWhether 0 is more than；If then choosing preceding α * width of current web page Webpage, wherein α are to add the coefficient of Url_queue sub-pages numbers；If otherwise choosing the preceding width subnet of current web page Page.

4. the subject search method of Industry-oriented as claimed in claim 3, it is characterised in that calculated in the step C8 each The joint score value score of webpage_iCalculation formula be specially：

score_i=β * sim_i+(1-β)*PR_i

5. the subject search method of Industry-oriented as claimed in claim 4, it is characterised in that calculated in the step C8 current The mean scores of webpage are crawledCalculation formula be specially：

\overset{&OverBar;}{s c o r e} = \frac{Σ_{1}^{n} {score}_{i}}{n} .

6. the subject search method of Industry-oriented as claimed in claim 5, it is characterised in that calculate webpage in the step C8 Correlation prediction coefficient δ calculation formula is specially：

δ = \frac{n_{m a x}}{n_{m i n}} \times \overset{&OverBar;}{s c o r e}

Wherein, n_maxCombine the webpage quantity that score value is more than mean scores in webpage currently to have crawled, n_minCurrently to have crawled Combine the webpage quantity that score value is less than mean scores in webpage.

7. the subject search method of Industry-oriented as claimed in claim 6, it is characterised in that the step D is used PageRank-Advanced algorithms calculate webpage connection value PR and webpage sorting score value rank specifically include it is following step by step：

D3, the relevance degree potential_score according to webpage in step C and theme, with reference to the PR values of webpage in step D2 Obtain webpage sorting score value rank.

8. the subject search method of Industry-oriented as claimed in claim 7, it is characterised in that the PR of webpage in the step D2 Value carries out vector representation：

π_k+1=π_kG

Wherein, π_kThe PR values vector of webpage is calculated for kth time,

9. the subject search method of Industry-oriented as claimed in claim 8, it is characterised in that webpage sorting in the step D3 Score value rank is expressed as：

Rank=γ * potential_score+ (1- γ) * PR.