CN108959576A

CN108959576A - A kind of network crawler system and method based on Party school's research work theme

Info

Publication number: CN108959576A
Application number: CN201810736630.3A
Authority: CN
Inventors: 徐玉红
Original assignee: Hefei Minggao Software Technology Co Ltd
Current assignee: Hefei Minggao Software Technology Co Ltd
Priority date: 2018-07-06
Filing date: 2018-07-06
Publication date: 2018-12-07

Abstract

The invention discloses a kind of network crawler systems and method based on Party school's research work theme, are related to internet search engine technical field.Network crawler system of the invention includes initial seed module, crawler module, database, topic correlativity analysis module, sorting module, theme establishment module；Web crawlers working method includes: that 1, crawler module fetches webpage；2, Controlling UEP module is called to carry out Controlling UEP to webpage；3, crawler module carries out webpage rejecting according to the result of analysis or reservation acts；4, crawler module such as takes out at the URL to be processed from database；5, sorting module is ranked up the significance level of webpage；6, crawler module judges whether there is new URL in database.The present invention improves the degree of correlation of Party school's research work Webpage search and the precision of search information using the progress theme optimization of topic correlativity analysis module and home page filter by establishing the search engine of Party school's research work theme.

Description

A kind of network crawler system and method based on Party school's research work theme

Technical field

The invention belongs to internet search engine technical field, more particularly to a kind of based on Party school's research work theme Network crawler system and method.

Background technique

Traditional general search engine is faced with huge challenge: first is that Web information resource increases by geometric progression, Search engine can not index all pages；Second is that the user of different field has a different search needs, " wide and general " it is general Search engine is not able to satisfy the search need of professional user " specialized and skilled ".It is all kinds of for the " main of specific crowd in face of these challenges Topic search engine " comes into being.

At the same time, with the continuous development of Party school of China research work, Party school's research work resource has been over TB Grade, but does not set up an effective information retrieval approach, be such as directed to the Party School of the CPC Central Committee website (http: // Www.ccps.gov.cn/ Baidu search " Marxist Contemporary Value ") is used, the result of inquiry is 0, for section of Party school It grinds career field to need to establish the topic search engine of oneself, therefore in view of the above problems, provides a kind of based on Party school's scientific research work The network crawler system and method for making theme are of great significance.

Summary of the invention

The purpose of the present invention is to provide a kind of network crawler systems and method based on Party school's research work theme, pass through It on the basis of Shark-Search algorithm, is made improvements for Party school's research work feature, establishes Party school's scientific research The search engine of work topic establishes theme by using keyword, and each keyword possesses specified different weights, benefit Theme optimization and home page filter are carried out with topic correlativity analysis module, solves existing Party school's research work subject search net The search relevance of page is low, the low problem of search accurate information degree.

In order to solve the above technical problems, the present invention is achieved by the following technical solutions:

A kind of network crawler system based on Party school's research work theme of the invention, including html document, initial seed Module, crawler module, database, topic correlativity analysis module, sorting module, theme establish module；

The theme establish module be used to establish crawler towards theme；

The topic correlativity analysis module is used to carry out the calculating of the Web page subject degree of correlation；

The initial seed module keeps crawler module suitable for generating the preferable seed website towards specific subject Work of creeping is unfolded in benefit；

The html document, initial seed module, database, Controlling UEP module respectively with the real-time phase of crawler module Connection；The sorting module is connected in real time with database；The theme establishes module and is connected in real time with Controlling UEP module.

Further, the theme establishes module and determines theme using keyword set, refers to wherein each keyword has Fixed different weights, the weight use feature extracting method.

Further, the webpage that the topic correlativity analysis module is used to guarantee that crawler obtains is leaned on to theme as far as possible Hold together, the webpage of crawler module crawl is filtered, the lower webpage of topic correlativity is rejected, the topic correlativity analysis The topic correlativity calculation method that module uses is vector space model.

Further, the sorting module arranges costly webpage for being ranked up to the significance level of webpage It to front, is chosen to so as to easier, the sort method that the sorting module uses is PageRank algorithm.

A kind of web crawlers method based on Party school's research work theme includes web crawlers working method, feature extraction side Method, vector space model, descriptor recording method in database；

The web crawlers working method the following steps are included:

S01: the crawler module fetches webpage；

S02: the calling Controlling UEP module carries out Controlling UEP to webpage；

S03: the crawler module carries out webpage rejecting according to the Different Results of analysis or reservation acts；

S04: the crawler module such as takes out at the URL to be processed from database；

S05: the sorting module is ranked up the significance level of webpage；

S06: the crawler module judges whether there is new URL in database；

It is recycled if so, being back to step S01；

If it is not, then terminating.

Further, the feature extracting method using given one with theme relevant collections of web pages, it is automatic by program Feature common in these webpages is extracted, and weight is determined according to frequency.

Further, the vector space model includes the following steps:

P01: using the number n of keyword as dimension of a vector space, the weight w of each keyword_iAs every one-dimensional point The size of amount, then theme is expressed as a vector:

α=(a₁,a₂,...,a_n), i=1,2,3..., n, a_i=w_i；

P02: analyzing the page, the frequency that statistics keyword occurs, and finds out frequency ratio, most with the frequency of occurrences High keyword is as benchmark, frequency x_i=1 indicates, by frequency ratio, finds out the frequency x of other keywords_i, then the page That face corresponds to vector is x per one-dimensional component_iw_i, page subject matter is expressed as a vector:

β=(x₁w₁,x₂w₂,...,x_nw_n), i=1,2 ..., n,

With the topic correlativity of two vectorial angle cosine representation pages:

P03: a threshold values r r is specified, can consider the page relevant, r compared with theme is as cos < α, β >=r Value need rule of thumb with actual requirement determine

Further, descriptor recording method includes the following steps: in the database

T01: the common word dictionary of Party school research work page URL is established

W_url=(the communist party, party school, party history......), includes part The host name of authoritative Party school's scientific research website and common word, if entry number is d；

T02: the marker characters such as http, com are removed after URL is split with "/" and " ", extract significant phrase (word₁,word₂,...word_n)；

T03: the Relevance scores R calculated according to URL_URLFor

Wherein:

The invention has the following advantages:

The present invention is by changing it for Party school's research work feature on the basis of Shark-Search algorithm Into establishing the search engine of Party school's research work theme, theme established by using keyword, each keyword possesses Specified different weights carry out theme optimization and home page filter using topic correlativity analysis module, improve Party school's scientific research The degree of correlation of Webpage search that works and the precision of search information, are conducive to push building for Party school of China research work informationization If.

Certainly, it implements any of the products of the present invention and does not necessarily require achieving all the advantages described above at the same time.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is a kind of network crawler system structural schematic diagram based on Party school's research work theme of the invention；

Fig. 2 is that a kind of working method block diagram of web crawlers based on Party school's research work theme of the invention is illustrated Figure.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.

Refering to Figure 1, a kind of network crawler system based on Party school's research work theme of the invention, including HTML Document, initial seed module, crawler module, database, topic correlativity analysis module, sorting module, theme establish module；

Theme establish module be used for establish crawler towards theme；

Topic correlativity analysis module is used to carry out the calculating of the Web page subject degree of correlation；

Initial seed module enables crawler module smoothly to open up for generating the preferable seed website towards specific subject Open work of creeping；

Html document, initial seed module, database, Controlling UEP module are connected with crawler module in real time respectively；Row Sequence module is connected in real time with database；Theme establishes module and is connected in real time with Controlling UEP module.

Wherein, theme establishes module and determines theme using keyword set, wherein each keyword has specified difference Weight, weight use feature extracting method.

Wherein, the webpage that topic correlativity analysis module is used to guarantee that crawler obtains is drawn close to theme as far as possible, to climbing The webpage of row module crawl is filtered, and the lower webpage of topic correlativity is rejected, the topic correlativity analysis module is adopted Topic correlativity calculation method is vector space model.

Wherein, costly webpage is aligned to front for being ranked up to the significance level of webpage by sorting module, with Just easier to be chosen to, the sort method that sorting module uses is PageRank algorithm.

As shown in Fig. 2, a kind of web crawlers method based on Party school's research work theme, including web crawlers work side Method, feature extracting method, vector space model, descriptor recording method in database；

Web crawlers working method the following steps are included:

S01: crawler module fetches webpage；

S02: Controlling UEP module is called to carry out Controlling UEP to webpage；

S03: crawler module carries out webpage rejecting according to the Different Results of analysis or reservation acts；

S04: crawler module such as takes out at the URL to be processed from database；

S05: sorting module is ranked up the significance level of webpage；

S06: crawler module judges whether there is new URL in database；

It is recycled if so, being back to step S01；

If it is not, then terminating.

Wherein, feature extracting method using given one with theme relevant collections of web pages, these are automatically extracted by program Common feature in webpage, and weight is determined according to frequency.

Wherein, vector space model includes the following steps:

α=(a₁,a₂,...,a_n), i=1,2,3..., n, a_i=w_i；

β=(x₁w₁,x₂w₂,...,x_nw_n), i=1,2 ..., n,

Wherein, descriptor recording method includes the following steps: in database

T03: the Relevance scores R calculated according to URL_URLFor

Wherein:

In the description of this specification, the description of reference term " one embodiment ", " example ", " specific example " etc. means Particular features, structures, materials, or characteristics described in conjunction with this embodiment or example are contained at least one implementation of the invention In example or example.In the present specification, schematic expression of the above terms may not refer to the same embodiment or example. Moreover, particular features, structures, materials, or characteristics described can be in any one or more of the embodiments or examples to close Suitable mode combines.

Present invention disclosed above preferred embodiment is only intended to help to illustrate the present invention.There is no detailed for preferred embodiment All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification, It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to better explain the present invention Principle and practical application, so that skilled artisan be enable to better understand and utilize the present invention.The present invention is only It is limited by claims and its full scope and equivalent.

Claims

1. a kind of network crawler system based on Party school's research work theme, which is characterized in that including html document, initial seed Module, crawler module, database, topic correlativity analysis module, sorting module, theme establish module；

The theme establish module be used to establish crawler towards theme；

The initial seed module enables crawler module smoothly to open up for generating the preferable seed website towards specific subject Open work of creeping；

The html document, initial seed module, database, Controlling UEP module are connected with crawler module in real time respectively；Institute It states sorting module and is connected in real time with database；The theme establishes module and is connected in real time with Controlling UEP module.

2. a kind of network crawler system based on Party school's research work theme according to claim 1, which is characterized in that institute It states theme establishment module and theme is determined using keyword set, wherein each keyword has specified different weights, it is described Weight uses feature extracting method.

3. a kind of network crawler system based on Party school's research work theme according to claim 1, which is characterized in that institute It states webpage of the topic correlativity analysis module for guaranteeing that crawler obtains to draw close to theme as far as possible, to crawler module crawl Webpage is filtered, and the lower webpage of topic correlativity is rejected, the theme that the topic correlativity analysis module uses is related Degree calculation method is vector space model.

4. a kind of network crawler system based on Party school's research work theme according to claim 1, which is characterized in that institute Sorting module is stated for being ranked up to the significance level of webpage, costly webpage is aligned to front, so as to easier It is chosen to, the sort method that the sorting module uses is PageRank algorithm.

5. a kind of web crawlers method based on Party school's research work theme as described in Claims 1-4 is any, feature exist In, including descriptor recording method in web crawlers working method, feature extracting method, vector space model, database；

The web crawlers working method the following steps are included:

S01: the crawler module fetches webpage；

S05: the sorting module is ranked up the significance level of webpage；

S06: the crawler module judges whether there is new URL in database；

It is recycled if so, being back to step S01；

If it is not, then terminating.

6. a kind of web crawlers method based on Party school's research work theme according to claim 5, which is characterized in that institute State feature extracting method using given one with theme relevant collections of web pages, automatically extracted by program common in these webpages Feature, and weight is determined according to frequency.

7. a kind of web crawlers method based on Party school's research work theme according to claim 5, which is characterized in that institute Vector space model is stated to include the following steps:

P01: using the number n of keyword as dimension of a vector space, the weight w of each keyword_iAs the big of every one-dimensional component Small, then theme is expressed as a vector:

α=(a₁,a₂,...,a_n), i=1,2,3..., n, a_i=w_i；

P02: analyzing the page, the frequency that statistics keyword occurs, and finds out frequency ratio, highest with the frequency of occurrences Keyword is as benchmark, frequency x_i=1 indicates, by frequency ratio, finds out the frequency x of other keywords_i, then the page pair That answer vector is x per one-dimensional component_iw_i, page subject matter is expressed as a vector:

β=(x₁w₁,x₂w₂,...,x_nw_n), i=1,2 ..., n,

P03: a threshold values r r is specified, can consider that the page is relevant compared with theme is as cos < α, β >=r, r's takes Value needs rule of thumb to determine with actual requirement

8. a kind of web crawlers method based on Party school's research work theme according to claim 5, which is characterized in that institute Descriptor recording method in database is stated to include the following steps:

W_url=(the communist party, party school, party history......), indexed unit fraction prestige The host name of Party school's scientific research website and common word, if entry number is d；

T02: the marker characters such as http, com are removed after URL is split with "/" and " ", extract significant phrase (word₁, word₂,...word_n)；

T03: the Relevance scores R calculated according to URL_URLFor

Wherein: