CN108959576A - A kind of network crawler system and method based on Party school's research work theme - Google Patents
A kind of network crawler system and method based on Party school's research work theme Download PDFInfo
- Publication number
- CN108959576A CN108959576A CN201810736630.3A CN201810736630A CN108959576A CN 108959576 A CN108959576 A CN 108959576A CN 201810736630 A CN201810736630 A CN 201810736630A CN 108959576 A CN108959576 A CN 108959576A
- Authority
- CN
- China
- Prior art keywords
- module
- theme
- crawler
- webpage
- research work
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of network crawler systems and method based on Party school's research work theme, are related to internet search engine technical field.Network crawler system of the invention includes initial seed module, crawler module, database, topic correlativity analysis module, sorting module, theme establishment module;Web crawlers working method includes: that 1, crawler module fetches webpage;2, Controlling UEP module is called to carry out Controlling UEP to webpage;3, crawler module carries out webpage rejecting according to the result of analysis or reservation acts;4, crawler module such as takes out at the URL to be processed from database;5, sorting module is ranked up the significance level of webpage;6, crawler module judges whether there is new URL in database.The present invention improves the degree of correlation of Party school's research work Webpage search and the precision of search information using the progress theme optimization of topic correlativity analysis module and home page filter by establishing the search engine of Party school's research work theme.
Description
Technical field
The invention belongs to internet search engine technical field, more particularly to a kind of based on Party school's research work theme
Network crawler system and method.
Background technique
Traditional general search engine is faced with huge challenge: first is that Web information resource increases by geometric progression,
Search engine can not index all pages;Second is that the user of different field has a different search needs, " wide and general " it is general
Search engine is not able to satisfy the search need of professional user " specialized and skilled ".It is all kinds of for the " main of specific crowd in face of these challenges
Topic search engine " comes into being.
At the same time, with the continuous development of Party school of China research work, Party school's research work resource has been over TB
Grade, but does not set up an effective information retrieval approach, be such as directed to the Party School of the CPC Central Committee website (http: //
Www.ccps.gov.cn/ Baidu search " Marxist Contemporary Value ") is used, the result of inquiry is 0, for section of Party school
It grinds career field to need to establish the topic search engine of oneself, therefore in view of the above problems, provides a kind of based on Party school's scientific research work
The network crawler system and method for making theme are of great significance.
Summary of the invention
The purpose of the present invention is to provide a kind of network crawler systems and method based on Party school's research work theme, pass through
It on the basis of Shark-Search algorithm, is made improvements for Party school's research work feature, establishes Party school's scientific research
The search engine of work topic establishes theme by using keyword, and each keyword possesses specified different weights, benefit
Theme optimization and home page filter are carried out with topic correlativity analysis module, solves existing Party school's research work subject search net
The search relevance of page is low, the low problem of search accurate information degree.
In order to solve the above technical problems, the present invention is achieved by the following technical solutions:
A kind of network crawler system based on Party school's research work theme of the invention, including html document, initial seed
Module, crawler module, database, topic correlativity analysis module, sorting module, theme establish module;
The theme establish module be used to establish crawler towards theme;
The topic correlativity analysis module is used to carry out the calculating of the Web page subject degree of correlation;
The initial seed module keeps crawler module suitable for generating the preferable seed website towards specific subject
Work of creeping is unfolded in benefit;
The html document, initial seed module, database, Controlling UEP module respectively with the real-time phase of crawler module
Connection;The sorting module is connected in real time with database;The theme establishes module and is connected in real time with Controlling UEP module.
Further, the theme establishes module and determines theme using keyword set, refers to wherein each keyword has
Fixed different weights, the weight use feature extracting method.
Further, the webpage that the topic correlativity analysis module is used to guarantee that crawler obtains is leaned on to theme as far as possible
Hold together, the webpage of crawler module crawl is filtered, the lower webpage of topic correlativity is rejected, the topic correlativity analysis
The topic correlativity calculation method that module uses is vector space model.
Further, the sorting module arranges costly webpage for being ranked up to the significance level of webpage
It to front, is chosen to so as to easier, the sort method that the sorting module uses is PageRank algorithm.
A kind of web crawlers method based on Party school's research work theme includes web crawlers working method, feature extraction side
Method, vector space model, descriptor recording method in database;
The web crawlers working method the following steps are included:
S01: the crawler module fetches webpage;
S02: the calling Controlling UEP module carries out Controlling UEP to webpage;
S03: the crawler module carries out webpage rejecting according to the Different Results of analysis or reservation acts;
S04: the crawler module such as takes out at the URL to be processed from database;
S05: the sorting module is ranked up the significance level of webpage;
S06: the crawler module judges whether there is new URL in database;
It is recycled if so, being back to step S01;
If it is not, then terminating.
Further, the feature extracting method using given one with theme relevant collections of web pages, it is automatic by program
Feature common in these webpages is extracted, and weight is determined according to frequency.
Further, the vector space model includes the following steps:
P01: using the number n of keyword as dimension of a vector space, the weight w of each keywordiAs every one-dimensional point
The size of amount, then theme is expressed as a vector:
α=(a1,a2,...,an), i=1,2,3..., n, ai=wi;
P02: analyzing the page, the frequency that statistics keyword occurs, and finds out frequency ratio, most with the frequency of occurrences
High keyword is as benchmark, frequency xi=1 indicates, by frequency ratio, finds out the frequency x of other keywordsi, then the page
That face corresponds to vector is x per one-dimensional componentiwi, page subject matter is expressed as a vector:
β=(x1w1,x2w2,...,xnwn), i=1,2 ..., n,
With the topic correlativity of two vectorial angle cosine representation pages:
P03: a threshold values r r is specified, can consider the page relevant, r compared with theme is as cos < α, β >=r
Value need rule of thumb with actual requirement determine
Further, descriptor recording method includes the following steps: in the database
T01: the common word dictionary of Party school research work page URL is established
Wurl=(the communist party, party school, party history......), includes part
The host name of authoritative Party school's scientific research website and common word, if entry number is d;
T02: the marker characters such as http, com are removed after URL is split with "/" and " ", extract significant phrase
(word1,word2,...wordn);
T03: the Relevance scores R calculated according to URLURLFor
Wherein:
The invention has the following advantages:
The present invention is by changing it for Party school's research work feature on the basis of Shark-Search algorithm
Into establishing the search engine of Party school's research work theme, theme established by using keyword, each keyword possesses
Specified different weights carry out theme optimization and home page filter using topic correlativity analysis module, improve Party school's scientific research
The degree of correlation of Webpage search that works and the precision of search information, are conducive to push building for Party school of China research work informationization
If.
Certainly, it implements any of the products of the present invention and does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability
For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is a kind of network crawler system structural schematic diagram based on Party school's research work theme of the invention;
Fig. 2 is that a kind of working method block diagram of web crawlers based on Party school's research work theme of the invention is illustrated
Figure.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other
Embodiment shall fall within the protection scope of the present invention.
Refering to Figure 1, a kind of network crawler system based on Party school's research work theme of the invention, including HTML
Document, initial seed module, crawler module, database, topic correlativity analysis module, sorting module, theme establish module;
Theme establish module be used for establish crawler towards theme;
Topic correlativity analysis module is used to carry out the calculating of the Web page subject degree of correlation;
Initial seed module enables crawler module smoothly to open up for generating the preferable seed website towards specific subject
Open work of creeping;
Html document, initial seed module, database, Controlling UEP module are connected with crawler module in real time respectively;Row
Sequence module is connected in real time with database;Theme establishes module and is connected in real time with Controlling UEP module.
Wherein, theme establishes module and determines theme using keyword set, wherein each keyword has specified difference
Weight, weight use feature extracting method.
Wherein, the webpage that topic correlativity analysis module is used to guarantee that crawler obtains is drawn close to theme as far as possible, to climbing
The webpage of row module crawl is filtered, and the lower webpage of topic correlativity is rejected, the topic correlativity analysis module is adopted
Topic correlativity calculation method is vector space model.
Wherein, costly webpage is aligned to front for being ranked up to the significance level of webpage by sorting module, with
Just easier to be chosen to, the sort method that sorting module uses is PageRank algorithm.
As shown in Fig. 2, a kind of web crawlers method based on Party school's research work theme, including web crawlers work side
Method, feature extracting method, vector space model, descriptor recording method in database;
Web crawlers working method the following steps are included:
S01: crawler module fetches webpage;
S02: Controlling UEP module is called to carry out Controlling UEP to webpage;
S03: crawler module carries out webpage rejecting according to the Different Results of analysis or reservation acts;
S04: crawler module such as takes out at the URL to be processed from database;
S05: sorting module is ranked up the significance level of webpage;
S06: crawler module judges whether there is new URL in database;
It is recycled if so, being back to step S01;
If it is not, then terminating.
Wherein, feature extracting method using given one with theme relevant collections of web pages, these are automatically extracted by program
Common feature in webpage, and weight is determined according to frequency.
Wherein, vector space model includes the following steps:
P01: using the number n of keyword as dimension of a vector space, the weight w of each keywordiAs every one-dimensional point
The size of amount, then theme is expressed as a vector:
α=(a1,a2,...,an), i=1,2,3..., n, ai=wi;
P02: analyzing the page, the frequency that statistics keyword occurs, and finds out frequency ratio, most with the frequency of occurrences
High keyword is as benchmark, frequency xi=1 indicates, by frequency ratio, finds out the frequency x of other keywordsi, then the page
That face corresponds to vector is x per one-dimensional componentiwi, page subject matter is expressed as a vector:
β=(x1w1,x2w2,...,xnwn), i=1,2 ..., n,
With the topic correlativity of two vectorial angle cosine representation pages:
P03: a threshold values r r is specified, can consider the page relevant, r compared with theme is as cos < α, β >=r
Value need rule of thumb with actual requirement determine
Wherein, descriptor recording method includes the following steps: in database
T01: the common word dictionary of Party school research work page URL is established
Wurl=(the communist party, party school, party history......), includes part
The host name of authoritative Party school's scientific research website and common word, if entry number is d;
T02: the marker characters such as http, com are removed after URL is split with "/" and " ", extract significant phrase
(word1,word2,...wordn);
T03: the Relevance scores R calculated according to URLURLFor
Wherein:
In the description of this specification, the description of reference term " one embodiment ", " example ", " specific example " etc. means
Particular features, structures, materials, or characteristics described in conjunction with this embodiment or example are contained at least one implementation of the invention
In example or example.In the present specification, schematic expression of the above terms may not refer to the same embodiment or example.
Moreover, particular features, structures, materials, or characteristics described can be in any one or more of the embodiments or examples to close
Suitable mode combines.
Present invention disclosed above preferred embodiment is only intended to help to illustrate the present invention.There is no detailed for preferred embodiment
All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification,
It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to better explain the present invention
Principle and practical application, so that skilled artisan be enable to better understand and utilize the present invention.The present invention is only
It is limited by claims and its full scope and equivalent.
Claims (8)
1. a kind of network crawler system based on Party school's research work theme, which is characterized in that including html document, initial seed
Module, crawler module, database, topic correlativity analysis module, sorting module, theme establish module;
The theme establish module be used to establish crawler towards theme;
The topic correlativity analysis module is used to carry out the calculating of the Web page subject degree of correlation;
The initial seed module enables crawler module smoothly to open up for generating the preferable seed website towards specific subject
Open work of creeping;
The html document, initial seed module, database, Controlling UEP module are connected with crawler module in real time respectively;Institute
It states sorting module and is connected in real time with database;The theme establishes module and is connected in real time with Controlling UEP module.
2. a kind of network crawler system based on Party school's research work theme according to claim 1, which is characterized in that institute
It states theme establishment module and theme is determined using keyword set, wherein each keyword has specified different weights, it is described
Weight uses feature extracting method.
3. a kind of network crawler system based on Party school's research work theme according to claim 1, which is characterized in that institute
It states webpage of the topic correlativity analysis module for guaranteeing that crawler obtains to draw close to theme as far as possible, to crawler module crawl
Webpage is filtered, and the lower webpage of topic correlativity is rejected, the theme that the topic correlativity analysis module uses is related
Degree calculation method is vector space model.
4. a kind of network crawler system based on Party school's research work theme according to claim 1, which is characterized in that institute
Sorting module is stated for being ranked up to the significance level of webpage, costly webpage is aligned to front, so as to easier
It is chosen to, the sort method that the sorting module uses is PageRank algorithm.
5. a kind of web crawlers method based on Party school's research work theme as described in Claims 1-4 is any, feature exist
In, including descriptor recording method in web crawlers working method, feature extracting method, vector space model, database;
The web crawlers working method the following steps are included:
S01: the crawler module fetches webpage;
S02: the calling Controlling UEP module carries out Controlling UEP to webpage;
S03: the crawler module carries out webpage rejecting according to the Different Results of analysis or reservation acts;
S04: the crawler module such as takes out at the URL to be processed from database;
S05: the sorting module is ranked up the significance level of webpage;
S06: the crawler module judges whether there is new URL in database;
It is recycled if so, being back to step S01;
If it is not, then terminating.
6. a kind of web crawlers method based on Party school's research work theme according to claim 5, which is characterized in that institute
State feature extracting method using given one with theme relevant collections of web pages, automatically extracted by program common in these webpages
Feature, and weight is determined according to frequency.
7. a kind of web crawlers method based on Party school's research work theme according to claim 5, which is characterized in that institute
Vector space model is stated to include the following steps:
P01: using the number n of keyword as dimension of a vector space, the weight w of each keywordiAs the big of every one-dimensional component
Small, then theme is expressed as a vector:
α=(a1,a2,...,an), i=1,2,3..., n, ai=wi;
P02: analyzing the page, the frequency that statistics keyword occurs, and finds out frequency ratio, highest with the frequency of occurrences
Keyword is as benchmark, frequency xi=1 indicates, by frequency ratio, finds out the frequency x of other keywordsi, then the page pair
That answer vector is x per one-dimensional componentiwi, page subject matter is expressed as a vector:
β=(x1w1,x2w2,...,xnwn), i=1,2 ..., n,
With the topic correlativity of two vectorial angle cosine representation pages:
P03: a threshold values r r is specified, can consider that the page is relevant compared with theme is as cos < α, β >=r, r's takes
Value needs rule of thumb to determine with actual requirement
8. a kind of web crawlers method based on Party school's research work theme according to claim 5, which is characterized in that institute
Descriptor recording method in database is stated to include the following steps:
T01: the common word dictionary of Party school research work page URL is established
Wurl=(the communist party, party school, party history......), indexed unit fraction prestige
The host name of Party school's scientific research website and common word, if entry number is d;
T02: the marker characters such as http, com are removed after URL is split with "/" and " ", extract significant phrase (word1,
word2,...wordn);
T03: the Relevance scores R calculated according to URLURLFor
Wherein:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810736630.3A CN108959576A (en) | 2018-07-06 | 2018-07-06 | A kind of network crawler system and method based on Party school's research work theme |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810736630.3A CN108959576A (en) | 2018-07-06 | 2018-07-06 | A kind of network crawler system and method based on Party school's research work theme |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108959576A true CN108959576A (en) | 2018-12-07 |
Family
ID=64482204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810736630.3A Withdrawn CN108959576A (en) | 2018-07-06 | 2018-07-06 | A kind of network crawler system and method based on Party school's research work theme |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108959576A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110059235A (en) * | 2018-12-19 | 2019-07-26 | 远光软件股份有限公司 | A kind of crawl of Party building information resources, distribution, method for pushing and system |
CN110309246A (en) * | 2019-05-24 | 2019-10-08 | 中国地质调查局发展研究中心 | A kind of method and device thereof internet geologic data retrieval and obtained |
-
2018
- 2018-07-06 CN CN201810736630.3A patent/CN108959576A/en not_active Withdrawn
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110059235A (en) * | 2018-12-19 | 2019-07-26 | 远光软件股份有限公司 | A kind of crawl of Party building information resources, distribution, method for pushing and system |
CN110309246A (en) * | 2019-05-24 | 2019-10-08 | 中国地质调查局发展研究中心 | A kind of method and device thereof internet geologic data retrieval and obtained |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cai et al. | Extracting content structure for web pages based on visual representation | |
CN105488024B (en) | The abstracting method and device of Web page subject sentence | |
US8312035B2 (en) | Search engine enhancement using mined implicit links | |
CN101630327A (en) | Design method of theme network crawler system | |
CN102722558B (en) | A kind of method and apparatus recommending for user to put question to | |
CN101231661A (en) | Method and system for digging object grade knowledge | |
CN101261629A (en) | Specific information searching method based on automatic classification technology | |
CN103631794A (en) | Method, device and equipment for sorting search results | |
Prajapati | A survey paper on hyperlink-induced topic search (HITS) algorithms for web mining | |
CN110012122A (en) | A kind of domain name similarity analysis method of word-based embedded technology | |
Bun et al. | Emerging topic tracking system | |
CN108959576A (en) | A kind of network crawler system and method based on Party school's research work theme | |
Jalal | Exploring web link analysis of websites of Indian Institute of Technology | |
Chopra et al. | A survey on improving the efficiency of different web structure mining algorithms | |
CN103838786A (en) | Web data automatic collecting method | |
CN103902687B (en) | The generation method and device of a kind of Search Results | |
Srinath | Page ranking algorithms–a comparison | |
Nithya | Link Analysis Algorithm for Web Structure Mining | |
Yuan et al. | Improvement of pagerank for focused crawler | |
Divya et al. | Onto-search: An ontology based personalized mobile search engine | |
Ma et al. | Searching Tourism Information by Using Vertical Search Engine Based on Nutch and Solr | |
Zhang et al. | Research and implementation of keyword extraction algorithm based on professional background knowledge | |
Fernández et al. | Novelty detection using local context analysis | |
Balaji et al. | TOPCRAWL: Community mining in web search engines with emphasize on topical crawling | |
Ni et al. | Web information recommendation based on user behaviors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20181207 |
|
WW01 | Invention patent application withdrawn after publication |