CN108959576A - A kind of network crawler system and method based on Party school's research work theme - Google Patents

A kind of network crawler system and method based on Party school's research work theme Download PDF

Info

Publication number
CN108959576A
CN108959576A CN201810736630.3A CN201810736630A CN108959576A CN 108959576 A CN108959576 A CN 108959576A CN 201810736630 A CN201810736630 A CN 201810736630A CN 108959576 A CN108959576 A CN 108959576A
Authority
CN
China
Prior art keywords
module
theme
crawler
webpage
research work
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810736630.3A
Other languages
Chinese (zh)
Inventor
徐玉红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Minggao Software Technology Co Ltd
Original Assignee
Hefei Minggao Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Minggao Software Technology Co Ltd filed Critical Hefei Minggao Software Technology Co Ltd
Priority to CN201810736630.3A priority Critical patent/CN108959576A/en
Publication of CN108959576A publication Critical patent/CN108959576A/en
Withdrawn legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of network crawler systems and method based on Party school's research work theme, are related to internet search engine technical field.Network crawler system of the invention includes initial seed module, crawler module, database, topic correlativity analysis module, sorting module, theme establishment module;Web crawlers working method includes: that 1, crawler module fetches webpage;2, Controlling UEP module is called to carry out Controlling UEP to webpage;3, crawler module carries out webpage rejecting according to the result of analysis or reservation acts;4, crawler module such as takes out at the URL to be processed from database;5, sorting module is ranked up the significance level of webpage;6, crawler module judges whether there is new URL in database.The present invention improves the degree of correlation of Party school's research work Webpage search and the precision of search information using the progress theme optimization of topic correlativity analysis module and home page filter by establishing the search engine of Party school's research work theme.

Description

A kind of network crawler system and method based on Party school's research work theme
Technical field
The invention belongs to internet search engine technical field, more particularly to a kind of based on Party school's research work theme Network crawler system and method.
Background technique
Traditional general search engine is faced with huge challenge: first is that Web information resource increases by geometric progression, Search engine can not index all pages;Second is that the user of different field has a different search needs, " wide and general " it is general Search engine is not able to satisfy the search need of professional user " specialized and skilled ".It is all kinds of for the " main of specific crowd in face of these challenges Topic search engine " comes into being.
At the same time, with the continuous development of Party school of China research work, Party school's research work resource has been over TB Grade, but does not set up an effective information retrieval approach, be such as directed to the Party School of the CPC Central Committee website (http: // Www.ccps.gov.cn/ Baidu search " Marxist Contemporary Value ") is used, the result of inquiry is 0, for section of Party school It grinds career field to need to establish the topic search engine of oneself, therefore in view of the above problems, provides a kind of based on Party school's scientific research work The network crawler system and method for making theme are of great significance.
Summary of the invention
The purpose of the present invention is to provide a kind of network crawler systems and method based on Party school's research work theme, pass through It on the basis of Shark-Search algorithm, is made improvements for Party school's research work feature, establishes Party school's scientific research The search engine of work topic establishes theme by using keyword, and each keyword possesses specified different weights, benefit Theme optimization and home page filter are carried out with topic correlativity analysis module, solves existing Party school's research work subject search net The search relevance of page is low, the low problem of search accurate information degree.
In order to solve the above technical problems, the present invention is achieved by the following technical solutions:
A kind of network crawler system based on Party school's research work theme of the invention, including html document, initial seed Module, crawler module, database, topic correlativity analysis module, sorting module, theme establish module;
The theme establish module be used to establish crawler towards theme;
The topic correlativity analysis module is used to carry out the calculating of the Web page subject degree of correlation;
The initial seed module keeps crawler module suitable for generating the preferable seed website towards specific subject Work of creeping is unfolded in benefit;
The html document, initial seed module, database, Controlling UEP module respectively with the real-time phase of crawler module Connection;The sorting module is connected in real time with database;The theme establishes module and is connected in real time with Controlling UEP module.
Further, the theme establishes module and determines theme using keyword set, refers to wherein each keyword has Fixed different weights, the weight use feature extracting method.
Further, the webpage that the topic correlativity analysis module is used to guarantee that crawler obtains is leaned on to theme as far as possible Hold together, the webpage of crawler module crawl is filtered, the lower webpage of topic correlativity is rejected, the topic correlativity analysis The topic correlativity calculation method that module uses is vector space model.
Further, the sorting module arranges costly webpage for being ranked up to the significance level of webpage It to front, is chosen to so as to easier, the sort method that the sorting module uses is PageRank algorithm.
A kind of web crawlers method based on Party school's research work theme includes web crawlers working method, feature extraction side Method, vector space model, descriptor recording method in database;
The web crawlers working method the following steps are included:
S01: the crawler module fetches webpage;
S02: the calling Controlling UEP module carries out Controlling UEP to webpage;
S03: the crawler module carries out webpage rejecting according to the Different Results of analysis or reservation acts;
S04: the crawler module such as takes out at the URL to be processed from database;
S05: the sorting module is ranked up the significance level of webpage;
S06: the crawler module judges whether there is new URL in database;
It is recycled if so, being back to step S01;
If it is not, then terminating.
Further, the feature extracting method using given one with theme relevant collections of web pages, it is automatic by program Feature common in these webpages is extracted, and weight is determined according to frequency.
Further, the vector space model includes the following steps:
P01: using the number n of keyword as dimension of a vector space, the weight w of each keywordiAs every one-dimensional point The size of amount, then theme is expressed as a vector:
α=(a1,a2,...,an), i=1,2,3..., n, ai=wi
P02: analyzing the page, the frequency that statistics keyword occurs, and finds out frequency ratio, most with the frequency of occurrences High keyword is as benchmark, frequency xi=1 indicates, by frequency ratio, finds out the frequency x of other keywordsi, then the page That face corresponds to vector is x per one-dimensional componentiwi, page subject matter is expressed as a vector:
β=(x1w1,x2w2,...,xnwn), i=1,2 ..., n,
With the topic correlativity of two vectorial angle cosine representation pages:
P03: a threshold values r r is specified, can consider the page relevant, r compared with theme is as cos < α, β >=r Value need rule of thumb with actual requirement determine
Further, descriptor recording method includes the following steps: in the database
T01: the common word dictionary of Party school research work page URL is established
Wurl=(the communist party, party school, party history......), includes part The host name of authoritative Party school's scientific research website and common word, if entry number is d;
T02: the marker characters such as http, com are removed after URL is split with "/" and " ", extract significant phrase (word1,word2,...wordn);
T03: the Relevance scores R calculated according to URLURLFor
Wherein:
The invention has the following advantages:
The present invention is by changing it for Party school's research work feature on the basis of Shark-Search algorithm Into establishing the search engine of Party school's research work theme, theme established by using keyword, each keyword possesses Specified different weights carry out theme optimization and home page filter using topic correlativity analysis module, improve Party school's scientific research The degree of correlation of Webpage search that works and the precision of search information, are conducive to push building for Party school of China research work informationization If.
Certainly, it implements any of the products of the present invention and does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.
Fig. 1 is a kind of network crawler system structural schematic diagram based on Party school's research work theme of the invention;
Fig. 2 is that a kind of working method block diagram of web crawlers based on Party school's research work theme of the invention is illustrated Figure.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts all other Embodiment shall fall within the protection scope of the present invention.
Refering to Figure 1, a kind of network crawler system based on Party school's research work theme of the invention, including HTML Document, initial seed module, crawler module, database, topic correlativity analysis module, sorting module, theme establish module;
Theme establish module be used for establish crawler towards theme;
Topic correlativity analysis module is used to carry out the calculating of the Web page subject degree of correlation;
Initial seed module enables crawler module smoothly to open up for generating the preferable seed website towards specific subject Open work of creeping;
Html document, initial seed module, database, Controlling UEP module are connected with crawler module in real time respectively;Row Sequence module is connected in real time with database;Theme establishes module and is connected in real time with Controlling UEP module.
Wherein, theme establishes module and determines theme using keyword set, wherein each keyword has specified difference Weight, weight use feature extracting method.
Wherein, the webpage that topic correlativity analysis module is used to guarantee that crawler obtains is drawn close to theme as far as possible, to climbing The webpage of row module crawl is filtered, and the lower webpage of topic correlativity is rejected, the topic correlativity analysis module is adopted Topic correlativity calculation method is vector space model.
Wherein, costly webpage is aligned to front for being ranked up to the significance level of webpage by sorting module, with Just easier to be chosen to, the sort method that sorting module uses is PageRank algorithm.
As shown in Fig. 2, a kind of web crawlers method based on Party school's research work theme, including web crawlers work side Method, feature extracting method, vector space model, descriptor recording method in database;
Web crawlers working method the following steps are included:
S01: crawler module fetches webpage;
S02: Controlling UEP module is called to carry out Controlling UEP to webpage;
S03: crawler module carries out webpage rejecting according to the Different Results of analysis or reservation acts;
S04: crawler module such as takes out at the URL to be processed from database;
S05: sorting module is ranked up the significance level of webpage;
S06: crawler module judges whether there is new URL in database;
It is recycled if so, being back to step S01;
If it is not, then terminating.
Wherein, feature extracting method using given one with theme relevant collections of web pages, these are automatically extracted by program Common feature in webpage, and weight is determined according to frequency.
Wherein, vector space model includes the following steps:
P01: using the number n of keyword as dimension of a vector space, the weight w of each keywordiAs every one-dimensional point The size of amount, then theme is expressed as a vector:
α=(a1,a2,...,an), i=1,2,3..., n, ai=wi
P02: analyzing the page, the frequency that statistics keyword occurs, and finds out frequency ratio, most with the frequency of occurrences High keyword is as benchmark, frequency xi=1 indicates, by frequency ratio, finds out the frequency x of other keywordsi, then the page That face corresponds to vector is x per one-dimensional componentiwi, page subject matter is expressed as a vector:
β=(x1w1,x2w2,...,xnwn), i=1,2 ..., n,
With the topic correlativity of two vectorial angle cosine representation pages:
P03: a threshold values r r is specified, can consider the page relevant, r compared with theme is as cos < α, β >=r Value need rule of thumb with actual requirement determine
Wherein, descriptor recording method includes the following steps: in database
T01: the common word dictionary of Party school research work page URL is established
Wurl=(the communist party, party school, party history......), includes part The host name of authoritative Party school's scientific research website and common word, if entry number is d;
T02: the marker characters such as http, com are removed after URL is split with "/" and " ", extract significant phrase (word1,word2,...wordn);
T03: the Relevance scores R calculated according to URLURLFor
Wherein:
In the description of this specification, the description of reference term " one embodiment ", " example ", " specific example " etc. means Particular features, structures, materials, or characteristics described in conjunction with this embodiment or example are contained at least one implementation of the invention In example or example.In the present specification, schematic expression of the above terms may not refer to the same embodiment or example. Moreover, particular features, structures, materials, or characteristics described can be in any one or more of the embodiments or examples to close Suitable mode combines.
Present invention disclosed above preferred embodiment is only intended to help to illustrate the present invention.There is no detailed for preferred embodiment All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification, It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to better explain the present invention Principle and practical application, so that skilled artisan be enable to better understand and utilize the present invention.The present invention is only It is limited by claims and its full scope and equivalent.

Claims (8)

1. a kind of network crawler system based on Party school's research work theme, which is characterized in that including html document, initial seed Module, crawler module, database, topic correlativity analysis module, sorting module, theme establish module;
The theme establish module be used to establish crawler towards theme;
The topic correlativity analysis module is used to carry out the calculating of the Web page subject degree of correlation;
The initial seed module enables crawler module smoothly to open up for generating the preferable seed website towards specific subject Open work of creeping;
The html document, initial seed module, database, Controlling UEP module are connected with crawler module in real time respectively;Institute It states sorting module and is connected in real time with database;The theme establishes module and is connected in real time with Controlling UEP module.
2. a kind of network crawler system based on Party school's research work theme according to claim 1, which is characterized in that institute It states theme establishment module and theme is determined using keyword set, wherein each keyword has specified different weights, it is described Weight uses feature extracting method.
3. a kind of network crawler system based on Party school's research work theme according to claim 1, which is characterized in that institute It states webpage of the topic correlativity analysis module for guaranteeing that crawler obtains to draw close to theme as far as possible, to crawler module crawl Webpage is filtered, and the lower webpage of topic correlativity is rejected, the theme that the topic correlativity analysis module uses is related Degree calculation method is vector space model.
4. a kind of network crawler system based on Party school's research work theme according to claim 1, which is characterized in that institute Sorting module is stated for being ranked up to the significance level of webpage, costly webpage is aligned to front, so as to easier It is chosen to, the sort method that the sorting module uses is PageRank algorithm.
5. a kind of web crawlers method based on Party school's research work theme as described in Claims 1-4 is any, feature exist In, including descriptor recording method in web crawlers working method, feature extracting method, vector space model, database;
The web crawlers working method the following steps are included:
S01: the crawler module fetches webpage;
S02: the calling Controlling UEP module carries out Controlling UEP to webpage;
S03: the crawler module carries out webpage rejecting according to the Different Results of analysis or reservation acts;
S04: the crawler module such as takes out at the URL to be processed from database;
S05: the sorting module is ranked up the significance level of webpage;
S06: the crawler module judges whether there is new URL in database;
It is recycled if so, being back to step S01;
If it is not, then terminating.
6. a kind of web crawlers method based on Party school's research work theme according to claim 5, which is characterized in that institute State feature extracting method using given one with theme relevant collections of web pages, automatically extracted by program common in these webpages Feature, and weight is determined according to frequency.
7. a kind of web crawlers method based on Party school's research work theme according to claim 5, which is characterized in that institute Vector space model is stated to include the following steps:
P01: using the number n of keyword as dimension of a vector space, the weight w of each keywordiAs the big of every one-dimensional component Small, then theme is expressed as a vector:
α=(a1,a2,...,an), i=1,2,3..., n, ai=wi
P02: analyzing the page, the frequency that statistics keyword occurs, and finds out frequency ratio, highest with the frequency of occurrences Keyword is as benchmark, frequency xi=1 indicates, by frequency ratio, finds out the frequency x of other keywordsi, then the page pair That answer vector is x per one-dimensional componentiwi, page subject matter is expressed as a vector:
β=(x1w1,x2w2,...,xnwn), i=1,2 ..., n,
With the topic correlativity of two vectorial angle cosine representation pages:
P03: a threshold values r r is specified, can consider that the page is relevant compared with theme is as cos < α, β >=r, r's takes Value needs rule of thumb to determine with actual requirement
8. a kind of web crawlers method based on Party school's research work theme according to claim 5, which is characterized in that institute Descriptor recording method in database is stated to include the following steps:
T01: the common word dictionary of Party school research work page URL is established
Wurl=(the communist party, party school, party history......), indexed unit fraction prestige The host name of Party school's scientific research website and common word, if entry number is d;
T02: the marker characters such as http, com are removed after URL is split with "/" and " ", extract significant phrase (word1, word2,...wordn);
T03: the Relevance scores R calculated according to URLURLFor
Wherein:
CN201810736630.3A 2018-07-06 2018-07-06 A kind of network crawler system and method based on Party school's research work theme Withdrawn CN108959576A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810736630.3A CN108959576A (en) 2018-07-06 2018-07-06 A kind of network crawler system and method based on Party school's research work theme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810736630.3A CN108959576A (en) 2018-07-06 2018-07-06 A kind of network crawler system and method based on Party school's research work theme

Publications (1)

Publication Number Publication Date
CN108959576A true CN108959576A (en) 2018-12-07

Family

ID=64482204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810736630.3A Withdrawn CN108959576A (en) 2018-07-06 2018-07-06 A kind of network crawler system and method based on Party school's research work theme

Country Status (1)

Country Link
CN (1) CN108959576A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059235A (en) * 2018-12-19 2019-07-26 远光软件股份有限公司 A kind of crawl of Party building information resources, distribution, method for pushing and system
CN110309246A (en) * 2019-05-24 2019-10-08 中国地质调查局发展研究中心 A kind of method and device thereof internet geologic data retrieval and obtained

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059235A (en) * 2018-12-19 2019-07-26 远光软件股份有限公司 A kind of crawl of Party building information resources, distribution, method for pushing and system
CN110309246A (en) * 2019-05-24 2019-10-08 中国地质调查局发展研究中心 A kind of method and device thereof internet geologic data retrieval and obtained

Similar Documents

Publication Publication Date Title
Cai et al. Extracting content structure for web pages based on visual representation
CN105488024B (en) The abstracting method and device of Web page subject sentence
US8312035B2 (en) Search engine enhancement using mined implicit links
CN101630327A (en) Design method of theme network crawler system
CN102722558B (en) A kind of method and apparatus recommending for user to put question to
CN101231661A (en) Method and system for digging object grade knowledge
CN101261629A (en) Specific information searching method based on automatic classification technology
CN103631794A (en) Method, device and equipment for sorting search results
Prajapati A survey paper on hyperlink-induced topic search (HITS) algorithms for web mining
CN110012122A (en) A kind of domain name similarity analysis method of word-based embedded technology
Bun et al. Emerging topic tracking system
CN108959576A (en) A kind of network crawler system and method based on Party school&#39;s research work theme
Jalal Exploring web link analysis of websites of Indian Institute of Technology
Chopra et al. A survey on improving the efficiency of different web structure mining algorithms
CN103838786A (en) Web data automatic collecting method
CN103902687B (en) The generation method and device of a kind of Search Results
Srinath Page ranking algorithms–a comparison
Nithya Link Analysis Algorithm for Web Structure Mining
Yuan et al. Improvement of pagerank for focused crawler
Divya et al. Onto-search: An ontology based personalized mobile search engine
Ma et al. Searching Tourism Information by Using Vertical Search Engine Based on Nutch and Solr
Zhang et al. Research and implementation of keyword extraction algorithm based on professional background knowledge
Fernández et al. Novelty detection using local context analysis
Balaji et al. TOPCRAWL: Community mining in web search engines with emphasize on topical crawling
Ni et al. Web information recommendation based on user behaviors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20181207

WW01 Invention patent application withdrawn after publication