CN104182462B - A kind of web crawlers service system for room library net - Google Patents

A kind of web crawlers service system for room library net Download PDF

Info

Publication number
CN104182462B
CN104182462B CN201410347463.5A CN201410347463A CN104182462B CN 104182462 B CN104182462 B CN 104182462B CN 201410347463 A CN201410347463 A CN 201410347463A CN 104182462 B CN104182462 B CN 104182462B
Authority
CN
China
Prior art keywords
website
web crawlers
reptile
module
service module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410347463.5A
Other languages
Chinese (zh)
Other versions
CN104182462A (en
Inventor
璐惧博
贾岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Original Assignee
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd filed Critical ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority to CN201410347463.5A priority Critical patent/CN104182462B/en
Publication of CN104182462A publication Critical patent/CN104182462A/en
Application granted granted Critical
Publication of CN104182462B publication Critical patent/CN104182462B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention proposes a kind of web crawlers service system for room library net, can be rapidly performed by website and excavate and extract the relevant data of house property, including:Website reptile module is made of multiple website reptiles, and website reptile corresponds with website, and is parsed for the page elements of website, and website reptile extracts website data and carries out semantic analysis and be mapped in preset data entity to be preserved;Monitoring service module for monitoring the working condition of each website reptile, judges whether website reptile work is normal, and whether data grabber is correct;Management services module for the setting of website reptile work relevant parameter to be configured, upgrades website reptile, and start and stop, the life cycle of website reptile and the work to service system are managed;Deployment services module, for website reptile to be allocated and disposed;Dispatch service module, the scheduling method of embedded with network reptile are scheduled management to the working method of website reptile, time, stopping.

Description

A kind of web crawlers service system for room library net
Technical field
The present invention relates to website data digging technology field more particularly to a kind of web crawlers service systems for room library net System.
Background technology
Real estate industry's direct relation people's livelihood is basic.Current resident market will enter stock buildings epoch, and many stock buildings House-owner be not specialty salesperson, the sale information provided is not comprehensive enough.The archives in the house of current government department simultaneously Management remains in the papery stage, it is various it is related live with house property in terms of data be scattered in each unit and department, so both People is occupied to relevant departments and the management of house property is made troubles, while various valid data also cannot be utilized adequately.The common people Select house, enterprise's selection office space will face that there is a serious shortage of the detailed information services of profession.
Under social overall situation, house property information is promoted, convenient for house purchaser's Query Information, promotes the completion of house prosperity transaction, It is significant.House property information needs to establish the large database of receiving house property related " ten-thousand-ton train ".And establish database Basis is exactly data mining, but is entering today that high speed information is propagated, junk information is existed everywhere, how quick and effective Website data excavation is carried out, is always hot issue, also never finds preferable effective solution.
Invention content
Based on background technology there are the problem of, the present invention propose it is a kind of for room library net web crawlers service system, Website can be rapidly performed by excavate and effectively extract the relevant data of house property.
A kind of web crawlers service system for room library net proposed by the present invention, which is characterized in that including:
Website reptile module is made of multiple website reptiles, and website reptile corresponds with website, and for the page of website Surface element is parsed, website reptile extract website data carry out semantic analysis and be mapped in preset data entity into Row preserves;
Monitoring service module for monitoring the working condition of each website reptile, judges whether website reptile work is normal, Whether data grabber is correct;
Management services module for the setting of website reptile work relevant parameter to be configured, upgrades website reptile, and The life cycle of start and stop, website reptile to service system and work are managed;
Deployment services module, for website reptile to be allocated and disposed;
Dispatch service module, the scheduling method of embedded with network reptile, to the working method of website reptile, time, stop into Row management and running;
Website reptile module connects monitoring service module, management services module, deployment services module and dispatch service respectively Module, monitoring service module difference connection management service module, management services module connect deployment service module and scheduling respectively Service module;
During work, dispatch service module is scheduled the working method of website reptile, time, stopping management, deployment clothes Module of being engaged in calls website reptile to carry out data mining, monitoring service module monitoring website to corresponding website from website reptile module The working condition of reptile, when individual Web sites reptile operation irregularity, monitoring service module notifies management services module to exception Website reptile carries out parameter and working method is adjusted, when abnormal website reptile meets or exceeds threshold value a, monitoring service module Notify management services module halt system crawl website data, then, management services module notice dispatch service module and deployment Service module carries out website data excavation again after website reptile is scheduled and is disposed again, and by monitoring service module into Row monitoring, moves in circles.
Preferably, threshold value a is abnormal website reptile and the ratio of total website reptile being distributed away.
Preferably, the value range of a is [0.1,1].
Preferably, a=0.5.
Preferably, a is the quantity of abnormal website reptile.
Preferably, the value range of a is [100,10000].
Preferably, the value of a is directly proportional to the website reptile quantity that distribution is gone out.
Preferably, the value of a can be by manually setting or by system automatically generated.
Preferably, website reptile is focused crawler.
In the present invention, data mining is carried out by website reptile one-to-one with website, operating rate is high, excavates simultaneously To data mapped by semantic analysis and with preset data entity after store, which can effectively delete irrelevant information And duplicate message, promote the value for storing data, while also memory space shared by reduction.In the present invention, to the pipe of website reptile Very convenient and hommization is managed, can both monitor adjusting automatically, artificial regulatory can also be carried out, ensures the reality that website data excavates Shi Xing, validity and accuracy.
Description of the drawings
Fig. 1 is a kind of structure chart of web crawlers service system for room library net proposed by the present invention.
Specific embodiment
With reference to Fig. 1, a kind of web crawlers service system for room library net proposed by the present invention, including:Website reptile mould Block, monitoring service module, management services module, deployment services module and dispatch service module.Website reptile module connects respectively Monitoring service module, management services module, deployment services module and dispatch service module, monitoring service module connection management service Module, management services module connect deployment service module and dispatch service module respectively.
Website reptile module is made of multiple website reptiles, and website reptile corresponds with website, and for the page of website Surface element is parsed, and website reptile extracts website data and carries out semantic analysis and be mapped in preset data entity to carry out It preserves.Present embodiment carries out data mining by website reptile one-to-one with website, and operating rate is high, excavates simultaneously To data mapped by semantic analysis and with preset data entity after store, which can effectively delete irrelevant information And duplicate message, promote the value for storing data, while also memory space shared by reduction.Website reptile is focused crawler, only Only excavate and the relevant information of house property.
Monitoring service module for monitoring the working condition of each website reptile, judges whether website reptile work is normal, Whether data grabber is correct, so that O&M and developer understand the working condition of website reptile in time, is adjusted.
Management services module for the setting of website reptile work relevant parameter to be configured, upgrades website reptile, and The life cycle of start and stop, website reptile to service system and work are managed.Website reptile be in the system most The part of real-time update is needed, is changed if being crawled the page elements of website and certification mode etc., then corresponding net The reptile that stands will carry out corresponding upgrading, to ensure the accuracy of crawl content.O&M can be according to monitoring with developer The monitored results of service module carry out the upgrading of website reptile by management services module in time, and ensure website reptile has in real time Effect property.
Deployment services module, for website reptile to be allocated and disposed so that each website reptile is responsible for and only It is responsible for the crawl of a corresponding website data, improves data mining efficiency, avoid repeating.Deployment services module is exactly to be promoted The deployment convenience of website reptile and prepare ' developer has upgraded after the reptile component of website can conveniently and efficiently carry out portion Administration.
Dispatch service module, the scheduling method of embedded with network reptile, to the working method of website reptile, time, stop into Row management and running, the module is available for adjusting website reptile, the setting efficiency of raising website reptile reduce sky quickly, in bulk It the white time, prevents data from omitting, improves the integrity degree that website data excavates.
When this system works, dispatch service module is scheduled management to the working method of website reptile, time, stopping, Deployment services module calls website reptile to carry out data mining, monitoring service module prison to corresponding website from website reptile module The working condition of website reptile is controlled, when individual Web sites reptile operation irregularity, monitoring service module notice management services module pair Abnormal website reptile carries out parameter and working method is adjusted, when abnormal website reptile meets or exceeds threshold value a, monitoring clothes Module of being engaged in notice management services module halt system crawl website data, then, management services module notice dispatch service module Website data excavation is carried out after website reptile is scheduled and is disposed again with deployment services module again, and is taken by monitoring Business module is monitored, and is moved in circles.
In this system, threshold value a is abnormal website reptile and the ratio of total website reptile being distributed away, and a=0.5, Work as a<When 0.5, abnormal website reptile is adjusted using management services module, works as a>When 0.5, deployment services module and scheduling are utilized The abnormal website reptile of service module adjustment.When it is implemented, the value range of a can be set as [0.1,1].
When it is implemented, a may be the quantity of abnormal website reptile, website reptile number that value and the distribution of a are gone out Measure it is directly proportional, that is, be distributed away carry out data mining website reptile it is more, the value of a is bigger, can specifically set the value range of a For [100,10000], when the quantity of abnormal website reptile is less than 100, management services module, which is voluntarily handled, is also unlikely to load mistake Greatly, when the quantity of abnormal website reptile is more than 10000, the load range of management services module is alreadyd exceed, enables deployment services Module and dispatch service module are more quick, can reduce the blank time section of data mining.
The value of a can be by manually setting or by system automatically generated, and artificial setting can improve its accuracy, and system is given birth to automatically It is more preferable into real-time.
The foregoing is only a preferred embodiment of the present invention, but protection scope of the present invention be not limited thereto, Any one skilled in the art in the technical scope disclosed by the present invention, according to the technique and scheme of the present invention and its Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.

Claims (9)

1. a kind of web crawlers service system for room library net, which is characterized in that including:
Webcrawler module is made of multiple web crawlers, and web crawlers is corresponded with website, and for the page member of website Element is parsed, and web crawlers extracts website data and carries out semantic analysis and be mapped in preset data entity to be protected It deposits;
Monitoring service module for monitoring the working condition of each web crawlers, judges whether normal, the data of web crawlers work Whether crawl is correct;
Management services module for the setting of Configuration network reptile work relevant parameter, upgrades web crawlers, and to clothes The start and stop of business system, the life cycle of web crawlers and work are managed;
Deployment services module, for being allocated and disposing to web crawlers;
Dispatch service module, the scheduling method of embedded with network reptile adjust the working method of web crawlers, time, stopping Degree management;
Webcrawler module connects monitoring service module, management services module, deployment services module and dispatch service module respectively, Monitoring service module connection management service module, management services module connect deployment service module and dispatch service module respectively;
During work, dispatch service module is scheduled the working method of web crawlers, time, stopping management, deployment services mould Block calls web crawlers to carry out data mining, monitoring service module monitoring web crawlers to corresponding website from webcrawler module Working condition, when individual networks reptile operation irregularity, monitoring service module notifies dispatch service module to abnormal network Reptile carries out parameter and working method is adjusted, when abnormal network reptile meets or exceeds threshold value a, monitoring service module notice Management services module halt system captures website data, then, management services module notice dispatch service module and deployment services Module carries out website data excavation, and supervised by monitoring service module again after web crawlers is scheduled and is disposed again Control, moves in circles.
2. as described in claim 1 for the web crawlers service system of room library net, which is characterized in that threshold value a is abnormal Web crawlers and the ratio of overall network reptile being distributed away.
3. as claimed in claim 2 for the web crawlers service system of room library net, which is characterized in that the value range of a is [0.1,1]。
4. as claimed in claim 3 for the web crawlers service system of room library net, which is characterized in that a=0.5.
5. as described in claim 1 for the web crawlers service system of room library net, which is characterized in that a is climbed for abnormal network The quantity of worm.
6. as claimed in claim 5 for the web crawlers service system of room library net, which is characterized in that the value range of a is [100,10000]。
7. as claimed in claim 6 for the web crawlers service system of room library net, which is characterized in that the value of a and distribution The web crawlers quantity gone out is directly proportional.
8. the web crawlers service system as described in any one of claim 1 to 7 for room library net, which is characterized in that a's takes Value can be by manually setting or by system automatically generated.
9. as described in claim 1 for the web crawlers service system of room library net, which is characterized in that web crawlers is poly- Burnt reptile.
CN201410347463.5A 2014-07-21 2014-07-21 A kind of web crawlers service system for room library net Expired - Fee Related CN104182462B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410347463.5A CN104182462B (en) 2014-07-21 2014-07-21 A kind of web crawlers service system for room library net

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410347463.5A CN104182462B (en) 2014-07-21 2014-07-21 A kind of web crawlers service system for room library net

Publications (2)

Publication Number Publication Date
CN104182462A CN104182462A (en) 2014-12-03
CN104182462B true CN104182462B (en) 2018-06-26

Family

ID=51963502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410347463.5A Expired - Fee Related CN104182462B (en) 2014-07-21 2014-07-21 A kind of web crawlers service system for room library net

Country Status (1)

Country Link
CN (1) CN104182462B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537005B (en) * 2014-12-15 2018-04-06 北京国双科技有限公司 Data processing method and device for web page crawl
CN107784036A (en) * 2016-08-31 2018-03-09 北京国双科技有限公司 Network crawler system and the data processing method based on network crawler system
CN109302299B (en) * 2017-07-25 2021-12-28 北京国双科技有限公司 Website broken link detection method and device
CN110020041B (en) * 2017-08-21 2021-10-08 北京国双科技有限公司 Method and device for tracking crawling process
CN108416046B (en) * 2018-03-15 2020-05-26 阿里巴巴(中国)有限公司 Sequence crawler boundary detection method and device and server

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103002014A (en) * 2012-11-09 2013-03-27 哈尔滨中智拓图地理信息技术有限公司 Environmental geographic information service platform based on cloud computing and internet-of-things technology
CN103051649A (en) * 2011-10-17 2013-04-17 江苏怡丰通信设备有限公司 Comprehensive energy consumption monitoring and managing system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120005187A1 (en) * 2010-07-02 2012-01-05 Philippe Chavanne Web Site Content Management Techniques

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103051649A (en) * 2011-10-17 2013-04-17 江苏怡丰通信设备有限公司 Comprehensive energy consumption monitoring and managing system
CN103002014A (en) * 2012-11-09 2013-03-27 哈尔滨中智拓图地理信息技术有限公司 Environmental geographic information service platform based on cloud computing and internet-of-things technology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种优化的网络爬虫的设计与实现;曹忠等;《电脑知识与技术》;20081215;第2082-2083页 *

Also Published As

Publication number Publication date
CN104182462A (en) 2014-12-03

Similar Documents

Publication Publication Date Title
CN104182462B (en) A kind of web crawlers service system for room library net
US10474381B2 (en) Multi-server system resource manager
US7949628B1 (en) Information technology configuration management
CN105225190A (en) The large data sharing platform of a kind of warehouse logistics
US20070179823A1 (en) Observation modeling
CN102255776B (en) The state monitoring method of application on site and device
CN104901838A (en) Enterprise network safety event management system and method thereof
CN104408587A (en) Government project management information system
CN103123602A (en) Abnormal alarming monitoring method based on java and device thereof
CA2481712A1 (en) A software distribution method and system
CN104346574A (en) Automatic host computer security configuration vulnerability restoration method and system based on configuration specification
CN105656698A (en) Intelligent monitoring structure and method for network application system
CN107769985A (en) A kind of computer network management system
CN104184610A (en) Information management device and method
Köle et al. Value of information through options contract under disruption risk
CN104246787A (en) Parameter adjustment for pattern discovery
CN102281331A (en) Remote monitoring system and monitoring method for industrial site
CN108833442A (en) A kind of distributed network security monitoring device and its method
CN105139186A (en) Community management table generating method and community management table generating system
WO2012101531A1 (en) Data integrity protection in storage volumes
CN114356654A (en) Backup system, backup method, backup device, computer device, and storage medium
CN103443762B (en) Method and apparatus for moving software object
CN109213657A (en) A kind of grid operation data cloud storage device
CN102779086B (en) Monitor evaluating apparatus and monitor evaluation methodology
KR20220020566A (en) Method and apparatus for managing address of vulnerable wallet within blockchain network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180626

Termination date: 20210721