CN104182462B - A kind of web crawlers service system for room library net - Google Patents
A kind of web crawlers service system for room library net Download PDFInfo
- Publication number
- CN104182462B CN104182462B CN201410347463.5A CN201410347463A CN104182462B CN 104182462 B CN104182462 B CN 104182462B CN 201410347463 A CN201410347463 A CN 201410347463A CN 104182462 B CN104182462 B CN 104182462B
- Authority
- CN
- China
- Prior art keywords
- website
- web crawlers
- reptile
- module
- service module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention proposes a kind of web crawlers service system for room library net, can be rapidly performed by website and excavate and extract the relevant data of house property, including:Website reptile module is made of multiple website reptiles, and website reptile corresponds with website, and is parsed for the page elements of website, and website reptile extracts website data and carries out semantic analysis and be mapped in preset data entity to be preserved;Monitoring service module for monitoring the working condition of each website reptile, judges whether website reptile work is normal, and whether data grabber is correct;Management services module for the setting of website reptile work relevant parameter to be configured, upgrades website reptile, and start and stop, the life cycle of website reptile and the work to service system are managed;Deployment services module, for website reptile to be allocated and disposed;Dispatch service module, the scheduling method of embedded with network reptile are scheduled management to the working method of website reptile, time, stopping.
Description
Technical field
The present invention relates to website data digging technology field more particularly to a kind of web crawlers service systems for room library net
System.
Background technology
Real estate industry's direct relation people's livelihood is basic.Current resident market will enter stock buildings epoch, and many stock buildings
House-owner be not specialty salesperson, the sale information provided is not comprehensive enough.The archives in the house of current government department simultaneously
Management remains in the papery stage, it is various it is related live with house property in terms of data be scattered in each unit and department, so both
People is occupied to relevant departments and the management of house property is made troubles, while various valid data also cannot be utilized adequately.The common people
Select house, enterprise's selection office space will face that there is a serious shortage of the detailed information services of profession.
Under social overall situation, house property information is promoted, convenient for house purchaser's Query Information, promotes the completion of house prosperity transaction,
It is significant.House property information needs to establish the large database of receiving house property related " ten-thousand-ton train ".And establish database
Basis is exactly data mining, but is entering today that high speed information is propagated, junk information is existed everywhere, how quick and effective
Website data excavation is carried out, is always hot issue, also never finds preferable effective solution.
Invention content
Based on background technology there are the problem of, the present invention propose it is a kind of for room library net web crawlers service system,
Website can be rapidly performed by excavate and effectively extract the relevant data of house property.
A kind of web crawlers service system for room library net proposed by the present invention, which is characterized in that including:
Website reptile module is made of multiple website reptiles, and website reptile corresponds with website, and for the page of website
Surface element is parsed, website reptile extract website data carry out semantic analysis and be mapped in preset data entity into
Row preserves;
Monitoring service module for monitoring the working condition of each website reptile, judges whether website reptile work is normal,
Whether data grabber is correct;
Management services module for the setting of website reptile work relevant parameter to be configured, upgrades website reptile, and
The life cycle of start and stop, website reptile to service system and work are managed;
Deployment services module, for website reptile to be allocated and disposed;
Dispatch service module, the scheduling method of embedded with network reptile, to the working method of website reptile, time, stop into
Row management and running;
Website reptile module connects monitoring service module, management services module, deployment services module and dispatch service respectively
Module, monitoring service module difference connection management service module, management services module connect deployment service module and scheduling respectively
Service module;
During work, dispatch service module is scheduled the working method of website reptile, time, stopping management, deployment clothes
Module of being engaged in calls website reptile to carry out data mining, monitoring service module monitoring website to corresponding website from website reptile module
The working condition of reptile, when individual Web sites reptile operation irregularity, monitoring service module notifies management services module to exception
Website reptile carries out parameter and working method is adjusted, when abnormal website reptile meets or exceeds threshold value a, monitoring service module
Notify management services module halt system crawl website data, then, management services module notice dispatch service module and deployment
Service module carries out website data excavation again after website reptile is scheduled and is disposed again, and by monitoring service module into
Row monitoring, moves in circles.
Preferably, threshold value a is abnormal website reptile and the ratio of total website reptile being distributed away.
Preferably, the value range of a is [0.1,1].
Preferably, a=0.5.
Preferably, a is the quantity of abnormal website reptile.
Preferably, the value range of a is [100,10000].
Preferably, the value of a is directly proportional to the website reptile quantity that distribution is gone out.
Preferably, the value of a can be by manually setting or by system automatically generated.
Preferably, website reptile is focused crawler.
In the present invention, data mining is carried out by website reptile one-to-one with website, operating rate is high, excavates simultaneously
To data mapped by semantic analysis and with preset data entity after store, which can effectively delete irrelevant information
And duplicate message, promote the value for storing data, while also memory space shared by reduction.In the present invention, to the pipe of website reptile
Very convenient and hommization is managed, can both monitor adjusting automatically, artificial regulatory can also be carried out, ensures the reality that website data excavates
Shi Xing, validity and accuracy.
Description of the drawings
Fig. 1 is a kind of structure chart of web crawlers service system for room library net proposed by the present invention.
Specific embodiment
With reference to Fig. 1, a kind of web crawlers service system for room library net proposed by the present invention, including:Website reptile mould
Block, monitoring service module, management services module, deployment services module and dispatch service module.Website reptile module connects respectively
Monitoring service module, management services module, deployment services module and dispatch service module, monitoring service module connection management service
Module, management services module connect deployment service module and dispatch service module respectively.
Website reptile module is made of multiple website reptiles, and website reptile corresponds with website, and for the page of website
Surface element is parsed, and website reptile extracts website data and carries out semantic analysis and be mapped in preset data entity to carry out
It preserves.Present embodiment carries out data mining by website reptile one-to-one with website, and operating rate is high, excavates simultaneously
To data mapped by semantic analysis and with preset data entity after store, which can effectively delete irrelevant information
And duplicate message, promote the value for storing data, while also memory space shared by reduction.Website reptile is focused crawler, only
Only excavate and the relevant information of house property.
Monitoring service module for monitoring the working condition of each website reptile, judges whether website reptile work is normal,
Whether data grabber is correct, so that O&M and developer understand the working condition of website reptile in time, is adjusted.
Management services module for the setting of website reptile work relevant parameter to be configured, upgrades website reptile, and
The life cycle of start and stop, website reptile to service system and work are managed.Website reptile be in the system most
The part of real-time update is needed, is changed if being crawled the page elements of website and certification mode etc., then corresponding net
The reptile that stands will carry out corresponding upgrading, to ensure the accuracy of crawl content.O&M can be according to monitoring with developer
The monitored results of service module carry out the upgrading of website reptile by management services module in time, and ensure website reptile has in real time
Effect property.
Deployment services module, for website reptile to be allocated and disposed so that each website reptile is responsible for and only
It is responsible for the crawl of a corresponding website data, improves data mining efficiency, avoid repeating.Deployment services module is exactly to be promoted
The deployment convenience of website reptile and prepare ' developer has upgraded after the reptile component of website can conveniently and efficiently carry out portion
Administration.
Dispatch service module, the scheduling method of embedded with network reptile, to the working method of website reptile, time, stop into
Row management and running, the module is available for adjusting website reptile, the setting efficiency of raising website reptile reduce sky quickly, in bulk
It the white time, prevents data from omitting, improves the integrity degree that website data excavates.
When this system works, dispatch service module is scheduled management to the working method of website reptile, time, stopping,
Deployment services module calls website reptile to carry out data mining, monitoring service module prison to corresponding website from website reptile module
The working condition of website reptile is controlled, when individual Web sites reptile operation irregularity, monitoring service module notice management services module pair
Abnormal website reptile carries out parameter and working method is adjusted, when abnormal website reptile meets or exceeds threshold value a, monitoring clothes
Module of being engaged in notice management services module halt system crawl website data, then, management services module notice dispatch service module
Website data excavation is carried out after website reptile is scheduled and is disposed again with deployment services module again, and is taken by monitoring
Business module is monitored, and is moved in circles.
In this system, threshold value a is abnormal website reptile and the ratio of total website reptile being distributed away, and a=0.5,
Work as a<When 0.5, abnormal website reptile is adjusted using management services module, works as a>When 0.5, deployment services module and scheduling are utilized
The abnormal website reptile of service module adjustment.When it is implemented, the value range of a can be set as [0.1,1].
When it is implemented, a may be the quantity of abnormal website reptile, website reptile number that value and the distribution of a are gone out
Measure it is directly proportional, that is, be distributed away carry out data mining website reptile it is more, the value of a is bigger, can specifically set the value range of a
For [100,10000], when the quantity of abnormal website reptile is less than 100, management services module, which is voluntarily handled, is also unlikely to load mistake
Greatly, when the quantity of abnormal website reptile is more than 10000, the load range of management services module is alreadyd exceed, enables deployment services
Module and dispatch service module are more quick, can reduce the blank time section of data mining.
The value of a can be by manually setting or by system automatically generated, and artificial setting can improve its accuracy, and system is given birth to automatically
It is more preferable into real-time.
The foregoing is only a preferred embodiment of the present invention, but protection scope of the present invention be not limited thereto,
Any one skilled in the art in the technical scope disclosed by the present invention, according to the technique and scheme of the present invention and its
Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.
Claims (9)
1. a kind of web crawlers service system for room library net, which is characterized in that including:
Webcrawler module is made of multiple web crawlers, and web crawlers is corresponded with website, and for the page member of website
Element is parsed, and web crawlers extracts website data and carries out semantic analysis and be mapped in preset data entity to be protected
It deposits;
Monitoring service module for monitoring the working condition of each web crawlers, judges whether normal, the data of web crawlers work
Whether crawl is correct;
Management services module for the setting of Configuration network reptile work relevant parameter, upgrades web crawlers, and to clothes
The start and stop of business system, the life cycle of web crawlers and work are managed;
Deployment services module, for being allocated and disposing to web crawlers;
Dispatch service module, the scheduling method of embedded with network reptile adjust the working method of web crawlers, time, stopping
Degree management;
Webcrawler module connects monitoring service module, management services module, deployment services module and dispatch service module respectively,
Monitoring service module connection management service module, management services module connect deployment service module and dispatch service module respectively;
During work, dispatch service module is scheduled the working method of web crawlers, time, stopping management, deployment services mould
Block calls web crawlers to carry out data mining, monitoring service module monitoring web crawlers to corresponding website from webcrawler module
Working condition, when individual networks reptile operation irregularity, monitoring service module notifies dispatch service module to abnormal network
Reptile carries out parameter and working method is adjusted, when abnormal network reptile meets or exceeds threshold value a, monitoring service module notice
Management services module halt system captures website data, then, management services module notice dispatch service module and deployment services
Module carries out website data excavation, and supervised by monitoring service module again after web crawlers is scheduled and is disposed again
Control, moves in circles.
2. as described in claim 1 for the web crawlers service system of room library net, which is characterized in that threshold value a is abnormal
Web crawlers and the ratio of overall network reptile being distributed away.
3. as claimed in claim 2 for the web crawlers service system of room library net, which is characterized in that the value range of a is
[0.1,1]。
4. as claimed in claim 3 for the web crawlers service system of room library net, which is characterized in that a=0.5.
5. as described in claim 1 for the web crawlers service system of room library net, which is characterized in that a is climbed for abnormal network
The quantity of worm.
6. as claimed in claim 5 for the web crawlers service system of room library net, which is characterized in that the value range of a is
[100,10000]。
7. as claimed in claim 6 for the web crawlers service system of room library net, which is characterized in that the value of a and distribution
The web crawlers quantity gone out is directly proportional.
8. the web crawlers service system as described in any one of claim 1 to 7 for room library net, which is characterized in that a's takes
Value can be by manually setting or by system automatically generated.
9. as described in claim 1 for the web crawlers service system of room library net, which is characterized in that web crawlers is poly-
Burnt reptile.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410347463.5A CN104182462B (en) | 2014-07-21 | 2014-07-21 | A kind of web crawlers service system for room library net |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410347463.5A CN104182462B (en) | 2014-07-21 | 2014-07-21 | A kind of web crawlers service system for room library net |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104182462A CN104182462A (en) | 2014-12-03 |
CN104182462B true CN104182462B (en) | 2018-06-26 |
Family
ID=51963502
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410347463.5A Expired - Fee Related CN104182462B (en) | 2014-07-21 | 2014-07-21 | A kind of web crawlers service system for room library net |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104182462B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104537005B (en) * | 2014-12-15 | 2018-04-06 | 北京国双科技有限公司 | Data processing method and device for web page crawl |
CN107784036A (en) * | 2016-08-31 | 2018-03-09 | 北京国双科技有限公司 | Network crawler system and the data processing method based on network crawler system |
CN109302299B (en) * | 2017-07-25 | 2021-12-28 | 北京国双科技有限公司 | Website broken link detection method and device |
CN110020041B (en) * | 2017-08-21 | 2021-10-08 | 北京国双科技有限公司 | Method and device for tracking crawling process |
CN108416046B (en) * | 2018-03-15 | 2020-05-26 | 阿里巴巴(中国)有限公司 | Sequence crawler boundary detection method and device and server |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103002014A (en) * | 2012-11-09 | 2013-03-27 | 哈尔滨中智拓图地理信息技术有限公司 | Environmental geographic information service platform based on cloud computing and internet-of-things technology |
CN103051649A (en) * | 2011-10-17 | 2013-04-17 | 江苏怡丰通信设备有限公司 | Comprehensive energy consumption monitoring and managing system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120005187A1 (en) * | 2010-07-02 | 2012-01-05 | Philippe Chavanne | Web Site Content Management Techniques |
-
2014
- 2014-07-21 CN CN201410347463.5A patent/CN104182462B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103051649A (en) * | 2011-10-17 | 2013-04-17 | 江苏怡丰通信设备有限公司 | Comprehensive energy consumption monitoring and managing system |
CN103002014A (en) * | 2012-11-09 | 2013-03-27 | 哈尔滨中智拓图地理信息技术有限公司 | Environmental geographic information service platform based on cloud computing and internet-of-things technology |
Non-Patent Citations (1)
Title |
---|
一种优化的网络爬虫的设计与实现;曹忠等;《电脑知识与技术》;20081215;第2082-2083页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104182462A (en) | 2014-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104182462B (en) | A kind of web crawlers service system for room library net | |
US10474381B2 (en) | Multi-server system resource manager | |
US7949628B1 (en) | Information technology configuration management | |
CN105225190A (en) | The large data sharing platform of a kind of warehouse logistics | |
US20070179823A1 (en) | Observation modeling | |
CN102255776B (en) | The state monitoring method of application on site and device | |
CN104901838A (en) | Enterprise network safety event management system and method thereof | |
CN104408587A (en) | Government project management information system | |
CN103123602A (en) | Abnormal alarming monitoring method based on java and device thereof | |
CA2481712A1 (en) | A software distribution method and system | |
CN104346574A (en) | Automatic host computer security configuration vulnerability restoration method and system based on configuration specification | |
CN105656698A (en) | Intelligent monitoring structure and method for network application system | |
CN107769985A (en) | A kind of computer network management system | |
CN104184610A (en) | Information management device and method | |
Köle et al. | Value of information through options contract under disruption risk | |
CN104246787A (en) | Parameter adjustment for pattern discovery | |
CN102281331A (en) | Remote monitoring system and monitoring method for industrial site | |
CN108833442A (en) | A kind of distributed network security monitoring device and its method | |
CN105139186A (en) | Community management table generating method and community management table generating system | |
WO2012101531A1 (en) | Data integrity protection in storage volumes | |
CN114356654A (en) | Backup system, backup method, backup device, computer device, and storage medium | |
CN103443762B (en) | Method and apparatus for moving software object | |
CN109213657A (en) | A kind of grid operation data cloud storage device | |
CN102779086B (en) | Monitor evaluating apparatus and monitor evaluation methodology | |
KR20220020566A (en) | Method and apparatus for managing address of vulnerable wallet within blockchain network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20180626 Termination date: 20210721 |