CN117828156A - Distributed crawler dynamic updating system and method - Google Patents
Distributed crawler dynamic updating system and method Download PDFInfo
- Publication number
- CN117828156A CN117828156A CN202311808702.8A CN202311808702A CN117828156A CN 117828156 A CN117828156 A CN 117828156A CN 202311808702 A CN202311808702 A CN 202311808702A CN 117828156 A CN117828156 A CN 117828156A
- Authority
- CN
- China
- Prior art keywords
- crawler
- update
- exchanger
- switch
- message
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 13
- 230000004048 modification Effects 0.000 claims description 11
- 238000012986 modification Methods 0.000 claims description 11
- 238000012795 verification Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims 1
- 238000012423 maintenance Methods 0.000 abstract description 4
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/542—Event management; Broadcasting; Multicasting; Notifications
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a distributed crawler dynamic updating system and a method, wherein the system comprises the following components: a message system based implementation of a switch, the switch supporting a broadcast mode; a management center connected to the exchanger, wherein the management center is used as a producer; one or more crawler clients connected to the switch, the crawler clients acting as consumers and subscribing to the switch; the producer is used for issuing a crawler update message to the exchanger; the exchanger is used for broadcasting and notifying the crawler update message; the consumer is configured to receive a crawler update message according to a broadcast notification for dynamic crawler update. The invention is based on a message system, can reduce the management and maintenance cost by a broadcast pushing mode, and ensures that all crawlers can be updated in time. Thus having unique advantages in addressing large-scale or distributed crawler failures and batch updates.
Description
Technical Field
The invention relates to the technical field of crawlers, in particular to a distributed crawler dynamic updating system and method.
Background
In a large-scale or distributed information gathering environment, in existing crawler technologies, when a certain crawler code fails, it is often necessary to manually inspect and repair the code. However, with the continuous expansion of data size and crawler size, crawlers have the problem that failure needs to be repaired, the efficiency of manually managing the crawler code is lower and lower, and the accuracy and the integrity of data are difficult to ensure. If the crawler is adopted to actively inquire whether the server is updated or not and if the server is invalid, the problem that part of the crawlers are invalid and can not acquire the server information exists. When all crawler clients send a large number of messages to the server, the server may be faced with high load, network congestion, memory consumption, bandwidth limitations, connection number limitations, database pressure, etc. It is difficult in the prior art to ensure that all crawlers are updated (e.g., the crawlers themselves fail, disconnect from the server, etc.). Accordingly, a technique is presented that enables dynamic updating of dead crawler code.
Disclosure of Invention
The invention aims to provide a distributed crawler dynamic updating system and a distributed crawler dynamic updating method, which are used for solving the problems of crawler failure and batch updating in a large-scale or distributed information acquisition environment.
The invention provides a distributed crawler dynamic updating system, which comprises:
a message system based implementation of a switch, the switch supporting a broadcast mode;
a management center connected to the exchanger, wherein the management center is used as a producer;
one or more crawler clients connected to the switch, the crawler clients acting as consumers and subscribing to the switch;
the producer is used for issuing a crawler update message to the exchanger; the exchanger is used for broadcasting and notifying the crawler update message; the consumer is configured to receive a crawler update message according to a broadcast notification for dynamic crawler update.
Further, the message system includes:
a message broker for enabling decoupled communications between the producer and the consumer;
and the queue server is used for storing and managing the to-be-processed crawler updating task.
Further, the management center can monitor the running state of the system.
The invention also provides a distributed crawler dynamic updating method, which comprises the following steps:
based on the message system, each crawler is ensured to update itself in time after receiving the update request in a broadcast pushing mode.
Further, the distributed crawler dynamic updating method comprises the following steps:
when the management center needs to inform all the crawler clients subscribed to the exchanger of carrying out crawler updating, the management center is connected to the exchanger as a producer and issues crawler updating information to the exchanger;
when a crawler update message is published to a switch, all crawler clients subscribed to the switch can receive broadcast notification of the crawler update message; and the crawler client dynamically updates the crawler according to the crawler update message of the broadcast notification.
Further, the dynamic crawler update includes creating, deleting or modifying a crawler by the crawler client. The modification of the crawler can be any modification of one or more crawlers in the crawler client, including modification of analysis rules, anti-crawling measures, rate control, agent customization, login verification, crawler output setting and the like.
In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:
the invention is based on a message system, can reduce the management and maintenance cost by a broadcast pushing mode, and ensures that all crawlers can be updated in time. Thus having unique advantages in addressing large-scale or distributed crawler failures and batch updates.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly describe the drawings in the embodiments, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered as limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of dynamic update of a distributed crawler in an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
The embodiment provides a distributed crawler dynamic updating method, which comprises the following steps:
based on the message system, each crawler is ensured to update itself in time after receiving the update request in a broadcast pushing mode. Wherein the message system comprises:
a message broker (e.g., rabbitMQ, etc.) for enabling decoupled communications between the producer and consumer;
and the queue server is used for storing and managing the to-be-processed crawler updating task.
Thus, the components of the distributed crawler dynamic update system implemented are shown in FIG. 1, comprising:
a message system based implementation of a switch that supports broadcast modes (e.g., a Fanout switch for message broadcasting);
a management center connected to the switch, wherein the management center is used as a producer for issuing a crawler update message to the switch (for example, a crawler client builds, deletes or modifies a crawler, wherein the modification of the crawler can be any modification to one or more crawlers in the crawler client, including modification of parsing rules, anti-crawling measures, rate control, proxy customization, login verification, crawler output setting and the like); in addition, the management center can monitor the running state of the system;
and the crawler clients are used as consumers and subscribe to the exchanger and are used for receiving crawler update messages according to broadcast notification to perform dynamic crawler update. The crawler client may be a single crawler, a set of multiple crawlers of different types, or a container (including multiple crawlers inside).
The specific implementation process is as follows:
s1, creating a message system-based exchanger, wherein the exchanger supports a broadcast mode.
S2, the management center creates a producer for sending a crawler update message to the exchanger. For example, a message is published that a crawler crawls parsing rules for a web site.
S3, creating a plurality of consumers, wherein each consumer subscribes to the exchanger. For example, subscribe to the switch created above and listen to the switch:
s4, when the management center needs to inform all the crawler clients subscribed to the exchanger of carrying out crawler updating, the management center is connected to the exchanger as a producer and issues crawler updating information to the exchanger;
s5, when a crawler update message is published to the exchanger, all crawler clients subscribed to the exchanger receive broadcast notification of the crawler update message; and the crawler client dynamically updates the crawler according to the crawler update message of the broadcast notification.
Therefore, the prior art has high management and maintenance cost, and is generally updated by adopting a mode that a crawler client polls a server, so that it is difficult to ensure that all crawlers are updated, such as the crawlers fail, lose efficacy, disconnect from the server and the like. The invention is based on a message system, and can reduce the management and maintenance cost by a broadcast pushing mode, thereby ensuring that all crawlers can be updated in time. Thus having unique advantages in addressing large-scale or distributed crawler failures and batch updates.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (6)
1. A distributed crawler dynamic update system, comprising:
a message system based implementation of a switch, the switch supporting a broadcast mode;
a management center connected to the exchanger, wherein the management center is used as a producer;
one or more crawler clients connected to the switch, the crawler clients acting as consumers and subscribing to the switch;
the producer is used for issuing a crawler update message to the exchanger; the exchanger is used for broadcasting and notifying the crawler update message; the consumer is configured to receive a crawler update message according to a broadcast notification for dynamic crawler update.
2. The distributed crawler dynamic update system of claim 1 wherein said messaging system comprises:
a message broker for enabling decoupled communications between the producer and the consumer;
and the queue server is used for storing and managing the to-be-processed crawler updating task.
3. The distributed crawler dynamic update system of claim 2 wherein said management center is capable of monitoring system operational status.
4. A distributed crawler dynamic update method, comprising:
based on the message system, each crawler is ensured to update itself in time after receiving the update request in a broadcast pushing mode.
5. The distributed crawler dynamic update method of claim 4, comprising:
when the management center needs to inform all the crawler clients subscribed to the exchanger of carrying out crawler updating, the management center is connected to the exchanger as a producer and issues crawler updating information to the exchanger;
when a crawler update message is published to a switch, all crawler clients subscribed to the switch can receive broadcast notification of the crawler update message; and the crawler client dynamically updates the crawler according to the crawler update message of the broadcast notification.
6. The distributed crawler dynamic update method of claim 5, wherein the dynamic crawler update includes a crawler client creating, deleting, or modifying a crawler; wherein the modification of the crawler is any modification to one or more crawlers in the crawler client, including modification of parsing rules, anti-crawling measures, rate control, proxy customization, login verification, and/or crawler output settings.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311808702.8A CN117828156A (en) | 2023-12-26 | 2023-12-26 | Distributed crawler dynamic updating system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311808702.8A CN117828156A (en) | 2023-12-26 | 2023-12-26 | Distributed crawler dynamic updating system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117828156A true CN117828156A (en) | 2024-04-05 |
Family
ID=90505235
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311808702.8A Pending CN117828156A (en) | 2023-12-26 | 2023-12-26 | Distributed crawler dynamic updating system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117828156A (en) |
-
2023
- 2023-12-26 CN CN202311808702.8A patent/CN117828156A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4583452B2 (en) | Method and system for monitoring server events in a node configuration by using direct communication between servers | |
CN100579082C (en) | Information exchange system, management server, and method for reducing the network load | |
CN101567807B (en) | Knowledge-based failure recovery support system | |
CN102984012B (en) | Management method and system for service resources | |
CN107197012B (en) | Service publishing and monitoring system and method based on metadata management system | |
CN102014403A (en) | Method and system for transmitting network topology information | |
US20070282993A1 (en) | Distribution of system status information using a web feed | |
CN101383839A (en) | Data distribution system based on data server and implementation method | |
CN103036719A (en) | Cross-regional service disaster method and device based on main cluster servers | |
US8335177B2 (en) | Communication system | |
CN112671697A (en) | Data processing method, device and system of comprehensive monitoring system | |
CN112100004A (en) | Management method and storage medium of Redis cluster node | |
CN114978922B (en) | Dynamic topology data acquisition method | |
CN116319732A (en) | Message queue centralized configuration management system and method based on RabbitMQ | |
JP2010092395A (en) | Server management system, server management method and program for server management | |
CN117828156A (en) | Distributed crawler dynamic updating system and method | |
CN104052723A (en) | Information processing method and server | |
CN112437146B (en) | Equipment state synchronization method, device and system | |
CN111935296B (en) | System for high-availability infinite MQTT message service capacity expansion | |
CN114090687A (en) | Data synchronization method and device | |
CN103684825A (en) | Multi-system communication system and maintenance method for same | |
CN113923142A (en) | Method, system and medium for monitoring state of equipment of Internet of things | |
CN101179415A (en) | Method of processing alarm information variation of telecom management network | |
JP7034139B2 (en) | Equipment management method, equipment management equipment and equipment management system | |
EP2533153A1 (en) | Unit for managing messages indicating event situations of monitored objects |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |