CN117828156A - Distributed crawler dynamic updating system and method - Google Patents

Distributed crawler dynamic updating system and method Download PDF

Info

Publication number
CN117828156A
CN117828156A CN202311808702.8A CN202311808702A CN117828156A CN 117828156 A CN117828156 A CN 117828156A CN 202311808702 A CN202311808702 A CN 202311808702A CN 117828156 A CN117828156 A CN 117828156A
Authority
CN
China
Prior art keywords
crawler
update
exchanger
switch
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311808702.8A
Other languages
Chinese (zh)
Inventor
魏光
黄张朋
朱磊
康荣保
于汝云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 30 Research Institute
Original Assignee
CETC 30 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 30 Research Institute filed Critical CETC 30 Research Institute
Priority to CN202311808702.8A priority Critical patent/CN117828156A/en
Publication of CN117828156A publication Critical patent/CN117828156A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/542Event management; Broadcasting; Multicasting; Notifications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a distributed crawler dynamic updating system and a method, wherein the system comprises the following components: a message system based implementation of a switch, the switch supporting a broadcast mode; a management center connected to the exchanger, wherein the management center is used as a producer; one or more crawler clients connected to the switch, the crawler clients acting as consumers and subscribing to the switch; the producer is used for issuing a crawler update message to the exchanger; the exchanger is used for broadcasting and notifying the crawler update message; the consumer is configured to receive a crawler update message according to a broadcast notification for dynamic crawler update. The invention is based on a message system, can reduce the management and maintenance cost by a broadcast pushing mode, and ensures that all crawlers can be updated in time. Thus having unique advantages in addressing large-scale or distributed crawler failures and batch updates.

Description

Distributed crawler dynamic updating system and method
Technical Field
The invention relates to the technical field of crawlers, in particular to a distributed crawler dynamic updating system and method.
Background
In a large-scale or distributed information gathering environment, in existing crawler technologies, when a certain crawler code fails, it is often necessary to manually inspect and repair the code. However, with the continuous expansion of data size and crawler size, crawlers have the problem that failure needs to be repaired, the efficiency of manually managing the crawler code is lower and lower, and the accuracy and the integrity of data are difficult to ensure. If the crawler is adopted to actively inquire whether the server is updated or not and if the server is invalid, the problem that part of the crawlers are invalid and can not acquire the server information exists. When all crawler clients send a large number of messages to the server, the server may be faced with high load, network congestion, memory consumption, bandwidth limitations, connection number limitations, database pressure, etc. It is difficult in the prior art to ensure that all crawlers are updated (e.g., the crawlers themselves fail, disconnect from the server, etc.). Accordingly, a technique is presented that enables dynamic updating of dead crawler code.
Disclosure of Invention
The invention aims to provide a distributed crawler dynamic updating system and a distributed crawler dynamic updating method, which are used for solving the problems of crawler failure and batch updating in a large-scale or distributed information acquisition environment.
The invention provides a distributed crawler dynamic updating system, which comprises:
a message system based implementation of a switch, the switch supporting a broadcast mode;
a management center connected to the exchanger, wherein the management center is used as a producer;
one or more crawler clients connected to the switch, the crawler clients acting as consumers and subscribing to the switch;
the producer is used for issuing a crawler update message to the exchanger; the exchanger is used for broadcasting and notifying the crawler update message; the consumer is configured to receive a crawler update message according to a broadcast notification for dynamic crawler update.
Further, the message system includes:
a message broker for enabling decoupled communications between the producer and the consumer;
and the queue server is used for storing and managing the to-be-processed crawler updating task.
Further, the management center can monitor the running state of the system.
The invention also provides a distributed crawler dynamic updating method, which comprises the following steps:
based on the message system, each crawler is ensured to update itself in time after receiving the update request in a broadcast pushing mode.
Further, the distributed crawler dynamic updating method comprises the following steps:
when the management center needs to inform all the crawler clients subscribed to the exchanger of carrying out crawler updating, the management center is connected to the exchanger as a producer and issues crawler updating information to the exchanger;
when a crawler update message is published to a switch, all crawler clients subscribed to the switch can receive broadcast notification of the crawler update message; and the crawler client dynamically updates the crawler according to the crawler update message of the broadcast notification.
Further, the dynamic crawler update includes creating, deleting or modifying a crawler by the crawler client. The modification of the crawler can be any modification of one or more crawlers in the crawler client, including modification of analysis rules, anti-crawling measures, rate control, agent customization, login verification, crawler output setting and the like.
In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:
the invention is based on a message system, can reduce the management and maintenance cost by a broadcast pushing mode, and ensures that all crawlers can be updated in time. Thus having unique advantages in addressing large-scale or distributed crawler failures and batch updates.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly describe the drawings in the embodiments, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered as limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of dynamic update of a distributed crawler in an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
The embodiment provides a distributed crawler dynamic updating method, which comprises the following steps:
based on the message system, each crawler is ensured to update itself in time after receiving the update request in a broadcast pushing mode. Wherein the message system comprises:
a message broker (e.g., rabbitMQ, etc.) for enabling decoupled communications between the producer and consumer;
and the queue server is used for storing and managing the to-be-processed crawler updating task.
Thus, the components of the distributed crawler dynamic update system implemented are shown in FIG. 1, comprising:
a message system based implementation of a switch that supports broadcast modes (e.g., a Fanout switch for message broadcasting);
a management center connected to the switch, wherein the management center is used as a producer for issuing a crawler update message to the switch (for example, a crawler client builds, deletes or modifies a crawler, wherein the modification of the crawler can be any modification to one or more crawlers in the crawler client, including modification of parsing rules, anti-crawling measures, rate control, proxy customization, login verification, crawler output setting and the like); in addition, the management center can monitor the running state of the system;
and the crawler clients are used as consumers and subscribe to the exchanger and are used for receiving crawler update messages according to broadcast notification to perform dynamic crawler update. The crawler client may be a single crawler, a set of multiple crawlers of different types, or a container (including multiple crawlers inside).
The specific implementation process is as follows:
s1, creating a message system-based exchanger, wherein the exchanger supports a broadcast mode.
S2, the management center creates a producer for sending a crawler update message to the exchanger. For example, a message is published that a crawler crawls parsing rules for a web site.
S3, creating a plurality of consumers, wherein each consumer subscribes to the exchanger. For example, subscribe to the switch created above and listen to the switch:
s4, when the management center needs to inform all the crawler clients subscribed to the exchanger of carrying out crawler updating, the management center is connected to the exchanger as a producer and issues crawler updating information to the exchanger;
s5, when a crawler update message is published to the exchanger, all crawler clients subscribed to the exchanger receive broadcast notification of the crawler update message; and the crawler client dynamically updates the crawler according to the crawler update message of the broadcast notification.
Therefore, the prior art has high management and maintenance cost, and is generally updated by adopting a mode that a crawler client polls a server, so that it is difficult to ensure that all crawlers are updated, such as the crawlers fail, lose efficacy, disconnect from the server and the like. The invention is based on a message system, and can reduce the management and maintenance cost by a broadcast pushing mode, thereby ensuring that all crawlers can be updated in time. Thus having unique advantages in addressing large-scale or distributed crawler failures and batch updates.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A distributed crawler dynamic update system, comprising:
a message system based implementation of a switch, the switch supporting a broadcast mode;
a management center connected to the exchanger, wherein the management center is used as a producer;
one or more crawler clients connected to the switch, the crawler clients acting as consumers and subscribing to the switch;
the producer is used for issuing a crawler update message to the exchanger; the exchanger is used for broadcasting and notifying the crawler update message; the consumer is configured to receive a crawler update message according to a broadcast notification for dynamic crawler update.
2. The distributed crawler dynamic update system of claim 1 wherein said messaging system comprises:
a message broker for enabling decoupled communications between the producer and the consumer;
and the queue server is used for storing and managing the to-be-processed crawler updating task.
3. The distributed crawler dynamic update system of claim 2 wherein said management center is capable of monitoring system operational status.
4. A distributed crawler dynamic update method, comprising:
based on the message system, each crawler is ensured to update itself in time after receiving the update request in a broadcast pushing mode.
5. The distributed crawler dynamic update method of claim 4, comprising:
when the management center needs to inform all the crawler clients subscribed to the exchanger of carrying out crawler updating, the management center is connected to the exchanger as a producer and issues crawler updating information to the exchanger;
when a crawler update message is published to a switch, all crawler clients subscribed to the switch can receive broadcast notification of the crawler update message; and the crawler client dynamically updates the crawler according to the crawler update message of the broadcast notification.
6. The distributed crawler dynamic update method of claim 5, wherein the dynamic crawler update includes a crawler client creating, deleting, or modifying a crawler; wherein the modification of the crawler is any modification to one or more crawlers in the crawler client, including modification of parsing rules, anti-crawling measures, rate control, proxy customization, login verification, and/or crawler output settings.
CN202311808702.8A 2023-12-26 2023-12-26 Distributed crawler dynamic updating system and method Pending CN117828156A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311808702.8A CN117828156A (en) 2023-12-26 2023-12-26 Distributed crawler dynamic updating system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311808702.8A CN117828156A (en) 2023-12-26 2023-12-26 Distributed crawler dynamic updating system and method

Publications (1)

Publication Number Publication Date
CN117828156A true CN117828156A (en) 2024-04-05

Family

ID=90505235

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311808702.8A Pending CN117828156A (en) 2023-12-26 2023-12-26 Distributed crawler dynamic updating system and method

Country Status (1)

Country Link
CN (1) CN117828156A (en)

Similar Documents

Publication Publication Date Title
JP4583452B2 (en) Method and system for monitoring server events in a node configuration by using direct communication between servers
CN100579082C (en) Information exchange system, management server, and method for reducing the network load
CN101567807B (en) Knowledge-based failure recovery support system
CN102984012B (en) Management method and system for service resources
CN107197012B (en) Service publishing and monitoring system and method based on metadata management system
CN102014403A (en) Method and system for transmitting network topology information
US20070282993A1 (en) Distribution of system status information using a web feed
CN101383839A (en) Data distribution system based on data server and implementation method
CN103036719A (en) Cross-regional service disaster method and device based on main cluster servers
US8335177B2 (en) Communication system
CN112671697A (en) Data processing method, device and system of comprehensive monitoring system
CN112100004A (en) Management method and storage medium of Redis cluster node
CN114978922B (en) Dynamic topology data acquisition method
CN116319732A (en) Message queue centralized configuration management system and method based on RabbitMQ
JP2010092395A (en) Server management system, server management method and program for server management
CN117828156A (en) Distributed crawler dynamic updating system and method
CN104052723A (en) Information processing method and server
CN112437146B (en) Equipment state synchronization method, device and system
CN111935296B (en) System for high-availability infinite MQTT message service capacity expansion
CN114090687A (en) Data synchronization method and device
CN103684825A (en) Multi-system communication system and maintenance method for same
CN113923142A (en) Method, system and medium for monitoring state of equipment of Internet of things
CN101179415A (en) Method of processing alarm information variation of telecom management network
JP7034139B2 (en) Equipment management method, equipment management equipment and equipment management system
EP2533153A1 (en) Unit for managing messages indicating event situations of monitored objects

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination