CN113282372B - Deployment method, device, equipment and storage medium of data collection cluster - Google Patents

Deployment method, device, equipment and storage medium of data collection cluster Download PDF

Info

Publication number
CN113282372B
CN113282372B CN202110604923.8A CN202110604923A CN113282372B CN 113282372 B CN113282372 B CN 113282372B CN 202110604923 A CN202110604923 A CN 202110604923A CN 113282372 B CN113282372 B CN 113282372B
Authority
CN
China
Prior art keywords
data collection
warning
keyword
project file
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110604923.8A
Other languages
Chinese (zh)
Other versions
CN113282372A (en
Inventor
刘亚庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202110604923.8A priority Critical patent/CN113282372B/en
Publication of CN113282372A publication Critical patent/CN113282372A/en
Application granted granted Critical
Publication of CN113282372B publication Critical patent/CN113282372B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a deployment method of a data collection cluster, which comprises the following steps: encapsulating the ScrapydWeb component and the Selenium component in a first docker container to form a master node of a data collection cluster; receiving a data collection project file uploaded by a user; encapsulating the Scarapyd component in a second docker container to form a slave node in the data collection cluster corresponding to the data collection project file; sending the data collection project file to a slave node corresponding to the data collection project file to complete the deployment of the data collection cluster; the slave nodes are used for collecting authorization data according to the data collection project files, and the master node is used for managing the slave nodes. Therefore, the data collection cluster can be formed by using the docker container to replace a real server, so that the deployment process of the data collection cluster is simplified, and the deployment efficiency is improved. The invention also relates to the technical field of block chains.

Description

Deployment method, device, equipment and storage medium of data collection cluster
Technical Field
The present invention relates to the field of distributed deployment technologies, and in particular, to a method and an apparatus for deploying a data collection cluster, a computer device, and a storage medium.
Background
Data collection technology is a common data processing technology, and can crawl data needed by users from massive internet websites. In practical application, data stored in an internet website is generally huge, and in order to ensure the data volume and efficiency of data crawling, a data collection cluster is generally required to be constructed for crawling of the data. For this reason, before crawling of website data using data collection technology, it is often necessary to complete deployment of data collection clusters. When a data collection cluster is deployed, a user is generally required to log in servers in the data collection cluster one by one and complete the deployment of crawlers in each server one by one, and at the moment, if the number of servers in the data collection cluster is large, the crawlers in the servers need to be frequently updated, and the like, the deployment of the data collection cluster is a very complicated task. Therefore, the deployment process of the conventional deployment method of the data collection cluster is complicated, and the deployment efficiency is low.
Disclosure of Invention
The invention aims to solve the technical problems that the deployment process of the conventional data collection cluster deployment method is complicated and the deployment efficiency is low.
In order to solve the above technical problem, a first aspect of the present invention discloses a method for deploying a data collection cluster, where the method includes:
encapsulating the ScrapydWeb component and the Selenium component in a first docker container to form a master node of a data collection cluster;
receiving a data collection project file uploaded by a user;
encapsulating the Scrapyd component in a second docker container to form a slave node in the data collection cluster corresponding to the data collection project file;
sending the data collection project file to a slave node corresponding to the data collection project file to complete the deployment of the data collection cluster;
the slave nodes are used for collecting authorization data according to the data collection project files, and the master node is used for managing the slave nodes.
The second aspect of the present invention discloses a deployment apparatus for a data collection cluster, the apparatus comprising:
the encapsulation module is used for encapsulating the ScrapydWeb component and the Selenium component in a first docker container to form a main node of the data collection cluster;
the receiving module is used for receiving the data collection project file uploaded by the user;
the encapsulation module is further used for encapsulating the Scrapyd component in a second docker container to form a slave node corresponding to the data collection project file in the data collection cluster;
the sending module is used for sending the data collection project file to a slave node corresponding to the data collection project file so as to complete the deployment of the data collection cluster;
the slave nodes are used for collecting authorization data according to the data collection project files, and the master node is used for managing the slave nodes.
A third aspect of the present invention discloses a computer apparatus, comprising:
a memory storing executable program code;
a processor coupled to the memory;
the processor calls the executable program code stored in the memory to execute part or all of the steps in the method for deploying the data collection cluster disclosed by the first aspect of the invention.
In a fourth aspect, the present invention discloses a computer storage medium, which stores computer instructions, and when the computer instructions are called, the computer instructions are used to execute part or all of the steps in the deployment method of the data collection cluster disclosed in the first aspect of the present invention.
In the embodiment of the invention, the ScrapydWeb component and the Selenium component are packaged in the first docker container to form the main node of the data collection cluster, the Scrapydd component is packaged in the second docker container to form the slave node of the data collection cluster, and then the data collection project file uploaded by the user is sent to the slave node to complete the deployment of the data collection cluster, so that the docker container can be used for replacing a real server to form the data collection cluster.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a deployment method of a data collection cluster according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a deployment apparatus of a data collection cluster disclosed in an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, article, or article that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or article.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The invention discloses a deployment method, a deployment device, computer equipment and a storage medium of a data collection cluster, wherein a ScarapydWeb component and a Selenium component are packaged in a first docker container to form a main node of the data collection cluster, a Scarapyd component is packaged in a second docker container to form a slave node of the data collection cluster, and then a data collection project file uploaded by a user is sent to the slave node to complete the deployment of the data collection cluster. The following are detailed below.
Example one
Referring to fig. 1, fig. 1 is a schematic flowchart of a deployment method of a data collection cluster according to an embodiment of the present invention. As shown in fig. 1, the deployment method of the data collection cluster may include the following operations:
101. the scripydweb component and the Selenium component are encapsulated in a first docker container to form a master node of a data collection cluster.
102. A data collection project file uploaded by a user is received.
103. Encapsulating the Scrapyd component in a second docker container to form a slave node of the data collection cluster corresponding to the data collection project file.
104. Sending the data collection project file to a slave node corresponding to the data collection project file to complete the deployment of the data collection cluster; the slave nodes are used for collecting authorization data according to the data collection project files, and the master node is used for managing the slave nodes.
In the embodiment of the present invention, the docker technology is a technology capable of implementing lightweight virtualization of an operating system, and is capable of implementing a function equivalent to a virtual machine. The docker container is equivalent to a fast and lightweight virtual machine, and can provide a necessary operating environment for the program in the docker container, so that the program in the docker container can run normally. Compared with the traditional virtual machine technology, the docker technology has the advantages of higher utilization rate of system resources, shorter starting time, consistent running environment and the like.
In a conventional deployment process of a data collection cluster, one node in the data collection cluster is a server, and is used to provide a necessary operating environment for a crawler component in the node, for example, a master node of the data collection cluster may be a server installed with a Scrapyd web component and a Selenium component, and each slave node of the data collection cluster is a server installed with a Scrapyd component, so that the data collection cluster may be composed of a plurality of servers. In the process of deploying the data collection cluster, users often need to log in the servers one by one, and complete installation of corresponding crawler assemblies on the servers, so that deployment of the data collection cluster is completed, which is a main reason that the traditional deployment process of the data collection cluster is complicated and the deployment efficiency is low. In the embodiment of the invention, the ScrapydWeb component and the Selenium component are packaged in the first docker container to form the main node of the data collection cluster, and the Scrapydd component is packaged in the second docker container to form the slave node of the data collection cluster, so that the docker container can be used for replacing a real server to form the data collection cluster, and thus, when a user deploys the data collection cluster, the user does not need to log in the server and install the crawler components one by one, thereby simplifying the deployment process of the data collection cluster and improving the deployment efficiency. In addition, the data collection project file usually contains a data collection program to be run, and after the slave node corresponding to the data collection project file is formed, the data collection project file can be sent to the corresponding slave node, so that the slave node can perform data capture by running the data collection program in the data collection project file.
It should be noted that the scrapydWeb component and the Scrapyd component are both components in a scrapycrawler framework, and a scrapycrawler framework generally consists of a scrapydWeb component and a plurality of Scrapyd components, where each Scrapyd component independently runs a corresponding data collector to crawl data, and the ScrapydWeb component can provide a visual management interface for a user to manage the respective Scrapyd components. The Selenium component is an automated testing tool for web applications that provides the necessary environment for the scripydweb component.
It should be noted that the data collection cluster in the embodiment of the present invention may be a crawler cluster, and data is captured by using a crawler technology, so as to collect data, where if the crawled data is data related to user privacy, the data is crawled after being authorized by a user, that is, the crawled data in the embodiment of the present invention is data authorized by the user (i.e., authorized data).
Therefore, when the optional embodiment is implemented, the ScrapydWeb component and the Selenium component are packaged in the first docker container to form the master node of the data collection cluster, the Scrapydd component is packaged in the second docker container to form the slave node of the data collection cluster, and then the data collection project file uploaded by the user is sent to the slave node to complete the deployment of the data collection cluster, so that the docker container can be used for replacing a real server to form the data collection cluster.
In an optional embodiment, after the sending the data collection project file to the slave node corresponding to the data collection project file to complete the deployment of the data collection cluster, the method further includes:
receiving a data collection log uploaded by the slave node;
detecting whether a target keyword exists in the data collection log, wherein the target keyword is any keyword in a preset keyword set;
and when the target keyword is detected to exist in the data collection log, sending a warning prompt to the user.
In the process of crawling data, if abnormity occurs, the data crawling stability can be ensured by processing as soon as possible, so that after the deployment of a data collection cluster is completed, the data crawling condition of a slave node can be monitored, and if abnormity occurs, a user can be informed to process the data crawling stability, so that the data crawling stability is ensured. Specifically, a data collection log (i.e., a crawler log of a crawler cluster) generated in the data crawling process can be uploaded from the node, and then whether the data crawling process is abnormal is determined by determining whether any keyword in the keyword set exists in the data collection log, and if the data crawling process is abnormal, a warning prompt is sent to the user. For example, the keyword set may include five keywords, i.e., DEBUG, warning, INFO, ERROR, and CRITICAL, and only one of the keywords is detected to be included in the data collection log (e.g., the data collection log includes DEBUG keyword, and the data collection log includes DEBUG keyword and warning keyword at the same time), a warning prompt may be issued to the user. The method for sending the warning prompt to the user may be to send a warning mail to a mailbox of the user.
Therefore, the optional embodiment is implemented, the data collection log uploaded by the slave node is received, whether the data crawling process of the slave node is abnormal or not is judged by detecting whether the target keyword exists in the data collection log, and if the target keyword exists in the data collection log, a warning prompt is sent to a user, so that the user can process the abnormality occurring in the data crawling process of the slave node in time, and the stability of data crawling of the slave node is guaranteed.
In an optional embodiment, after detecting that the target keyword exists in the data collection log and before issuing a warning prompt to the user, the method further includes:
judging whether the data collection log meets a preset warning condition or not;
and triggering and executing the step of sending out a warning prompt to the user when the data collection log meets the warning condition.
In practical applications, there may be a case where although the data collection log has the target keyword, a warning prompt does not need to be issued to the user. For example, in the data crawling process of the slave node, the target keyword may occasionally appear in the crawling log due to accidental factors such as network fluctuation and radio wave interference, but such accidental factors usually disappear relatively quickly and do not substantially affect the data crawling process of the slave node, and at this time, although the target keyword appears in the data collecting log, no warning prompt needs to be given to the user. For another example, the time when the target keyword appears in the data collection log is not in the working time period of the user, and at this time, a warning prompt does not need to be given to the user. Therefore, in order to ensure the accuracy of the warning prompt issued to the user, after the target keyword is detected in the data collection log, whether the data collection log meets a preset warning condition (described later in detail) can be continuously judged, and the warning prompt is issued to the user after the warning condition is met, so that the accuracy of the warning prompt issued to the user can be ensured.
Therefore, by implementing the optional embodiment, after the target keyword is detected to exist in the data collection log, whether the data collection log meets the preset warning condition or not can be continuously judged, and the warning prompt is sent to the user only after the warning condition is met, so that the accuracy of the warning prompt sent to the user can be ensured.
In an optional embodiment, each keyword in the keyword set is preset with a corresponding warning frequency threshold;
and, the judging whether the data collection log meets a preset warning condition includes:
determining the occurrence frequency of each target keyword in the data collection log;
judging whether a number warning keyword exists in the data collection log, wherein the number warning keyword is a target keyword of which the corresponding occurrence number is greater than a corresponding warning number threshold;
when the times warning keyword exists in the data collection log, determining that the data collection log meets a preset warning condition;
when the times warning keyword does not exist in the data collection log, determining that the data collection log does not satisfy the warning condition.
In this alternative embodiment, different warning number threshold values may be preset for different keywords, for example, the warning number threshold value corresponding to the DEBUG keyword may be preset to 1 because the degree of abnormality represented by the DEBUG keyword is heavy, and the warning number threshold value corresponding to the warning keyword may be preset to 5 because the degree of abnormality represented by the warning keyword is light. Therefore, only when the occurrence frequency of the keywords appearing in the data collection log is greater than the warning frequency threshold value corresponding to the keywords, the data collection log is determined to meet the warning condition, and then the warning prompt is sent to the user, so that the accuracy of the warning prompt sent to the user is better guaranteed. For example, the number of occurrences of the warning keyword is not greater than the warning number threshold 5 corresponding to the warning keyword, and thus it can be determined that the data collection log does not satisfy the warning condition. For another example, the data collection log may be determined to satisfy the warning condition because the number of occurrences of the warning keyword is greater than the warning number threshold 5 corresponding to the warning keyword when the warning keyword occurs 6 times in the data collection log.
Therefore, by implementing the optional embodiment, after the target keyword is detected to exist in the data collection log, when the number of times of the target keyword appearing in the data collection log is greater than the warning number threshold corresponding to the target keyword, the data collection log is determined to meet the warning condition, and then the warning prompt is sent to the user, so that the accuracy of the warning prompt sent to the user is better ensured.
In an optional embodiment, each keyword in the keyword set is preset with a corresponding warning time period;
and after judging that the times warning keyword exists in the data collection log and before determining that the data collection log meets a preset warning condition, the method further comprises the following steps:
determining the occurrence time of the times warning keyword in the data collection log;
judging whether the appearance moment corresponding to the times warning keyword is in the warning time period corresponding to the times warning keyword or not;
when the occurrence moment corresponding to the times warning keyword is within the warning time period corresponding to the times warning keyword, triggering and executing the step of determining that the data collection log meets the preset warning condition;
and when the appearance moment corresponding to the times warning keyword is not within the warning time period corresponding to the times warning keyword, triggering and executing the step of determining that the data collection log does not meet the warning condition.
In this optional embodiment, different warning time periods may be preset for different keywords, for example, the warning time period represented by the DEBUG keyword is heavy, and the exception of the DEBUG needs to be processed as soon as possible, so the warning time period corresponding to the DEBUG keyword may be preset to 6 am to 11 pm, the exception represented by the warning keyword is light, and it is not necessary to immediately process the exception of warning, so the warning time period corresponding to the warning keyword may be preset to the working time of the user, that is, 9 am to 6 pm. In this way, after the number warning keyword is determined to exist in the data collection log, whether the appearance time of the number warning keyword appearing in the data collection log is within the warning time period corresponding to the number warning keyword is continuously judged, if so, the data collection log is determined to meet the warning condition and then a warning prompt is sent to the user, and if not, the data collection log is determined not to meet the warning condition and then no warning prompt is sent to the user. For example, the time of occurrence of the number warning keyword warning in the data collection log is 10 am, which is within the warning period of the keyword warning, so a warning prompt may be issued to the user. For another example, the time of occurrence of the number warning keyword warning in the data collection log is 6 am, which is not within the warning period of the keyword warning, so that it is not necessary to give a warning prompt to the user.
Therefore, by implementing the optional embodiment, after the number of times of warning keywords are determined to exist in the data collection log, when the appearance time of the number of times of warning keywords appearing in the data collection log is within the warning time period corresponding to the number of times of warning keywords, the data collection log is determined to meet the warning condition, and then the warning prompt is sent to the user, so that the accuracy of the warning prompt sent to the user is better ensured.
In an optional embodiment, each keyword in the keyword set is preset with a corresponding warning time period;
and, the judging whether the data collection log meets a preset warning condition includes:
determining the occurrence time of each target keyword in the data collection log;
judging whether a moment warning keyword exists in the data collection log, wherein the moment warning keyword is a target keyword of the corresponding appearance moment in the corresponding warning time period;
when the warning keyword at the moment exists in the data collection log, determining that the data collection log meets a preset warning condition;
and when the warning keyword at the moment does not exist in the data collection log, determining that the data collection log does not meet the warning condition.
In this optional embodiment, when determining whether the data collection log satisfies the warning condition, it may also be determined whether only the occurrence time of each target keyword in the data collection log is within the warning time period corresponding to the target keyword. Specifically, different warning time periods can be preset for different keywords, for example, the warning time period represented by the DEBUG keyword is heavy, and the exception of the DEBUG needs to be processed as soon as possible, so that the warning time period corresponding to the DEBUG keyword can be preset to 6 am to 11 pm, the exception represented by the WARING keyword is light, and the exception of WARING does not need to be processed immediately, so that the warning time period corresponding to the WARING keyword can be preset to the working time of the user, namely, 9 am to 6 pm. In this way, after the target keyword is detected to exist in the data collection log, whether the appearance time of the target keyword appearing in the data collection log is within the warning time period corresponding to the target keyword is continuously judged, if so, the data collection log is determined to meet the warning condition, and then a warning prompt is sent to the user, and if not, the data collection log is determined not to meet the warning condition, and then the warning prompt is not needed to be sent to the user. For example, the target keyword warning appears in the data collection log at the time of 10 am, which is within the warning period of the keyword warning, so it can be determined that the data collection log satisfies the warning condition. For another example, the target keyword warning appears in the data collection log at the time of 6 am, which is not within the warning period of the keyword warning, so it can be determined that the data collection log does not satisfy the warning condition.
Therefore, by implementing the optional embodiment, after the target keyword is detected in the data collection log, when the appearance moment of the target keyword appearing in the data collection log is within the warning time period corresponding to the target keyword, the data collection log is determined to meet the warning condition, and then the warning prompt is sent to the user, so that the accuracy of the warning prompt sent to the user is better ensured.
In an optional embodiment, the data collection project file is a general data collection project file or a focused data collection project file, where the general data collection project file includes at least a uniform resource locator system of a target website and location information recording a location of target data in the target website, the focused data collection project file includes at least a preset data collection program, the target website is a website for which authorized data fetching by the slave node is to be performed, and the target data is authorized data to be fetched from the target website by the slave node;
when the data collection project file is the universal data collection project file, the slave node is used for collecting the target data from the target website according to the position information and the uniform resource positioning system;
and when the data collection project file is the focused data collection project file, the slave node is used for operating a data collection program contained in the focused data collection project file so as to capture authorized data.
In this alternative embodiment, the data crawl from the nodes according to the data collection project file may be performed in two ways, namely, a general crawler and a focused crawler. And when the data collection project file uploaded by the user is a universal data collection project file, crawling data from the node by using a universal crawler mode. The url (i.e., uniform resource locator) of the website from which the data is to be crawled and the location information of the data to be crawled in the website may be included in the universal data collection project file, which may be a json file. The location information may be an xpath or a regular expression corresponding to a title, time, content, author, and other fields of the website. And when the data collection project file uploaded by the user is the focused data collection project file, crawling data from the node by using a focused crawler mode. The focused data collection project file may include existing publicly available data collection programs (i.e., crawler programs), such as existing data collection programs for capturing data such as national environmental air quality monitoring data, water quality data, wind field data, and the like. These existing data collection programs are usually mature, and the data collection programs are directly run from the nodes, that is, the corresponding data can be grabbed.
Therefore, by implementing the optional embodiment, the data collection project file uploaded by the user can be a general data collection project file or a focused data collection project file, so that the deployed data collection cluster can capture data in a general crawler or focused crawler manner, functions of the deployed data collection cluster can be enriched, and the method is better suitable for actual application scenarios.
Optionally, after the data collection project file is sent to the slave node corresponding to the data collection project file to complete the deployment of the data collection cluster, the timing scheduling information input by the user and including at least the timing scheduling time may be received, and then the slave node is scheduled based on the timing scheduling information, so that the slave node starts to collect the authorization data according to the data collection project file at the timing scheduling time.
Optionally, after the data collection project file is sent to the slave node corresponding to the data collection project file to complete deployment of the data collection cluster, the state information uploaded by the slave node and used for recording the working state of the slave node may be received, and then the working state of the slave node is displayed on the interactive interface based on the state information.
Optionally, it is also possible: and uploading the deployment information of the data collection cluster of the deployment method of the data collection cluster to a block chain.
Specifically, the deployment information of the data collection cluster is obtained by operating the deployment method of the data collection cluster, and is used for recording the deployment condition of the data collection cluster, for example, the uploading time of the data collection project file, the container identifier of the first docker container, the container identifier of the second docker container, the completion time of the deployment of the data collection cluster, and the like. Uploading the deployment information of the data collection cluster to the block chain can ensure the safety and the fair transparency to users. The user can download the deployment information of the data collection cluster from the blockchain so as to verify whether the deployment information of the data collection cluster of the deployment method of the data collection cluster is tampered. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of a deployment apparatus of a data collection cluster according to an embodiment of the present invention. As shown in fig. 2, the deployment apparatus of the data collection cluster may include:
an encapsulation module 201, configured to encapsulate the scripydweb component and the Selenium component in a first docker container to form a master node of a data collection cluster;
a receiving module 202, configured to receive a data collection project file uploaded by a user;
the encapsulating module 201 is further configured to encapsulate the scanpyd component in a second docker container to form a slave node in the data collection cluster corresponding to the data collection project file;
a sending module 203, configured to send the data collection project file to a slave node corresponding to the data collection project file, so as to complete deployment of the data collection cluster;
the slave nodes are used for collecting authorization data according to the data collection project files, and the master node is used for managing the slave nodes.
In an optional embodiment, the receiving module 202 is further configured to receive a data collection log uploaded by a slave node after the sending module 203 sends the data collection project file to the slave node corresponding to the data collection project file to complete the deployment of the data collection cluster;
and, the apparatus further comprises:
the detection module is used for detecting whether a target keyword exists in the data collection log, wherein the target keyword is any keyword in a preset keyword set; and when the target keyword is detected to exist in the data collection log, sending a warning prompt to the user.
In an optional embodiment, the apparatus further comprises:
the judging module is used for judging whether the data collection log meets a preset warning condition or not before the detecting module sends a warning prompt to the user after the detecting module detects that the target keyword exists in the data collection log; and when the data collection log meets the warning condition, triggering the detection module to execute the step of sending a warning prompt to the user.
In an optional embodiment, each keyword in the keyword set is preset with a corresponding warning frequency threshold;
and the specific way for judging whether the data collection log meets the preset warning condition by the judging module is as follows:
determining the occurrence frequency of each target keyword in the data collection log;
judging whether a number warning keyword exists in the data collection log, wherein the number warning keyword is a target keyword of which the corresponding occurrence number is greater than a corresponding warning number threshold;
when the times warning keyword exists in the data collection log, determining that the data collection log meets a preset warning condition;
when the times warning keyword does not exist in the data collection log, determining that the data collection log does not satisfy the warning condition.
In an optional embodiment, each keyword in the keyword set is preset with a corresponding warning time period;
the judging module is further configured to determine an occurrence time of the times warning keyword in the data collection log after judging that the times warning keyword exists in the data collection log and before determining that the data collection log meets a preset warning condition; judging whether the appearance moment corresponding to the times warning keyword is in the warning time period corresponding to the times warning keyword or not; when the occurrence moment corresponding to the times warning keyword is within the warning time period corresponding to the times warning keyword, triggering and executing the step of determining that the data collection log meets the preset warning condition; and when the appearance moment corresponding to the times warning keyword is not within the warning time period corresponding to the times warning keyword, triggering and executing the step of determining that the data collection log does not meet the warning condition.
In an optional embodiment, each keyword in the keyword set is preset with a corresponding warning time period;
and the specific way for judging whether the data collection log meets the preset warning condition by the judging module is as follows:
determining the occurrence time of each target keyword in the data collection log;
judging whether a moment warning keyword exists in the data collection log, wherein the moment warning keyword is a target keyword of the corresponding appearance moment in the corresponding warning time period;
when the warning keywords exist in the data collection log at the moment, determining that the data collection log meets a preset warning condition;
and when the warning keyword at the moment does not exist in the data collection log, determining that the data collection log does not meet the warning condition.
In an optional embodiment, the data collection project file is a general data collection project file or a focused data collection project file, where the general data collection project file includes at least a uniform resource locator system of a target website and location information recording a location of target data in the target website, the focused data collection project file includes at least a preset data collection program, the target website is a website for which authorized data fetching by the slave node is to be performed, and the target data is authorized data to be fetched from the target website by the slave node;
when the data collection project file is the universal data collection project file, the slave node is used for collecting the target data from the target website according to the position information and the uniform resource positioning system;
and when the data collection project file is the focused data collection project file, the slave node is used for operating a data collection program contained in the focused data collection project file so as to capture authorized data.
For the specific description of the deployment apparatus of the data collection cluster, reference may be made to the specific description of the deployment method of the data collection cluster, and in order to avoid repetition, details are not repeated here.
EXAMPLE III
Referring to fig. 3, fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 3, the computer apparatus may include:
a memory 301 storing executable program code;
a processor 302 connected to the memory 301;
the processor 302 calls the executable program code stored in the memory 301 to execute the steps in the method for deploying the data collection cluster disclosed in the embodiment of the present invention.
Example four
Referring to fig. 4, an embodiment of the present invention discloses a computer storage medium 401, where the computer storage medium 401 stores computer instructions, and the computer instructions are used to execute steps in a deployment method of a data collection cluster disclosed in an embodiment of the present invention when being called.
The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above detailed description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on such understanding, the above technical solutions may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, where the storage medium includes a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc-Read-Only Memory (CD-ROM), or other disk memories, CD-ROMs, or other magnetic disks, A tape memory, or any other medium readable by a computer that can be used to carry or store data.
Finally, it should be noted that: the deployment method, apparatus, computer device and storage medium of a data collection cluster disclosed in the embodiments of the present invention are only preferred embodiments of the present invention, and are only used for illustrating the technical solution of the present invention, and are not limited thereto; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A method for deploying a data collection cluster, the method comprising:
encapsulating the ScrapydWeb component and the Selenium component in a first docker container to form a master node of a data collection cluster;
receiving a data collection project file uploaded by a user; the data collection project file is a general data collection project file or a focused data collection project file, wherein the general data collection project file at least comprises a uniform resource positioning system of a target website and position information recorded with the position of target data in the target website, the focused data collection project file at least comprises a preset data collection program, the target website is a website to be subjected to authorized data capture by a slave node, and the target data is authorized data to be captured by the slave node from the target website;
encapsulating the Scarapyd component in a second docker container to form a slave node in the data collection cluster corresponding to the data collection project file;
sending the data collection project file to a slave node corresponding to the data collection project file to complete the deployment of the data collection cluster;
receiving timing scheduling information at least comprising timing scheduling time input by a user, and scheduling a slave node based on the timing scheduling information so that the slave node starts to collect authorization data according to the data collection project file at the timing scheduling time; specifically, when the data collection project file is the universal data collection project file, the slave node is configured to collect the target data from the target website according to the location information and the uniform resource locator system; and when the data collection project file is the focused data collection project file, the slave node is used for operating a data collection program contained in the focused data collection project file so as to capture authorized data.
2. The method for deploying the data collection cluster according to claim 1, wherein after the sending the data collection project file to the slave node corresponding to the data collection project file to complete the deployment of the data collection cluster, the method further comprises:
receiving a data collection log uploaded by the slave node;
detecting whether a target keyword exists in the data collection log, wherein the target keyword is any keyword in a preset keyword set;
and when the target keyword is detected to exist in the data collection log, sending a warning prompt to the user.
3. The method for deploying a data collection cluster according to claim 2, wherein after detecting that the target keyword exists in the data collection log, before issuing a warning prompt to the user, the method further comprises:
judging whether the data collection log meets a preset warning condition or not;
and triggering and executing the step of sending out a warning prompt to the user when the data collection log meets the warning condition.
4. The deployment method of the data collection cluster according to claim 3, wherein each keyword in the keyword set is preset with a corresponding warning number threshold;
and, the judging whether the data collection log meets a preset warning condition includes:
determining the occurrence frequency of each target keyword in the data collection log;
judging whether a number warning keyword exists in the data collection log, wherein the number warning keyword is a target keyword of which the corresponding occurrence number is greater than a corresponding warning number threshold;
when the times warning keyword exists in the data collection log, determining that the data collection log meets a preset warning condition;
when the number warning keyword does not exist in the data collection log, determining that the data collection log does not satisfy the warning condition.
5. The deployment method of the data collection cluster according to claim 4, wherein each keyword in the keyword set is preset with a corresponding warning time period;
and after judging that the times warning keyword exists in the data collection log and before determining that the data collection log meets a preset warning condition, the method further comprises the following steps:
determining the occurrence time of the times warning keyword in the data collection log;
judging whether the appearance moment corresponding to the times warning keyword is in the warning time period corresponding to the times warning keyword or not;
when the appearance moment corresponding to the times warning keyword is within the warning time period corresponding to the times warning keyword, triggering and executing the step of determining that the data collection log meets the preset warning condition;
and when the appearance moment corresponding to the times warning keyword is not within the warning time period corresponding to the times warning keyword, triggering and executing the step of determining that the data collection log does not meet the warning condition.
6. The deployment method of the data collection cluster according to claim 3, wherein each keyword in the keyword set is preset with a corresponding warning time period;
and, the judging whether the data collection log meets a preset warning condition includes:
determining the occurrence time of each target keyword in the data collection log;
judging whether a moment warning keyword exists in the data collection log, wherein the moment warning keyword is a target keyword of the corresponding appearance moment in the corresponding warning time period;
when the warning keywords exist in the data collection log at the moment, determining that the data collection log meets a preset warning condition;
when the warning keyword does not exist in the data collection log at the moment, determining that the data collection log does not meet the warning condition.
7. An apparatus for deploying a data collection cluster, the apparatus comprising:
the encapsulation module is used for encapsulating the ScrapydWeb component and the Selenium component in a first docker container to form a main node of the data collection cluster;
the receiving module is used for receiving the data collection project file uploaded by the user; the data collection project file is a general data collection project file or a focused data collection project file, wherein the general data collection project file at least comprises a uniform resource positioning system of a target website and position information recorded with the position of target data in the target website, the focused data collection project file at least comprises a preset data collection program, the target website is a website to be subjected to authorized data capture by a slave node, and the target data is authorized data to be captured by the slave node from the target website;
the encapsulation module is further used for encapsulating the Scacopyd component in a second docker container to form a slave node corresponding to the data collection project file in the data collection cluster;
the sending module is used for sending the data collection project files to the slave nodes corresponding to the data collection project files so as to complete the deployment of the data collection cluster;
receiving timing scheduling information at least comprising timing scheduling time input by a user, and scheduling a slave node based on the timing scheduling information so that the slave node starts to collect authorization data according to the data collection project file at the timing scheduling time; specifically, when the data collection project file is the universal data collection project file, the slave node is configured to collect the target data from the target website according to the location information and the uniform resource locator system; and when the data collection project file is the focused data collection project file, the slave node is used for operating a data collection program contained in the focused data collection project file so as to capture authorized data.
8. A computer device, characterized in that the computer device comprises:
a memory storing executable program code;
a processor coupled to the memory;
the processor calls the executable program code stored in the memory to perform the method of deploying a data collection cluster according to any of claims 1-6.
9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method of deploying a data collection cluster according to any one of claims 1 to 6.
CN202110604923.8A 2021-05-31 2021-05-31 Deployment method, device, equipment and storage medium of data collection cluster Active CN113282372B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110604923.8A CN113282372B (en) 2021-05-31 2021-05-31 Deployment method, device, equipment and storage medium of data collection cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110604923.8A CN113282372B (en) 2021-05-31 2021-05-31 Deployment method, device, equipment and storage medium of data collection cluster

Publications (2)

Publication Number Publication Date
CN113282372A CN113282372A (en) 2021-08-20
CN113282372B true CN113282372B (en) 2022-08-26

Family

ID=77283037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110604923.8A Active CN113282372B (en) 2021-05-31 2021-05-31 Deployment method, device, equipment and storage medium of data collection cluster

Country Status (1)

Country Link
CN (1) CN113282372B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN110618821A (en) * 2018-06-19 2019-12-27 普天信息技术有限公司 Container cluster system based on Docker and rapid building method
CN111209463A (en) * 2020-01-02 2020-05-29 北京天元创新科技有限公司 Internet data acquisition method and device
CN112199567A (en) * 2020-09-27 2021-01-08 深圳市伊欧乐科技有限公司 Distributed data acquisition method, system, server and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111866149B (en) * 2020-07-23 2023-09-05 平安证券股份有限公司 Cluster deployment method and device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN110618821A (en) * 2018-06-19 2019-12-27 普天信息技术有限公司 Container cluster system based on Docker and rapid building method
CN111209463A (en) * 2020-01-02 2020-05-29 北京天元创新科技有限公司 Internet data acquisition method and device
CN112199567A (en) * 2020-09-27 2021-01-08 深圳市伊欧乐科技有限公司 Distributed data acquisition method, system, server and storage medium

Also Published As

Publication number Publication date
CN113282372A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN112787992B (en) Method, device, equipment and medium for detecting and protecting sensitive data
CN105743730B (en) The method and its system of real time monitoring are provided for the web service of mobile terminal
US20120311562A1 (en) Extendable event processing
CN102684944B (en) Method and device for detecting intrusion
CN111866016B (en) Log analysis method and system
CN112685682B (en) Method, device, equipment and medium for identifying forbidden object of attack event
CN102647421A (en) Web back door detection method and device based on behavioral characteristics
CN110062926B (en) Device driver telemetry
CN114465741B (en) Abnormality detection method, abnormality detection device, computer equipment and storage medium
CN112560029A (en) Website content monitoring and automatic response protection method based on intelligent analysis technology
CN112291266B (en) Data processing method, device, server and storage medium
CN109710440A (en) Abnormality eliminating method, device, storage medium and the terminal device of webpage front-end
CN112688914A (en) Intelligent cloud platform dynamic sensing method
CN116107846A (en) Linux system event monitoring method and device based on EBPF
US20230252136A1 (en) Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information
CN110365714A (en) Host-based intrusion detection method, apparatus, equipment and computer storage medium
Ghorbanian et al. Signature-based hybrid Intrusion detection system (HIDS) for android devices
KR20230103275A (en) Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information
CN112817827A (en) Operation and maintenance method, device, server, equipment, system and medium
CN113591096A (en) Vulnerability scanning system for comprehensively detecting big data bugs and unsafe configurations
CN113282372B (en) Deployment method, device, equipment and storage medium of data collection cluster
US20230252146A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
EP4386597A1 (en) Cyber threat information processing device, cyber threat information processing method, and storage medium storing cyber threat information processing program
CN104881354A (en) Cloud disk monitoring method and device
CN112882892B (en) Data processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant