CN115277694B - Data acquisition method, device, system, electronic equipment and storage medium - Google Patents

Data acquisition method, device, system, electronic equipment and storage medium Download PDF

Info

Publication number
CN115277694B
CN115277694B CN202210757236.4A CN202210757236A CN115277694B CN 115277694 B CN115277694 B CN 115277694B CN 202210757236 A CN202210757236 A CN 202210757236A CN 115277694 B CN115277694 B CN 115277694B
Authority
CN
China
Prior art keywords
link
data acquisition
accessed
task
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210757236.4A
Other languages
Chinese (zh)
Other versions
CN115277694A (en
Inventor
王海利
王明杨
徐俊俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202210757236.4A priority Critical patent/CN115277694B/en
Publication of CN115277694A publication Critical patent/CN115277694A/en
Application granted granted Critical
Publication of CN115277694B publication Critical patent/CN115277694B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • H04L67/146Markers for unambiguous identification of a particular session, e.g. session cookie or URL-encoding

Abstract

The embodiment of the invention provides a data acquisition method, a device, a system, electronic equipment and a storage medium, wherein a central node acquires a link to be accessed; and sending the link to be accessed to the edge node. And when the edge node receives the link to be accessed, extracting target data indicated by the data acquisition task in the target page indicated by the link to be accessed, and sending the target data to the center node. And the center node receives the target data sent by the edge node, and obtains a data acquisition result of the data acquisition task. Based on the processing, the center node can send the link to be accessed to the edge node, the edge node acquires data according to the received link to be accessed, and the edge node acquires the data by using the computing resource of the edge node, so that the data acquisition efficiency can be improved.

Description

Data acquisition method, device, system, electronic equipment and storage medium
Technical Field
The present invention relates to the field of internet technologies, and in particular, to a data acquisition method, apparatus, system, electronic device, and storage medium.
Background
With the rapid development of internet technology, the internet provides users with a large amount of data (e.g., video, music, pictures, etc.), and users can acquire desired data from the internet. When data acquisition is performed, the server can determine the network address of the data which the user needs to acquire, and acquire corresponding data from the determined network address.
However, in the related art, a corresponding method is not provided to improve the data acquisition efficiency.
Disclosure of Invention
The embodiment of the invention aims to provide a data acquisition method, a device, a system, electronic equipment and a storage medium, so as to improve the data acquisition efficiency. The specific technical scheme is as follows:
in a first aspect of the present invention, there is provided a data acquisition method, the method being applied to a data acquisition system, the data acquisition system comprising: a center node and an edge node, the method comprising:
the central node acquires a link to be accessed; sending the link to be accessed to the edge node;
when the edge node receives the link to be accessed, extracting target data indicated by a data acquisition task in a target page indicated by the link to be accessed, and sending the target data to the center node;
and the center node receives the target data sent by the edge node, and obtains a data acquisition result of the data acquisition task.
In a second aspect of the present invention, there is also provided a data acquisition method, the method being applied to a central node in a data acquisition system, the method comprising:
Acquiring a link to be accessed;
sending the link to be accessed to an edge node, so that when the edge node receives the link to be accessed, extracting target data indicated by a data acquisition task in a target page indicated by the link to be accessed, and sending the target data to the center node;
and receiving target data sent by the edge node to obtain a data acquisition result of the data acquisition task.
Optionally, the obtaining the link to be accessed includes:
and extracting an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task to serve as a link to be accessed.
Optionally, the obtaining the link to be accessed includes:
receiving a data acquisition link sent by the edge node; wherein, the data acquisition link is: the data in the target page extracted by the edge node is linked;
and carrying out de-duplication processing on the received data acquisition link to obtain a link to be accessed.
Optionally, the edge nodes are multiple;
the sending the link to be accessed to the edge node includes:
and determining an edge node for processing the link to be accessed from the edge nodes based on the processing state information of the data acquisition links of the edge nodes, and sending the link to be accessed to the determined edge node.
Optionally, the determining, from the edge nodes, the edge node for processing the link to be accessed based on the processing state information of the link to be accessed by the edge nodes currently for data, includes:
for each edge node, determining a number of data acquisition links that the edge node currently has received and unprocessed as a first number;
and determining the corresponding first minimum number of edge nodes from the edge nodes as the edge nodes for processing the links to be accessed.
Optionally, before extracting the initial link corresponding to the data acquisition task from the configuration information of the data acquisition task as the link to be accessed, the method further includes:
acquiring task execution codes of the data acquisition tasks from a preset file system; wherein the task execution code includes: task start code;
extracting an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task as a link to be accessed, wherein the extracting includes:
and loading the task starting code to extract an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task as a link to be accessed.
Optionally, the task execution code further includes: a new link acquisition code and a page content acquisition code;
before sending the link to be accessed to the edge node, the method further comprises:
the new link acquisition code and the page content acquisition code are sent to the edge node, so that the edge node receives the new link acquisition code and the page content acquisition code, when a link to be accessed sent by the center node is received, the page content acquisition code is loaded to extract target data indicated by the data acquisition task in a target page indicated by the link to be accessed, and the new link acquisition code is loaded to extract the data acquisition link in the target page indicated by the link to be accessed.
Optionally, before the sending the new link acquisition code and the page content acquisition code to the edge node, the method further includes:
acquiring task execution information of the data acquisition task from a preset database; wherein the task execution information includes: the execution period of the data acquisition task;
the sending the new link acquisition code and the page content acquisition code to the edge node includes:
When the moment corresponding to the execution period is reached, sending the new link acquisition code and the page content acquisition code to the edge node;
the loading the task starting code to extract an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task as a link to be accessed includes:
and when the moment corresponding to the execution period is reached, loading the task starting code to extract an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task as a link to be accessed.
Optionally, the task execution information further includes a storage address corresponding to the data acquisition task;
after receiving the target data sent by the edge node and obtaining the data acquisition result of the data acquisition task, the method further comprises:
and storing the data acquisition result of the data acquisition task to a storage address corresponding to the data acquisition task.
In a third aspect of the present invention, there is also provided a data acquisition method, the method being applied to an edge node in a data acquisition system, the method comprising:
When a link to be accessed sent by a central node is received, extracting target data indicated by a data acquisition task in a target page indicated by the link to be accessed;
and sending the target data to the central node so that the central node receives the target data to obtain a data acquisition result of the data acquisition task.
Optionally, the method further comprises:
when a link to be accessed sent by a central node is received, extracting a data acquisition link in a target page indicated by the link to be accessed, and sending the data acquisition link to the central node so that the central node performs de-duplication processing on the received data acquisition link to obtain the link to be accessed.
Optionally, before extracting the target data indicated by the data acquisition task in the target page indicated by the link to be accessed when the link to be accessed sent by the central node is received, the method further includes:
receiving a new link acquisition code and a page content acquisition code sent by the central node;
when receiving a link to be accessed sent by a central node, extracting target data indicated by a data acquisition task in a target page indicated by the link to be accessed, including:
When a link to be accessed sent by a central node is received, loading the page content acquisition code to extract target data indicated by a data acquisition task in a target page indicated by the link to be accessed;
when receiving a link to be accessed sent by a central node, extracting a data acquisition link in a target page indicated by the link to be accessed, wherein the data acquisition link comprises:
and when receiving a link to be accessed sent by the central node, loading the new link acquisition code to extract a data acquisition link in a target page indicated by the link to be accessed.
In a fourth aspect of the present invention, there is also provided a data acquisition system, the data acquisition system comprising: center node and edge node, wherein:
the center node is used for acquiring a link to be accessed; sending the link to be accessed to the edge node;
the edge node is used for extracting target data indicated by a data acquisition task in a target page indicated by the link to be accessed when the link to be accessed is received, and sending the target data to the center node;
the center node is further configured to receive the target data sent by the edge node, and obtain a data acquisition result of the data acquisition task.
In a fifth aspect of the present invention, there is also provided a data acquisition device for use in a central node in a data acquisition system, the device comprising:
the first acquisition module is used for acquiring a link to be accessed;
the first sending module is used for sending the link to be accessed to an edge node, so that when the edge node receives the link to be accessed, the edge node extracts target data indicated by a data acquisition task in a target page indicated by the link to be accessed and sends the target data to the center node;
and the receiving module is used for receiving the target data sent by the edge node and obtaining a data acquisition result of the data acquisition task.
Optionally, the first obtaining module is specifically configured to extract, from the configuration information of the data obtaining task, an initial link corresponding to the data obtaining task as a link to be accessed.
Optionally, the first obtaining module is specifically configured to receive a data obtaining link sent by the edge node; wherein, the data acquisition link is: the data in the target page extracted by the edge node is linked;
and carrying out de-duplication processing on the received data acquisition link to obtain a link to be accessed.
Optionally, the edge nodes are multiple;
the first sending module is specifically configured to determine, from among the edge nodes, an edge node for processing the link to be accessed based on processing state information of the data acquisition link of each edge node, and send the link to be accessed to the determined edge node.
Optionally, the first sending module is specifically configured to determine, for each edge node, a number of data acquisition links that are currently received and unprocessed by the edge node, as a first number;
and determining the corresponding first minimum number of edge nodes from the edge nodes as the edge nodes for processing the links to be accessed.
Optionally, the apparatus further includes:
the second acquisition module is used for extracting an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task executed by the first acquisition module, and executing a task execution code for acquiring the data acquisition task from a preset file system before the initial link is used as a link to be accessed; wherein the task execution code includes: task start code;
the first obtaining module is specifically configured to load the task starting code, so as to extract an initial link corresponding to the data obtaining task from the configuration information of the data obtaining task, and use the initial link as a link to be accessed.
Optionally, the task execution code further includes: a new link acquisition code and a page content acquisition code;
the apparatus further comprises:
and the second sending module is used for sending the new link acquisition code and the page content acquisition code to the edge node before the first sending module sends the link to be accessed to the edge node, so that the edge node receives the new link acquisition code and the page content acquisition code, loads the page content acquisition code when receiving the link to be accessed sent by the center node, extracts target data indicated by the data acquisition task in a target page indicated by the link to be accessed, and loads the new link acquisition code to extract the data acquisition link in the target page indicated by the link to be accessed.
Optionally, the apparatus further includes:
the third acquisition module is used for executing task execution information for acquiring the data acquisition task from a preset database before the second transmission module executes the transmission of the new link acquisition code and the page content acquisition code to the edge node; wherein the task execution information includes: the execution period of the data acquisition task;
The second sending module is specifically configured to send the new link acquisition code and the page content acquisition code to the edge node when the moment corresponding to the execution period is reached;
the first obtaining module is specifically configured to load the task start code when the time corresponding to the execution period is reached, so as to extract, from the configuration information of the data obtaining task, an initial link corresponding to the data obtaining task as a link to be accessed.
Optionally, the task execution information further includes a storage address corresponding to the data acquisition task;
the apparatus further comprises:
and the storage module is used for storing the data acquisition result of the data acquisition task to a storage address corresponding to the data acquisition task after the receiving module receives the target data sent by the edge node and obtains the data acquisition result of the data acquisition task.
In a sixth aspect of the present invention, there is also provided a data acquisition device, the device being applied to an edge node in a data acquisition system, the device comprising:
the first extraction module is used for extracting target data indicated by a data acquisition task in a target page indicated by a link to be accessed when the link to be accessed sent by the central node is received;
And the sending module is used for sending the target data to the central node so that the central node receives the target data and obtains a data acquisition result of the data acquisition task.
Optionally, the apparatus further includes:
and the second extraction module is used for extracting the data acquisition link in the target page indicated by the link to be accessed when the link to be accessed sent by the central node is received, and sending the data acquisition link to the central node so that the central node performs de-duplication processing on the received data acquisition link to obtain the link to be accessed.
Optionally, the apparatus further includes:
the receiving module is used for executing the receiving of the new link acquisition code and the page content acquisition code sent by the central node before the first extracting module extracts the target data indicated by the data acquisition task in the target page indicated by the link to be accessed when the link to be accessed sent by the central node is received;
the first extraction module is specifically configured to load the page content acquisition code when receiving a link to be accessed sent by a central node, so as to extract target data indicated by a data acquisition task in a target page indicated by the link to be accessed;
The second extracting module is specifically configured to load the new link acquiring code when receiving a link to be accessed sent by the central node, so as to extract a data acquiring link in a target page indicated by the link to be accessed.
In yet another aspect of the present invention, there is also provided an electronic device including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory perform communication with each other through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the steps of the data acquisition method according to any one of the second aspect or the third aspect when executing the program stored in the memory.
In yet another aspect of the present invention, there is also provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the data acquisition method according to any one of the above second or third aspects.
In a further aspect of the invention there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the data acquisition method of any of the above second or third aspects.
According to the data acquisition method provided by the embodiment of the invention, a central node acquires a link to be accessed; and sending the link to be accessed to the edge node. And when the edge node receives the link to be accessed, extracting target data indicated by the data acquisition task in the target page indicated by the link to be accessed, and sending the target data to the center node. And the center node receives the target data sent by the edge node, and obtains a data acquisition result of the data acquisition task.
Based on the processing, the center node can send the link to be accessed to the edge node, the edge node acquires data according to the received link to be accessed, and the edge node acquires the data by using the computing resource of the edge node, so that the data acquisition efficiency can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a flowchart of a data acquisition method according to an embodiment of the present invention;
FIG. 2 is a flowchart of another data acquisition method according to an embodiment of the present invention;
FIG. 3 is a flowchart of another data acquisition method according to an embodiment of the present invention;
FIG. 4 is a flowchart of another data acquisition method according to an embodiment of the present invention;
FIG. 5 is a flowchart of another data acquisition method according to an embodiment of the present invention;
FIG. 6 is a block diagram of a data acquisition device according to an embodiment of the present invention;
FIG. 7 is a block diagram of another data acquisition device according to an embodiment of the present invention;
fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of a data acquisition method according to an embodiment of the present invention, where the data acquisition method is applied to a data acquisition system, and the data acquisition system includes: a center node and an edge node. The method may comprise the steps of:
s101: the central node obtains a link to be accessed.
S102: the central node sends the link to be accessed to the edge node.
S103: and when the edge node receives the link to be accessed, extracting target data indicated by the data acquisition task in the target page indicated by the link to be accessed.
S104: the edge node sends the target data to the center node.
S105: and the center node receives the target data sent by the edge node, and obtains a data acquisition result of the data acquisition task.
Based on the data acquisition method provided by the embodiment of the invention, the center node can send the link to be accessed to the edge node, the edge node acquires the data according to the received link to be accessed, and the computing resource of the edge node is utilized for acquiring the data, so that the data acquisition efficiency can be improved.
For step S101 and step S102, the data acquisition system in the embodiment of the present invention may include: a central node and a plurality of edge nodes. The central node may be a server, or a cluster of servers. The edge nodes can be mobile phones, computers, routers, television boxes and other electronic devices.
In the data acquisition system, the central node may receive a data acquisition task, where the data acquisition task is configured to acquire target data according to a data acquisition link indicated by a user. The center node dynamically coordinates each edge node, and distributes the data acquisition tasks to different edge nodes for processing so as to complete the corresponding data acquisition tasks, namely, the data acquisition can be performed through a plurality of edge nodes, and the data acquisition efficiency is improved. And the data acquisition is performed through the center node and the edge nodes, so that the cost of the data acquisition is low.
In some embodiments, step S101 may include the steps of: and the central node extracts an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task, and the initial link is used as a link to be accessed.
When a user needs to acquire data, the data acquisition task and the configuration information of the data acquisition task can be issued to the central node through the client, and the central node can receive the data acquisition task and determine an initial link in the configuration information of the data acquisition task as a link to be accessed. The initial links carried by the data acquisition task can be one or a plurality of. The initial link carried in the data acquisition task is: and the network address is set by the user when issuing the data acquisition task and is used for indicating the data which the user needs to acquire.
For example, when the data acquisition task indicates that data in a web page of a website needs to be acquired, an initial link carried by the data acquisition task may be a URL (Uniform Resource Locator ) of a specified web page (e.g., a head page, a search page, etc.) of the website. Or when the data acquisition task indicates that the video data needs to be downloaded, the initial link carried by the data acquisition task may be a storage address of the video data.
The center node can send links to be accessed to each edge node, the edge nodes can acquire data according to the received links to be accessed, and when the links to be accessed sent by the center node are received, the edge nodes can also extract the links to be accessed, indicated by the links to be accessed, in the target page, and send the links to be accessed to the center node.
For example, when the link to be accessed is a URL of the target page, the edge node may send an HTTP (HyperText Transfer Protocol ) request to the corresponding server according to the URL, so that the server sends an HTML (Hyper Text Markup Language ) file containing the page content of the target page to the edge node. The edge node may then receive the HTML file for the target page.
The page content of the target page indicated by the link to be accessed may include: video, pictures, documents, text, etc., the page content may carry a data acquisition link, e.g., if clicking on one of the pictures in the target page can jump to another page, the picture carries a data acquisition link that is a URL for accessing the other page. The edge node may obtain a data acquisition link carried by the page content of the target page based on the HTML file of the target page.
In one implementation, the edge node may parse the HTML file of the target page through a beautifu so (a Python library capable of extracting data from the HTML file) tool to obtain a data acquisition link carried by the page content of the target page.
In addition, when the central node sends the link to be accessed to the edge node, the central node may also send a task identifier of the data acquisition task corresponding to the link to be accessed to the edge node. Correspondingly, the edge node can determine the task requirement of the data acquisition task corresponding to the link to be accessed in the corresponding relation between the preset task identifier and the task requirement based on the received task identifier.
Correspondingly, the edge node can also screen the acquired data acquisition links based on the task requirements of the data acquisition tasks, and determine the data acquisition links matched with the task requirements of the data acquisition tasks from the acquired data acquisition links.
For example, the data acquisition link carried by the page content of the target page includes: a link for downloading video, a link for accessing a web page, and a link for downloading pictures. If the task requirement of the data acquisition task is to acquire a picture, the edge node may determine that the data acquisition link matching the task requirement of the data acquisition task includes: links for downloading pictures. Since the picture may be displayed in the web page, the edge node may determine a data acquisition link matching the task requirement of the data acquisition task further includes: links for accessing web pages.
In some embodiments, the method may further comprise the steps of: when the edge node receives the link to be accessed sent by the center node, extracting a data acquisition link in a target page indicated by the link to be accessed, and sending the data acquisition link to the center node.
Accordingly, step S101 may include the steps of: the central node receives the data acquisition link sent by the edge node. The data acquisition link sent by the edge node is as follows: the data in the target page extracted by the edge node acquires the link. And the center node performs de-duplication processing on the received data acquisition link to obtain a link to be accessed.
When the edge node receives the link to be accessed, the edge node can access the target page indicated by the link to be accessed and acquire the data link from the extracted target page. The edge node then sends the acquired data acquisition link to the central node. And the center node performs de-duplication processing on the data acquisition link sent by the edge node to obtain a new link to be accessed.
For example, the central node may determine whether the same link exists in the data acquisition links sent by the edge nodes. Or, the central node may determine whether the same link exists in the received data acquisition links when the preset time is reached.
If the same link does not exist in the received data acquisition links, the central node can take the received data acquisition links as links to be accessed.
If the same links exist in the data acquisition links carried by the target data types, the central node may perform deduplication processing on the received data acquisition links, for example, the central node may perform filtering processing on the received data acquisition links through a bloom filter, so as to select one data acquisition link from the same plurality of data acquisition links. Further, the center node may use the selected data acquisition link, and the data acquisition link where the same link does not exist, as the link to be accessed.
And the central node can send the determined links to be accessed to each edge node, and so on until all target data indicated by the data acquisition task are acquired, and the data acquisition result of the data acquisition task is obtained based on the target data.
In some embodiments, the edge nodes are plural, and accordingly, based on fig. 1, referring to fig. 2, step S102 may include the following steps:
s1021: the center node determines an edge node for processing a link to be accessed from among the edge nodes based on processing state information of the edge nodes for the data acquisition link at present.
S1022: and the center node sends the link to be accessed to the determined edge node.
The processing state information of an edge node for the data acquisition link can represent the processing situation of the edge node for the data acquisition link.
In one implementation, for each edge node, the processing state information for the current data acquisition link for that edge node may include: the hardware resources (e.g., bandwidth, memory, etc.) of the edge node, the latency of processing the data acquisition links, the number of data acquisition links that have currently been received and unprocessed, etc. The central node may determine the edge node for processing the link to be accessed based on the processing state information of the respective edge node currently acquiring the link for the data, and based on a task allocation algorithm (e.g., GA (Genetic Algorithm, distributed genetic algorithm)) of the edge computation.
In another implementation, the processing state information of the edge node for the data acquisition link may include: the number of data acquisition links that the edge node has currently received and unprocessed (i.e., the first number). Accordingly, referring to fig. 3 on the basis of fig. 2, step S1021 may include the steps of:
S10211: the central node determines, for each edge node, a number of data acquisition links that the edge node has currently received and unprocessed as a first number.
S10212: the central node determines a corresponding first minimum number of edge nodes from the edge nodes as the edge nodes for processing the links to be accessed.
In the case that the link to be accessed is one, the central node may then determine, directly from the respective edge nodes, the corresponding first minimum number of edge nodes as the edge nodes for processing the link to be accessed.
In the case where the link to be accessed is plural, the link to be accessed may include: the initial link carried by the data acquisition task and the data acquisition links sent by the edge nodes.
The central node may select one link to be accessed according to the sequence of the acquisition time for acquiring the links to be accessed. For each link to be accessed, if the link to be accessed is carried in the data acquisition task, the acquisition time of the link to be accessed is the time when the central node receives the data acquisition task. If the link to be accessed is sent by the edge node, the acquisition time of the link to be accessed is the time when the center node receives the link to be accessed.
If there are multiple links to be accessed with the same acquisition time, the central node may randomly select one link to be accessed from the multiple links to be accessed.
Then, for each edge node, the central node may determine the number of data acquisition links that the edge node has currently received and unprocessed as the first number.
In one implementation, for each edge node, the edge node may determine a first number of data acquisition links that have been currently received and unprocessed, and when sending target data to the central node, the corresponding first number may be sent to the central node. The central node may receive a respective corresponding first number sent by each edge node.
In another implementation, for each edge node, after acquiring target data indicated by a data acquisition task in a page indicated by a data acquisition link, the edge node sends a first notification message to the central node, where the first notification message indicates that the data acquisition link has been processed. When the central node receives the first notification message, it may be determined that the edge node has processed the data acquisition link.
Further, the central node may determine a second number of data acquisition links that have been sent to the edge node, and a third number of data acquisition links that have received the first notification message sent by the edge node, the third number being the number of data acquisition links that the edge node has processed. The central node may then calculate the difference between the second number and the third number to obtain a first number of data acquisition links that the edge node has currently received and unprocessed.
Further, the central node may determine, from among the edge nodes, a corresponding first minimum number of edge nodes, and obtain an edge node for processing the link to be accessed.
Based on the above processing, the central node allocates the links to be accessed for each edge node based on the first number of the data acquisition links that each edge node has currently received and has not processed, that is, allocates more links to be accessed for the edge node with the corresponding first number being smaller, and allocates less links to be accessed for the edge node with the corresponding first number being larger, that is, dynamically adjusts the number of links to be accessed allocated for each edge node based on the situation that each edge node currently processes the data acquisition links, so as to further improve the data acquisition efficiency and the stability of the data acquisition system.
For step S103, for each edge node, the link to be accessed received by the edge node is a link to be accessed determined by the center node and required to be processed by the edge node.
For each received link to be accessed, when the edge node receives the link to be accessed sent by the center node, the edge node can extract target data indicated by a data acquisition task in a target page indicated by the link to be accessed. The target data indicated by the data acquisition task includes: and acquiring page contents in the target page indicated by the task.
For example, when the link to be accessed is a URL of the target page, the edge node may send an HTTP request to the corresponding server according to the URL, so that the server sends an HTML file containing page content of the target page to the edge node. The edge node may then receive the HTML file for the target page.
The page content of the target page indicated by the link to be accessed may include: video, pictures, documents, text, etc.
The edge node may take all page content of the target page as target data. Or, the edge node may further determine, from the page contents of the target page, target page contents matching with the task requirements of the data acquisition task, based on the task requirements of the data acquisition task, and use the target page contents as target data.
For example, when the task requirement of the data acquisition task is to acquire a picture, the edge node may determine the picture in the target page, so as to obtain the target page content matched with the data acquisition task. Or when the task requirement of the data acquisition task is to acquire the picture containing the rainbow, the edge node can determine the picture containing the rainbow in the target page, and obtain the content of the target page matched with the data acquisition task.
In one implementation, the edge node may search in the HTML file of the target page based on the task requirement of the data acquisition task to obtain target page content that matches the task requirement of the data acquisition task.
For step S104, in one implementation, when acquiring the data acquisition link in the target page indicated by the link to be accessed and the target data of the target page indicated by the link to be accessed, the edge node may send the data acquisition link and the target data to the central node.
In another implementation manner, the edge node may send the acquired data acquisition link to the central node when acquiring the data acquisition link in the target page indicated by the link to be accessed, and send the acquired target data to the central node when acquiring the target data of the target page indicated by the link to be accessed.
Based on the above processing, when the edge node does not acquire the data acquisition link and the target data in the target page indicated by the link to be accessed at the same time, the edge node transmits the acquired data acquisition link to the center node when acquiring the data acquisition link in the target page indicated by the link to be accessed, without waiting for the edge node to acquire the target data of the target page indicated by the link to be accessed. Or when the edge node acquires the target data of the target page indicated by the link to be accessed, the edge node sends the acquired target data to the central node, and the edge node does not need to wait for acquiring the data acquisition link in the target page indicated by the link to be accessed, so that the data acquisition efficiency can be further improved.
In the embodiment of the invention, the order of acquiring the data acquisition links and the target data in the target page indicated by the link to be accessed by the edge node is not limited, that is, the edge node may acquire the data acquisition links in the target page indicated by the link to be accessed first and then acquire the target data of the target page indicated by the link to be accessed. The edge node may acquire the target data of the target page indicated by the link to be accessed first, and then acquire the data acquisition link in the target page indicated by the link to be accessed.
The edge node may also acquire the data acquisition link and the target data in the target page indicated by the link to be accessed at the same time, for example, the edge node may start a plurality of parallel threads, one for acquiring the data acquisition link in the target page indicated by the link to be accessed, and the other for acquiring the target data of the target page indicated by the link to be accessed.
For step S105, the central node may receive the target data sent by each edge node. The central node may use the received target data as a data acquisition result of the data acquisition task.
In some embodiments, the link to be accessed is an initial link corresponding to the extracted data acquisition task from the configuration information of the data acquisition task by the central node, and accordingly, before step S101, the method may further include the following steps: and the central node acquires task execution codes of the data acquisition tasks from a preset file system.
Wherein the task execution code includes: task initiation code.
Accordingly, step S101 may include the steps of: the central node loads a task starting code to extract an initial link corresponding to the data acquisition task from configuration information of the data acquisition task, and the initial link is used as a link to be accessed.
In order to acquire data meeting the requirements of the user, the user may also develop task execution codes for executing the data acquisition task, and store the task execution codes of the data acquisition task to a preset file system. The preset file system may be a Kafka system. The user may develop task execution code for performing the data acquisition task through a computer language such as JavaScript, css, html, python.
The task execution code includes: task initiation code for initiating a data acquisition task.
Correspondingly, the central node can acquire task execution codes from a preset file system, and when corresponding data are required to be acquired according to the data acquisition task, the central node can load task starting codes so as to extract initial links corresponding to the data acquisition task from configuration information of the data acquisition task and serve as links to be accessed.
In some embodiments, the task execution code further comprises: a new link acquisition code and a page content acquisition code.
Accordingly, before step S102, the method may further include the steps of:
the center node sends a new link acquisition code and a page content acquisition code to the edge node so that the edge node receives the new link acquisition code and the page content acquisition code, loads the page content acquisition code to extract target data indicated by a data acquisition task in a target page indicated by the link to be accessed when receiving the link to be accessed sent by the center node, and loads the new link acquisition code to extract the data acquisition link in the target page indicated by the link to be accessed.
The new link acquisition code is used for extracting a data acquisition link in a target page indicated by the link to be accessed; the page content acquisition code is used for extracting target data (namely page content) in the target page indicated by the link to be accessed.
The central node may also send a new link acquisition code and page content acquisition code to each edge node. The edge node may receive the new link acquisition code and the page content acquisition code sent by the central node, and further, when receiving the link to be accessed sent by the central node, the edge node may load the new link acquisition code to extract a data acquisition link in the target page indicated by the link to be accessed, and load the page content acquisition code to extract target data in the target page indicated by the link to be accessed.
In some embodiments, before sending the new link acquisition code and the page content acquisition code to the edge node, the method may further comprise the steps of: the central node acquires task execution information of a data acquisition task from a preset database.
The task execution information comprises: the execution cycle of the data acquisition task.
Correspondingly, the step of sending the new link acquisition code and the page content acquisition code to the edge node comprises the steps of:
and when the moment corresponding to the execution period is reached, the center node sends a new link acquisition code and a page content acquisition code to the edge node.
Correspondingly, loading a task starting code to extract an initial link corresponding to the data acquisition task from configuration information of the data acquisition task, wherein the step of taking the initial link as a link to be accessed comprises the following steps of:
when the moment corresponding to the execution period is reached, the central node loads a task starting code to extract an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task, and the initial link is used as a link to be accessed.
The preset database may be a relational distributed database, such as MySQL (one relational database management system), oracle (another relational database management system), or the like.
The user may also set task execution information of the data acquisition task when developing a task execution program for executing the data acquisition task. The task execution information may include: the execution cycle of the data acquisition task.
The central node can load a task starting program of the data acquisition task when the moment corresponding to the execution period of the data acquisition task is reached, so as to extract an initial link corresponding to the data acquisition task from configuration information of the data acquisition task, and the initial link is used as a link to be accessed. The central node acquires target data indicated by the data acquisition task according to the execution period of the data acquisition task, and can acquire updated page content when the page content in the target page indicated by the link to be accessed is updated, so that the effectiveness of the acquired data can be improved.
In some embodiments, the task execution information further includes a storage address corresponding to the data acquisition task.
Accordingly, after step S105, the method may further include the steps of: the central node stores the data acquisition result of the data acquisition task to a storage address corresponding to the data acquisition task.
The storage address corresponding to the data acquisition task may be: and receiving the address of the target device of the data acquisition result of the data acquisition task. The central node may send an HTTP request carrying a data acquisition result of the data acquisition task to the target device, so that the target device receives the HTTP request and obtains the data acquisition result of the data acquisition task.
In one implementation manner, after receiving each piece of target data and obtaining a data acquisition result of the data acquisition task based on the target data, the central node may add the data acquisition result of the data acquisition task to a preset queue corresponding to the target device. The target device may then obtain the data acquisition result of the data acquisition task from the preset queue.
The preset queue may be AMQ (active mq, an open source message queue) or RMQ (rabkitmq, an AMQP (Advanced Message Queuing Protocol, advanced message queue protocol) based message queue).
In another implementation, if the data acquisition link does not exist in the target page indicated by the link to be accessed, the edge node may determine that all target data indicated by the data acquisition task has been acquired, and the edge node may send a second notification message to the central node indicating that all target data indicated by the data acquisition task has been acquired.
If a second notification message is received that is sent by all edge nodes that are performing the data acquisition task, the central node may determine that all target data indicated by the data acquisition task has been acquired. The central node may store all data acquisition results of the data acquisition task obtained based on each target data to a storage address corresponding to the data acquisition task.
Illustratively, a user may develop the task start code through a client, and set a data acquisition link carried by the data acquisition task when developing the task start code, and set a cron (a timing execution tool) expression for representing an execution period of the data acquisition task. When developing the task start code, the user needs to develop the task start code in accordance with an entry function and a data return format (for example, a type of separator) that can be analyzed by the center node. Subsequently, the central node may periodically initiate a data acquisition task according to the cron expression.
Then, the user can also develop a new link acquisition code for extracting the data acquisition link in the target page indicated by the link to be accessed and a page content acquisition code for extracting the target data in the target page indicated by the link to be accessed through the client. The client can correlate the task starting code, the new link acquisition code and the page content acquisition code, and set a storage address corresponding to the data acquisition task. When developing the task start code, the new link acquisition code and the page content acquisition code, the user needs to develop according to the entry function and the data return format which can be analyzed by the center node and the edge node.
Further, the client may store the task start code, the new link acquisition code, and the page content acquisition code to a preset file system, and store a cron expression for representing an execution period of the data acquisition task, and a storage address of the target data to a preset database.
Based on the above processing, when developing a code for performing data acquisition, the user only needs to develop a task execution code (i.e., a task start code, a new link acquisition code, and a page content acquisition code) of the data acquisition task, and store the task execution code to a preset file system. When the data needs to be acquired, the central node can acquire task execution codes from a preset file system and send the task execution codes to the edge nodes. And then the center node and the edge node can load corresponding codes to acquire data. The user does not need to pay attention to the interaction process between the center node and the edge node, so that the development efficiency of the user in developing codes for data acquisition can be improved.
Referring to fig. 4, fig. 4 is a flowchart of another data acquisition method according to an embodiment of the present invention. The data acquisition method is applied to a data acquisition system, and the data acquisition system comprises the following steps: a central node and a plurality of edge nodes.
When data acquisition is performed, the central node can add a link, that is, the central node loads a task starting program of a data acquisition task to determine an initial link corresponding to the data acquisition task as a link to be accessed. For each link to be accessed, the central node may determine an edge node for processing the link to be accessed based on processing status information of each edge node for the data acquisition link currently, and send the link to be accessed to the determined edge node.
Then, the edge node can download the resource, that is, when the edge node receives the link to be accessed sent by the central node, the edge node obtains the page content of the target page indicated by the link to be accessed. The edge node may determine whether to parse, that is, determine whether to parse the page content of the target page, that is, determine whether the page content of the target page includes target page content that matches the task requirement of the data acquisition task. When the page content of the target page is determined to be analyzed, the edge node can analyze and put in storage. That is, the edge node obtains the target page content matched with the task requirement of the data acquisition task from the page content of the target page, takes the target page content as target data, and sends the target data to the central node. The central node may receive the target data sent by the edge node, and store the data acquisition result including the target data to the storage address corresponding to the data acquisition task.
The edge node may determine whether a data acquisition link in the target page is extracted, that is, whether a data acquisition link exists in the target page. Upon determining to extract the data acquisition link in the target page, the edge node may extract a new link and download the resource based on the new link. That is, when the edge node has a data acquisition link in the target page, the edge node extracts the data acquisition link in the target page, which is matched with the task requirement of the data acquisition task, and sends the determined data acquisition link to the center node. The central node may determine a link to be accessed based on the received data acquisition link and send the link to be accessed to the edge node. Furthermore, when receiving the link to be accessed sent by the central node, the edge node may extract the target data indicated by the data acquisition task in the target page indicated by the link to be accessed, send the target data to the central node, and so on until all the target data indicated by the data acquisition link are acquired.
Based on the processing, the central node can allocate the link to be accessed to each edge node based on the processing state information of each edge node aiming at the data acquisition link, the edge node acquires the data according to the received link to be accessed, and the plurality of edge nodes can acquire the data based on the link to be accessed respectively, so that the data acquisition efficiency can be improved. In addition, the center node distributes links to be accessed for each edge node based on the processing state information of each edge node aiming at the data acquisition link, namely the number of the links to be accessed distributed for each edge node can be dynamically adjusted based on the processing condition of each edge node aiming at the data acquisition link, and the data acquisition efficiency can be further improved.
Referring to fig. 5, fig. 5 is a flowchart of another data acquisition method according to an embodiment of the present invention. The user creates code at the client, i.e., the user develops task execution code at the client for performing the data acquisition task. The client may store the developed task execution code to a preset file system in Response to an instruction of the user.
The user may instruct the client to run the Runjob, the client may Create a Task, and send the created Task to the scheduling center (i.e., the central node in the foregoing embodiment), which may perform scheduling preparation. That is, the user creates a data acquisition task at the client, the client issues the data acquisition task to the scheduling center, and the scheduling center acquires the task execution code of the data acquisition task from the preset file system and sends a new link acquisition code and a page content acquisition code to the edge (i.e., the edge node in the foregoing embodiment).
The dispatch center determines the download link indicated by the data acquisition task (i.e., the link to be accessed in the foregoing embodiment) and sends the download link to the edge, which receives the link. The scheduling center loads a task starting code of the data acquisition task to acquire a data acquisition link corresponding to the data task as a link to be accessed, and sends the link to be accessed to the edge terminal, and the edge terminal receives the link to be accessed sent by the scheduling center.
The edge terminal downloads, namely the edge terminal can download data according to the received download link, namely when the edge terminal receives the link to be accessed sent by the dispatching center, the edge terminal selects one link to be accessed to acquire the HTML file of the target Page indicated by the link to be accessed. Then, the edge terminal performs link extraction, that is, the edge terminal may extract a new link based on the HTML file of the target page, and send the new link to the scheduling center, and the scheduling (that is, the scheduling center) receives the link. That is, the edge terminal extracts the data acquisition link in the target page based on the HTML file of the target page, and sends the acquired data acquisition link to the scheduling center. The scheduling center receives the data acquisition links sent by the edge terminals and performs de-duplication processing to obtain new download links, and sends the new download links to the edge terminals, namely, the scheduling center determines links to be accessed based on the data acquisition links carried in the target data sent by each edge terminal, and sends the links to be accessed to the edge terminals, and so on, loops the data acquisition process until all the target data indicated by the data acquisition task are acquired, and loops Finish.
The edge end analyzes, that is, the edge end can also analyze based on the HTML file of the target page to obtain analysis Results, and send the analysis Results to the dispatching center, and dispatch (i.e. the dispatching center) receives the Results. That is, the edge terminal obtains the target data in the page content of the target page based on the HTML file of the target page, and sends the target data to the dispatching center. And the dispatching center receives the target data and returns a data acquisition result containing the page content to the user. The dispatching center stores the data acquisition result of the data acquisition task to a storage address corresponding to the data acquisition task.
Based on the processing, the central node can allocate the link to be accessed to each edge node based on the processing state information of each edge node aiming at the data acquisition link, the edge node acquires the data according to the received link to be accessed, and the plurality of edge nodes can acquire the data based on the link to be accessed respectively, so that the data acquisition efficiency can be improved. In addition, the center node distributes links to be accessed for each edge node based on the processing state information of each edge node aiming at the data acquisition link, namely the number of the links to be accessed distributed for each edge node can be dynamically adjusted based on the processing condition of each edge node aiming at the data acquisition link, and the data acquisition efficiency can be further improved.
The embodiment of the invention also provides a data acquisition system, which comprises: center node and edge node, wherein:
the center node is used for acquiring a link to be accessed; sending the link to be accessed to the edge node;
the edge node is used for extracting target data indicated by a data acquisition task in a target page indicated by the link to be accessed when the link to be accessed is received, and sending the target data to the center node;
the center node is further configured to receive the target data sent by the edge node, and obtain a data acquisition result of the data acquisition task.
Based on the data acquisition system provided by the embodiment of the invention, the center node can send the link to be accessed to the edge node, the edge node acquires the data according to the received link to be accessed, and the computing resource of the edge node is utilized for acquiring the data, so that the data acquisition efficiency can be improved.
Corresponding to the method embodiment of fig. 1, referring to fig. 6, fig. 6 is a block diagram of a data acquisition device according to an embodiment of the present invention, where the device is applied to a central node in a data acquisition system, and the device includes:
A first obtaining module 601, configured to obtain a link to be accessed;
a first sending module 602, configured to send the link to be accessed to an edge node, so that when the edge node receives the link to be accessed, the edge node extracts target data indicated by a data acquisition task in a target page indicated by the link to be accessed, and sends the target data to the central node;
and the receiving module 603 is configured to receive the target data sent by the edge node, and obtain a data acquisition result of the data acquisition task.
Optionally, the first obtaining module 601 is specifically configured to extract, from the configuration information of the data obtaining task, an initial link corresponding to the data obtaining task as a link to be accessed.
Optionally, the first obtaining module 601 is specifically configured to receive a data obtaining link sent by the edge node; wherein, the data acquisition link is: the data in the target page extracted by the edge node is linked;
and carrying out de-duplication processing on the received data acquisition link to obtain a link to be accessed.
Optionally, the edge nodes are multiple;
the first sending module 602 is specifically configured to determine, from among the edge nodes, an edge node for processing the link to be accessed based on processing status information of the link that is currently acquired by each edge node for data, and send the link to be accessed to the determined edge node.
Optionally, the first sending module 602 is specifically configured to determine, for each edge node, a number of data acquisition links that are currently received and unprocessed by the edge node, as a first number;
and determining the corresponding first minimum number of edge nodes from the edge nodes as the edge nodes for processing the links to be accessed.
Optionally, the apparatus further includes:
the second obtaining module is configured to extract, from the configuration information of the data obtaining task executed by the first obtaining module 601, an initial link corresponding to the data obtaining task, and execute a task execution code for obtaining the data obtaining task from a preset file system before the initial link is used as a link to be accessed; wherein the task execution code includes: task start code;
the first obtaining module 601 is specifically configured to load the task start code, so as to extract, from the configuration information of the data obtaining task, an initial link corresponding to the data obtaining task as a link to be accessed.
Optionally, the task execution code further includes: a new link acquisition code and a page content acquisition code;
the apparatus further comprises:
And a second sending module, configured to, before the first sending module 602 performs sending the link to be accessed to an edge node, perform sending the new link acquisition code and the page content acquisition code to the edge node, so that the edge node receives the new link acquisition code and the page content acquisition code, and when receiving the link to be accessed sent by the central node, load the page content acquisition code to extract target data indicated by the data acquisition task in a target page indicated by the link to be accessed, and load the new link acquisition code to extract the data acquisition link in the target page indicated by the link to be accessed.
Optionally, the apparatus further includes:
the third acquisition module is used for executing task execution information for acquiring the data acquisition task from a preset database before the second transmission module executes the transmission of the new link acquisition code and the page content acquisition code to the edge node; wherein the task execution information includes: the execution period of the data acquisition task;
the second sending module is specifically configured to send the new link acquisition code and the page content acquisition code to the edge node when the moment corresponding to the execution period is reached;
The first obtaining module 601 is specifically configured to load the task start code when the time corresponding to the execution period is reached, so as to extract, from the configuration information of the data obtaining task, an initial link corresponding to the data obtaining task as a link to be accessed.
Optionally, the task execution information further includes a storage address corresponding to the data acquisition task;
the apparatus further comprises:
and the storage module is configured to perform, after the receiving module 603 receives the target data sent by the edge node and obtains a data acquisition result of the data acquisition task, storing the data acquisition result of the data acquisition task to a storage address corresponding to the data acquisition task.
Based on the data acquisition device provided by the embodiment of the invention, the center node can send the link to be accessed to the edge node, the edge node acquires the data according to the received link to be accessed, and the computing resource of the edge node is utilized for acquiring the data, so that the data acquisition efficiency can be improved.
Corresponding to the method embodiment of fig. 1, referring to fig. 7, fig. 7 is a block diagram of a data acquisition device according to an embodiment of the present invention, where the device is applied to an edge node in a data acquisition system, and the device includes:
A first extracting module 701, configured to extract, when receiving a link to be accessed sent by a central node, target data indicated by a data acquisition task in a target page indicated by the link to be accessed;
and the sending module 702 is configured to send the target data to the central node, so that the central node receives the target data and obtains a data acquisition result of the data acquisition task.
Optionally, the apparatus further includes:
and the second extraction module is used for extracting the data acquisition link in the target page indicated by the link to be accessed when the link to be accessed sent by the central node is received, and sending the data acquisition link to the central node so that the central node performs de-duplication processing on the received data acquisition link to obtain the link to be accessed.
Optionally, the apparatus further includes:
the receiving module is configured to perform receiving a new link acquisition code and a page content acquisition code sent by a central node before the first extracting module 701 performs extracting target data indicated by a data acquisition task in a target page indicated by a link to be accessed when receiving the link to be accessed sent by the central node;
The first extraction module 701 is specifically configured to load the page content acquisition code when receiving a link to be accessed sent by a central node, so as to extract target data indicated by a data acquisition task in a target page indicated by the link to be accessed;
the second extracting module is specifically configured to load the new link acquiring code when receiving a link to be accessed sent by the central node, so as to extract a data acquiring link in a target page indicated by the link to be accessed.
Based on the data acquisition device provided by the embodiment of the invention, the center node can send the link to be accessed to the edge node, the edge node acquires the data according to the received link to be accessed, and the computing resource of the edge node is utilized for acquiring the data, so that the data acquisition efficiency can be improved.
The embodiment of the present invention further provides an electronic device, as shown in fig. 8, including a processor 801, a communication interface 802, a memory 803, and a communication bus 804, where the processor 801, the communication interface 802, and the memory 803 complete communication with each other through the communication bus 804,
a memory 803 for storing a computer program;
the processor 801 is configured to implement the data acquisition method steps applied to the central node or the data acquisition method steps applied to the edge node according to any one of the above embodiments when executing the program stored in the memory 803.
The communication bus mentioned by the above electronic device may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the electronic device and other devices.
The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, there is further provided a computer readable storage medium, in which a computer program is stored, the computer program implementing the data acquisition method applied to a center node or the data acquisition method applied to an edge node according to any one of the above embodiments when executed by a processor.
In a further embodiment of the present invention, a computer program product comprising instructions, which when run on a computer, causes the computer to perform the data acquisition method applied to a central node or the data acquisition method applied to an edge node as described in any of the above embodiments is also provided.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, systems, electronic devices, computer readable storage media and computer program product embodiments, the description is relatively simple as it is substantially similar to method embodiments, as relevant to the description of method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (18)

1. A data acquisition method, the method being applied to a data acquisition system, the data acquisition system comprising: a center node and an edge node, the method comprising:
the central node acquires a link to be accessed; sending the link to be accessed to the edge node;
when the edge node receives the link to be accessed, extracting target data indicated by a data acquisition task in a target page indicated by the link to be accessed, and sending the target data to the center node;
and the center node receives the target data sent by the edge node, and obtains a data acquisition result of the data acquisition task.
2. A method of data acquisition, the method being applied to a central node in a data acquisition system, the method comprising:
acquiring a link to be accessed;
sending the link to be accessed to an edge node, so that when the edge node receives the link to be accessed, extracting target data indicated by a data acquisition task in a target page indicated by the link to be accessed, and sending the target data to the center node;
And receiving target data sent by the edge node to obtain a data acquisition result of the data acquisition task.
3. The method of claim 2, wherein the obtaining the link to be accessed comprises:
and extracting an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task to serve as a link to be accessed.
4. The method of claim 2, wherein the obtaining the link to be accessed comprises:
receiving a data acquisition link sent by the edge node; wherein, the data acquisition link is: the data in the target page extracted by the edge node is linked;
and carrying out de-duplication processing on the received data acquisition link to obtain a link to be accessed.
5. The method of claim 2, wherein the edge nodes are a plurality;
the sending the link to be accessed to the edge node includes:
and determining an edge node for processing the link to be accessed from the edge nodes based on the processing state information of the data acquisition links of the edge nodes, and sending the link to be accessed to the determined edge node.
6. The method of claim 5, wherein determining an edge node from among the edge nodes for processing the link to be accessed based on processing state information of the link currently acquired for data by the edge nodes comprises:
for each edge node, determining a number of data acquisition links that the edge node currently has received and unprocessed as a first number;
and determining the corresponding first minimum number of edge nodes from the edge nodes as the edge nodes for processing the links to be accessed.
7. A method according to claim 3, wherein, in the configuration information of the data acquisition task, an initial link corresponding to the data acquisition task is extracted, and before the initial link is used as a link to be accessed, the method further comprises:
acquiring task execution codes of the data acquisition tasks from a preset file system; wherein the task execution code includes: task start code;
extracting an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task as a link to be accessed, wherein the extracting includes:
and loading the task starting code to extract an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task as a link to be accessed.
8. The method of claim 7, wherein the task execution code further comprises: a new link acquisition code and a page content acquisition code;
before sending the link to be accessed to the edge node, the method further comprises:
the new link acquisition code and the page content acquisition code are sent to the edge node, so that the edge node receives the new link acquisition code and the page content acquisition code, when a link to be accessed sent by the center node is received, the page content acquisition code is loaded to extract target data indicated by the data acquisition task in a target page indicated by the link to be accessed, and the new link acquisition code is loaded to extract the data acquisition link in the target page indicated by the link to be accessed.
9. The method of claim 8, wherein prior to said sending the new link acquisition code and the page content acquisition code to the edge node, the method further comprises:
acquiring task execution information of the data acquisition task from a preset database; wherein the task execution information includes: the execution period of the data acquisition task;
The sending the new link acquisition code and the page content acquisition code to the edge node includes:
when the moment corresponding to the execution period is reached, sending the new link acquisition code and the page content acquisition code to the edge node;
the loading the task starting code to extract an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task as a link to be accessed includes:
and when the moment corresponding to the execution period is reached, loading the task starting code to extract an initial link corresponding to the data acquisition task from the configuration information of the data acquisition task as a link to be accessed.
10. The method of claim 9, wherein the task execution information further includes a storage address corresponding to the data acquisition task;
after receiving the target data sent by the edge node and obtaining the data acquisition result of the data acquisition task, the method further comprises:
and storing the data acquisition result of the data acquisition task to a storage address corresponding to the data acquisition task.
11. A method of data acquisition, the method being applied to an edge node in a data acquisition system, the method comprising:
When a link to be accessed sent by a central node is received, extracting target data indicated by a data acquisition task in a target page indicated by the link to be accessed;
and sending the target data to the central node so that the central node receives the target data to obtain a data acquisition result of the data acquisition task.
12. The method of claim 11, wherein the method further comprises:
when a link to be accessed sent by a central node is received, extracting a data acquisition link in a target page indicated by the link to be accessed, and sending the data acquisition link to the central node so that the central node performs de-duplication processing on the received data acquisition link to obtain the link to be accessed.
13. The method according to claim 12, wherein, when the link to be accessed sent by the central node is received, before extracting the target data indicated by the data acquisition task in the target page indicated by the link to be accessed, the method further comprises:
receiving a new link acquisition code and a page content acquisition code sent by the central node;
when receiving a link to be accessed sent by a central node, extracting target data indicated by a data acquisition task in a target page indicated by the link to be accessed, including:
When a link to be accessed sent by a central node is received, loading the page content acquisition code to extract target data indicated by a data acquisition task in a target page indicated by the link to be accessed;
when receiving a link to be accessed sent by a central node, extracting a data acquisition link in a target page indicated by the link to be accessed, wherein the data acquisition link comprises:
and when receiving a link to be accessed sent by the central node, loading the new link acquisition code to extract a data acquisition link in a target page indicated by the link to be accessed.
14. A data acquisition system, the data acquisition system comprising: center node and edge node, wherein:
the center node is used for acquiring a link to be accessed; sending the link to be accessed to the edge node;
the edge node is used for extracting target data indicated by a data acquisition task in a target page indicated by the link to be accessed when the link to be accessed is received, and sending the target data to the center node;
the center node is further configured to receive the target data sent by the edge node, and obtain a data acquisition result of the data acquisition task.
15. A data acquisition device for use in a central node in a data acquisition system, the device comprising:
the first acquisition module is used for acquiring a link to be accessed;
the first sending module is used for sending the link to be accessed to an edge node, so that when the edge node receives the link to be accessed, the edge node extracts target data indicated by a data acquisition task in a target page indicated by the link to be accessed and sends the target data to the center node;
and the receiving module is used for receiving the target data sent by the edge node and obtaining a data acquisition result of the data acquisition task.
16. A data acquisition device for use in an edge node in a data acquisition system, the device comprising:
the first extraction module is used for extracting target data indicated by a data acquisition task in a target page indicated by a link to be accessed when the link to be accessed sent by the central node is received;
and the sending module is used for sending the target data to the central node so that the central node receives the target data and obtains a data acquisition result of the data acquisition task.
17. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 2-10, or claims 11-13, when executing a program stored on a memory.
18. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 2-10, or claims 11-13.
CN202210757236.4A 2022-06-29 2022-06-29 Data acquisition method, device, system, electronic equipment and storage medium Active CN115277694B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210757236.4A CN115277694B (en) 2022-06-29 2022-06-29 Data acquisition method, device, system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210757236.4A CN115277694B (en) 2022-06-29 2022-06-29 Data acquisition method, device, system, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115277694A CN115277694A (en) 2022-11-01
CN115277694B true CN115277694B (en) 2023-12-08

Family

ID=83763640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210757236.4A Active CN115277694B (en) 2022-06-29 2022-06-29 Data acquisition method, device, system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115277694B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101626385A (en) * 2009-08-10 2010-01-13 中兴通讯股份有限公司 Media service method and media service system
JP2011002932A (en) * 2009-06-17 2011-01-06 Brother Industries Ltd Content distribution system, node device and information processing apparatus thereof, and method of distributing content data
CN101984637A (en) * 2010-11-02 2011-03-09 中兴通讯股份有限公司 Content distribution implementation method and system
CN104699757A (en) * 2015-01-15 2015-06-10 南京邮电大学 Distributed network information acquisition method in cloud environment
US9871850B1 (en) * 2014-06-20 2018-01-16 Amazon Technologies, Inc. Enhanced browsing using CDN routing capabilities
CN108353095A (en) * 2017-09-30 2018-07-31 深圳前海达闼云端智能科技有限公司 Domain name analytic method, client, fringe node and domain name analysis system
CN110247977A (en) * 2019-06-17 2019-09-17 中国联合网络通信集团有限公司 A kind of method and system of the data fusion based on edge calculations
CN110336790A (en) * 2019-05-29 2019-10-15 网宿科技股份有限公司 A kind of method and system of website detection
CN110943876A (en) * 2018-09-21 2020-03-31 阿里巴巴集团控股有限公司 URL state detection method, device, equipment and system
CN111459657A (en) * 2020-03-09 2020-07-28 重庆邮电大学 Task allocation method based on edge-assisted data quality perception
CN112231296A (en) * 2020-09-30 2021-01-15 北京金山云网络技术有限公司 Distributed log processing method, device, system, equipment and medium
CN113645288A (en) * 2021-08-02 2021-11-12 北京金山云网络技术有限公司 Data downloading method and device, computer equipment and storage medium
CN113986489A (en) * 2021-10-21 2022-01-28 远景智能国际私人投资有限公司 Task execution method and device of heterogeneous system, computer equipment and storage medium
WO2022057318A1 (en) * 2020-09-21 2022-03-24 北京金山云网络技术有限公司 Stream pulling request processing method, apparatus and system, electronic device, and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7921259B2 (en) * 2007-09-07 2011-04-05 Edgecast Networks, Inc. Content network global replacement policy

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011002932A (en) * 2009-06-17 2011-01-06 Brother Industries Ltd Content distribution system, node device and information processing apparatus thereof, and method of distributing content data
CN101626385A (en) * 2009-08-10 2010-01-13 中兴通讯股份有限公司 Media service method and media service system
CN101984637A (en) * 2010-11-02 2011-03-09 中兴通讯股份有限公司 Content distribution implementation method and system
US9871850B1 (en) * 2014-06-20 2018-01-16 Amazon Technologies, Inc. Enhanced browsing using CDN routing capabilities
CN104699757A (en) * 2015-01-15 2015-06-10 南京邮电大学 Distributed network information acquisition method in cloud environment
CN108353095A (en) * 2017-09-30 2018-07-31 深圳前海达闼云端智能科技有限公司 Domain name analytic method, client, fringe node and domain name analysis system
CN110943876A (en) * 2018-09-21 2020-03-31 阿里巴巴集团控股有限公司 URL state detection method, device, equipment and system
CN110336790A (en) * 2019-05-29 2019-10-15 网宿科技股份有限公司 A kind of method and system of website detection
CN110247977A (en) * 2019-06-17 2019-09-17 中国联合网络通信集团有限公司 A kind of method and system of the data fusion based on edge calculations
CN111459657A (en) * 2020-03-09 2020-07-28 重庆邮电大学 Task allocation method based on edge-assisted data quality perception
WO2022057318A1 (en) * 2020-09-21 2022-03-24 北京金山云网络技术有限公司 Stream pulling request processing method, apparatus and system, electronic device, and storage medium
CN112231296A (en) * 2020-09-30 2021-01-15 北京金山云网络技术有限公司 Distributed log processing method, device, system, equipment and medium
CN113645288A (en) * 2021-08-02 2021-11-12 北京金山云网络技术有限公司 Data downloading method and device, computer equipment and storage medium
CN113986489A (en) * 2021-10-21 2022-01-28 远景智能国际私人投资有限公司 Task execution method and device of heterogeneous system, computer equipment and storage medium

Also Published As

Publication number Publication date
CN115277694A (en) 2022-11-01

Similar Documents

Publication Publication Date Title
CN107590188B (en) Crawler crawling method and management system for automatic vertical subdivision field
US10452730B2 (en) Methods for analyzing web sites using web services and devices thereof
CN113411404A (en) File downloading method, device, server and storage medium
CN113422808B (en) Internet of things platform HTTP information pushing method, system, device and medium
CN112653736B (en) Parallel source returning method and device and electronic equipment
CN111427899A (en) Method, device, equipment and computer readable medium for storing file
CN115277694B (en) Data acquisition method, device, system, electronic equipment and storage medium
CN112825525B (en) Method and apparatus for processing transactions
CN107045452B (en) Virtual machine scheduling method and device
CN113127225A (en) Method, device and system for scheduling data processing tasks
CN113515715B (en) Buried point event code generation method, buried point event code processing method and related equipment
CN112491939B (en) Multimedia resource scheduling method and system
CN111338775B (en) Method and equipment for executing timing task
CN113779122A (en) Method and apparatus for exporting data
CN113704203A (en) Log file processing method and device
CN110858240A (en) Front-end module loading method and device
CN113760482A (en) Task processing method, device and system
CN112559001A (en) Method and device for updating application
CN112799797A (en) Task management method and device
CN112784195A (en) Page data publishing method and system
CN113055443B (en) Resource data deployment method, device, equipment and storage medium
CN113765868B (en) Service processing method and device
CN110278451B (en) Picture online transcoding method and device and electronic equipment
CN113141403B (en) Log transmission method and device
CN109725929B (en) Embedded webpage loading method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant