CN112380418A

CN112380418A - Data processing method and system based on web crawler and cloud platform

Info

Publication number: CN112380418A
Application number: CN202011618649.1A
Authority: CN
Inventors: 詹能勇; 刘振宇
Original assignee: Guangzhou Zhiyunshang Big Data Technology Co ltd
Current assignee: Jinfu software (Guangzhou) Co.,Ltd.
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-02-19
Anticipated expiration: 2040-12-31
Also published as: CN112380418B

Abstract

The invention relates to the technical field of internet and data processing, in particular to a data processing method and system based on a web crawler and a cloud platform. According to the method, a webpage crawler instruction input by a user is obtained, the webpage crawler instruction comprises target webpage information and a crawling object set, target crawler data corresponding to the target webpage information and the crawling object set are obtained, and the target crawler data are stored into target distributed storage nodes, wherein the target distributed storage nodes are storage nodes corresponding to the webpage object set in a distributed storage system; compared with the prior art, the method and the device can improve the reliability of storing the crawler data during large-scale data crawling, and can fully crawl data required by a user by crawling the current webpage content data and the historical webpage content data, so that the integrity of data crawling is improved.

Description

Data processing method and system based on web crawler and cloud platform

Technical Field

The invention relates to the technical field of internet and data processing, in particular to a data processing method and system based on a web crawler and a cloud platform.

Background

The web crawler is a program or script which can automatically capture webpage information according to a set rule; by utilizing the web crawler, the webpage data required by the user can be quickly acquired, so that technical support is provided for large-scale data collection.

Wherein, in the process of crawling data by using a web crawler, the prior art can store the crawled data locally in the device. However, in a large-scale data analysis scenario, due to the large amount of crawled data, crawler data of different web page data may be polluted by each other, and reliability of reliable data crawl is reduced.

Disclosure of Invention

The invention aims to provide a data processing method, a data processing system and a cloud platform based on a web crawler, so as to solve at least part of technical problems.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a data processing method based on web crawlers, the method including:

acquiring a webpage crawler instruction input by a user, wherein the webpage crawler instruction comprises target webpage information and a crawling object set, and the crawling object set is used for indicating a webpage object set to be crawled in the target webpage information;

acquiring target crawler data corresponding to the target webpage information and the crawling object set;

and storing the target crawler data into target distributed storage nodes, wherein the target distributed storage nodes are storage nodes corresponding to the webpage object set in the distributed storage system.

Optionally, as an implementation manner, the obtaining target crawler data corresponding to the target webpage information and the set of crawled objects includes:

acquiring current webpage content data and historical webpage content data corresponding to the target webpage information;

matching target crawler data corresponding to the web page object set in the current web page content data according to the crawling object set, and matching target crawler data corresponding to the web page object set in the historical web page content data;

the current webpage content data is webpage content data of a webpage address indicated by the target webpage information at the current moment, and the historical webpage content data is webpage content data of the webpage address indicated by the target webpage information at the historical moment.

Optionally, as an embodiment, the saving the target crawler data to the target distributed saving node includes:

initializing unit data storage resource quantity of a data storage process in the process of storing the target crawler data corresponding to the webpage object set to the target distributed storage nodes, wherein the unit data storage resource quantity of the data storage process is used for indicating the data quantity of the target crawler data corresponding to each data storage operation;

acquiring a unit data storage upper limit value, wherein the unit data storage upper limit value is used for indicating the upper limit data quantity of target crawler data corresponding to each data storage operation in the process of storing the target crawler data in the data storage process;

updating the unit data storage resource amount of the data storage process according to the unit data storage upper limit value;

according to the updated data saving process, continuing to perform data saving operation on the target crawler data;

wherein, the updating the unit data storage resource amount of the data storage process according to the unit data storage upper limit value comprises:

when the unit data storage upper limit value is smaller than a preset unit data storage threshold value, multiplying the unit data storage resource amount of the data storage process by a first preset proportionality coefficient, and taking the calculated result as the updated unit data storage resource amount of the data storage process;

and when the unit data storage upper limit value is greater than or equal to the unit data storage threshold value, dividing the unit data storage resource amount of the data storage process by a second preset proportionality coefficient, and taking the calculated result as the updated unit data storage resource amount of the data storage process.

Optionally, as an implementation manner, the matching of the target crawler data corresponding to the web page object set in the historical web page content data includes:

acquiring target virtual object data corresponding to each target webpage object in the webpage object set; wherein each target virtual object data is used to represent a web page representation of a corresponding target web page object;

when the security verification of the target virtual object data corresponding to all the target webpage objects in the webpage object set is passed, performing virtual object conversion on each target virtual object data according to an object conversion strategy to obtain a webpage object representation corresponding to each target webpage object;

when each webpage object portrait meets a preset portrait matching condition, acquiring historical webpage object data corresponding to each webpage object portrait matched with the received crawling time interval from a webpage information storage server to obtain an initial crawler data packet corresponding to each webpage object portrait; the webpage information storage server stores crawler data corresponding to all target webpage objects;

merging the initial crawler data packets with the same portrait type label according to the portrait type label carried by each initial crawler data packet to obtain virtual object data corresponding to each portrait type label;

screening virtual object data corresponding to each portrait type label to obtain intermediate virtual object data;

generating a secret key for each intermediate virtual object data according to a preset secret key generation strategy to obtain a virtual object secret key signaling corresponding to each intermediate virtual object data;

performing security signaling verification on all the virtual object key signaling;

determining intermediate virtual object data corresponding to all the virtual object key signallings passing the safety verification as virtual object data to be selected according to the verification result of each safety signaling verification;

constructing target webpage objects corresponding to all the to-be-selected virtual object data into a to-be-selected webpage object set;

and acquiring the crawler data corresponding to all target webpage objects in the webpage object set to be selected by the webpage information storage server as the matched target crawler data corresponding to the webpage object set.

Optionally, as an implementation manner, the acquiring target virtual object data corresponding to each target web page object in the web page object set includes:

reading current portrait data stored by a webpage object portrait storage node, and counting the number of read nodes;

when the number of the read nodes meets a set node number threshold, selecting a target portrait storage node cluster from all read webpage object portrait storage nodes according to a preset node selection strategy;

calculating an average portrait value level and a standard portrait value level according to current portrait data read by each webpage object storage node in the target portrait storage node cluster;

according to the average image value level and the standard image value level, searching a normal webpage object portrait storage node for the webpage object portrait storage node, and taking current portrait data corresponding to the normal webpage object portrait storage node obtained through searching as a webpage object portrait;

historical webpage object data of each webpage object portrait in a preset data analysis time period are obtained from a webpage information storage server, and virtual object data of each webpage object portrait are obtained;

and classifying the virtual object data carrying the same portrait type label according to the portrait type label carried by the virtual object data to obtain target virtual object data corresponding to each target webpage object.

Optionally, as an implementation manner, after the obtaining target virtual object data corresponding to each target web page object, the method further includes:

screening target verification virtual object data from all the target virtual object data according to the target portrait type label corresponding to each target virtual object data;

calculating the average value level of the reference portrait according to the target verification virtual object data;

comparing the reference portrait average level with a preset reference portrait average level threshold;

and when the reference portrait average level is larger than the preset reference portrait average level threshold value, determining that the reference portrait average level is verified to be passed.

Optionally, as an implementation manner, the acquiring current webpage content data corresponding to the target webpage information includes:

determining all webpage contents related to the webpage links and the target webpage information as webpage contents to be selected in a service database according to the target webpage information, wherein the service database comprises a plurality of webpage data and the webpage contents corresponding to the webpage data;

and acquiring the current webpage content data corresponding to the webpage content to be selected.

Optionally, as an implementation manner, the acquiring historical webpage content data corresponding to the target webpage information includes:

obtaining a historical webpage data storage table entry;

receiving a first table item analysis strategy corresponding to the historical webpage data storage table item;

determining the table item content address range indicated by the first table item analysis strategy;

traversing a target table item address range corresponding to the target webpage information;

when the table entry content address range included in the traversed target table entry address range does not exceed the table entry content address range indicated by the first table entry resolution strategy, continuing traversing;

when the list item content address range included in the traversed target list item address range reaches the list item content address range indicated by the first list item analysis strategy, taking the traversed target list item address range as a candidate target list item address range;

when the table entry content address range included in the traversed target table entry address range exceeds the table entry content address range indicated by the first table entry resolution strategy, performing address sorting on the traversed target table entry address range according to the sequence of each table entry address in the included table entry content address range to obtain a candidate target table entry address range;

determining the address matching degree of the address range of the candidate target table item and a preset address searching range;

when the address matching degree reaches a set matching degree threshold value, taking the target table entry address range as a table entry address range to be searched;

searching a webpage content data time distribution label corresponding to the target table entry address range in a webpage information storage server;

matching a second table item analysis strategy corresponding to the historical webpage data storage table item according to the time distribution characteristics of the webpage content data time distribution tag;

searching a target historical information searching address range corresponding to the second table item analysis strategy from the webpage information storage server;

and searching the webpage content data corresponding to the target historical information searching address range in the webpage information storage server to obtain historical webpage content data.

In a second aspect, the present invention provides a web crawler-based data processing system, the system comprising:

the system comprises a processing module, a crawling module and a searching module, wherein the processing module is used for acquiring a webpage crawler instruction input by a user, the webpage crawler instruction comprises target webpage information and a crawling object set, and the crawling object set is used for indicating a webpage object set to be crawled in the target webpage information;

the processing module is further used for acquiring target crawler data corresponding to the target webpage information and the crawling object set;

and the storage module is used for storing the target crawler data into target distributed storage nodes, wherein the target distributed storage nodes are storage nodes corresponding to the webpage object set in the distributed storage system.

In a third aspect, the present invention provides an electronic device comprising a memory for storing one or more programs; a processor; when the one or more programs are executed by the processor, the web crawler-based data processing method is realized.

In a fourth aspect, the present invention provides a cloud platform, on which a computer program is stored, and the computer program, when executed by a processor, implements the above-mentioned web crawler-based data processing method.

According to the data processing method, the data processing system and the cloud platform based on the web crawler, a web crawler instruction input by a user is obtained, the web crawler instruction comprises target web page information and a crawling object set, then target crawler data corresponding to the target web page information and the crawling object set are obtained, and the target crawler data are stored into target distributed storage nodes, wherein the target distributed storage nodes are storage nodes corresponding to the web page object set in a distributed storage system; compared with the prior art, the reliability of crawler data storage during large-scale data crawling can be improved.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to these drawings without inventive effort.

Fig. 1 is a block diagram of an electronic device according to the present invention.

Fig. 2 is a flowchart of a data processing method based on web crawlers according to the present invention.

FIG. 3 is a block diagram of a web crawler-based data processing system according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in some embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. The components of the present invention, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on a part of the embodiments of the present invention, belong to the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a block diagram of an electronic device 100 according to the present invention, in this embodiment, the electronic device 100 includes a memory 101, a processor 102 and a communication interface 103, and the memory 101, the processor 102 and the communication interface 103 are electrically connected to each other directly or indirectly to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

The memory 101 may be used to store software programs and modules, such as program instructions/modules corresponding to the data processing system provided by the present invention, and the processor 102 executes the software programs and modules stored in the memory 101 to execute various functional applications and data processing, thereby executing the steps of the data processing method provided by the present invention. The communication interface 103 may be used for communicating signaling or data with other node devices.

The Memory 101 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Programmable Read-Only Memory (EEPROM), and the like.

The processor 102 may be an integrated circuit chip having signal processing capabilities. The Processor 102 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

Referring to fig. 2, fig. 2 is a flowchart of a data processing method based on web crawlers according to the present invention, where the data processing method includes the following steps:

step S301, acquiring a webpage crawler instruction input by a user.

In this embodiment, taking the electronic device 100 as an execution subject as an example, when a user performs a web crawler search, the user may input a web crawler instruction to the electronic device, where the web crawler instruction includes target web page information and a crawling object set, the target web page information is used to indicate a web address of a to-be-crawled web page, and the crawling object set is used to indicate a web object set to be crawled in the target web page information. It is understood that the web page object set may include a plurality of target web page objects, and each target web page object may extract content for one of the data in the target web page information, such as the number of clicks of a sub-link, and the like.

Step S302, target crawler data corresponding to the target webpage information and the crawling object set are obtained.

In this embodiment, the electronic device may obtain, in response to the web crawler instruction, target crawler data corresponding to the target web page information and the set of crawled objects.

And step S303, storing the target crawler data into a target distributed storage node.

In this embodiment, the electronic device may store crawled crawler data by using a distributed storage system, where the distributed storage system may be composed of a plurality of storage nodes, and the web object set corresponds to a target distributed storage node in the distributed storage system; therefore, for the target crawler data, the electronic device may save the target crawler data to the target distributed saving node.

Thus, according to the implementation scheme provided by the invention, by acquiring a webpage crawler instruction input by a user, the webpage crawler instruction comprises target webpage information and a crawling object set, then acquiring target crawler data corresponding to the target webpage information and the crawling object set, and storing the target crawler data into target distributed storage nodes, wherein the target distributed storage nodes are storage nodes corresponding to the webpage object set in a distributed storage system; compared with the prior art, the reliability of crawler data storage during large-scale data crawling can be improved.

As an embodiment, when the electronic device executes step S302 to obtain target crawler data, it may first obtain current web content data and historical web content data corresponding to the target web page information; then, the electronic equipment matches target crawler data corresponding to the webpage object set in the current webpage content data according to the crawling object set, and matches target crawler data corresponding to the webpage object set in the historical webpage content data; the current webpage content data is webpage content data of a webpage address indicated by the target webpage information at the current moment, and the historical webpage content data is webpage content data of the webpage address indicated by the target webpage information at the historical moment. Therefore, the data required by the user can be fully crawled by crawling the current webpage content data and the historical webpage content data, and the integrity of data crawling is improved.

In addition, as an embodiment, when the electronic device executes step S303 to store the target crawler data into the target distributed storage node, in the process of storing the target crawler data corresponding to the web page object set into the target distributed storage node, a unit data storage resource amount of a data storage process may be initialized, where the unit data storage resource amount of the data storage process is used to indicate a data amount of the target crawler data corresponding to each data storage operation.

Then, the electronic device obtains a unit data saving upper limit value, where the unit data saving upper limit value is used to indicate an upper limit data amount of the target crawler data corresponding to each data saving operation in the process of saving the target crawler data in the data saving process.

Next, the electronic device updates the unit data saving resource amount of the data saving process according to the unit data saving upper limit value.

And then, the electronic equipment continues to execute data saving operation on the target crawler data according to the updated data saving process.

When the unit data storage upper limit value is smaller than a preset unit data storage threshold value, the electronic equipment multiplies the unit data storage resource amount of the data storage process by a first preset proportionality coefficient, and uses a calculated result as the updated unit data storage resource amount of the data storage process. On the other hand, when the unit data storage upper limit value is greater than or equal to the unit data storage threshold, dividing the unit data storage resource amount of the data storage process by a second preset proportionality coefficient, and taking the calculated result as the updated unit data storage resource amount of the data storage process. That is to say, in this embodiment, the electronic device may adjust the unit data storage resource amount by using the first preset scaling factor and the second preset scaling factor.

It can be understood that the first preset scaling factor and the second preset scaling factor are both coefficients preset by a user, and a specific value is based on a value input by the user, which is not limited in the present invention.

In this embodiment, when the electronic device matches target crawler data corresponding to the web object set in the historical web content data, target virtual object data corresponding to each target web object in the web object set may be obtained first; wherein each target virtual object data is used to represent a web representation of a corresponding target web object.

Then, when the security verification of the target virtual object data corresponding to all the target webpage objects in the webpage object set is passed, the electronic equipment performs virtual object conversion on each target virtual object data according to an object conversion strategy to obtain a webpage object representation corresponding to each target webpage object.

Next, when each webpage object portrait meets a preset portrait matching condition, the electronic equipment acquires historical webpage object data corresponding to each webpage object portrait matched with the received crawling time interval from a webpage information storage server to obtain an initial crawler data packet corresponding to each webpage object portrait; and the webpage information storage server stores the crawler data corresponding to all the target webpage objects.

Then, the electronic equipment merges the initial crawler data packets with the same portrait type tag according to the portrait type tag carried by each initial crawler data packet to obtain virtual object data corresponding to each portrait type tag.

Then, the electronic device filters the virtual object data corresponding to each portrait type label to obtain intermediate virtual object data.

And then, the electronic equipment generates a key for each piece of intermediate virtual object data according to a pre-configured key generation strategy to obtain a virtual object key signaling corresponding to each piece of intermediate virtual object data.

Next, the electronic device performs security signaling verification on all the virtual object key signaling.

And then, the electronic equipment determines the intermediate virtual object data corresponding to all the virtual object key signallings passing the safety verification as the virtual object data to be selected according to the verification result of each safety signaling verification.

And then, the electronic equipment constructs target webpage objects corresponding to all the to-be-selected virtual object data into a to-be-selected webpage object set.

And then, the electronic equipment acquires the crawler data corresponding to all target webpage objects in the webpage object set to be selected from the webpage information storage server as the matched target crawler data corresponding to the webpage object set.

Therefore, by the scheme provided by the invention, the safety of the crawler data can be improved, and the data pollution is avoided.

As an implementation manner, when acquiring target virtual object data corresponding to each target web page object in the web page object set, the electronic device may first read current portrait data stored in a web page object portrait storage node, and count the number of read nodes.

In the reading process, when the number of the read nodes and a preset node number threshold are met, the electronic equipment selects a target portrait storage node cluster from all read webpage object portrait storage nodes according to a preset node selection strategy. For example, as an implementation manner, the preset node selection policy may be random selection, or according to the size of the storage space occupied by the node, according to a preset topk policy.

Next, the electronic device calculates an average portrait value level and a standard portrait value level according to current portrait data that has been read by each web object storage node in the target portrait storage node cluster.

Then, the electronic equipment searches the webpage object portrait storage node according to the average image value level and the standard image value level, and takes the current portrait data corresponding to the searched normal webpage object portrait storage node as the webpage object portrait.

Next, the electronic device obtains historical webpage object data of each webpage object portrait in a preset data analysis time period from the webpage information storage server, and obtains virtual object data of each webpage object portrait.

Then, the electronic equipment classifies each piece of virtual object data carrying the same portrait type label according to the portrait type label carried by the virtual object data, and target virtual object data corresponding to each target webpage object is obtained.

In addition, in this embodiment, after acquiring the target virtual object data corresponding to each target webpage object, the electronic device may further screen out target verification virtual object data from all the target virtual object data according to a target portrait type tag corresponding to each target virtual object data.

And then, the electronic equipment calculates the average value level of the reference portrait according to the target verification virtual object data.

Next, the electronic device compares the reference portrait average level to a preset reference portrait average level threshold.

Wherein, in the present embodiment, the reference average level may be used to indicate a degree of reliability of the crawled crawler data.

In addition, as an implementation manner, when the electronic device acquires current webpage content data corresponding to the target webpage information, according to the target webpage information, all webpage contents related to a webpage link and the target webpage information may be determined as webpage contents to be selected in a service database, where the service database includes a plurality of webpage data and webpage contents corresponding to the plurality of webpage data; and then, the electronic equipment acquires the current webpage content data corresponding to the webpage content to be selected.

Furthermore, as an implementation manner, when obtaining a historical webpage data saving entry, the electronic device may first receive a first entry parsing policy corresponding to the historical webpage data saving entry; wherein, the first entry parsing policy is used to indicate a parsing address in the historical webpage data saving entry, that is: and indicating specific values from a first row to a second row and values from a first column to a second column in the historical webpage data saving table entry.

The electronic device may then determine an entry content address range indicated by the first entry resolution policy.

And then, the electronic equipment traverses the target table item address range corresponding to the target webpage information.

Then, when the table entry content address range included in the traversed target table entry address range does not exceed the table entry content address range indicated by the first table entry resolution strategy, the electronic equipment continues traversing; when the list item content address range included in the traversed target list item address range reaches the list item content address range indicated by the first list item analysis strategy, the electronic equipment takes the traversed target list item address range as a candidate target list item address range; when the table entry content address range included in the traversed target table entry address range exceeds the table entry content address range indicated by the first table entry resolution strategy, the electronic equipment performs address sorting on the traversed target table entry address range according to the sequence of each table entry address in the included table entry content address range to obtain a candidate target table entry address range.

Next, the electronic device determines an address matching degree between the address range of the candidate target table entry and a preset address search range.

And then, when the address matching degree reaches a set matching degree threshold value, the electronic equipment takes the target table entry address range as a table entry address range to be searched.

And then, the electronic equipment searches a webpage content data time distribution label corresponding to the target table item address range in a webpage information storage server.

And then, the electronic equipment matches a second table item analysis strategy corresponding to the historical webpage data storage table item according to the time distribution characteristics of the webpage content data time distribution label. Wherein the second table entry resolution strategy and the first table entry resolution strategy indicate different resolution addresses.

Next, the electronic device searches for a target history information search address range corresponding to the second entry parsing policy from the web page information storage server.

And then, the electronic equipment searches out the webpage content data corresponding to the target historical information search address range in the webpage information storage server to obtain historical webpage content data.

In addition, as another embodiment of the present invention, in the process of executing step S302 to obtain target crawler data, when web page content corresponding to the target web page information is matched in a service database, the electronic device determines the corresponding web page content as the web page content to be selected, where the service database includes a plurality of web page data and web page contents corresponding to the plurality of web page data; then, the electronic equipment acquires current webpage content data corresponding to the target webpage information; and then, the electronic equipment determines the webpage content data corresponding to the crawling object set in the current webpage content data as the target crawler data corresponding to the webpage object set.

In addition, based on the same inventive concept as the above-mentioned web crawler-based data processing method provided by the present invention, the present invention further provides a web crawler-based data processing system 500 as shown in fig. 3, where the data processing system 500 includes a processing module 510 and a saving module 520.

The processing module 510 is configured to obtain a web crawler instruction input by a user, where the web crawler instruction includes target web page information and a crawl object set, and the crawl object set is used to indicate a web page object set to be crawled in the target web page information;

the processing module 510 is further configured to obtain target crawler data corresponding to the target webpage information and the crawled object set;

a saving module 520, configured to save the target crawler data to a target distributed saving node, where the target distributed saving node is a saving node corresponding to the web object set in the distributed storage system.

Optionally, as an implementation manner, when acquiring target crawler data corresponding to the target webpage information and the set of crawled objects, the processing module 510 is specifically configured to:

Optionally, as an implementation manner, when the saving module 520 saves the target crawler data into the target distributed saving node, it is specifically configured to:

Optionally, as an implementation manner, when the target crawler data corresponding to the web object set is matched in the historical web content data, the processing module 510 is specifically configured to:

Optionally, as an implementation manner, when acquiring target virtual object data corresponding to each target web page object in the web page object set, the processing module 510 is specifically configured to:

Optionally, as an embodiment, after acquiring the target virtual object data corresponding to each target web page object, the processing module 510 is further configured to:

Optionally, as an implementation manner, when acquiring the current webpage content data corresponding to the target webpage information, the processing module 510 is specifically configured to:

Optionally, as an implementation manner, when acquiring the historical webpage content data corresponding to the target webpage information, the processing module 510 is specifically configured to:

obtaining a historical webpage data storage table entry;

when the webpage content corresponding to the target webpage information is matched in a service database, determining the corresponding webpage content as the webpage content to be selected, wherein the service database comprises a plurality of webpage data and the webpage content corresponding to the webpage data;

acquiring current webpage content data corresponding to the target webpage information;

and determining the webpage content data corresponding to the crawling object set in the current webpage content data as target crawler data corresponding to the webpage object set.

In addition, based on the same inventive concept as the above-mentioned web crawler-based data processing method provided by the present invention, the present invention also provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the above-mentioned web crawler-based data processing method.

In addition, based on the same inventive concept as the above-mentioned web crawler-based data processing method provided by the present invention, the present invention also provides a cloud platform on which a computer program is stored, which, when executed by a processor, implements the above-mentioned web crawler-based data processing method.

In the embodiments provided by the present invention, it should be understood that the disclosed system and method can be implemented in other ways. The system embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to some embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).

It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in some embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to some embodiments of the present invention. And the aforementioned storage medium includes: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.

The above description is only a partial example of the present invention, and is not intended to limit the present invention, and it is obvious to those skilled in the art that various modifications and variations can be made in the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A data processing method based on web crawlers is characterized by comprising the following steps:

2. The method of claim 1, wherein the obtaining target crawler data corresponding to the target web page information and the set of crawled objects comprises:

3. The method of claim 2, wherein saving the target crawler data to a target distributed saving node comprises:

4. The method of claim 2, wherein matching the target crawler data corresponding to the set of web page objects in the historical web page content data comprises:

5. The method of claim 4, wherein the obtaining target virtual object data corresponding to each target web page object in the set of web page objects comprises:

6. The method according to claim 4 or 5, wherein after said obtaining target virtual object data corresponding to respective target web page objects, the method further comprises:

7. The method of claim 2, wherein the obtaining current web content data corresponding to the target web page information comprises:

8. The method of claim 2, wherein the obtaining historical web content data corresponding to the target web page information comprises:

obtaining a historical webpage data storage table entry;

9. A web crawler-based data processing system, the system comprising:

10. A cloud platform, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, implements the web crawler-based data processing method according to any one of claims 1 to 7.