CN110737814A - Crawling method and device for website data, electronic equipment and storage medium - Google Patents

Crawling method and device for website data, electronic equipment and storage medium Download PDF

Info

Publication number
CN110737814A
CN110737814A CN201911000083.3A CN201911000083A CN110737814A CN 110737814 A CN110737814 A CN 110737814A CN 201911000083 A CN201911000083 A CN 201911000083A CN 110737814 A CN110737814 A CN 110737814A
Authority
CN
China
Prior art keywords
crawling
task
data
program
terminal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911000083.3A
Other languages
Chinese (zh)
Inventor
何海生
张龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Isoftstone Information Technology Co Ltd
Original Assignee
Isoftstone Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Isoftstone Information Technology Co Ltd filed Critical Isoftstone Information Technology Co Ltd
Priority to CN201911000083.3A priority Critical patent/CN110737814A/en
Publication of CN110737814A publication Critical patent/CN110737814A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a crawling method, a crawling device, electronic equipment and a storage medium for kinds of website data, wherein the method is applied to any node terminal in a terminal cluster and comprises the steps of receiving a data crawling instruction sent by a main control terminal in the terminal cluster, starting a data crawling program according to the data crawling instruction, circularly reading a crawling task in an unprocessed state from a task queue through the data crawling program, crawling page data of a corresponding website according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty, and realizing efficient and convenient website data crawling by constructing a crawling framework comprising the terminal cluster and the task queue, wherein at least node terminals selected by the main control terminal in the terminal cluster are realized, and the crawling task in the unprocessed state can be circularly read from the task queue.

Description

Crawling method and device for website data, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to a computer technology, in particular to a crawling method and device for kinds of website data, electronic equipment and a storage medium.
Background
The vertical website can be understood as a website which provides depth information and related services for some specific fields or specific needs, such as a judicial website, an educational website, a financial website, an entertainment website or a shopping website.
Currently, vertical websites typically provide paginated structured data. The conventional method for crawling web data generally includes crawling web data by using a script crawler framework or a webmagic crawler framework running on a local host.
The shortcomings of the existing methods at least comprise that although the scrapy crawler framework and the webmagic crawler framework can realize multi-thread crawling, the crawling efficiency is still limited, so that more efficient and convenient website data crawling methods are needed.
Disclosure of Invention
The embodiment of the invention provides website data crawling methods and devices, electronic equipment and storage media, and efficient and convenient website data crawling is realized.
In an aspect, an embodiment of the present invention provides a crawling method for website data, which is applied to any node terminal in a terminal cluster, and includes:
receiving a data crawling instruction sent by a main control terminal in a terminal cluster, and starting a data crawling program according to the data crawling instruction;
and circularly reading the crawling task in the unprocessed state from the task queue through the data crawling program, and crawling the page data of the corresponding website according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty.
In a second aspect, an embodiment of the present invention further provides a crawling apparatus for kinds of website data, configured in any node terminal in a terminal cluster, including:
the program starting module is used for receiving a data crawling instruction sent by a main control terminal in a terminal cluster and starting a data crawling program according to the data crawling instruction;
and the data crawling module is used for circularly reading the crawling task in the unprocessed state from the task queue through the data crawling program, and crawling the page data of the corresponding website according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty.
In a third aspect, an embodiment of the present invention further provides electronic devices, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the method for crawling website data provided in any embodiment of the present application.
In a fourth aspect, an embodiment of the present invention further provides computer-readable storage media, where the computer-readable storage media stores a computer program, where the computer program is executed by a processor to implement the method for crawling website data provided in any embodiment of the present application.
According to the crawling method, the crawling device, the electronic equipment and the storage medium for kinds of website data, a main control terminal in a terminal cluster selects a node terminal capable of executing a crawling task and sends a data crawling instruction to the selected node terminal, the node terminal receiving the data crawling instruction starts a data crawling program, the crawling task in an unprocessed state is read from a task queue circularly through the data crawling program, and page data of a corresponding website are crawled according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty.
By constructing the crawling framework comprising the terminal cluster and the task queue, at least node terminals selected by the main control terminal in the terminal cluster are realized, and the crawling task in an unprocessed state can be read from the task queue in a circulating manner, so that efficient and convenient website data crawling is realized.
Drawings
FIG. 1 is a flowchart illustrating a crawling method for website data according to an embodiment of the present invention;
fig. 2a is a schematic structural diagram of a crawler framework in the method for crawling kinds of website data according to the embodiment of the present invention;
FIG. 2b is a schematic diagram of a website interface in the method for crawling kinds of website data according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a crawling apparatus for kinds of website data according to a second embodiment of the present invention;
fig. 4 is a schematic structural diagram of electronic devices according to a third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the technical solutions of the present invention will be described in detail below through embodiments with reference to the attached drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.
Example
Fig. 1 is a flowchart illustrating a method for crawling types of website data according to an embodiment of the present invention, where the present embodiment is applicable to a situation of crawling website data, for example, crawling structured data paginated in a vertical website.
Referring to fig. 1, the method is applied to any node terminal in a terminal cluster, and specifically includes the following steps:
s110, receiving a data crawling instruction sent by a main control terminal in the terminal cluster, and starting a data crawling program according to the data crawling instruction.
The terminal cluster can be composed of a main control terminal and at least node terminals, the main control terminal can deploy a data crawling program to each node terminal in advance, and can also deploy the data crawling program to each node terminal when sending a data crawling instruction, wherein when the terminal cluster needs to execute a website data crawling task, the main control terminal in the cluster can select a node terminal capable of executing the crawling task according to the resource use condition (such as CPU occupancy rate, residual operating memory and/or available memory and the like) of each node terminal by integrating a preset scheduling algorithm, and distribute the data crawling instruction to the selected node terminal, and when any terminal in the terminal cluster receives the data crawling instruction sent by the main control terminal, the data crawling program is started.
Optionally, the terminal cluster is a distributed cluster constructed based on a Spark architecture.
The existing Spark architecture comprises a master control terminal (master), a master node (driver) for distributing tasks and a work node (worker) for executing calculation, wherein the master control terminal controls and starts a relevant program of the master node, and the master node distributes the tasks to the work nodes and collects feedback results of the work nodes. In the distributed cluster constructed based on the Spark architecture provided by the embodiment of the invention, the interaction steps between the main control terminal and the main node can be simplified, and the main control terminal is utilized to directly control the node terminal to execute the website data crawling task, so that the distributed cluster management is realized.
The distributed cluster constructed based on the Spark architecture can be compatible with a scala program and a java program, but as for a data crawling program (such as a crawler program), the scripting language and the powerful character string and set processing capability of scala are better than those of a pure java program, so that the data crawling program is preferably written by the scala program.
The Spark architecture can support large-scale (for example, thousands of nodes) clusters, and nearly linear efficiency improvement can be obtained when nodes are added in the clusters, so that large-scale webpage data crawling requirements can be met.
And S120, circularly reading the crawling task in the unprocessed state from the task queue through a data crawling program, and crawling the page data of the corresponding website according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty.
The processing states of the crawling tasks in the task queue can comprise an unprocessed state, a processing success state and a processing failure state. When the crawling task is not read by the node terminal, the crawling processing state of the crawling task is an unprocessed state; when the crawling task is read by the node terminal and the crawling result is not fed back, the crawling state of the crawling task is a processing state; when the crawling task is read by the node terminal and the crawling result is fed back to be successful, the crawling state of the crawling task is a processing success state; when the crawling task is read by the node terminal and the crawling result is fed back to be failed (for example, the crawling is overtime), the crawling state of the crawling task is a processing failure state.
The node terminal receiving the data crawling instruction at any can cyclically read the crawling task in the unprocessed state from the task queue and execute the crawling task until the crawling task in the unprocessed state in the task queue is empty, and the reading is stopped, wherein the crawling of the page data of the corresponding website according to the crawling task currently read can be realized by accessing the corresponding website according to the crawling task currently read and grabbing the page data of the webpage content of the website according to a predefined rule, wherein the page data can be structured data comprising a plurality of fields, and the predefined rule can be or more required page data.
Optionally, crawling the page data of the corresponding website according to the currently read crawling task includes: and when the crawling of the page data of the corresponding website according to the currently read crawling task fails, performing crawling retry.
When the crawling of the page data of the corresponding website according to the currently read crawling task fails, the processing state of the crawling task can be updated to be the processing failure state, the node terminal can also read the crawling task in the crawling failure state from the task queue in a circulating mode, the page data of the corresponding website can be crawled according to the currently read crawling task, crawling retry is achieved, and the crawling success rate is improved in degree.
Optionally, crawling the page data of the corresponding website according to the currently read crawling task includes: and calling a functional plug-in according to the currently read crawling task, and crawling page data of the corresponding website based on the functional plug-in.
When the corresponding website is accessed according to the currently read crawling task, the page of the website may be a dynamic page (which may be understood as a page with a constant frame but a variable content), and the operation such as login or verification may also be included in the page. When the page of the website is identified to be a dynamic page, or operations such as dynamic proxy or verification exist, a corresponding functional plug-in (plug-in program) can be called through the data crawling program, so that the page data of the website can be successfully crawled according to the functional plug-in. The functional plug-in can be a functional plug-in which a main control terminal is pre-deployed in a node terminal; or the functional plug-in actively requesting the main control terminal when the node terminal accesses the website and analyzes what kind of plug-in is needed. The node terminal can flexibly call related function plug-ins through the data crawling program, so that smooth execution of the crawling task is guaranteed.
Optionally, the task queue is a distributed key-value queue constructed based on an Elasticsearch architecture, where the key only identifies the crawling task, and the value includes a processing state of the crawling task.
In view of the above, in the embodiment of the present invention, the crawling task may be stored in a Key-value (KV) manner to achieve idempotency of operations such as insertion, so as to overcome the above problems.
The ElasticSearch is Lucene-based distributed multi-capability full-text search engines, can be used for constructing a distributed keyword-numerical value queue based on a distributed storage database of the ElasticSearch, has high storage capacity and high reading and writing speed, and provides higher possibility for large-scale webpage data crawling.
, the task queue generating step includes receiving a task crawling instruction sent by a main control terminal in the terminal cluster, starting a task crawling program according to the task crawling instruction, crawling task links of corresponding websites through the task crawling program, and sending the task links to the task queue as keywords of crawling tasks.
The main control terminal in the cluster can select the node terminal capable of executing the generation queue according to the resource use condition (such as CPU occupancy rate, residual operating memory and/or available memory and the like) of each node terminal by integrating a preset scheduling algorithm to generate the task queue, and distribute the task crawling instruction to the selected node terminal.
When the node terminal starts the task crawling program, the website corresponding to the task to be crawled can be accessed, the task link in the webpage is obtained after the webpage is opened, the obtained task link is used as a keyword of the crawling task and sent to the task queue, and meanwhile, the processing state corresponding to the task link can be set to be an unprocessed state. The process of opening the web page can be a process of opening a plurality of layers of web pages, namely, the task crawling program can support deep-level web page crawling.
The node terminal crawls a task link through a task crawling program and stores the task link into a task queue to generate a task queue; and a page corresponding to the task link in the task queue can be accessed through the data crawling program, and required page data can be crawled. The time sequence of the two crawling operations (namely crawling the task link and crawling the webpage data) is not strictly limited, and the crawling operation can be carried out synchronously or sequentially. By crawling the task links and the webpage data respectively, the crawling framework can be clearer, and the crawling efficiency of the webpage data is improved.
Optionally, after crawling the page data of the corresponding website according to the currently read crawling task, the method further includes: and sending the page data to a database, wherein the database is a distributed database constructed based on an Elasticissearch architecture.
, by the distributed multi-capability full-text search function of the ElasticSearch, the stored page data can be retrieved, for example, TOP-K aggregation retrieval of specified fields can be performed, so as to facilitate query and analysis of the acquired data.
For example, fig. 2a is a schematic structural diagram of a crawler framework in the method for crawling website data provided by embodiment of the present invention, referring to fig. 2a, a terminal cluster 210 may include a master control terminal 211 and at least node terminals 212, and the terminal cluster may be regarded as a crawler cluster, when the terminal cluster needs to execute a website data crawling task, the master control terminal in the terminal cluster selects at least node terminals to execute the crawling task according to resource usage, the selected node terminals start a data crawling program according to a data crawling instruction sent by the master control terminal, the node terminals cyclically read a crawling task in an unprocessed state from a distributed task queue 220 through the data crawling program, crawl page data of a corresponding website from the internet according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty, and the node terminals send the page data to a distributed database 230 for storage.
For example, fig. 2b is a schematic diagram of a website interface in a method for crawling kinds of website data according to embodiment of the present invention, referring to fig. 2b, the website interface may be a search interface, after inputting factor parameters in a query factor board of the page, clicking a query button, checking a query result list returned in pages from the query result board, and when clicking a [ view ] link, acquiring detailed information of page data.
Assuming that the factor 1 in the query factor is a company name, txt files containing a large number of company names can be sequentially imported into an Elasticsearch database, and table 1 is generated as follows:
TABLE 1
_key _status
xx Bank stocks Co Ltd -1
xx Enterprise resources Co Ltd -1
xx science and technology Ltd -1
xx group Ltd -1
... -1
xx technical Co Ltd -1
Referring to table 1, since the company name has only properties, the company name can be stored as KV in the key field (i.e., in the _ key column of table 1), so that deduplication in the process of inputting the company name can be achieved.
After inquiring about every company names, if the inquiry result is not empty, crawling the "[ view ]" link in each page of the paging feedback, and generating a task queue list as shown in table 2:
TABLE 2
_key _status
xx Bank stocks Co Ltd # [ View]1 -1
xx Bank stocks Co Ltd # [ View]2 -1
... -1
xx Bank stocks Co Ltd # [ View]n -1
... -1
xx technical products Ltd # [ see]1 -1
xx technical products Ltd # [ see]2 -1
... -1
xx technical products Ltd # [ see]n -1
In addition, the crawling state of the page data corresponding to each link can be recorded by using the numerical field (namely, the numerical value can be represented by a numerical value-1), the crawling state can also be recorded by using the numerical value field (namely, the numerical value can be represented by a numerical value-2), the crawling success state can also be represented by a numerical value-1, and the crawling failure state can be represented by a numerical value-0 (for example), wherein at least node terminals in the terminal cluster can be used for accessing the links with the crawling state as the un-queried state and the crawling failure state until all links corresponding to the crawling data are accessed to the page data.
According to the successfully crawled page data, a page data list can be generated as shown in table 3:
TABLE 3
_key C1 ... Cn
xx Bank stocks Co Ltd # [ View]1 xx xx xx
xx Bank stocks Co Ltd # [ View]2 xx xx xx
... xx xx xx
xx Bank stocks Co Ltd # [ View]n xx xx xx
... xx xx xx
xx technical products Ltd # [ see]1 xx xx xx
xx technical products Ltd # [ see]2 xx xx xx
... xx xx xx
xx technical products Ltd # [ see]n xx xx xx
The key values in the key fields in Table 3 (i.e., the _ key columns in Table 3) may be the same as those in Table 2 or different from those in Table 2, and may be, for example, a concatenated string of any two or more fields in the acquired page data fields (i.e., the C1-Cn columns in Table 3). furthermore, the key values in the key fields in Table 2 and Table 1 are not limited to the contents in the tables, as long as they have only the identification function.
According to the website data crawling method provided by the embodiment, a main control terminal in a terminal cluster selects a node terminal capable of executing a crawling task and sends a data crawling instruction to the selected node terminal, the node terminal receiving the data crawling instruction starts a data crawling program, the crawling task in an unprocessed state is read from a task queue circularly through the data crawling program, page data of a corresponding website is crawled according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty, at least node terminals selected by the main control terminal in the terminal cluster are achieved by constructing a crawling framework comprising the terminal cluster and the task queue, the crawling task in the unprocessed state can be read from the task queue circularly, and therefore efficient and convenient website data crawling is achieved.
Example two
Fig. 3 is a schematic structural diagram of a crawling apparatus for kinds of website data according to a second embodiment of the present invention, the crawling apparatus is configured in any node terminal in a terminal cluster, and the crawling method for website data according to any embodiment of the present invention can be implemented by using the crawling apparatus.
Referring to fig. 3, a website data crawling apparatus includes:
the program starting module 310 is configured to receive a data crawling instruction sent by a main control terminal in a terminal cluster, and start a data crawling program according to the data crawling instruction;
and the data crawling module 320 is used for circularly reading the crawling task in the unprocessed state from the task queue through a data crawling program, and crawling the page data of the corresponding website according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty.
Optionally, the data crawling module 320 is specifically configured to: and when the crawling of the page data of the corresponding website according to the currently read crawling task fails, performing crawling retry.
Optionally, the data crawling module 320 is further specifically configured to: and calling a functional plug-in according to the currently read crawling task, and crawling page data of the corresponding website based on the functional plug-in.
Optionally, the terminal cluster is a distributed cluster constructed based on a Spark architecture.
Optionally, the task queue is a distributed key-value queue constructed based on an Elasticsearch architecture, where the key only identifies the crawling task, and the value includes a processing state of the crawling task.
Optionally, the website data crawling apparatus further includes:
the task queue generating module is used for receiving a task crawling instruction sent by a main control terminal in the terminal cluster and starting a task crawling program according to the task crawling instruction; and crawling the task links of the corresponding websites through the task crawling program, and sending the task links to the task queue as keywords of the crawling tasks.
Optionally, the website data crawling apparatus further includes:
and the data sending module is used for sending the page data to a database, wherein the database is a distributed database constructed based on an Elasticissearch architecture.
The apparatus for crawling website data according to the embodiment of the present invention may perform the method for crawling website data according to any of the present invention, and has corresponding functional modules and beneficial effects.
EXAMPLE III
Fig. 4 is a schematic structural diagram of electronic devices provided by the third embodiment of the present invention, fig. 4 shows a block diagram of an exemplary electronic device 12 suitable for implementing the embodiment of the present invention, the electronic device 12 shown in fig. 4 is only examples, and should not bring any limitation to the functions and the scope of use of the third embodiment of the present invention, and the device 12 is typically an electronic device that undertakes a crawling function of website data.
As shown in FIG. 4, electronic device 12 is embodied in a general purpose computing device, the components of electronic device 12 may include, but are not limited to, or more processors or processing units 16, memory 28, and bus 18 that connects the various components (including memory 28 and processing unit 16).
Bus 18 represents or more of several types of bus structures, including a memory bus or memory controller, a Peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures, including, but not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA (enhanced ISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus, to name a few.
Electronic device 12 typically includes a variety of computer-readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer-readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer storage media storage systems 34 may be used to Read from and write to, by way of example Only, non-removable, nonvolatile magnetic media (not shown in fig. 4, and commonly referred to as "hard disk drives"). although not shown in fig. 4, magnetic disk drives may be provided for reading from and writing to removable nonvolatile magnetic disks (e.g., "floppy disks"), and optical disk drives for reading from and writing to removable nonvolatile optical disks (e.g., Compact disk-Read on Memory, CD-ROM), Digital Video disk (Digital Video disk-Read on Memory, DVD-ROM), or other optical media). in these cases, each drive may interface or more data media with bus 18. Memory 28 may include at least one program product configured to execute program modules 42 embodying the present invention.
A program/utility 40 having sets of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating device, or more application programs, other program modules, and program data, each or some combination of which may comprise an implementation of a network environment.
Electronic device 12 may also communicate with or more external devices 14 (e.g., keyboard, pointing device, camera, etc.), may also include display 24, may also communicate with or more devices that enable a user to interact with electronic device 12, and/or may communicate with any device (e.g., Network card, modem, etc.) that enables electronic device 12 to communicate with or more other computing devices.
The processor 16 executes various functional applications and data processing by executing programs stored in the memory 28, for example, implementing a method for crawling website data provided by the above-described embodiment of the present invention, the method including:
receiving a data crawling instruction sent by a main control terminal in a terminal cluster, and starting a data crawling program according to the data crawling instruction; and circularly reading the crawling task in the unprocessed state from the task queue through a data crawling program, and crawling the page data of the corresponding website according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty.
Of course, those skilled in the art will appreciate that the processor may also implement the technical solution of the website data crawling method provided by any embodiment of the present invention.
Example four
The fourth embodiment of the present invention further provides computer-readable storage media, where the computer-readable storage media store thereon a computer program, and when the computer program is executed by a processor, the computer program implements the method for crawling website data provided in the fourth embodiment of the present invention, where the method includes:
receiving a data crawling instruction sent by a main control terminal in a terminal cluster, and starting a data crawling program according to the data crawling instruction; and circularly reading the crawling task in the unprocessed state from the task queue through a data crawling program, and crawling the page data of the corresponding website according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty.
Of course, the computer readable storage media provided by the embodiments of the present invention, the computer program stored thereon is not limited to the above method operations, and may also execute the crawling method for website data provided by any embodiments of the present invention.
A more specific example (a non-exhaustive list) of the computer readable storage medium includes an electrical connection having or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave .
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or a combination thereof, as well as conventional procedural programming languages, such as the "C" language or similar programming languages.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments illustrated herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

  1. The crawling method of the website data of 1 and is characterized by being applied to any node terminal in a terminal cluster and comprising the following steps:
    receiving a data crawling instruction sent by a main control terminal in a terminal cluster, and starting a data crawling program according to the data crawling instruction;
    and circularly reading the crawling task in the unprocessed state from the task queue through the data crawling program, and crawling the page data of the corresponding website according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty.
  2. 2. The method of claim 1, wherein crawling page data of a corresponding website according to a currently read crawling task comprises:
    and when the crawling of the page data of the corresponding website according to the currently read crawling task fails, performing crawling retry.
  3. 3. The method of claim 1, wherein crawling page data of a corresponding website according to a currently read crawling task comprises:
    and calling a functional plug-in according to the currently read crawling task, and crawling page data of the corresponding website based on the functional plug-in.
  4. 4. The method according to , wherein the endpoint cluster is a distributed cluster built based on Spark architecture.
  5. 5. The method of any , wherein the task queue is a distributed key-value queue built based on an Elasticsearch architecture, wherein the key-only identifies crawling tasks, and wherein the values include processing states of the crawling tasks.
  6. 6. The method of claim 5, wherein the step of generating the task queue comprises:
    receiving a task crawling instruction sent by a main control terminal in a terminal cluster, and starting a task crawling program according to the task crawling instruction;
    crawling the task links of the corresponding websites through the task crawling program, and sending the task links to the task queue as keywords of crawling tasks.
  7. 7. The method of any , wherein after crawling page data of the corresponding website according to the crawling task currently read, the method further comprises:
    and sending the page data to a database, wherein the database is a distributed database constructed based on an Elasticissearch architecture.
  8. 8, kinds of website data crawl device, characterized by, dispose arbitrary node terminal in the terminal cluster, include:
    the program starting module is used for receiving a data crawling instruction sent by a main control terminal in a terminal cluster and starting a data crawling program according to the data crawling instruction;
    and the data crawling module is used for circularly reading the crawling task in the unprocessed state from the task queue through the data crawling program, and crawling the page data of the corresponding website according to the currently read crawling task until the crawling task in the unprocessed state in the task queue is empty.
  9. An electronic device of 9, , comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method for crawling website data of any of claims 1-7.
  10. 10, computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out a method of crawling web site data according to any of claims 1-7, .
CN201911000083.3A 2019-10-21 2019-10-21 Crawling method and device for website data, electronic equipment and storage medium Pending CN110737814A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911000083.3A CN110737814A (en) 2019-10-21 2019-10-21 Crawling method and device for website data, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911000083.3A CN110737814A (en) 2019-10-21 2019-10-21 Crawling method and device for website data, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110737814A true CN110737814A (en) 2020-01-31

Family

ID=69270666

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911000083.3A Pending CN110737814A (en) 2019-10-21 2019-10-21 Crawling method and device for website data, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110737814A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460006A (en) * 2020-04-25 2020-07-28 智博云信息科技(广州)有限公司 Data mining method and device for database construction and server
CN113254747A (en) * 2021-06-09 2021-08-13 南京北斗创新应用科技研究院有限公司 Geographic space data acquisition system and method based on distributed web crawler
CN113515681A (en) * 2021-04-30 2021-10-19 广东科学技术职业学院 Real estate data crawler method and device based on script framework

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021619A (en) * 2016-07-14 2016-10-12 微额速达(上海)金融信息服务有限公司 Entire network search system
US20170083378A1 (en) * 2015-09-18 2017-03-23 Salesforce.Com, Inc. Managing processing of long tail task sequences in a stream processing framework
CN107273498A (en) * 2017-06-16 2017-10-20 成都布林特信息技术有限公司 Public sentiment big data processing method
CN108875091A (en) * 2018-08-14 2018-11-23 杭州费尔斯通科技有限公司 A kind of distributed network crawler system of unified management

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170083378A1 (en) * 2015-09-18 2017-03-23 Salesforce.Com, Inc. Managing processing of long tail task sequences in a stream processing framework
CN106021619A (en) * 2016-07-14 2016-10-12 微额速达(上海)金融信息服务有限公司 Entire network search system
CN107273498A (en) * 2017-06-16 2017-10-20 成都布林特信息技术有限公司 Public sentiment big data processing method
CN108875091A (en) * 2018-08-14 2018-11-23 杭州费尔斯通科技有限公司 A kind of distributed network crawler system of unified management

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
娄岩 等: "Spark概论", 《大数据应用基础》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460006A (en) * 2020-04-25 2020-07-28 智博云信息科技(广州)有限公司 Data mining method and device for database construction and server
CN113515681A (en) * 2021-04-30 2021-10-19 广东科学技术职业学院 Real estate data crawler method and device based on script framework
CN113254747A (en) * 2021-06-09 2021-08-13 南京北斗创新应用科技研究院有限公司 Geographic space data acquisition system and method based on distributed web crawler
CN113254747B (en) * 2021-06-09 2021-10-15 南京北斗创新应用科技研究院有限公司 Geographic space data acquisition system and method based on distributed web crawler

Similar Documents

Publication Publication Date Title
US10642904B2 (en) Infrastructure enabling intelligent execution and crawling of a web application
US9489237B1 (en) Dynamic tree determination for data processing
US9996593B1 (en) Parallel processing framework
CN110737814A (en) Crawling method and device for website data, electronic equipment and storage medium
US8276022B2 (en) Efficient failure detection for long running data transfer jobs
MX2007014899A (en) Back-off mechanism for search.
CN107807937B (en) Website SEO processing method, device and system
CN107480205B (en) Method and device for partitioning data
CN105677904B (en) Small documents storage method and device based on distributed file system
US20170078361A1 (en) Method and System for Collecting Digital Media Data and Metadata and Audience Data
CN110688096B (en) Method and device for constructing application program containing plug-in, medium and electronic equipment
CN103827778A (en) Enterprise tools enhancements
CN109325192B (en) Advertisement anti-shielding method and device
CN109213824B (en) Data capture system, method and device
CN112384940B (en) Mechanism for WEB crawling of e-commerce resource pages
CN110888972A (en) Sensitive content identification method and device based on Spark Streaming
US6898599B2 (en) Method and system for automated web reports
JP2017215868A (en) Anonymization processor, anonymization processing method, and program
CN111026945B (en) Multi-platform crawler scheduling method, device and storage medium
CN111078975A (en) Multi-node incremental data acquisition system and acquisition method
CN110888839A (en) Data storage and data search method and device
CN107643892B (en) Interface processing method, device, storage medium and processor
CN106452855B (en) Article label adding method and device
CN113722007A (en) Configuration method, device and system of VPN branch equipment
CN105190598A (en) Resource reference classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200131