CN112559839A - Data acquisition method and device, computer equipment and storage medium - Google Patents

Data acquisition method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112559839A
CN112559839A CN201910850324.7A CN201910850324A CN112559839A CN 112559839 A CN112559839 A CN 112559839A CN 201910850324 A CN201910850324 A CN 201910850324A CN 112559839 A CN112559839 A CN 112559839A
Authority
CN
China
Prior art keywords
crawling
task
data
crawler system
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910850324.7A
Other languages
Chinese (zh)
Other versions
CN112559839B (en
Inventor
张志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201910850324.7A priority Critical patent/CN112559839B/en
Publication of CN112559839A publication Critical patent/CN112559839A/en
Application granted granted Critical
Publication of CN112559839B publication Critical patent/CN112559839B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a data acquisition method, a data acquisition device, computer equipment and a storage medium. According to the data acquisition method, the proper crawler systems are selected according to the crawling ability values of the crawler systems to execute the data crawling task, the execution conditions of the crawler systems to the data crawling task are known in real time, when the crawling condition occurs, the crawler systems are quickly forwarded to the other crawler system in the midway of the task, the obtained data are collected, the data crawling task is completed, and the execution efficiency of the data crawling task is effectively ensured.

Description

Data acquisition method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data acquisition method and apparatus, a computer device, and a storage medium.
Background
With the development of computer internet technology, web crawler technology has emerged, and a network is also called a web spider, a web robot or a web page chaser, etc., and is a program or script that automatically captures web information according to a certain rule. Functionally, the crawler is generally divided into three parts, namely data acquisition, processing and storage. The traditional crawler obtains the URL on the initial webpage from the URL of one or a plurality of initial webpages, continuously extracts new URLs from the current webpage and puts the new URLs into a queue in the process of capturing the webpage until certain stop conditions of the system are met. The workflow of the focused crawler is complex, and links irrelevant to the subject need to be filtered according to a certain webpage analysis algorithm, and useful links are reserved and put into a URL queue to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process until reaching a certain condition of the system. In addition, all the web pages grabbed by the crawler are stored by the system, certain analysis and filtering are carried out, and indexes are established so as to facilitate later query and retrieval; for focused crawlers, the analysis results obtained by this process may also give feedback and guidance to the subsequent grabbing process.
For the existing crawler system, under the condition of progress blockage, the crawler end is required to wait for the final result of the crawling task or perform other compensation operations after overtime, and the efficiency of data acquisition through the web crawler is low.
Disclosure of Invention
Therefore, it is necessary to provide a data acquisition method, an apparatus, a computer device, and a storage medium capable of efficiently acquiring data by a web crawler in order to solve the problem that the efficiency of acquiring data is low when the web crawler is stuck in a task.
A method of data acquisition, the method comprising:
acquiring a data crawling task, and searching a configuration site corresponding to the data crawling task;
acquiring corresponding crawling ability values of all the preset crawler systems in the configuration site, and selecting a first preset crawler system with the highest ability value according to the crawling ability values;
calling the first preset crawler system to execute the data crawling task on the configuration site;
when a stuck event occurs during the execution of the data crawling task, selecting a second preset crawler system with the second highest crawling capability value according to the crawling capability value to execute the data crawling task;
and collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquiring target data corresponding to the data crawling task.
In one embodiment, before the obtaining the crawling ability value corresponding to each preset crawler system in the configuration site and selecting the first preset crawler system with the highest corresponding ability value of the data crawling task according to the crawling ability value, the method further includes:
calling each preset crawler system to execute a test crawling task corresponding to each configuration site;
when the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system on the test crawling task are obtained;
and acquiring the crawling ability values corresponding to the preset crawler systems and the configuration sites according to the crawling time consumption and the crawling success rate of the preset crawler systems on the test crawling tasks.
In one embodiment, when a stuck event occurs during the execution of the data crawling task, the selecting a second preset crawler system with a second highest crawling capability value according to the crawling capability value to execute the data crawling task includes:
when a stuck event occurs during the execution of the data crawling task, selecting a second preset crawler system with the second highest crawling ability value according to the crawling ability value;
acquiring a morton node page of the first preset crawler system in the data crawling task, updating the data crawling task according to the morton node page, wherein the crawling range of the updated data crawling task is from the morton node page to a corresponding crawling end point page of the original data crawling task;
and executing the updated data crawling task on the configuration site through the second preset crawler system.
In one embodiment, the method further comprises the following steps:
acquiring the crawling progress, the residual page number and the progress pause time of the first preset crawler system on the data crawling task;
and when the crawling progress is higher than a preset stuck progress threshold value, the number of the remaining pages is lower than a preset stuck page number threshold value, and the progress pause time is higher than a preset stuck threshold time, judging that a stuck event occurs in the executed data crawling task.
In one embodiment, the method further comprises the following steps:
acquiring the progress pause time of the first preset crawler system on the data crawling task;
and when the progress pause time is higher than the preset task configuration time, judging that a stuck event occurs when the data crawling task is executed.
In one embodiment, the collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and after obtaining target data corresponding to the data crawling task, further includes:
when the first preset crawler system finishes a data crawling task, sending a task termination instruction to the second preset crawler system, and controlling the second preset crawler system to terminate the current data crawling task;
and when the second preset crawler system finishes a data crawling task, sending a task termination instruction to the first preset crawler system, and controlling the first preset crawler system to terminate the current data crawling task.
In one embodiment, the collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and after obtaining target data corresponding to the data crawling task, includes:
acquiring the crawling time consumption and the crawling success rate of the first preset crawler system and the second preset crawler system on the data crawling task;
and respectively updating the crawling ability values of the configuration sites corresponding to the data crawling tasks of the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
A data acquisition apparatus, the apparatus comprising:
the task acquisition module is used for acquiring a data crawling task and searching a configuration site corresponding to the data crawling task;
the crawler selecting module is used for acquiring the corresponding crawling ability values of all the preset crawler systems in the configuration site and selecting a first preset crawler system with the highest ability value according to the crawling ability values;
the first task execution module is used for calling the first preset crawler system to execute the data crawling task on the configuration site;
the second task execution module is used for selecting a second preset crawler system with the highest crawling capability value to execute the data crawling task according to the crawling capability value when a stuck event occurs during the data crawling task;
and the data acquisition module is used for collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system and acquiring target data corresponding to the data crawling task.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring a data crawling task, and searching a configuration site corresponding to the data crawling task;
acquiring corresponding crawling ability values of all the preset crawler systems in the configuration site, and selecting a first preset crawler system with the highest ability value according to the crawling ability values;
calling the first preset crawler system to execute the data crawling task on the configuration site;
when a stuck event occurs during the execution of the data crawling task, selecting a second preset crawler system with the second highest crawling capability value according to the crawling capability value to execute the data crawling task;
and collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquiring target data corresponding to the data crawling task.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring a data crawling task, and searching a configuration site corresponding to the data crawling task;
acquiring corresponding crawling ability values of all the preset crawler systems in the configuration site, and selecting a first preset crawler system with the highest ability value according to the crawling ability values;
calling the first preset crawler system to execute the data crawling task on the configuration site;
when a stuck event occurs during the execution of the data crawling task, selecting a second preset crawler system with the second highest crawling capability value according to the crawling capability value to execute the data crawling task;
and collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquiring target data corresponding to the data crawling task.
According to the data acquisition method, the data acquisition device, the computer equipment and the storage medium, a proper crawler system is selected according to the crawling capacity value of each crawler system to execute the data crawling task, the execution condition of the crawler system on the data crawling task is known in real time, when the crawling condition occurs, the crawler system is quickly forwarded to another crawler system in the midway of the task, then the obtained data are collected, the data crawling task is completed, and the execution efficiency of the data crawling task is effectively ensured.
Drawings
FIG. 1 is a diagram of an application environment of a data acquisition method in one embodiment;
FIG. 2 is a schematic flow chart diagram illustrating a data acquisition method in one embodiment;
FIG. 3 is a schematic flow chart diagram of a data acquisition method in another embodiment;
FIG. 4 is a schematic sub-flow chart of step S700 of FIG. 2 in one embodiment;
FIG. 5 is a block diagram showing the structure of a data acquisition device according to an embodiment;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The data acquisition method provided by the application can be applied to the application environment shown in fig. 1. Wherein server 102 communicates with server 104 over a network. The terminal 102 submits a data crawling task to the server 104, and the server 104 receives the data crawling task and searches for a configuration site corresponding to the data crawling task; then, acquiring corresponding crawling ability values of all the preset crawler systems in the configuration site, and selecting a first preset crawler system with the highest ability value according to the crawling ability values; calling a first preset crawler system to execute a data crawling task at a configuration site; when the execution data crawling task is subjected to a stuck event, selecting a second preset crawler system with the next highest crawling capability value according to the crawling capability value to execute the data crawling task; and collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquiring target data corresponding to the data crawling task. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a data obtaining method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:
and S100, acquiring a data crawling task, and searching a configuration site corresponding to the data crawling task.
The data crawling task is a task of acquiring network information data which accord with a certain rule from a configuration site corresponding to the task through a web crawler. The configuration site is specifically a site storing network information data, the data crawling task carries a configuration site corresponding to the task, and the server is used for executing the data crawling task in the configuration site through a web crawler and acquiring corresponding data.
S300, obtaining the corresponding crawling ability values of the preset crawler systems in the configuration site, and selecting the first preset crawler system with the highest ability value according to the crawling ability values.
And S500, calling a first preset crawler system to execute a data crawling task on a configuration site.
The preset crawler system refers to a plurality of crawler systems which are configured in the server and can independently execute crawler tasks. When the crawler system is used, the plurality of web crawlers in the crawler system are used for completing the same data crawling task. The crawling ability value is a pre-estimated value of the crawling speed and the crawling success rate of crawling data of each configured site by the crawler system, and is determined according to the crawling speed and the crawling success rate of each preset crawler system in historical crawling records of the site. For the same preset crawler system, different configuration sites have different crawling ability values, for the same configuration site, the crawling ability values of the different preset crawler systems are also different, and the ability values of the preset crawler system are managed by a crawler ability value center in a server. And the crawling ability value is determined according to the crawling speed and the crawling success rate of the preset crawler system on the configured site. In one embodiment, the crawling speed and the crawling success rate of each preset crawler system to the configuration site can be obtained through the test task corresponding to the configuration site, then the crawling speed and the crawling success rate of each preset crawler system are ranked, and the crawling ability value of each preset crawler system to the configuration site is comprehensively determined according to the ranking condition.
In one embodiment, when a data crawling task is received, a server starts to poll each preset crawler system in the server, determines the preset crawler system in the normal state in each preset crawler system, sets the unavailable preset crawler system to be in the unavailable state, acquires a configuration site corresponding to the data crawling task, searches the crawler system with the highest power value in the normal preset crawler system through a crawler capacity value center, sets the crawler system to be a first preset crawler system, executes the data crawling task through the crawler system, and acquires corresponding data.
S700, when the execution data crawling task is blocked, selecting a second preset crawler system with the second highest crawling ability value according to the crawling ability value to execute the data crawling task.
The stuck event refers to an abnormal situation that the first preset crawler system is stuck when executing a data crawling task, so that the task cannot be completed in a short time. Whether the data crawling task is the card pause task or not can be determined by the corresponding crawling progress of the data crawling task, when the data crawling task executed by the first preset crawler system is determined, after the data crawling task enters the card pause state, the server can directly select the second preset crawler system with the second highest crawling ability value according to the crawling ability value, then the data crawling task is executed through the second preset crawler system, and corresponding data are obtained. In addition, in another embodiment, when a second preset crawler system or a later selected preset crawler system performs the data crawling task and a stuck event also occurs, the server may reselect other crawler systems in sequence according to the crawling capability value to perform the data crawling task until the data crawling task is performed completely.
And S900, collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquiring target data corresponding to the data crawling task.
After the second preset crawler system starts the data crawling task, the server starts to collect data obtained by the two preset crawler systems in the process of executing the data crawling task, and repeatedly obtained data can be removed after collection, so that target data corresponding to the data crawling task are obtained. In addition, when the number of the preset crawler systems executing the data crawling task is more than two, the server finally collects and processes all the data crawled by the preset crawler systems.
According to the data acquisition method, the proper crawler system is selected according to the crawling ability value of each crawler system to execute the data crawling task, the execution condition of the crawler system on the data crawling task is known in real time, when the crawling condition occurs, the crawler system is quickly forwarded to another crawler system in the midway of the task, the obtained data are collected, the data crawling task is completed, and the execution efficiency of the data crawling task is effectively ensured.
As shown in fig. 3, in one embodiment, before S300, the method further includes:
and S210, calling each preset crawler system to execute a test crawling task corresponding to each configuration site.
And S230, when the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system to the test crawling task are acquired.
And S250, acquiring the crawling ability values of the preset crawler systems corresponding to the configured sites according to the crawling time consumption and the crawling success rate of the preset crawler systems on the test crawling tasks.
When the data crawling task is not executed, the crawler capacity value center can previously define some websites which may be related to the data crawling task, and the websites are predetermined as configuration sites. And then test crawling tasks for the configured sites are randomly generated, and are respectively executed through preset crawler systems controlled by the server. And the server counts the crawling time consumption and the crawling success rate of the preset crawler systems on the data crawling tasks. And then, according to the crawling time consumption and the crawling success rate, acquiring the crawling capability values of the preset crawler systems corresponding to the configuration sites. In one embodiment, in the testing process, the ranking of the crawling ability value can be determined according to the crawling time-consuming ranking and the crawling success rate ranking of each preset crawler system on the same configuration site, then the crawling ability values are distributed to the preset crawler systems according to the ranking of the crawling ability values, if the crawling of the A crawler system on the first site is shortest, the crawling success rate is also highest, the crawling ability value corresponding to the A crawler system and the first site can be determined to be 100, and then the corresponding crawling ability values are given to other crawler systems according to the crawling ability value serving as a reference. Crawling ability values can be efficiently given to the preset crawler systems through the testing tasks, so that the side face is searched for the most suitable preset crawler system for the data crawling task, and the data acquisition efficiency is improved.
As shown in fig. 4, in one embodiment, S700 includes:
s720, when a stuck event occurs in the execution data crawling task, selecting a second preset crawler system with the second highest crawling ability value according to the crawling ability value.
And S740, acquiring a morton node page of the first preset crawler system in the data crawling task, updating the data crawling task according to the morton node page, and enabling the crawling range of the updated data crawling task to be the crawling end point page corresponding to the morton node page to the original data crawling task.
And S760, executing the updated data crawling task on the configuration site through a second preset crawler system.
The data crawling task execution method comprises the steps that a stuck event specifically refers to the situation that a data crawling task is in a progress stagnation state in the execution process, a stuck node page refers to a page under a configuration site with the progress stagnation, when it is determined that the data crawling task which is currently being executed is the stuck task, a second preset crawler system with the second highest crawling capability value corresponding to the current data crawling task can be selected according to the crawling capability value, the data crawling task is executed through the second preset crawler system, and as the first preset crawler system already executes partial data crawling work, the original data crawling task can be updated according to the stuck node at the moment, and only the second preset crawler system is needed to execute the stuck node and the remaining data crawling task behind the stuck node. Only the data which are not crawled temporarily are crawled, and the overall data crawling efficiency can be effectively improved. In addition, the first preset crawler system may return failure results of some pages in the process of crawling the current data crawling task, and for the reverse crawling reasons such as non-404, the server may price the pages to the updated data crawling task, and the second preset crawler system crawls the pages again. In another embodiment, the crawling range of the updated data crawling task is the corresponding crawling end point page from a preset number of pages before the page of the morton node to the original data crawling task, if the crawling range of the original data crawling task is the page a200-a500 of the configuration site, when the first preset crawler system crawls to the page a400, a morton event occurs, and at this time, the morton node is the page a400, the crawling range of the updated data crawling task can be set to be a390-a500, so that the situation that crawling data is incomplete when the data are collected is avoided. Crawl the task of crawling failure again in the executive process of data crawling task, rather than crawl again after the whole task finishes, can effectively improve the efficiency that data crawled.
In one embodiment, the method further comprises the following steps: and acquiring the crawling progress, the residual page number and the progress pause time of the first preset crawler system on the data crawling task. And when the crawling progress is higher than a preset stuck progress threshold value, the number of the remaining pages is lower than a preset stuck page number threshold value, and the progress pause time is higher than a preset stuck threshold time, judging that the data crawling task is a stuck task.
Specifically, whether the data crawling task is blocked or not can be judged by combining the specific task progress of the data crawling task, the number of remaining non-crawled pages and progress pause. The preset stuck progress threshold, the preset stuck page number threshold and the preset stuck threshold time can be determined according to the page number of the data crawling task. As in a specific embodiment, for a data crawling task, the corresponding page number is 3000, when the task progress is below 20%, and the total number of pages is below 500, and the data crawling task whose task progress stays for more than 10min can be regarded as a morton task. Whether the current data crawling task is a stuck task or not can be effectively identified through the task progress of the crawling task, the task to be completed by crawling is not set as the stuck task, and a large amount of data can be effectively prevented from being re-crawled; and for the crawling task which does not meet the blocking event, re-crawling can be performed according to the common new task.
In addition, in another embodiment, the method further comprises: acquiring the progress pause time of a first preset crawler system on a data crawling task; and when the progress pause time is higher than the preset task configuration time, judging that the data crawling task is a stuck task. And (4) for the data crawling task of which the task progress stays beyond the preset task configuration time, namely the stuck task. Whether the current data crawling task is a stuck task can be effectively identified through the task progress of the crawling task.
In one embodiment, after S900, the method further includes:
when the first preset crawler system finishes the data crawling task, sending a task termination instruction to a second preset crawler system, and controlling the second preset crawler system to terminate the current data crawling task;
and when the second preset crawler system finishes the data crawling task, sending a task termination instruction to the first preset crawler system, and controlling the first preset crawler system to terminate the current data crawling task.
Specifically, whether the current data crawling task is completed or not is judged according to the summarized crawling result, when the first preset crawler system completes the data crawling task, a task termination instruction is sent to the second preset crawler system, and the second preset crawler system is controlled to terminate the current data crawling task; and when the second preset crawler system finishes the data crawling task, sending a task termination instruction to the first preset crawler system, and controlling the first preset crawler system to terminate the current data crawling task. Crawler resources can be effectively saved by sending task interrupt instructions.
In one embodiment, S900 is followed by:
acquiring the crawling time consumption and the crawling success rate of a first preset crawler system and a second preset crawler system on a data crawling task;
and respectively updating the crawling ability values of the configuration sites corresponding to the data crawling tasks of the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
After the data crawling task is completed, the crawling ability value can be updated in real time according to the crawling rate and the crawling success rate of each preset crawler system to the data crawling task under the actual condition. The crawling ability value can meet the actual requirement of the data crawling task, and the data crawling efficiency is further improved.
It should be understood that although the various steps in the flow charts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 5, there is provided a data acquisition apparatus including:
the task obtaining module 100 is configured to obtain a data crawling task and search a configuration site corresponding to the data crawling task;
the crawler selecting module 300 is configured to obtain a crawling ability value corresponding to each preset crawler system in a configuration site, and select a first preset crawler system with a highest ability value according to the crawling ability value;
a first task execution module 500, configured to invoke a first preset crawler system to execute a data crawling task at a configuration site;
the second task execution module 700 is configured to select a second preset crawler system with a second highest crawling capability value according to the crawling capability value to execute the data crawling task when the data crawling task is executed and a stuck event occurs;
and the data acquisition module 900 is configured to collect the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquire target data corresponding to the data crawling task.
In one embodiment, the system further comprises a crawler testing module, which is used for calling each preset crawler system to execute a test crawling task corresponding to each configuration site; when the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system on the test crawling task are obtained; and acquiring the crawling ability values of the preset crawler systems corresponding to the configured sites according to the crawling time consumption and the crawling success rate of the preset crawler systems on the test crawling task.
In one embodiment, the second task execution module 700 is configured to select a second preset crawler system with a second highest crawling ability value according to the crawling ability value when a stuck event occurs in the execution data crawling task; acquiring a morton node page of a first preset crawler system in a data crawling task, updating the data crawling task according to the morton node page, wherein the crawling range of the updated data crawling task is from the morton node page to a corresponding crawling end point page of an original data crawling task; and executing the updated data crawling task on the configuration site through a second preset crawler system.
In one embodiment, the system further comprises a first pause judging module, which is used for acquiring the crawling progress, the number of remaining pages and the progress stopping time of the data crawling task by a first preset crawler system; and when the crawling progress is higher than a preset stuck progress threshold value, the number of the remaining pages is lower than a preset stuck page number threshold value, and the progress pause time is higher than a preset stuck threshold time, judging that the data crawling task is a stuck task.
In one embodiment, the system further comprises a second pause judging module, which is used for acquiring the progress pause time of the first preset crawler system on the data crawling task; and when the progress pause time is higher than the preset task configuration time, judging that the data crawling task is a stuck task.
In one embodiment, the system further comprises a task interrupt module for
When the first preset crawler system finishes the data crawling task, sending a task termination instruction to a second preset crawler system, and controlling the second preset crawler system to terminate the current data crawling task; and when the second preset crawler system finishes the data crawling task, sending a task termination instruction to the first preset crawler system, and controlling the first preset crawler system to terminate the current data crawling task.
In one embodiment, the system further comprises a capability value updating module, which is used for acquiring the crawling time consumption and the crawling success rate of the first preset crawler system and the second preset crawler system on the data crawling task; and respectively updating the crawling ability values of the configuration sites corresponding to the data crawling tasks of the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
For specific limitations of the data acquisition device, reference may be made to the above limitations of the data acquisition method, which are not described herein again. The modules in the data acquisition device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing relevant data of the preset crawler system. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data acquisition method.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
acquiring a data crawling task, and searching a configuration site corresponding to the data crawling task;
acquiring corresponding crawling ability values of all the preset crawler systems in the configuration site, and selecting a first preset crawler system with the highest ability value according to the crawling ability values;
calling a first preset crawler system to execute a data crawling task at a configuration site;
when the execution data crawling task is subjected to a stuck event, selecting a second preset crawler system with the next highest crawling capability value according to the crawling capability value to execute the data crawling task;
and collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquiring target data corresponding to the data crawling task.
In one embodiment, the processor, when executing the computer program, further performs the steps of: calling each preset crawler system to execute a test crawling task corresponding to each configuration site; when the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system on the test crawling task are obtained; and acquiring the crawling ability values of the preset crawler systems corresponding to the configured sites according to the crawling time consumption and the crawling success rate of the preset crawler systems on the test crawling task.
In one embodiment, the processor, when executing the computer program, further performs the steps of: when a stuck event occurs in the execution data crawling task, selecting a second preset crawler system with the second highest crawling ability value according to the crawling ability value; acquiring a morton node page of a first preset crawler system in a data crawling task, updating the data crawling task according to the morton node page, wherein the crawling range of the updated data crawling task is from the morton node page to a corresponding crawling end point page of an original data crawling task; and executing the updated data crawling task on the configuration site through a second preset crawler system.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring the crawling progress, the number of remaining pages and the progress pause time of a data crawling task by a first preset crawler system; and when the crawling progress is higher than a preset stuck progress threshold value, the number of the remaining pages is lower than a preset stuck page number threshold value, and the progress pause time is higher than a preset stuck threshold time, judging that the data crawling task is a stuck task.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring the progress pause time of a first preset crawler system on a data crawling task; and when the progress pause time is higher than the preset task configuration time, judging that the data crawling task is a stuck task.
In one embodiment, the processor, when executing the computer program, further performs the steps of: when the first preset crawler system finishes the data crawling task, sending a task termination instruction to a second preset crawler system, and controlling the second preset crawler system to terminate the current data crawling task; and when the second preset crawler system finishes the data crawling task, sending a task termination instruction to the first preset crawler system, and controlling the first preset crawler system to terminate the current data crawling task.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring the crawling time consumption and the crawling success rate of a first preset crawler system and a second preset crawler system on a data crawling task; and respectively updating the crawling ability values of the configuration sites corresponding to the data crawling tasks of the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring a data crawling task, and searching a configuration site corresponding to the data crawling task;
acquiring corresponding crawling ability values of all the preset crawler systems in the configuration site, and selecting a first preset crawler system with the highest ability value according to the crawling ability values;
calling a first preset crawler system to execute a data crawling task at a configuration site;
when the execution data crawling task is subjected to a stuck event, selecting a second preset crawler system with the next highest crawling capability value according to the crawling capability value to execute the data crawling task;
and collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquiring target data corresponding to the data crawling task.
In one embodiment, the computer program when executed by the processor further performs the steps of: calling each preset crawler system to execute a test crawling task corresponding to each configuration site; when the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system on the test crawling task are obtained; and acquiring the crawling ability values of the preset crawler systems corresponding to the configured sites according to the crawling time consumption and the crawling success rate of the preset crawler systems on the test crawling task.
In one embodiment, the computer program when executed by the processor further performs the steps of: when a stuck event occurs in the execution data crawling task, selecting a second preset crawler system with the second highest crawling ability value according to the crawling ability value; acquiring a morton node page of a first preset crawler system in a data crawling task, updating the data crawling task according to the morton node page, wherein the crawling range of the updated data crawling task is from the morton node page to a corresponding crawling end point page of an original data crawling task; and executing the updated data crawling task on the configuration site through a second preset crawler system.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring the crawling progress, the number of remaining pages and the progress pause time of a data crawling task by a first preset crawler system; and when the crawling progress is higher than a preset stuck progress threshold value, the number of the remaining pages is lower than a preset stuck page number threshold value, and the progress pause time is higher than a preset stuck threshold time, judging that the data crawling task is a stuck task.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring the progress pause time of a first preset crawler system on a data crawling task; and when the progress pause time is higher than the preset task configuration time, judging that the data crawling task is a stuck task.
In one embodiment, the computer program when executed by the processor further performs the steps of: when the first preset crawler system finishes the data crawling task, sending a task termination instruction to a second preset crawler system, and controlling the second preset crawler system to terminate the current data crawling task; and when the second preset crawler system finishes the data crawling task, sending a task termination instruction to the first preset crawler system, and controlling the first preset crawler system to terminate the current data crawling task.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring the crawling time consumption and the crawling success rate of a first preset crawler system and a second preset crawler system on a data crawling task; and respectively updating the crawling ability values of the configuration sites corresponding to the data crawling tasks of the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of data acquisition, the method comprising:
acquiring a data crawling task, and searching a configuration site corresponding to the data crawling task;
acquiring corresponding crawling ability values of all the preset crawler systems in the configuration site, and selecting a first preset crawler system with the highest ability value according to the crawling ability values;
calling the first preset crawler system to execute the data crawling task on the configuration site;
when a stuck event occurs during the execution of the data crawling task, selecting a second preset crawler system with the second highest crawling capability value according to the crawling capability value to execute the data crawling task;
and collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquiring target data corresponding to the data crawling task.
2. The method according to claim 1, wherein the obtaining of the crawling ability value corresponding to each preset crawler system in the configuration site, and before selecting a first preset crawler system with a highest corresponding ability value of the data crawling task according to the crawling ability value, further comprises:
calling each preset crawler system to execute a test crawling task corresponding to each configuration site;
when the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system on the test crawling task are obtained;
and acquiring the crawling ability values corresponding to the preset crawler systems and the configuration sites according to the crawling time consumption and the crawling success rate of the preset crawler systems on the test crawling tasks.
3. The method of claim 1, wherein selecting a second preset crawler system with a next highest crawling capability value according to the crawling capability value to perform the data crawling task when a stuck event occurs during the data crawling task comprises:
when a stuck event occurs during the execution of the data crawling task, selecting a second preset crawler system with the second highest crawling ability value according to the crawling ability value;
acquiring a morton node page of the first preset crawler system in the data crawling task, updating the data crawling task according to the morton node page, wherein the crawling range of the updated data crawling task is from the morton node page to a corresponding crawling end point page of the original data crawling task;
and executing the updated data crawling task on the configuration site through the second preset crawler system.
4. The method of claim 1, further comprising:
acquiring the crawling progress, the residual page number and the progress pause time of the first preset crawler system on the data crawling task;
and when the crawling progress is higher than a preset stuck progress threshold value, the number of the remaining pages is lower than a preset stuck page number threshold value, and the progress pause time is higher than a preset stuck threshold time, judging that a stuck event occurs in the executed data crawling task.
5. The method of claim 1, further comprising:
acquiring the progress pause time of the first preset crawler system on the data crawling task;
and when the progress pause time is higher than the preset task configuration time, judging that a stuck event occurs when the data crawling task is executed.
6. The method according to claim 1, wherein the aggregating the crawling results of the first preset crawler system and the crawling results of the second preset crawler system, and after obtaining target data corresponding to a data crawling task, further comprises:
when the first preset crawler system finishes a data crawling task, sending a task termination instruction to the second preset crawler system, and controlling the second preset crawler system to terminate the current data crawling task;
and when the second preset crawler system finishes a data crawling task, sending a task termination instruction to the first preset crawler system, and controlling the first preset crawler system to terminate the current data crawling task.
7. The method according to claim 1, wherein the aggregating the crawling results of the first preset crawler system and the crawling results of the second preset crawler system, after obtaining target data corresponding to a data crawling task, comprises:
acquiring the crawling time consumption and the crawling success rate of the first preset crawler system and the second preset crawler system on the data crawling task;
and respectively updating the crawling ability values of the configuration sites corresponding to the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
8. A data acquisition apparatus, characterized in that the apparatus comprises:
the task acquisition module is used for acquiring a data crawling task and searching a configuration site corresponding to the data crawling task;
the crawler selecting module is used for acquiring the corresponding crawling ability values of all the preset crawler systems in the configuration site and selecting a first preset crawler system with the highest ability value according to the crawling ability values;
the first task execution module is used for calling the first preset crawler system to execute the data crawling task on the configuration site;
the second task execution module is used for selecting a second preset crawler system with the highest crawling capability value to execute the data crawling task according to the crawling capability value when a stuck event occurs during the data crawling task;
and the data acquisition module is used for collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system and acquiring target data corresponding to the data crawling task.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN201910850324.7A 2019-09-10 2019-09-10 Data acquisition method, device, computer equipment and storage medium Active CN112559839B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910850324.7A CN112559839B (en) 2019-09-10 2019-09-10 Data acquisition method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910850324.7A CN112559839B (en) 2019-09-10 2019-09-10 Data acquisition method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112559839A true CN112559839A (en) 2021-03-26
CN112559839B CN112559839B (en) 2024-05-03

Family

ID=75029719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910850324.7A Active CN112559839B (en) 2019-09-10 2019-09-10 Data acquisition method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112559839B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150161257A1 (en) * 2013-12-11 2015-06-11 Ebay Inc. Web crawler optimization system
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method
CN107025230A (en) * 2016-01-29 2017-08-08 北京国双科技有限公司 The processing method and processing device of web crawlers
CN108446199A (en) * 2017-02-16 2018-08-24 阿里巴巴集团控股有限公司 A kind of detection method and device using interim card

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150161257A1 (en) * 2013-12-11 2015-06-11 Ebay Inc. Web crawler optimization system
CN107025230A (en) * 2016-01-29 2017-08-08 北京国双科技有限公司 The processing method and processing device of web crawlers
CN108446199A (en) * 2017-02-16 2018-08-24 阿里巴巴集团控股有限公司 A kind of detection method and device using interim card
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method

Also Published As

Publication number Publication date
CN112559839B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
CN109033123B (en) Big data-based query method and device, computer equipment and storage medium
CN108259482B (en) Network Abnormal data detection method, device, computer equipment and storage medium
CN108768728B (en) Operation and maintenance task processing method and device, computer equipment and storage medium
EP2674884A1 (en) Method, system and computer-readable recording medium for adding a new image and information on the new image to an image database
RU2645266C1 (en) Method and device for planning web-crowlers in accordance with keyword search
US20090150381A1 (en) Methods and apparatus for computing graph similarity via signature similarity
CN111538883B (en) Data crawling method, system and equipment
CN105607986A (en) Acquisition method and device of user behavior log data
CN109543124B (en) Page loading method, storage medium and server
CN109413153B (en) Data crawling method and device, computer equipment and storage medium
CN110400080A (en) Examination data monitoring method, device, computer equipment and storage medium
CN110659297A (en) Data processing method, data processing device, computer equipment and storage medium
CN111090797A (en) Data acquisition method and device, computer equipment and storage medium
CN115344533A (en) Microservice log retrieval method, microservice log retrieval system, microservice log retrieval control device, and storage medium
CN105468981A (en) Vulnerability identification technology-based plugin safety scanning device and scanning method
CN113065887B (en) Resource processing method, resource processing device, computer equipment and storage medium
CN111190727A (en) Asynchronous memory destructuring method and device, computer equipment and storage medium
CN110659373A (en) Image retrieval method, image retrieval device, computer device and storage medium
CN112559839B (en) Data acquisition method, device, computer equipment and storage medium
CN116385422A (en) Hidden crack detection method and device, computer equipment and storage medium
CN110969430B (en) Suspicious user identification method, suspicious user identification device, computer equipment and storage medium
CN115409345A (en) Service index calculation method and device, computer equipment and storage medium
CN110889357A (en) Underground cable fault detection method and device based on marked area
CN112347394A (en) Method and device for acquiring webpage information, computer equipment and storage medium
CN114020610A (en) Interface analysis method and device based on graph mining and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant