CN112559839B - Data acquisition method, device, computer equipment and storage medium - Google Patents

Data acquisition method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN112559839B
CN112559839B CN201910850324.7A CN201910850324A CN112559839B CN 112559839 B CN112559839 B CN 112559839B CN 201910850324 A CN201910850324 A CN 201910850324A CN 112559839 B CN112559839 B CN 112559839B
Authority
CN
China
Prior art keywords
crawling
task
data
crawler system
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910850324.7A
Other languages
Chinese (zh)
Other versions
CN112559839A (en
Inventor
张志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201910850324.7A priority Critical patent/CN112559839B/en
Publication of CN112559839A publication Critical patent/CN112559839A/en
Application granted granted Critical
Publication of CN112559839B publication Critical patent/CN112559839B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a data acquisition method, a data acquisition device, computer equipment and a storage medium. According to the data acquisition method, the proper crawler systems are selected to execute the data crawling task according to the crawling capacity values of the crawler systems, meanwhile, the execution condition of the crawler systems on the data crawling task is known in real time, when the condition of clamping occurs, the data is quickly forwarded to the other crawler system in the middle of the task, and then the obtained data are collected to complete the data crawling task, so that the execution efficiency of the data crawling task is effectively ensured.

Description

Data acquisition method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data acquisition method, apparatus, computer device, and storage medium.
Background
With the development of computer internet technology, web crawler technology appears, and a network is also called web spider, web robot or web chaser, etc., which is a program or script for automatically capturing web information according to a certain rule. Functionally, crawlers are generally divided into three parts, data collection, processing, and storage. The traditional crawler starts from the URL of one or a plurality of initial web pages, obtains the URL on the initial web pages, and continuously extracts new URL from the current web page and puts the new URL into a queue in the process of grabbing the web pages until a certain stop condition of the system is met. The workflow of the focused crawler is complex, and links irrelevant to the subject need to be filtered according to a certain webpage analysis algorithm, and useful links are reserved and put into a URL queue waiting to be grabbed. It will then select the web page URL to be fetched next from the queue according to a certain search strategy and repeat the above procedure until a certain condition of the system is reached. In addition, all the web pages captured by the crawlers are stored by the system, are analyzed and filtered to a certain extent, and are indexed for subsequent inquiry and retrieval; for focused crawlers, the analysis results from this process may also give feedback and guidance for the later grabbing process.
For the existing crawler system, under the condition of progress blocking, the crawler end is required to wait for the final result of the crawling task or to perform other compensation operations after overtime, and the efficiency of acquiring data through the web crawler is low.
Disclosure of Invention
Based on the above, it is necessary to provide a data acquisition method, apparatus, computer device and storage medium capable of efficiently acquiring data by a web crawler, aiming at the problem that the efficiency of acquiring data is low when the web crawler has a stuck condition in a task.
A method of data acquisition, the method comprising:
Acquiring a data crawling task and searching a configuration site corresponding to the data crawling task;
obtaining corresponding crawling capacity values of each preset crawler system in the configuration site, and selecting a first preset crawler system with the highest crawling capacity value according to the crawling capacity values;
invoking the first preset crawler system to execute the data crawling task at the configuration site;
When a clamping event occurs in the process of executing the data crawling task, selecting a second preset crawler system with the next highest capacity value according to the crawling capacity value to execute the data crawling task;
And collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquiring target data corresponding to the data crawling task.
In one embodiment, before the obtaining the crawling capability value corresponding to each preset crawler system in the configuration site, selecting the first preset crawler system with the highest data crawling task corresponding capability value according to the crawling capability value, the method further includes:
calling each preset crawler system to execute a test crawling task corresponding to each configuration site;
When the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system for the test crawling task are obtained;
and obtaining the crawling capacity values of the preset crawler systems corresponding to the configuration sites according to the crawling time consumption and the crawling success rate of the preset crawler systems on the test crawling task.
In one embodiment, when the data crawling task is executed and a katon event occurs, selecting a second preset crawler system with a next highest capacity value according to the crawling capacity value to execute the data crawling task includes:
When a clamping event occurs in the data crawling task, selecting a second preset crawler system with a second highest capacity value according to the crawling capacity value;
a stuck node page of the first preset crawler system in the data crawling task is obtained, the data crawling task is updated according to the stuck node page, and the crawling range of the updated data crawling task is from the stuck node page to a crawling terminal page corresponding to the original data crawling task;
and executing the updated data crawling task on the configuration site through the second preset crawler system.
In one embodiment, the method further comprises:
acquiring the crawling progress, the number of residual pages and the progress pause time of the first preset crawler system on the data crawling task;
and when the crawling progress is higher than a preset jamming progress threshold value, the number of the residual pages is lower than a preset jamming page number threshold value, and the progress pause time is higher than a preset jamming threshold time, judging that the executed data crawling task has a jamming event.
In one embodiment, the method further comprises:
Acquiring the progress pause time of the first preset crawler system on the data crawling task;
and when the progress pause time is higher than the preset task configuration time, judging that the data crawling task is executed and a stuck event occurs.
In one embodiment, after the step of summarizing the crawling result of the first preset crawler system and the crawling result of the second preset crawler system and obtaining the target data corresponding to the data crawling task, the method further includes:
When the first preset crawler system finishes a data crawling task, a task termination instruction is sent to the second preset crawler system, and the second preset crawler system is controlled to terminate the current data crawling task;
When the second preset crawler system finishes the data crawling task, a task termination instruction is sent to the first preset crawler system, and the first preset crawler system is controlled to terminate the current data crawling task.
In one embodiment, the collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, after obtaining the target data corresponding to the data crawling task, includes:
acquiring crawling time consumption and crawling success rate of the first preset crawler system and the second preset crawler system on the data crawling task;
And respectively updating the crawling capacity values of the configuration sites corresponding to the data crawling task of the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
A data acquisition device, the device comprising:
The task acquisition module is used for acquiring a data crawling task and searching a configuration site corresponding to the data crawling task;
The crawler selection module is used for acquiring the crawling capacity value corresponding to each preset crawler system in the configuration site, and selecting a first preset crawler system with the highest crawling capacity value according to the crawling capacity value;
The first task execution module is used for calling the first preset crawler system to execute the data crawling task at the configuration site;
The second task execution module is used for selecting a second preset crawler system with a second highest capacity value according to the crawling capacity value to execute the data crawling task when a katon event occurs in the process of executing the data crawling task;
The data acquisition module is used for gathering the crawling result of the first preset crawler system and the crawling result of the second preset crawler system and acquiring target data corresponding to the data crawling task.
A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:
Acquiring a data crawling task and searching a configuration site corresponding to the data crawling task;
obtaining corresponding crawling capacity values of each preset crawler system in the configuration site, and selecting a first preset crawler system with the highest crawling capacity value according to the crawling capacity values;
invoking the first preset crawler system to execute the data crawling task at the configuration site;
When a clamping event occurs in the process of executing the data crawling task, selecting a second preset crawler system with the next highest capacity value according to the crawling capacity value to execute the data crawling task;
And collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquiring target data corresponding to the data crawling task.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
Acquiring a data crawling task and searching a configuration site corresponding to the data crawling task;
obtaining corresponding crawling capacity values of each preset crawler system in the configuration site, and selecting a first preset crawler system with the highest crawling capacity value according to the crawling capacity values;
invoking the first preset crawler system to execute the data crawling task at the configuration site;
When a clamping event occurs in the process of executing the data crawling task, selecting a second preset crawler system with the next highest capacity value according to the crawling capacity value to execute the data crawling task;
And collecting the crawling result of the first preset crawler system and the crawling result of the second preset crawler system, and acquiring target data corresponding to the data crawling task.
According to the data acquisition method, the device, the computer equipment and the storage medium, the proper crawling system is selected to execute the data crawling task through the crawling capacity value of each crawling system, meanwhile, the execution condition of the crawling system on the data crawling task is known in real time, when the condition of a clamping and a stopping occurs, the data is rapidly forwarded to the other crawling system in the middle of the task, and then the crawling acquired data is collected to complete the data crawling task, so that the execution efficiency of the data crawling task is effectively ensured.
Drawings
FIG. 1 is an application environment diagram of a data acquisition method in one embodiment;
FIG. 2 is a flow chart of a method of data acquisition in one embodiment;
FIG. 3 is a flow chart of a data acquisition method according to another embodiment;
FIG. 4 is a schematic flow chart illustrating a sub-process of step S700 of FIG. 2 in one embodiment;
FIG. 5 is a block diagram of a data acquisition device in one embodiment;
fig. 6 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The data acquisition method provided by the application can be applied to an application environment shown in figure 1. Wherein server 102 communicates with server 104 over a network. The terminal 102 submits a data crawling task to the server 104, and the server 104 receives the data crawling task and searches a configuration site corresponding to the data crawling task; then, obtaining the corresponding crawling capacity value of each preset crawler system in the configuration site, and selecting a first preset crawler system with the highest crawling capacity value according to the crawling capacity value; invoking a first preset crawler system to execute a data crawling task at a configuration site; when a clamping event occurs in the data crawling task execution process, selecting a second preset crawler system with a second highest capacity value according to the crawling capacity value to execute the data crawling task; collecting crawling results of a first preset crawler system and crawling results of a second preset crawler system, and acquiring target data corresponding to a data crawling task. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 104 may be implemented by a stand-alone server or a server cluster composed of a plurality of servers.
In one embodiment, as shown in fig. 2, a data acquisition method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
S100, acquiring a data crawling task and searching a configuration site corresponding to the data crawling task.
The data crawling task is a task of acquiring network information data conforming to a certain rule from a configuration site corresponding to the task through a web crawler. The configuration site specifically refers to a site storing network information data, the data crawling task carries a configuration site corresponding to the task, and the server is used for executing the data crawling task in the configuration site through a web crawler and acquiring corresponding data.
S300, obtaining the crawling capacity value corresponding to each preset crawler system in the configuration site, and selecting a first preset crawler system with the highest crawling capacity value according to the crawling capacity value.
S500, a first preset crawler system is called to execute a data crawling task at a configuration site.
The preset crawler systems are multiple crawler systems which are configured in the server and can independently execute crawler tasks. A crawler system is composed of a plurality of web crawlers, and when the crawler system is used, the plurality of web crawlers in the crawler system are used for completing the same data crawling task. The crawling capacity value refers to a predicted value of crawling speed and crawling success rate of crawling data of a crawler system at each configured site, and is determined according to crawling speed and crawling success rate of each preset crawler system in historical crawling records of the site. For the same preset crawler system, different configuration sites have different crawling capability values, and for the same configuration site, the crawling capability values of the different preset crawler systems are also different, and the capability values of the preset crawler systems are managed by a crawler capability value center in the server. And the crawling capacity value is determined according to the crawling speed and the crawling success rate of the preset crawler system on the configuration site. In one embodiment, the crawling speed and the crawling success rate of each preset crawler system to the configuration site can be obtained through the testing task corresponding to the configuration site, then the crawling speed and the crawling success rate of each preset crawler system are ranked, and the crawling capacity value of each preset crawler system to the configuration site is comprehensively determined according to the ranking condition.
In one embodiment, when a data crawling task is received, a server starts to poll each preset crawler system in the server, firstly determines the preset crawler system with normal state in each preset crawler system, sets the unavailable preset crawler system to be in an unavailable state, acquires a configuration site corresponding to the data crawling task, searches the crawler system with the highest capacity value in the preset crawler system with normal state through a crawler capacity value center, sets the crawler system as a first preset crawler system, executes the data crawling task through the first preset crawler system, and acquires corresponding data.
And S700, when a clamping event occurs in the data crawling task execution, selecting a second preset crawler system with the next highest capacity value according to the crawling capacity value to execute the data crawling task.
The stuck event refers to an abnormal condition that the first preset crawler system is stuck when executing the data crawling task, so that the task cannot be completed in a short time. Whether the data crawling task is a stuck task or not can be specifically determined through the corresponding crawling progress of the data crawling task, when the data crawling task executed by the first preset crawler system is determined, after the data crawling task enters a stuck state, the server can directly select a second preset crawler system with the next highest capacity value according to the crawling capacity value, then the data crawling task is executed through the second preset crawler system, and corresponding data is acquired. In addition, in another embodiment, when a second preset crawler system or a preset crawler system selected later is executing the data crawling task and a click event is also generated, the server may select other crawler systems again in sequence according to the crawling capability value to execute the data crawling task until the data crawling task is executed.
S900, collecting crawling results of the first preset crawler system and crawling results of the second preset crawler system, and acquiring target data corresponding to the data crawling task.
After the second preset crawler system starts the data crawling task, the server starts to summarize the data obtained by the two preset crawler systems in the process of executing the data crawling task, and the repeatedly obtained data can be removed after summarizing, so that target data corresponding to the data crawling task is obtained. In addition, when more than two preset crawler systems execute the data crawling task, the server finally performs aggregation processing on all data crawled by the preset crawler systems.
According to the data acquisition method, the proper crawler systems are selected to execute the data crawling task according to the crawling capacity values of the crawler systems, meanwhile, the execution condition of the crawler systems on the data crawling task is known in real time, when the condition of blocking occurs, the data is rapidly forwarded to the other crawler system in the middle of the task, and then the crawling acquired data is collected to complete the data crawling task, so that the execution efficiency of the data crawling task is effectively ensured.
As shown in fig. 3, in one embodiment, before S300, the method further includes:
s210, calling each preset crawler system to execute the test crawling task corresponding to each configuration site.
S230, when the test crawling task is completed, crawling time consumption and crawling success rate of each preset crawler system on the test crawling task are obtained.
S250, obtaining crawling capacity values of the preset crawler systems and the configuration sites according to crawling time consumption and crawling success rate of the preset crawler systems on the test crawling task.
When the data crawling task is not being performed, the crawler capacity value center may predefine some websites that may be involved in the data crawling task as configuration websites. And then randomly generating test crawling tasks for the configuration sites, and respectively executing the test crawling tasks through each preset crawler system controlled by the server. And the server counts the crawling time consumption and the crawling success rate of the preset crawler systems on the data crawling tasks. And then obtaining the crawling capacity value corresponding to each preset crawler system and each configuration site according to the crawling time consumption and the crawling success rate. In one embodiment, in the testing process, the crawling capacity value rank can be determined according to the crawling time consumption rank and the crawling success rate rank of each preset crawler system on the same configuration site, then the crawling capacity value is allocated to each preset crawler system according to the crawling capacity value rank, if the crawling time consumption of the A crawler system on the first site is shortest and the crawling success rate is highest, then the crawling capacity value corresponding to the A crawler system and the first site can be determined to be 100, and then corresponding crawling capacity values are given to other crawler systems according to the crawling capacity value as a reference. The crawling capacity value can be given to each preset crawler system efficiently through the test task, so that the most suitable preset crawler system is found for the data crawling task by side searching, and the data acquisition efficiency is improved.
As shown in fig. 4, in one embodiment, S700 includes:
S720, when a clamping event occurs in the data crawling task, selecting a second preset crawler system with a next highest capacity value according to the crawling capacity value.
S740, acquiring a stuck node page of the first preset crawler system in the data crawling task, updating the data crawling task according to the stuck node page, and enabling the crawling range of the updated data crawling task to be from the stuck node page to the crawling destination page corresponding to the original data crawling task.
S760, executing the updated data crawling task on the configuration site through the second preset crawler system.
The click-on event is specifically a situation that progress stagnation state occurs in the process of executing a data crawling task, the click-on node page refers to a page under a configuration site with stagnation degree, when the data crawling task currently being executed is determined to be the click-on task, a second preset crawler system with a second highest crawling capacity value corresponding to the current data crawling task can be selected according to the crawling capacity value, the data crawling task is executed through the second preset crawler system, and because the first preset crawler system has already executed part of data crawling work, the original data crawling task can be updated according to the click-on node at the moment, and the residual data crawling task after the click-on node and the click-on node are executed only through the second preset crawler system. Only data which is not crawled temporarily is crawled, and the overall data crawling efficiency can be effectively improved. In addition, the first preset crawler system may return failure results of some pages in the process of crawling the current data crawling task, and for the reverse crawling reasons such as non-404, the server may also adjust the price of the pages to the updated data crawling task, and the second preset crawler system performs crawling again. In another embodiment, the crawling range of the updated data crawling task is a preset number of pages in front of the page of the stuck node, and the crawling range of the updated data crawling task corresponds to the crawling destination page, for example, the crawling range of the original data crawling task is a page A200-A500 of the configuration site, when the first preset crawler system crawls to a page A400, a stuck event occurs, the stuck node is A400, and the crawling range of the updated data crawling task can be set to be A390-A500, so that incomplete crawling data can be avoided when data are collected. The task which fails to crawl is crawled again in the execution process of the data crawling task, instead of crawling again after the whole task is finished, and the data crawling efficiency can be effectively improved.
In one embodiment, the method further comprises: and acquiring the crawling progress, the number of the residual pages and the progress pause time of the first preset crawler system on the data crawling task. And when the crawling progress is higher than a preset stuck progress threshold, the number of the residual pages is lower than the preset stuck page number threshold, and the progress pause time is higher than the preset stuck threshold time, judging that the data crawling task is a stuck task.
Specifically, a specific task progress of the data crawling task can be combined, and whether a clamping event occurs to the data crawling task can be judged by the number of the residual unpeeled pages and the progress pause. The preset jam progress threshold, the preset jam page number threshold and the preset jam threshold time can be determined according to the page number of the data crawling task. In a specific embodiment, for a data crawling task, the number of pages to be corresponding is 3000, when the task progress is less than 20%, and the total number of the remaining pages is less than 500, and the data crawling task with the task progress remaining for more than 10min can be considered as a stuck task. Whether the current data crawling task is a stuck task or not can be effectively identified through the task progress of the crawling task, and tasks to be crawled are not set to be stuck tasks, so that heavy crawling of a large amount of data can be effectively prevented; and for the crawling tasks which do not meet the clamping event, the crawling tasks are re-crawled according to the common new tasks.
In another embodiment, the method further comprises: acquiring the progress pause time of a first preset crawler system on a data crawling task; and when the progress pause time is higher than the preset task configuration time, judging that the data crawling task is a cartoon task. And the data crawling task with the task progress staying exceeding the preset task configuration time is called a cartoon task. Whether the current data crawling task is a cartoon task or not can be effectively identified through the task progress of the crawling task.
In one embodiment, S900 further includes:
When the first preset crawler system finishes the data crawling task, a task termination instruction is sent to the second preset crawler system, and the second preset crawler system is controlled to terminate the current data crawling task;
When the second preset crawler system finishes the data crawling task, a task termination instruction is sent to the first preset crawler system, and the first preset crawler system is controlled to terminate the current data crawling task.
Specifically, judging whether the current data crawling task is completed according to the summarized crawling result, and when the first preset crawler system completes the data crawling task, sending a task termination instruction to the second preset crawler system, and controlling the second preset crawler system to terminate the current data crawling task; and when the second preset crawler system finishes the data crawling task, sending a task termination instruction to the first preset crawler system, and controlling the first preset crawler system to terminate the current data crawling task. By sending the task interrupt instruction, crawler resources can be effectively saved.
In one embodiment, S900 then comprises:
acquiring crawling time consumption and crawling success rate of the first preset crawler system and the second preset crawler system on a data crawling task;
and respectively updating the crawling capacity values of the configuration sites corresponding to the data crawling tasks of the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
After the data crawling task is completed, the crawling capacity value can be updated in real time according to the crawling rate and the crawling success rate of each preset crawler system on the data crawling task under the actual condition. The crawling capability value is ensured to meet the actual data crawling task requirement, and the data crawling efficiency is further improved.
It should be understood that, although the steps in the flowcharts of fig. 2-4 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily occur sequentially, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or steps.
In one embodiment, as shown in fig. 5, there is provided a data acquisition apparatus comprising:
the task acquisition module 100 is configured to acquire a data crawling task, and search a configuration site corresponding to the data crawling task;
the crawler selection module 300 is configured to obtain the crawling capability value corresponding to each preset crawler system in the configuration site, and select the first preset crawler system with the highest crawling capability value according to the crawling capability value;
the first task execution module 500 is configured to invoke a first preset crawler system to execute a data crawling task at a configuration site;
The second task execution module 700 is configured to select a second preset crawler system with a second highest capacity value according to the crawling capacity value to execute the data crawling task when a stuck event occurs in the execution of the data crawling task;
The data acquisition module 900 is configured to gather a crawling result of the first preset crawler system and a crawling result of the second preset crawler system, and acquire target data corresponding to the data crawling task.
In one embodiment, the system further comprises a crawler testing module, which is used for calling each preset crawler system to execute the test crawling task corresponding to each configuration site; when the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system on the test crawling task are obtained; and obtaining the crawling capacity value of each preset crawler system corresponding to each configuration site according to the crawling time consumption and the crawling success rate of each preset crawler system on the test crawling task.
In one embodiment, the second task execution module 700 is configured to select, when a stuck event occurs in executing the data crawling task, a second preset crawler system with a second highest capacity value according to the crawling capacity value; the method comprises the steps of obtaining a stuck node page of a first preset crawler system in a data crawling task, updating the data crawling task according to the stuck node page, and enabling the crawling range of the updated data crawling task to be from the stuck node page to a crawling terminal page corresponding to an original data crawling task; and executing the updated data crawling task on the configuration site through a second preset crawler system.
In one embodiment, the system further includes a first click-on determination module, configured to obtain a crawling progress, a number of remaining pages, and a progress pause time of the first preset crawler system on the data crawling task; and when the crawling progress is higher than a preset stuck progress threshold, the number of the residual pages is lower than the preset stuck page number threshold, and the progress pause time is higher than the preset stuck threshold time, judging that the data crawling task is a stuck task.
In one embodiment, the system further includes a second click-on determination module, configured to obtain a progress pause time of the first preset crawler system on the data crawling task; and when the progress pause time is higher than the preset task configuration time, judging that the data crawling task is a cartoon task.
In one embodiment, the system also comprises a task interrupt module for
When the first preset crawler system finishes the data crawling task, sending a task termination instruction to the second preset crawler system, and controlling the second preset crawler system to terminate the current data crawling task; when the second preset crawler system finishes the data crawling task, a task termination instruction is sent to the first preset crawler system, and the first preset crawler system is controlled to terminate the current data crawling task.
In one embodiment, the system further comprises a capability value updating module, which is used for acquiring crawling time and crawling success rate of the first preset crawler system and the second preset crawler system on the data crawling task; and respectively updating the crawling capacity values of the configuration sites corresponding to the data crawling tasks of the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
For specific limitations of the data acquisition device, reference may be made to the above limitations of the data acquisition method, and no further description is given here. The respective modules in the above-described data acquisition apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing related data of a preset crawler system. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data acquisition method.
It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
Acquiring a data crawling task, and searching a configuration site corresponding to the data crawling task;
obtaining corresponding crawling capacity values of each preset crawler system in a configuration site, and selecting a first preset crawler system with the highest crawling capacity value according to the crawling capacity values;
invoking a first preset crawler system to execute a data crawling task at a configuration site;
When a clamping event occurs in the data crawling task execution process, selecting a second preset crawler system with a second highest capacity value according to the crawling capacity value to execute the data crawling task;
Collecting crawling results of a first preset crawler system and crawling results of a second preset crawler system, and acquiring target data corresponding to a data crawling task.
In one embodiment, the processor when executing the computer program further performs the steps of: calling each preset crawler system to execute a test crawling task corresponding to each configuration site; when the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system on the test crawling task are obtained; and obtaining the crawling capacity value of each preset crawler system corresponding to each configuration site according to the crawling time consumption and the crawling success rate of each preset crawler system on the test crawling task.
In one embodiment, the processor when executing the computer program further performs the steps of: when a clamping event occurs in the data crawling task, selecting a second preset crawler system with a second highest capacity value according to the crawling capacity value; the method comprises the steps of obtaining a stuck node page of a first preset crawler system in a data crawling task, updating the data crawling task according to the stuck node page, and enabling the crawling range of the updated data crawling task to be from the stuck node page to a crawling terminal page corresponding to an original data crawling task; and executing the updated data crawling task on the configuration site through a second preset crawler system.
In one embodiment, the processor when executing the computer program further performs the steps of: obtaining crawling progress, the number of residual pages and progress pause time of a first preset crawler system on a data crawling task; and when the crawling progress is higher than a preset stuck progress threshold, the number of the residual pages is lower than the preset stuck page number threshold, and the progress pause time is higher than the preset stuck threshold time, judging that the data crawling task is a stuck task.
In one embodiment, the processor when executing the computer program further performs the steps of: acquiring the progress pause time of a first preset crawler system on a data crawling task; and when the progress pause time is higher than the preset task configuration time, judging that the data crawling task is a cartoon task.
In one embodiment, the processor when executing the computer program further performs the steps of: when the first preset crawler system finishes the data crawling task, sending a task termination instruction to the second preset crawler system, and controlling the second preset crawler system to terminate the current data crawling task; when the second preset crawler system finishes the data crawling task, a task termination instruction is sent to the first preset crawler system, and the first preset crawler system is controlled to terminate the current data crawling task.
In one embodiment, the processor when executing the computer program further performs the steps of: acquiring crawling time consumption and crawling success rate of the first preset crawler system and the second preset crawler system on a data crawling task; and respectively updating the crawling capacity values of the configuration sites corresponding to the data crawling tasks of the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
Acquiring a data crawling task, and searching a configuration site corresponding to the data crawling task;
obtaining corresponding crawling capacity values of each preset crawler system in a configuration site, and selecting a first preset crawler system with the highest crawling capacity value according to the crawling capacity values;
invoking a first preset crawler system to execute a data crawling task at a configuration site;
When a clamping event occurs in the data crawling task execution process, selecting a second preset crawler system with a second highest capacity value according to the crawling capacity value to execute the data crawling task;
Collecting crawling results of a first preset crawler system and crawling results of a second preset crawler system, and acquiring target data corresponding to a data crawling task.
In one embodiment, the computer program when executed by the processor further performs the steps of: calling each preset crawler system to execute a test crawling task corresponding to each configuration site; when the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system on the test crawling task are obtained; and obtaining the crawling capacity value of each preset crawler system corresponding to each configuration site according to the crawling time consumption and the crawling success rate of each preset crawler system on the test crawling task.
In one embodiment, the computer program when executed by the processor further performs the steps of: when a clamping event occurs in the data crawling task, selecting a second preset crawler system with a second highest capacity value according to the crawling capacity value; the method comprises the steps of obtaining a stuck node page of a first preset crawler system in a data crawling task, updating the data crawling task according to the stuck node page, and enabling the crawling range of the updated data crawling task to be from the stuck node page to a crawling terminal page corresponding to an original data crawling task; and executing the updated data crawling task on the configuration site through a second preset crawler system.
In one embodiment, the computer program when executed by the processor further performs the steps of: obtaining crawling progress, the number of residual pages and progress pause time of a first preset crawler system on a data crawling task; and when the crawling progress is higher than a preset stuck progress threshold, the number of the residual pages is lower than the preset stuck page number threshold, and the progress pause time is higher than the preset stuck threshold time, judging that the data crawling task is a stuck task.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring the progress pause time of a first preset crawler system on a data crawling task; and when the progress pause time is higher than the preset task configuration time, judging that the data crawling task is a cartoon task.
In one embodiment, the computer program when executed by the processor further performs the steps of: when the first preset crawler system finishes the data crawling task, sending a task termination instruction to the second preset crawler system, and controlling the second preset crawler system to terminate the current data crawling task; when the second preset crawler system finishes the data crawling task, a task termination instruction is sent to the first preset crawler system, and the first preset crawler system is controlled to terminate the current data crawling task.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring crawling time consumption and crawling success rate of the first preset crawler system and the second preset crawler system on a data crawling task; and respectively updating the crawling capacity values of the configuration sites corresponding to the data crawling tasks of the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims (10)

1. A method of data acquisition, the method comprising:
Acquiring a data crawling task and searching a configuration site corresponding to the data crawling task;
The method comprises the steps of obtaining a crawling capacity value corresponding to each preset crawler system in the configuration site, selecting a first preset crawler system with the highest crawling capacity value according to the crawling capacity value, wherein the crawling capacity value is a predicted value of crawling speed and crawling success rate of the preset crawler system in crawling data of each configuration site, and the crawling capacity value is determined according to crawling speed and crawling success rate of each preset crawler system in a historical crawling record of the configuration site;
invoking the first preset crawler system to execute the data crawling task at the configuration site;
When a clamping event occurs in the process of executing the data crawling task, selecting a second preset crawler system with the next highest capacity value according to the crawling capacity value to execute the data crawling task;
collecting crawling results of the first preset crawler system and crawling results of the second preset crawler system, and obtaining target data corresponding to a data crawling task;
when the data crawling task is executed and a stuck event occurs, selecting a second preset crawler system with a next highest capacity value according to the crawling capacity value to execute the data crawling task comprises the following steps:
When a clamping event occurs in the data crawling task, selecting a second preset crawler system with a second highest capacity value according to the crawling capacity value;
a stuck node page of the first preset crawler system in the data crawling task is obtained, the data crawling task is updated according to the stuck node page, and the crawling range of the updated data crawling task is from the stuck node page to a crawling terminal page corresponding to the original data crawling task;
and executing the updated data crawling task on the configuration site through the second preset crawler system.
2. The method of claim 1, wherein the obtaining the corresponding crawling capability value of each preset crawler system in the configuration site, before selecting the first preset crawler system with the highest corresponding capability value of the data crawling task according to the crawling capability value, further comprises:
calling each preset crawler system to execute a test crawling task corresponding to each configuration site;
When the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system for the test crawling task are obtained;
and obtaining the crawling capacity values of the preset crawler systems corresponding to the configuration sites according to the crawling time consumption and the crawling success rate of the preset crawler systems on the test crawling task.
3. The method as recited in claim 1, further comprising:
acquiring the crawling progress, the number of residual pages and the progress pause time of the first preset crawler system on the data crawling task;
and when the crawling progress is higher than a preset jamming progress threshold value, the number of the residual pages is lower than a preset jamming page number threshold value, and the progress pause time is higher than a preset jamming threshold time, judging that the executed data crawling task has a jamming event.
4. The method as recited in claim 1, further comprising:
Acquiring the progress pause time of the first preset crawler system on the data crawling task;
and when the progress pause time is higher than the preset task configuration time, judging that the data crawling task is executed and a stuck event occurs.
5. The method of claim 1, wherein the aggregating the crawling results of the first preset crawler system and the crawling results of the second preset crawler system, after obtaining the target data corresponding to the data crawling task, further comprises:
when the first preset crawler system finishes a data crawling task, a task termination instruction is sent to the second preset crawler system, and the second preset crawler system is controlled to terminate the current data crawling task;
When the second preset crawler system finishes the data crawling task, a task termination instruction is sent to the first preset crawler system, and the first preset crawler system is controlled to terminate the current data crawling task.
6. The method of claim 1, wherein the aggregating the crawling results of the first preset crawler system and the crawling results of the second preset crawler system, after obtaining the target data corresponding to the data crawling task, comprises:
acquiring crawling time consumption and crawling success rate of the first preset crawler system and the second preset crawler system on the data crawling task;
And respectively updating the crawling capacity values of the configuration sites corresponding to the first preset crawler system and the second preset crawler system according to the crawling time consumption and the crawling success rate.
7. A data acquisition device, the device comprising:
The task acquisition module is used for acquiring a data crawling task and searching a configuration site corresponding to the data crawling task;
The crawler selection module is used for acquiring the crawling capacity value corresponding to each preset crawler system in the configuration site, selecting a first preset crawler system with the highest crawling capacity value according to the crawling capacity value, wherein the crawling capacity value is a predicted value of the crawling speed and the crawling success rate of the preset crawler system in crawling data of each configuration site, and the crawling capacity value is determined according to the crawling speed and the crawling success rate of each preset crawler system in the historical crawling record of the configuration site;
The first task execution module is used for calling the first preset crawler system to execute the data crawling task at the configuration site;
The second task execution module is used for selecting a second preset crawler system with a second highest capacity value according to the crawling capacity value to execute the data crawling task when a katon event occurs in the process of executing the data crawling task;
The data acquisition module is used for gathering the crawling result of the first preset crawler system and the crawling result of the second preset crawler system and acquiring target data corresponding to a data crawling task;
The second task execution module is specifically configured to: when a clamping event occurs in the data crawling task, selecting a second preset crawler system with a second highest capacity value according to the crawling capacity value; a stuck node page of the first preset crawler system in the data crawling task is obtained, the data crawling task is updated according to the stuck node page, and the crawling range of the updated data crawling task is from the stuck node page to a crawling terminal page corresponding to the original data crawling task; and executing the updated data crawling task on the configuration site through the second preset crawler system.
8. The apparatus of claim 7, further comprising a crawler testing module to: calling each preset crawler system to execute a test crawling task corresponding to each configuration site; when the test crawling task is completed, the crawling time consumption and the crawling success rate of each preset crawler system for the test crawling task are obtained; and obtaining the crawling capacity values of the preset crawler systems corresponding to the configuration sites according to the crawling time consumption and the crawling success rate of the preset crawler systems on the test crawling task.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN201910850324.7A 2019-09-10 2019-09-10 Data acquisition method, device, computer equipment and storage medium Active CN112559839B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910850324.7A CN112559839B (en) 2019-09-10 2019-09-10 Data acquisition method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910850324.7A CN112559839B (en) 2019-09-10 2019-09-10 Data acquisition method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112559839A CN112559839A (en) 2021-03-26
CN112559839B true CN112559839B (en) 2024-05-03

Family

ID=75029719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910850324.7A Active CN112559839B (en) 2019-09-10 2019-09-10 Data acquisition method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112559839B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method
CN107025230A (en) * 2016-01-29 2017-08-08 北京国双科技有限公司 The processing method and processing device of web crawlers
CN108446199A (en) * 2017-02-16 2018-08-24 阿里巴巴集团控股有限公司 A kind of detection method and device using interim card

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9652538B2 (en) * 2013-12-11 2017-05-16 Ebay Inc. Web crawler optimization system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025230A (en) * 2016-01-29 2017-08-08 北京国双科技有限公司 The processing method and processing device of web crawlers
CN108446199A (en) * 2017-02-16 2018-08-24 阿里巴巴集团控股有限公司 A kind of detection method and device using interim card
CN106874487A (en) * 2017-02-21 2017-06-20 国信优易数据有限公司 A kind of distributed reptile management system and its method

Also Published As

Publication number Publication date
CN112559839A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN110209652B (en) Data table migration method, device, computer equipment and storage medium
EP2674884A1 (en) Method, system and computer-readable recording medium for adding a new image and information on the new image to an image database
RU2645266C1 (en) Method and device for planning web-crowlers in accordance with keyword search
US20090150381A1 (en) Methods and apparatus for computing graph similarity via signature similarity
CN111538883B (en) Data crawling method, system and equipment
CN105607986A (en) Acquisition method and device of user behavior log data
CN109359263B (en) User behavior feature extraction method and system
CN109408320A (en) Abnormality eliminating method, device, computer equipment and storage medium are developed in front end
CN109325010A (en) Log inspection method, device, computer equipment and storage medium
CN111176767A (en) Table data processing method and device, computer equipment and storage medium
CN115344533A (en) Microservice log retrieval method, microservice log retrieval system, microservice log retrieval control device, and storage medium
CN108845869A (en) Concurrent request control method, device, computer equipment and storage medium
CN110321364B (en) Transaction data query method, device and terminal of credit card management system
CN105740384A (en) Crawler agent automatic switching method and device
CN112559839B (en) Data acquisition method, device, computer equipment and storage medium
CN113065887B (en) Resource processing method, resource processing device, computer equipment and storage medium
CN111897843B (en) Configuration method and device of data flow strategy of Internet of things and computer equipment
CN110134846A (en) Proper noun processing method, device and the computer equipment of text
US20170286440A1 (en) Method, business processing server and data processing server for storing and searching transaction history data
CN110597573A (en) Warehouse entry request data processing method and device
CN110969430B (en) Suspicious user identification method, suspicious user identification device, computer equipment and storage medium
CN110889357A (en) Underground cable fault detection method and device based on marked area
CN112286876A (en) Log file capturing method and device and computer readable storage medium
CN115994244B (en) Directed graph data processing method and device based on big data and computer equipment
CN111309572B (en) Test analysis method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant