CN116501945A

CN116501945A - Multithreaded browser driven crawler method, system and readable storage medium

Info

Publication number: CN116501945A
Application number: CN202310765246.7A
Authority: CN
Inventors: 吕振; 陈增和; 赖浩哲; 林兴武
Original assignee: Shenzhen Housley Technology Co ltd
Current assignee: Shenzhen Housley Technology Co ltd
Priority date: 2023-06-27
Filing date: 2023-06-27
Publication date: 2023-07-28

Abstract

The invention discloses a method, a system and a readable storage medium for driving crawlers by a multithreaded browser, wherein the method comprises the following steps: obtaining a crawler task; analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data; monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread; and after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server. According to the invention, the multithreading scheduling is started, a plurality of browsers are respectively driven to acquire data storage according to the current environment parameters of the current browsers and then the simulation request is carried out, so that the problems that the browser interface is required to scan codes for logging, the parameters are complex and are difficult to analyze, and the efficiency of a single browser is low are solved.

Description

Multithreaded browser driven crawler method, system and readable storage medium

Technical Field

The present application relates to the field of data processing and data transmission, and more particularly, to a method, system and readable storage medium for driving crawlers by a multithreaded browser.

Background

The existing multithreading crawler technology is mainly used for analyzing interfaces of target sites, then performing data capture through code simulation requests, and is high in webpage development difficulty, long in period, difficult to debug and analyze according to logs, frequent in account password login, poor in disconnection stability and the like in the face of special network request parameter encryption.

Therefore, the prior art has defects, and improvement is needed.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a method, a system and a readable storage medium for driving crawlers by a multithreaded browser, which can efficiently, stably and rapidly crawl a plurality of target websites in a large-scale website data crawling process, so as to improve data acquisition efficiency and quality.

The first aspect of the present invention provides a method for driving a crawler by a multithreaded browser, comprising:

obtaining a crawler task;

analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data;

monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread;

and after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server.

In this scheme, still include:

starting a main thread and a sub thread through a Python language;

initializing a task pool through the main thread;

and starting a plurality of browsers through the sub-threads.

In this scheme, according to analysis is carried out to the crawler task, log in the website through sub-thread control browser to crawl website data, include:

distributing the crawler task to the sub-threads through the main thread;

marking the crawler task based on the ID of the assigned crawler task;

the sub-thread accesses a login page through a browser according to the assigned crawler task, and judges whether the login page needs code scanning login or not;

if yes, displaying the two-dimensional code and informing the manual code scanning;

otherwise, directly logging in;

and detecting a login state, and after successful login, accessing a specific webpage by the child thread through a browser and crawling website data through the webpage.

In this scheme, in the crawler task execution process, monitor crawler task state, mark the crawler task that has accomplished to end corresponding sub-thread, include:

analyzing according to the task state of the subtask in the subthread, and judging whether the subtask is completed or not;

If not, continuing to monitor the task state of the subtask;

if yes, marking the subtasks as completed;

ending the sub-thread when all sub-tasks in the task pool of the sub-thread are marked as completed;

and after all the sub-threads are finished, sending a completion notification to the main thread.

In this scheme, still include:

analyzing according to the task state of the sub-task in the sub-thread, recording the task interruption times when the task is interrupted, and judging whether the interruption times are larger than a first preset threshold value or not;

if not, re-logging in the website, and continuing climbing the sub-task with the transmission interruption;

if yes, ending the sub-task for transmitting the interrupt, and canceling the binding of the sub-task for transmitting the interrupt and the corresponding sub-thread;

and distributing the sub-task for transmitting the interrupt to other sub-threads, and continuing to climb website data corresponding to the sub-task for transmitting the interrupt through the other sub-threads.

In this scheme, after all sub-threads all end, detect the integrality of crawling data through the main thread, continue crawling to missing data, until crawling data is complete, generate log report and propelling movement to the server according to the task completion condition, include:

The main thread detects the data integrity of all tasks and marks the tasks with incomplete data;

analyzing according to the marked task to obtain missing data;

generating a missing data crawler task according to the missing data, and processing the missing data crawler task by a promoter thread;

detecting the integrity of the missing data after the task of the missing data crawler is finished, and judging whether the missing data is completely crawled;

if not, the incomplete missing data repeatedly generates a missing data crawler task until the crawler task is completed completely;

if yes, generating a task log based on the task completion time and the task content.

The second aspect of the present invention provides a multithreaded browser driven crawler system, comprising a memory and a processor, wherein the memory comprises a multithreaded browser driven crawler method program, and the multithreaded browser driven crawler method program when executed by the processor realizes the following steps:

obtaining a crawler task;

In this scheme, still include:

starting a main thread and a sub thread through a Python language;

initializing a task pool through the main thread;

and starting a plurality of browsers through the sub-threads.

distributing the crawler task to the sub-threads through the main thread;

marking the crawler task based on the ID of the assigned crawler task;

otherwise, directly logging in;

if not, continuing to monitor the task state of the subtask;

if yes, marking the subtasks as completed;

In this scheme, still include:

analyzing according to the marked task to obtain missing data;

A third aspect of the present invention provides a computer readable storage medium having embodied therein a multithreaded browser driven crawler method program which, when executed by a processor, implements the steps of a multithreaded browser driven crawler method as described in any of the preceding claims.

Drawings

FIG. 1 illustrates a flow chart of a method of the present invention for a multithreaded browser to drive a crawler;

FIG. 2 illustrates a flow chart of a crawler task status monitoring method of the present invention;

FIG. 3 illustrates a flow chart of a method of intermittent transmission in a crawler task in accordance with the present invention;

FIG. 4 illustrates a block diagram of a multi-threaded browser driven crawler system of the present invention;

FIG. 5 illustrates a schematic diagram of a multithreaded browser driven crawler framework in accordance with the present invention.

Description of the embodiments

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

FIG. 1 illustrates a flow chart of a method of the present invention for a multithreaded browser driven crawler.

As shown in fig. 1, the invention discloses a method for driving crawlers by a multithreaded browser, which comprises the following steps:

s102, acquiring a crawler task;

s104, analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data;

S106, monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread;

and S108, after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server.

According to the embodiment of the invention, as shown in fig. 5, after a crawler task is acquired, a main thread is started by using a Python language, a task pool is generated by starting a main thread framework, the task pool is initialized, and a plurality of sub-threads are started to respectively control a plurality of browsers. The crawler task is then divided into a plurality of subtasks and evenly distributed to the task pools of each subtread. The sub-thread starts working according to the assigned sub-task, and the browser is controlled to sequentially scan codes, log in and extract the logged environment variables, and then the simulation request of the target data is carried out to carry out target data warehousing. And checking the transmission condition of the crawler task in the process of the crawler task, sending an interrupt warning when the interruption of the crawler task is detected, generating a code scanning login screenshot to wait for manual code scanning, and carrying out breakpoint continuous climbing on the crawler task after the code scanning is successful. And finishing corresponding sub threads after all sub tasks in the sub threads are crawled, after all the sub threads are finished, verifying the data integrity through the main thread, acquiring missing data again until the data is complete, and finally generating a log report according to the task completion condition and uploading the log report to a server. According to the invention, through multi-thread driving, the disconnection early warning is prevented, the breakpoint continues to climb, the crawler efficiency is improved, the exception handling and fault tolerance mechanism is increased, and the stability and reliability of the whole crawler system are ensured. According to an embodiment of the present invention, further comprising:

Starting a main thread and a sub thread through a Python language;

initializing a task pool through the main thread;

and starting a plurality of browsers through the sub-threads.

Before the crawler task starts, a main thread is started by using a Python language, a current task pool is initialized and generated in the main thread and recorded in a MySql database, and then a plurality of browsers are started by using sub-threads in the multithreading. Each thread controls a browser using selenium (a tool for Web application testing).

According to an embodiment of the present invention, the analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data includes:

distributing the crawler task to the sub-threads through the main thread;

marking the crawler task based on the ID of the assigned crawler task;

otherwise, directly logging in;

After the main thread finishes the distribution of the crawler task, the sub-thread acquires the crawler task according to the distribution data, marks the task ID of the crawler task in the database, and binds the sub-thread with the distributed crawler task, so that the current task is refused to acquire for other threads, and the other sub-threads are prevented from processing the current sub-task again. After the sub-thread acquires the task, the access login page performs login according to user information set by the system, wherein the login mode can be account password login, verification code login or scanning login, wherein the account password login and the verification code login can directly perform login according to the user information set by the system, but the code scanning login needs to perform login in a manual code scanning mode, and when the login mode is code scanning login, a browser page displays a login two-dimensional code and a manager performs manual code scanning.

And detecting the login state at regular time, controlling the browser to access specific webpage pages through the sub-thread after successful login, obtaining the environment parameters of the browser, simulating the request to obtain website data, and sorting and warehousing the website data.

FIG. 2 illustrates a flow chart of a crawler task status monitoring method of the present invention.

As shown in fig. 2, according to an embodiment of the present invention, during execution of a crawler task, monitoring a status of the crawler task, marking a completed crawler task, and ending a corresponding sub-thread, including:

s202, analyzing according to the task state of the subtask in the subthread, and judging whether the subtask is completed or not;

s204, if not, continuing to monitor the task state of the subtask;

s206, if yes, marking the subtasks as completed;

s208, ending the sub-thread when all sub-tasks in the task pool of the sub-thread are marked as completed;

s210, after all the sub-threads are finished, a completion notification is sent to the main thread.

It should be noted that, each sub-thread is allocated with one or more sub-tasks, when the sub-thread is allocated with a plurality of sub-tasks, a sub-task pool of the sub-thread is built according to the plurality of sub-tasks, during the execution process of the sub-tasks, the sub-thread is monitored in real time to detect whether the sub-tasks are completed, when the sub-tasks are crawled to be completed, the current sub-tasks are marked to be completed, and then the remaining sub-tasks in the task pool are acquired until all the tasks are completed to finish the current sub-thread. And after all the sub-threads are finished, the system automatically sends a completion notification to the main thread.

FIG. 3 illustrates a flow chart of a method of intermittent transmission in a crawler task in accordance with the present invention.

As shown in fig. 3, according to an embodiment of the present invention, further includes:

s302, analyzing according to the task state of the sub-task in the sub-thread, recording the task interruption times when the task is interrupted, and judging whether the interruption times are larger than a first preset threshold value or not;

s304, if not, re-logging in the website, and continuing to climb the sub-task with interrupted transmission;

s306, if yes, ending the sub-task for transmitting the interrupt, and canceling the binding of the sub-task for transmitting the interrupt and the corresponding sub-thread;

and S308, distributing the sub-task for transmitting the interrupt to other sub-threads, and continuing to climb website data corresponding to the sub-task for transmitting the interrupt through the other sub-threads.

It should be noted that, when detecting a sub-task interrupt in a sub-thread, the system analyzes according to the recorded task interrupt times, if the task interrupt times are less than or equal to a first preset threshold value, then the sub-thread controls the browser to enter a login page to perform code scanning login again, based on the data of the crawled part, breakpoint continuous crawling is performed on the website data through the current sub-thread again, if the current sub-thread is greater than the first preset threshold value, the current task is ended, the current task is marked as an unfinished state, other sub-threads are allowed to acquire, warning and code scanning login screenshot are sent to wait for manual code scanning, the login state is detected at regular time, the remaining tasks to be completed are continuously completed after the login is successful, and the sub-threads are ended until all tasks in the task pool are completed.

According to the embodiment of the invention, after all the sub-threads are finished, the integrity of the crawl data is detected by the main thread, the crawl is continued to be carried out on the missing data until the crawl data is complete, and a log report is generated and pushed to the server according to the task completion condition, which comprises the following steps:

analyzing according to the marked task to obtain missing data;

After all the sub-threads are finished, the main thread automatically receives a task completion notification, then the main thread compares the warehouse-in data of the current date with the warehouse-in data of the historical date, the data integrity of the sub-tasks is judged, the sub-threads are started again for incomplete missing data to generate a task pool based on the missing data, the sub-threads are scheduled to start missing data crawler tasks, after the missing data crawler tasks are completed, data integrity detection is carried out again, if the data is incomplete, the missing data crawler tasks are continuously generated, and the detection is circulated until the data is complete, and the crawler tasks are completed. And finally, summarizing task logs based on task completion time and task content on the current date, and pushing the task logs to a display terminal.

According to an embodiment of the present invention, further comprising:

acquiring execution time of a crawler task;

analyzing according to the execution time of the crawler task, and judging whether the execution time of the crawler task is larger than a first preset time threshold;

if not, continuing recording;

if yes, finishing the crawler task, and generating a task log based on the finishing time and the task content.

It should be noted that, except for the case where the task is completed, if the task is timed out, the crawler task is also ended, and a task log is generated. The first preset time threshold is obtained by predicting the whole situation of the crawler task by the system, the execution time of the crawler task is recorded while the crawler task starts, and when the execution time of the crawler task is greater than the first preset time threshold, whether the crawler task is completed to directly finish the crawler task is not considered, and a task log is generated.

According to an embodiment of the present invention, further comprising:

analyzing according to the crawler task, and calculating the estimated completion time of each subtask;

and evenly distributing the crawler task to each sub-thread according to the estimated completion time of each sub-task.

Before the crawler task starts, firstly, each subtask in the crawler task is simulated through a server to obtain the expected completion time of each subtask. All subtasks are then evenly distributed to each subthread of the crawler task based on the predicted completion time of each subtask. In the process of crawling website data, each sub-thread is in a working state as much as possible, so that the efficiency of a crawler task is improved.

According to an embodiment of the present invention, further comprising:

setting an access time interval according to the website attribute;

randomly generating a random time interval through the access time interval;

starting timing after the sub-thread finishes network data crawling, and obtaining the rest time of the sub-thread;

comparing the rest time of the sub-thread with the random time interval, and when the rest time of the sub-thread is larger than the random time interval, continuing to crawl the network data by the promoter thread.

When the data crawling of the website is carried out, firstly judging whether the target website limits the crawler request frequency, if so, setting the access time interval according to the limiting requirement of the target website; if not, the access time interval is set according to the complexity of the web pages of the website, the size of the data volume, the stability of the website and the like. When the access time interval is set, the crawlers are intercepted by taking account of access time, when the access time interval is set, the crawlers are set in the form of the access time interval, then a random time interval is randomly generated in the access time interval, and the data of the websites are crawled in the random time interval under the condition that the websites do not normally run.

According to an embodiment of the present invention, further comprising:

setting a corresponding second preset time threshold value for each subtask according to the predicted completion time of each subtask;

recording the execution time of each subtask;

judging whether the execution time of each subtask is greater than a corresponding second preset time threshold;

if yes, ending the subtasks and executing the subtasks through other subtreads;

if not, recording is continued.

It should be noted that, in order to ensure the efficiency of the crawler task, the execution time of the crawler task is recorded at the beginning of the crawler task, and a corresponding second preset time threshold, that is, the maximum execution time, is set for each subtask by 1.2 times of the predicted completion time of each subtask. Comparing the execution time of each subtask in the crawler task with a corresponding second preset time threshold, and when the execution time of the subtask is greater than the corresponding second preset threshold, indicating that the execution efficiency of the current task is low and the website data acquisition is slow. In this case, the current subtask may be ended and the binding of the current subtask and the subthread may be canceled, and the subtask may be processed by other subthreads.

FIG. 4 illustrates a block diagram of a multi-threaded browser driven crawler system of the present invention.

As shown in fig. 4, a second aspect of the present invention provides a multithreaded browser driven crawler system 4, including a memory 41 and a processor 42, where the memory includes a multithreaded browser driven crawler method program, and the multithreaded browser driven crawler method program when executed by the processor implements the following steps:

obtaining a crawler task;

According to the embodiment of the invention, as shown in fig. 5, after a crawler task is acquired, a main thread is started by using a Python language, a task pool is generated by starting a main thread framework, the task pool is initialized, and a plurality of sub-threads are started to respectively control a plurality of browsers. The crawler task is then divided into a plurality of subtasks and evenly distributed to the task pools of each subtread. The sub-thread starts working according to the assigned sub-task, and the browser is controlled to sequentially scan codes, log in and extract the logged environment variables, and then the simulation request of the target data is carried out to carry out target data warehousing. And checking the transmission condition of the crawler task in the process of the crawler task, sending an interrupt warning when the interruption of the crawler task is detected, generating a code scanning login screenshot to wait for manual code scanning, and carrying out breakpoint continuous climbing on the crawler task after the code scanning is successful. And finishing corresponding sub threads after all sub tasks in the sub threads are crawled, after all the sub threads are finished, verifying the data integrity through the main thread, acquiring missing data again until the data is complete, and finally generating a log report according to the task completion condition and uploading the log report to a server. According to the invention, through multi-thread driving, the disconnection early warning is prevented, the breakpoint continues to climb, the crawler efficiency is improved, the exception handling and fault tolerance mechanism is increased, and the stability and reliability of the whole crawler system are ensured.

According to an embodiment of the present invention, further comprising:

starting a main thread and a sub thread through a Python language;

initializing a task pool through the main thread;

and starting a plurality of browsers through the sub-threads.

distributing the crawler task to the sub-threads through the main thread;

marking the crawler task based on the ID of the assigned crawler task;

otherwise, directly logging in;

According to an embodiment of the present invention, in the execution process of a crawler task, monitoring the state of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread, including:

if not, continuing to monitor the task state of the subtask;

if yes, marking the subtasks as completed;

According to an embodiment of the present invention, further comprising:

analyzing according to the marked task to obtain missing data;

According to an embodiment of the present invention, further comprising:

acquiring execution time of a crawler task;

if not, continuing recording;

According to an embodiment of the present invention, further comprising:

setting an access time interval according to the website attribute;

randomly generating a random time interval through the access time interval;

According to an embodiment of the present invention, further comprising:

recording the execution time of each subtask;

if yes, ending the subtasks and executing the subtasks through other subtreads;

if not, recording is continued.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

Claims

1. A method for driving a crawler by a multithreaded browser, comprising:

obtaining a crawler task;

2. The multi-threaded browser driven crawler method of claim 1, further comprising:

starting a main thread and a sub thread through a Python language;

initializing a task pool through the main thread;

and starting a plurality of browsers through the sub-threads.

3. The method for driving a crawler by a multithreaded browser according to claim 1, wherein the analyzing according to the crawler task, controlling the browser to log in to a website by a sub-thread, and crawling website data comprises:

Distributing the crawler task to the sub-threads through the main thread;

marking the crawler task based on the ID of the assigned crawler task;

otherwise, directly logging in;

4. The method for driving a crawler by a multithreaded browser according to claim 1, wherein monitoring the status of the crawler task, marking completed crawler tasks, and ending corresponding sub-threads during execution of the crawler task comprises:

if not, continuing to monitor the task state of the subtask;

if yes, marking the subtasks as completed;

5. The multi-threaded browser driven crawler method of claim 4, further comprising:

6. The method for driving a crawler by a multithreaded browser according to claim 1, wherein after all the sub-threads are finished, detecting the integrity of the crawled data by the main thread, and continuing crawling the missing data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server, comprises:

Analyzing according to the marked task to obtain missing data;

7. The multithreaded browser driven crawler system is characterized by comprising a memory and a processor, wherein the memory comprises a multithreaded browser driven crawler method program, and the multithreaded browser driven crawler method program realizes the following steps when being executed by the processor:

obtaining a crawler task;

8. The multi-threaded browser-driven crawler system of claim 7, wherein the analyzing according to the crawler task, controlling the browser to log into the website by the sub-threads, and crawling website data comprises:

distributing the crawler task to the sub-threads through the main thread;

marking the crawler task based on the ID of the assigned crawler task;

otherwise, directly logging in;

9. The multi-threaded browser driven crawler system of claim 7, wherein monitoring the status of a crawler task, marking completed crawler tasks, and ending corresponding sub-threads during execution of the crawler task, comprises:

if not, continuing to monitor the task state of the subtask;

if yes, marking the subtasks as completed;

10. A computer readable storage medium, comprising a multi-threaded browser driven crawler method program, wherein the multi-threaded browser driven crawler method program, when executed by a processor, implements the steps of a multi-threaded browser driven crawler method according to any of claims 1 to 6.