CN116501945A - Multithreaded browser driven crawler method, system and readable storage medium - Google Patents

Multithreaded browser driven crawler method, system and readable storage medium Download PDF

Info

Publication number
CN116501945A
CN116501945A CN202310765246.7A CN202310765246A CN116501945A CN 116501945 A CN116501945 A CN 116501945A CN 202310765246 A CN202310765246 A CN 202310765246A CN 116501945 A CN116501945 A CN 116501945A
Authority
CN
China
Prior art keywords
task
sub
crawler
thread
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310765246.7A
Other languages
Chinese (zh)
Inventor
吕振
陈增和
赖浩哲
林兴武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Housley Technology Co ltd
Original Assignee
Shenzhen Housley Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Housley Technology Co ltd filed Critical Shenzhen Housley Technology Co ltd
Priority to CN202310765246.7A priority Critical patent/CN116501945A/en
Publication of CN116501945A publication Critical patent/CN116501945A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/485Task life-cycle, e.g. stopping, restarting, resuming execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5055Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering software capabilities, i.e. software resources associated or available to the machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method, a system and a readable storage medium for driving crawlers by a multithreaded browser, wherein the method comprises the following steps: obtaining a crawler task; analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data; monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread; and after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server. According to the invention, the multithreading scheduling is started, a plurality of browsers are respectively driven to acquire data storage according to the current environment parameters of the current browsers and then the simulation request is carried out, so that the problems that the browser interface is required to scan codes for logging, the parameters are complex and are difficult to analyze, and the efficiency of a single browser is low are solved.

Description

Multithreaded browser driven crawler method, system and readable storage medium
Technical Field
The present application relates to the field of data processing and data transmission, and more particularly, to a method, system and readable storage medium for driving crawlers by a multithreaded browser.
Background
The existing multithreading crawler technology is mainly used for analyzing interfaces of target sites, then performing data capture through code simulation requests, and is high in webpage development difficulty, long in period, difficult to debug and analyze according to logs, frequent in account password login, poor in disconnection stability and the like in the face of special network request parameter encryption.
Therefore, the prior art has defects, and improvement is needed.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a method, a system and a readable storage medium for driving crawlers by a multithreaded browser, which can efficiently, stably and rapidly crawl a plurality of target websites in a large-scale website data crawling process, so as to improve data acquisition efficiency and quality.
The first aspect of the present invention provides a method for driving a crawler by a multithreaded browser, comprising:
obtaining a crawler task;
analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data;
monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread;
and after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server.
In this scheme, still include:
starting a main thread and a sub thread through a Python language;
initializing a task pool through the main thread;
and starting a plurality of browsers through the sub-threads.
In this scheme, according to analysis is carried out to the crawler task, log in the website through sub-thread control browser to crawl website data, include:
distributing the crawler task to the sub-threads through the main thread;
marking the crawler task based on the ID of the assigned crawler task;
the sub-thread accesses a login page through a browser according to the assigned crawler task, and judges whether the login page needs code scanning login or not;
if yes, displaying the two-dimensional code and informing the manual code scanning;
otherwise, directly logging in;
and detecting a login state, and after successful login, accessing a specific webpage by the child thread through a browser and crawling website data through the webpage.
In this scheme, in the crawler task execution process, monitor crawler task state, mark the crawler task that has accomplished to end corresponding sub-thread, include:
analyzing according to the task state of the subtask in the subthread, and judging whether the subtask is completed or not;
If not, continuing to monitor the task state of the subtask;
if yes, marking the subtasks as completed;
ending the sub-thread when all sub-tasks in the task pool of the sub-thread are marked as completed;
and after all the sub-threads are finished, sending a completion notification to the main thread.
In this scheme, still include:
analyzing according to the task state of the sub-task in the sub-thread, recording the task interruption times when the task is interrupted, and judging whether the interruption times are larger than a first preset threshold value or not;
if not, re-logging in the website, and continuing climbing the sub-task with the transmission interruption;
if yes, ending the sub-task for transmitting the interrupt, and canceling the binding of the sub-task for transmitting the interrupt and the corresponding sub-thread;
and distributing the sub-task for transmitting the interrupt to other sub-threads, and continuing to climb website data corresponding to the sub-task for transmitting the interrupt through the other sub-threads.
In this scheme, after all sub-threads all end, detect the integrality of crawling data through the main thread, continue crawling to missing data, until crawling data is complete, generate log report and propelling movement to the server according to the task completion condition, include:
The main thread detects the data integrity of all tasks and marks the tasks with incomplete data;
analyzing according to the marked task to obtain missing data;
generating a missing data crawler task according to the missing data, and processing the missing data crawler task by a promoter thread;
detecting the integrity of the missing data after the task of the missing data crawler is finished, and judging whether the missing data is completely crawled;
if not, the incomplete missing data repeatedly generates a missing data crawler task until the crawler task is completed completely;
if yes, generating a task log based on the task completion time and the task content.
The second aspect of the present invention provides a multithreaded browser driven crawler system, comprising a memory and a processor, wherein the memory comprises a multithreaded browser driven crawler method program, and the multithreaded browser driven crawler method program when executed by the processor realizes the following steps:
obtaining a crawler task;
analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data;
monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread;
And after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server.
In this scheme, still include:
starting a main thread and a sub thread through a Python language;
initializing a task pool through the main thread;
and starting a plurality of browsers through the sub-threads.
In this scheme, according to analysis is carried out to the crawler task, log in the website through sub-thread control browser to crawl website data, include:
distributing the crawler task to the sub-threads through the main thread;
marking the crawler task based on the ID of the assigned crawler task;
the sub-thread accesses a login page through a browser according to the assigned crawler task, and judges whether the login page needs code scanning login or not;
if yes, displaying the two-dimensional code and informing the manual code scanning;
otherwise, directly logging in;
and detecting a login state, and after successful login, accessing a specific webpage by the child thread through a browser and crawling website data through the webpage.
In this scheme, in the crawler task execution process, monitor crawler task state, mark the crawler task that has accomplished to end corresponding sub-thread, include:
analyzing according to the task state of the subtask in the subthread, and judging whether the subtask is completed or not;
if not, continuing to monitor the task state of the subtask;
if yes, marking the subtasks as completed;
ending the sub-thread when all sub-tasks in the task pool of the sub-thread are marked as completed;
and after all the sub-threads are finished, sending a completion notification to the main thread.
In this scheme, still include:
analyzing according to the task state of the sub-task in the sub-thread, recording the task interruption times when the task is interrupted, and judging whether the interruption times are larger than a first preset threshold value or not;
if not, re-logging in the website, and continuing climbing the sub-task with the transmission interruption;
if yes, ending the sub-task for transmitting the interrupt, and canceling the binding of the sub-task for transmitting the interrupt and the corresponding sub-thread;
and distributing the sub-task for transmitting the interrupt to other sub-threads, and continuing to climb website data corresponding to the sub-task for transmitting the interrupt through the other sub-threads.
In this scheme, after all sub-threads all end, detect the integrality of crawling data through the main thread, continue crawling to missing data, until crawling data is complete, generate log report and propelling movement to the server according to the task completion condition, include:
the main thread detects the data integrity of all tasks and marks the tasks with incomplete data;
analyzing according to the marked task to obtain missing data;
generating a missing data crawler task according to the missing data, and processing the missing data crawler task by a promoter thread;
detecting the integrity of the missing data after the task of the missing data crawler is finished, and judging whether the missing data is completely crawled;
if not, the incomplete missing data repeatedly generates a missing data crawler task until the crawler task is completed completely;
if yes, generating a task log based on the task completion time and the task content.
A third aspect of the present invention provides a computer readable storage medium having embodied therein a multithreaded browser driven crawler method program which, when executed by a processor, implements the steps of a multithreaded browser driven crawler method as described in any of the preceding claims.
The invention discloses a method, a system and a readable storage medium for driving crawlers by a multithreaded browser, wherein the method comprises the following steps: obtaining a crawler task; analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data; monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread; and after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server. According to the invention, the multithreading scheduling is started, a plurality of browsers are respectively driven to acquire data storage according to the current environment parameters of the current browsers and then the simulation request is carried out, so that the problems that the browser interface is required to scan codes for logging, the parameters are complex and are difficult to analyze, and the efficiency of a single browser is low are solved.
Drawings
FIG. 1 illustrates a flow chart of a method of the present invention for a multithreaded browser to drive a crawler;
FIG. 2 illustrates a flow chart of a crawler task status monitoring method of the present invention;
FIG. 3 illustrates a flow chart of a method of intermittent transmission in a crawler task in accordance with the present invention;
FIG. 4 illustrates a block diagram of a multi-threaded browser driven crawler system of the present invention;
FIG. 5 illustrates a schematic diagram of a multithreaded browser driven crawler framework in accordance with the present invention.
Description of the embodiments
In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.
FIG. 1 illustrates a flow chart of a method of the present invention for a multithreaded browser driven crawler.
As shown in fig. 1, the invention discloses a method for driving crawlers by a multithreaded browser, which comprises the following steps:
s102, acquiring a crawler task;
s104, analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data;
S106, monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread;
and S108, after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server.
According to the embodiment of the invention, as shown in fig. 5, after a crawler task is acquired, a main thread is started by using a Python language, a task pool is generated by starting a main thread framework, the task pool is initialized, and a plurality of sub-threads are started to respectively control a plurality of browsers. The crawler task is then divided into a plurality of subtasks and evenly distributed to the task pools of each subtread. The sub-thread starts working according to the assigned sub-task, and the browser is controlled to sequentially scan codes, log in and extract the logged environment variables, and then the simulation request of the target data is carried out to carry out target data warehousing. And checking the transmission condition of the crawler task in the process of the crawler task, sending an interrupt warning when the interruption of the crawler task is detected, generating a code scanning login screenshot to wait for manual code scanning, and carrying out breakpoint continuous climbing on the crawler task after the code scanning is successful. And finishing corresponding sub threads after all sub tasks in the sub threads are crawled, after all the sub threads are finished, verifying the data integrity through the main thread, acquiring missing data again until the data is complete, and finally generating a log report according to the task completion condition and uploading the log report to a server. According to the invention, through multi-thread driving, the disconnection early warning is prevented, the breakpoint continues to climb, the crawler efficiency is improved, the exception handling and fault tolerance mechanism is increased, and the stability and reliability of the whole crawler system are ensured. According to an embodiment of the present invention, further comprising:
Starting a main thread and a sub thread through a Python language;
initializing a task pool through the main thread;
and starting a plurality of browsers through the sub-threads.
Before the crawler task starts, a main thread is started by using a Python language, a current task pool is initialized and generated in the main thread and recorded in a MySql database, and then a plurality of browsers are started by using sub-threads in the multithreading. Each thread controls a browser using selenium (a tool for Web application testing).
According to an embodiment of the present invention, the analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data includes:
distributing the crawler task to the sub-threads through the main thread;
marking the crawler task based on the ID of the assigned crawler task;
the sub-thread accesses a login page through a browser according to the assigned crawler task, and judges whether the login page needs code scanning login or not;
if yes, displaying the two-dimensional code and informing the manual code scanning;
otherwise, directly logging in;
and detecting a login state, and after successful login, accessing a specific webpage by the child thread through a browser and crawling website data through the webpage.
After the main thread finishes the distribution of the crawler task, the sub-thread acquires the crawler task according to the distribution data, marks the task ID of the crawler task in the database, and binds the sub-thread with the distributed crawler task, so that the current task is refused to acquire for other threads, and the other sub-threads are prevented from processing the current sub-task again. After the sub-thread acquires the task, the access login page performs login according to user information set by the system, wherein the login mode can be account password login, verification code login or scanning login, wherein the account password login and the verification code login can directly perform login according to the user information set by the system, but the code scanning login needs to perform login in a manual code scanning mode, and when the login mode is code scanning login, a browser page displays a login two-dimensional code and a manager performs manual code scanning.
And detecting the login state at regular time, controlling the browser to access specific webpage pages through the sub-thread after successful login, obtaining the environment parameters of the browser, simulating the request to obtain website data, and sorting and warehousing the website data.
FIG. 2 illustrates a flow chart of a crawler task status monitoring method of the present invention.
As shown in fig. 2, according to an embodiment of the present invention, during execution of a crawler task, monitoring a status of the crawler task, marking a completed crawler task, and ending a corresponding sub-thread, including:
s202, analyzing according to the task state of the subtask in the subthread, and judging whether the subtask is completed or not;
s204, if not, continuing to monitor the task state of the subtask;
s206, if yes, marking the subtasks as completed;
s208, ending the sub-thread when all sub-tasks in the task pool of the sub-thread are marked as completed;
s210, after all the sub-threads are finished, a completion notification is sent to the main thread.
It should be noted that, each sub-thread is allocated with one or more sub-tasks, when the sub-thread is allocated with a plurality of sub-tasks, a sub-task pool of the sub-thread is built according to the plurality of sub-tasks, during the execution process of the sub-tasks, the sub-thread is monitored in real time to detect whether the sub-tasks are completed, when the sub-tasks are crawled to be completed, the current sub-tasks are marked to be completed, and then the remaining sub-tasks in the task pool are acquired until all the tasks are completed to finish the current sub-thread. And after all the sub-threads are finished, the system automatically sends a completion notification to the main thread.
FIG. 3 illustrates a flow chart of a method of intermittent transmission in a crawler task in accordance with the present invention.
As shown in fig. 3, according to an embodiment of the present invention, further includes:
s302, analyzing according to the task state of the sub-task in the sub-thread, recording the task interruption times when the task is interrupted, and judging whether the interruption times are larger than a first preset threshold value or not;
s304, if not, re-logging in the website, and continuing to climb the sub-task with interrupted transmission;
s306, if yes, ending the sub-task for transmitting the interrupt, and canceling the binding of the sub-task for transmitting the interrupt and the corresponding sub-thread;
and S308, distributing the sub-task for transmitting the interrupt to other sub-threads, and continuing to climb website data corresponding to the sub-task for transmitting the interrupt through the other sub-threads.
It should be noted that, when detecting a sub-task interrupt in a sub-thread, the system analyzes according to the recorded task interrupt times, if the task interrupt times are less than or equal to a first preset threshold value, then the sub-thread controls the browser to enter a login page to perform code scanning login again, based on the data of the crawled part, breakpoint continuous crawling is performed on the website data through the current sub-thread again, if the current sub-thread is greater than the first preset threshold value, the current task is ended, the current task is marked as an unfinished state, other sub-threads are allowed to acquire, warning and code scanning login screenshot are sent to wait for manual code scanning, the login state is detected at regular time, the remaining tasks to be completed are continuously completed after the login is successful, and the sub-threads are ended until all tasks in the task pool are completed.
According to the embodiment of the invention, after all the sub-threads are finished, the integrity of the crawl data is detected by the main thread, the crawl is continued to be carried out on the missing data until the crawl data is complete, and a log report is generated and pushed to the server according to the task completion condition, which comprises the following steps:
the main thread detects the data integrity of all tasks and marks the tasks with incomplete data;
analyzing according to the marked task to obtain missing data;
generating a missing data crawler task according to the missing data, and processing the missing data crawler task by a promoter thread;
detecting the integrity of the missing data after the task of the missing data crawler is finished, and judging whether the missing data is completely crawled;
if not, the incomplete missing data repeatedly generates a missing data crawler task until the crawler task is completed completely;
if yes, generating a task log based on the task completion time and the task content.
After all the sub-threads are finished, the main thread automatically receives a task completion notification, then the main thread compares the warehouse-in data of the current date with the warehouse-in data of the historical date, the data integrity of the sub-tasks is judged, the sub-threads are started again for incomplete missing data to generate a task pool based on the missing data, the sub-threads are scheduled to start missing data crawler tasks, after the missing data crawler tasks are completed, data integrity detection is carried out again, if the data is incomplete, the missing data crawler tasks are continuously generated, and the detection is circulated until the data is complete, and the crawler tasks are completed. And finally, summarizing task logs based on task completion time and task content on the current date, and pushing the task logs to a display terminal.
According to an embodiment of the present invention, further comprising:
acquiring execution time of a crawler task;
analyzing according to the execution time of the crawler task, and judging whether the execution time of the crawler task is larger than a first preset time threshold;
if not, continuing recording;
if yes, finishing the crawler task, and generating a task log based on the finishing time and the task content.
It should be noted that, except for the case where the task is completed, if the task is timed out, the crawler task is also ended, and a task log is generated. The first preset time threshold is obtained by predicting the whole situation of the crawler task by the system, the execution time of the crawler task is recorded while the crawler task starts, and when the execution time of the crawler task is greater than the first preset time threshold, whether the crawler task is completed to directly finish the crawler task is not considered, and a task log is generated.
According to an embodiment of the present invention, further comprising:
analyzing according to the crawler task, and calculating the estimated completion time of each subtask;
and evenly distributing the crawler task to each sub-thread according to the estimated completion time of each sub-task.
Before the crawler task starts, firstly, each subtask in the crawler task is simulated through a server to obtain the expected completion time of each subtask. All subtasks are then evenly distributed to each subthread of the crawler task based on the predicted completion time of each subtask. In the process of crawling website data, each sub-thread is in a working state as much as possible, so that the efficiency of a crawler task is improved.
According to an embodiment of the present invention, further comprising:
setting an access time interval according to the website attribute;
randomly generating a random time interval through the access time interval;
starting timing after the sub-thread finishes network data crawling, and obtaining the rest time of the sub-thread;
comparing the rest time of the sub-thread with the random time interval, and when the rest time of the sub-thread is larger than the random time interval, continuing to crawl the network data by the promoter thread.
When the data crawling of the website is carried out, firstly judging whether the target website limits the crawler request frequency, if so, setting the access time interval according to the limiting requirement of the target website; if not, the access time interval is set according to the complexity of the web pages of the website, the size of the data volume, the stability of the website and the like. When the access time interval is set, the crawlers are intercepted by taking account of access time, when the access time interval is set, the crawlers are set in the form of the access time interval, then a random time interval is randomly generated in the access time interval, and the data of the websites are crawled in the random time interval under the condition that the websites do not normally run.
According to an embodiment of the present invention, further comprising:
setting a corresponding second preset time threshold value for each subtask according to the predicted completion time of each subtask;
recording the execution time of each subtask;
judging whether the execution time of each subtask is greater than a corresponding second preset time threshold;
if yes, ending the subtasks and executing the subtasks through other subtreads;
if not, recording is continued.
It should be noted that, in order to ensure the efficiency of the crawler task, the execution time of the crawler task is recorded at the beginning of the crawler task, and a corresponding second preset time threshold, that is, the maximum execution time, is set for each subtask by 1.2 times of the predicted completion time of each subtask. Comparing the execution time of each subtask in the crawler task with a corresponding second preset time threshold, and when the execution time of the subtask is greater than the corresponding second preset threshold, indicating that the execution efficiency of the current task is low and the website data acquisition is slow. In this case, the current subtask may be ended and the binding of the current subtask and the subthread may be canceled, and the subtask may be processed by other subthreads.
FIG. 4 illustrates a block diagram of a multi-threaded browser driven crawler system of the present invention.
As shown in fig. 4, a second aspect of the present invention provides a multithreaded browser driven crawler system 4, including a memory 41 and a processor 42, where the memory includes a multithreaded browser driven crawler method program, and the multithreaded browser driven crawler method program when executed by the processor implements the following steps:
obtaining a crawler task;
analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data;
monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread;
and after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server.
According to the embodiment of the invention, as shown in fig. 5, after a crawler task is acquired, a main thread is started by using a Python language, a task pool is generated by starting a main thread framework, the task pool is initialized, and a plurality of sub-threads are started to respectively control a plurality of browsers. The crawler task is then divided into a plurality of subtasks and evenly distributed to the task pools of each subtread. The sub-thread starts working according to the assigned sub-task, and the browser is controlled to sequentially scan codes, log in and extract the logged environment variables, and then the simulation request of the target data is carried out to carry out target data warehousing. And checking the transmission condition of the crawler task in the process of the crawler task, sending an interrupt warning when the interruption of the crawler task is detected, generating a code scanning login screenshot to wait for manual code scanning, and carrying out breakpoint continuous climbing on the crawler task after the code scanning is successful. And finishing corresponding sub threads after all sub tasks in the sub threads are crawled, after all the sub threads are finished, verifying the data integrity through the main thread, acquiring missing data again until the data is complete, and finally generating a log report according to the task completion condition and uploading the log report to a server. According to the invention, through multi-thread driving, the disconnection early warning is prevented, the breakpoint continues to climb, the crawler efficiency is improved, the exception handling and fault tolerance mechanism is increased, and the stability and reliability of the whole crawler system are ensured.
According to an embodiment of the present invention, further comprising:
starting a main thread and a sub thread through a Python language;
initializing a task pool through the main thread;
and starting a plurality of browsers through the sub-threads.
Before the crawler task starts, a main thread is started by using a Python language, a current task pool is initialized and generated in the main thread and recorded in a MySql database, and then a plurality of browsers are started by using sub-threads in the multithreading. Each thread controls a browser using selenium (a tool for Web application testing).
According to an embodiment of the present invention, the analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data includes:
distributing the crawler task to the sub-threads through the main thread;
marking the crawler task based on the ID of the assigned crawler task;
the sub-thread accesses a login page through a browser according to the assigned crawler task, and judges whether the login page needs code scanning login or not;
if yes, displaying the two-dimensional code and informing the manual code scanning;
otherwise, directly logging in;
and detecting a login state, and after successful login, accessing a specific webpage by the child thread through a browser and crawling website data through the webpage.
After the main thread finishes the distribution of the crawler task, the sub-thread acquires the crawler task according to the distribution data, marks the task ID of the crawler task in the database, and binds the sub-thread with the distributed crawler task, so that the current task is refused to acquire for other threads, and the other sub-threads are prevented from processing the current sub-task again. After the sub-thread acquires the task, the access login page performs login according to user information set by the system, wherein the login mode can be account password login, verification code login or scanning login, wherein the account password login and the verification code login can directly perform login according to the user information set by the system, but the code scanning login needs to perform login in a manual code scanning mode, and when the login mode is code scanning login, a browser page displays a login two-dimensional code and a manager performs manual code scanning.
And detecting the login state at regular time, controlling the browser to access specific webpage pages through the sub-thread after successful login, obtaining the environment parameters of the browser, simulating the request to obtain website data, and sorting and warehousing the website data.
According to an embodiment of the present invention, in the execution process of a crawler task, monitoring the state of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread, including:
analyzing according to the task state of the subtask in the subthread, and judging whether the subtask is completed or not;
if not, continuing to monitor the task state of the subtask;
if yes, marking the subtasks as completed;
ending the sub-thread when all sub-tasks in the task pool of the sub-thread are marked as completed;
and after all the sub-threads are finished, sending a completion notification to the main thread.
It should be noted that, each sub-thread is allocated with one or more sub-tasks, when the sub-thread is allocated with a plurality of sub-tasks, a sub-task pool of the sub-thread is built according to the plurality of sub-tasks, during the execution process of the sub-tasks, the sub-thread is monitored in real time to detect whether the sub-tasks are completed, when the sub-tasks are crawled to be completed, the current sub-tasks are marked to be completed, and then the remaining sub-tasks in the task pool are acquired until all the tasks are completed to finish the current sub-thread. And after all the sub-threads are finished, the system automatically sends a completion notification to the main thread.
According to an embodiment of the present invention, further comprising:
analyzing according to the task state of the sub-task in the sub-thread, recording the task interruption times when the task is interrupted, and judging whether the interruption times are larger than a first preset threshold value or not;
if not, re-logging in the website, and continuing climbing the sub-task with the transmission interruption;
if yes, ending the sub-task for transmitting the interrupt, and canceling the binding of the sub-task for transmitting the interrupt and the corresponding sub-thread;
and distributing the sub-task for transmitting the interrupt to other sub-threads, and continuing to climb website data corresponding to the sub-task for transmitting the interrupt through the other sub-threads.
It should be noted that, when detecting a sub-task interrupt in a sub-thread, the system analyzes according to the recorded task interrupt times, if the task interrupt times are less than or equal to a first preset threshold value, then the sub-thread controls the browser to enter a login page to perform code scanning login again, based on the data of the crawled part, breakpoint continuous crawling is performed on the website data through the current sub-thread again, if the current sub-thread is greater than the first preset threshold value, the current task is ended, the current task is marked as an unfinished state, other sub-threads are allowed to acquire, warning and code scanning login screenshot are sent to wait for manual code scanning, the login state is detected at regular time, the remaining tasks to be completed are continuously completed after the login is successful, and the sub-threads are ended until all tasks in the task pool are completed.
According to the embodiment of the invention, after all the sub-threads are finished, the integrity of the crawl data is detected by the main thread, the crawl is continued to be carried out on the missing data until the crawl data is complete, and a log report is generated and pushed to the server according to the task completion condition, which comprises the following steps:
the main thread detects the data integrity of all tasks and marks the tasks with incomplete data;
analyzing according to the marked task to obtain missing data;
generating a missing data crawler task according to the missing data, and processing the missing data crawler task by a promoter thread;
detecting the integrity of the missing data after the task of the missing data crawler is finished, and judging whether the missing data is completely crawled;
if not, the incomplete missing data repeatedly generates a missing data crawler task until the crawler task is completed completely;
if yes, generating a task log based on the task completion time and the task content.
After all the sub-threads are finished, the main thread automatically receives a task completion notification, then the main thread compares the warehouse-in data of the current date with the warehouse-in data of the historical date, the data integrity of the sub-tasks is judged, the sub-threads are started again for incomplete missing data to generate a task pool based on the missing data, the sub-threads are scheduled to start missing data crawler tasks, after the missing data crawler tasks are completed, data integrity detection is carried out again, if the data is incomplete, the missing data crawler tasks are continuously generated, and the detection is circulated until the data is complete, and the crawler tasks are completed. And finally, summarizing task logs based on task completion time and task content on the current date, and pushing the task logs to a display terminal.
According to an embodiment of the present invention, further comprising:
acquiring execution time of a crawler task;
analyzing according to the execution time of the crawler task, and judging whether the execution time of the crawler task is larger than a first preset time threshold;
if not, continuing recording;
if yes, finishing the crawler task, and generating a task log based on the finishing time and the task content.
It should be noted that, except for the case where the task is completed, if the task is timed out, the crawler task is also ended, and a task log is generated. The first preset time threshold is obtained by predicting the whole situation of the crawler task by the system, the execution time of the crawler task is recorded while the crawler task starts, and when the execution time of the crawler task is greater than the first preset time threshold, whether the crawler task is completed to directly finish the crawler task is not considered, and a task log is generated.
According to an embodiment of the present invention, further comprising:
analyzing according to the crawler task, and calculating the estimated completion time of each subtask;
and evenly distributing the crawler task to each sub-thread according to the estimated completion time of each sub-task.
Before the crawler task starts, firstly, each subtask in the crawler task is simulated through a server to obtain the expected completion time of each subtask. All subtasks are then evenly distributed to each subthread of the crawler task based on the predicted completion time of each subtask. In the process of crawling website data, each sub-thread is in a working state as much as possible, so that the efficiency of a crawler task is improved.
According to an embodiment of the present invention, further comprising:
setting an access time interval according to the website attribute;
randomly generating a random time interval through the access time interval;
starting timing after the sub-thread finishes network data crawling, and obtaining the rest time of the sub-thread;
comparing the rest time of the sub-thread with the random time interval, and when the rest time of the sub-thread is larger than the random time interval, continuing to crawl the network data by the promoter thread.
When the data crawling of the website is carried out, firstly judging whether the target website limits the crawler request frequency, if so, setting the access time interval according to the limiting requirement of the target website; if not, the access time interval is set according to the complexity of the web pages of the website, the size of the data volume, the stability of the website and the like. When the access time interval is set, the crawlers are intercepted by taking account of access time, when the access time interval is set, the crawlers are set in the form of the access time interval, then a random time interval is randomly generated in the access time interval, and the data of the websites are crawled in the random time interval under the condition that the websites do not normally run.
According to an embodiment of the present invention, further comprising:
setting a corresponding second preset time threshold value for each subtask according to the predicted completion time of each subtask;
recording the execution time of each subtask;
judging whether the execution time of each subtask is greater than a corresponding second preset time threshold;
if yes, ending the subtasks and executing the subtasks through other subtreads;
if not, recording is continued.
It should be noted that, in order to ensure the efficiency of the crawler task, the execution time of the crawler task is recorded at the beginning of the crawler task, and a corresponding second preset time threshold, that is, the maximum execution time, is set for each subtask by 1.2 times of the predicted completion time of each subtask. Comparing the execution time of each subtask in the crawler task with a corresponding second preset time threshold, and when the execution time of the subtask is greater than the corresponding second preset threshold, indicating that the execution efficiency of the current task is low and the website data acquisition is slow. In this case, the current subtask may be ended and the binding of the current subtask and the subthread may be canceled, and the subtask may be processed by other subthreads.
A third aspect of the present invention provides a computer readable storage medium having embodied therein a multithreaded browser driven crawler method program which, when executed by a processor, implements the steps of a multithreaded browser driven crawler method as described in any of the preceding claims.
The invention discloses a method, a system and a readable storage medium for driving crawlers by a multithreaded browser, wherein the method comprises the following steps: obtaining a crawler task; analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data; monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread; and after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server. According to the invention, the multithreading scheduling is started, a plurality of browsers are respectively driven to acquire data storage according to the current environment parameters of the current browsers and then the simulation request is carried out, so that the problems that the browser interface is required to scan codes for logging, the parameters are complex and are difficult to analyze, and the efficiency of a single browser is low are solved.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.
Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

Claims (10)

1. A method for driving a crawler by a multithreaded browser, comprising:
obtaining a crawler task;
analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data;
monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread;
and after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server.
2. The multi-threaded browser driven crawler method of claim 1, further comprising:
starting a main thread and a sub thread through a Python language;
initializing a task pool through the main thread;
and starting a plurality of browsers through the sub-threads.
3. The method for driving a crawler by a multithreaded browser according to claim 1, wherein the analyzing according to the crawler task, controlling the browser to log in to a website by a sub-thread, and crawling website data comprises:
Distributing the crawler task to the sub-threads through the main thread;
marking the crawler task based on the ID of the assigned crawler task;
the sub-thread accesses a login page through a browser according to the assigned crawler task, and judges whether the login page needs code scanning login or not;
if yes, displaying the two-dimensional code and informing the manual code scanning;
otherwise, directly logging in;
and detecting a login state, and after successful login, accessing a specific webpage by the child thread through a browser and crawling website data through the webpage.
4. The method for driving a crawler by a multithreaded browser according to claim 1, wherein monitoring the status of the crawler task, marking completed crawler tasks, and ending corresponding sub-threads during execution of the crawler task comprises:
analyzing according to the task state of the subtask in the subthread, and judging whether the subtask is completed or not;
if not, continuing to monitor the task state of the subtask;
if yes, marking the subtasks as completed;
ending the sub-thread when all sub-tasks in the task pool of the sub-thread are marked as completed;
And after all the sub-threads are finished, sending a completion notification to the main thread.
5. The multi-threaded browser driven crawler method of claim 4, further comprising:
analyzing according to the task state of the sub-task in the sub-thread, recording the task interruption times when the task is interrupted, and judging whether the interruption times are larger than a first preset threshold value or not;
if not, re-logging in the website, and continuing climbing the sub-task with the transmission interruption;
if yes, ending the sub-task for transmitting the interrupt, and canceling the binding of the sub-task for transmitting the interrupt and the corresponding sub-thread;
and distributing the sub-task for transmitting the interrupt to other sub-threads, and continuing to climb website data corresponding to the sub-task for transmitting the interrupt through the other sub-threads.
6. The method for driving a crawler by a multithreaded browser according to claim 1, wherein after all the sub-threads are finished, detecting the integrity of the crawled data by the main thread, and continuing crawling the missing data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server, comprises:
the main thread detects the data integrity of all tasks and marks the tasks with incomplete data;
Analyzing according to the marked task to obtain missing data;
generating a missing data crawler task according to the missing data, and processing the missing data crawler task by a promoter thread;
detecting the integrity of the missing data after the task of the missing data crawler is finished, and judging whether the missing data is completely crawled;
if not, the incomplete missing data repeatedly generates a missing data crawler task until the crawler task is completed completely;
if yes, generating a task log based on the task completion time and the task content.
7. The multithreaded browser driven crawler system is characterized by comprising a memory and a processor, wherein the memory comprises a multithreaded browser driven crawler method program, and the multithreaded browser driven crawler method program realizes the following steps when being executed by the processor:
obtaining a crawler task;
analyzing according to the crawler task, controlling a browser to log in a website through a sub-thread, and crawling website data;
monitoring the state of the crawler task in the execution process of the crawler task, marking the completed crawler task, and ending the corresponding sub-thread;
And after all the sub-threads are finished, detecting the integrity of the crawled data through the main thread, continuously crawling the missed data until the crawled data is complete, generating a log report according to the task completion condition, and pushing the log report to the server.
8. The multi-threaded browser-driven crawler system of claim 7, wherein the analyzing according to the crawler task, controlling the browser to log into the website by the sub-threads, and crawling website data comprises:
distributing the crawler task to the sub-threads through the main thread;
marking the crawler task based on the ID of the assigned crawler task;
the sub-thread accesses a login page through a browser according to the assigned crawler task, and judges whether the login page needs code scanning login or not;
if yes, displaying the two-dimensional code and informing the manual code scanning;
otherwise, directly logging in;
and detecting a login state, and after successful login, accessing a specific webpage by the child thread through a browser and crawling website data through the webpage.
9. The multi-threaded browser driven crawler system of claim 7, wherein monitoring the status of a crawler task, marking completed crawler tasks, and ending corresponding sub-threads during execution of the crawler task, comprises:
Analyzing according to the task state of the subtask in the subthread, and judging whether the subtask is completed or not;
if not, continuing to monitor the task state of the subtask;
if yes, marking the subtasks as completed;
ending the sub-thread when all sub-tasks in the task pool of the sub-thread are marked as completed;
and after all the sub-threads are finished, sending a completion notification to the main thread.
10. A computer readable storage medium, comprising a multi-threaded browser driven crawler method program, wherein the multi-threaded browser driven crawler method program, when executed by a processor, implements the steps of a multi-threaded browser driven crawler method according to any of claims 1 to 6.
CN202310765246.7A 2023-06-27 2023-06-27 Multithreaded browser driven crawler method, system and readable storage medium Pending CN116501945A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310765246.7A CN116501945A (en) 2023-06-27 2023-06-27 Multithreaded browser driven crawler method, system and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310765246.7A CN116501945A (en) 2023-06-27 2023-06-27 Multithreaded browser driven crawler method, system and readable storage medium

Publications (1)

Publication Number Publication Date
CN116501945A true CN116501945A (en) 2023-07-28

Family

ID=87316994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310765246.7A Pending CN116501945A (en) 2023-06-27 2023-06-27 Multithreaded browser driven crawler method, system and readable storage medium

Country Status (1)

Country Link
CN (1) CN116501945A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594756A (en) * 2023-07-17 2023-08-15 深圳市豪斯莱科技有限公司 Task processing method, device, terminal equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system
CN111191097A (en) * 2019-12-20 2020-05-22 天阳宏业科技股份有限公司 Method, device and system for automatically acquiring webpage information by web crawler
CN112765438A (en) * 2021-01-25 2021-05-07 北京星汉博纳医药科技有限公司 Automatic crawler management method based on micro-service
CN114610975A (en) * 2022-04-20 2022-06-10 厦门市美亚柏科信息股份有限公司 Webpage crawling method and device, computing equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system
CN111191097A (en) * 2019-12-20 2020-05-22 天阳宏业科技股份有限公司 Method, device and system for automatically acquiring webpage information by web crawler
CN112765438A (en) * 2021-01-25 2021-05-07 北京星汉博纳医药科技有限公司 Automatic crawler management method based on micro-service
CN114610975A (en) * 2022-04-20 2022-06-10 厦门市美亚柏科信息股份有限公司 Webpage crawling method and device, computing equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594756A (en) * 2023-07-17 2023-08-15 深圳市豪斯莱科技有限公司 Task processing method, device, terminal equipment and storage medium
CN116594756B (en) * 2023-07-17 2023-11-03 深圳市豪斯莱科技有限公司 Task processing method, device, terminal equipment and storage medium

Similar Documents

Publication Publication Date Title
Sharif et al. An eye-tracking study on the role of scan time in finding source code defects
CN116501945A (en) Multithreaded browser driven crawler method, system and readable storage medium
KR102129843B1 (en) Method for verifying real annotation works using test annotation works and apparatus thereof
CN107040535B (en) Method, device and system for monitoring login of mobile application channel and storage medium
CN111291384B (en) Vulnerability scanning method and device and electronic equipment
CN111143188B (en) Method and equipment for automatically testing application
CN112380255A (en) Service processing method, device, equipment and storage medium
CN106855844B (en) Performance test method and system
CN113472787A (en) Alarm information processing method, device, equipment and storage medium
CN113850506A (en) Method and device for evaluating working quality, storage medium and electronic equipment
CN107168844B (en) Performance monitoring method and device
CN112463432A (en) Inspection method, device and system based on index data
CN110543429B (en) Test case debugging method, device and storage medium
CN110727595A (en) Application login interface identification method, intelligent terminal and storage medium
US9081605B2 (en) Conflicting sub-process identification method, apparatus and computer program
CN104471531B (en) Session method and conversational system
CN112073714A (en) Video playing quality automatic detection method, device, equipment and readable storage medium
CN112199273A (en) Virtual machine pressure/performance testing method and system
CN114531383B (en) Method, device, equipment and storage medium for detecting abnormality of railway vehicle-mounted switch
CN107808088A (en) A kind of fingerprint input method, device, computer installation, readable storage medium storing program for executing
CN116489336A (en) Equipment monitoring method, device, equipment, medium and product based on virtual film production
CN114896483A (en) Data acquisition method, system and storage medium
CN113065055A (en) News information capturing method and device, electronic equipment and storage medium
CN113190836A (en) Web attack behavior detection method and system based on local command execution
US10545858B2 (en) Method for testing a graphical interface and corresponding test system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230728