CN112541106A - Network data acquisition method and device, computer equipment and storage medium - Google Patents

Network data acquisition method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112541106A
CN112541106A CN202011512362.0A CN202011512362A CN112541106A CN 112541106 A CN112541106 A CN 112541106A CN 202011512362 A CN202011512362 A CN 202011512362A CN 112541106 A CN112541106 A CN 112541106A
Authority
CN
China
Prior art keywords
downloading
failure
strategy
page
downloading strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011512362.0A
Other languages
Chinese (zh)
Inventor
曾文清
杨濠兴
朱光岳
廖梓鸿
虞孝伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Chuangle Information Technology Co ltd
Original Assignee
Guangzhou Chuangle Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Chuangle Information Technology Co ltd filed Critical Guangzhou Chuangle Information Technology Co ltd
Priority to CN202011512362.0A priority Critical patent/CN112541106A/en
Publication of CN112541106A publication Critical patent/CN112541106A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/37Compiler construction; Parser generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application relates to a network data acquisition method, a network data acquisition device, computer equipment and a storage medium. The method comprises the following steps: monitoring a data capture state under a current downloading strategy in a crawling system, wherein the data capture state comprises page downloading failure, page analysis failure and/or page analysis success; determining a failure index corresponding to the current downloading strategy according to the data capturing state; if the failure index is larger than or equal to a first preset threshold value, switching the current downloading strategy into a target downloading strategy; the target downloading strategy is one of a plurality of downloading strategies preset in the crawling system. By adopting the method, the great waste of invalid grabbing and data crawling resources can be avoided.

Description

Network data acquisition method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of web crawlers, and in particular, to a method and an apparatus for acquiring network data, a computer device, and a storage medium.
Background
The web crawler is a program or script for crawling network information according to a predetermined rule, however, with the continuous progress of the anti-crawler technology, when a given crawler system is adopted for crawling network data, the situation that the data crawling fails often occurs.
At present, the condition of crawling failure is responded, and various anti-crawler technologies are utilized to continuously access or circularly access the crawling failure mainly through analyzing the reason of the crawling failure, so that the resource is greatly wasted, and the data grabbing cost is increased.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a network data acquisition method, apparatus, computer device and storage medium for solving the above technical problems.
A method of network data acquisition, the method comprising: monitoring a data capture state under a current downloading strategy in a crawling system, wherein the data capture state comprises page downloading failure, page analysis failure and/or page analysis success; determining a failure index corresponding to the current downloading strategy according to the data capturing state; if the failure index is larger than or equal to a first preset threshold value, switching the current downloading strategy into a target downloading strategy; the target downloading strategy is one of a plurality of downloading strategies preset in the crawling system.
In one embodiment, the step of switching the current downloading policy to the target downloading policy if the failure indicator is greater than or equal to a first preset threshold includes: acquiring the accumulated switching times of the downloading strategy aiming at the crawling system within a first preset time; and if the accumulated switching times are larger than or equal to a second preset threshold value, suspending a data capturing program of the crawling system within second preset time, and releasing the corresponding proxy IP resources.
In one embodiment, the method further comprises the following steps: if the accumulated switching times are smaller than a second preset threshold value, switching the current downloading strategy into a target downloading strategy; and after the target downloading strategy is switched, clearing the failure index corresponding to the current downloading strategy.
In one embodiment, the step of determining a failure indicator corresponding to the current downloading policy according to the data capture state includes: and obtaining the page download failure times and/or page analysis failure times within the third preset time according to the data capture state within the third preset time, and determining the failure index based on the page download failure times and/or the page analysis failure times.
In one embodiment, the obtaining, according to the data capture state within a third preset time, the number of page download failures and/or the number of page analysis failures within the third preset time, and determining the failure indicator based on the number of page download failures and/or the number of page analysis failures includes: traversing the data capture state within a third preset time; if the traversed data capture state is page download failure or page analysis failure, accumulating the capture failure times of the current download strategy once; if the traversed data capture state is that the page analysis is successful, accumulating the capture success times of the current downloading strategy once; determining the failure index based on the accumulated capture failure times and capture success times when traversing the data capture state within the third preset time;
in one embodiment, the method further comprises the following steps: and if the corresponding failure index under the current downloading strategy is smaller than a first preset threshold value within the third preset time, returning the corresponding capturing failure times and capturing success times under the current downloading strategy to zero.
In one embodiment, the method further comprises the following steps: determining a candidate downloading strategy from a plurality of downloading strategies preset in the crawling system; judging whether the candidate downloading strategy is in a preset time period or not; when the candidate downloading strategy is within a preset time period, determining the candidate downloading strategy as the target downloading strategy; and when the candidate downloading strategy is not in the preset aging period, determining a candidate downloading strategy from the rest downloading strategies in the plurality of downloading strategies, and returning to the step of judging whether the candidate downloading strategy is in the preset aging period until a target downloading strategy is obtained or the rest downloading strategies in the plurality of downloading strategies are 0.
In one embodiment, the failure indicator is at least one of: the difference between the accumulated times of grabbing failure and the accumulated times of grabbing success; the ratio of the accumulated times of grabbing failure to the total times of grabbing; the number of capture failures is accumulated.
A network data acquisition apparatus comprising:
the monitor module is used for monitoring a data capture state under a current downloading strategy in the crawling system, wherein the data capture state comprises page downloading failure, page analysis failure and/or page analysis success; determining a failure index corresponding to the current downloading strategy according to the data capturing state;
the downloading strategy switching module is used for switching the current downloading strategy into a target downloading strategy if the failure index is greater than or equal to a first preset threshold value; the target downloading strategy is one of a plurality of downloading strategies preset in the crawling system.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the method as above when the processor executes the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as above.
The network data acquisition method, the network data acquisition device, the computer equipment and the storage medium monitor the data capture state under the current downloading strategy in the crawling system, wherein the data capture state comprises page downloading failure, page parsing failure and/or page parsing success; determining a failure index corresponding to the current downloading strategy according to the data capturing state; if the failure index is larger than or equal to a first preset threshold value, switching the current downloading strategy into a target downloading strategy; the target downloading strategy is one of a plurality of downloading strategies preset in the crawling system. The method can solve the problem of crawling failure in an economical and efficient mode when crawling of network data fails, and low-efficiency continuous access or cyclic access is avoided, so that data resources are saved, and data grabbing cost is reduced.
Drawings
FIG. 1 is a diagram of an exemplary network data acquisition method;
FIG. 2 is a schematic diagram of a crawling system in one embodiment;
FIG. 3 is a flow diagram illustrating a method for network data acquisition according to one embodiment;
FIG. 4 is a flow chart illustrating a method for network data acquisition in another embodiment;
FIG. 5 is a flow chart illustrating a method for network data acquisition in another embodiment;
fig. 6 is a flowchart illustrating a step S300 of a network data acquisition method according to another embodiment;
FIG. 7 is a flow diagram illustrating a method for network data acquisition according to one embodiment;
FIG. 8 is a block diagram showing the structure of a network data acquisition device according to an embodiment;
FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The network data acquisition method provided by the application can be applied to a terminal or a server alone, or partially applied to the terminal or the server, taking communication between the terminal and the server as an example, as shown in an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The user accesses the server 104 through the terminal 102 to obtain a web page, so as to achieve network data acquisition. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
The crawler system is preset in the terminal 102 or the server 104, and the crawler system in the present application may monitor a data capture state, where the data capture state includes a page download failure, a page download success, a page parsing failure, and/or a page parsing success, and determine a data capture effect of the current download policy based on the monitored data capture state, for example, determine a failure index of the current download policy, and determine a switching of the download policy based on the failure index.
In one embodiment, the crawler system comprises a listener module, wherein the listener module is used for monitoring a downloading state of a downloaded web page and also monitoring an analysis state of a page analysis after the downloading is successful, and the crawler system determines whether to switch a downloading strategy based on monitoring data of the listener module. The data capture state monitored by the listener module can be preset according to the actual scene, for example, the data capture state of the listener module for monitoring page download failure and page analysis failure is set, and corresponding recording is performed.
Further, the crawler system can also comprise a download manager module and a page resolver module; the download manager module is used for downloading the network page from the internet and inputting the downloaded network page into the page resolver module for resolution processing. The page resolver module is used for resolving the network page, extracting the information of the network page and mining a new URL link.
In other embodiments, the overall architecture of the crawler system may further comprise, as shown in fig. 2: a seed manager module (Scheduler)202, a Redis aggregate queue module 204, a download manager module 206(Downloader), a Page parser module (PageProcessor)208, a storage manager module (Pipeline)210, and a Listener module (Listener) 212.
The seed manager module 202 is used to manage URL links.
The Redis aggregate queue module 204 is used for accessing, deleting, and de-duplicating URL links.
The download manager module 206 is used to download the web page from the internet, and input the downloaded web page to the page parser module 208 for parsing.
The page parser module 208 is used to parse web pages and extract web page information, as well as mining new URL links.
The storage manager module 210 is used to process the parsing results input by the page parser module 208, including calculation, persistence to a file, database, etc.
The listener module 212 is configured to monitor a state of a download web page of the download manager, monitor an analysis state of the page analyzer, and add a web page that fails to be downloaded or analyzed again to the Redis aggregation queue module 204.
In one embodiment, as shown in fig. 3, a network data obtaining method is provided, which is described by taking the method as an example applied to the terminal 102 in fig. 1, and the method includes the following steps:
step S200, monitoring data capture states in the current downloading strategy in the crawling system, wherein the data capture states comprise page downloading failure, page analysis failure and/or page analysis success.
The crawling system is a web crawler system, and the web crawler is a program for crawling web pages according to a certain rule and is an important component for a search engine to download web pages from the world wide web. And in the process of searching the web pages, continuously acquiring new URLs from the current web page and putting the new URLs into a queue until certain stop conditions are met.
The crawling system can preset various downloading strategies in advance, and the various downloading strategies can be implemented based on http policy, implemented based on HtmlUnit + jsup, implemented by adding a Cookie for a request header, clicked by other in-station pages and the like. The current download policy may be any of a variety of download policies.
In the process of acquiring the webpage content through the URL, after the webpage is captured, the webpage content needs to be analyzed through an analysis program, the webpage content is extracted, and then the webpage analysis is successful. The following states may exist in the process of acquiring web page content: the web page download fails, i.e., the page download fails or the web page download succeeds, but the web page content cannot be analyzed, i.e., the page analysis fails.
In one embodiment, the monitoring of the data capture state under the current downloading policy in the crawling system may be to mark or record the webpage downloading state, or may also mark or record the analysis state of the downloaded webpage, when the webpage downloading fails, a page downloading failure event is recorded once, when the webpage analysis fails, a page analysis failure event is recorded once, and the failure events in the crawling system include a downloading failure event and an analysis failure event. When the webpage is successfully analyzed, an analysis success event is recorded.
In one embodiment, a monitor may be disposed in the crawling system, and monitoring and marking of the data capture state may be implemented by the monitor, for example, a monitor 212 may monitor a page download failure event in the module 206 in the download manager in the crawling system, and a page parsing success event and a page parsing failure event in the page parser module 208, and mark or record the data capture state corresponding to each event. The download manager downloads the web pages from the internet, and the successfully downloaded web pages are input into the page parser module 208 for parsing and storing the successfully parsed web pages. The crawler system determines the switching of the download policy based on the data capture status monitored by the listener 212.
And step S300, determining a failure index corresponding to the current downloading strategy according to the data capturing state.
The failure index is a parameter for measuring data capture invalidity under the current downloading strategy, and the parameter quantifies the data capture invalidity in a numerical form. The failure index can be the difference value between the accumulated times of the grabbing failure and the accumulated times of the grabbing success; or the ratio of the accumulated times of grabbing failure to the total times of grabbing; it may also be a cumulative number of grab failures. The crawling failure comprises a page downloading failure and/or a page parsing failure.
In one embodiment, when the listener module 212 monitors a page download failure event, a page parsing failure event, and/or a page parsing success event, a corresponding data capture state is determined based on each event, and a failure indicator is updated based on the data capture state.
Step S400, if the failure index is greater than or equal to a first preset threshold value, switching the current downloading strategy into a target downloading strategy; the target downloading strategy is one of a plurality of downloading strategies preset in the crawling system.
In the application, a target downloading strategy can be determined from a plurality of preset downloading strategies according to a preset rule, and in one embodiment, the target downloading strategy can be determined according to the calling priority of the preset downloading strategy; for example, the first download policy is preset to a first priority, the second download policy is preset to a second priority, the third download policy is preset to a third priority, … …, the nth download policy is preset to an nth priority, when the current download policy satisfies the switching condition, the preset download policy with the highest priority is the target download policy, and the current download policy is sorted according to the priorities.
In another embodiment, the target downloading strategy can also be determined according to the preset downloading strategy historical page downloading success rate or historical page analysis success rate; when the current downloading strategy meets the switching condition, sorting according to the preset downloading strategy historical page downloading success rate or historical page analysis success rate, wherein the preset downloading strategy with the highest historical page downloading success rate or historical page analysis success rate is the target downloading strategy.
In addition, in other embodiments, the target download policy may be determined from a plurality of download policies in a random selection manner.
In one embodiment, a first preset threshold is set in the listener module 212 of the crawling system, a plurality of downloading policies are preset in the downloading manager module 206, the failure index is compared with the first preset threshold, if the failure index is greater than or equal to the first preset threshold, the current downloading policy is switched to the target downloading policy, and the target downloading policy may be one of the plurality of preset downloading policies excluding the current downloading policy, and may be determined by the above-described embodiment.
For example, the target download policy may be switched in order according to a preset priority. And if the current downloading strategy is the third downloading strategy, if the switching condition is met, determining the switching priority sequence of the target downloading strategy corresponding to the third downloading strategy, and selecting the target downloading strategy with the highest priority according to the priority sequence.
According to the scheme, the monitor determines the failure index based on the data capture state, and when the failure index is larger than or equal to the first preset threshold, the switching of the downloading strategies is triggered, so that the problem that when the current downloading strategy frequently fails to capture, the data acquisition is continuously carried out, and the resource waste is caused is avoided.
Meanwhile, by setting the first preset threshold, the behavior of switching the downloading strategy is triggered only when the first preset threshold is greater than or equal to the first preset threshold, so that the triggering of switching events caused by small network fluctuations is avoided, and the waste of data crawling resources caused by frequent switching of the downloading strategy is also avoided.
In one embodiment, as shown in fig. 4, step S400 specifically includes:
step S402, acquiring the accumulated switching times of the downloading strategy aiming at the crawling system within a first preset time;
step S404, if the accumulated switching times is larger than or equal to a second preset threshold, suspending the data capturing program of the crawling system within a second preset time, and releasing the corresponding proxy IP resource.
In one embodiment, a first preset time and a second preset threshold are set in the crawling system, when the current downloading policy meets the switching condition, the accumulated switching times of the downloading policy in the first preset time is obtained, and when the accumulated switching times is greater than or equal to the second preset threshold, the data capturing program of the crawling system is suspended in the second preset time, and the corresponding proxy IP resource is released. By means of the method, the data crawling program is prevented from always crawling and failing, so that the resources are wasted inefficiently, and when the downloading strategy is switched for multiple times and still crawling and failing, the data crawling is suspended for a period of time, so that the data crawling resources are saved.
In one embodiment, the method further includes switching the current downloading policy to a target downloading policy if the corresponding accumulated switching times within a first preset time is less than a second preset threshold; and after the target downloading strategy is switched, clearing the failure index corresponding to the current downloading strategy.
Specifically, the corresponding accumulated switching times within the first preset time is compared with a second preset threshold, and if the accumulated switching times is smaller than the second preset threshold, it indicates that the downloading strategy is not frequently switched, the current downloading strategy is switched to the target downloading strategy. After the target downloading strategy is switched, in order to facilitate the accuracy of the acquisition failure index to be checked next time, the failure index corresponding to the current downloading strategy is cleared.
Furthermore, a parameter with timeliness is set in the crawling system to record the frequency of downloading strategy switching events, and the downloading strategy is switched once in a timeliness period, and the parameter with timeliness is accumulated once every time the downloading strategy is switched. The second preset threshold may be the total number of all preset download policies.
Specifically, if the download policy is switched, the download policy is still in the aging period, and the switching cumulative number is equal to the total number of the existing download policies, which means that all the download policies are invalid, the data crawling procedure may be suspended for a period of time, for example, the second preset time is 30 minutes, and the data crawling procedure is suspended for 30 minutes. The pause program can set a pause time parameter with time effect, the scheduling and the verification are carried out at regular time, when the read pause time parameter is 0 or the pause time parameter can not be read, the data crawling program is started again, and the invalid waste of resources caused by the failure of crawling all the time is avoided.
In one embodiment, step S300 shown in fig. 5 includes: and according to the data capture state in a third preset time, obtaining the page download failure times, the page analysis failure times and/or the page analysis success times in the third preset time, and determining the failure index based on the page download failure times, the page analysis failure times and/or the page analysis success times.
Specifically, the monitor module monitors the state of the webpage downloaded by the download manager within the third preset time, acquires the information of the page download failure, counts the number of times of the page download failure, and simultaneously monitors the page parsing state information in the page parser module within the third preset time, including the page parsing success information and the page parsing failure information, and counts the number of times of the page parsing failure and the number of times of the page parsing success. And determining a failure index based on the monitored page download failure times, page analysis failure times and/or page analysis success times.
Specifically, the listener module 212 constructs an automatically switched download policy based on the failure indicator after acquiring many data capture states. And utilizing timing scheduling in the crawling system, for example, setting the third preset time to be 3 minutes, verifying the failure index recorded by the listener module within 3 minutes, comparing the failure index with a preset first preset threshold, and switching the downloading strategy when the failure index is greater than or equal to the first preset threshold.
In one embodiment, as shown in fig. 6, obtaining the number of page download failures and/or the number of page parsing failures within a third preset time according to the data capture state within the third preset time, and determining the failure indicator based on the number of page download failures and/or the number of page parsing failures includes:
step S302, traversing the data capture state within a third preset time;
step S304, if the traversed data capture state is a page download failure or a page analysis failure, accumulating the capture failure times of the current download strategy once; if the traversed data capture state is that the page analysis is successful, accumulating the capture success times of the current downloading strategy once;
step S306, when the data capture state in the third preset time is traversed, based on the accumulated capture failure times and capture success times, determining the failure index.
Specifically, the failure index may be determined by combining the number of times of failure of grabbing and the number of times of success of grabbing according to a preset rule, and the preset rule may be adjusted according to the statistical characteristic of the crawling system.
In one embodiment, if the failure index corresponding to the current downloading policy within the third preset time is smaller than a first preset threshold, the number of capturing failures and the number of capturing successes corresponding to the current downloading policy are set to zero.
Specifically, if the failure index corresponding to the current downloading policy within the third preset time is smaller than a first preset threshold, it indicates that the current downloading policy is valid, and in order to facilitate information acquisition and statistics for the next time, the failure index is effectively determined, and the number of capture failures and the number of capture successes corresponding to the current downloading policy are returned to zero.
In one embodiment, a candidate downloading strategy is determined from a plurality of downloading strategies preset in the crawling system;
judging whether the candidate downloading strategy is in a preset time period or not;
when the candidate downloading strategy is within a preset time period, determining the candidate downloading strategy as the target downloading strategy; and when the candidate downloading strategy is not in the preset aging period, determining a candidate downloading strategy from the rest downloading strategies in the plurality of downloading strategies, and returning to the step of judging whether the candidate downloading strategy is in the preset aging period until a target downloading strategy is obtained or the rest downloading strategies in the plurality of downloading strategies are 0. In the scheme, one Redis Key with aging is set for each downloading strategy, and the target downloading strategy is confirmed by judging whether the Redis Key of the corresponding candidate downloading strategy is in a preset aging period.
Specifically, the download policy may be a plurality of download policies preset in advance, such as: the strategy comprises strategies realized based on the http client, realized based on the http unit + jsup, realized by adding a Cookie of a request header, clicked by other in-station pages and the like.
In one embodiment, the failure indicator is at least one of: the difference between the accumulated times of grabbing failure and the accumulated times of grabbing success; the ratio of the accumulated times of grabbing failure to the total times of grabbing; the number of capture failures is accumulated.
For example, the cumulative number of times of grabbing failure is 40 times, the cumulative number of times of grabbing success is 10 times, and when the difference between the cumulative number of times of grabbing failure and the cumulative number of times of grabbing success is taken as a failure index, the failure index is 30;
when the ratio of the cumulative number of times of grabbing failure to the total number of times of grabbing is taken as a failure index, the failure index is 0.8.
When the cumulative number of times of capture failure is used as the failure index, the failure index is 40.
In one embodiment, a difference between the capture failure accumulated times and the capture success accumulated times is used as a failure indicator, a first preset threshold is 100, and when the failure indicator is greater than 100, the behavior of switching the download policy is triggered, so that triggering of switching the download policy caused by small network fluctuations is avoided.
In an embodiment, as shown in fig. 7, in the network data obtaining method, a listener monitors a data capture state, and sets a rediskey record failure index, where the rediskey may be:
the method comprises the following steps of Redis key (IP: monitor: { proxy _ type }: machine _ IP }: cumulative), wherein proxy _ type represents a downloading strategy type, namely the type of an agent, and machine _ IP represents a machine IP, and the crawling success is-1 and the crawling failure is + 1.
And the timing program is used for periodically checking, when the failure index in the third preset time is greater than or equal to the first preset threshold, further judging whether the cumulative switching times of the downloading strategies in the first preset time are greater than or equal to a second preset threshold, and when the cumulative switching times of the downloading strategies are greater than or equal to the second preset threshold, suspending data crawling in the second preset time, releasing the proxy resource and resetting the failure index. Further, after the second preset time, the data crawling program continues to perform data crawling.
The network data acquisition method further comprises the following steps: and when the failure index is smaller than a first preset threshold value, resetting the failure index and continuously monitoring the data crawling state under the current downloading strategy.
The network data acquisition method further comprises the following steps: and in a first preset time, when the cumulative switching times of the downloading strategies are smaller than a second preset threshold, switching to a target downloading strategy, updating and recording the times of switching downloaders, and clearing a failure index.
It should be understood that although the various steps in the flow charts of fig. 3-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 3-7 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 8, there is provided a network data acquisition apparatus, including: the monitor module is used for monitoring a data capture state under a current downloading strategy in the crawling system, wherein the data capture state comprises page downloading failure, page analysis failure and/or page analysis success; determining a failure index corresponding to the current downloading strategy according to the data capturing state;
the downloading strategy switching module is used for switching the current downloading strategy into a target downloading strategy if the failure index is greater than or equal to a first preset threshold value; the target downloading strategy is one of a plurality of downloading strategies preset in the crawling system.
In one embodiment, the downloading policy switching module is further configured to obtain an accumulated switching number of the downloading policy for the crawling system within a first preset time in the step of switching the current downloading policy to the target downloading policy if the failure indicator is greater than or equal to a first preset threshold; and if the accumulated switching times are larger than or equal to a second preset threshold value, suspending a data capturing program of the crawling system within second preset time, and releasing the corresponding proxy IP resources.
In one embodiment, the download policy switching module is further configured to switch the current download policy to a target download policy if the accumulated number of times of switching is less than a second preset threshold;
and after the target downloading strategy is switched, clearing the failure index corresponding to the current downloading strategy.
In one embodiment, the listener module is further configured to obtain, according to the data capture state within a third preset time, the number of page download failures, the number of page parsing failures, and/or the number of page parsing success times within the third preset time, and determine the failure indicator based on the number of page download failures, the number of page parsing failures, and/or the number of page parsing success times.
In one embodiment, the listener module is further configured to obtain, according to the data capture state in a third preset time, the number of page download failures and/or the number of page analysis failures in the third preset time, and determine the failure indicator based on the number of page download failures and/or the number of page analysis failures, where the method includes:
traversing the data capture state within a third preset time;
if the traversed data capture state is page download failure or page analysis failure, accumulating the capture failure times of the current download strategy once; if the traversed data capture state is that the page analysis is successful, accumulating the capture success times of the current downloading strategy once;
and when the data capture state in the third preset time is traversed, determining the failure index based on the accumulated capture failure times and capture success times.
In one embodiment, the listener module is further configured to, if the failure index corresponding to the current downloading policy within the third preset time is smaller than a first preset threshold, zero the number of capturing failures and the number of capturing successes corresponding to the current downloading policy.
In one embodiment, the download policy switching module is further configured to determine a candidate download policy from a plurality of download policies preset in the crawling system;
judging whether the candidate downloading strategy is in a preset time period or not;
when the candidate downloading strategy is within a preset time period, determining the candidate downloading strategy as the target downloading strategy; and when the candidate downloading strategy is not in the preset aging period, determining a candidate downloading strategy from the rest downloading strategies in the plurality of downloading strategies, and returning to the step of judging whether the candidate downloading strategy is in the preset aging period until a target downloading strategy is obtained or the rest downloading strategies in the plurality of downloading strategies are 0.
For specific limitations of the network data acquisition device, reference may be made to the above limitations of the network data acquisition method, which is not described herein again. The modules in the network acquiring apparatus may be implemented wholly or partially by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the web page resolution data or the URL link data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a network data acquisition method.
Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
monitoring a data capture state under a current downloading strategy in a crawling system, wherein the data capture state comprises page downloading failure, page analysis failure and/or page analysis success;
determining a failure index corresponding to the current downloading strategy according to the data capturing state;
if the failure index is larger than or equal to a first preset threshold value, switching the current downloading strategy into a target downloading strategy; the target downloading strategy is one of a plurality of downloading strategies preset in the crawling system.
For specific limitations of the computer device, reference may be made to the above limitations of the network data acquisition method, which is not described herein again.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
monitoring a data capture state under a current downloading strategy in a crawling system, wherein the data capture state comprises page downloading failure, page analysis failure and/or page analysis success;
determining a failure index corresponding to the current downloading strategy according to the data capturing state;
if the failure index is larger than or equal to a first preset threshold value, switching the current downloading strategy into a target downloading strategy; the target downloading strategy is one of a plurality of downloading strategies preset in the crawling system.
For specific limitations of the computer-readable storage medium, reference may be made to the above limitations of the network data acquisition method, which are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for network data acquisition, the method comprising:
monitoring a data capture state under a current downloading strategy in a crawling system, wherein the data capture state comprises page downloading failure, page analysis failure and/or page analysis success;
determining a failure index corresponding to the current downloading strategy according to the data capturing state;
if the failure index is larger than or equal to a first preset threshold value, switching the current downloading strategy into a target downloading strategy; the target downloading strategy is one of a plurality of downloading strategies preset in the crawling system.
2. The method of claim 1, wherein the step of switching the current downloading policy to the target downloading policy if the failure indicator is greater than or equal to a first preset threshold comprises:
acquiring the accumulated switching times of the downloading strategy aiming at the crawling system within a first preset time;
and if the accumulated switching times are larger than or equal to a second preset threshold value, suspending a data capturing program of the crawling system within second preset time, and releasing the corresponding proxy IP resources.
3. The method of claim 2, further comprising:
if the accumulated switching times are smaller than a second preset threshold value, switching the current downloading strategy into a target downloading strategy;
and after the target downloading strategy is switched, clearing the failure index corresponding to the current downloading strategy.
4. The method according to claim 1, wherein the step of determining a failure indicator corresponding to the current downloading policy according to the data capture status comprises:
and according to the data capture state in a third preset time, obtaining the page download failure times, the page analysis failure times and/or the page analysis success times in the third preset time, and determining the failure index based on the page download failure times, the page analysis failure times and/or the page analysis success times.
5. The method according to claim 4, wherein the obtaining of the number of page download failures and/or the number of page parsing failures within a third preset time according to the data capture state within the third preset time, and determining the failure indicator based on the number of page download failures and/or the number of page parsing failures comprises:
traversing the data capture state within a third preset time;
if the traversed data capture state is page download failure or page analysis failure, accumulating the capture failure times of the current download strategy once; if the traversed data capture state is that the page analysis is successful, accumulating the capture success times of the current downloading strategy once;
determining the failure index based on the accumulated capture failure times and capture success times when traversing the data capture state within the third preset time;
the method further comprises the following steps:
and if the corresponding failure index under the current downloading strategy is smaller than a first preset threshold value within the third preset time, returning the corresponding capturing failure times and capturing success times under the current downloading strategy to zero.
6. The method of claim 1, further comprising:
determining a candidate downloading strategy from a plurality of downloading strategies preset in the crawling system;
judging whether the candidate downloading strategy is in a preset time period or not;
when the candidate downloading strategy is within a preset time period, determining the candidate downloading strategy as the target downloading strategy; and when the candidate downloading strategy is not in the preset aging period, determining a candidate downloading strategy from the rest downloading strategies in the plurality of downloading strategies, and returning to the step of judging whether the candidate downloading strategy is in the preset aging period until a target downloading strategy is obtained or the rest downloading strategies in the plurality of downloading strategies are 0.
7. The method according to any one of claims 1 to 6, wherein the failure indicator is at least one of:
the difference between the accumulated times of grabbing failure and the accumulated times of grabbing success;
the ratio of the accumulated times of grabbing failure to the total times of grabbing;
the number of capture failures is accumulated.
8. A network data acquisition apparatus, the apparatus comprising:
the monitor module is used for monitoring a data capture state under a current downloading strategy in the crawling system, wherein the data capture state comprises page downloading failure, page analysis failure and/or page analysis success; determining a failure index corresponding to the current downloading strategy according to the data capturing state;
the downloading strategy switching module is used for switching the current downloading strategy into a target downloading strategy if the failure index is greater than or equal to a first preset threshold value; the target downloading strategy is one of a plurality of downloading strategies preset in the crawling system.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202011512362.0A 2020-12-19 2020-12-19 Network data acquisition method and device, computer equipment and storage medium Pending CN112541106A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011512362.0A CN112541106A (en) 2020-12-19 2020-12-19 Network data acquisition method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011512362.0A CN112541106A (en) 2020-12-19 2020-12-19 Network data acquisition method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112541106A true CN112541106A (en) 2021-03-23

Family

ID=75019307

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011512362.0A Pending CN112541106A (en) 2020-12-19 2020-12-19 Network data acquisition method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112541106A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486229A (en) * 2021-07-05 2021-10-08 北京百度网讯科技有限公司 Method and device for controlling grabbing pressure, electronic equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486229A (en) * 2021-07-05 2021-10-08 北京百度网讯科技有限公司 Method and device for controlling grabbing pressure, electronic equipment and readable storage medium
CN113486229B (en) * 2021-07-05 2023-11-07 北京百度网讯科技有限公司 Control method and device for grabbing pressure, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN111092852B (en) Network security monitoring method, device, equipment and storage medium based on big data
US9672085B2 (en) Adaptive fault diagnosis
CN112000541A (en) Buried point data reporting method and device, computer equipment and storage medium
CN103918222A (en) System and method for detection of denial of service attacks
CN111176941B (en) Data processing method, device and storage medium
CN112685270A (en) System monitoring log acquisition method and device, electronic equipment and medium
EP2634699B1 (en) Application monitoring
CN112541106A (en) Network data acquisition method and device, computer equipment and storage medium
CN115220995A (en) Agent probe-based micro-service full-link analysis method
CN112367340B (en) Intranet asset risk assessment method, device, equipment and medium
CN113688022A (en) Browser performance monitoring method, device, equipment and medium
CN112347394A (en) Method and device for acquiring webpage information, computer equipment and storage medium
CN113472858A (en) Buried point data processing method and device and electronic equipment
CN115409345A (en) Service index calculation method and device, computer equipment and storage medium
CN115509851A (en) Page monitoring method, device and equipment
CN114598622A (en) Data monitoring method and device, storage medium and computer equipment
CN114297462A (en) Intelligent website asynchronous sequence data acquisition method based on dynamic self-adaption
EP4187390B1 (en) Optimized sampling of resource content data for session recording under communication constraints by independently capturing agents
US11863632B2 (en) Method and system for application performance neutral, network bandwidth optimized capturing of resources used during the interaction of user with a web-based application to create monitoring data for an accurate visual reconstruction of the user experience
CN115396319B (en) Data stream slicing method, device, equipment and storage medium
CN114422272B (en) Data processing system, method and server side equipment
CN113055395B (en) Security detection method, device, equipment and storage medium
US20240007537A1 (en) System and method for a web scraping tool
Wu et al. Improving Web Browsing Experience with Personalized Edge Computing
KR20230097438A (en) A system that detects and monitors the risk of tampering with request parameters by generating and executing verification queries through analysis of large amounts of user behavior data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination