CN117474694A - Cloud financial data acquisition method based on web crawler technology - Google Patents
Cloud financial data acquisition method based on web crawler technology Download PDFInfo
- Publication number
- CN117474694A CN117474694A CN202311413924.XA CN202311413924A CN117474694A CN 117474694 A CN117474694 A CN 117474694A CN 202311413924 A CN202311413924 A CN 202311413924A CN 117474694 A CN117474694 A CN 117474694A
- Authority
- CN
- China
- Prior art keywords
- processing
- crawler
- exception
- callback
- abnormal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005516 engineering process Methods 0.000 title claims abstract description 22
- 238000000034 method Methods 0.000 title claims abstract description 22
- 238000012545 processing Methods 0.000 claims abstract description 69
- 230000002159 abnormal effect Effects 0.000 claims abstract description 34
- 230000007246 mechanism Effects 0.000 claims abstract description 19
- 230000009193 crawling Effects 0.000 claims description 26
- 230000004044 response Effects 0.000 claims description 20
- 238000004458 analytical method Methods 0.000 claims description 14
- 238000012544 monitoring process Methods 0.000 claims description 12
- 238000012795 verification Methods 0.000 claims description 9
- 235000014510 cooky Nutrition 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000007726 management method Methods 0.000 claims description 6
- 230000005856 abnormality Effects 0.000 claims description 4
- 230000006399 behavior Effects 0.000 claims description 3
- 230000010485 coping Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 238000004321 preservation Methods 0.000 claims description 3
- BUGBHKTXTAQXES-UHFFFAOYSA-N Selenium Chemical compound [Se] BUGBHKTXTAQXES-UHFFFAOYSA-N 0.000 description 15
- 229910052711 selenium Inorganic materials 0.000 description 15
- 239000011669 selenium Substances 0.000 description 15
- 238000013515 script Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 13
- 238000012360 testing method Methods 0.000 description 6
- 239000003795 chemical substances by application Substances 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 238000012946 outsourcing Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/12—Accounting
- G06Q40/125—Finance or payroll
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a cloud financial data acquisition method based on a web crawler technology, relates to the technical field of data acquisition, and is used for solving the problem that the existing cloud financial acquisition cannot access a data source so that general acquisition cannot be applicable; according to the crawler exception handling scheme and the exception callback scheme, different types of exception conditions can be better handled, the robustness and stability of the crawler are improved, relevant laws and regulations and the use regulations of a target website are complied with in use, and legal and compliance data acquisition work is ensured; the anti-crawler technology is automatically identified and processed, so that the data grabbing efficiency and accuracy are improved; the grabbing strategy can be automatically adjusted according to the structures and rules of different websites, and the data acquisition efficiency and accuracy are optimized; perfect abnormal retry processing mechanism and open abnormal callback capability.
Description
Technical Field
The invention relates to the technical field of data acquisition, in particular to a cloud financial data acquisition method based on a web crawler technology.
Background
Cloud finance refers to financial work of enterprises in cloud computing environments, and essentially, a virtual accounting information system is built on the Internet by utilizing cloud technology, so that financial accounting, financial management and other contents of the enterprises are completed. Under the cloud financial environment, financial information is shared through the cloud, and enterprise financial staff can process the financial information anytime and anywhere, so that the working efficiency of the financial staff is greatly improved; the enterprise manager can comprehensively and systematically predict, identify, control and deal with the business risk of the enterprise through mining analysis after the financial information and the non-financial information are fused in real time, so that the flexible adaptation of the enterprise to market change is realized. The adoption of the outsourcing mode for the construction and service of accounting informatization further promotes the forward development of enterprise financial work.
Difficulty of cloud financial acquisition: the construction and the service of the accounting informatization of the cloud finance adopt an outsourcing mode, the construction and the service of the accounting informatization of the normal financial software adopt a local call privatization mode, from the acquisition, the acquisition of the normal financial software can be performed through accessing a data source (background database) of the financial software, the data are managed to form a unified data format standard, automatic acquisition is performed, and the cloud finance adopts the outsourcing sharing characteristic, so that the data source cannot be accessed, and the universal acquisition cannot be applicable.
Disclosure of Invention
The invention aims to solve the problem that the existing cloud financial acquisition cannot access a data source, so that general acquisition cannot be applied, and provides a cloud financial data acquisition method based on a web crawler technology. In the aspect of feasibility, the data acquisition is realized by adopting the technologies of network crawling, reverse simulation request interface, simulation clicking, data downloading, management and the like.
The aim of the invention can be achieved by the following technical scheme: a cloud financial data acquisition method based on a web crawler technology comprises the following steps:
determining websites and data types to be grabbed, and setting corresponding grabbing tasks and rules;
simulating a user to log in a website, or opening a browser to log in;
acquiring the logged cookie data, and simulating a user interface request to acquire back-end source data;
according to the website structure and the anticreeper technology, automatically adjusting the grabbing strategy and processing abnormal conditions;
processing a plurality of grabbing tasks through a distributed architecture;
monitoring and analyzing the data grabbing result in real time, feeding back abnormal conditions in time, and carrying out corresponding processing and adjustment;
and carrying out structural preservation on the grabbing result according to the requirements of the financial analysis software.
As a preferred embodiment of the present invention, the setting corresponding grabbing tasks and rules includes:
initial setting: setting initial crawling speed, frequency and crawler delay time;
monitoring target websites: monitoring response time and state code of the target website;
dynamic adjustment policy algorithm: according to the monitored response time and the server load condition, the crawling strategy is automatically adjusted, and the crawling strategy is specifically expressed as follows: if the response time is longer or the server load is higher, the delay time of the crawler is increased or the number of concurrent requests is reduced; if the response time is shorter and the server load is lower, the crawling rate is increased or the number of concurrent requests is increased;
adjustment mechanism: the automatic monitoring tool monitors response time and server load conditions in real time, and when the condition that adjustment is required is monitored, a corresponding mechanism sends a signal to a crawler program to enable the crawler program to automatically modify a crawling strategy; the crawler program re-adjusts the request frequency and the concurrency number within a specified time interval;
setting a crawling rate algorithm: the crawling rate is set based on the crawler policy or robots.
As a preferred embodiment of the present invention, the handling of the abnormal situation includes:
network exception handling: adding a retry mechanism, and setting the maximum retry times and delay time;
HTTP exception handling: taking corresponding measures according to the specific status code or the error information, wherein the corresponding measures comprise retrying and updating the request head;
data processing exception handling: recording error information and processing error data, wherein the processing comprises skipping and re-analyzing;
anticreep constraint processing: corresponding coping strategies are adopted according to different limiting means, and the strategies comprise using agents, processing verification codes and simulating login.
As a preferred embodiment of the present invention, the performing of the corresponding processing and adjustment includes:
handling abnormal conditions: processing the anti-crawler mechanism of the target website by using an ip proxy, user-agent random switching and cookie management; adding corresponding processing logic to the verification code and login limit;
logging and analysis: and recording a request log of the crawler, analyzing the log, detecting potential problems, recording an abnormal log, and further optimizing a crawling strategy according to an analysis result.
As a preferred embodiment of the present invention, the request log includes response time, status code, and error information; recording an exception log, namely recording the occurrence time, URL, exception type and detailed information of each exception, storing by using a log system or a database, and keeping a history record; exception types include web exceptions, HTTP exceptions, data processing exceptions, and anticreep limitations.
As a preferred embodiment of the present invention, the processing of the exception condition further includes exception callback processing, which specifically includes:
defining an exception callback function: defining an exception callback function in the crawler program, wherein the exception callback function is used for processing the captured exception; the callback function can receive the abnormal information as a parameter and perform corresponding processing according to the abnormal type;
callback processing strategy: determining callback processing behaviors according to the severity and specific conditions of the abnormality; actions include whether to retry, skip, and alarm; for serious anomalies, stopping the crawler operation and triggering a corresponding notification or alarm mechanism, wherein the serious anomalies comprise network connection problems and anticreeper limitations;
triggering abnormal callback: capturing possible anomalies in the crawler program, and calling an anomaly callback function at a corresponding position; capturing an exception through a try-except structure, and calling a callback function in the except block;
exception callback optimization: and continuously optimizing and improving according to the execution condition and the processing result of the abnormal callback, and dynamically adjusting the callback processing strategy according to the actual condition, wherein the callback processing strategy comprises increasing retry times and adjusting processing sequence.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the crawler exception handling scheme and the exception callback scheme, different types of exception conditions can be better handled, the robustness and stability of the crawler are improved, relevant laws and regulations and the use regulations of a target website are complied with in use, and legal and compliant data acquisition work is ensured.
2. According to the invention, through an automatic recognition and processing anticreeper technology, the data grabbing efficiency and accuracy are improved; the grabbing strategy can be automatically adjusted according to the structures and rules of different websites, and the data acquisition efficiency and accuracy are optimized; perfect abnormal retry processing mechanism and open abnormal callback capability.
Drawings
The present invention is further described below with reference to the accompanying drawings for the convenience of understanding by those skilled in the art.
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention is mainly realized based on python language programming, and relates to selenium, requests, re, beautifulsoup, time, json and other modules; the python language is a cross-platform computer programming language, an object-oriented dynamic type language, designed for writing automation scripts.
The core of the invention is a selenium library, in particular a webdriver module therein.
The selenium library, a module in the python language, is a complete network application testing system, contains a record of the test, and is used for the selenium ide, and the authority of he selenium provides a recording and playback tool based on a browser plug-in for creating and executing a basic selenium test script.
And (2) mounting: selenium is a chrome plug-in that is searched and installed in the corresponding plug-in store. After the installation, in the toolbar of the browser, the icon is similar to a recorder, namely seleniumide.
selenium ide recording process:
recording a script: the selenium ide plug-in is opened and the test script begins to be recorded. Clicking the record button will record the operational steps in the browser where the sample of data to be collected is selected for recording, while pause and resume buttons may also be used to control the recording process.
Editing a script: and after the recording is finished, editing and modifying the generated script. Adding assertions, verifying, and waiting operations, and adjusting the order between steps.
Playback script: after script editing is completed, the test script is executed by clicking the playback button. The selenium ide will automatically reproduce the process of collecting data samples in the browser.
Export script: the recorded script is exported as selenium webdriver script in the python programming language.
selenium ides are relatively limited in functionality and flexibility, and are suitable for simple scenarios and quick creation of test and acquisition scripts. More complex acquisition scenarios acquisition scripts are written using selenium webdriver in conjunction with a programming language.
selenium webdriver workflow is as follows:
mounting and configuration: the selenium module was installed at python using pip. And downloading and configuring a required browser driver.
The webdriver module starts a target browser and binds the target browser to a designated port, and the started browser instance serves as a remote server of the webdriver module.
The client sends a network request to a interception port of the remote server;
the remote server needs to rely on native browser components to translate the browser's local calls;
the re module, i.e., the regular expression module, is an essential module for writing a regular expression using the python language. A regular expression is an expression composed of a set of special symbols for describing a rule.
the time module, i.e. the instant module, contains a large number of python language self-built functions related to time, which is an indispensable module for realizing the control of the program by the time point or the time period.
And the json module, namely a module which is necessary for processing functions such as storage and reading of data in json files, is realized in a python language standard library.
And the beaufullsource module is used for extracting data from the HTML and XML files. It can parse HTML and XM documents and provide some means to traverse the document tree, search for tags and data in the document tree, etc.
The requests module is a Python third party library, and is used for sending an HTTP request to a web server and acquiring response time.
Based on the python language programming, the application discloses a cloud financial data acquisition method based on a web crawler technology, which is realized through a webpage analysis module, a webpage operation module, an element identification module and a data storage module.
The webpage analysis module performs webpage structure analysis and automation operation on a given webpage; the method comprises the following steps:
for each program script, a network request is created and sent to the browser's driver; the browser driver comprises a network server for receiving the network requests; the network server receives the request and then specifically controls the corresponding browser according to the request; the browser executes specific steps; the browser returns the execution result of the step to the network server; the network server returns the result to the program script, and if the network code is wrong, the corresponding error reporting information can be seen at the control console.
The webpage operation module simulates basic operation of human on the webpage through programs. The invention mainly relates to text parameter input operation, drop-down list selection operation, scroll bar rolling operation, page turning operation, link clicking operation, page refreshing operation, page exiting operation and the like;
the element identification module analyzes through the webpage structure to realize analysis of valuable financial text data;
the data storage module stores the positioned elements to the local area in the form of json file. In order to facilitate the storage of data, the module is further related to the basic data processing operations such as numbering, naming and the like of the data.
Referring to fig. 1, the cloud financial data acquisition method based on the web crawler technology specifically includes the following steps:
step one: determining websites and data types to be grabbed, and setting corresponding grabbing tasks and rules; the method comprises the following steps:
initial setting:
the initial crawling rate and frequency are set to ensure that the target web site is not burdened excessively.
A reasonable crawler delay time (e.g., latency between each request) is set.
Monitoring target websites:
the response time and status code of the target web site are monitored to detect potential problems or anomalies.
Dynamic adjustment policy algorithm:
and automatically adjusting the crawling strategy according to the monitored response time and the server load condition.
If the response time is longer or the server load is higher, the delay time of the crawler can be increased or the number of concurrent requests can be reduced;
if the response time is shorter and the server load is lower, the crawling rate can be moderately increased or the number of concurrent requests can be increased.
Adjustment mechanism:
the automated monitoring tool monitors the response time and server load conditions in real time.
Under the condition that adjustment is required, the corresponding mechanism can send a signal to the crawler program to enable the crawler program to automatically modify the crawling strategy.
The crawler may re-make adjustments to the request frequency and the number of concurrency within a specified time interval.
Setting a crawling rate algorithm:
different websites have different requirements on crawling rates, please follow the target website's crawler policy or rules in the robots.
Avoiding requesting the same page too quickly in succession may introduce a random delay time between each url request.
For large-scale crawling, requests can be uniformly dispersed and burden can be reduced in a batch processing mode, a distributed crawler mode and the like.
Step two: simulating a user to log in a website, or opening a browser to log in the user, wherein some websites can perform operations such as verification code verification and the like on logging in, and the operations need to be processed according to actual conditions. In addition, in order to avoid password leakage, it is suggested to store the user name and password in a configuration file or database instead of directly writing in the code.
Step three: and acquiring the logged cookie data for simulating the user interface request and acquiring the back-end source data.
Step four: according to the website structure and the anticreeper technology, automatically adjusting the grabbing strategy and processing abnormal conditions; the processing exception conditions include a crawler exception handling scheme and an exception callback scheme.
Crawler exception handling scheme:
abnormality type classification:
network anomaly: such as connection timeout, DNS resolution error, etc.
HTTP anomaly: such as status code errors, request denied, etc.
Data processing exception: such as failure to parse HTML/XML, data format errors, etc.
Anticreep limit: such as authentication codes, login restrictions, etc.
The corresponding treatment measures are as follows:
network exception handling: the maximum retry number and delay time may be set by adding a retry mechanism.
HTTP exception handling: corresponding measures are taken, such as retrying, updating the request header, etc., based on the particular status code or error information.
Data processing exception handling: error information is recorded and processing of error data is performed, such as skipping, re-parsing, etc. Anticreep constraint processing: corresponding coping strategies are adopted according to different limiting means, such as using agents, processing verification codes, simulating login and the like.
Exception log record:
the time, URL, anomaly type, detailed information, and the like of each anomaly occurrence are recorded.
A log system or database may be used for storage and a long enough history is maintained for analysis and monitoring of anomalies.
Handling abnormal conditions:
aiming at the anticreeper mechanism of the target website, the technical means such as ip proxy, user-agent random switching, cookie management and the like can be used.
Corresponding processing logic may be added for cases of verification codes, login restrictions, etc., which may occur.
Logging and analysis:
the request log of the crawler is recorded, including response time, status code, error information, etc.
Analyzing the log, detecting potential problems, and further optimizing a crawling strategy according to an analysis result;
exception callback scheme:
defining an exception callback function:
an exception callback function is defined in the crawler program for handling the captured exception.
The callback function can receive the abnormal information as a parameter and perform corresponding processing according to the abnormal type.
Callback processing strategy:
and determining the callback processing behavior according to the severity and the specific condition of the abnormality, such as retrying, skipping, alarming and the like.
For severe anomalies, such as network connectivity problems or anticreeper limitations, the crawler can be suspended from running and a corresponding notification or alarm mechanism triggered.
Triggering abnormal callback:
the possible exception is captured in the crawler and the exception callback function is called in the appropriate location.
Exceptions may be captured through a try-except structure and callback functions are called in the except block.
Exception callback optimization:
and continuously optimizing and improving according to the execution condition and the processing result of the abnormal callback.
The callback processing strategy can be dynamically adjusted according to actual conditions, such as increasing retry times, adjusting processing sequence and the like.
By the crawler exception handling scheme and the exception callback scheme, different types of exception conditions can be better handled, and the robustness and stability of the crawler are improved. The method is used for adhering to relevant laws and regulations and the use regulations of target websites, and ensuring legal and compliant data acquisition work.
Step five: through a distributed architecture, a plurality of grabbing tasks are processed simultaneously, so that the system efficiency and stability are improved;
step six: timely feeding back abnormal conditions by monitoring and analyzing data grabbing results in real time, and carrying out corresponding processing and adjustment; comprising the following steps:
handling abnormal conditions: aiming at the anticreeper mechanism of the target website, the technical means such as ip proxy, user-agent random switching, cookie management and the like can be used. Corresponding processing logic may be added for cases of verification codes, login restrictions, etc., which may occur.
Logging and analysis: the request log of the crawler is recorded, including response time, status code, error information, etc. Analyzing the log, detecting potential problems, and further optimizing the crawling strategy according to the analysis result
Step seven: and (5) carrying out structural preservation on the result according to the requirements of the financial analysis software.
According to the crawler exception handling scheme and the exception callback scheme, different types of exception conditions can be better handled, the robustness and stability of the crawler are improved, relevant laws and regulations and target website use regulations are complied with in use, and legal and compliance data acquisition work is ensured.
The anticreeper technology is automatically identified and processed, so that the data grabbing efficiency and accuracy are improved; the grabbing strategy can be automatically adjusted according to the structures and rules of different websites, and the data acquisition efficiency and accuracy are optimized; perfect abnormal retry processing mechanism and open abnormal callback capability.
The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.
Claims (6)
1. The cloud financial data acquisition method based on the web crawler technology is characterized by comprising the following steps of:
determining websites and data types to be grabbed, and setting corresponding grabbing tasks and rules;
simulating a user to log in a website, or opening a browser to log in;
acquiring the logged cookie data, and simulating a user interface request to acquire back-end source data;
according to the website structure and the anticreeper technology, automatically adjusting the grabbing strategy and processing abnormal conditions;
processing a plurality of grabbing tasks through a distributed architecture;
monitoring and analyzing the data grabbing result in real time, feeding back abnormal conditions in time, and carrying out corresponding processing and adjustment;
and carrying out structural preservation on the grabbing result according to the requirements of the financial analysis software.
2. The method for obtaining cloud financial data based on web crawler technology according to claim 1, wherein the setting corresponding grabbing tasks and rules comprises:
initial setting: setting initial crawling speed, frequency and crawler delay time;
monitoring target websites: monitoring response time and state code of the target website;
dynamic adjustment policy algorithm: according to the monitored response time and the server load condition, the crawling strategy is automatically adjusted, and the crawling strategy is specifically expressed as follows: if the response time is longer or the server load is higher, the delay time of the crawler is increased or the number of concurrent requests is reduced; if the response time is shorter and the server load is lower, the crawling rate is increased or the number of concurrent requests is increased;
adjustment mechanism: the automatic monitoring tool monitors response time and server load conditions in real time, and when the condition that adjustment is required is monitored, a corresponding mechanism sends a signal to a crawler program to enable the crawler program to automatically modify a crawling strategy; the crawler program re-adjusts the request frequency and the concurrency number within a specified time interval;
setting a crawling rate algorithm: the crawling rate is set based on the crawler policy or robots.
3. The method for obtaining cloud financial data based on web crawler technology according to claim 1, wherein the processing abnormal conditions comprises:
network exception handling: adding a retry mechanism, and setting the maximum retry times and delay time;
HTTP exception handling: taking corresponding measures according to the specific status code or the error information, wherein the corresponding measures comprise retrying and updating the request head;
data processing exception handling: recording error information and processing error data, wherein the processing comprises skipping and re-analyzing;
anticreep constraint processing: corresponding coping strategies are adopted according to different limiting means, and the strategies comprise using agents, processing verification codes and simulating login.
4. The method for obtaining cloud financial data based on web crawler technology according to claim 1, wherein the performing corresponding processing and adjustment includes:
handling abnormal conditions: processing the anti-crawler mechanism of the target website by using an ip proxy, user-agent random switching and cookie management; adding corresponding processing logic to the verification code and login limit;
logging and analysis: and recording a request log of the crawler, analyzing the log, detecting potential problems, recording an abnormal log, and further optimizing a crawling strategy according to an analysis result.
5. The web crawler technology-based cloud financial data acquisition method of claim 4, wherein the request log comprises response time, status code and error information; recording an exception log, namely recording the occurrence time, URL, exception type and detailed information of each exception, storing by using a log system or a database, and keeping a history record; exception types include web exceptions, HTTP exceptions, data processing exceptions, and anticreep limitations.
6. The cloud financial data acquisition method based on the web crawler technology according to claim 3, wherein the processing of the abnormal situation further comprises processing of an abnormal callback, and the specific processing is as follows:
defining an exception callback function: defining an exception callback function in the crawler program, wherein the exception callback function is used for processing the captured exception; the callback function can receive the abnormal information as a parameter and perform corresponding processing according to the abnormal type;
callback processing strategy: determining callback processing behaviors according to the severity and specific conditions of the abnormality;
actions include whether to retry, skip, and alarm; for serious anomalies, stopping the crawler operation and triggering a corresponding notification or alarm mechanism, wherein the serious anomalies comprise network connection problems and anticreeper limitations;
triggering abnormal callback: capturing possible anomalies in the crawler program, and calling an anomaly callback function at a corresponding position; capturing an exception through a try-except structure, and calling a callback function in the except block;
exception callback optimization: and continuously optimizing and improving according to the execution condition and the processing result of the abnormal callback, and dynamically adjusting the callback processing strategy according to the actual condition, wherein the callback processing strategy comprises increasing retry times and adjusting processing sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311413924.XA CN117474694A (en) | 2023-10-30 | 2023-10-30 | Cloud financial data acquisition method based on web crawler technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311413924.XA CN117474694A (en) | 2023-10-30 | 2023-10-30 | Cloud financial data acquisition method based on web crawler technology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117474694A true CN117474694A (en) | 2024-01-30 |
Family
ID=89634121
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311413924.XA Pending CN117474694A (en) | 2023-10-30 | 2023-10-30 | Cloud financial data acquisition method based on web crawler technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117474694A (en) |
-
2023
- 2023-10-30 CN CN202311413924.XA patent/CN117474694A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10901727B2 (en) | Monitoring code sensitivity to cause software build breaks during software project development | |
Jiang et al. | A survey on load testing of large-scale software systems | |
EP3104287B1 (en) | Systems and methods for indexing and aggregating data records | |
US9846636B1 (en) | Client-side event logging for heterogeneous client environments | |
US11601462B2 (en) | Systems and methods of intelligent and directed dynamic application security testing | |
US20150149826A1 (en) | Management of performance levels of information technology systems | |
US9762597B2 (en) | Method and system to detect and interrupt a robot data aggregator ability to access a website | |
US20130346917A1 (en) | Client application analytics | |
US9037913B2 (en) | Dynamic event generation for user interface control | |
US10353760B2 (en) | System verification of interactive screenshots and log files between client systems and server systems within a network computing environment | |
US10528456B2 (en) | Determining idle testing periods | |
US9304887B2 (en) | Method and system for operating system (OS) verification | |
US9317398B1 (en) | Vendor and version independent browser driver | |
WO2023082907A1 (en) | Defect tracking and remediation using client-side screen recording | |
US9430361B1 (en) | Transition testing model for heterogeneous client environments | |
WO2021129335A1 (en) | Operation monitoring method and apparatus, operation analysis method and apparatus | |
Wang et al. | Fast outage analysis of large-scale production clouds with service correlation mining | |
Liao et al. | Using black-box performance models to detect performance regressions under varying workloads: an empirical study | |
CN117474694A (en) | Cloud financial data acquisition method based on web crawler technology | |
CN110650126A (en) | Method and device for preventing website traffic attack, intelligent terminal and storage medium | |
US10097565B1 (en) | Managing browser security in a testing context | |
CN113934595A (en) | Data analysis method and system, storage medium and electronic terminal | |
Rahmel et al. | Testing a site with ApacheBench, JMeter, and Selenium | |
Jiang | Automated analysis of load testing results | |
Sharma et al. | Scalable microservice forensics and stability assessment using variational autoencoders |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |