CN117474694A

CN117474694A - Cloud financial data acquisition method based on web crawler technology

Info

Publication number: CN117474694A
Application number: CN202311413924.XA
Authority: CN
Inventors: 袁永虹; 汤仲翊
Original assignee: Jiangsu Tax Software Technology Co ltd
Current assignee: Jiangsu Tax Software Technology Co ltd
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2024-01-30

Abstract

The invention discloses a cloud financial data acquisition method based on a web crawler technology, relates to the technical field of data acquisition, and is used for solving the problem that the existing cloud financial acquisition cannot access a data source so that general acquisition cannot be applicable; according to the crawler exception handling scheme and the exception callback scheme, different types of exception conditions can be better handled, the robustness and stability of the crawler are improved, relevant laws and regulations and the use regulations of a target website are complied with in use, and legal and compliance data acquisition work is ensured; the anti-crawler technology is automatically identified and processed, so that the data grabbing efficiency and accuracy are improved; the grabbing strategy can be automatically adjusted according to the structures and rules of different websites, and the data acquisition efficiency and accuracy are optimized; perfect abnormal retry processing mechanism and open abnormal callback capability.

Description

Cloud financial data acquisition method based on web crawler technology

Technical Field

The invention relates to the technical field of data acquisition, in particular to a cloud financial data acquisition method based on a web crawler technology.

Background

Cloud finance refers to financial work of enterprises in cloud computing environments, and essentially, a virtual accounting information system is built on the Internet by utilizing cloud technology, so that financial accounting, financial management and other contents of the enterprises are completed. Under the cloud financial environment, financial information is shared through the cloud, and enterprise financial staff can process the financial information anytime and anywhere, so that the working efficiency of the financial staff is greatly improved; the enterprise manager can comprehensively and systematically predict, identify, control and deal with the business risk of the enterprise through mining analysis after the financial information and the non-financial information are fused in real time, so that the flexible adaptation of the enterprise to market change is realized. The adoption of the outsourcing mode for the construction and service of accounting informatization further promotes the forward development of enterprise financial work.

Difficulty of cloud financial acquisition: the construction and the service of the accounting informatization of the cloud finance adopt an outsourcing mode, the construction and the service of the accounting informatization of the normal financial software adopt a local call privatization mode, from the acquisition, the acquisition of the normal financial software can be performed through accessing a data source (background database) of the financial software, the data are managed to form a unified data format standard, automatic acquisition is performed, and the cloud finance adopts the outsourcing sharing characteristic, so that the data source cannot be accessed, and the universal acquisition cannot be applicable.

Disclosure of Invention

The invention aims to solve the problem that the existing cloud financial acquisition cannot access a data source, so that general acquisition cannot be applied, and provides a cloud financial data acquisition method based on a web crawler technology. In the aspect of feasibility, the data acquisition is realized by adopting the technologies of network crawling, reverse simulation request interface, simulation clicking, data downloading, management and the like.

The aim of the invention can be achieved by the following technical scheme: a cloud financial data acquisition method based on a web crawler technology comprises the following steps:

determining websites and data types to be grabbed, and setting corresponding grabbing tasks and rules;

simulating a user to log in a website, or opening a browser to log in;

acquiring the logged cookie data, and simulating a user interface request to acquire back-end source data;

according to the website structure and the anticreeper technology, automatically adjusting the grabbing strategy and processing abnormal conditions;

processing a plurality of grabbing tasks through a distributed architecture;

monitoring and analyzing the data grabbing result in real time, feeding back abnormal conditions in time, and carrying out corresponding processing and adjustment;

and carrying out structural preservation on the grabbing result according to the requirements of the financial analysis software.

As a preferred embodiment of the present invention, the setting corresponding grabbing tasks and rules includes:

initial setting: setting initial crawling speed, frequency and crawler delay time;

monitoring target websites: monitoring response time and state code of the target website;

dynamic adjustment policy algorithm: according to the monitored response time and the server load condition, the crawling strategy is automatically adjusted, and the crawling strategy is specifically expressed as follows: if the response time is longer or the server load is higher, the delay time of the crawler is increased or the number of concurrent requests is reduced; if the response time is shorter and the server load is lower, the crawling rate is increased or the number of concurrent requests is increased;

adjustment mechanism: the automatic monitoring tool monitors response time and server load conditions in real time, and when the condition that adjustment is required is monitored, a corresponding mechanism sends a signal to a crawler program to enable the crawler program to automatically modify a crawling strategy; the crawler program re-adjusts the request frequency and the concurrency number within a specified time interval;

setting a crawling rate algorithm: the crawling rate is set based on the crawler policy or robots.

As a preferred embodiment of the present invention, the handling of the abnormal situation includes:

network exception handling: adding a retry mechanism, and setting the maximum retry times and delay time;

HTTP exception handling: taking corresponding measures according to the specific status code or the error information, wherein the corresponding measures comprise retrying and updating the request head;

data processing exception handling: recording error information and processing error data, wherein the processing comprises skipping and re-analyzing;

anticreep constraint processing: corresponding coping strategies are adopted according to different limiting means, and the strategies comprise using agents, processing verification codes and simulating login.

As a preferred embodiment of the present invention, the performing of the corresponding processing and adjustment includes:

handling abnormal conditions: processing the anti-crawler mechanism of the target website by using an ip proxy, user-agent random switching and cookie management; adding corresponding processing logic to the verification code and login limit;

logging and analysis: and recording a request log of the crawler, analyzing the log, detecting potential problems, recording an abnormal log, and further optimizing a crawling strategy according to an analysis result.

As a preferred embodiment of the present invention, the request log includes response time, status code, and error information; recording an exception log, namely recording the occurrence time, URL, exception type and detailed information of each exception, storing by using a log system or a database, and keeping a history record; exception types include web exceptions, HTTP exceptions, data processing exceptions, and anticreep limitations.

As a preferred embodiment of the present invention, the processing of the exception condition further includes exception callback processing, which specifically includes:

defining an exception callback function: defining an exception callback function in the crawler program, wherein the exception callback function is used for processing the captured exception; the callback function can receive the abnormal information as a parameter and perform corresponding processing according to the abnormal type;

callback processing strategy: determining callback processing behaviors according to the severity and specific conditions of the abnormality; actions include whether to retry, skip, and alarm; for serious anomalies, stopping the crawler operation and triggering a corresponding notification or alarm mechanism, wherein the serious anomalies comprise network connection problems and anticreeper limitations;

triggering abnormal callback: capturing possible anomalies in the crawler program, and calling an anomaly callback function at a corresponding position; capturing an exception through a try-except structure, and calling a callback function in the except block;

exception callback optimization: and continuously optimizing and improving according to the execution condition and the processing result of the abnormal callback, and dynamically adjusting the callback processing strategy according to the actual condition, wherein the callback processing strategy comprises increasing retry times and adjusting processing sequence.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the crawler exception handling scheme and the exception callback scheme, different types of exception conditions can be better handled, the robustness and stability of the crawler are improved, relevant laws and regulations and the use regulations of a target website are complied with in use, and legal and compliant data acquisition work is ensured.

2. According to the invention, through an automatic recognition and processing anticreeper technology, the data grabbing efficiency and accuracy are improved; the grabbing strategy can be automatically adjusted according to the structures and rules of different websites, and the data acquisition efficiency and accuracy are optimized; perfect abnormal retry processing mechanism and open abnormal callback capability.

Drawings

The present invention is further described below with reference to the accompanying drawings for the convenience of understanding by those skilled in the art.

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention is mainly realized based on python language programming, and relates to selenium, requests, re, beautifulsoup, time, json and other modules; the python language is a cross-platform computer programming language, an object-oriented dynamic type language, designed for writing automation scripts.

The core of the invention is a selenium library, in particular a webdriver module therein.

The selenium library, a module in the python language, is a complete network application testing system, contains a record of the test, and is used for the selenium ide, and the authority of he selenium provides a recording and playback tool based on a browser plug-in for creating and executing a basic selenium test script.

And (2) mounting: selenium is a chrome plug-in that is searched and installed in the corresponding plug-in store. After the installation, in the toolbar of the browser, the icon is similar to a recorder, namely seleniumide.

selenium ide recording process:

recording a script: the selenium ide plug-in is opened and the test script begins to be recorded. Clicking the record button will record the operational steps in the browser where the sample of data to be collected is selected for recording, while pause and resume buttons may also be used to control the recording process.

Editing a script: and after the recording is finished, editing and modifying the generated script. Adding assertions, verifying, and waiting operations, and adjusting the order between steps.

Playback script: after script editing is completed, the test script is executed by clicking the playback button. The selenium ide will automatically reproduce the process of collecting data samples in the browser.

Export script: the recorded script is exported as selenium webdriver script in the python programming language.

selenium ides are relatively limited in functionality and flexibility, and are suitable for simple scenarios and quick creation of test and acquisition scripts. More complex acquisition scenarios acquisition scripts are written using selenium webdriver in conjunction with a programming language.

selenium webdriver workflow is as follows:

mounting and configuration: the selenium module was installed at python using pip. And downloading and configuring a required browser driver.

The webdriver module starts a target browser and binds the target browser to a designated port, and the started browser instance serves as a remote server of the webdriver module.

The client sends a network request to a interception port of the remote server;

the remote server needs to rely on native browser components to translate the browser's local calls;

the re module, i.e., the regular expression module, is an essential module for writing a regular expression using the python language. A regular expression is an expression composed of a set of special symbols for describing a rule.

the time module, i.e. the instant module, contains a large number of python language self-built functions related to time, which is an indispensable module for realizing the control of the program by the time point or the time period.

And the json module, namely a module which is necessary for processing functions such as storage and reading of data in json files, is realized in a python language standard library.

And the beaufullsource module is used for extracting data from the HTML and XML files. It can parse HTML and XM documents and provide some means to traverse the document tree, search for tags and data in the document tree, etc.

The requests module is a Python third party library, and is used for sending an HTTP request to a web server and acquiring response time.

Based on the python language programming, the application discloses a cloud financial data acquisition method based on a web crawler technology, which is realized through a webpage analysis module, a webpage operation module, an element identification module and a data storage module.

The webpage analysis module performs webpage structure analysis and automation operation on a given webpage; the method comprises the following steps:

for each program script, a network request is created and sent to the browser's driver; the browser driver comprises a network server for receiving the network requests; the network server receives the request and then specifically controls the corresponding browser according to the request; the browser executes specific steps; the browser returns the execution result of the step to the network server; the network server returns the result to the program script, and if the network code is wrong, the corresponding error reporting information can be seen at the control console.

The webpage operation module simulates basic operation of human on the webpage through programs. The invention mainly relates to text parameter input operation, drop-down list selection operation, scroll bar rolling operation, page turning operation, link clicking operation, page refreshing operation, page exiting operation and the like;

the element identification module analyzes through the webpage structure to realize analysis of valuable financial text data;

the data storage module stores the positioned elements to the local area in the form of json file. In order to facilitate the storage of data, the module is further related to the basic data processing operations such as numbering, naming and the like of the data.

Referring to fig. 1, the cloud financial data acquisition method based on the web crawler technology specifically includes the following steps:

step one: determining websites and data types to be grabbed, and setting corresponding grabbing tasks and rules; the method comprises the following steps:

initial setting:

the initial crawling rate and frequency are set to ensure that the target web site is not burdened excessively.

A reasonable crawler delay time (e.g., latency between each request) is set.

Monitoring target websites:

the response time and status code of the target web site are monitored to detect potential problems or anomalies.

Dynamic adjustment policy algorithm:

and automatically adjusting the crawling strategy according to the monitored response time and the server load condition.

If the response time is longer or the server load is higher, the delay time of the crawler can be increased or the number of concurrent requests can be reduced;

if the response time is shorter and the server load is lower, the crawling rate can be moderately increased or the number of concurrent requests can be increased.

Adjustment mechanism:

the automated monitoring tool monitors the response time and server load conditions in real time.

Under the condition that adjustment is required, the corresponding mechanism can send a signal to the crawler program to enable the crawler program to automatically modify the crawling strategy.

The crawler may re-make adjustments to the request frequency and the number of concurrency within a specified time interval.

Setting a crawling rate algorithm:

different websites have different requirements on crawling rates, please follow the target website's crawler policy or rules in the robots.

Avoiding requesting the same page too quickly in succession may introduce a random delay time between each url request.

For large-scale crawling, requests can be uniformly dispersed and burden can be reduced in a batch processing mode, a distributed crawler mode and the like.

Step two: simulating a user to log in a website, or opening a browser to log in the user, wherein some websites can perform operations such as verification code verification and the like on logging in, and the operations need to be processed according to actual conditions. In addition, in order to avoid password leakage, it is suggested to store the user name and password in a configuration file or database instead of directly writing in the code.

Step three: and acquiring the logged cookie data for simulating the user interface request and acquiring the back-end source data.

Step four: according to the website structure and the anticreeper technology, automatically adjusting the grabbing strategy and processing abnormal conditions; the processing exception conditions include a crawler exception handling scheme and an exception callback scheme.

Crawler exception handling scheme:

abnormality type classification:

network anomaly: such as connection timeout, DNS resolution error, etc.

HTTP anomaly: such as status code errors, request denied, etc.

Data processing exception: such as failure to parse HTML/XML, data format errors, etc.

Anticreep limit: such as authentication codes, login restrictions, etc.

The corresponding treatment measures are as follows:

network exception handling: the maximum retry number and delay time may be set by adding a retry mechanism.

HTTP exception handling: corresponding measures are taken, such as retrying, updating the request header, etc., based on the particular status code or error information.

Data processing exception handling: error information is recorded and processing of error data is performed, such as skipping, re-parsing, etc. Anticreep constraint processing: corresponding coping strategies are adopted according to different limiting means, such as using agents, processing verification codes, simulating login and the like.

Exception log record:

the time, URL, anomaly type, detailed information, and the like of each anomaly occurrence are recorded.

A log system or database may be used for storage and a long enough history is maintained for analysis and monitoring of anomalies.

Handling abnormal conditions:

aiming at the anticreeper mechanism of the target website, the technical means such as ip proxy, user-agent random switching, cookie management and the like can be used.

Corresponding processing logic may be added for cases of verification codes, login restrictions, etc., which may occur.

Logging and analysis:

the request log of the crawler is recorded, including response time, status code, error information, etc.

Analyzing the log, detecting potential problems, and further optimizing a crawling strategy according to an analysis result;

exception callback scheme:

defining an exception callback function:

an exception callback function is defined in the crawler program for handling the captured exception.

The callback function can receive the abnormal information as a parameter and perform corresponding processing according to the abnormal type.

Callback processing strategy:

and determining the callback processing behavior according to the severity and the specific condition of the abnormality, such as retrying, skipping, alarming and the like.

For severe anomalies, such as network connectivity problems or anticreeper limitations, the crawler can be suspended from running and a corresponding notification or alarm mechanism triggered.

Triggering abnormal callback:

the possible exception is captured in the crawler and the exception callback function is called in the appropriate location.

Exceptions may be captured through a try-except structure and callback functions are called in the except block.

Exception callback optimization:

and continuously optimizing and improving according to the execution condition and the processing result of the abnormal callback.

The callback processing strategy can be dynamically adjusted according to actual conditions, such as increasing retry times, adjusting processing sequence and the like.

By the crawler exception handling scheme and the exception callback scheme, different types of exception conditions can be better handled, and the robustness and stability of the crawler are improved. The method is used for adhering to relevant laws and regulations and the use regulations of target websites, and ensuring legal and compliant data acquisition work.

Step five: through a distributed architecture, a plurality of grabbing tasks are processed simultaneously, so that the system efficiency and stability are improved;

step six: timely feeding back abnormal conditions by monitoring and analyzing data grabbing results in real time, and carrying out corresponding processing and adjustment; comprising the following steps:

handling abnormal conditions: aiming at the anticreeper mechanism of the target website, the technical means such as ip proxy, user-agent random switching, cookie management and the like can be used. Corresponding processing logic may be added for cases of verification codes, login restrictions, etc., which may occur.

Logging and analysis: the request log of the crawler is recorded, including response time, status code, error information, etc. Analyzing the log, detecting potential problems, and further optimizing the crawling strategy according to the analysis result

Step seven: and (5) carrying out structural preservation on the result according to the requirements of the financial analysis software.

According to the crawler exception handling scheme and the exception callback scheme, different types of exception conditions can be better handled, the robustness and stability of the crawler are improved, relevant laws and regulations and target website use regulations are complied with in use, and legal and compliance data acquisition work is ensured.

The anticreeper technology is automatically identified and processed, so that the data grabbing efficiency and accuracy are improved; the grabbing strategy can be automatically adjusted according to the structures and rules of different websites, and the data acquisition efficiency and accuracy are optimized; perfect abnormal retry processing mechanism and open abnormal callback capability.

The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims

1. The cloud financial data acquisition method based on the web crawler technology is characterized by comprising the following steps of:

simulating a user to log in a website, or opening a browser to log in;

processing a plurality of grabbing tasks through a distributed architecture;

2. The method for obtaining cloud financial data based on web crawler technology according to claim 1, wherein the setting corresponding grabbing tasks and rules comprises:

3. The method for obtaining cloud financial data based on web crawler technology according to claim 1, wherein the processing abnormal conditions comprises:

4. The method for obtaining cloud financial data based on web crawler technology according to claim 1, wherein the performing corresponding processing and adjustment includes:

5. The web crawler technology-based cloud financial data acquisition method of claim 4, wherein the request log comprises response time, status code and error information; recording an exception log, namely recording the occurrence time, URL, exception type and detailed information of each exception, storing by using a log system or a database, and keeping a history record; exception types include web exceptions, HTTP exceptions, data processing exceptions, and anticreep limitations.

6. The cloud financial data acquisition method based on the web crawler technology according to claim 3, wherein the processing of the abnormal situation further comprises processing of an abnormal callback, and the specific processing is as follows:

callback processing strategy: determining callback processing behaviors according to the severity and specific conditions of the abnormality;

actions include whether to retry, skip, and alarm; for serious anomalies, stopping the crawler operation and triggering a corresponding notification or alarm mechanism, wherein the serious anomalies comprise network connection problems and anticreeper limitations;