CN114969474A - Webpage data acquisition method, webpage data acquisition device and storage medium - Google Patents

Webpage data acquisition method, webpage data acquisition device and storage medium Download PDF

Info

Publication number
CN114969474A
CN114969474A CN202210346533.XA CN202210346533A CN114969474A CN 114969474 A CN114969474 A CN 114969474A CN 202210346533 A CN202210346533 A CN 202210346533A CN 114969474 A CN114969474 A CN 114969474A
Authority
CN
China
Prior art keywords
data
source code
target
browser
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210346533.XA
Other languages
Chinese (zh)
Inventor
蒋庆高
汪健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Xishima Data Technology Co ltd
Original Assignee
Anhui Xishima Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Xishima Data Technology Co ltd filed Critical Anhui Xishima Data Technology Co ltd
Priority to CN202210346533.XA priority Critical patent/CN114969474A/en
Publication of CN114969474A publication Critical patent/CN114969474A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/12Protecting executable software
    • G06F21/121Restricting unauthorised execution of programs
    • G06F21/128Restricting unauthorised execution of programs involving web programs, i.e. using technology especially used in internet, generally interacting with a web browser, e.g. hypertext markup language [HTML], applets, java
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4482Procedural

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Multimedia (AREA)
  • Technology Law (AREA)
  • Computer Hardware Design (AREA)
  • Storage Device Security (AREA)

Abstract

The application discloses a webpage data acquisition method. The acquisition method comprises the following steps: and (4) declaring a browser and a data source website variable based on the CefSharp framework, and then, executing the browser and loading the data source website variable to directly access a webpage source code. And then, calling the asynchronous entrusting event to initialize the browser, acquiring a webpage source code corresponding to the data source website variable under the condition that the browser is initialized, and decrypting the data of the webpage source code to obtain a target source code. Clean data can be obtained faster by invoking asynchronous delegation events and decrypting encrypted data. And finally, determining target data according to the target source code and storing the target data. The application also provides a webpage data acquisition device and a nonvolatile computer readable storage medium.

Description

Webpage data acquisition method, webpage data acquisition device and storage medium
Technical Field
The present application relates to the field of data acquisition technologies, and in particular, to a web page data acquisition method, a web page data acquisition apparatus, and a non-volatile computer-readable storage medium.
Background
For big data analysis, data needs to be acquired, and the data can be acquired in a web crawler mode. However, the web crawler technically encounters various anti-crawling mechanisms, such as account number, IP network, data source upgrade, data source encryption and the like. Thereby making the desired source code and data unavailable to the developer.
Disclosure of Invention
The application provides a webpage data acquisition method, a webpage data acquisition device and a nonvolatile computer readable storage medium.
The embodiment of the application provides a webpage data acquisition method, which comprises the following steps:
declaring variables of a browser and a data source website based on a CefSharp framework;
instantiating the browser, loading the data source website variable, and calling an asynchronous entrusting event to initialize the browser;
acquiring a webpage source code corresponding to the data source website variable under the condition that the browser initialization is completed;
carrying out data decryption on the webpage source code to obtain a target source code;
and determining target data according to the target source code and storing the target data.
Thus, the web page source code can be accessed directly by first declaring the browser and data source site variables based on the CefSharp framework, then instantiating the browser and loading the data source site variables. And then, calling the asynchronous entrusting event to initialize the browser, acquiring a webpage source code corresponding to the data source website variable under the condition that the browser is initialized, and decrypting the data of the webpage source code to obtain a target source code. Clean data can be obtained faster by invoking asynchronous delegation events and decrypting encrypted data. And finally, determining target data according to the target source code and storing the target data.
In some embodiments, the example method further includes the step of loading the data source website variable by the browser, and invoking an asynchronous delegation event to initialize the browser, including:
instantiating the browser, loading the data source website variable, and adding the asynchronous entrusting event;
setting a sleep time;
and calling the asynchronous entrusting event according to the sleep time delay so as to initialize the browser.
Therefore, the browser is instantiated, a data source website variable is loaded, after the asynchronous entrusting event is added, the sleep time is set, and finally the asynchronous entrusting event is called according to the sleep time delay so as to initialize the browser. By adding sleep time and combining with calling asynchronous entrusting events, complete target data can be obtained, and then by combining with decrypting and encrypting data, complete and clean target data can be obtained finally.
In some embodiments, the decrypting the data of the web page source code to obtain the target source code includes:
acquiring header data of the webpage source code;
matching the header data according to preset code elements to determine encrypted data;
and carrying out data decryption on the header data according to the encrypted data to obtain the target source code.
Thus, header data of the webpage source code is obtained, the header data is matched according to the preset code elements so as to determine the encrypted content, and then the encrypted data in the header data is decrypted to obtain the target source code. The desired header data can be obtained more easily.
In some embodiments, the matching the header data according to the preset code elements to determine the encrypted data includes:
dividing the header data into a plurality of header fields;
determining the header field as the encrypted data if the header field includes the hidden code element.
In this way, the header data is divided into a plurality of header fields, and in the case where the header fields include a hidden code element, the header fields can be determined to be encrypted data.
In some embodiments, the decrypting the data of the web page source code to obtain the target source code includes:
acquiring table content data of the webpage source code;
matching the table content data according to preset code elements to determine encrypted data;
and carrying out data decryption on the table content data to obtain the target data.
Therefore, table content data of the webpage source code is obtained, the table content data are matched according to the preset code elements to obtain encrypted data, and then the encrypted data in the table content data are decrypted to obtain the target source code. Desired table content data can be obtained more easily.
In some embodiments, the matching the table content data according to the preset code elements to determine the encrypted data includes:
dividing the table content data into a plurality of table content fields;
determining the table content field as the encrypted data if the table content field includes the hidden code element.
In this manner, the table content data is divided into a plurality of table content fields, and in the case where the table content fields include a hidden code element, the table content fields can be determined to be encrypted data.
In some embodiments, said determining target data from said target source code and storing said target data comprises:
converting the target source code into data table cache data;
and circularly inserting the cached data of the data table into a target database for storage.
Therefore, the target source code is converted into the data table cache data, and the data table cache data is circularly inserted into the target database for storage, so that the storage of the target data is completed.
The application also provides a webpage data acquisition device, including:
the declaration module is used for declaring the browser and the data source website variables based on the CefSharp framework;
the initialization module is used for instantiating the browser, loading the data source website variable and calling an asynchronous entrusting event to initialize the browser;
the acquisition module is used for acquiring a webpage source code corresponding to the data source website variable under the condition that the browser is initialized;
the decryption module is used for carrying out data decryption on the webpage source code to obtain a target source code;
and the storage module is used for determining target data according to the target source code and storing the target data.
The embodiment of the application also provides an electronic device, which comprises a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the method for acquiring the webpage data is realized.
The present embodiments also provide a non-transitory computer-readable storage medium of a computer program, which when executed by one or more processors, implements the web page data collection method of the claims above.
The application relates to a webpage data acquisition method, a webpage data acquisition device and a nonvolatile computer readable storage medium. Thus, the web page source code can be accessed directly by first declaring the browser and data source site variables based on the CefSharp framework, then instantiating the browser and loading the data source site variables. And then, calling the asynchronous entrusting event to initialize the browser, acquiring a webpage source code corresponding to the data source website variable under the condition that the browser is initialized, and decrypting the data of the webpage source code to obtain a target source code. Clean data can be obtained faster by invoking asynchronous delegation events and decrypting encrypted data. And finally, determining target data according to the target source code and storing the target data.
Additional aspects and advantages of embodiments of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of embodiments of the present application.
Drawings
The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic view of a scenario of a web page data collection method according to some embodiments of the present application;
FIG. 2 is a schematic diagram of a scenario of a web page data collection method according to some embodiments of the present application;
FIG. 3 is a schematic diagram of a scenario of a web page data collection method according to some embodiments of the present application;
FIG. 4 is a schematic diagram of a scenario of a web page data collection method according to some embodiments of the present application;
FIG. 5 is a schematic diagram of a scenario of a web page data collection method according to some embodiments of the present application;
FIG. 6 is a schematic flow chart diagram of a web page data collection method according to some embodiments of the present application;
FIG. 7 is a schematic diagram of a web page data acquisition device according to some embodiments of the present application;
FIG. 8 is a schematic diagram of a scenario of a web page data collection method according to some embodiments of the present application;
FIG. 9 is a schematic diagram illustrating a scenario of a web page data collection method according to some embodiments of the present application;
FIG. 10 is a schematic diagram of a scenario of a web page data collection method according to some embodiments of the present application;
FIG. 11 is a schematic diagram illustrating a scenario of a web page data collection method according to some embodiments of the present application;
FIG. 12 is a schematic flow chart diagram illustrating a method for web page data collection in accordance with certain embodiments of the present application;
FIG. 13 is a schematic flow chart diagram of a web page data collection method according to some embodiments of the present application;
FIG. 14 is a schematic flow chart diagram illustrating a method for web page data collection in accordance with certain embodiments of the present application;
FIG. 15 is a schematic flow chart diagram illustrating a method for web page data collection in accordance with certain embodiments of the present application;
FIG. 16 is a schematic flow chart diagram of a web page data collection method according to some embodiments of the present application;
FIG. 17 is a schematic flow chart diagram illustrating a method for web page data collection in accordance with certain embodiments of the present application;
FIG. 18 is a schematic diagram illustrating a scenario of a web page data collection method according to some embodiments of the present application;
FIG. 19 is a schematic diagram of a connection state of a non-volatile computer readable storage medium and a processor of some embodiments of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the embodiments of the present application, and are not to be construed as limiting the embodiments of the present application.
Generally, it is difficult to obtain data and source codes of encrypted web pages, which is a relatively difficult problem for developers to face, and the data sources need to be analyzed sufficiently to make cracking possible. The webpage data returned after the encrypted webpage is collected can have various situations, such as source code confusion and data loss, and the webpage source code cannot be viewed by clicking a right button on the encrypted webpage. An example of a web page that captures the weather for a city in a browser is listed. The web page is an encrypted web page, and as shown in fig. 1, the web page displays a weather form and a form text of a certain city, and the form text includes a form header and form content. The web page code returned according to the normal acquisition request is shown in fig. 2, the source code is chaotic, as shown in fig. 3, the source code 16th is equivalent to 16 columns, the column names are repeated, the repeated column names are different randomly from the sequence, td is provided with a plurality of columns, partial data is not needed, and what data is can not be found, the data is encrypted data and should be decrypted. Looking in the simulation browser, as shown in FIG. 4, only the header data and no table content data are available. Clicking the right button of the web page, as shown in fig. 5, the web page cannot view the source code.
It should be specifically noted that the examples listed later in this application are the same as the present example, and are only at different stages of the example.
Referring to fig. 6, an embodiment of the present application provides a method for acquiring webpage data, including:
01: declaring variables of a browser and a data source website based on a CefSharp framework;
02: the method comprises the steps that a browser is instantiated, a data source website variable is loaded, and an asynchronous entrusting event is called to initialize the browser;
03: acquiring a webpage source code corresponding to a data source website variable under the condition that the browser initialization is completed;
04: carrying out data decryption on the webpage source code to obtain a target source code;
05: and determining target data according to the target source code and storing the target data.
Referring to fig. 7, the present embodiment further provides a web page data collecting apparatus 100, where the web page data collecting apparatus 100 includes a declaration module 110, an initialization module 120, an obtaining module 130, a decryption module 140, and a storage module 150.
The web page data acquisition method according to the embodiment of the present application can be applied to the web page data acquisition apparatus 100 according to the embodiment of the present application. Specifically, the declaration module 110 is configured to execute step 01, that is, the declaration module 110 is configured to obtain a web page source code corresponding to the data source website variable when the browser initialization is completed. The initialization module 120 is configured to execute step 02, that is, the initialization module 120 is configured to instantiate the browser and load the data source website variable, and call an asynchronous delegation event to initialize the browser. The obtaining module 130 is configured to obtain a web page source code corresponding to the data source website variable when the browser initialization is completed. The decryption module 140 is configured to decrypt the data of the web page source code to obtain the target source code. The storage module 150 is configured to determine target data according to the target source code and store the target data.
The application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, the processor is used for declaring the browser and a data source website variable based on the CefSharp framework, instantiating the browser and loading the data source website variable, calling an asynchronous delegation event to initialize the browser, acquiring a webpage source code corresponding to the data source website variable under the condition that the browser is initialized, and decrypting the webpage source code to obtain a target source code, determining target data according to the target source code and storing the target data.
Specifically, CEF is called chrome Embedded Framework, and an open source Web Browser control based on Google chrome project, and the CEF is mainly used for embedding third-party application to realize functions related to a Browser. CefSharp is a C # version of CEF, and is a browser package written by Net, which is convenient for developers to embed Chrome browser components in Winform and WPF. CefSharp uses multiple threads for different levels of processing. The code can be written by using C #, VB or any other CLR language. The embodiment of the application is written in C # language as an example.
After the CefSharp is loaded, the webpage source code after the loading is obtained, so that how the browser is encrypted is not needed to be analyzed, and the obtained data is the final data seen by a developer.
CefSharp runs using multiple processes. The main process that handles window creation, drawing and network access is called the browser process. Typically, this process is the same as the host application, and most of the application logic will run in the browser process. Some application logic will also run during the rendering process. The default process model will generate a new rendering process for each unique source.
Step 01: and declaring variables of the browser and the data source website based on the CefSharp framework.
As shown in fig. 8, the way to build the CefSharp framework includes downloading and installing CefSharp. The cefsharp.dll, cefsharp.core.dll and cefsharp.winforms.dll were introduced and the construction of the CefSharp framework was never completed.
To cite an example of a web page that captures the weather of a city in a webBrowser, constructing the CefSharp framework involves calling a component named CefSharp, the CefSharp. The specific codes are as follows:
using CefSharp;
using CefSharp.WinForms;
after introducing CefSharp, starting to declare variables of the browser and the data source website. The browser is used for presenting resources of a World Wide Web (Web) selected by a developer, and the developer needs to request the resources from a server and display the resources in a browser window, wherein the format of the resources is usually HTML, and also includes PDF, image and other formats. The browser refers to a browser to embed the component, and the data source website variable can be retrieved, browsed and acquired from the browser.
The data source website refers to a website of a target webpage from which data is to be acquired in the embodiment of the present application, that is, a website of a target webpage from which data is to be acquired in the embodiment of the present application. By declaring the browser and data source web address variables, the compiler can be told the browser and data source web address of the variables that need to be used. In an example of acquiring target data of a web page of weather of a city in a webBrowser, a chromaumwebbrowser browser is declared to be used, and a data source website defined as cityurl is declared, and specific codes are as follows:
ChromiumWebBrowser webBrowser;
string cityurl="";
step 02: and the example browser loads a data source website variable and calls an asynchronous entrusting event to initialize the browser.
The instance refers to the implementation of an object based on a certain class in a browser, this process may be referred to as instantiation, and a new keyword may be used to call a corresponding constructor in the class. A function is a code module that performs a specific function, and variables and statements are required inside the function. Therefore, the data source website variable is loaded to obtain the variable. In an example of obtaining target data of a webpage of weather of a certain city in a webBrowser, a example of a chromaumwebbrowser is that a website of weather of a certain city in cityurl is given as a data source website, and specific codes are as follows:
Figure BDA0003576700590000041
CefSharp can use a plurality of threads to process different levels, therefore, asynchronous entrusting events can be called to realize the simultaneous processing of a plurality of threads, thereby improving the execution efficiency. In order to allocate memory space to variables and facilitate storage of selected content, browser initialization is required, where initialization refers to a method of assigning initial values to data objects or variables. Thus, the browser can be initialized and determined by asynchronous delegation while the asynchronous delegation is invoked. In an example of obtaining target data of a web page of weather of a city in the webBrowser, the following codes may be used to asynchronously determine whether the webBrowser browser is initialized.
this.BeginInvoke(new Action(()=>
{
if(!webBrowser.IsBrowserInitialized)return;
cityHtmlStr=webBrowser.GetSourceAsync().Result;
If the web browser is not initialized, the previous example browser is continuously executed, the website variable of the data source is loaded, and the step of calling the asynchronous entrusting event is carried out.
Step 03: and acquiring a webpage source code corresponding to the data source website variable under the condition that the browser initialization is completed.
Each website address corresponds to a webpage, so that the corresponding webpage source code can be found according to the website address variable. Meanwhile, the content to be acquired can be limited through the corresponding character string codes, for example, the source codes of the table text and the table in the webpage are acquired. In an example of obtaining target data of a web page of weather of a certain city in the webBrowser, a web page source code of a data source website in the webBrowser can be obtained through the following codes, including a form text of the web page and a source code of a form. Namely, the code label is defined as the table text, and the code label is defined as the table.
string newHtmlStr=Analytical.CutStr(cityHtmlStr,"<div class=\"container\">","</tbody></table>")+"</table>";
Step 04: and the data decryption module is used for carrying out data decryption on the webpage source code to obtain a target source code.
The encrypted data may appear in the web page source code acquired in step 03, and the encrypted data refers to data which is wrong and should not appear compared with normal data of the web page. Including indicating the presence of different data at the same location in a web page, the presence of multiple identical data at the same location, etc. Target source code refers to the source code of the data that the developer wants to get, including the code that gets the correct data. I.e., the correct data, may appear to be the same data obtained at the same location on the web page.
Referring to fig. 9, a web page source code obtained through the example of step 03 is shown in fig. 9. The field renaming of the tag < th > representing the header of the table can be seen from the acquired source code. Data such as PM10, PM2.5, and 03 boxed in fig. 9 occur twice or more. Looking at the web page shown in FIG. 1, however, it is found that PM10, PM2.5, and 03 appear only once in the header. Therefore, it can be known that the encrypted data is present to obtain the source code representing the header. Meanwhile, it can be seen from the acquired source code that the field representing table content < td > is repeated. For example, both box 48 and 0.7 in fig. 9 correspond to header PM 10. Looking at the web page shown in fig. 1, it is found that the body corresponding to the header PM10 is only 48, and 0.7 is encrypted data. Moreover, the field names and the repeated contents and positions of the fields in the returned result can be different through different access requests.
Therefore, the encrypted data needs to be decrypted to obtain the target source code, and the decryption mode mainly analyzes the similar representation hiding/non-display elements and does not analyze the header names. For example, the data obtained by the first request is a PM10 duplicate, the re-access may be a PM2.5 duplicate, and the number of duplicates is not fixed, and it cannot be known which duplicate data is encrypted data by analyzing the header name. By analyzing hidden/not-displayed elements, first, elements of the resulting web page source code are analyzed. Then, removing Html represents hidden/undisplayed elements. For example, elements represented by 'display: none', 'hidden', etc. are removed.
It can be understood that, since the web page source code parsed from the encrypted web page has encrypted data, the clean web page source code can be obtained by removing the hidden/undisplayed elements indicated by Html, thereby decrypting the encrypted data. The desired data can then be extracted from these clean web page source code. Compared with directly obtaining the desired data from the analyzed webpage source code containing the encrypted data, the method can obtain the desired data more easily by decrypting the encrypted data first.
It should be noted that, when removing the elements hidden/not displayed by Html, if the processed multiple sets of data have corresponding relationships, the rule for removing the elements hidden/not displayed by Html for each set of data should be the same, otherwise, the corresponding relationships may be incorrect. For example, when the header and the table content are analyzed, if the selection is different, the number of the titles in the header and the columns of the table content corresponding to the titles may be different, for example, if there are 10 titles in the header, and each title should correspond to a column, then 10 columns should be originally present. However, if the rules of the selection analysis are different, the table contents may appear in 7 columns, 13 columns, and the like. There may also be a case where the title and the corresponding column are misaligned, for example, when the content of the column to which the title month should correspond should be a year and month, but the rule of selection is different, the content of the column to which the title month corresponds should be good, and the like.
Referring to fig. 10, after the encrypted data is decrypted, the target source code may be obtained according to the following codes, and the obtained part of the target source code is as shown in fig. 10.
newHtmlStr="<table><tr>"+strth.ToString()+"</tr><tr>"+strtd.ToString()+"</table>";
For clarity, please refer to the web page table of fig. 1 and the target source code of fig. 10 for comparison. The partially acquired target source code is shown in FIG. 10 with 5 segments separated by lines. Segment 1 is header data. The 2 nd to 5 th sections are table content data. Each segment represents each row in the table, e.g., segment 1 represents the header contents. Paragraph 2 through paragraph 5, each representing each row of the table contents. For example, section 2 represents the contents of the first row in the table contents, section 3 represents the contents of the second row in the table contents, and so on, resulting in the data for each row in the table contents. Each row in each section represents the contents of each cell in the web page, e.g., the first row of section 1 represents the title of the first column in the header. The first row of the 2 nd section represents the contents of the first row corresponding to the first column header. That is, there is a corresponding relationship between the first rows of each segment. It should be noted that fig. 10 only shows a part of the target source code, and the corresponding relationship of other target source codes not shown is as described in this paragraph.
Step 05: and determining target data according to the target source code and storing the target data.
The target data refers to data that a developer wants to obtain, and the data is obtained by compiling target source code.
Storing the target data refers to storing the target data in a database.
It can be understood that the target data of different cities obtained by the web page data acquisition method of the present application can be stored in the database, and the result shown in fig. 11 can be obtained.
The target data can be acquired and included once by the webpage data acquisition method, and the preset time can be set by developers, such as every day, every month, every year and the like. Therefore, compared with the original method that the webpage data needs to be processed manually, for example, a monthly database needs to manually check whether the monthly data source of each city is disclosed irregularly, and then the disclosed data is manually added into a database to realize database updating. By the webpage data acquisition method, automatic data acquisition and updating are achieved, data entry cost is reduced, and timeliness of warehousing is improved.
Thus, the web page source code can be accessed directly by first declaring the browser and data source site variables based on the CefSharp framework, then instantiating the browser and loading the data source site variables. And then, calling the asynchronous entrusting event to initialize the browser, acquiring a webpage source code corresponding to the data source website variable under the condition that the browser is initialized, and decrypting the data of the webpage source code to obtain a target source code. Clean data can be obtained faster by invoking asynchronous delegation events and decrypting encrypted data. And finally, determining target data according to the target source code and storing the target data.
Referring to fig. 12, in some embodiments, step 02 includes:
020: the method comprises the steps that an example browser loads a data source website variable and adds an asynchronous entrusting event;
021: setting a sleep time;
022: an asynchronous delegation event is invoked to initialize the browser according to the sleep time delay.
The initialization module 120 is used to execute steps 020, 021 and 022, namely, the initialization module 120 is used to instantiate a browser and load a data source website variable, join an asynchronous delegation event, and set a sleep time, and is used to call the asynchronous delegation event according to a sleep time delay to initialize the browser.
The processor is used for instantiating the browser, loading a data source website variable, adding an asynchronous delegation event, setting sleep time, and calling the asynchronous delegation event according to the sleep time delay to initialize the browser.
Specifically, the sleep time refers to a time for delaying loading, and may be set by a developer according to a corresponding reason. For example, when loading a data source website variable, it is observed that how fast or slow the webpage loads different types of data, and in order to finish loading all types of data, the next step is performed, and the sleep time can be set, so that all different types of data are loaded. Different types of data refer to data that has different purposes, e.g., header data is of one type and table contents are of another type.
It can be understood that if the data source web site variables are not completely loaded, the next step is entered, and the result displayed by the simulated browser as shown in fig. 3 appears, where only the header data, not the data of the table content, is outlined in the figure. The loaded data is incomplete. Therefore, by adding the sleep time and combining with calling the asynchronous entrusting event, complete target data can be obtained, and then, by combining with decrypting and encrypting data, complete and clean target data can be obtained finally.
In the example of obtaining the target data of the webpage of the weather of a certain city in the webBrowser, it is observed that the header data of the webpage is loaded first, and the data of the table content is loaded only within 1 to 2 seconds, so that the sleep time of 2 seconds can be set, and the header data and the data of the table content can be loaded completely. That is, the webpage data is loaded in a delayed manner, so that the sleep is set for 2 seconds, the data is acquired after the data is completely loaded, and otherwise, the obtained table content is null data. Thus, the example Chromium WebBrowser browser, which gives the web address of the weather of a city in cityurl as the data source web address, the specific code can be as follows,
Figure BDA0003576700590000061
therefore, the browser is instantiated, a data source website variable is loaded, after the asynchronous entrusting event is added, the sleep time is set, and finally the asynchronous entrusting event is called according to the sleep time delay so as to initialize the browser. By adding sleep time and combining with calling asynchronous entrusting events, complete target data can be obtained, and then by combining with decrypting and encrypting data, complete and clean target data can be obtained finally.
Referring to fig. 13, in some embodiments, step 04 includes:
040: acquiring header data of a webpage source code;
041: matching the header data according to the preset code elements to determine encrypted data;
042: and carrying out data decryption on the header data according to the encrypted data to obtain the target source code.
The decryption module 140 is configured to perform steps 040, 041, and 042, that is, the obtaining module 140 is configured to obtain header data of the web page source code, match the header data according to the preset code element to determine encrypted data, and perform data decryption on the header data according to the encrypted data to obtain the target source code.
The processor is used for acquiring the header data of the webpage source code, matching the header data according to the preset code elements to determine the encrypted data, and decrypting the header data according to the encrypted data to obtain the target source code.
Specifically, after the data of the form body and the form is obtained, the header data of the web page source code may be further obtained, for example, by a tag < th > representing the header.
The preset code element refers to a code element that matches the encrypted data, for example, Html denotes a hidden/undisplayed element, characters 'display: none', 'hidden', and the like.
It can be understood that, since the header source code parsed from the encrypted web page has encrypted data, the encrypted data is decrypted by removing the hidden/undisplayed element indicated by Html, so as to obtain a clean header source code. The desired data can then be extracted from these clean header source codes. Compared with the method that the required header data is directly obtained from the analyzed webpage source code containing the encrypted data, the required header data can be obtained more easily by decrypting the encrypted data first.
Thus, header data of the webpage source code is obtained, the header data is matched according to the preset code elements so as to determine the encrypted content, and then the encrypted data in the header data is decrypted to obtain the target source code. The desired header data can be obtained more easily.
Referring to FIG. 14, in some embodiments, step 041 includes:
0410: dividing the header data into a plurality of header fields;
0411: in the case where the header field includes a hidden code element, the header field is determined to be encrypted data.
The decryption module 140 is configured to perform steps 0410 and 0411, i.e., the decryption module 140 divides the header data into a plurality of header fields and determines the header fields as encrypted data if the header fields include hidden code elements.
The processor is configured to divide the header data into a plurality of header fields, and to determine the header fields as encrypted data if the header fields include a hidden code element.
Specifically, referring to fig. 10, after the header data is obtained, the header data may be continuously divided into a plurality of data of next-level types, for example, the header data may be continuously divided into a plurality of types of header data, such as titles of a month, a range, and a quality level. After the obtained webpage source code is directly analyzed, the data of the next-level type of the header data can be respectively formed into a line according to the respective types. Each row is a header field. For example, the source codes for data representing months are in a line, and the source codes for data representing ranges are in a line.
In the case where the header field includes a hidden code element, the header field may be determined to be encrypted data.
In an example of acquiring target data of a webpage of weather of a certain city in the webBrowser, the header data may be acquired through the following codes, and the header encrypted data may be decrypted.
Figure BDA0003576700590000071
Figure BDA0003576700590000081
In the example of acquiring target data of a webpage of weather of a certain city in the webBrowser browser, a loop body for sentence can be used to design a mode of removing elements which are hidden/not displayed by Html. In the code, (tmp.contents ("display: none") | tmp.contents ("hidden-md") | tmp.contents ("hidden-lg") | tmp.contents ("hidden-sm") | (tmp.contents ("hidden") & & | tmp.content ("hidden-xs"), data of hidden elements including 'display: none', 'hidden-md', 'hidden-lg', 'hidden-sm', and 'hidden-xs' are removed, and data of hidden elements including 'hidden-xs' are retained.
In this way, the header data is divided into a plurality of header fields, and in the case where the header fields include a hidden code element, the header fields can be determined to be encrypted data.
Referring to fig. 15, in some embodiments, step 04 includes:
043: acquiring table content data of a webpage source code;
044: matching the table content data according to the preset code elements to determine encrypted data;
045: and carrying out data decryption on the table content data to obtain target data.
The decryption module 140 is configured to perform steps 043, 044 and 045, that is, the decryption module 140 is configured to obtain table content data of the web page source code, match the table content data according to the preset code elements to determine encrypted data, and decrypt the table content data to obtain target data.
The processor is used for acquiring the table content data of the webpage source code, matching the table content data according to the preset code elements to determine encrypted data, and decrypting the table content data to obtain target data.
Specifically, after the table body and the data of the table are acquired, the table content data of the web page source code may be further acquired, for example, by a cell content tag < td > representing the table content.
The preset code element refers to a code element that matches the encrypted data, for example, Html denotes a hidden/undisplayed element, characters 'display: none', 'hidden', and the like.
It can be understood that, since the table content source code parsed from the encrypted web page has encrypted data, the clean table content source code can be obtained by decrypting Html representing hidden/undisplayed elements, thereby decrypting the encrypted data. The desired data can then be extracted from these clean table content source codes. The desired table content data can be obtained more easily by decrypting the encrypted data first than by directly obtaining the desired table content data from the parsed table content source code containing the encrypted data.
Therefore, table content data of the webpage source code is obtained, the table content data are matched according to the preset code elements to determine encrypted data, and then the encrypted data in the table content data are decrypted to obtain the target source code. Desired table content data can be obtained more easily.
Referring to fig. 16, in some embodiments, step 044 includes:
0440: dividing table content data into a plurality of table content fields;
0441: in the case where the table contents field includes a hidden code element, the table contents field is determined to be encrypted data.
The decryption module 140 is configured to perform steps 0440 and 0441, i.e., the decryption module 140 is configured to divide the table content data into a plurality of table content fields, and to determine the table content fields as encrypted data in case the table content fields include hidden code elements.
The processor is configured to divide the table content data into a plurality of table content fields and to determine the table content fields as encrypted data if the table content fields include a hidden code element.
Specifically, referring to fig. 10, after the table content data is obtained, the table content data may be continuously divided into a plurality of data of the next level type, for example, the table content data may be continuously divided into a plurality of cells. After the obtained webpage source codes are directly analyzed, the source codes of the data of the next-level type of the table content data are respectively aligned according to the respective types. Each row is a table content field. For example, the source code representing each cell is in a line.
In the case where the table contents field includes a hidden code element, the table contents field may be determined to be encrypted data.
In an example of obtaining target data of a web page of weather of a city in the webBrowser, encrypted data of table content may be decrypted by obtaining the data of the table content through the following codes.
Figure BDA0003576700590000082
Figure BDA0003576700590000091
In this example, the loop body for sentence may be used to design and remove the elements hidden/not displayed in Html. In the code, (tmp.contents ("display: none") | tmp.contents ("hidden-md") | tmp.contents ("hidden-lg") | tmp.contents ("hidden-sm") | (tmp.contents ("hidden") & & | tmp.content ("hidden-xs"), data of hidden elements including 'display: none', 'hidden-md', 'hidden-lg', 'hidden-sm', and 'hidden-xs' are removed, and data of hidden elements including 'hidden-xs' are retained.
In this manner, the table content data is divided into a plurality of table content fields, and in the case where the table content fields include a hidden code element, the table content fields can be determined to be encrypted data.
Referring to fig. 17, in some embodiments, step 05 includes:
050: converting the target source code into data table cache data;
051: and circularly inserting the cached data of the data table into the target database for storage.
The storage module 150 is configured to perform steps 050 and 051, that is, the storage module 150 is configured to convert the target source code into the data table cache data, and is configured to insert the data table cache data into the target database for storage in a circular manner.
The processor is used for converting the target source code into the data table cache data and circularly inserting the data table cache data into the target database for storage.
In particular, the data table buffer data is the result of the structuring of the target data, the buffer data being arranged in the form of a data table. The target database refers to a database used by a developer to store data table buffer data.
The specific method of structuring target data into data table cache data and storing the data in a database includes, first, converting html < table > data into data table cache data. The spreadsheet buffer data is then inserted into the underlying database, which may include Oracle, Mysql, Sql Server, etc.
In the example of obtaining target data of a webpage of weather of a certain city in a webBrowser, the target data is structured into data table cache data, and then the data table cache data are stored in a database by the following codes:
the code for converting html < table > data into data table cache data is as follows, and the obtained result is shown in fig. 18:
DataTable dt=HtmlTableParser.ParseDataSet(newHtmlStr).Tables[0];
and circularly inserting the data table cache data obtained by the code conversion into the bottom database.
It can be understood that the results shown in fig. 11 can be obtained by inserting the cache data of the data tables of different cities, which are obtained by the web page data acquisition method of the present application, into the underlying databases, respectively.
Therefore, the target source code is converted into the data table cache data, and the data table cache data is circularly inserted into the target database for storage, so that the storage of the target data is completed.
Referring to fig. 19, the present application also provides a non-volatile computer-readable storage medium 300 containing a computer program 301. The computer program 301, when executed by the one or more processors 200, causes the one or more processors 200 to perform the web page data collection method of any of the embodiments described above.
In the description herein, references to the description of "certain embodiments," "in one example," "exemplary," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the various embodiments or examples and features of the various embodiments or examples described in this specification can be combined and combined by those skilled in the art without contradiction.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application and that variations, modifications, substitutions and alterations in the above embodiments may be made by those of ordinary skill in the art within the scope of the present application.

Claims (10)

1. A webpage data acquisition method is characterized by comprising the following steps:
declaring variables of a browser and a data source website based on a CefSharp framework;
instantiating the browser, loading the data source website variable, and calling an asynchronous entrusting event to initialize the browser;
acquiring a webpage source code corresponding to the data source website variable under the condition that the browser initialization is completed;
carrying out data decryption on the webpage source code to obtain a target source code;
and determining target data according to the target source code and storing the target data.
2. The method of claim 1, wherein the step of loading the data source website variable by the browser and invoking an asynchronous delegation event to initialize the browser comprises:
instantiating the browser, loading the data source website variable, and adding the asynchronous entrusting event;
setting a sleep time;
and calling the asynchronous entrusting event according to the sleep time delay so as to initialize the browser.
3. The method for acquiring webpage data according to claim 1, wherein the decrypting the webpage source code to obtain the target source code comprises:
acquiring header data of the webpage source code;
matching the header data according to preset code elements to determine encrypted data;
and carrying out data decryption on the header data according to the encrypted data to obtain the target source code.
4. The method for acquiring webpage data according to claim 3, wherein the matching the header data according to the preset code elements to determine the encrypted data comprises:
dividing the header data into a plurality of header fields;
determining the header field as the encrypted data if the header field includes the hidden code element.
5. The method for acquiring webpage data according to claim 1, wherein the decrypting the webpage source code to obtain the target source code comprises:
acquiring table content data of the webpage source code;
matching the table content data according to preset code elements to determine encrypted data;
and carrying out data decryption on the table content data to obtain the target data.
6. The method for acquiring webpage data according to claim 5, wherein the matching the table content data according to the preset code elements to determine the encrypted data comprises:
dividing the table content data into a plurality of table content fields;
determining the table content field as the encrypted data if the table content field includes the hidden code element.
7. The method for collecting data on web pages according to claim 1, wherein said determining target data according to the target source code and storing the target data comprises:
converting the target source code into data table cache data;
and circularly inserting the cached data of the data table into a target database for storage.
8. A web page data acquisition device, comprising:
the declaration module is used for declaring the browser and the data source website variables based on the CefSharp framework;
the initialization module is used for instantiating the browser, loading the data source website variable and calling an asynchronous entrusting event to initialize the browser;
the acquisition module is used for acquiring a webpage source code corresponding to the data source website variable under the condition that the browser is initialized;
the decryption module is used for carrying out data decryption on the webpage source code to obtain a target source code;
and the storage module is used for determining target data according to the target source code and storing the target data.
9. An electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the computer program, when executed by the processor, implements the web page data collecting method according to any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium of a computer program, wherein the computer program, when executed by one or more processors, implements the web page data collection method of any one of claims 1-9.
CN202210346533.XA 2022-03-31 2022-03-31 Webpage data acquisition method, webpage data acquisition device and storage medium Pending CN114969474A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210346533.XA CN114969474A (en) 2022-03-31 2022-03-31 Webpage data acquisition method, webpage data acquisition device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210346533.XA CN114969474A (en) 2022-03-31 2022-03-31 Webpage data acquisition method, webpage data acquisition device and storage medium

Publications (1)

Publication Number Publication Date
CN114969474A true CN114969474A (en) 2022-08-30

Family

ID=82978003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210346533.XA Pending CN114969474A (en) 2022-03-31 2022-03-31 Webpage data acquisition method, webpage data acquisition device and storage medium

Country Status (1)

Country Link
CN (1) CN114969474A (en)

Similar Documents

Publication Publication Date Title
US9880696B2 (en) System for providing dynamic linked panels in user interface
Lawson Web scraping with Python
US7269792B2 (en) System and method for generating high-function browser widgets with full addressability
JP4972254B2 (en) Integrated method for creating refreshable web queries
US7050056B2 (en) Interactive and web-based Gantt Chart
Cotton Obit: A development environment for astronomical algorithms
US8352875B2 (en) System and method for integrating a plurality of software applications
US8166396B2 (en) User interface rendering
US20150113366A1 (en) Methods for dynamic document generation
US7007266B1 (en) Method and software system for modularizing software components for business transaction applications
Freire et al. Reproducibility using vistrails
US8407598B2 (en) Dynamic web control generation facilitator
US20190347187A1 (en) Systems and methods for reducing storage required for code coverage results
US9817811B2 (en) Web server system, dictionary system, dictionary call method, screen control display method, and demonstration application generation method
CN109710220B (en) Relational database query method, relational database query device, relational database query equipment and storage medium
US9311111B2 (en) Programming environment with support for handle and non-handle user-created classes
US20060224977A1 (en) Graphical application interface
Adnan et al. Developing efficient web-based GIS applications
US10114617B2 (en) Rapid visualization rendering package for statistical programming language
US20060218174A1 (en) Method for coordinating schema and data access objects
Souza et al. Provenance of dynamic adaptations in user-steered dataflows
CN112612943A (en) Asynchronous processing framework-based data crawling method with automatic testing function
CN114969474A (en) Webpage data acquisition method, webpage data acquisition device and storage medium
CN107357926B (en) Webpage processing method and device and electronic equipment
Chattratichat et al. A visual language for internet-based data mining and data visualisation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination