CN114428635A - Data acquisition method and device, electronic equipment and storage medium - Google Patents

Data acquisition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114428635A
CN114428635A CN202210352822.0A CN202210352822A CN114428635A CN 114428635 A CN114428635 A CN 114428635A CN 202210352822 A CN202210352822 A CN 202210352822A CN 114428635 A CN114428635 A CN 114428635A
Authority
CN
China
Prior art keywords
request
configuration file
website
request parameter
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210352822.0A
Other languages
Chinese (zh)
Inventor
梁仪
姚斌
陈家银
张伟
陈曦
麻志毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Institute of Information Technology AIIT of Peking University
Hangzhou Weiming Information Technology Co Ltd
Original Assignee
Advanced Institute of Information Technology AIIT of Peking University
Hangzhou Weiming Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Institute of Information Technology AIIT of Peking University, Hangzhou Weiming Information Technology Co Ltd filed Critical Advanced Institute of Information Technology AIIT of Peking University
Priority to CN202210352822.0A priority Critical patent/CN114428635A/en
Publication of CN114428635A publication Critical patent/CN114428635A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Abstract

The invention discloses a data acquisition method, a data acquisition device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a configuration file for a website; taking the first request parameter in the configuration file as a current request parameter, initiating a network request to a website by using the current request parameter, receiving webpage data returned by the website and storing the webpage data; judging whether the configuration file has a next request parameter or not; if the request parameter exists, the next request parameter is used as the current request parameter, and the process of initiating the network request to the website by using the current request parameter is returned and executed. According to the scheme, the website acquisition is completed based on configuration, the link of repeatedly writing programs is reduced, so that the labor cost is saved, acquisition personnel write corresponding configuration files according to the network protocol rules of the target website, then the website acquisition is realized completely based on the network protocol interface by acquiring the configuration files, and the efficiency of data acquisition is greatly improved.

Description

Data acquisition method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a data acquisition method, a data acquisition device, electronic equipment and a storage medium.
Background
At present, more and more enterprises carry out business condition mining and market expansion through a bidding mode, according to statistics, tens of thousands of sites release bidding related data on the whole network, and how to effectively monitor the sites helps the enterprises to push latest bidding information in real time is a key element for improving business condition mining capability of the enterprises.
In the related technology, all the sites are analyzed one by one, and corresponding acquisition programs are compiled one by one, and the analysis of the sites is a time-consuming and tedious process, so that a large amount of labor and time cost is required to be invested to achieve the purpose of monitoring the whole network, and the data acquisition efficiency is very low.
Disclosure of Invention
The present invention provides a data acquisition method, an apparatus, an electronic device and a storage medium for overcoming the above-mentioned deficiencies in the prior art, and the object is achieved by the following technical solutions.
A first aspect of the present invention provides a data acquisition method, including:
acquiring a configuration file for a website;
using the first request parameter in the configuration file as the current request parameter, generating a network request by using the current request parameter and according to the request structure set in the configuration file, and replacing the actual original address in the network request by using a proxy address through a preset proxy pool middleware;
sending a network request after replacing an address to the website, receiving webpage data returned by the website and storing the webpage data;
judging whether the configuration file has a next request parameter or not;
and if so, taking the next request parameter as the current request parameter, and returning to execute the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
A second aspect of the present invention provides a data acquisition apparatus, the apparatus comprising:
the acquisition module is used for acquiring a configuration file aiming at a website;
the acquisition module is used for taking the first request parameter in the configuration file as a current request parameter and generating a network request by using the current request parameter according to a request structure set in the configuration file; replacing the actual original address in the network request by using the proxy address through a preset proxy pool middleware; sending a network request after replacing an address to the website, receiving webpage data returned by the website and storing the webpage data;
the judging module is used for judging whether the next request parameter exists in the configuration file or not;
and the skip module is used for taking the next request parameter as the current request parameter when judging that the network request exists, and returning to execute the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
A third aspect of the present invention proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect when executing the program.
A fourth aspect of the present invention proposes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method according to the first aspect as described above.
Based on the data acquisition method and device in the first aspect and the second aspect, the invention has at least the following beneficial effects or advantages:
according to the scheme, the website collection is completed based on configuration, the link of repeatedly writing programs is reduced, so that the labor cost is saved, collection personnel write corresponding configuration files according to the network protocol rules of the target website, then the website collection is realized completely based on the network protocol interface by obtaining the configuration files, the data collection efficiency is greatly improved, and the configuration files can be imported into any operating system to realize the data collection through simple environment configuration.
In addition, before the network request is sent to the target website, the actual original address in the network request is modified through the proxy pool middleware, so that each network request can come from different addresses, and an ip detection mechanism of the target website is bypassed.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart illustrating an embodiment of a data collection method according to an exemplary embodiment of the present invention;
FIG. 2 is a schematic diagram of a network request structure according to the present invention;
FIG. 3 is a diagram illustrating a webpage data parsing configuration according to the present invention;
FIG. 4 is a schematic diagram of a data list obtained by parsing a list page according to the present invention;
FIG. 5 is a schematic diagram of a data acquisition device according to an exemplary embodiment of the present invention;
FIG. 6 is a diagram illustrating a hardware configuration of an electronic device according to an exemplary embodiment of the present invention;
fig. 7 is a schematic structural diagram of a storage medium according to an exemplary embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if," as used herein, may be interpreted as "at … …" or "when … …" or "in response to a determination," depending on the context.
At present, the existing acquisition framework in the market comprises octopus, locomotive and other acquisition software, but the acquisition software is based on a browser, the acquisition performance and efficiency can not meet the acquisition requirements of a large number of websites, and the operable operating system is limited.
In order to solve the technical problem, the application provides a data acquisition method, a configuration file for a website is obtained, then a first request parameter in the configuration file is used as a current request parameter, a network request is generated by using the current request parameter and according to a request structure set in the configuration file, an actual original address in the network request is replaced by using a proxy address through a preset proxy pool middleware, after the network request with the replaced address is sent to the website, webpage data returned by the website are received and stored, whether a next request parameter exists in the configuration file or not is judged, if the next request parameter exists, the next request parameter is used as the current request parameter, and a process of initiating the network request to the website by using the current request parameter is returned and executed.
The technical effects that can be achieved based on the above description are:
according to the scheme, the website collection is completed based on configuration, the link of repeatedly writing programs is reduced, so that the labor cost is saved, collection personnel write corresponding configuration files according to the network protocol rules of the target website, then the website collection is realized completely based on the network protocol interface by obtaining the configuration files, the data collection efficiency is greatly improved, and the configuration files can be imported into any operating system to realize the data collection through simple environment configuration.
In addition, before the network request is sent to the target website, the actual original address in the network request is modified through the proxy pool middleware, so that each network request can come from different addresses, and an ip detection mechanism of the target website is bypassed.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The first embodiment is as follows:
fig. 1 is a flowchart illustrating an embodiment of a data acquisition method according to an exemplary embodiment of the present invention, the data acquisition method including the steps of:
step 101: a configuration file for a website is obtained.
The configuration file is JSON configuration written by acquisition personnel according to rules such as a data request mode of a website to be acquired, html tag layout of a webpage, a rendering mode of webpage data and the like. And then the configuration file is imported into a specified task queue, so that the configuration file is sequentially taken out from the task queue according to a certain sequence for data acquisition.
Based on the configuration file, the configuration file is obtained from the specified task queue, and the configuration file is placed in the queue to perform data acquisition according to a certain sequence, so that the problem of missing the configuration file can be avoided.
It should be noted that there is a configuration file in the designated task queue for each website to be collected.
Step 102: and taking the first request parameter in the configuration file as the current request parameter.
The web server is usually stored with a plurality of web pages, so the configuration file needs to set the request parameters of each web page, and each web page has a certain jump association relationship, because the request parameters of each web page in the configuration file are arranged according to a certain sequence, the first request is to use the first request parameter in the configuration file to perform data acquisition.
Step 103: and generating a network request by using the current request parameters and according to a request structure set in the configuration file, replacing an actual original address in the network request by using a proxy address through a preset proxy pool middleware, sending the network request with the address replaced to the website, receiving and storing webpage data returned by the website.
Before step 103 is executed, in order to better realize data collection, the method is performed based on the crawler framework script, and modifies the data before the network request is initiated or after the request is returned by using a middleware method provided by the script framework. In the invention, a proxy pool middleware, a renderer middleware, a text code conversion middleware and a request retry middleware are preset.
The proxy pool middleware is used for replacing an actual original ip address with a proxy ip address before each network request so as to enable each network request to come from different ip addresses, and therefore an ip detection mechanism of a target website is bypassed. The renderer middleware is used for processing a plurality of asynchronously loaded websites, running a network request through a simulation browser, rendering returned webpage data on a webpage, and then returning the rendered webpage data, so that data seen on the webpage can be obtained, namely what you see is what you can crawl. The text code conversion middleware is used for flexibly replacing the code format of returned webpage data, and the text code format of each webpage is not necessarily the same, so that the condition that the returned data is messy due to code format conversion of part of websites frequently occurs, common text code formats are utf8, Unicode, GBk and the like, and the text code conversion middleware is used for identifying the code format of the returned data and converting the code format into a uniform code format to prevent the messy code from occurring is often generated. The request retry middleware is used for processing the situations of request timeout or some abnormal errors, the request timeout is mainly because a proxy ip is used, and the network condition of the proxy ip is unknown, so that the problem of timeout occurs when initiating a network request, and certainly, besides the timeout, the data returned by some requests is incorrect, which causes program abnormality, and both the situations are that retry is needed until correct data is returned.
Based on the above preset settings of the proxy pool middleware, the renderer middleware, the text transcoding middleware, and the request retry middleware, the use of these middleware is explained below by a specific embodiment.
As shown in fig. 2, the structure of the network request mainly includes: request headers, request address url, request body parameter body and request mode get/post. Wherein, the headers represent the network protocol header of the network request, and are consistent with the network protocol header of the target address; the url can be filled with three parameters, value is a target webpage url address or an ajax interface address, re can be filled with a regular expression, a process can write some processing functions for processing splicing or replacement of some urls, a specified value is matched from the value, and the matched value is used for processing in the process; the request mode of a general bidding type website is a get mode or a post mode, when the request mode is get, the body is empty, and when the request mode is post, the body appointed by the target website needs to be carried.
Further, after the network request after the address replacement is sent to the website, whether a request timeout or a returned data abnormal condition exists may be detected through a preset request retry middleware, if so, when it is determined that the retry number is smaller than a preset threshold, the retry number is added by 1, and a process of generating the network request according to the request structure set in the configuration file by using the current request parameter is re-executed.
In a possible implementation manner, in order to receive and store the webpage data returned by the website, the encoding format of the webpage data returned by the website may be converted into a preset encoding format through a preset text encoding conversion middleware, and then the webpage data after the encoding format conversion is analyzed by using the analysis rule in the configuration file, and the analyzed webpage data is stored.
The preset coding format is a coding format which is uniformly required according to actual analysis requirements.
It should be noted that web pages on a website (such as a bid web site) are generally divided into two categories, namely a list page and a detail page, and the web pages in the two categories have almost the same structure, except that the list page usually has one more field.
Based on this, in the process of analyzing the converted webpage data by using the analysis rule in the configuration file, whether the converted webpage data contains a preset field or not can be judged, if the converted webpage data contains the preset field, the webpage data is analyzed by using the list page analysis rule in the configuration file, and if the converted webpage data does not contain the preset field, the webpage data is analyzed by using the detail page analysis rule in the configuration file.
The preset field is a field analyzed according to an actual website and used for distinguishing the list page data from the detail page data, for example, one li field is more than the list page than the detail page, and the field can be set as the preset field.
In specific implementation, as shown in fig. 3, the list page parsing composition diagram firstly uses one of an xpath path expression, a css selector expression, a regular expression, and a json format to parse data representing li fields in the web page data, and uses a user-defined processing function method in the process to perform specified processing and return a data list.
As shown in fig. 4, each data in the data list may correspond to multiple item elements, and each item element corresponds to data of two fields, namely, value and field _ name. Wherein, value is a value for parsing and filling item using item _ loader mechanism, field _ name is a field name represented by the item element, and the columns are as follows: item1 represents title, field _ name is filled as title, item2 represents time, field _ name is filled as date.
The item _ loader mechanism is specifically a preprocessing method for each item element, and performs a corresponding processing method according to the field _ name of the item element, when the field _ name is date, the data in the time format is found in value, and when the field _ name is title, the blank characters in the value are removed. Therefore, the processing method of the specific field can be customized in the item _ loader.
It should be noted that, in the above-mentioned parsing process of the list page, besides the li field parsing manner, other parsing processes are applicable to parsing the detail page.
In another possible implementation manner, for the process of step 103, the web request may be executed and the returned web page data may be rendered by using a simulated browser through preset renderer middleware, and the rendered web page data may be stored.
Step 104: and judging whether the next request parameter exists in the configuration file, if not, continuing returning to the step 101, and if so, executing the step 105.
In the actual analysis process, it is found that most websites need to jump from the list page to the detail page to complete one-time data collection, but a small number of websites need not jump to the detail page to complete one-time data collection, such as: the website data returns all data of all articles on the current page through the ajax interface, and a part of websites can reach pages of article detail pages only by skipping for multiple times, so that when a configuration file is written according to a website to be acquired actually, for the website which needs to skip pages for multiple times to finish data acquisition, request parameters of multiple layers of requests can be configured according to a skipping sequence, and for the website which does not need to skip pages to finish data acquisition, only request parameters of one layer of requests are configured.
In a specific embodiment, the next request parameter can be called back through a callback function (callback), and if the callback function outputs a callback success result, the next request parameter is determined to exist in the configuration file; and if the callback function outputs a callback failure result, determining that the next request parameter does not exist in the configuration file.
Step 105: the next request parameter is taken as the current request parameter, and the process returns to execute step 103.
It should be noted that the request parameters for each layer request are in accordance with the requirements of the network request structure shown in fig. 2.
So far, the collection flow shown in fig. 1 is completed, website collection is completed based on configuration in the scheme, and links for repeatedly writing programs are reduced, so that labor cost is saved, collection personnel write corresponding configuration files according to network protocol rules of a target website, website collection is achieved completely based on a network protocol interface by obtaining the configuration files, data collection efficiency is greatly improved, and the configuration files can be imported into any operating system to achieve data collection through simple environment configuration.
In addition, before the network request is sent to the target website, the actual original address in the network request is modified through the proxy pool middleware, so that each network request can come from different addresses, and an ip detection mechanism of the target website is bypassed.
Corresponding to the embodiment of the data acquisition method, the invention also provides an embodiment of the data acquisition device.
Fig. 5 is a schematic structural diagram of a data acquisition apparatus according to an exemplary embodiment of the present invention, the apparatus being configured to perform the data acquisition method provided in any of the above embodiments, as shown in fig. 5, the data acquisition apparatus includes:
an obtaining module 510, configured to obtain a configuration file for a website;
an acquisition module 520, configured to use the first request parameter in the configuration file as a current request parameter, and generate a network request according to the request structure set in the configuration file by using the current request parameter; replacing the actual original address in the network request by using the proxy address through a preset proxy pool middleware; sending a network request after replacing an address to the website, receiving webpage data returned by the website and storing the webpage data;
a determining module 530, configured to determine whether a next request parameter exists in the configuration file;
and the skip module 540 is configured to, when the request is determined to exist, take the next request parameter as the current request parameter, and return to execute a process of generating a network request according to the request structure set in the configuration file by using the current request parameter.
In a possible implementation manner, the obtaining module 510 is specifically configured to obtain a configuration file from a specified task queue.
In one possible implementation, the apparatus further comprises (not shown in fig. 5):
the re-request module is used for detecting whether the request is overtime or abnormal conditions of returned data exist or not through a preset request retry middleware after the network request after the address is replaced is sent to the website; and if so, adding 1 to the retry number when the retry number is determined to be smaller than a preset threshold value, and re-executing the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
In a possible implementation manner, the determining module 530 is specifically configured to call back a next request parameter through a callback function; if the callback function outputs a callback success result, determining that the next request parameter exists in the configuration file; and if the callback function outputs a callback failure result, determining that the next request parameter does not exist in the configuration file.
In a possible implementation manner, the acquisition module 520 is specifically configured to convert, by a preset text code conversion middleware, a code format of the webpage data returned by the website into a preset code format in a process of receiving and storing the webpage data returned by the website; and analyzing the webpage data after the code format conversion by using the analysis rule in the configuration file, and storing the analyzed webpage data.
In a possible implementation manner, the acquisition module 520 is specifically configured to determine whether the converted webpage data includes a preset field in a process of analyzing the converted webpage data by using an analysis rule in the configuration file; if the preset fields are contained, analyzing the webpage data by using the list page analysis rule in the configuration file; and if the webpage data does not contain the preset fields, analyzing the webpage data by using the detail page analysis rule in the configuration file.
In a possible implementation manner, the collecting module 520 is specifically configured to run the web request and render the returned web page data by using a simulation browser through a preset renderer middleware, and store the rendered web page data.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the invention also provides electronic equipment corresponding to the data acquisition method provided by the embodiment, so as to execute the data acquisition method.
Fig. 6 is a hardware block diagram of an electronic device according to an exemplary embodiment of the present invention, the electronic device including: a communication interface 601, a processor 602, a memory 603, and a bus 604; the communication interface 601, the processor 602 and the memory 603 communicate with each other via a bus 604. The processor 602 may execute the data collection method described above by reading and executing machine executable instructions corresponding to the control logic of the data collection method in the memory 603, and the details of the method are described in the above embodiments, which will not be described herein again.
The memory 603 referred to in this disclosure may be any electronic, magnetic, optical, or other physical storage device that can contain stored information, such as executable instructions, data, and so forth. Specifically, the Memory 603 may be a RAM (Random Access Memory), a flash Memory, a storage drive (e.g., a hard disk drive), any type of storage disk (e.g., an optical disk, a DVD, etc.), or similar storage medium, or a combination thereof. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 601 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
Bus 604 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 603 is used for storing a program, and the processor 602 executes the program after receiving the execution instruction.
The processor 602 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 602. The Processor 602 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
The electronic equipment provided by the embodiment of the application and the data acquisition method provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the electronic equipment.
Referring to fig. 7, the computer-readable storage medium is an optical disc 30, and a computer program (i.e., a program product) is stored thereon, where the computer program is executed by a processor to perform the data acquisition method according to any of the foregoing embodiments.
It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memories (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical and magnetic storage media, which are not described in detail herein.
The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the data acquisition method provided by the embodiment of the present application have the same beneficial effects as the method adopted, operated or implemented by the application program stored in the computer-readable storage medium.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. The invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of data acquisition, the method comprising:
acquiring a configuration file for a website;
taking the first request parameter in the configuration file as a current request parameter, generating a network request by using the current request parameter and according to a request structure set in the configuration file, and replacing an actual original address in the network request by using a proxy address through a preset proxy pool middleware;
sending a network request after replacing an address to the website, receiving webpage data returned by the website and storing the webpage data;
judging whether the configuration file has a next request parameter or not;
and if so, taking the next request parameter as the current request parameter, and returning to execute the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
2. The method of claim 1, wherein obtaining the configuration file for the website comprises:
a configuration file is obtained from the assigned task queue.
3. The method of claim 1, wherein after sending the network request after the address replacement to the website, the method further comprises:
detecting whether a request overtime condition or a returned data abnormal condition exists through a preset request retry middleware;
and if so, adding 1 to the retry number when the retry number is determined to be smaller than a preset threshold value, and re-executing the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
4. The method of claim 1, wherein the determining whether the next request parameter exists in the configuration file comprises:
calling back the next request parameter through a callback function;
if the callback function outputs a callback success result, determining that the next request parameter exists in the configuration file;
and if the callback function outputs a callback failure result, determining that the next request parameter does not exist in the configuration file.
5. The method of claim 1, wherein receiving and storing the web page data returned by the website comprises:
converting the coding format of the webpage data returned by the website into a preset coding format through a preset text code conversion middleware;
and analyzing the webpage data after the code format conversion by using the analysis rule in the configuration file, and storing the analyzed webpage data.
6. The method of claim 5, wherein parsing the converted webpage data using the parsing rule in the configuration file comprises:
judging whether the converted webpage data contains a preset field or not;
if the preset fields are contained, analyzing the webpage data by using the list page analysis rule in the configuration file;
and if the webpage data does not contain the preset fields, analyzing the webpage data by using the detail page analysis rule in the configuration file.
7. The method of claim 1, wherein the sending the web request after replacing the address to the website and receiving and storing the web page data returned by the website comprises:
and running the network request and rendering the returned webpage data by using a simulation browser through a preset renderer middleware, and storing the rendered webpage data.
8. A data acquisition device, the device comprising:
the acquisition module is used for acquiring a configuration file aiming at a website;
the acquisition module is used for taking the first request parameter in the configuration file as a current request parameter and generating a network request by using the current request parameter according to a request structure set in the configuration file; replacing the actual original address in the network request by using the proxy address through a preset proxy pool middleware; sending a network request after replacing an address to the website, receiving webpage data returned by the website and storing the webpage data;
the judging module is used for judging whether the next request parameter exists in the configuration file or not;
and the skip module is used for taking the next request parameter as the current request parameter when judging that the network request exists, and returning to execute the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-7 are implemented when the processor executes the program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202210352822.0A 2022-04-06 2022-04-06 Data acquisition method and device, electronic equipment and storage medium Pending CN114428635A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210352822.0A CN114428635A (en) 2022-04-06 2022-04-06 Data acquisition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210352822.0A CN114428635A (en) 2022-04-06 2022-04-06 Data acquisition method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114428635A true CN114428635A (en) 2022-05-03

Family

ID=81314253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210352822.0A Pending CN114428635A (en) 2022-04-06 2022-04-06 Data acquisition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114428635A (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214098A (en) * 2011-06-15 2011-10-12 中山大学 Dynamic webpage data acquisition method based on WebKit browser engine
CN103092817A (en) * 2013-01-18 2013-05-08 五八同城信息技术有限公司 Data collection method and data collection device based on script engine
US20160191522A1 (en) * 2013-08-02 2016-06-30 Uc Mobile Co., Ltd. Method and apparatus for accessing website
US9524352B1 (en) * 2014-07-15 2016-12-20 Google Inc. Sharing data across partner websites
CN108304498A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Webpage data acquiring method, device, computer equipment and storage medium
CN108334585A (en) * 2018-01-29 2018-07-27 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment
CN108345642A (en) * 2018-01-12 2018-07-31 深圳壹账通智能科技有限公司 Method, storage medium and the server of website data are crawled using Agent IP
CN109274782A (en) * 2018-08-24 2019-01-25 北京创鑫旅程网络技术有限公司 A kind of method and device acquiring website data
CN109829096A (en) * 2019-03-15 2019-05-31 北京金山数字娱乐科技有限公司 A kind of collecting method, device, electronic equipment and storage medium
CN110245278A (en) * 2018-09-05 2019-09-17 爱信诺征信有限公司 Acquisition method, device, electronic equipment and the storage medium of web data
US20200073906A1 (en) * 2017-06-15 2020-03-05 Beijing Gridsum Technology Co., Ltd. Method, Device, Storage Medium and Processor for Data Acquisition and Query
CN111767450A (en) * 2020-07-27 2020-10-13 深圳快学教育科技有限公司 Browser data acquisition system and method
CN111953766A (en) * 2020-08-07 2020-11-17 福建省天奕网络科技有限公司 Method and system for collecting network data
WO2021022689A1 (en) * 2019-08-05 2021-02-11 苏州闻道网络科技股份有限公司 Information collection method and apparatus
US10943063B1 (en) * 2017-09-25 2021-03-09 Anonyome Labs, Inc. Apparatus and method to automate website user interface navigation
CN112597373A (en) * 2020-12-29 2021-04-02 科技谷(厦门)信息技术有限公司 Data acquisition method based on distributed crawler engine
KR20210036517A (en) * 2019-09-26 2021-04-05 경희대학교 산학협력단 SYSTEM OF SENSORY DATA ACQUISITION AND SYNCHRONIZATION FOR CLOUD CENTRIC IoT AND METHOD PERFORMING THEREOF

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214098A (en) * 2011-06-15 2011-10-12 中山大学 Dynamic webpage data acquisition method based on WebKit browser engine
CN103092817A (en) * 2013-01-18 2013-05-08 五八同城信息技术有限公司 Data collection method and data collection device based on script engine
US20160191522A1 (en) * 2013-08-02 2016-06-30 Uc Mobile Co., Ltd. Method and apparatus for accessing website
US9524352B1 (en) * 2014-07-15 2016-12-20 Google Inc. Sharing data across partner websites
US20200073906A1 (en) * 2017-06-15 2020-03-05 Beijing Gridsum Technology Co., Ltd. Method, Device, Storage Medium and Processor for Data Acquisition and Query
US10943063B1 (en) * 2017-09-25 2021-03-09 Anonyome Labs, Inc. Apparatus and method to automate website user interface navigation
CN108304498A (en) * 2018-01-12 2018-07-20 深圳壹账通智能科技有限公司 Webpage data acquiring method, device, computer equipment and storage medium
CN108345642A (en) * 2018-01-12 2018-07-31 深圳壹账通智能科技有限公司 Method, storage medium and the server of website data are crawled using Agent IP
CN108334585A (en) * 2018-01-29 2018-07-27 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment
CN109274782A (en) * 2018-08-24 2019-01-25 北京创鑫旅程网络技术有限公司 A kind of method and device acquiring website data
CN110245278A (en) * 2018-09-05 2019-09-17 爱信诺征信有限公司 Acquisition method, device, electronic equipment and the storage medium of web data
CN109829096A (en) * 2019-03-15 2019-05-31 北京金山数字娱乐科技有限公司 A kind of collecting method, device, electronic equipment and storage medium
WO2021022689A1 (en) * 2019-08-05 2021-02-11 苏州闻道网络科技股份有限公司 Information collection method and apparatus
KR20210036517A (en) * 2019-09-26 2021-04-05 경희대학교 산학협력단 SYSTEM OF SENSORY DATA ACQUISITION AND SYNCHRONIZATION FOR CLOUD CENTRIC IoT AND METHOD PERFORMING THEREOF
CN111767450A (en) * 2020-07-27 2020-10-13 深圳快学教育科技有限公司 Browser data acquisition system and method
CN111953766A (en) * 2020-08-07 2020-11-17 福建省天奕网络科技有限公司 Method and system for collecting network data
CN112597373A (en) * 2020-12-29 2021-04-02 科技谷(厦门)信息技术有限公司 Data acquisition method based on distributed crawler engine

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
付顺顺: "基于Scrapy的赌博网站数据采集与分析", 《网络安全技术与应用》 *
刘文杰等: "一种基于网页DOM树的信息采集系统", 《武汉理工大学学报》 *
郅芬香 等: "基于Scrapy框架的数据采集系统设计与实现", 《信息记录材料》 *

Similar Documents

Publication Publication Date Title
US10613971B1 (en) Autonomous testing of web-based applications
JP5990605B2 (en) Method and system for acquiring AJAX web page content
US9037636B2 (en) Managing script file dependencies and load times
US20140059112A1 (en) Automated correction and reporting for dynamic web applications
CN104063401B (en) The method and apparatus that a kind of webpage pattern address merges
CN104168250B (en) Business Process Control method and device based on CGI frames
CN112287009A (en) Interface calling and interface data warehousing method, device, equipment and storage medium
CN111797336A (en) Webpage parsing method and device, electronic equipment and medium
CN111858376A (en) Request message generation method and interface test method
CN108388796B (en) Dynamic domain name verification method, system, computer device and storage medium
CN105824647A (en) Form page generating method and device
CN103019818A (en) Page interaction method and device
CN110532458B (en) Method and device for determining search mode, server and storage medium
CN110908907A (en) Web page testing method, device, equipment and storage medium
CN114428635A (en) Data acquisition method and device, electronic equipment and storage medium
CN111695060A (en) Page switching method, device, equipment and storage medium
Lonka Improving the Initial Rendering Performance of React Applications Through Contemporary Rendering Approaches
US9471569B1 (en) Integrating information sources to create context-specific documents
US11960560B1 (en) Methods for analyzing recurring accessibility issues with dynamic web site behavior and devices thereof
CN104063488A (en) Semi-automatic learning type form feature extraction method
CN116701810B (en) Website operation playback method and device
CN116594917B (en) UI testing method and device, electronic equipment and machine-readable storage medium
US10809986B2 (en) System and method for dynamic translation code optimization
JP2013037580A (en) Information processor
CN116643780A (en) Project source code investigation method, system, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220503

RJ01 Rejection of invention patent application after publication