CN114428635A - Data acquisition method and device, electronic equipment and storage medium - Google Patents
Data acquisition method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN114428635A CN114428635A CN202210352822.0A CN202210352822A CN114428635A CN 114428635 A CN114428635 A CN 114428635A CN 202210352822 A CN202210352822 A CN 202210352822A CN 114428635 A CN114428635 A CN 114428635A
- Authority
- CN
- China
- Prior art keywords
- request
- configuration file
- website
- request parameter
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 77
- 230000008569 process Effects 0.000 claims abstract description 27
- 230000015654 memory Effects 0.000 claims description 16
- 238000004458 analytical method Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 12
- 238000006243 chemical reaction Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 230000002159 abnormal effect Effects 0.000 claims description 4
- 238000009877 rendering Methods 0.000 claims description 3
- 238000004088 simulation Methods 0.000 claims description 3
- 230000000977 initiatory effect Effects 0.000 abstract description 4
- 238000013480 data collection Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 9
- 238000012545 processing Methods 0.000 description 9
- 230000007246 mechanism Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000005065 mining Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 241000238413 Octopus Species 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 230000003137 locomotive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/71—Version control; Configuration management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Abstract
The invention discloses a data acquisition method, a data acquisition device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a configuration file for a website; taking the first request parameter in the configuration file as a current request parameter, initiating a network request to a website by using the current request parameter, receiving webpage data returned by the website and storing the webpage data; judging whether the configuration file has a next request parameter or not; if the request parameter exists, the next request parameter is used as the current request parameter, and the process of initiating the network request to the website by using the current request parameter is returned and executed. According to the scheme, the website acquisition is completed based on configuration, the link of repeatedly writing programs is reduced, so that the labor cost is saved, acquisition personnel write corresponding configuration files according to the network protocol rules of the target website, then the website acquisition is realized completely based on the network protocol interface by acquiring the configuration files, and the efficiency of data acquisition is greatly improved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a data acquisition method, a data acquisition device, electronic equipment and a storage medium.
Background
At present, more and more enterprises carry out business condition mining and market expansion through a bidding mode, according to statistics, tens of thousands of sites release bidding related data on the whole network, and how to effectively monitor the sites helps the enterprises to push latest bidding information in real time is a key element for improving business condition mining capability of the enterprises.
In the related technology, all the sites are analyzed one by one, and corresponding acquisition programs are compiled one by one, and the analysis of the sites is a time-consuming and tedious process, so that a large amount of labor and time cost is required to be invested to achieve the purpose of monitoring the whole network, and the data acquisition efficiency is very low.
Disclosure of Invention
The present invention provides a data acquisition method, an apparatus, an electronic device and a storage medium for overcoming the above-mentioned deficiencies in the prior art, and the object is achieved by the following technical solutions.
A first aspect of the present invention provides a data acquisition method, including:
acquiring a configuration file for a website;
using the first request parameter in the configuration file as the current request parameter, generating a network request by using the current request parameter and according to the request structure set in the configuration file, and replacing the actual original address in the network request by using a proxy address through a preset proxy pool middleware;
sending a network request after replacing an address to the website, receiving webpage data returned by the website and storing the webpage data;
judging whether the configuration file has a next request parameter or not;
and if so, taking the next request parameter as the current request parameter, and returning to execute the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
A second aspect of the present invention provides a data acquisition apparatus, the apparatus comprising:
the acquisition module is used for acquiring a configuration file aiming at a website;
the acquisition module is used for taking the first request parameter in the configuration file as a current request parameter and generating a network request by using the current request parameter according to a request structure set in the configuration file; replacing the actual original address in the network request by using the proxy address through a preset proxy pool middleware; sending a network request after replacing an address to the website, receiving webpage data returned by the website and storing the webpage data;
the judging module is used for judging whether the next request parameter exists in the configuration file or not;
and the skip module is used for taking the next request parameter as the current request parameter when judging that the network request exists, and returning to execute the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
A third aspect of the present invention proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect when executing the program.
A fourth aspect of the present invention proposes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method according to the first aspect as described above.
Based on the data acquisition method and device in the first aspect and the second aspect, the invention has at least the following beneficial effects or advantages:
according to the scheme, the website collection is completed based on configuration, the link of repeatedly writing programs is reduced, so that the labor cost is saved, collection personnel write corresponding configuration files according to the network protocol rules of the target website, then the website collection is realized completely based on the network protocol interface by obtaining the configuration files, the data collection efficiency is greatly improved, and the configuration files can be imported into any operating system to realize the data collection through simple environment configuration.
In addition, before the network request is sent to the target website, the actual original address in the network request is modified through the proxy pool middleware, so that each network request can come from different addresses, and an ip detection mechanism of the target website is bypassed.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart illustrating an embodiment of a data collection method according to an exemplary embodiment of the present invention;
FIG. 2 is a schematic diagram of a network request structure according to the present invention;
FIG. 3 is a diagram illustrating a webpage data parsing configuration according to the present invention;
FIG. 4 is a schematic diagram of a data list obtained by parsing a list page according to the present invention;
FIG. 5 is a schematic diagram of a data acquisition device according to an exemplary embodiment of the present invention;
FIG. 6 is a diagram illustrating a hardware configuration of an electronic device according to an exemplary embodiment of the present invention;
fig. 7 is a schematic structural diagram of a storage medium according to an exemplary embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if," as used herein, may be interpreted as "at … …" or "when … …" or "in response to a determination," depending on the context.
At present, the existing acquisition framework in the market comprises octopus, locomotive and other acquisition software, but the acquisition software is based on a browser, the acquisition performance and efficiency can not meet the acquisition requirements of a large number of websites, and the operable operating system is limited.
In order to solve the technical problem, the application provides a data acquisition method, a configuration file for a website is obtained, then a first request parameter in the configuration file is used as a current request parameter, a network request is generated by using the current request parameter and according to a request structure set in the configuration file, an actual original address in the network request is replaced by using a proxy address through a preset proxy pool middleware, after the network request with the replaced address is sent to the website, webpage data returned by the website are received and stored, whether a next request parameter exists in the configuration file or not is judged, if the next request parameter exists, the next request parameter is used as the current request parameter, and a process of initiating the network request to the website by using the current request parameter is returned and executed.
The technical effects that can be achieved based on the above description are:
according to the scheme, the website collection is completed based on configuration, the link of repeatedly writing programs is reduced, so that the labor cost is saved, collection personnel write corresponding configuration files according to the network protocol rules of the target website, then the website collection is realized completely based on the network protocol interface by obtaining the configuration files, the data collection efficiency is greatly improved, and the configuration files can be imported into any operating system to realize the data collection through simple environment configuration.
In addition, before the network request is sent to the target website, the actual original address in the network request is modified through the proxy pool middleware, so that each network request can come from different addresses, and an ip detection mechanism of the target website is bypassed.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The first embodiment is as follows:
fig. 1 is a flowchart illustrating an embodiment of a data acquisition method according to an exemplary embodiment of the present invention, the data acquisition method including the steps of:
step 101: a configuration file for a website is obtained.
The configuration file is JSON configuration written by acquisition personnel according to rules such as a data request mode of a website to be acquired, html tag layout of a webpage, a rendering mode of webpage data and the like. And then the configuration file is imported into a specified task queue, so that the configuration file is sequentially taken out from the task queue according to a certain sequence for data acquisition.
Based on the configuration file, the configuration file is obtained from the specified task queue, and the configuration file is placed in the queue to perform data acquisition according to a certain sequence, so that the problem of missing the configuration file can be avoided.
It should be noted that there is a configuration file in the designated task queue for each website to be collected.
Step 102: and taking the first request parameter in the configuration file as the current request parameter.
The web server is usually stored with a plurality of web pages, so the configuration file needs to set the request parameters of each web page, and each web page has a certain jump association relationship, because the request parameters of each web page in the configuration file are arranged according to a certain sequence, the first request is to use the first request parameter in the configuration file to perform data acquisition.
Step 103: and generating a network request by using the current request parameters and according to a request structure set in the configuration file, replacing an actual original address in the network request by using a proxy address through a preset proxy pool middleware, sending the network request with the address replaced to the website, receiving and storing webpage data returned by the website.
Before step 103 is executed, in order to better realize data collection, the method is performed based on the crawler framework script, and modifies the data before the network request is initiated or after the request is returned by using a middleware method provided by the script framework. In the invention, a proxy pool middleware, a renderer middleware, a text code conversion middleware and a request retry middleware are preset.
The proxy pool middleware is used for replacing an actual original ip address with a proxy ip address before each network request so as to enable each network request to come from different ip addresses, and therefore an ip detection mechanism of a target website is bypassed. The renderer middleware is used for processing a plurality of asynchronously loaded websites, running a network request through a simulation browser, rendering returned webpage data on a webpage, and then returning the rendered webpage data, so that data seen on the webpage can be obtained, namely what you see is what you can crawl. The text code conversion middleware is used for flexibly replacing the code format of returned webpage data, and the text code format of each webpage is not necessarily the same, so that the condition that the returned data is messy due to code format conversion of part of websites frequently occurs, common text code formats are utf8, Unicode, GBk and the like, and the text code conversion middleware is used for identifying the code format of the returned data and converting the code format into a uniform code format to prevent the messy code from occurring is often generated. The request retry middleware is used for processing the situations of request timeout or some abnormal errors, the request timeout is mainly because a proxy ip is used, and the network condition of the proxy ip is unknown, so that the problem of timeout occurs when initiating a network request, and certainly, besides the timeout, the data returned by some requests is incorrect, which causes program abnormality, and both the situations are that retry is needed until correct data is returned.
Based on the above preset settings of the proxy pool middleware, the renderer middleware, the text transcoding middleware, and the request retry middleware, the use of these middleware is explained below by a specific embodiment.
As shown in fig. 2, the structure of the network request mainly includes: request headers, request address url, request body parameter body and request mode get/post. Wherein, the headers represent the network protocol header of the network request, and are consistent with the network protocol header of the target address; the url can be filled with three parameters, value is a target webpage url address or an ajax interface address, re can be filled with a regular expression, a process can write some processing functions for processing splicing or replacement of some urls, a specified value is matched from the value, and the matched value is used for processing in the process; the request mode of a general bidding type website is a get mode or a post mode, when the request mode is get, the body is empty, and when the request mode is post, the body appointed by the target website needs to be carried.
Further, after the network request after the address replacement is sent to the website, whether a request timeout or a returned data abnormal condition exists may be detected through a preset request retry middleware, if so, when it is determined that the retry number is smaller than a preset threshold, the retry number is added by 1, and a process of generating the network request according to the request structure set in the configuration file by using the current request parameter is re-executed.
In a possible implementation manner, in order to receive and store the webpage data returned by the website, the encoding format of the webpage data returned by the website may be converted into a preset encoding format through a preset text encoding conversion middleware, and then the webpage data after the encoding format conversion is analyzed by using the analysis rule in the configuration file, and the analyzed webpage data is stored.
The preset coding format is a coding format which is uniformly required according to actual analysis requirements.
It should be noted that web pages on a website (such as a bid web site) are generally divided into two categories, namely a list page and a detail page, and the web pages in the two categories have almost the same structure, except that the list page usually has one more field.
Based on this, in the process of analyzing the converted webpage data by using the analysis rule in the configuration file, whether the converted webpage data contains a preset field or not can be judged, if the converted webpage data contains the preset field, the webpage data is analyzed by using the list page analysis rule in the configuration file, and if the converted webpage data does not contain the preset field, the webpage data is analyzed by using the detail page analysis rule in the configuration file.
The preset field is a field analyzed according to an actual website and used for distinguishing the list page data from the detail page data, for example, one li field is more than the list page than the detail page, and the field can be set as the preset field.
In specific implementation, as shown in fig. 3, the list page parsing composition diagram firstly uses one of an xpath path expression, a css selector expression, a regular expression, and a json format to parse data representing li fields in the web page data, and uses a user-defined processing function method in the process to perform specified processing and return a data list.
As shown in fig. 4, each data in the data list may correspond to multiple item elements, and each item element corresponds to data of two fields, namely, value and field _ name. Wherein, value is a value for parsing and filling item using item _ loader mechanism, field _ name is a field name represented by the item element, and the columns are as follows: item1 represents title, field _ name is filled as title, item2 represents time, field _ name is filled as date.
The item _ loader mechanism is specifically a preprocessing method for each item element, and performs a corresponding processing method according to the field _ name of the item element, when the field _ name is date, the data in the time format is found in value, and when the field _ name is title, the blank characters in the value are removed. Therefore, the processing method of the specific field can be customized in the item _ loader.
It should be noted that, in the above-mentioned parsing process of the list page, besides the li field parsing manner, other parsing processes are applicable to parsing the detail page.
In another possible implementation manner, for the process of step 103, the web request may be executed and the returned web page data may be rendered by using a simulated browser through preset renderer middleware, and the rendered web page data may be stored.
Step 104: and judging whether the next request parameter exists in the configuration file, if not, continuing returning to the step 101, and if so, executing the step 105.
In the actual analysis process, it is found that most websites need to jump from the list page to the detail page to complete one-time data collection, but a small number of websites need not jump to the detail page to complete one-time data collection, such as: the website data returns all data of all articles on the current page through the ajax interface, and a part of websites can reach pages of article detail pages only by skipping for multiple times, so that when a configuration file is written according to a website to be acquired actually, for the website which needs to skip pages for multiple times to finish data acquisition, request parameters of multiple layers of requests can be configured according to a skipping sequence, and for the website which does not need to skip pages to finish data acquisition, only request parameters of one layer of requests are configured.
In a specific embodiment, the next request parameter can be called back through a callback function (callback), and if the callback function outputs a callback success result, the next request parameter is determined to exist in the configuration file; and if the callback function outputs a callback failure result, determining that the next request parameter does not exist in the configuration file.
Step 105: the next request parameter is taken as the current request parameter, and the process returns to execute step 103.
It should be noted that the request parameters for each layer request are in accordance with the requirements of the network request structure shown in fig. 2.
So far, the collection flow shown in fig. 1 is completed, website collection is completed based on configuration in the scheme, and links for repeatedly writing programs are reduced, so that labor cost is saved, collection personnel write corresponding configuration files according to network protocol rules of a target website, website collection is achieved completely based on a network protocol interface by obtaining the configuration files, data collection efficiency is greatly improved, and the configuration files can be imported into any operating system to achieve data collection through simple environment configuration.
In addition, before the network request is sent to the target website, the actual original address in the network request is modified through the proxy pool middleware, so that each network request can come from different addresses, and an ip detection mechanism of the target website is bypassed.
Corresponding to the embodiment of the data acquisition method, the invention also provides an embodiment of the data acquisition device.
Fig. 5 is a schematic structural diagram of a data acquisition apparatus according to an exemplary embodiment of the present invention, the apparatus being configured to perform the data acquisition method provided in any of the above embodiments, as shown in fig. 5, the data acquisition apparatus includes:
an obtaining module 510, configured to obtain a configuration file for a website;
an acquisition module 520, configured to use the first request parameter in the configuration file as a current request parameter, and generate a network request according to the request structure set in the configuration file by using the current request parameter; replacing the actual original address in the network request by using the proxy address through a preset proxy pool middleware; sending a network request after replacing an address to the website, receiving webpage data returned by the website and storing the webpage data;
a determining module 530, configured to determine whether a next request parameter exists in the configuration file;
and the skip module 540 is configured to, when the request is determined to exist, take the next request parameter as the current request parameter, and return to execute a process of generating a network request according to the request structure set in the configuration file by using the current request parameter.
In a possible implementation manner, the obtaining module 510 is specifically configured to obtain a configuration file from a specified task queue.
In one possible implementation, the apparatus further comprises (not shown in fig. 5):
the re-request module is used for detecting whether the request is overtime or abnormal conditions of returned data exist or not through a preset request retry middleware after the network request after the address is replaced is sent to the website; and if so, adding 1 to the retry number when the retry number is determined to be smaller than a preset threshold value, and re-executing the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
In a possible implementation manner, the determining module 530 is specifically configured to call back a next request parameter through a callback function; if the callback function outputs a callback success result, determining that the next request parameter exists in the configuration file; and if the callback function outputs a callback failure result, determining that the next request parameter does not exist in the configuration file.
In a possible implementation manner, the acquisition module 520 is specifically configured to convert, by a preset text code conversion middleware, a code format of the webpage data returned by the website into a preset code format in a process of receiving and storing the webpage data returned by the website; and analyzing the webpage data after the code format conversion by using the analysis rule in the configuration file, and storing the analyzed webpage data.
In a possible implementation manner, the acquisition module 520 is specifically configured to determine whether the converted webpage data includes a preset field in a process of analyzing the converted webpage data by using an analysis rule in the configuration file; if the preset fields are contained, analyzing the webpage data by using the list page analysis rule in the configuration file; and if the webpage data does not contain the preset fields, analyzing the webpage data by using the detail page analysis rule in the configuration file.
In a possible implementation manner, the collecting module 520 is specifically configured to run the web request and render the returned web page data by using a simulation browser through a preset renderer middleware, and store the rendered web page data.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the invention also provides electronic equipment corresponding to the data acquisition method provided by the embodiment, so as to execute the data acquisition method.
Fig. 6 is a hardware block diagram of an electronic device according to an exemplary embodiment of the present invention, the electronic device including: a communication interface 601, a processor 602, a memory 603, and a bus 604; the communication interface 601, the processor 602 and the memory 603 communicate with each other via a bus 604. The processor 602 may execute the data collection method described above by reading and executing machine executable instructions corresponding to the control logic of the data collection method in the memory 603, and the details of the method are described in the above embodiments, which will not be described herein again.
The memory 603 referred to in this disclosure may be any electronic, magnetic, optical, or other physical storage device that can contain stored information, such as executable instructions, data, and so forth. Specifically, the Memory 603 may be a RAM (Random Access Memory), a flash Memory, a storage drive (e.g., a hard disk drive), any type of storage disk (e.g., an optical disk, a DVD, etc.), or similar storage medium, or a combination thereof. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 601 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
The processor 602 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 602. The Processor 602 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
The electronic equipment provided by the embodiment of the application and the data acquisition method provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the electronic equipment.
Referring to fig. 7, the computer-readable storage medium is an optical disc 30, and a computer program (i.e., a program product) is stored thereon, where the computer program is executed by a processor to perform the data acquisition method according to any of the foregoing embodiments.
It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memories (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical and magnetic storage media, which are not described in detail herein.
The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the data acquisition method provided by the embodiment of the present application have the same beneficial effects as the method adopted, operated or implemented by the application program stored in the computer-readable storage medium.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. The invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. A method of data acquisition, the method comprising:
acquiring a configuration file for a website;
taking the first request parameter in the configuration file as a current request parameter, generating a network request by using the current request parameter and according to a request structure set in the configuration file, and replacing an actual original address in the network request by using a proxy address through a preset proxy pool middleware;
sending a network request after replacing an address to the website, receiving webpage data returned by the website and storing the webpage data;
judging whether the configuration file has a next request parameter or not;
and if so, taking the next request parameter as the current request parameter, and returning to execute the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
2. The method of claim 1, wherein obtaining the configuration file for the website comprises:
a configuration file is obtained from the assigned task queue.
3. The method of claim 1, wherein after sending the network request after the address replacement to the website, the method further comprises:
detecting whether a request overtime condition or a returned data abnormal condition exists through a preset request retry middleware;
and if so, adding 1 to the retry number when the retry number is determined to be smaller than a preset threshold value, and re-executing the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
4. The method of claim 1, wherein the determining whether the next request parameter exists in the configuration file comprises:
calling back the next request parameter through a callback function;
if the callback function outputs a callback success result, determining that the next request parameter exists in the configuration file;
and if the callback function outputs a callback failure result, determining that the next request parameter does not exist in the configuration file.
5. The method of claim 1, wherein receiving and storing the web page data returned by the website comprises:
converting the coding format of the webpage data returned by the website into a preset coding format through a preset text code conversion middleware;
and analyzing the webpage data after the code format conversion by using the analysis rule in the configuration file, and storing the analyzed webpage data.
6. The method of claim 5, wherein parsing the converted webpage data using the parsing rule in the configuration file comprises:
judging whether the converted webpage data contains a preset field or not;
if the preset fields are contained, analyzing the webpage data by using the list page analysis rule in the configuration file;
and if the webpage data does not contain the preset fields, analyzing the webpage data by using the detail page analysis rule in the configuration file.
7. The method of claim 1, wherein the sending the web request after replacing the address to the website and receiving and storing the web page data returned by the website comprises:
and running the network request and rendering the returned webpage data by using a simulation browser through a preset renderer middleware, and storing the rendered webpage data.
8. A data acquisition device, the device comprising:
the acquisition module is used for acquiring a configuration file aiming at a website;
the acquisition module is used for taking the first request parameter in the configuration file as a current request parameter and generating a network request by using the current request parameter according to a request structure set in the configuration file; replacing the actual original address in the network request by using the proxy address through a preset proxy pool middleware; sending a network request after replacing an address to the website, receiving webpage data returned by the website and storing the webpage data;
the judging module is used for judging whether the next request parameter exists in the configuration file or not;
and the skip module is used for taking the next request parameter as the current request parameter when judging that the network request exists, and returning to execute the process of generating the network request by using the current request parameter and according to the request structure set in the configuration file.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-7 are implemented when the processor executes the program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210352822.0A CN114428635A (en) | 2022-04-06 | 2022-04-06 | Data acquisition method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210352822.0A CN114428635A (en) | 2022-04-06 | 2022-04-06 | Data acquisition method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114428635A true CN114428635A (en) | 2022-05-03 |
Family
ID=81314253
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210352822.0A Pending CN114428635A (en) | 2022-04-06 | 2022-04-06 | Data acquisition method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114428635A (en) |
Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102214098A (en) * | 2011-06-15 | 2011-10-12 | 中山大学 | Dynamic webpage data acquisition method based on WebKit browser engine |
CN103092817A (en) * | 2013-01-18 | 2013-05-08 | 五八同城信息技术有限公司 | Data collection method and data collection device based on script engine |
US20160191522A1 (en) * | 2013-08-02 | 2016-06-30 | Uc Mobile Co., Ltd. | Method and apparatus for accessing website |
US9524352B1 (en) * | 2014-07-15 | 2016-12-20 | Google Inc. | Sharing data across partner websites |
CN108304498A (en) * | 2018-01-12 | 2018-07-20 | 深圳壹账通智能科技有限公司 | Webpage data acquiring method, device, computer equipment and storage medium |
CN108334585A (en) * | 2018-01-29 | 2018-07-27 | 湖北省楚天云有限公司 | A kind of spiders method, apparatus and electronic equipment |
CN108345642A (en) * | 2018-01-12 | 2018-07-31 | 深圳壹账通智能科技有限公司 | Method, storage medium and the server of website data are crawled using Agent IP |
CN109274782A (en) * | 2018-08-24 | 2019-01-25 | 北京创鑫旅程网络技术有限公司 | A kind of method and device acquiring website data |
CN109829096A (en) * | 2019-03-15 | 2019-05-31 | 北京金山数字娱乐科技有限公司 | A kind of collecting method, device, electronic equipment and storage medium |
CN110245278A (en) * | 2018-09-05 | 2019-09-17 | 爱信诺征信有限公司 | Acquisition method, device, electronic equipment and the storage medium of web data |
US20200073906A1 (en) * | 2017-06-15 | 2020-03-05 | Beijing Gridsum Technology Co., Ltd. | Method, Device, Storage Medium and Processor for Data Acquisition and Query |
CN111767450A (en) * | 2020-07-27 | 2020-10-13 | 深圳快学教育科技有限公司 | Browser data acquisition system and method |
CN111953766A (en) * | 2020-08-07 | 2020-11-17 | 福建省天奕网络科技有限公司 | Method and system for collecting network data |
WO2021022689A1 (en) * | 2019-08-05 | 2021-02-11 | 苏州闻道网络科技股份有限公司 | Information collection method and apparatus |
US10943063B1 (en) * | 2017-09-25 | 2021-03-09 | Anonyome Labs, Inc. | Apparatus and method to automate website user interface navigation |
CN112597373A (en) * | 2020-12-29 | 2021-04-02 | 科技谷(厦门)信息技术有限公司 | Data acquisition method based on distributed crawler engine |
KR20210036517A (en) * | 2019-09-26 | 2021-04-05 | 경희대학교 산학협력단 | SYSTEM OF SENSORY DATA ACQUISITION AND SYNCHRONIZATION FOR CLOUD CENTRIC IoT AND METHOD PERFORMING THEREOF |
-
2022
- 2022-04-06 CN CN202210352822.0A patent/CN114428635A/en active Pending
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102214098A (en) * | 2011-06-15 | 2011-10-12 | 中山大学 | Dynamic webpage data acquisition method based on WebKit browser engine |
CN103092817A (en) * | 2013-01-18 | 2013-05-08 | 五八同城信息技术有限公司 | Data collection method and data collection device based on script engine |
US20160191522A1 (en) * | 2013-08-02 | 2016-06-30 | Uc Mobile Co., Ltd. | Method and apparatus for accessing website |
US9524352B1 (en) * | 2014-07-15 | 2016-12-20 | Google Inc. | Sharing data across partner websites |
US20200073906A1 (en) * | 2017-06-15 | 2020-03-05 | Beijing Gridsum Technology Co., Ltd. | Method, Device, Storage Medium and Processor for Data Acquisition and Query |
US10943063B1 (en) * | 2017-09-25 | 2021-03-09 | Anonyome Labs, Inc. | Apparatus and method to automate website user interface navigation |
CN108304498A (en) * | 2018-01-12 | 2018-07-20 | 深圳壹账通智能科技有限公司 | Webpage data acquiring method, device, computer equipment and storage medium |
CN108345642A (en) * | 2018-01-12 | 2018-07-31 | 深圳壹账通智能科技有限公司 | Method, storage medium and the server of website data are crawled using Agent IP |
CN108334585A (en) * | 2018-01-29 | 2018-07-27 | 湖北省楚天云有限公司 | A kind of spiders method, apparatus and electronic equipment |
CN109274782A (en) * | 2018-08-24 | 2019-01-25 | 北京创鑫旅程网络技术有限公司 | A kind of method and device acquiring website data |
CN110245278A (en) * | 2018-09-05 | 2019-09-17 | 爱信诺征信有限公司 | Acquisition method, device, electronic equipment and the storage medium of web data |
CN109829096A (en) * | 2019-03-15 | 2019-05-31 | 北京金山数字娱乐科技有限公司 | A kind of collecting method, device, electronic equipment and storage medium |
WO2021022689A1 (en) * | 2019-08-05 | 2021-02-11 | 苏州闻道网络科技股份有限公司 | Information collection method and apparatus |
KR20210036517A (en) * | 2019-09-26 | 2021-04-05 | 경희대학교 산학협력단 | SYSTEM OF SENSORY DATA ACQUISITION AND SYNCHRONIZATION FOR CLOUD CENTRIC IoT AND METHOD PERFORMING THEREOF |
CN111767450A (en) * | 2020-07-27 | 2020-10-13 | 深圳快学教育科技有限公司 | Browser data acquisition system and method |
CN111953766A (en) * | 2020-08-07 | 2020-11-17 | 福建省天奕网络科技有限公司 | Method and system for collecting network data |
CN112597373A (en) * | 2020-12-29 | 2021-04-02 | 科技谷(厦门)信息技术有限公司 | Data acquisition method based on distributed crawler engine |
Non-Patent Citations (3)
Title |
---|
付顺顺: "基于Scrapy的赌博网站数据采集与分析", 《网络安全技术与应用》 * |
刘文杰等: "一种基于网页DOM树的信息采集系统", 《武汉理工大学学报》 * |
郅芬香 等: "基于Scrapy框架的数据采集系统设计与实现", 《信息记录材料》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10613971B1 (en) | Autonomous testing of web-based applications | |
JP5990605B2 (en) | Method and system for acquiring AJAX web page content | |
US9037636B2 (en) | Managing script file dependencies and load times | |
US20140059112A1 (en) | Automated correction and reporting for dynamic web applications | |
CN104063401B (en) | The method and apparatus that a kind of webpage pattern address merges | |
CN104168250B (en) | Business Process Control method and device based on CGI frames | |
CN112287009A (en) | Interface calling and interface data warehousing method, device, equipment and storage medium | |
CN111797336A (en) | Webpage parsing method and device, electronic equipment and medium | |
CN111858376A (en) | Request message generation method and interface test method | |
CN108388796B (en) | Dynamic domain name verification method, system, computer device and storage medium | |
CN105824647A (en) | Form page generating method and device | |
CN103019818A (en) | Page interaction method and device | |
CN110532458B (en) | Method and device for determining search mode, server and storage medium | |
CN110908907A (en) | Web page testing method, device, equipment and storage medium | |
CN114428635A (en) | Data acquisition method and device, electronic equipment and storage medium | |
CN111695060A (en) | Page switching method, device, equipment and storage medium | |
Lonka | Improving the Initial Rendering Performance of React Applications Through Contemporary Rendering Approaches | |
US9471569B1 (en) | Integrating information sources to create context-specific documents | |
US11960560B1 (en) | Methods for analyzing recurring accessibility issues with dynamic web site behavior and devices thereof | |
CN104063488A (en) | Semi-automatic learning type form feature extraction method | |
CN116701810B (en) | Website operation playback method and device | |
CN116594917B (en) | UI testing method and device, electronic equipment and machine-readable storage medium | |
US10809986B2 (en) | System and method for dynamic translation code optimization | |
JP2013037580A (en) | Information processor | |
CN116643780A (en) | Project source code investigation method, system, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220503 |
|
RJ01 | Rejection of invention patent application after publication |