CN113392301A - Method, device, medium and electronic equipment for crawling data - Google Patents

Method, device, medium and electronic equipment for crawling data Download PDF

Info

Publication number
CN113392301A
CN113392301A CN202110636165.8A CN202110636165A CN113392301A CN 113392301 A CN113392301 A CN 113392301A CN 202110636165 A CN202110636165 A CN 202110636165A CN 113392301 A CN113392301 A CN 113392301A
Authority
CN
China
Prior art keywords
data
dynamic
request
dynamic request
result data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110636165.8A
Other languages
Chinese (zh)
Inventor
杨光
周天星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Precision Communication Media Technology Co ltd
Original Assignee
Beijing Precision Communication Media Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Precision Communication Media Technology Co ltd filed Critical Beijing Precision Communication Media Technology Co ltd
Priority to CN202110636165.8A priority Critical patent/CN113392301A/en
Publication of CN113392301A publication Critical patent/CN113392301A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The invention relates to a method, a device, a medium and electronic equipment for crawling data. A method of crawling data, comprising: acquiring configuration data related to a crawler task; acquiring first dynamic request information from configuration data; configuring a first dynamic request template according to the first dynamic request information, wherein the first dynamic request template comprises a first dynamic parameter; acquiring a first data set corresponding to a first dynamic parameter from a data source of the first dynamic parameter; generating a first dynamic request set based on the first data set and the first dynamic request template; downloading based on a first dynamic request set, and acquiring first response data, wherein the first response data is used for being analyzed to generate first result data, and the first result data is part of data of the target to be crawled with a specific format. The scheme of the invention can facilitate and efficiently analyze the crawled data.

Description

Method, device, medium and electronic equipment for crawling data
Technical Field
The invention relates to the technical field of computer networks, in particular to a method, a device, a medium and electronic equipment for crawling data.
Background
The existing crawler has, for example, octopus, which crawls data of a mainstream website in advance, and a user can select certain types of data, such as titles, prices, names, and the like, on the website required by the user from the tool, and the tool is mainly designed for small and white users, and has the advantages of simplicity and easiness in use, and the defect that only simple requirements can be met; in addition, there is a crawler tool named locomotive collector, which is designed for users with related crawler experiences, and on which the user can perform some simple configurations, including submitting URLs, expressions, etc., but also cannot meet more complex requirements.
Disclosure of Invention
When a crawler tool is used for crawling data of a target website, a crawler task is required to be newly established, starting from an initial request url, after the data of a current page is crawled, the data of a next-level page linked to a hyperlink on the current page is further crawled continuously, and the data are sequentially crawled one-level by one-level, so that the data are continuously crawled. However, for the same target website, the data formats of pages of different contents are usually different, for example, as shown in fig. 1, a plurality of list entries are displayed in a certain list page of the target website, and clicking on each list entry can correspondingly link to a respective comment information page (fig. 1 schematically shows a comment page to which the comment clicking on list entry 1 jumps), and the data crawled by using the existing crawler tool downloads the list data and the comment data in a table in a mixed manner. Since the formats (or structures) of the two data are often different, it is difficult and troublesome to parse the data. The present invention solves this problem well by introducing dynamic requests.
According to an aspect of the present invention, there is provided a method of crawling data, comprising: acquiring configuration data related to a crawler task; acquiring first dynamic request information from the configuration data; configuring a first dynamic request template according to the first dynamic request information, wherein the first dynamic request template comprises a first dynamic parameter; acquiring a first data set corresponding to the first dynamic parameter from a data source of the first dynamic parameter; generating a first dynamic request set based on the first data set and the first dynamic request template; and downloading based on the first dynamic request set to obtain first response data.
According to yet another aspect of the invention, there is provided a non-transitory computer readable medium having stored thereon computer executable code which when executed by a processor implements the method.
According to yet another aspect of the present invention, there is provided an electronic device comprising a processor, a memory, and computer executable code stored thereon, which when executed by the processor implements the method.
The conventional crawling manner is to set configuration parameters such as a fixed url (i.e., a fixed request) of a target to be crawled, so as to crawl data of all levels under the fixed url in sequence until updated data cannot be crawled. The scheme of the invention provides a new crawling mode, a dynamic request template is configured, and when the data set corresponding to the dynamic parameter is acquired from the data source of the dynamic parameter, the corresponding dynamic request set is generated and the corresponding data crawling is carried out, so that the data of each type of data page (such as only comment page, only detail page and the like) in the target to be crawled can be individually and dynamically crawled, and the format of the data acquired by using the dynamic request is relatively uniform, thereby bringing convenience and improvement of efficiency to the subsequent data analysis work.
Drawings
In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar components in the various views. The drawings illustrate various embodiments generally by way of example and not by way of limitation, and together with the description and claims serve to explain the disclosed embodiments. The same reference numbers will be used throughout the drawings to refer to the same or like parts, where appropriate. Such embodiments are illustrative, and are not intended to be exhaustive or exclusive embodiments of the present system or method.
Fig. 1 schematically shows a list page and a comment page to which a comment clicking on list entry 1 jumps.
Fig. 2 is a flow chart illustrating a method for crawling data according to an embodiment of the present invention.
Fig. 3 is a flow chart illustrating a method for crawling data according to another embodiment of the present invention.
FIG. 4 is an apparatus for crawling data that schematically illustrates an embodiment of the present invention.
Detailed Description
Various aspects and features of the disclosure are described herein with reference to the drawings. These and other characteristics of the invention will become apparent from the following description of a preferred form of embodiment, given as a non-limiting example, with reference to the accompanying drawings.
The present description may use the phrases "in one embodiment," "in some embodiments," "in another embodiment," "in yet another embodiment," or "in other embodiments," which may each refer to one or more of the same or type of embodiments in accordance with the present disclosure. The same or similar reference numbers refer to the same or similar elements or processes. In particular embodiments, the singular reference of an element does not exclude the plural reference of such elements.
Definition of terms
Herein, a "fixed request URL" refers to a complete URL available for access, which is used for the fixed request of the present invention, i.e., a request for crawling the URL and the data of each level under the URL based on the fixed request URL by the crawler device/engine/platform/system.
"dynamic request URL" refers to a URL with dynamic parameters characterized by predetermined symbols, which data (or data set) needs to be (dynamically) acquired to replace the dynamic parameters during the application of the embodiment of the present invention, so that the dynamic request URL is changed to a complete URL available for access, and the "dynamic request URL" is used for the dynamic request of the present invention.
The "result data" is, for example, data focused by the user in the response data, or data of a main portion of the page corresponding to the response data. For example, the result data is comment data, which generally has a data format set uniformly for comments; for another example, the result data is list data, which generally has a data format set for a list in a unified manner.
Examples
As shown in FIG. 2, a method 100 of crawling data according to an embodiment of the present invention is described below, the method comprising the steps of:
101, obtaining configuration data related to a crawler task of a target to be crawled. The crawler task is a crawling task established by a user for a target to be crawled (for example, a target website accessed by a PC (personal computer) terminal through a browser or an APP installed on intelligent equipment such as a smart phone), and when the crawler task is started, the crawler device allocates corresponding threads and system resources to the crawler task. The user may request the configuration interface to input configuration data, including various parameters such as a request URL and information required to perform a crawler task, for example, through the front end of the crawler device to which embodiments of the present invention are applied, and the configuration data input by the user may be stored in a file or database. Configuration data related to the crawler tasks of the targets to be crawled may be obtained from a file or database or any data source.
And 102, acquiring first dynamic request information from the configuration data. In particular, the first dynamic request information may include a request method and dynamic request URL information associated with the first dynamic request.
103, configuring a first dynamic request template according to the first dynamic request information, where the first dynamic request template includes a first dynamic parameter. In particular, the dynamic request template may be configured based on the request method and the dynamic request URL. The request template may be an http request template, and both the request method and the request URL are required to be used when configuring an http request. The dynamic request URL is a URL that contains dynamic parameters. For example, the request method is get or post, and the dynamic request URL is "http:// test. compare { }", where "{ }" is a dynamic parameter (in this example, can be regarded as a first dynamic parameter). Of course, other symbols than "{ }" may be used as the dynamic parameter, and the present invention is not limited thereto.
And 104, acquiring a first data set corresponding to the first dynamic parameter from a data source of the first dynamic parameter. The data source may include, but is not limited to, a database, a file, or an interface. For example, when the data source of the first dynamic parameter is a database, a database query instruction such as an SQL statement may be used to obtain a dynamic parameter data set, the obtained dynamic parameter corresponding to the first data set [123,456,789], and the SQL statement may be input into a front-end configuration page of the crawler device or be part of configuration data by a user, for example.
A first dynamic request set is generated based on the first data set and the first dynamic request template 105. Specifically, each data in the first data set of dynamic parameters may replace the first dynamic parameter of the dynamic HTTP request template configured in step 103, respectively, to generate a first dynamic request set including corresponding each dynamic request.
106, requesting to download the first dynamic request set generated in step 105, and obtaining first response data, where the first response data is used for being parsed to generate first result data, and the first result data is partial data of the target to be crawled, and the first result data has a specific format.
The crawling method of the embodiment configures a dynamic request template, and when a data set corresponding to a dynamic parameter is acquired, a corresponding dynamic request set can be generated and corresponding data crawling is performed, so that individual dynamic crawling can be performed only on data of a single type of data page (for example, only on a comment page or only on a detail page) in a target to be crawled, formats of the data acquired by using a single dynamic request are relatively uniform, and convenience and efficiency improvement can be brought to subsequent data analysis work.
As shown in FIG. 3, a method 200 for crawling data in accordance with another embodiment of the present invention is described below, the method comprising the steps of:
configuration data related to a crawler task is obtained 201. The user may, for example, request configuration data, including information such as a request URL (e.g., a fixed request URL and/or a dynamic request URL) and various parameters required for performing a crawler task, to be input through a front-end request configuration interface of the crawler apparatus to which the embodiments of the present invention are applied, and the configuration data input by the user may be stored in a file or a database, in which case the crawler apparatus to which the embodiments of the present invention are applied may obtain configuration data related to the crawler task from the file or the database.
And 211, acquiring the fixed request information from the configuration data, and generating a fixed request according to the fixed request information in a packaging mode. The fixed request information is request information associated with a request URL to be crawled, and is a request containing a fixed request URL. In one embodiment, the fixed request information may include a fixed request URL, a request method, a request header, and request parameters. The fixed request URL is a complete accessible URL in the target (web page or APP) to be crawled. For example, the fixed request URL may be an initial request URL, i.e., the URL of an initial target page to be crawled. In one embodiment, the fixed http request can be encapsulated according to a fixed request URL, a request method, a request header and a request parameter in the fixed request information.
212, the fixed request generated in step 211 is delivered to the download engine for requesting download, and the second response data is obtained. The second response data obtained by the download engine can be parsed to generate second result data and stored in the database.
And 202, judging whether the dynamic request is contained or not according to the configuration data. If no dynamic request is included, the process ends.
203, if so, configuring the dynamic request template. In some embodiments, the dynamic HTTP request template may be configured based on a request method associated with the dynamic request and a dynamic request URL. The dynamic request URL is a URL that contains dynamic parameters. For example, the request method is get or post, and the dynamic request URL is "http:// test. compare { }", where "{ }" is a dynamic parameter.
And 204, acquiring a data set corresponding to the dynamic parameters from the data source according to a preset time period, thereby dynamically generating a first dynamic request set. In some embodiments, the predetermined time period may be 30 minutes, 1 hour, 2 hours, 1 day, 2 days, 1 week, or 2 weeks, which may be set empirically by one skilled in the art or based on the frequency with which the data set corresponding to the dynamic parameter is updated in the data source. In some embodiments, the data set corresponding to the dynamic parameter may be obtained from the database by SQL of the dynamic parameter source, i.e., executing an SQL statement, for example [123,456,789 ]. And the database stores second result data obtained by analyzing the second response data, wherein the second result data comprises dynamic parameters.
205, sequentially reading each data included in the data combination in step 204, and filling each data into the dynamic parameters of the dynamic HTTP request template configured in step 203 to generate a request set. In this example, the generated request set is: http:// test.compare ═ 123, http:// test.compare ═ 456, http:// test.com? param 789.
206, the dynamic request generated in step 205 is delivered to a download engine to request downloading, to obtain first response data, and first result data obtained by parsing the first response data is stored in a database, wherein the data format of the first result data is different from the data format of the second result data. The download engine is a component common to the crawler devices/systems/tools and will not be described in detail herein. It should be noted that the above steps do not necessarily need to be executed in the order described in this embodiment, and for example, parallel processing on the fixed request and the dynamic request may be performed, for example, when a data set corresponding to the dynamic parameter is obtained from the data source (that is, once downloading has been started for the fixed request to obtain the second response data, and a storage location of the second result data analyzed from the second response data obtains a corresponding data or data set of the dynamic parameter included in the second result data), the dynamic request may be configured to be generated, and at this time, operations such as downloading, subsequent analysis, and storage may be performed according to the dynamic request, and it is apparent that, in this case, parallel processing on the fixed request and the dynamic request is performed at the same time.
Through the embodiment, the crawling of at least two different structure data of the same target website or APP is realized. Compared with the prior art, the data of different structures that this embodiment crawled can be saved respectively and can not mix together, has reduced the loaded down with trivial details degree of analytic work, has improved analytic efficiency.
In some extended embodiments, a situation may arise where a fixed request is associated with two dynamic requests. For example, when the list information of the target website or APP, the comment information pointed by a part of hyperlinks included in the list information, and the detail information pointed by another part of hyperlinks are required to be crawled, since the list information includes corresponding data of dynamic requests related to the comment information and the detail information, the data of the list information task is used as the input of the detail information task and the input of the comment information task, and at this time, a list information task (a task for fixed request crawling), a task for dynamic crawling of the detail information (a task for corresponding dynamic request crawling of the detail information), and a task for dynamic crawling of the comment information (a task for corresponding dynamic request crawling of the comment information) need to be established.
In other embodiments, a situation may arise where a fixed request is associated with a dynamic request and a dynamic request is associated with another dynamic request. And will not be described in detail here to avoid unnecessary redundancy.
In one embodiment below, the data of the first data set is included in third result data crawled and parsed based on a third dynamic request, and the format of the third result data is different from the format of the first result data and the format of the second result data. In this embodiment, in addition to the steps shown in the embodiment of fig. 1, the present embodiment further includes steps (not shown): acquiring third dynamic request information from the configuration data; configuring a third dynamic request template according to the third dynamic request information, wherein the third dynamic request template comprises a third dynamic parameter; acquiring a third data set corresponding to a third dynamic parameter from a data source of the third dynamic parameter; generating a third dynamic request set based on a third data set and the third dynamic request template; downloading based on the third dynamic request set to obtain third response data; and analyzing the third result data from the third response data and storing the third result data to the data source of the first dynamic parameter.
In particular, the third dynamic request information may include a third request method and a third dynamic request URL associated with the third dynamic request, and the third dynamic request template is configured based on the third request method and the third dynamic request URL. The data source of the third dynamic parameter and the data source of the first dynamic parameter may be the same data source or different data sources.
As shown in fig. 4, a device for crawling data (also referred to herein as a crawler device) 400 is described. The crawler apparatus 400 includes: a configuration data acquisition module 401 configured to acquire configuration data related to a crawler task; a first dynamic request information obtaining module 402 configured to obtain first dynamic request information from the configuration data; a template configuration module 404 configured to configure a first dynamic request template according to the first dynamic request information, the first dynamic request template including a first dynamic parameter; a data set obtaining module 406, configured to obtain a first data set corresponding to the first dynamic parameter from a data source of the first dynamic parameter; a request generation module 408 configured to generate a set of dynamic requests based on the first set of data and the first dynamic request template; a response data obtaining module 410 configured to obtain first response data by downloading based on the first dynamic request set.
In some embodiments, the data of the first data set is included in second result data crawled and parsed based on a second fixed request, the second result data being stored in the data source, wherein the second fixed request is a request including a fixed request URL and the first result data and the second result data are in different formats from each other.
In some embodiments, crawler device 400 further comprises (not shown): the second fixing request generating module is configured to acquire second fixing request information from the configuration data and generate a second fixing request according to the second fixing request information in a packaging mode; a second response data obtaining module configured to download based on the second fixed request, and obtain second response data; a second result data storage module configured to parse the second result data from the second response data and store the second result data to the data source.
In some embodiments, the data of the first data set is included in third result data crawled and parsed based on a third dynamic request, the third result data being stored in the data source, wherein the third dynamic request is a request including a third dynamic request URL and wherein the first result data and the third result data are in different formats from each other.
In some embodiments, crawler device 400 further comprises (not shown): a third dynamic request information obtaining module configured to obtain third dynamic request information from the configuration data; a third dynamic request template configuration module configured to configure a third dynamic request template according to the third dynamic request information, where the third dynamic request information includes a third request method associated with a third dynamic request and a third dynamic request URL, and the third dynamic request URL includes a third dynamic parameter; a third data set obtaining module configured to obtain a third data set corresponding to the third dynamic parameter from the data source of the third dynamic parameter; a third dynamic request generation module configured to generate a third dynamic request set based on the third data set and the third dynamic request template; a third response data obtaining module configured to download based on the third dynamic request set, and obtain third response data; and a second result data storage module configured to parse the third result data from the third response data and store the third result data to the data source.
In some embodiments, the data set acquisition module 406 may be further configured to acquire the first data set using a database query instruction.
In still other embodiments, in the template configuration module 404, the first dynamic request information may include a request method associated with the first dynamic request and a first dynamic request URL, and the template configuration module 404 may be further configured to configure a first dynamic request template based on the first request method and the first dynamic request URL.
For the description related to the third dynamic request in the present embodiment, reference may be made to the description related to the first dynamic request. In the crawler apparatus to which the embodiment of the present invention is applied, after downloading is performed based on the third dynamic request and third result data including the first data set corresponding to the dynamic parameter is obtained, configuration may be performed to generate the first dynamic request set, so that downloading based on the first dynamic request may be performed synchronously. That is to say, the acquisition work of the first data set does not need to be executed until all the third result data are crawled, and therefore the embodiment of the invention can improve the crawling efficiency.
It is noted that the ordinal numbers before the terms dynamic parameters, dynamic requests, etc., such as "first", "third", etc., are used herein for identification purposes only and do not change the nature thereof. Thus, the description of a first dynamic parameter, a first dynamic request, etc. may be similarly extended to a third dynamic parameter, a third dynamic request, etc. and vice versa, where appropriate. The same is true for the sequence number of the result data.
Where embodiments of the apparatus of the invention are not described in detail, reference is made to corresponding method embodiments.
In an embodiment of the invention, there is also provided a non-transitory computer-readable medium having stored thereon computer-executable code that, when executed by a processor, is capable of implementing any of the method embodiments described above. The computer readable medium may include magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer readable medium or computer readable storage device. For example, as disclosed, the computer-readable medium may be a storage device or memory module having stored thereon computer instructions. In some embodiments, the computer readable medium may be a disk or flash drive having computer instructions stored thereon.
An embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer executable code stored thereon. Any of the above-described method embodiments and variations thereof are implemented when the processor executes the computer-executable code. The electronic device is, for example, a server, a cloud server, a desktop computer, or the like, and may be applied to the method of crawling data according to the embodiment of the present invention.
Various operations or functions are described herein that may be implemented as or defined as software code or instructions. Such content may be directly executable source code or difference code ("delta" or "block" code) ("object" or "executable" form). The software code or instructions may be stored in a computer-readable storage medium and, when executed, may cause a machine to perform the functions or operations described, and include any mechanism for storing information in a form accessible by a machine (e.g., a computing device, an electronic system, etc.), such as recordable or non-recordable media (e.g., Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.). The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software element executed by a processor, or a combination of the two.
The term "comprising" synonymous with "including," "containing," or "characterized by," is non-exclusive or open-ended and does not exclude additional, unrecited elements or method steps. "comprising" is a term of art used in claim language that means that the named element is essential, but that other elements can be added and still form a structure within the scope of the claims.
As used herein, the term "and/or," when used in the context of a list of entities, refers to the entities appearing alone or in combination. Thus, for example, the phrases "A, B, C, and/or D" include A, B, C and D, respectively, but also include any and all combinations and subcombinations of A, B, C and D.
The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the scope of the present invention is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present invention, and such modifications and equivalents should also be considered as falling within the scope of the present invention.

Claims (9)

1. A method of crawling data, comprising:
acquiring configuration data related to a crawler task of a target to be crawled;
acquiring first dynamic request information from the configuration data;
configuring a first dynamic request template according to the first dynamic request information, wherein the first dynamic request information comprises a first request method related to a first dynamic request and a first dynamic request URL (uniform resource locator), and the first dynamic request URL comprises a first dynamic parameter;
acquiring a first data set corresponding to the first dynamic parameter from a data source of the first dynamic parameter;
generating a first dynamic request set based on the first data set and the first dynamic request template;
and downloading based on the first dynamic request set to obtain first response data, wherein the first response data is used for being analyzed to generate first result data, and the first result data is part of data of the target to be crawled with a specific format.
2. The method of claim 1, wherein the first dynamic request set is dynamically generated by obtaining a first data set corresponding to the first dynamic parameter from a data source of the first dynamic parameter at a predetermined time period.
3. The method of claim 1, wherein data of the first data set is included in second result data crawled and parsed based on a second stationary request, the second result data being stored in the data source, wherein the second stationary request is a request including a stationary request URL and formats of the first result data and the second result data are different from each other.
4. The method of claim 3, further comprising:
acquiring second fixed request information from the configuration data, and generating a second fixed request according to the second fixed request information in a packaging manner;
downloading based on the second fixed request to obtain second response data;
and analyzing the second result data from the second response data and storing the second result data to the data source.
5. The method of claim 1, wherein data of the first data set is included in third result data crawled and parsed based on a third dynamic request, the third result data being stored in the data source, wherein the third dynamic request is a request including a third dynamic request URL and wherein formats of the first result data and the third result data are different from each other.
6. The method of claim 1, wherein the data source is a database, and the obtaining a first data set corresponding to the first dynamic parameter from the data source of the first dynamic parameter comprises:
the first set of data is obtained using a database query instruction.
7. An apparatus for crawling data, comprising:
a configuration data acquisition module configured to acquire configuration data related to a crawler task of a target to be crawled;
a first dynamic request information obtaining module configured to obtain first dynamic request information from the configuration data;
a template configuration module configured to configure a first dynamic request template according to the first dynamic request information, the first dynamic request template including a first dynamic parameter;
a data set acquisition module configured to acquire a first data set corresponding to the first dynamic parameter;
a request generation module configured to generate a set of dynamic requests based on the first set of data and the first dynamic request template;
a response data obtaining module configured to download based on the first dynamic request set, and obtain first response data, where the first response data is used for being parsed to generate first result data, and the first result data is partial data of the target to be crawled, and the first result data has a specific format.
8. A non-transitory computer-readable medium having stored thereon computer-executable code, wherein the computer-executable code, when executed by a processor, implements the method of any of claims 1-6.
9. An electronic device comprising a processor, a memory, and computer executable code stored thereon, wherein the processor, when executing the computer executable code, implements the method of any of claims 1-6.
CN202110636165.8A 2021-06-08 2021-06-08 Method, device, medium and electronic equipment for crawling data Pending CN113392301A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110636165.8A CN113392301A (en) 2021-06-08 2021-06-08 Method, device, medium and electronic equipment for crawling data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110636165.8A CN113392301A (en) 2021-06-08 2021-06-08 Method, device, medium and electronic equipment for crawling data

Publications (1)

Publication Number Publication Date
CN113392301A true CN113392301A (en) 2021-09-14

Family

ID=77618732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110636165.8A Pending CN113392301A (en) 2021-06-08 2021-06-08 Method, device, medium and electronic equipment for crawling data

Country Status (1)

Country Link
CN (1) CN113392301A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN109101600A (en) * 2018-08-01 2018-12-28 沈文策 The crawling method and device of dynamic data in a kind of webpage
CN110020060A (en) * 2018-07-18 2019-07-16 平安科技(深圳)有限公司 Web data crawling method, device and storage medium
CN110188259A (en) * 2019-05-27 2019-08-30 厦门商集网络科技有限责任公司 A kind of data grab method and device of configurableization
CN110765334A (en) * 2019-09-10 2020-02-07 北京字节跳动网络技术有限公司 Data capture method, system, medium and electronic device
CN112287201A (en) * 2020-12-31 2021-01-29 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for removing duplicate of crawler request

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN110020060A (en) * 2018-07-18 2019-07-16 平安科技(深圳)有限公司 Web data crawling method, device and storage medium
CN109101600A (en) * 2018-08-01 2018-12-28 沈文策 The crawling method and device of dynamic data in a kind of webpage
CN110188259A (en) * 2019-05-27 2019-08-30 厦门商集网络科技有限责任公司 A kind of data grab method and device of configurableization
CN110765334A (en) * 2019-09-10 2020-02-07 北京字节跳动网络技术有限公司 Data capture method, system, medium and electronic device
CN112287201A (en) * 2020-12-31 2021-01-29 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for removing duplicate of crawler request

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"详解网络爬虫与web安全", 《计算机与网络》, vol. 38, no. 12, pages 38 - 39 *
AVEKSHA KAPOOR等: "application of bloom filter for duplicate url detection in a web carwler", 《2016 IEEE 2ND INTERNATIONAL CONFERENCE ON COLLABORATION AND INTERNET COMPUTING》, pages 1 - 2 *

Similar Documents

Publication Publication Date Title
CN110968325B (en) Applet conversion method and device
US10839038B2 (en) Generating configuration information for obtaining web resources
CN106909361B (en) Web development method and device based on template engine
CN109766099B (en) Front-end source code compiling method and device, storage medium and computer equipment
CN106126693B (en) Method and device for sending related data of webpage
KR20140116874A (en) Managing script file dependencies and load times
CN102902785A (en) Webpage information acquisition system and method
CN111026634A (en) Interface automation test system, method, device and storage medium
CN112769706B (en) Componentized routing method and system
CN113138757B (en) Front-end code automatic generation method, device, server, system and medium
CN112416964A (en) Data processing method, device and system, computer equipment and computer readable storage medium
CN111124544A (en) Interface display method and device, electronic equipment and storage medium
CN114238811A (en) Page loading method, page request response method, device, equipment and medium
CN115686606A (en) Method, device, system and medium for displaying item dependency tree
CN104954363A (en) Method and device for generating interface document
CN111310005A (en) Network request processing method and device, server and storage medium
CN110989992A (en) Resource processing method and device
CN111046316B (en) Application on-shelf state monitoring method, intelligent terminal and storage medium
CN113392301A (en) Method, device, medium and electronic equipment for crawling data
US8321535B2 (en) Web services integration systems and methods
CN112433752B (en) Page analysis method, device, medium and electronic equipment
CN110851746B (en) Crawler seed generation method and device
CN113190735A (en) Method, device, medium and electronic equipment for crawling data
CN113190736A (en) Data processing method, crawler device, medium, and electronic device
CN108845803B (en) Method, device and equipment for updating list view and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20240419