CN113934913A - Data capture method and device, storage medium and electronic equipment - Google Patents

Data capture method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN113934913A
CN113934913A CN202111354326.0A CN202111354326A CN113934913A CN 113934913 A CN113934913 A CN 113934913A CN 202111354326 A CN202111354326 A CN 202111354326A CN 113934913 A CN113934913 A CN 113934913A
Authority
CN
China
Prior art keywords
target
data
acquiring
address
page information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111354326.0A
Other languages
Chinese (zh)
Inventor
赵智博
潘仕江
陈祖德
唐杰成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yancheng Tianyanchawei Technology Co ltd
Original Assignee
Yancheng Jindi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yancheng Jindi Technology Co Ltd filed Critical Yancheng Jindi Technology Co Ltd
Priority to CN202111354326.0A priority Critical patent/CN113934913A/en
Publication of CN113934913A publication Critical patent/CN113934913A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • G06F9/4451User profiles; Roaming

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The disclosure relates to a data capturing method, a data capturing device, a storage medium and electronic equipment, so that a user can configure a data capturing task according to needs, can capture required data without customizing and developing aiming at different websites, and is short in time consumption and high in flexibility. The method comprises the following steps: acquiring a target data grabbing task configured by a user, and acquiring a target grabbing address and a target analysis template corresponding to the target grabbing address from the target data grabbing task; accessing a target webpage corresponding to the target crawling address, and acquiring page information of the target webpage; and analyzing the page information of the target webpage according to the target analysis template to obtain target data, and storing the target data.

Description

Data capture method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a data capture method and apparatus, a storage medium, and an electronic device.
Background
With the rapid development of networks, the internet becomes the largest information carrier in the world today, and countable new data are flooded into the internet every day. Today, a great challenge is how to extract and utilize effective information from massive data. The first step in data processing is to acquire data from the internet, which is to process data.
At present, a plurality of methods for capturing data are available, open source codes exist, and commercial tools for directly providing services exist, but the methods are basically realized by customizing and developing according to the characteristics of websites aiming at different target websites, so that the realization has certain limitation, and once the capturing range is enlarged or the target websites are changed, the codes which are realized in the prior art need to be modified for redeveloping. This not only requires a lot of time, but also is not flexible enough and limited by the skills of the personnel that are implemented.
Disclosure of Invention
The disclosure aims to provide a data capture method, a data capture device, a storage medium and electronic equipment, so as to solve the problems of long time consumption and poor flexibility of the data capture method.
In order to achieve the above object, a first aspect of the present disclosure provides a data capture method, including:
acquiring a target data grabbing task configured by a user, and acquiring a target grabbing address and a target analysis template corresponding to the target grabbing address from the target data grabbing task;
accessing a target webpage corresponding to the target crawling address, and acquiring page information of the target webpage;
and analyzing the page information of the target webpage according to the target analysis template to obtain target data, and storing the target data.
Optionally, the target data capture task is configured as follows:
showing items to be configured to the user, wherein the items to be configured at least comprise a first item to be configured aiming at a target grabbing address and a second item to be configured aiming at a target resolving template;
responding to the operation of the user for the item to be configured, and acquiring configuration information set by the user; and generating the target data capturing task according to the configuration information.
Optionally, the items to be configured further include a third item to be configured for a data storage manner, and the storing the target data includes:
analyzing the target data capturing task to obtain the data storage mode;
and storing the target data according to the data storage mode.
Optionally, the items to be configured further include a fourth item to be configured for word shooting, the target parsing template includes parsing items corresponding to the word shooting, and parsing page information of the target webpage according to the target parsing template includes:
analyzing the page information of the target webpage to obtain a next-level capture address corresponding to the input word;
the method further comprises the following steps:
and taking the next-stage fetch address as a new target fetch address.
Optionally, the items to be configured further include a fifth item to be configured for a data interaction manner, and the acquiring page information of the target webpage includes:
and acquiring the page information from the server of the target webpage based on the data interaction mode.
Optionally, the acquiring a target data capture task configured by a user includes:
under the condition that a plurality of target data grabbing tasks exist, the target data grabbing task with the highest priority is obtained based on the priority relation among the target data grabbing tasks.
Optionally, the items to be configured further include a sixth item to be configured for a preset period, and the acquiring a target data capture task configured by a user includes:
and acquiring a target data capturing task configured by a user based on the preset period.
Optionally, the items to be configured further include a seventh item to be configured for a data acquisition mode, where the data acquisition mode is a full acquisition mode or an incremental acquisition mode, and the accessing the target webpage corresponding to the target crawling address and acquiring the page information of the target webpage includes:
under the condition of accessing a target webpage corresponding to the target crawling address for the first time, acquiring page information of the target webpage according to the full-scale acquisition mode;
and under the condition that a target webpage corresponding to the target crawling address is visited for the nth time, acquiring page information of the target webpage according to the increment acquisition mode, wherein n is an integer larger than 1.
Optionally, the items to be configured further include an eighth item to be configured for a concurrent number, and the acquiring page information of the target webpage includes:
and calling a plurality of threads to acquire the page information of the target webpage, wherein the number of the threads is not more than the concurrence number.
Optionally, the method further comprises:
and displaying the alarm information to the user under the condition that the alarm information is detected in the process of acquiring the page information of the target webpage and/or under the condition that the alarm information is detected in the process of analyzing the page information of the target webpage according to the target analysis template.
A second aspect of the present disclosure also provides a data capture apparatus, the apparatus comprising:
the acquisition module is used for acquiring a target data grabbing task configured by a user and acquiring a target grabbing address and a target analysis template corresponding to the target grabbing address from the target data grabbing task;
the access module is used for accessing a target webpage corresponding to the target crawling address and acquiring page information of the target webpage;
and the analysis module is used for analyzing the page information of the target webpage according to the target analysis template to obtain target data and storing the target data.
The third aspect of the present disclosure also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the above first aspects.
A fourth aspect of the present disclosure also provides an electronic device, including:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to implement the steps of the method of any of the first aspects above.
Through the technical scheme, the following technical effects can be at least achieved:
the method comprises the steps of obtaining a target data grabbing task configured by a user, obtaining a target grabbing address and a target analysis template corresponding to the target grabbing address from the target data grabbing task, then accessing a target webpage corresponding to the target grabbing address, obtaining page information of the target webpage, finally analyzing the page information of the target webpage according to the target analysis template to obtain target data, and storing the target data. By the method, a user can configure data capture tasks according to needs, can capture required data without customizing and developing aiming at different websites, is short in time consumption and high in flexibility, and can capture data by simple configuration without having code capacity.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
fig. 1 is a schematic flow chart diagram of a data capture method provided in an embodiment of the present disclosure;
fig. 2 is a schematic flow chart diagram of another data capture method provided in the embodiment of the present disclosure;
fig. 3 is a block diagram of a data capture apparatus provided in an embodiment of the present disclosure;
fig. 4 is a block diagram of an electronic device provided by an embodiment of the present disclosure.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect. The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units. It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
At present, a plurality of methods for capturing data can be realized, for example, customized development can be performed according to the characteristics of a website, but if the range to be captured is enlarged or the target website is changed, the previously realized codes need to be modified for redevelopment, a lot of time is consumed, the flexibility is not enough, and the method is limited by the skills of the realized personnel. Or some commercial tools for directly providing services are utilized, but such tools can only be configured for website information and login information generally, and only perform data capture for a single website, and after the data capture is completed, the next website needs to be reconfigured for data capture, so that the operation is inconvenient, and the requirements of users cannot be met.
In view of the above, the present disclosure provides a data capturing method, an apparatus, a storage medium and an electronic device to solve the above problems.
The following provides a detailed description of embodiments of the present disclosure.
The embodiment of the disclosure provides a data capturing method, which comprises the following steps:
s101, acquiring a target data grabbing task configured by a user, and acquiring a target grabbing address and a target analysis template corresponding to the target grabbing address from the target data grabbing task.
S102, accessing a target webpage corresponding to the target crawling address, and acquiring page information of the target webpage.
S103, analyzing the page information of the target webpage according to the target analysis template to obtain target data, and storing the target data.
According to the method, a target data grabbing task configured by a user is obtained, a target grabbing address and a target analysis template corresponding to the target grabbing address are obtained from the target data grabbing task, then a target webpage corresponding to the target grabbing address is visited, page information of the target webpage is obtained, finally the page information of the target webpage is analyzed according to the target analysis template to obtain target data, and the target data are stored. By the method, a user can configure data capture tasks according to needs, can capture required data without customizing and developing aiming at different websites, is short in time consumption and high in flexibility, and can capture data by simple configuration without having code capacity.
In order to make the data capture method provided by the present disclosure more understandable to those skilled in the art, the above steps are exemplified in detail below.
In a possible manner, the target data capture task may be configured as follows: and displaying items to be configured to a user, wherein the items to be configured at least comprise a first item to be configured aiming at the target capture address and a second item to be configured aiming at the target analysis template, responding to the operation of the user aiming at the items to be configured, acquiring configuration information set by the user, and finally generating a target data capture task according to the configuration information.
Illustratively, the target crawling address is an address of a web page to be crawled, so that the target web page crawling page information is accessed according to the address. The target analysis template is an analysis template configured for a data structure of the webpage to be captured, taking a certain webpage as an example, the page information of the webpage comprises a header content and a text content, and the analysis template needs to be provided with a corresponding header part and a corresponding text part, so that when the page information is analyzed, the header content is respectively analyzed according to the analysis template and stored in the header part, and the text content is stored in the text part, so that the page information is converted into structured data which is convenient to store. The parsing template is configured according to a data structure of a web page to be crawled, the structure of the parsing template is only used as an exemplary illustration, and the disclosure does not specifically limit the structure.
That is to say, the user can perform custom configuration according to the requirement, and the target data capture task is automatically generated after the configuration information set by the user is acquired. In addition, the configuration information can be simply checked, for example, after a data capture task is generated, a target capture address is accessed according to the configuration information, page information displayed on a first page is captured, whether the configuration information such as the target capture address is correct or not is verified, whether the page information can be correctly analyzed or not is verified, and the like. In addition, the user can set a plurality of configuration information, generate a plurality of data capture tasks, each task is independent of each other and does not influence each other, and determine whether to start executing the tasks according to the user requirements. Therefore, the user can capture data only by setting the corresponding items to be configured, the user does not need to pay attention to how the background operates, the user does not need to master the programming technology, and the use is convenient and quick.
In a possible mode, the target data capture tasks configured by the user are obtained, and the target data capture tasks with the highest priority are obtained based on the priority relation among the target data capture tasks under the condition that a plurality of target data capture tasks exist.
For example, the items to be configured include items to be configured for priority parameters, and when a user sets a plurality of pieces of configuration information, the priority parameters in each piece of configuration information may be set, for example, the priority may be represented by a number, where a smaller number represents a higher priority, or a larger number represents a higher priority, which is not specifically limited by the present disclosure. After a plurality of data capture tasks are generated, the data capture tasks are sequentially executed according to the priority relation among the data capture tasks, and the data capture task with the highest priority is preferentially executed.
Specifically, the data capture tasks may be sorted according to the priority parameter, and an ordered task sequence may be formed according to the priority from high to low, for example, a Redis ordered set queue may be adopted, and the task sequence is executed in sequence, after the task is executed, the task may be marked as being completed and deleted from the task sequence, or placed at the tail of the task sequence to wait for the next execution, which is not limited in this disclosure. In addition, a plurality of data grabbing tasks can be performed simultaneously, in order to relieve server pressure, the plurality of data grabbing tasks can be distributed to a plurality of servers in a mode of adding servers, and the data grabbing tasks with high execution priority are preferentially distributed, so that the plurality of data grabbing tasks can be performed simultaneously, and the data grabbing time is shortened.
In a possible mode, the items to be configured further include a sixth item to be configured for a preset period, and the target data capture task configured by the user is acquired based on the preset period.
For example, setting that a certain data grabbing task is executed once in 1 hour means that the data grabbing task is executed once every 1 hour, if the user only configures one data grabbing task, the data grabbing task may be directly started to be executed, if the user configures a plurality of data grabbing tasks, the data grabbing task may be set as a next task to be executed, and after the previous data grabbing task is executed, the data grabbing task is executed immediately, for example, by taking the above ordered task sequence as an example, the data grabbing task may be placed at the head of the sequence. In addition, in consideration of the data capture task with the priority parameter, the priority of other data capture tasks to be executed may also be compared with the priority of the data capture task, and the data capture task is placed at a corresponding position according to the comparison result, which is not specifically limited by the present disclosure. Therefore, after the user sets the configuration information, the data capturing task can be automatically generated, and the data capturing task can be repeatedly executed for many times according to the user requirements, so that repeated operation of the user is reduced, and the method is convenient and fast.
In a possible manner, the items to be configured further include a fifth item to be configured for the data interaction manner, and when the target webpage is accessed after the target data crawling task to be executed currently is determined, the page information may be acquired from the server of the target webpage based on the data interaction manner in the configuration information.
For example, if the target web page supports the GET mode to request data, the user needs to set the data interaction mode to the GET mode when setting the configuration information, and in addition, data interaction modes such as POST and HEAD are also available, and the setting is specifically performed according to the data interaction mode supported by the target web page. Therefore, different data interaction modes can be compatible, the data capture requirements of users on different target webpages are met, and independent customized development is not needed for each data interaction mode.
In a possible mode, the items to be configured further include a seventh item to be configured for the data obtaining mode, where the data obtaining mode is a full obtaining mode or an incremental obtaining mode, the page information of the target web page is obtained according to the full obtaining mode when the target web page corresponding to the target capture address is accessed for the first time, and the page information of the target web page is obtained according to the incremental obtaining mode when the target web page corresponding to the target capture address is accessed for the nth time, where n is an integer greater than 1.
The full-quantity acquisition mode is to acquire all page information of the target webpage, and the incremental acquisition mode is to acquire the first page information of the target webpage every time, so that data capture can be performed according to the data acquisition mode selected by the user. The page information of all pages of the target webpage can be acquired when the data capture task is executed for the first time, and the page information of the first page of the target webpage is acquired when the data capture task is not executed for the first time, so that not only can the waste of network resources be avoided, but also the waste of storage space caused by the acquisition of repeated data can be avoided.
In a possible mode, in order to avoid performing destructive crawling on the target webpage, the items to be configured further include an eighth item to be configured with respect to the concurrency number, and the page information of the target webpage is acquired by calling a plurality of threads, wherein the number of the plurality of threads does not exceed the concurrency number.
For example, to avoid performing destructive crawling on the target web page, the configured concurrency number is less than the maximum concurrency number that the target web page can bear, and the server pressure for performing the crawling task is also considered. By setting the concurrency number, the number of threads called by each data grabbing task in acquiring the page information of the target webpage is limited, so that the page information can be quickly acquired in a multi-thread mode, and meanwhile, the webpage is prevented from being grabbed in a destructive manner by setting the concurrency number.
In a possible mode, the items to be configured further include a fourth item to be configured for the term, the target analysis template includes an analysis item corresponding to the term, and the next-level capture address can be used as a new target capture address when the next-level capture address corresponding to the term is obtained by analyzing from the page information of the target webpage.
Illustratively, the target crawling address in the configuration information is a home page address of a certain website, and the term may be a category navigation name in the target webpage, for example, a category navigation in a home page of a certain news website includes sports news, entertainment news, civil news, international news, and the like, and clicking each category may enter the corresponding page, so that crawling page information typically includes a jumping address of each category. If the user only needs to capture the related contents of the sports news and the entertainment news, the configuration information can be set to be the sports news and the entertainment news, when the target page is captured, the addresses of the sports news and the entertainment news are analyzed according to the target analysis template, the addresses are continuously accessed, page information is captured, and the situation that other classified page information which is not needed by the user is obtained is avoided. In addition, if a plurality of page information are required to be acquired from the same target webpage, the page information of the next page can be automatically acquired until the plurality of page information are all acquired, and the condition that the acquired data are incomplete is avoided.
After the page information is acquired and the analysis is completed, the analyzed data needs to be stored. In order to be compatible with a plurality of data storage modes to meet the user requirements, in a possible mode, the items to be configured also comprise a second item to be configured aiming at the data storage mode, the data storage mode is firstly analyzed from the target data capturing task when the target data is stored, and then the target data is stored according to the data storage mode. For example, the user may select data storage manners such as MySQL, kafka, and Redis, which is not specifically limited by this disclosure.
Since various abnormalities may occur in the task execution process to cause task suspension, if manual monitoring is performed, a great deal of effort and time are consumed, and therefore, when alarm information is detected in the process of acquiring the page information of the target webpage and when alarm information is detected in the process of analyzing the page information of the target webpage according to the target analysis template, the alarm information is displayed to a user, so that the user can timely handle the abnormal conditions.
For example, the alarm information may be triggered when an abnormality occurs in the process of acquiring the page information of the target web page, when an abnormality occurs in the process of analyzing the page information, or when an abnormality occurs in the process of storing data. And, when the alarm information is detected, the alarm information is presented to the user. In addition, the execution progress of the data capture task can be monitored, for example, the execution stage of the currently executed data capture task, the percentage of completed progress and the like are displayed to the user.
It should be noted that the data capture method provided by the embodiment of the present disclosure can adapt to multiple dimensions of webpages such as public sentiments, finance, bidding, opening announcements, recruitment, and the like by setting the configuration information, can capture data in a mode from a search page and a list page to a detail page, and supports multiple capture strategies such as a cookie pool and an IP agent pool. In addition, for part of target webpages with anti-crawling measures, user operation can be simulated to access the target webpages in a browser mode, and the success rate of data capturing is improved.
In order to make the method provided by the embodiment of the present disclosure more easily understood by those skilled in the art, the following describes the steps of the data capture method provided by the embodiment of the present disclosure in detail. As shown in fig. 2, the method includes:
s201, obtaining a target data grabbing task with the highest priority in the data grabbing task list, and obtaining a target grabbing address and a target analysis template corresponding to the target grabbing address from the target data grabbing task.
Wherein the data capture task is generated according to the configuration information.
S202, acquiring page information from a server of the target webpage based on a data interaction mode.
Further, step S203 is executed when the target fetch address is accessed for the first time, otherwise step S204 is executed.
And S203, acquiring the page information of the target webpage according to the full-scale acquisition mode.
And S204, acquiring page information of the target webpage according to the increment acquisition mode.
And S205, analyzing the page information according to the target analysis template to obtain target data.
And S206, storing the target data according to a data storage mode.
By adopting the method, the data capture task can be generated according to the configuration information set by the user, the data capture task is executed according to the configuration parameters of the configuration information, the data capture aiming at different webpages and different requirements of the user is met, and finally, the data storage can be carried out according to the data storage mode used by the user. By the method, the required data can be captured without customizing and developing aiming at different websites, time consumption is short, flexibility is high, a user does not need to have code capacity, data capture can be achieved only by simple configuration, operation is simple, and user experience is good.
FIG. 3 is a block diagram illustrating a data crawling apparatus according to an exemplary embodiment. As shown in fig. 3, the apparatus 300 includes:
the obtaining module 301 is configured to obtain a target data capture task configured by a user, and obtain a target capture address and a target analysis template corresponding to the target capture address from the target data capture task.
The accessing module 302 is configured to access a target webpage corresponding to the target crawling address, and acquire page information of the target webpage.
And the analysis module 303 is configured to analyze the page information of the target webpage according to the target analysis template to obtain target data, and store the target data.
By adopting the device, the target data capture task configured by the user is firstly obtained, the target capture address and the target analysis template corresponding to the target capture address are obtained from the target data capture task, then the target webpage corresponding to the target capture address is accessed, the page information of the target webpage is obtained, finally the page information of the target webpage is analyzed according to the target analysis template to obtain the target data, and the target data is stored. Through the device, the user can be according to demand configuration data snatchs the task, need not to carry out customization development to different websites, just can snatch required data, and weak point consuming time, flexibility are high, and the user need not possess the code ability, only need carry out simple configuration and just can realize data snatching.
Optionally, the target data capture task is configured as follows:
showing items to be configured to the user, wherein the items to be configured at least comprise a first item to be configured aiming at a target grabbing address and a second item to be configured aiming at a target resolving template;
responding to the operation of the user for the item to be configured, and acquiring configuration information set by the user;
and generating the target data capturing task according to the configuration information.
Optionally, the items to be configured further include a third item to be configured for the data storage manner, and the parsing module 303 is configured to:
analyzing the data storage mode from the target data capture task;
and storing the target data according to the data storage mode.
Optionally, the items to be configured further include a fourth item to be configured for word projection, the target parsing template includes parsing items corresponding to the word projection, and the parsing module 303 is configured to:
analyzing the page information of the target webpage to obtain a next-level capture address corresponding to the input word;
the parsing module 303 is further configured to:
and taking the next-stage fetch address as a new target fetch address.
Optionally, the items to be configured further include a fifth item to be configured for a data interaction manner, and the access module 302 is configured to:
and acquiring the page information from the server of the target webpage based on the data interaction mode.
Optionally, the obtaining module 301 is configured to:
under the condition that a plurality of target data grabbing tasks exist, the target data grabbing task with the highest priority is obtained based on the priority relation among the target data grabbing tasks.
Optionally, the items to be configured further include a sixth item to be configured for a preset period, and the obtaining module 301 is configured to:
and acquiring a target data capturing task configured by a user based on the preset period.
Optionally, the items to be configured further include a seventh item to be configured for a data acquisition manner, where the data acquisition manner is a full-volume acquisition manner or an incremental acquisition manner, and the access module 302 is configured to:
under the condition of accessing a target webpage corresponding to the target crawling address for the first time, acquiring page information of the target webpage according to the full-scale acquisition mode;
and under the condition that a target webpage corresponding to the target crawling address is visited for the nth time, acquiring page information of the target webpage according to the increment acquisition mode, wherein n is an integer larger than 1.
Optionally, the items to be configured further include an eighth item to be configured for a concurrency number, and the access module 302 is configured to:
and calling a plurality of threads to acquire the page information of the target webpage, wherein the number of the threads is not more than the concurrence number.
Optionally, the apparatus 300 further comprises:
and the alarm module is used for displaying the alarm information to the user under the condition that the alarm information is detected in the process of acquiring the page information of the target webpage and/or under the condition that the alarm information is detected in the process of analyzing the page information of the target webpage according to the target analysis template.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
The embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the data capture method provided in the foregoing embodiments.
An embodiment of the present disclosure further provides an electronic device, including:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to implement the steps of the data capture method provided by the above embodiments.
Fig. 4 is a block diagram illustrating an electronic device 400 according to an example embodiment. As shown in fig. 4, the electronic device 400 may include: a processor 401 and a memory 402. The electronic device 400 may also include one or more of a multimedia component 403, an input/output (I/O) interface 404, and a communications component 405.
The processor 401 is configured to control the overall operation of the electronic device 400, so as to complete all or part of the steps in the data capture method. The memory 402 is used to store various types of data to support operation at the electronic device 400, such as instructions for any application or method operating on the electronic device 400 and application-related data, such as text, pictures, audio, video, and so forth. The Memory 402 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia components 403 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 402 or transmitted through the communication component 405. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 404 provides an interface between the processor 401 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 405 is used for wired or wireless communication between the electronic device 400 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 405 may therefore include: Wi-Fi module, Bluetooth module, NFC module, etc.
In an exemplary embodiment, the electronic Device 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described data capture method.
In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the data crawling method described above is also provided. For example, the computer readable storage medium may be the memory 402 comprising program instructions executable by the processor 401 of the electronic device 400 to perform the data capture method described above.
The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.
It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.
In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims (13)

1. A method for data capture, the method comprising:
acquiring a target data grabbing task configured by a user, and acquiring a target grabbing address and a target analysis template corresponding to the target grabbing address from the target data grabbing task;
accessing a target webpage corresponding to the target crawling address, and acquiring page information of the target webpage;
and analyzing the page information of the target webpage according to the target analysis template to obtain target data, and storing the target data.
2. The method of claim 1, wherein the target data capture task is configured by:
showing items to be configured to the user, wherein the items to be configured at least comprise a first item to be configured aiming at a target grabbing address and a second item to be configured aiming at a target resolving template;
responding to the operation of the user for the item to be configured, and acquiring configuration information set by the user; and generating the target data capturing task according to the configuration information.
3. The method of claim 2, wherein the items to be configured further include a third item to be configured for a data storage manner, and wherein the storing the target data includes:
analyzing the target data capturing task to obtain the data storage mode;
and storing the target data according to the data storage mode.
4. The method of claim 2, wherein the items to be configured further include a fourth item to be configured for the term, the target parsing template includes a parsing item corresponding to the term, and the parsing the page information of the target webpage according to the target parsing template includes:
analyzing the page information of the target webpage to obtain a next-level capture address corresponding to the input word;
the method further comprises the following steps:
and taking the next-stage fetch address as a new target fetch address.
5. The method according to claim 2, wherein the items to be configured further include a fifth item to be configured for a data interaction manner, and the acquiring page information of the target web page includes:
and acquiring the page information from the server of the target webpage based on the data interaction mode.
6. The method according to any one of claims 1-5, wherein the obtaining of the user-configured target data crawling task comprises:
under the condition that a plurality of target data grabbing tasks exist, the target data grabbing task with the highest priority is obtained based on the priority relation among the target data grabbing tasks.
7. The method according to claim 2, wherein the items to be configured further include a sixth item to be configured for a preset period, and the acquiring a target data capture task configured by a user includes:
and acquiring a target data capturing task configured by a user based on the preset period.
8. The method according to claim 2, wherein the items to be configured further include a seventh item to be configured for a data acquisition mode, the data acquisition mode is a full acquisition mode or an incremental acquisition mode, and the accessing the target webpage corresponding to the target crawling address and acquiring page information of the target webpage includes:
under the condition of accessing a target webpage corresponding to the target crawling address for the first time, acquiring page information of the target webpage according to the full-scale acquisition mode;
and under the condition that a target webpage corresponding to the target crawling address is visited for the nth time, acquiring page information of the target webpage according to the increment acquisition mode, wherein n is an integer larger than 1.
9. The method according to claim 2, wherein the items to be configured further include an eighth item to be configured for a concurrency number, and the obtaining page information of the target web page includes:
and calling a plurality of threads to acquire the page information of the target webpage, wherein the number of the threads is not more than the concurrence number.
10. The method of claim 1, further comprising:
and displaying the alarm information to the user under the condition that the alarm information is detected in the process of acquiring the page information of the target webpage and/or under the condition that the alarm information is detected in the process of analyzing the page information of the target webpage according to the target analysis template.
11. A data capture device, the device comprising:
the acquisition module is used for acquiring a target data grabbing task configured by a user and acquiring a target grabbing address and a target analysis template corresponding to the target grabbing address from the target data grabbing task;
the access module is used for accessing a target webpage corresponding to the target crawling address and acquiring page information of the target webpage;
and the analysis module is used for analyzing the page information of the target webpage according to the target analysis template to obtain target data and storing the target data.
12. A non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.
13. An electronic device, comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 10.
CN202111354326.0A 2021-11-12 2021-11-12 Data capture method and device, storage medium and electronic equipment Pending CN113934913A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111354326.0A CN113934913A (en) 2021-11-12 2021-11-12 Data capture method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111354326.0A CN113934913A (en) 2021-11-12 2021-11-12 Data capture method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN113934913A true CN113934913A (en) 2022-01-14

Family

ID=79286734

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111354326.0A Pending CN113934913A (en) 2021-11-12 2021-11-12 Data capture method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113934913A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114625502A (en) * 2022-03-03 2022-06-14 盐城金堤科技有限公司 Word-throwing task processing method and device, storage medium and electronic equipment
CN114692050A (en) * 2022-03-30 2022-07-01 北京金堤科技有限公司 Page parsing method and device, computer readable medium and electronic device
CN115730150A (en) * 2022-12-09 2023-03-03 广州富莱星科技有限公司 Data capturing method, system and equipment and storable medium
CN116719986A (en) * 2023-08-10 2023-09-08 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559304A (en) * 2013-11-18 2014-02-05 北京暴风科技股份有限公司 Implementation method and device for Internet data customization
CN104317948A (en) * 2014-11-05 2015-01-28 北京中科辅龙信息技术有限公司 Page data capturing method and system
CN109194543A (en) * 2018-08-24 2019-01-11 北京天元创新科技有限公司 Collecting method and device
CN109543103A (en) * 2018-11-14 2019-03-29 深圳市中易科技有限责任公司 A method of based on distributed data collection
US10394796B1 (en) * 2015-05-28 2019-08-27 BloomReach Inc. Control selection and analysis of search engine optimization activities for web sites
CN111949680A (en) * 2019-05-17 2020-11-17 杭州海康威视数字技术股份有限公司 Data processing method and device, computer equipment and storage medium
CN112541104A (en) * 2019-09-20 2021-03-23 浙江大搜车软件技术有限公司 Data capturing method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559304A (en) * 2013-11-18 2014-02-05 北京暴风科技股份有限公司 Implementation method and device for Internet data customization
CN104317948A (en) * 2014-11-05 2015-01-28 北京中科辅龙信息技术有限公司 Page data capturing method and system
US10394796B1 (en) * 2015-05-28 2019-08-27 BloomReach Inc. Control selection and analysis of search engine optimization activities for web sites
CN109194543A (en) * 2018-08-24 2019-01-11 北京天元创新科技有限公司 Collecting method and device
CN109543103A (en) * 2018-11-14 2019-03-29 深圳市中易科技有限责任公司 A method of based on distributed data collection
CN111949680A (en) * 2019-05-17 2020-11-17 杭州海康威视数字技术股份有限公司 Data processing method and device, computer equipment and storage medium
CN112541104A (en) * 2019-09-20 2021-03-23 浙江大搜车软件技术有限公司 Data capturing method and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114625502A (en) * 2022-03-03 2022-06-14 盐城金堤科技有限公司 Word-throwing task processing method and device, storage medium and electronic equipment
CN114692050A (en) * 2022-03-30 2022-07-01 北京金堤科技有限公司 Page parsing method and device, computer readable medium and electronic device
CN115730150A (en) * 2022-12-09 2023-03-03 广州富莱星科技有限公司 Data capturing method, system and equipment and storable medium
CN116719986A (en) * 2023-08-10 2023-09-08 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium
CN116719986B (en) * 2023-08-10 2023-12-26 深圳传趣网络技术有限公司 Python-based data grabbing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113934913A (en) Data capture method and device, storage medium and electronic equipment
CA2915619C (en) Method and apparatus for customized software development kit (sdk) generation
CN109144856A (en) A kind of UI automated testing method calculates equipment and storage medium
CN110532159B (en) Data monitoring method, device, equipment and computer readable storage medium
US10175954B2 (en) Method of processing big data, including arranging icons in a workflow GUI by a user, checking process availability and syntax, converting the workflow into execution code, monitoring the workflow, and displaying associated information
US20140289761A1 (en) Systems and Methods of Processing Data Involving Presentation of Information on Android Devices
CN111930472B (en) Code debugging method and device, electronic equipment and storage medium
CN107807841B (en) Server simulation method, device, equipment and readable storage medium
CN110851681A (en) Crawler processing method and device, server and computer readable storage medium
US9648078B2 (en) Identifying a browser for rendering an electronic document
US9571557B2 (en) Script caching method and information processing device utilizing the same
WO2022228156A1 (en) Policy orchestration processing method, apparatus, device and system and storage medium
CN105005596B (en) page display method and device
CN112667795B (en) Dialogue tree construction method and device, dialogue tree operation method, device and system
CN108062401B (en) Application recommendation method and device and storage medium
CN112307386A (en) Information monitoring method, system, electronic device and computer readable storage medium
CN111026945B (en) Multi-platform crawler scheduling method, device and storage medium
CN106383869B (en) Method and device for acquiring user behavior information
CN111124627B (en) Method and device for determining call initiator of application program, terminal and storage medium
CN116661936A (en) Page data processing method and device, computer equipment and storage medium
CN113626158A (en) Event agent-based embedded point execution method and device
CN113590985A (en) Page jump configuration method and device, electronic equipment and computer readable medium
CN107508705A (en) The resource tree constructing method and computing device of a kind of HTTP elements
CN113326237A (en) Log data processing method and device, terminal device and storage medium
CN111131354B (en) Method and apparatus for generating information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230731

Address after: 224008 Rooms 404-405 and 504, Building B-17-1, Big data Industrial Park, Kecheng Street, Yannan High tech Zone, Yancheng, Jiangsu Province

Applicant after: Yancheng Tianyanchawei Technology Co.,Ltd.

Address before: 224008 room 501-503, building b-17-1, Xuehai road big data Industrial Park, Kecheng street, Yannan high tech Zone, Yancheng City, Jiangsu Province (CNK)

Applicant before: Yancheng Jindi Technology Co.,Ltd.

TA01 Transfer of patent application right