CN115905661A - Automatic crawling method and device for webpage data, computer equipment and medium - Google Patents

Automatic crawling method and device for webpage data, computer equipment and medium Download PDF

Info

Publication number
CN115905661A
CN115905661A CN202211668127.1A CN202211668127A CN115905661A CN 115905661 A CN115905661 A CN 115905661A CN 202211668127 A CN202211668127 A CN 202211668127A CN 115905661 A CN115905661 A CN 115905661A
Authority
CN
China
Prior art keywords
crawling
webpage
user
data
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211668127.1A
Other languages
Chinese (zh)
Inventor
谯海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Chongqing BOE Smart Technology Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Chongqing BOE Smart Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd, Chongqing BOE Smart Technology Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to CN202211668127.1A priority Critical patent/CN115905661A/en
Publication of CN115905661A publication Critical patent/CN115905661A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method and a device for automatically crawling webpage data, computer equipment and a medium, wherein the method for automatically crawling in one embodiment comprises the following steps: starting automatic crawling application software in response to a first operation of a user, wherein the automatic crawling application software is generated based on a cross-platform desktop application development framework; responding to a second operation of the user to record a crawling rule on a target webpage and generate a data crawling template; and responding to a third operation of the user to operate the data crawling template so as to automatically crawl the webpage of the data to be crawled. The embodiment provided by the invention provides visual crawling rule recording based on automatic crawling application software, and acquires a data crawling template so as to facilitate automatic crawling of a webpage, thereby flexibly and simply realizing automatic crawling; in particular, the automatic crawling application software generated by the cross-platform desktop application program development framework can support multiple systems, and has a wide application prospect.

Description

Automatic crawling method and device for webpage data, computer equipment and medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for automatically crawling web page data, a computer device, and a medium.
Background
With the development of internet technology, the amount of information is increasing dramatically, how to obtain useful information becomes a focus of attention of those skilled in the art. The crawler technology is developed at the end of life, however, the application of the crawler technology usually needs professional research and development personnel to design and write, and common users are often limited by professional thresholds and cannot apply the crawler technology; meanwhile, some websites have crawled data increasingly difficult to prevent from using the crawler technology to protect the data security of the websites.
Disclosure of Invention
In order to solve at least one of the above problems, a first aspect of the present invention provides an automatic crawling method for web page data, including:
starting automatic crawling application software in response to a first operation of a user, wherein the automatic crawling application software is generated based on a cross-platform desktop application development framework;
responding to a second operation of the user to perform crawling rule recording on the target webpage and generate a data crawling template;
and responding to a third operation of the user to operate the data crawling template so as to automatically crawl the webpage of the data to be crawled.
Further, the recording the crawling rule at the target webpage and generating the data crawling template in response to the second operation of the user further comprises:
responding to a fourth operation of the user to receive the address of the target webpage and opening the target webpage;
responding to a fifth operation of a user, selecting a first webpage node needing to be crawled in the target webpage, and recording the first webpage node to generate a data crawling template.
Further, before the first webpage node needing to be crawled in the target webpage is selected in response to a fifth operation of the user and the first webpage node is recorded to generate a data crawling template, the method further comprises the following steps:
responding to a sixth operation of the user, inputting search content in a search box in the page of the target webpage, searching, and recording a second webpage node corresponding to the sixth operation;
responding to a seventh operation of the user, selecting a search term in the search list, and recording a third webpage node corresponding to the seventh operation;
the selecting a first webpage node needing to be crawled in the target webpage in response to a fifth operation of the user and recording the first webpage node to generate the data crawling template further comprises: and generating a data crawling template according to the first webpage node, the second webpage node and the third webpage node.
Further, the automatic crawling method further includes: recording a plurality of interval time among the fourth operation, the sixth operation, the seventh operation and the fifth operation;
the selecting a first webpage node needing to be crawled in the target webpage in response to a fifth operation of the user and recording the first webpage node to generate the data crawling template further comprises: and generating a data crawling template according to the first webpage node, the second webpage node, the third webpage node and the plurality of interval time.
Further, the recording the crawling rule at the target webpage and generating the data crawling template in response to the second operation of the user further comprises:
and re-presenting corresponding operations after the fourth operation, the sixth operation, the seventh operation and the fifth operation respectively, and repeating the corresponding operations in response to re-recording selection of the user or selecting to confirm recording in response to confirmation of the user.
Further, the first web page node, the second web page node and the third web page node are one of preset node lists, and the node lists comprise a plurality of types of web page nodes.
Further, the recording the crawling rule on the target webpage and generating the data crawling template in response to the second operation of the user further comprises:
the target webpage is a video webpage comprising a plurality of videos, and links of the videos are obtained in response to the eighth operation of the user and a data crawling template is generated.
Further, the recording the crawling rule at the target webpage and generating the data crawling template in response to the second operation of the user further comprises:
the target webpage comprises a control of a preset type, and a script corresponding to the control is called in response to the ninth operation of the user to generate a data crawling template.
Further, the executing the data crawling template in response to a third operation of the user to automatically crawl a webpage to be crawled with data further comprises:
responding to a first crawling mode selected by a user, operating the data crawling template to automatically crawl webpages of data to be crawled and output crawled data;
or
And operating the data crawling template in response to a second crawling mode selected by the user to automatically crawl a webpage of data to be crawled and store the crawled data.
A second aspect of the present invention provides an automatic crawling apparatus using the automatic crawling method according to the first aspect, including a starting unit, a recording unit, and a crawling unit, wherein the crawling unit is configured to perform crawling on a document to be recorded
The starting unit is configured to start automatic crawling application software in response to a first operation of a user, wherein the automatic crawling application software is generated based on a cross-platform desktop application development framework;
the recording unit is configured to respond to a second operation of the user to perform crawling rule recording on a target webpage and generate a data crawling template;
the crawling unit is configured to respond to a third operation of the user to run the data crawling template so as to automatically crawl a webpage of data to be crawled.
A third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the automatic crawling method according to the first aspect.
A fourth aspect of the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the automatic crawling method according to the first aspect when executing the program.
The invention has the following beneficial effects:
aiming at the existing problems, the invention sets an automatic crawling method, an automatic crawling device, computer equipment and a medium for webpage data, provides visual crawling rule recording based on automatic crawling application software, and acquires a data crawling template so as to facilitate automatic crawling of the webpage, thereby flexibly and simply realizing automatic crawling; in particular, the automatic crawling application software generated by the cross-platform desktop application program development framework can support multiple systems, and has a wide application prospect.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 illustrates a flow diagram of an automatic crawling method according to an embodiment of the present invention;
FIG. 2 is a block diagram of an automatic crawling apparatus according to an embodiment of the present invention;
fig. 3 shows a schematic structural diagram of a computer device according to another embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the invention, the invention is further described below with reference to preferred embodiments and the accompanying drawings. Similar parts in the figures are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not to be taken as limiting the scope of the invention.
According to the above problem, as shown in fig. 1, an embodiment of the present invention provides an automatic crawling method for web page data, including:
starting automatic crawling application software in response to a first operation of a user, wherein the automatic crawling application software is generated based on a cross-platform desktop application development framework;
responding to a second operation of the user to perform crawling rule recording on the target webpage and generate a data crawling template;
and responding to a third operation of the user to operate the data crawling template so as to automatically crawl the webpage of the data to be crawled.
In the embodiment, visual crawling rule recording is provided based on automatic crawling application software, and a data crawling template is obtained so as to facilitate automatic crawling of a webpage, so that automatic crawling is flexibly and simply realized; in particular, the automatic crawling application software generated by the cross-platform desktop application development framework can support multiple systems, and has a wide application prospect.
To further illustrate the automatic crawling method of the present embodiment, as shown in fig. 1, the following steps are described:
the method comprises the first step of responding to a first operation of a user to start automatic crawling application software, wherein the automatic crawling application software is generated based on a cross-platform desktop application program development framework.
In this embodiment, the cross-platform desktop application development framework is an Electron framework, that is, a cross-platform desktop application framework is constructed by using JavaScript, HTML, and CSS. The cross-platform desktop application development framework can be compatible with a Mac system, a Windows system and a Linux system through an Electron, application programs suitable for three platforms can be respectively constructed, namely, developers can realize application under different systems through the Electron framework only by developing one set of programs, the generalization of the automatic crawling application software generated by the embodiment can be improved, the development efficiency is further improved, the workload of the developers is reduced, and the development cost is effectively reduced.
The embodiment develops and generates cross-platform visual automatic crawling application software based on an Electron framework, and provides a visual user interface. Specifically, the user can use the application software to perform data crawling setting only by clicking and starting the application software, the user does not need to have professional crawler technical knowledge, the technical threshold is low, the user interface is friendly, the operation is flexible, simple and clear, and the user experience is effectively improved.
And secondly, responding to a second operation of the user to record a crawling rule on the target webpage and generate a data crawling template.
In this embodiment, based on visual automatic crawling application software, the user can record by taking data crawling on a target webpage as an example through simple operation, and a data crawling template meeting the data crawling requirement of the user is generated, so that the method is simple, convenient and easy to operate. The method specifically comprises the following steps:
firstly, responding to the fourth operation of the user to receive the address of the target webpage and opening the target webpage.
In this embodiment, a user records a data crawling template by taking a target webpage as an example, specifically, in this embodiment, while starting the automatic crawling application software, a mode of "adding a crawling insect source by self-definition" is selected to facilitate subsequent input of the target webpage, a website of the target webpage is input in a visual user interface of the automatic crawling application software, and the target webpage is opened according to the website. The automatic crawling application software of the embodiment is convenient for a user to input the address of the target webpage by setting the text box on the user interface, and the user interface is friendly and easy to operate.
It should be noted that, in the present application, a manner of inputting a target web page is not specifically limited, and a person skilled in the art should select an input manner of a crawler source according to an actual application requirement, for example, a setting option, so as to determine that a crawler source is a basic application criterion, which is not described herein again.
And secondly, responding to a fifth operation of the user, selecting a first webpage node needing to be crawled in the target webpage, and recording the first webpage node to generate a data crawling template.
In this embodiment, a target web page is taken as an example of text news, and for the text news, the automatic crawling application software of this embodiment defines various nodes, such as text-based text nodes, picture-based picture nodes, video-based video nodes, and the like. Specifically, with the text node as the first web page node, a sentence or a paragraph may be used as one text node, and a rectangular frame with different colors is formed in response to the node of the part of the user sliding through the mouse, for example, with a paragraph as one node, the text paragraph sliding through the mouse is displayed as a red rectangular frame, and a preset node list is displayed in response to the user operation, for example, a right key, the node list includes multiple node types, and the red rectangular frame is determined as the text node according to the selection of the user. This embodiment is injectd crawling the scope and crawling the content through the predefined webpage node pair who adopts, and the user operation of being convenient for is in order to record and generate data and crawls the template on the one hand, and on the other hand crawls through setting up the node list including different nodes, and the automatic application software that crawls of being convenient for discerns, effectively improves and crawls efficiency. Meanwhile, the embodiment records the crawling rule by using the webpage nodes, and can also avoid crawling errors caused by different resolutions, for example, crawling errors caused by position information errors after the resolution is modified by using the position information as a crawling basis can be avoided.
In consideration of the possibility of misoperation when the user records the crawling rule, in an optional embodiment, the automatic crawling method further comprises the step of reproducing the operation content after the user performs the operation, so as to facilitate the confirmation of the user.
Specifically, the data crawling of the news webpage is still taken as an example for explanation, after the user determines to select the character node, the automatic crawling application software repeats the current selection process so as to be convenient for the user to confirm, if the user confirms that the operation is correct, the crawling rule is continuously recorded, and if the user does not confirm that the operation is correct, the operation is selected to be deleted, and the recording is carried out again. According to the embodiment, the recording efficiency of the crawling rule of the user is further improved through operation reproduction and operation confirmation, the recording difficulty is reduced, and the user experience is effectively improved.
In view of the search requirement of the web page to be crawled, in an optional embodiment, before the user selects the first web page node, the automatic crawling method further comprises the following steps: responding to a sixth operation of a user, inputting search content in a search box in the page of the target webpage, and searching to record a second webpage node corresponding to the sixth operation; and responding to a seventh operation of the user to select a search term in the search list so as to record a third webpage node corresponding to the seventh operation.
In this embodiment, still taking data crawling of a news webpage as an example for explanation, a target webpage input by a user is a website of a portal website, search content is input for a search box of the portal website to perform a search, for example, a green rectangular box is formed by sliding a mouse of the user over the search box, a preset node list is displayed by a right button, and the rectangular box formed by a search operation is defined as a second webpage node. Specifically, the 'world cup' is input into the search box to obtain news related to the 'world cup', after the step is finished, the automatic crawling application software repeats the operation and presents the operation to the user, and the user confirms and stores the operation according to the repeated content to further complete the recording of the crawling rule.
Next, according to the search list displayed after searching, the user selects a third webpage node, for example, for multiple listed news data, each news data includes a title, a news summary, a release time, comment details, and the like, a blue rectangular box is formed in response to the user mouse sliding through a piece of news, a preset node list is displayed by a right button, and the rectangular box formed by the search option selection operation is defined as the third webpage node. Specifically, relevant news of the B group team games is selected from a news list, after the step is finished, the automatic crawling application software repeats the operation and presents the operation to the user, and the user confirms and stores the operation according to the repeated content so as to further complete the recording of the crawling rules.
It is worth noting that this embodiment is only used for explaining a crawling process of text news, and the present application does not limit specific crawling operations, and a person skilled in the art should select nodes in a hierarchical manner according to content to be crawled, so that the automatic crawling application software performs selection and division according to a crawling rule recorded by a user in a hierarchical manner, for example, whether a search box exists in an open target page or not, performs chapter selection according to a search list formed by search operations, selects a corresponding content page after entering a detailed page of chapters, and finally enters the content page to perform content node selection, that is, performs step-by-step operation according to content to be crawled, and realizes recording of the crawling rule according to a preset node list, and specific content is not described again.
In view of the existence of anti-crawling settings for portions of the web page, in an alternative embodiment, the automatic crawling method further comprises: and recording the interval time among the operations, and generating a data crawling template according to the webpage nodes corresponding to the operations and the interval time.
In the embodiment, the interval time of the user sliding the mouse and clicking the mouse when recording the crawling rule and the interval time of the search operation, the search option selection operation and the content selection operation are recorded, and the data crawling template is generated by combining a plurality of interval times and the corresponding operation of each webpage node. In other words, the embodiment simulates the operation behavior of the user by recording the interval time and recording the interval time together, so as to eliminate the anti-crawling setting of partial webpage deployment; the action characteristics of representing normal user operation behaviors are different from the action characteristics of automatically crawling by using a machine, so that the crawling operation can be avoided being incapable of being performed due to the anti-crawling setting of webpage deployment. According to the embodiment, the data crawling template is generated at intervals of actual operation of the user, various webpages can be automatically crawled, the webpages with the crawling-preventing operation are set in the correlation technique, the application range of the automatic crawling application software is further expanded, and the user experience is improved.
In view of the fact that there is a designated control that cannot be directly operated in the webpage to be crawled, in order to further improve the crawling performance, in an optional embodiment, the automatic crawling method further includes: and responding to the operation of the user to call the script corresponding to the control and generate a data crawling template.
In this embodiment, for example, in the case that a control including page scrolling cannot be crawled through a set webpage node, the crawling of the control is realized by embedding a custom script code in the automatic crawling application software. Specifically, the data of the control is data of page real-time scrolling, if a webpage node is adopted for crawling, only partial data can be obtained, and the control is analyzed by embedding a self-defined script code, for example, page results and data are automatically analyzed, so that complete content is obtained and all content data are crawled, and the crawling performance and the application universality can be effectively improved.
Considering that the web page to be crawled is a video web page, in an alternative embodiment, the automatic crawling method further includes: and responding to the operation of a user to acquire the links of all videos in the video webpage and generate a data crawling template.
In this embodiment, for a webpage to be crawled as a video webpage, that is, a webpage including a plurality of videos, the webpage is a video webpage according to the type of the webpage selected by the user, and a search operation, a search option selection operation, and a content selection operation are not required, address links of the videos in the video webpage are directly acquired as a crawling result, and the operation of acquiring the address links of the videos is recorded and a data crawling template is generated, so that the application universality of the automatic crawling application software is further increased.
In this embodiment, a data crawling template is generated by recording the operation of the user and is stored, so that data crawling in various application scenarios can be implemented.
And thirdly, responding to a third operation of the user to operate the data crawling template so as to automatically crawl the webpage of the data to be crawled.
In this embodiment, the data crawling template is invoked in response to the operation of the user on the automatic crawling application software according to the data crawling template generated by the user through the operation on the automatic crawling application software, so that the automatic crawling operation is performed on the web pages to be crawled through the operation recorded in the data crawling template, the professional threshold of the user for using a crawler technology is effectively reduced, the problems existing in the related art can be solved, and the data crawling template has practical application value.
In view of the fact that the application automatically crawls the storage space of the application software, in an alternative embodiment, the automatic crawling method comprises the following steps: and operating the data crawling template in response to a first crawling mode selected by a user to automatically crawl a webpage of data to be crawled and output the crawled data.
In the embodiment, for example, an interface mode with interface parameters is selected, when an interface request including the interface parameters is received, a data crawling template is called according to the interface parameters, data of a webpage are automatically crawled, the crawled data are directly output, a crawling result can be obtained in real time, the storage space of crawling hardware is reduced, and the investment cost is reduced.
In view of the fact that the application automatically crawls the storage space of the application software, in an alternative embodiment, the automatic crawling method comprises the following steps: and operating the data crawling template in response to a second crawling mode selected by the user to automatically crawl a webpage of data to be crawled and store the crawled data.
In this embodiment, select the persistence mode, call data and crawl the template and automatically crawl the data of webpage and directly save the data of crawling in the memory space of operation hardware, be applicable to and climb that the data bulk is big, crawl the condition that time is long, memory space is big, can directly look over on operation hardware, or export the data result of crawling through external interface.
Therefore, the automatic crawling of the webpage data is completed.
Corresponding to the automatic crawling method provided by the above embodiment, an embodiment of the present application further provides an automatic crawling apparatus applying the above automatic crawling method, as shown in fig. 2, where the automatic crawling apparatus includes a starting unit, a recording unit, and a crawling unit, where the starting unit is configured to start automatic crawling application software in response to a first operation of a user, and the automatic crawling application software is generated based on a cross-platform desktop application development framework; the recording unit is configured to respond to a second operation of the user to perform crawling rule recording on a target webpage and generate a data crawling template; the crawling unit is configured to respond to a third operation of the user to run the data crawling template so as to automatically crawl a webpage of data to be crawled.
The automatic crawling device provided by the embodiment starts the automatic crawling application software through the starting unit, provides a visual user interface through the automatic crawling application software applied by the recording unit, realizes the recording of crawling rules, acquires a data crawling template, and facilitates the automatic crawling of a webpage by utilizing the crawling unit to call the data crawling template, so that the automatic crawling can be flexibly and simply realized. In particular, the automatic crawling application software generated by the cross-platform desktop application development framework can support multiple systems, and has a wide application prospect. Since the automatic crawling apparatus provided in the embodiment of the present application corresponds to the automatic crawling methods provided in the foregoing several embodiments, the foregoing embodiments are also applicable to the automatic crawling apparatus provided in the embodiment, and a detailed description is not given in this embodiment.
Another embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements: starting automatic crawling application software in response to a first operation of a user, wherein the automatic crawling application software is generated based on a cross-platform desktop application development framework; responding to a second operation of the user to perform crawling rule recording on the target webpage and generate a data crawling template; and responding to a third operation of the user to operate the data crawling template so as to automatically crawl the webpage of the data to be crawled.
In practice, the computer-readable storage medium may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present embodiment, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
As shown in fig. 3, another embodiment of the present invention provides a schematic structural diagram of a computer device. The computer device 12 shown in FIG. 3 is only an example and should not impose any limitations on the functionality or scope of use of embodiments of the present invention.
As shown in FIG. 3, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 3, and commonly referred to as a "hard drive"). Although not shown in FIG. 3, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including but not limited to an operating system, one or more application programs, other program modules, and program data, each of which or some combination of which may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments of the invention as described.
Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) through network adapter 20. As shown in FIG. 3, the network adapter 20 communicates with the other modules of the computer device 12 via the bus 18. It should be understood that although not shown in FIG. 3, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.
The processor unit 16 executes various functional applications and data processing by running a program stored in the system memory 28, for example, to implement an automatic crawling method provided by an embodiment of the present invention.
It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention, and it will be obvious to those skilled in the art that other variations or modifications may be made on the basis of the above description, and all embodiments may not be exhaustive, and all obvious variations or modifications may be included within the scope of the present invention.

Claims (12)

1. An automatic crawling method for webpage data is characterized by comprising the following steps:
starting automatic crawling application software in response to a first operation of a user, wherein the automatic crawling application software is generated based on a cross-platform desktop application development framework;
responding to a second operation of the user to record a crawling rule on a target webpage and generate a data crawling template;
and responding to a third operation of the user to operate the data crawling template so as to automatically crawl the webpage of the data to be crawled.
2. The automatic crawling method of claim 1, wherein said recording crawling rules at the target webpage and generating a data crawling template in response to the second operation of the user further comprises:
responding to a fourth operation of the user to receive the address of the target webpage and opening the target webpage;
and responding to a fifth operation of a user to select a first webpage node needing to be crawled in the target webpage, and recording the first webpage node to generate a data crawling template.
3. The automatic crawling method according to claim 2,
before selecting a first webpage node to be crawled in the target webpage in response to a fifth operation of the user and recording the first webpage node to generate a data crawling template, the method further comprises the following steps:
responding to a sixth operation of a user, inputting search content in a search box in the page of the target webpage, searching, and recording a second webpage node corresponding to the sixth operation;
responding to a seventh operation of a user to select a search item in the search list, and recording a third webpage node corresponding to the seventh operation;
the selecting a first webpage node needing to be crawled in the target webpage in response to a fifth operation of the user and recording the first webpage node to generate the data crawling template further comprises: and generating a data crawling template according to the first webpage node, the second webpage node and the third webpage node.
4. The automatic crawling method according to claim 3, further comprising: recording a plurality of interval time among the fourth operation, the sixth operation, the seventh operation and the fifth operation;
the selecting a first webpage node needing to be crawled in the target webpage in response to a fifth operation of the user and recording the first webpage node to generate the data crawling template further comprises: and generating a data crawling template according to the first webpage node, the second webpage node, the third webpage node and the plurality of interval time.
5. The automatic crawling method of claim 4, wherein said recording crawling rules at the target webpage and generating a data crawling template in response to the second operation of the user further comprises:
and re-presenting corresponding operations after the fourth operation, the sixth operation, the seventh operation and the fifth operation respectively, and repeating the corresponding operations in response to re-recording selection of the user or selecting to confirm recording in response to confirmation of the user.
6. The automatic crawling method according to claim 3, wherein the first, second and third web page nodes are one of a preset list of nodes, and the list of nodes comprises a plurality of types of web page nodes.
7. The automatic crawling method of claim 1, wherein said recording crawling rules at the target webpage and generating a data crawling template in response to the second operation of the user further comprises:
the target webpage is a video webpage comprising a plurality of videos, and links of the videos are obtained in response to the eighth operation of the user and a data crawling template is generated.
8. The automatic crawling method according to claim 1, wherein said recording crawling rules on the target web page and generating data crawling templates in response to the second operation of the user further comprises:
the target webpage comprises a control of a preset type, and a script corresponding to the control is called in response to the ninth operation of the user and a data crawling template is generated.
9. The automatic crawling method of claim 1, wherein the running the data crawling template in response to a third operation of a user to automatically crawl a webpage to be crawled further comprises:
operating the data crawling template in response to a first crawling mode selected by a user to automatically crawl a webpage of data to be crawled and output the crawled data;
or alternatively
And responding to a second crawling mode selected by the user to operate the data crawling template so as to automatically crawl the webpage of the data to be crawled and store the crawled data.
10. An automatic crawling apparatus using the automatic crawling method according to any one of claims 1 to 9, comprising an initiating unit, a recording unit and a crawling unit, wherein
The starting unit is configured to start automatic crawling application software in response to a first operation of a user, wherein the automatic crawling application software is generated based on a cross-platform desktop application development framework;
the recording unit is configured to respond to a second operation of the user to perform crawling rule recording on a target webpage and generate a data crawling template;
the crawling unit is configured to respond to a third operation of the user to run the data crawling template so as to automatically crawl a webpage of data to be crawled.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out an automatic crawling method according to any one of claims 1 to 9.
12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the auto-crawling method according to any one of claims 1 to 9 when executing said program.
CN202211668127.1A 2022-12-23 2022-12-23 Automatic crawling method and device for webpage data, computer equipment and medium Pending CN115905661A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211668127.1A CN115905661A (en) 2022-12-23 2022-12-23 Automatic crawling method and device for webpage data, computer equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211668127.1A CN115905661A (en) 2022-12-23 2022-12-23 Automatic crawling method and device for webpage data, computer equipment and medium

Publications (1)

Publication Number Publication Date
CN115905661A true CN115905661A (en) 2023-04-04

Family

ID=86479563

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211668127.1A Pending CN115905661A (en) 2022-12-23 2022-12-23 Automatic crawling method and device for webpage data, computer equipment and medium

Country Status (1)

Country Link
CN (1) CN115905661A (en)

Similar Documents

Publication Publication Date Title
US9235640B2 (en) Logging browser data
JP4901731B2 (en) Automatic image capture for content generation
US10725625B2 (en) Displaying webpage information of parent tab associated with new child tab on graphical user interface
US9934214B2 (en) DOM snapshot capture
US20140137006A1 (en) Graphical Overlay Related To Data Mining And Analytics
US8291318B2 (en) Visualizing a mixture of automated and manual steps in a procedure
US20060101404A1 (en) Automated system for tresting a web application
US10984065B1 (en) Accessing embedded web links in real-time
JP2006228210A (en) Using existing content to generate active content wizard executable file for execution of task
Sato et al. Exploratory analysis of collaborative web accessibility improvement
CN106357719A (en) Page-based incident correlation for network applications
US20160378274A1 (en) Usability improvements for visual interfaces
CN113704590A (en) Webpage data acquisition method and device, electronic equipment and storage medium
US8713436B2 (en) Reusing data in user run book automation
CN113626023A (en) Sketch-based cross-platform interface development method and system, computer device and medium
US20120216132A1 (en) Embedding User Selected Content In A Web Browser Display
CN110377888B (en) HTML-based manuscript auditing editor real-time trace marking method and device
CN112015467A (en) Point burying method, medium, device and computing equipment
US20170147159A1 (en) Capturing and storing dynamic page state data
CN115905661A (en) Automatic crawling method and device for webpage data, computer equipment and medium
CN111767111B (en) Page data processing method and device, electronic equipment and storage medium
US7689905B1 (en) Containment of terminal application run-time data for viewing when disconnected from a host server
CN113590564A (en) Data storage method and device, electronic equipment and storage medium
CN113282285A (en) Code compiling method and device, electronic equipment and storage medium
Wang et al. Implementation of elementary chinese language learning application in WeChat mini programs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination