CN111090797B - Data acquisition method, device, computer equipment and storage medium - Google Patents

Data acquisition method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN111090797B
CN111090797B CN201911198993.7A CN201911198993A CN111090797B CN 111090797 B CN111090797 B CN 111090797B CN 201911198993 A CN201911198993 A CN 201911198993A CN 111090797 B CN111090797 B CN 111090797B
Authority
CN
China
Prior art keywords
webpage
target
path information
elements
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911198993.7A
Other languages
Chinese (zh)
Other versions
CN111090797A (en
Inventor
张冠龙
孙慧生
高勇
蒋旭曦
朱宏雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suning Cloud Computing Co Ltd
Original Assignee
Suning Cloud Computing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Cloud Computing Co Ltd filed Critical Suning Cloud Computing Co Ltd
Priority to CN201911198993.7A priority Critical patent/CN111090797B/en
Publication of CN111090797A publication Critical patent/CN111090797A/en
Application granted granted Critical
Publication of CN111090797B publication Critical patent/CN111090797B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to a data acquisition method, a data acquisition device, computer equipment and a storage medium of webpage elements, wherein the method comprises the following steps: acquiring first webpage element path information of a first target webpage; when at least two web page elements of the same type in the first target web page are triggered, acquiring path information of the triggered first web page elements; acquiring first similar path information with similar path structures according to the path information of the first webpage element; and determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, and acquiring webpage data of the plurality of first target elements. The method can obtain the webpage data of the target element in batches aiming at the webpage structures of different webpages.

Description

Data acquisition method, device, computer equipment and storage medium
Technical Field
The present invention relates to the field of web page element processing technologies, and in particular, to a method and apparatus for acquiring data of a web page element, a computer device, and a storage medium.
Background
With the popularity of browsers, more and more web applications have grown. There is a large amount of valuable web page data in web applications. For example, e-commerce website commodity list information data, blog article list data, microblog trending data, and the like. Different web pages have different web page structures, and how to obtain the web page data in batches is a problem to be solved by web page data crawling.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, and a storage medium for acquiring data of a web page element, which are capable of acquiring web page data of a target element in batches for web page structures of different web pages.
A method for obtaining data of a web page element, the method comprising: acquiring first webpage element path information of a first target webpage; when at least two web page elements of the same type in the first target web page are triggered, acquiring path information of the triggered first web page elements; acquiring first similar path information with similar path structures according to the path information of the first webpage element; and determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, and acquiring webpage data of the plurality of first target elements.
In one embodiment, obtaining the first web page element path information of the first target web page includes: traversing the DOM tree structure of the first target webpage, and generating first webpage element path information according to the traversing result.
In one embodiment, the method for acquiring data of a web page element further includes: acquiring a boundary value of the first webpage element according to the coordinate value of the first webpage element in the first target webpage; generating a mask layer for the first webpage element according to the boundary value; acquiring the path information of the triggered first webpage element, including: and acquiring the path information of the triggered first webpage element according to the mask layer.
In one embodiment, the method for acquiring data of a web page element further includes: acquiring page turning information in a first target webpage; acquiring a second target webpage according to the page turning information; acquiring second webpage element path information of a second target webpage; when at least two webpage elements of the same type in the second target webpage are triggered, acquiring path information of the triggered second webpage elements; acquiring second similar path information with similar path structures according to the triggered path information of the second webpage element; and determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information, and acquiring webpage data of the plurality of second target elements.
In one embodiment, determining a plurality of first target elements in the first target web page according to the first similar path information and the first web page element path information includes: and acquiring the peer element and the parent element of the triggered first webpage element of the same type from the first target webpage according to the first similar path information and the first webpage element path information, and taking the triggered first webpage element, the peer element and the parent element of the same type as the first target element respectively.
In one embodiment, obtaining web page data of a plurality of first target elements includes: acquiring configuration information of a first target webpage, wherein the configuration information is used for indicating data for extracting preset parameters in webpage elements of the first target webpage; and acquiring the webpage data of the plurality of first target elements according to the configuration information.
In one embodiment, the preset parameters include text parameters and/or link parameters, and acquiring the webpage data of the plurality of first target elements according to the configuration information includes: and acquiring text data and/or link data in the plurality of first target elements according to the configuration information, wherein the webpage data comprises the text data and the link data.
A data acquisition device for web page elements, the device comprising: the first acquisition module is used for acquiring first webpage element path information of a first target webpage; the second acquisition module is used for acquiring path information of the triggered first webpage elements when at least two webpage elements of the same type in the first target webpage are triggered; the third acquisition module is used for acquiring first similar path information with similar path structures according to the path information of the first webpage element; and the fourth acquisition module is used for determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information and acquiring webpage data of the plurality of first target elements.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods of the embodiments described above when the computer program is executed by the processor.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the embodiments described above.
The data acquisition method, the data acquisition device, the computer equipment and the storage medium for the webpage elements acquire the first webpage element path information of the first target webpage. And when triggering at least two web page elements of the same type in the first target web page, acquiring the path information of the triggered first web page element, and acquiring first similar path information with similar path structures according to the path information of the first web page element. Further, a plurality of first target elements in the first target webpage are determined according to the first similar path information and the first webpage element path information, and finally, webpage data of the plurality of first target elements are obtained. Therefore, the method can obtain the webpage data of the same type of elements under the webpage in batches according to the webpage structures of different webpages and the webpage element path information of the webpages and the path information of the same type of elements in the webpages.
Drawings
FIG. 1 is an application environment diagram of a method for data acquisition of web page elements in one embodiment;
FIG. 2 is a flowchart of a method for obtaining data of a web page element according to an embodiment;
FIG. 3 is a flowchart of a method for obtaining data of a web page element according to another embodiment;
FIG. 4 is a schematic diagram of an interface of an RPA designer in one embodiment;
FIG. 5 is an interface diagram of the web page interface corresponding to FIG. 4;
FIG. 6 is a schematic diagram of an interface of an RPA designer in another embodiment;
FIG. 7 is an interface diagram of the web page interface corresponding to FIG. 6;
FIG. 8 is an interface schematic of an RPA designer in yet another embodiment;
FIG. 9 is a diagram of an interface of a target web page in one embodiment;
FIG. 10 is a schematic diagram of an interface of a target web page according to another embodiment;
FIG. 11 is a block diagram illustrating an exemplary configuration of a data acquisition device for web page elements;
fig. 12 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
The data acquisition method of the webpage element is applied to an application environment shown in fig. 1. The server 110 is configured to implement a method for acquiring data of a web page element in the present application. Wherein the server 110 may be a computer device supporting the operation of an RPA (Robotic Process Automation ) designer. The server 110 is communicatively connected to the terminal device 120. The terminal device 120 is a user terminal device used by consumers of web page data information. The terminal device 120 may present web pages of different web page structures. When the terminal device 120 displays the first target web page 121, the server 110 obtains the first web page element path information in the first target web page 121. When a user triggers at least two web page elements of the same type in the first target web page 121, for example, triggers two titles of the same title type in the first target web page 121, the server 110 obtains path information of the triggered first web page element, further obtains first similar path information with similar path structure according to the path information of the first web page element, determines a plurality of first target elements in the first target web page according to the first similar path information and the path information of the first web page element, and finally obtains web page data of the plurality of first target elements. The server 110 is furthermore communicatively connected to the terminal device 130. The terminal device 130 is a terminal device used by a developer who performs data processing on web page data information. The developer uses the terminal device 130 to perform corresponding operations on the server 110. The web page data of the plurality of first target elements obtained by the server 110 may be displayed in the display interface 131 of the terminal device 130 for the developer to preview. The server 110 may be implemented as a server cluster formed by a plurality of servers, and the terminal device 120 may be a notebook, a desktop, or other mobile devices.
In one embodiment, as shown in fig. 2, a method for acquiring data of a web page element is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
s101, acquiring first webpage element path information of a first target webpage.
In this embodiment, when the terminal device opens the first target web page, the server obtains the first web page element path information of the first target web page in the terminal device. The obtaining may be that the server parses a DOM (Document Object Model ) tree structure of the first target web page, and obtains first web page element path information of the first target web page according to the DOM tree structure of the first target web page. The first webpage element path information is used for identifying path information of all webpage elements in the first target webpage.
In one embodiment, step S101 includes: traversing the DOM tree structure of the first target webpage, and generating first webpage element path information according to the traversing result.
In this embodiment, the server generates the first web page element path information of the first target web page by traversing the DOM tree structure of the first target web page. Specifically, when the terminal clicks the target element a and the target element B in the first target web page respectively through the mouse, the server traverses the DOM tree of the entire first target web page layer by layer, so as to generate first web page element path information which can uniquely identify the first target web page. For example, the first web page element path information is html- > body- > div- > table.
S103, when at least two web page elements of the same type in the first target web page are triggered, acquiring path information of the triggered first web page elements.
In this embodiment, when at least two web page elements of the same type in the first target web page are triggered, the server obtains path information of the triggered first web page element. Here, the at least two types of web page elements are triggered, which may be that the at least two types of web page elements are triggered sequentially or that the at least two types of web page elements are triggered simultaneously. In this embodiment, the server only needs to detect that at least two web page elements of the same type are in a triggered state. Wherein the same type of web page element refers to a web page element identified as the same type in the first target web page. The first webpage element that is triggered is a plurality of. The mode that the webpage element is triggered can be that the webpage element in the first target webpage in the terminal equipment is triggered manually. Or after the server reads the first target webpage to the server, a developer triggers webpage elements in the first target webpage in the server through terminal equipment in communication connection with the server. In addition, since the server can directly read the path information of all the webpage elements in the first target webpage from the terminal equipment for displaying the first target webpage, when the first webpage element is triggered, the server can directly read the path information of the first webpage element.
S105, obtaining first similar path information with similar path structures according to the path information of the first webpage element.
In this embodiment, the number of the triggered first webpage elements is multiple, and the server obtains the first similar path information according to the path information of the triggered first webpage elements. The first similar path information comprises path information with similar path structures in the path information of the plurality of first webpage elements. For example, the path information of the first web page element a is: html- > body- > div- > table- > tr [1], the path information of the first web page element B is: html- > body- > div- > table- > tr [2]. At this time, the first similar path information includes html- > body- > div- > table- > tr.
S107, determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, and acquiring webpage data of the plurality of first target elements.
In this embodiment, the server determines a plurality of first target elements in the first target web page according to the first similar path information and the first web page element path information. The determining method may be that all path information matched with the first similar path information is obtained from the path information of the first webpage element, and then the webpage element corresponding to all the matched path information is used as the first target element. For example, the first similar path information includes html- > body- > div- > table- > tr. At this time, the path information obtaining path structure prefix of the first webpage element includes all the webpage elements corresponding to the path information of html- > body- > div- > table- > tr as the first target element. Finally, the webpage data of all the first target elements are acquired, so that the webpage data of all the elements with similar paths are acquired in batches in the first target webpage.
In a specific implementation process, the first target webpage is a webpage list. Clicking the target element A under the webpage list and the target element B under the webpage list through a mouse. The target element A and the target element B are the first webpage elements of the same type which are triggered. The server traverses the DOM tree of the whole webpage list layer by layer upwards, and the generation of the first element path information capable of uniquely identifying the webpage list is as follows: html- > body- > div- > table, and path information of the target element a are: html- > body- > div- > table- > tr [1], the path information of the target element B is: (html- > body- > div- > table- > tr [2 ]). Thus, the first similar path information obtained is: html- > body- > div- > table- > tr. According to the first element path information and the first similar path information of the web page list, elements of all similar paths under the web page list, that is, the plurality of first target elements, can be retrieved. Finally, the webpage data of a plurality of first target elements are obtained, so that the webpage data of the webpage elements with similar paths in the webpage list are obtained in batches.
The data acquisition method of the webpage elements acquires the first webpage element path information of the first target webpage. And when triggering at least two web page elements of the same type in the first target web page, acquiring the path information of the triggered first web page element, and acquiring first similar path information with similar path structures according to the path information of the first web page element. Further, a plurality of first target elements in the first target webpage are determined according to the first similar path information and the first webpage element path information, and finally, webpage data of the plurality of first target elements are obtained. Therefore, the method can obtain the webpage data of the same type of elements under the webpage in batches according to the webpage structures of different webpages and the webpage element path information of the webpages and the path information of the same type of elements in the webpages. In addition, the method is suitable for acquiring the webpage data of any webpage structure by analyzing the webpage element path information of the webpage in batches.
In one embodiment, before step S103, the method further includes the steps of: acquiring a boundary value of the first webpage element according to the coordinate value of the first webpage element in the first target webpage; and generating a mask layer for the first webpage element according to the boundary value. At this time, step S103 includes: and acquiring the path information of the triggered first webpage element according to the mask layer.
In this embodiment, the location of each web page element in the first target web page is identified by means of a coordinate system. When a first webpage element in a first target webpage is triggered, acquiring a coordinate value of the first webpage element in the first target webpage, and acquiring a boundary value of the first webpage element according to the coordinate value. Further, a mask layer for the first web page element is generated according to the boundary value, so that the link jump does not occur when the content containing the jump link attribute in the first web page element is triggered. Further, the server acquires path information of the triggered first webpage element according to the mask layer. Specifically, when the server identifies the mask layer of the first webpage element, path information of the first webpage element is obtained. If the first webpage element contains the jumping link attribute content, the server identifies a mask layer of the first webpage element, and at the moment, path information of the first webpage element can be acquired when the first webpage element is triggered, so that the situation that the first webpage element jumps when triggered and cannot acquire the path information of the first webpage element is avoided.
In a specific implementation process, when a mouse slides across a certain webpage element of the first target webpage, namely the triggered first webpage element, a boundary value of a rectangular frame of the first webpage element is obtained according to (x, y) coordinate values of the first webpage element in the current first target webpage (the coordinates of the first target webpage are two-dimensional coordinates represented by an x coordinate system and a y coordinate system). A mask layer for the first webpage element is generated through the boundary value, and the mask layer ensures that target content with href (uniform resource locator (URL) attribute for specifying a hyperlink target does not click to jump when a mouse slides through the first webpage element to capture the target content through drawing a frame for the first webpage element.
In one embodiment, as shown in fig. 3, after step S107, the method further includes the steps of:
s109, acquiring page turning information in the first target webpage.
S111, acquiring a second target webpage according to the page turning information.
S113, obtaining second webpage element path information of a second target webpage.
S115, when at least two webpage elements of the same type in the second target webpage are triggered, acquiring path information of the triggered second webpage elements.
S117, obtaining second similar path information with similar path structures according to the triggered path information of the second webpage element.
S119, determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information, and acquiring webpage data of the plurality of second target elements.
In this embodiment, the first target web page includes page turning information. The page turning information is used to instruct the web page to jump from the current web page to another web page. The page turning information may be jump link instruction information. The server acquires a second target webpage according to the page turning information in the first target webpage, and further performs operations similar to the steps S101 to S107 for the second target webpage so as to acquire webpage data of target elements in the second target webpage. Specifically, when at least two web page elements of the same type in the second target web page are triggered, the server acquires path information of the triggered second web page element, and acquires second similar path information with similar path structures according to the path information of the triggered second web page element. And finally, determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information, and acquiring webpage data of the plurality of second target elements.
For example, web page data of a target element in a first target web page is obtained via one or more portal addresses. Such as article list address of the first web page: https:// www.cnblogs.com/#p1, obtaining the web page data of the target element in the first web page. Specifically, the crawling of the web page data of the article list of the first web page is completed according to the operations from step S101 to step S107. And entering the next level webpage, namely the second target webpage, according to the page turning information of the entry webpage, namely the page turning information of the first webpage, such as the link direction https:// www.cnblogs.com/#p2. The operations of steps S109 to S119 are performed in the second target web page to capture the web page data of the target element in the second target web page. Endless loop until the page turning information is executed.
In one embodiment, step S107 includes: and acquiring the peer element and the parent element of the triggered first webpage element of the same type from the first target webpage according to the first similar path information and the first webpage element path information, and taking the triggered first webpage element, the peer element and the parent element of the same type as the first target element respectively.
In this embodiment, the first similar path information is path information with similar path structure determined according to the triggered path information of the first web page element of the same type. The first webpage element path information is a set of path information of webpage elements of the first target webpage. And determining the peer elements and the parent elements of the triggered first webpage elements of the same type according to the first similar path information and the first webpage element path information. Specifically, all path information matched with similar path information is obtained from the path information of the first webpage element, and the peer element and the parent element of the first webpage element of the same type triggered are obtained from the first target webpage according to the all path information.
In one embodiment, step S107 includes: acquiring configuration information of a first target webpage, wherein the configuration information is used for indicating data for extracting preset parameters in webpage elements of the first target webpage; and acquiring the webpage data of the plurality of first target elements according to the configuration information.
In this embodiment, the web page element of the first target web page includes data corresponding to a plurality of parameters. In a specific implementation process, the acquired target element is often an html element, and there is web page data corresponding to attributes such as title attribute information, href attribute information, class attribute information, etc., so that the finally acquired data can be configured in advance. In this embodiment, the server obtains, according to the configuration information of the first target web page, web page data of preset parameters in the plurality of first target elements.
In one embodiment, the preset parameters include text parameters and/or link parameters. Acquiring the webpage data of a plurality of first target elements according to the configuration information, wherein the webpage data comprises: and acquiring text data and/or link data in the plurality of first target elements according to the configuration information, wherein the webpage data comprises the text data and the link data.
In this embodiment, the preset parameters include text parameters and/or link parameters. The configuration information is used for indicating the webpage data for extracting the text parameters and/or the link parameters in the webpage elements of the first target webpage. And the server extracts text data and/or link data from the plurality of first target elements according to the indication parameters in the configuration information.
For the data acquisition method of the above web page element, a specific implementation scenario is given below, so as to further detail the data acquisition method of the above web page element:
the server for implementing the data acquisition method of the webpage element is a computer device supporting the operation of an RPA (Robotic Process Automation, robot processing automation) designer. Therefore, the RPA designer can extract different types of webpage data of different webpages by adopting the data acquisition method of the webpage elements. For example, the commodity list information of the traditional e-commerce website can be used for extracting related information such as commodity names or prices or descriptions or evaluation or sales volume at one time. Specifically, as shown in fig. 4, the display interface of the RPA designer prompts the developer to first select the data table 1 in the target web page. As shown in fig. 5, after the developer opens the target web page, the first title is selected using a mouse trigger. After the RPA designer reads the path information of the first header and the web page data, as shown in fig. 6, the display interface of the RPA designer prompts the developer to select the data table 2 in the target web page. As shown in fig. 7, the developer continues to open the target web page, selecting the second title using a mouse trigger. The RPA designer reads the path information to the second header and the web page data. The first title and the second title are the first webpage elements of the same type triggered by the target webpage. The RPA designer judges that the first title and the second title are the same type of webpage elements, namely, the data acquisition method of the webpage elements can be executed to acquire all the webpage data of the title class with similar path information under the target webpage. In addition, the RPA designer may also provide configuration options for the developer to select the web page data for the corresponding parameters extracted from the web page elements. As shown in fig. 8, the developer may sort out the parameters that need to be extracted. Such as text parameters and link parameters. And the RPA designer extracts the webpage data in the target element according to the parameters checked by the developer. For example, the configuration that the developer has hooked is: the text of the target element is grabbed, and if the element has href attribute, the link can be checked and grabbed.
Further, if the developer is to capture more types of web page data, the continue selection may be clicked. The selection of web page elements in the target web page is shown with reference to fig. 9 and 10. For example, the title category of the commodity is grabbed for the first time, and the commodity price category needs to be grabbed continuously.
In summary, the RPA designer provides a visual web page data crawling manner and crawling result screening, so that users (such as research personnel) use the RPA designer more conveniently and efficiently. In addition, compared with the traditional crawler capturing and capturing webpage data by using different regular matching aiming at different webpages, the data acquisition method of the webpage elements used by the RPA designer has wider application range, and the circulation of the webpage data becomes efficient and simple in the RPA flow.
The application further provides a data acquisition device for a web page element, as shown in fig. 11, which includes a first acquisition module 10, a second acquisition module 20, a third acquisition module 30, and a fourth acquisition module 40.
The first obtaining module 10 is configured to obtain first web page element path information of a first target web page.
The second obtaining module 20 is configured to obtain path information of the triggered first webpage element when at least two webpage elements of the same type in the first target webpage are triggered.
The third obtaining module 30 is configured to obtain first similar path information with similar path structures according to the path information of the first web page element.
The fourth obtaining module 40 is configured to determine a plurality of first target elements in the first target web page according to the first similar path information and the first web page element path information, and obtain web page data of the plurality of first target elements.
In one embodiment, the first acquisition module 10 may include (not shown in fig. 11):
the first generation unit is used for traversing the DOM tree structure of the first target webpage and generating first webpage element path information according to the traversing result.
In one embodiment, a data acquisition device of a web page element further includes (not shown in fig. 11):
the second generation unit is used for acquiring the boundary value of the first webpage element according to the coordinate value of the first webpage element in the first target webpage; and generating a mask layer for the first webpage element according to the boundary value.
The second acquisition module 20 further comprises
And the path acquisition unit is used for acquiring the path information of the triggered first webpage element according to the mask layer.
In one embodiment, a data acquisition device of a web page element further includes (not shown in fig. 11):
And the fourth acquisition module is used for acquiring page turning information in the first target webpage.
And the fifth acquisition module is used for acquiring the second target webpage according to the page turning information.
And a sixth acquisition module, configured to acquire second web page element path information of a second target web page.
And a seventh acquiring module, configured to acquire path information of the triggered second webpage element when at least two webpage elements of the same type in the second target webpage are triggered.
And the eighth acquisition module is used for acquiring second similar path information with similar path structures according to the triggered path information of the second webpage element.
And the ninth acquisition module is used for determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information and acquiring webpage data of the plurality of second target elements.
In one embodiment, the fourth acquisition module 40 further comprises (not shown in fig. 11):
the element acquisition unit is used for acquiring the peer element and the parent element of the first webpage element of the same type which are triggered from the first target webpage according to the first similar path information and the first webpage element path information, and taking the first webpage element, the peer element and the parent element of the same type which are triggered as the first target element respectively.
In one embodiment, the fourth acquisition module 40 further comprises (not shown in fig. 11):
the data acquisition unit is used for acquiring configuration information of the first target webpage, wherein the configuration information is used for indicating data for extracting preset parameters in webpage elements of the first target webpage; and acquiring the webpage data of the plurality of first target elements according to the configuration information.
In one embodiment, the preset parameters include text parameters and/or link parameters. The data acquisition unit further includes (not shown in fig. 11):
and the data acquisition subunit is used for acquiring text data and/or link data in the plurality of first target elements according to the configuration information, and the webpage data comprises the text data and the link data.
For specific limitation of the data acquisition device of the web page element, reference may be made to the limitation of the data acquisition method of the web page element hereinabove, and the description thereof will not be repeated here. The modules in the data acquisition device of the webpage element can be all or partially implemented by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server supporting the operation of an RPA designer, the internal structure of which may be as shown in fig. 12. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for being connected with an external terminal so as to read information such as a webpage, webpage elements, webpage data and the like on the terminal. The computer program, when executed by a processor, implements a method for data retrieval of web page elements.
It will be appreciated by those skilled in the art that the structure shown in fig. 12 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:
acquiring first webpage element path information of a first target webpage; when at least two web page elements of the same type in the first target web page are triggered, acquiring path information of the triggered first web page elements; acquiring first similar path information with similar path structures according to the path information of the first webpage element; and determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, and acquiring webpage data of the plurality of first target elements.
In one embodiment, when the processor executes the computer program to implement the step of acquiring the path information of the first web page element of the first target web page, the following steps are specifically implemented: traversing the DOM tree structure of the first target webpage, and generating first webpage element path information according to the traversing result.
In one embodiment, the processor, when executing the computer program, performs the steps of: acquiring a boundary value of the first webpage element according to the coordinate value of the first webpage element in the first target webpage; generating a mask layer for the first webpage element according to the boundary value; when the processor executes the computer program to realize the step of acquiring the path information of the triggered first webpage element, the following steps are specifically realized: and acquiring the path information of the triggered first webpage element according to the mask layer.
In one embodiment, the processor, when executing the computer program, performs the steps of: acquiring page turning information in a first target webpage; acquiring a second target webpage according to the page turning information; acquiring second webpage element path information of a second target webpage; when at least two webpage elements of the same type in the second target webpage are triggered, acquiring path information of the triggered second webpage elements; acquiring second similar path information with similar path structures according to the triggered path information of the second webpage element; and determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information, and acquiring webpage data of the plurality of second target elements.
In one embodiment, when the processor executes the computer program to implement the above steps of determining a plurality of first target elements in the first target web page according to the first similar path information and the first web page element path information, the following steps are specifically implemented: and acquiring the peer element and the parent element of the triggered first webpage element of the same type from the first target webpage according to the first similar path information and the first webpage element path information, and taking the triggered first webpage element, the peer element and the parent element of the same type as the first target element respectively.
In one embodiment, when the processor executes the step of obtaining the web page data of the plurality of first target elements by using the computer program, the following steps are specifically implemented: acquiring configuration information of a first target webpage, wherein the configuration information is used for indicating data for extracting preset parameters in webpage elements of the first target webpage; and acquiring the webpage data of the plurality of first target elements according to the configuration information.
In one embodiment, the preset parameters include text parameters and/or link parameters, and when the processor executes the computer program to implement the step of acquiring the web page data of the plurality of first target elements according to the configuration information, the following steps are specifically implemented: and acquiring text data and/or link data in the plurality of first target elements according to the configuration information, wherein the webpage data comprises the text data and the link data.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring first webpage element path information of a first target webpage; when at least two web page elements of the same type in the first target web page are triggered, acquiring path information of the triggered first web page elements; acquiring first similar path information with similar path structures according to the path information of the first webpage element; and determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, and acquiring webpage data of the plurality of first target elements.
In one embodiment, when the computer program is executed by the processor to implement the step of obtaining the path information of the first web page element of the first target web page, the following steps are specifically implemented: traversing the DOM tree structure of the first target webpage, and generating first webpage element path information according to the traversing result.
In one embodiment, the computer program when executed by a processor performs the steps of: acquiring a boundary value of the first webpage element according to the coordinate value of the first webpage element in the first target webpage; generating a mask layer for the first webpage element according to the boundary value; when the computer program is executed by the processor to realize the step of acquiring the path information of the triggered first webpage element, the following steps are specifically realized: and acquiring the path information of the triggered first webpage element according to the mask layer.
In one embodiment, the computer program when executed by a processor performs the steps of: acquiring page turning information in a first target webpage; acquiring a second target webpage according to the page turning information; acquiring second webpage element path information of a second target webpage; when at least two webpage elements of the same type in the second target webpage are triggered, acquiring path information of the triggered second webpage elements; acquiring second similar path information with similar path structures according to the triggered path information of the second webpage element; and determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information, and acquiring webpage data of the plurality of second target elements.
In one embodiment, when the computer program is executed by the processor to implement the above-mentioned step of determining a plurality of first target elements in the first target web page according to the first similar path information and the first web page element path information, the following steps are specifically implemented: and acquiring the peer element and the parent element of the triggered first webpage element of the same type from the first target webpage according to the first similar path information and the first webpage element path information, and taking the triggered first webpage element, the peer element and the parent element of the same type as the first target element respectively.
In one embodiment, when the computer program is executed by the processor to implement the step of acquiring the web page data of the first target elements, the following steps are specifically implemented: acquiring configuration information of a first target webpage, wherein the configuration information is used for indicating data for extracting preset parameters in webpage elements of the first target webpage; and acquiring the webpage data of the plurality of first target elements according to the configuration information.
In one embodiment, the preset parameters include text parameters and/or link parameters, and when the computer program is executed by the processor to implement the above step of obtaining the web page data of the plurality of first target elements according to the configuration information, the following steps are specifically implemented: and acquiring text data and/or link data in the plurality of first target elements according to the configuration information, wherein the webpage data comprises the text data and the link data.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (9)

1. A method for obtaining data of a web page element, the method comprising:
acquiring first webpage element path information of a first target webpage;
when at least two webpage elements of the same type in the first target webpage are triggered, acquiring path information of the triggered first webpage elements;
acquiring first similar path information with similar path structures according to the path information of the first webpage element;
Determining a plurality of first target elements in the first target webpage according to the first similar path information and the first webpage element path information, and acquiring webpage data of the plurality of first target elements;
acquiring page turning information in the first target webpage;
acquiring a second target webpage according to the page turning information;
acquiring second webpage element path information of the second target webpage;
when at least two webpage elements of the same type in the second target webpage are triggered, acquiring path information of the triggered second webpage elements;
acquiring second similar path information with similar path structures according to the triggered path information of the second webpage element;
and determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information, and acquiring webpage data of the plurality of second target elements.
2. The method of claim 1, wherein the obtaining the first web page element path information of the first target web page comprises:
traversing the DOM tree structure of the first target webpage, and generating the path information of the first webpage element according to the traversing result.
3. The method according to claim 1, wherein the method further comprises:
acquiring a boundary value of the first webpage element according to the coordinate value of the first webpage element in the first target webpage;
generating a mask layer for the first webpage element according to the boundary value;
the obtaining the path information of the triggered first webpage element includes:
and acquiring the path information of the triggered first webpage element according to the mask layer.
4. The method of claim 1, wherein the determining a plurality of first target elements in the first target web page from the first similar path information and the first web page element path information comprises:
and acquiring the triggered peer elements and parent elements of the same type of the first webpage elements from the first target webpage according to the first similar path information and the first webpage element path information, and taking the triggered first webpage elements, the peer elements and the parent elements of the same type as the first target elements respectively.
5. The method of claim 4, wherein the obtaining the web page data of the plurality of first target elements comprises:
Acquiring configuration information of the first target webpage, wherein the configuration information is used for indicating data for extracting preset parameters in webpage elements of the first target webpage;
and acquiring the webpage data of the plurality of first target elements according to the configuration information.
6. The method according to claim 5, wherein the preset parameters include text parameters and/or link parameters, and the obtaining the web page data of the plurality of first target elements according to the configuration information includes:
and acquiring text data and/or link data in the plurality of first target elements according to the configuration information, wherein the webpage data comprises the text data and the link data.
7. A data acquisition device for web page elements, the device comprising:
the first acquisition module is used for acquiring first webpage element path information of a first target webpage;
the second acquisition module is used for acquiring path information of the triggered first webpage elements when at least two webpage elements of the same type in the first target webpage are triggered;
the third acquisition module is used for acquiring first similar path information with similar path structures according to the path information of the first webpage element;
A fourth obtaining module, configured to determine a plurality of first target elements in the first target web page according to the first similar path information and the first web page element path information, and obtain web page data of the plurality of first target elements;
the fourth acquisition module is also used for acquiring page turning information in the first target webpage;
the fifth acquisition module is used for acquiring a second target webpage according to the page turning information;
a sixth obtaining module, configured to obtain second web page element path information of a second target web page;
a seventh obtaining module, configured to obtain path information of the triggered second webpage element when at least two webpage elements of the same type in the second target webpage are triggered;
the eighth acquisition module is used for acquiring second similar path information with similar path structures according to the triggered path information of the second webpage element;
and the ninth acquisition module is used for determining a plurality of second target elements in the second target webpage according to the second similar path information and the second webpage element path information and acquiring webpage data of the plurality of second target elements.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 6 when the computer program is executed by the processor.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN201911198993.7A 2019-11-29 2019-11-29 Data acquisition method, device, computer equipment and storage medium Active CN111090797B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911198993.7A CN111090797B (en) 2019-11-29 2019-11-29 Data acquisition method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911198993.7A CN111090797B (en) 2019-11-29 2019-11-29 Data acquisition method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111090797A CN111090797A (en) 2020-05-01
CN111090797B true CN111090797B (en) 2023-07-25

Family

ID=70393709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911198993.7A Active CN111090797B (en) 2019-11-29 2019-11-29 Data acquisition method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111090797B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111638879B (en) * 2020-05-15 2023-10-31 民生科技有限责任公司 System, method, apparatus and readable storage medium for overcoming pixel positioning limitation
CN112882625B (en) * 2021-02-10 2022-05-17 南京苏宁软件技术有限公司 Element pickup method, element pickup device, computer equipment and storage medium
CN114528005B (en) * 2021-11-29 2023-06-23 深圳市千源互联网科技服务有限公司 Grabbing label updating method, grabbing label updating device, grabbing label updating equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method
CN102117289A (en) * 2009-12-30 2011-07-06 北京大学 Method and device for extracting comment content from webpage

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831121B (en) * 2011-06-15 2015-07-08 阿里巴巴集团控股有限公司 Method and system for extracting webpage information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464905A (en) * 2009-01-08 2009-06-24 中国科学院计算技术研究所 Web page information extraction system and method
CN102117289A (en) * 2009-12-30 2011-07-06 北京大学 Method and device for extracting comment content from webpage

Also Published As

Publication number Publication date
CN111090797A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN111090797B (en) Data acquisition method, device, computer equipment and storage medium
US7496847B2 (en) Displaying a computer resource through a preferred browser
US8762556B2 (en) Displaying content on a mobile device
US9015144B2 (en) Configuring web crawler to extract web page information
CN107729475B (en) Webpage element acquisition method, device, terminal and computer-readable storage medium
CN104077387B (en) A kind of web page contents display methods and browser device
CN110069683B (en) Method and device for crawling data based on browser
CN104536973B (en) The method and browser client of picture recognition
CN110688600A (en) Online editing method, device and equipment based on HTML (Hypertext markup language) page and storage medium
US20140359411A1 (en) Methods and systems for uniquely identifying digital content for ediscovery
CN107644100B (en) Information processing method, device and system and computer readable storage medium
CN107679214B (en) Link positioning method, device, terminal and computer readable storage medium
JP2012529688A (en) Update notification method and system
CN114417197A (en) Access record processing method and device and storage medium
CN104866594A (en) Information pushing method and apparatus
CN104239298A (en) Text message recommendation method, server, browser and system
CN110222251B (en) Service packaging method based on webpage segmentation and search algorithm
CN103678511A (en) Method and device for extracting webpage content according to visualized template
US9741018B2 (en) Systems and methods for extracting similar group elements
CN106649350B (en) Method and device for acquiring position information of link element
CN104281629A (en) Method and device for extracting picture from webpage and client equipment
CN111177623A (en) Information processing method and device
US10095791B2 (en) Information search method and apparatus
CN104317929A (en) Search result display optimizing method and device
JP6505849B2 (en) Generation of element identifier

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant