CN113722640A - Method, device and medium for collecting webpage configurable items based on RPA - Google Patents

Method, device and medium for collecting webpage configurable items based on RPA Download PDF

Info

Publication number
CN113722640A
CN113722640A CN202110987813.4A CN202110987813A CN113722640A CN 113722640 A CN113722640 A CN 113722640A CN 202110987813 A CN202110987813 A CN 202110987813A CN 113722640 A CN113722640 A CN 113722640A
Authority
CN
China
Prior art keywords
configurable
items
acquisition
rpa
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110987813.4A
Other languages
Chinese (zh)
Inventor
梁威
谢宏亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha Biovision Software Technology Co ltd
Original Assignee
Changsha Biovision Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha Biovision Software Technology Co ltd filed Critical Changsha Biovision Software Technology Co ltd
Priority to CN202110987813.4A priority Critical patent/CN113722640A/en
Publication of CN113722640A publication Critical patent/CN113722640A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method, a device and a medium for acquiring webpage configurable items based on RPA, wherein the method comprises the steps of locking an acquisition area in a webpage interface so that the acquisition area contains a plurality of configurable items of a current webpage, and each configurable item has similarity; locating all configurable items within the collection area; all acquisition items in all configurable items are located and bound. According to the method, before all configurable items of the webpage are positioned, an acquisition area is firstly locked, so that the acquisition area contains a plurality of configurable items of the current webpage, and each configurable item has similarity, the acquisition area is firstly locked, the acquisition area can be prevented from being positioned beyond the acquisition area, and unnecessary configurable items are prevented from being found.

Description

Method, device and medium for collecting webpage configurable items based on RPA
Technical Field
The invention relates to the technical field of RPA webpage configuration, in particular to a method, a device and a medium for acquiring webpage configurable items based on RPA.
Background
In the BS system (browser/Server, a system based on wide area network), the content in the web page is partitioned, for example, a certain item of a certain shopping web page is searched, information of the certain item appears, and each different item in the region is in a similar card, and the content and the sequence of the package arrangement in each card have similar meanings.
At present, the webpage card acquisition technology based on the RPA (software automation) searches for areas with similarity from the whole webpage, but the areas with the similarity are easy to analyze out the areas with the unneeded similarity, and even if the areas are well adjusted in the configuration period, in the operation period, because most of webpages have the condition of page turning, the situation different from the configuration period can occur in a certain page through page turning acquisition and when the data of each page slightly changes, the RPA searches for the areas with the unneeded similarity, thereby finding the unneeded cards.
Disclosure of Invention
The present invention is directed to at least solving the problems of the prior art. Therefore, the invention provides a method, a device and a medium for collecting webpage configurable items based on RPA. The method can avoid collecting the unnecessary web configurable items, so that the collected items are positioned more clearly, and the collected items are collected more accurately and completely.
The invention provides a method for acquiring webpage configurable items based on RPA, which comprises the following steps:
locking a collection area in a webpage interface so that the collection area contains a plurality of configurable items of a current webpage, and each configurable item has similarity;
locating all of the configurable items within the collection area;
locating and binding all acquisition items in all the configurable items.
According to the embodiment of the invention, at least the following technical effects are achieved:
compared with the prior art, the similarity area is searched from the whole webpage, so that the unnecessary similarity card or table area is easily separated. According to the method, before all configurable items of a webpage are positioned, an acquisition area is firstly locked, so that the acquisition area contains a plurality of configurable items of the current webpage, and each configurable item has similarity, positioning exceeding the acquisition area can be avoided by firstly locking the acquisition area, and unnecessary configurable items are avoided being found.
According to some embodiments of the invention, further comprising the step of: and scrolling the webpage interface, if the current webpage has new configurable item loading, positioning all the new configurable items in the acquisition area according to the same mode, and positioning and binding all the acquisition items in all the new configurable items.
According to some embodiments of the invention, the distance of each scrolling of the web interface is the same as the height of the frame of the web interface.
According to some embodiments of the invention, the acquisition area is locked in the web interface by xpath and/or dom.
According to some embodiments of the invention, the configurable item is a card or a table.
In a second aspect of the present invention, an apparatus for acquiring a web page configurable item based on RPA is provided, including:
the acquisition area positioning module is used for locking an acquisition area in a webpage interface so that the acquisition area contains a plurality of configurable items of a current webpage and each configurable item has similarity;
a configurable item location module for locating all the configurable items within the acquisition area;
and the acquisition item positioning and binding module is used for positioning and binding all acquisition items in all the configurable items.
According to the embodiment of the invention, at least the following technical effects are achieved:
compared with the prior art, the similarity area is searched from the whole webpage, so that the unnecessary similarity card or table area is easily separated. The device firstly locks a collection area before positioning all configurable items of a webpage, so that the collection area contains a plurality of configurable items of the current webpage, and each configurable item has similarity, the situation that the collection area exceeds the collection area in positioning can be avoided by firstly locking the collection area, and unnecessary configurable items are avoided to be found.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a method for collecting a web page configurable item based on RPA according to a first embodiment of the present invention;
fig. 2 is a schematic configuration flow chart of a method for collecting a web page configurable item based on RPA according to a second embodiment of the present invention;
fig. 3 is a schematic view of a runtime flow of a method for collecting a web page configurable item based on RPA according to a second embodiment of the present invention;
fig. 4 is a schematic diagram of a collection area and a card area of a web page according to a second embodiment of the present invention;
fig. 5 is a schematic diagram of a detail page corresponding to an acquisition item according to a second embodiment of the present invention;
FIG. 6 is a diagram illustrating a card area of a web page according to a second embodiment of the present invention;
FIG. 7 is a diagram illustrating a table area of a web page according to a third embodiment of the present invention;
fig. 8 is a schematic diagram of attribute features on a dom element of a web card according to a third embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
In the prior art, the similarity area is searched for from the whole webpage, so that the area of the similarity card which is not needed is easily analyzed, even if the similarity card is adjusted in the configuration period, in the operation period, when the data of each page is slightly changed during page turning acquisition, the similarity area which is not needed is searched for when the data of a certain page is different in the configuration period, and the card which is not needed is found.
The method firstly locks a large area range, avoids finding similar cards outside the area range, combines the analysis of the internal structure of the card, and has clearer positioning and more accurate and complete acquisition. The invention can also be used as a collection scheme of the web page form.
Referring to fig. 1, a first embodiment of the present invention provides a method for collecting a web page configurable item based on RPA, where the configurable item is a card, including the following steps:
step S101, a collection area is locked in a webpage interface, so that a plurality of cards of a current webpage are contained in the collection area, and each card has similarity.
And S102, positioning all cards in the acquisition area.
And S103, positioning and binding all the acquisition items in all the cards.
Compared with the prior art, the similarity area is searched from the whole webpage, so that the unnecessary similarity card or table area is easily separated. According to the method, before all cards of a webpage are positioned, an acquisition area is firstly locked, so that a plurality of cards of the current webpage are contained in the acquisition area, and each card has similarity, positioning exceeding the acquisition area can be avoided by firstly locking the acquisition area, and unnecessary cards are avoided being found.
It should be noted that the scheme of the present invention can also be applied to the method for collecting the web page form, and the principle is the same as that of the method for collecting the web page card, which will not be described in detail herein.
As an optional implementation, the method further comprises the following steps:
and S104, scrolling a webpage interface, if new card loading occurs on the current webpage, positioning all new cards in the acquisition area according to the same mode, and positioning and binding all acquisition items in all new cards.
After the collection items in all the cards of the current page in steps S101 to S103 are collected, the web interface is scrolled to locate the card on the current page next time and bind the collection items. It should be noted that the acquisition regions in step S104 and step S101 belong to the same region. Through the embodiment, when the page is turned for collection, the card which is not needed is prevented from being found.
Referring to fig. 2 and 3, for easy understanding, the second embodiment of the present invention is described as a process of acquiring a web card:
firstly, an acquisition area of a current page is configured, the acquisition area is locked, and similar cards outside the area are prevented from being identified. Manual modification fine tuning is provided to locate attributes ID, CLASS, style of a dom (Document Object Model) element, and collection area for the current page by xpath (a language that is a gate to find information in XML documents).
Secondly, positioning the cards, positioning all similar cards in the acquisition area, and setting a filtering condition, for example, setting the width or height of the card to be larger or smaller than a certain value, and adding an offset value as the filtering condition, or taking the attributes of some dom in the card as the filtering condition.
Then, configuring the acquisition items, using the current card as the parent positioning of the acquisition items, traversing the card to the fifth position, using the card as the parent positioning of the acquisition items, using each item as an acquisition record, binding the acquisition record to the variable, and storing the corresponding data field.
And finally, configuring whether to collect in a rolling way, and when the configuration is 'true', rolling a rolling bar after a first batch of cards are collected in the running process, judging whether new data are loaded on a page, if so, continuously traversing and collecting card records in the new data, and defaulting the rolling bar of a rolling window by the rolling bar to designate the rolling bar on an element.
Referring to fig. 4 to 6, fig. 4 includes a configuration process of three regions, and in fig. 4, a labeled box labeled with reference number L1 represents a region range of acquisition, within which all acquisition tasks are within (note that the labeled box labeled with reference number L1 in fig. 4 is not fully shown). The label box labeled L2 shows that when the card is configured, the locator will analyze the same kind of elements to find all similar card locations within the area. The label box labeled with the reference number L3 represents the acquisition configuration positioning item in the card (note that the label box labeled with the reference number L2 contains a plurality of label boxes labeled with the reference number L3, such as "2999.00" shown in fig. 4, etc.), each card generates the acquisition item positioner under the current card, so as to acquire the data and correspond to the table field, and the configuration from the outer layer to the inner layer is divided into 3 steps:
step 1: in the positioning process, the outer layer is positioned firstly (namely the marking frame of the mark number L1 is positioned firstly), the acquisition area is determined, the positioning range is reduced, the range is prevented from being too large, when the similarity of the cards is found, if the similarity is found from the whole page, the positioning which can occur exceeds the boundary problem, so that the cards which are not needed can be found, and the configuration is used for solving the problem.
Step 2: in FIG. 4, the labeled box labeled L2 is the card location, and the configuration is searched within the location area in step 1 (i.e., the area in the labeled box labeled L1), and after selecting this area, similar cards in this area are analyzed as shown in FIG. 6.
And 3, step 3: in fig. 4, the reference box labeled L3 is a capture item, and this positioning is relative to the current card.
The 3-step configuration positioning is performed through a positioner provided in the xpath or the dom, and each step can perform fine adjustment on positioning by changing the grammar of the xpath or the dom, so that the positioning change is more stable and reliable.
During the collection process, as shown in fig. 4, the collection item in the card needs to be spliced with the collection item in the card and the collection item in the detail page as shown in fig. 5 (fig. 5 is the detail page corresponding to the card in fig. 4, and the label box of reference number L3 is the configuration positioning item of the collection item in the detail page) by clicking the detail page (for example, clicking the picture in the current card enters the detail page). And clicking one card in each traversal, entering a detail page (such as an interface of fig. 5), and after returning to the card page (such as an interface of fig. 4), saving each acquisition item of fig. 4 and 5 as a line of record to ensure data integrity.
After all cards in the current page are collected, a page scroll bar starts to scroll, the scrolling distance is the window height of the current page, whether a new data loading item exists in the page or not is judged after the page is scrolled once, if so, the latest card items are collected one by one, and the window height is scrolled after the collection is finished, so that the data corresponding to all the cards of the current page are ensured to be collected until no new data are loaded after the page is scrolled.
For the convenience of understanding, the third embodiment of the present invention is described as the collection process of the web page form:
as in fig. 7 (explanation of the label boxes is the same as fig. 4, note that in fig. 7, three label boxes are represented using R1, R2, and R3), the table range is selected with one row in the table as a card and each field as a collection.
In analyzing the similarity of the cards, analyzing the element attributes (including style attributes and node attributes) and the element height and width (configurable) features (see fig. 8) in the node of the current card dom as the analysis objects of the first step. With the selection of the configuration acquisition item, the element attributes (including style attributes and node attributes) and the node height and width (configurable) of the acquisition item are used as the internal features of the analysis card as card screening conditions so as to find similar cards more accurately.
In a fourth embodiment of the present invention, an RPA-based web page configurable item acquisition device is provided, which may be any type of intelligent terminal, such as a mobile phone, a tablet computer, a personal computer, and so on. Specifically, the apparatus includes: one or more control processors and memory, here exemplified by a control processor. The control processor and the memory may be connected by a bus or other means, here exemplified by a connection via a bus.
The memory, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the acquisition device of the RPA-based web page configurable item in the embodiment of the present invention. The control processor implements the collection method of the RPA-based web page configurable item of the above method embodiments by running non-transitory software programs, instructions, and modules stored in the memory.
The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes a memory remotely located from the control processor, and the remote memories may be connected to the RPA-based web page configurable item acquisition device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more modules are stored in the memory and, when executed by the one or more control processors, perform the collection method for RPA-based web page configurable items in the above embodiments.
The embodiment of the invention also provides a computer-readable storage medium, which stores computer-executable instructions, and the computer-executable instructions are used by one or more control processors to execute the collection method of the webpage configurable item based on the RPA in the above embodiment.
Through the above description of the embodiments, those skilled in the art can clearly understand that the embodiments can be implemented by software plus a general hardware platform. Those skilled in the art will appreciate that all or part of the processes in the methods for implementing the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes in the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (8)

1. A method for collecting webpage configurable items based on RPA is characterized by comprising the following steps:
locking a collection area in a webpage interface so that the collection area contains a plurality of configurable items of a current webpage, and each configurable item has similarity;
locating all of the configurable items within the collection area;
locating and binding all acquisition items in all the configurable items.
2. The method for collecting web page configurable items based on RPA as claimed in claim 1, further comprising the steps of:
and scrolling the webpage interface, if the current webpage has new configurable item loading, positioning all the new configurable items in the acquisition area according to the same mode, and positioning and binding all the acquisition items in all the new configurable items.
3. The method for collecting web page configurable items based on RPA according to claim 2, wherein the distance of scrolling said web page interface each time is the same as the height of the frame of said web page interface.
4. The collection method of the RPA-based web page configurable item according to claim 1, wherein the collection area is locked in the web interface by xpath and/or dom.
5. The collection method of RPA-based web page configurable items according to any of claims 1-4, wherein said configurable items are cards or tables.
6. An apparatus for collecting web page configurable items based on RPA, comprising:
the acquisition area positioning module is used for locking an acquisition area in a webpage interface so that the acquisition area contains a plurality of configurable items of a current webpage and each configurable item has similarity;
a configurable item location module for locating all the configurable items within the acquisition area;
and the acquisition item positioning and binding module is used for positioning and binding all acquisition items in all the configurable items.
7. An acquisition device of webpage configurable items based on RPA is characterized in that: comprises at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the method of acquiring an RPA-based web page configurable item of any one of claims 1 to 5.
8. A computer-readable storage medium characterized by: the computer-readable storage medium stores computer-executable instructions for causing a computer to perform the method for acquiring the RPA-based web page configurable item of any one of claims 1 to 5.
CN202110987813.4A 2021-08-26 2021-08-26 Method, device and medium for collecting webpage configurable items based on RPA Pending CN113722640A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110987813.4A CN113722640A (en) 2021-08-26 2021-08-26 Method, device and medium for collecting webpage configurable items based on RPA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110987813.4A CN113722640A (en) 2021-08-26 2021-08-26 Method, device and medium for collecting webpage configurable items based on RPA

Publications (1)

Publication Number Publication Date
CN113722640A true CN113722640A (en) 2021-11-30

Family

ID=78678158

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110987813.4A Pending CN113722640A (en) 2021-08-26 2021-08-26 Method, device and medium for collecting webpage configurable items based on RPA

Country Status (1)

Country Link
CN (1) CN113722640A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446139A (en) * 2016-09-20 2017-02-22 微梦创科网络科技(中国)有限公司 Webpage content extracting method and device
CN107239546A (en) * 2017-06-05 2017-10-10 成都知道创宇信息技术有限公司 A kind of method of webpage local content tracking with reminding
CN108804458A (en) * 2017-05-02 2018-11-13 阿里巴巴集团控股有限公司 A kind of reptile web retrieval method and apparatus
CN108846116A (en) * 2018-06-26 2018-11-20 北京京东金融科技控股有限公司 Page Impression collecting method, system, electronic equipment and storage medium
CN109214864A (en) * 2018-08-27 2019-01-15 河南丰泰光电科技有限公司 A kind of advertisement recognition method and device, electronic equipment
CN109829092A (en) * 2018-12-26 2019-05-31 厦门邑通软件科技有限公司 The method that a kind of pair of webpage is oriented monitoring
CN110020312A (en) * 2017-12-11 2019-07-16 北京京东尚科信息技术有限公司 The method and apparatus for extracting Web page text
CN112559355A (en) * 2020-12-18 2021-03-26 中国平安财产保险股份有限公司 Test case generation method and device, electronic equipment and storage medium
CN112579852A (en) * 2019-09-30 2021-03-30 厦门邑通软件科技有限公司 Interactive webpage data accurate acquisition method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446139A (en) * 2016-09-20 2017-02-22 微梦创科网络科技(中国)有限公司 Webpage content extracting method and device
CN108804458A (en) * 2017-05-02 2018-11-13 阿里巴巴集团控股有限公司 A kind of reptile web retrieval method and apparatus
CN107239546A (en) * 2017-06-05 2017-10-10 成都知道创宇信息技术有限公司 A kind of method of webpage local content tracking with reminding
CN110020312A (en) * 2017-12-11 2019-07-16 北京京东尚科信息技术有限公司 The method and apparatus for extracting Web page text
CN108846116A (en) * 2018-06-26 2018-11-20 北京京东金融科技控股有限公司 Page Impression collecting method, system, electronic equipment and storage medium
CN109214864A (en) * 2018-08-27 2019-01-15 河南丰泰光电科技有限公司 A kind of advertisement recognition method and device, electronic equipment
CN109829092A (en) * 2018-12-26 2019-05-31 厦门邑通软件科技有限公司 The method that a kind of pair of webpage is oriented monitoring
CN112579852A (en) * 2019-09-30 2021-03-30 厦门邑通软件科技有限公司 Interactive webpage data accurate acquisition method
CN112559355A (en) * 2020-12-18 2021-03-26 中国平安财产保险股份有限公司 Test case generation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US8601120B2 (en) Update notification method and system
CN109684575A (en) Processing method and processing device, storage medium, the computer equipment of web data
US9454535B2 (en) Topical mapping
CN109376291B (en) Website fingerprint information scanning method and device based on web crawler
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN107402938B (en) Page display method and device and client equipment
US20120304051A1 (en) Automation Tool for XML Based Pagination Process
CN106886547A (en) A kind of scenario generation method and device
CN105373593A (en) Method and apparatus for displaying target element from webpage
CN110008393B (en) Method and equipment for acquiring website information
CN103377246B (en) Bookmark processing method and terminal browser
CN114329281A (en) Rendering server, webpage rendering method and webpage rendering device
CN113722640A (en) Method, device and medium for collecting webpage configurable items based on RPA
CN109543127B (en) Page refreshing method, device and equipment and readable storage medium
CN109710833B (en) Method and apparatus for determining content node
CN113051186B (en) Method and device for processing page bump in memory recovery and electronic equipment
US10296566B2 (en) Apparatus and method for outputting web content that is rendered based on device information
CN107103001B (en) Method, device and system for acquiring target front-end resource file based on browser
CN104281693A (en) Semantic search method and semantic search system
CN113468316A (en) Method and system for quickly retrieving SVN document library
Ansari Industrial Application: Real-Time Defect Detection in Industrial Manufacturing
CN111695056A (en) Page processing method, page return processing method, device and equipment
CN109948013B (en) Webpage processing method and device
CN106708846B (en) Method and device for collecting webpage data
CN113792237B (en) Optimization method and device for card layout, storage medium and processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination