CN114647466A - Page content extraction method, device, equipment and computer readable storage medium - Google Patents

Page content extraction method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN114647466A
CN114647466A CN202011493052.9A CN202011493052A CN114647466A CN 114647466 A CN114647466 A CN 114647466A CN 202011493052 A CN202011493052 A CN 202011493052A CN 114647466 A CN114647466 A CN 114647466A
Authority
CN
China
Prior art keywords
node
operation information
node operation
page
target element
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011493052.9A
Other languages
Chinese (zh)
Inventor
王安迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoxinjunhe Beijing Technology Co ltd
Original Assignee
Guoxinjunhe Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guoxinjunhe Beijing Technology Co ltd filed Critical Guoxinjunhe Beijing Technology Co ltd
Priority to CN202011493052.9A priority Critical patent/CN114647466A/en
Publication of CN114647466A publication Critical patent/CN114647466A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/38Creation or generation of source code for implementing user interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • G06F9/4451User profiles; Roaming

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention discloses a page content extraction method, a page content extraction device, page content extraction equipment and a computer readable storage medium. The method comprises the following steps: starting a target application program; wherein the target application program is used for displaying a page; reading preset operation information of a plurality of nodes; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type; sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information. The method and the device do not need to break the API of the application program and extract the content which is not shown in the page.

Description

Page content extraction method, device, equipment and computer readable storage medium
Technical Field
The present invention relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for extracting page content.
Background
With the continuous progress of science and technology, intelligent terminals are gradually popularized and become indispensable tools in the life of users. The intelligent terminal can install various APP (Application), and the Application can be used for presenting page content. The page content displayed by the application program has a great technical value. For example: the page content reflects the personal preference of the user, the page content is extracted and analyzed, and the personal preference data of the user can be determined.
At present, extracting page content requires acquiring an HTML (HyperText Markup Language) text corresponding to the page content, parsing the HTML text into a DOM (Document Object Model) tree structure, locating a required element node in the DOM tree structure, and extracting the page content from the element node.
However, the third-party APP uses the private webpage display component, so that the operating system does not support obtaining the HTML text from the third-party APP, and the extraction of the page content cannot be realized. Although the page content can be obtained by cracking the API of the third-party APP, the obtained page content relates to the user privacy data, and both the behavior of cracking the API of the third-party APP and the behavior of obtaining the user privacy data privately are illegal behaviors and bear greater legal risks.
Disclosure of Invention
The embodiment of the invention mainly aims to provide a page content extraction method, a page content extraction device, page content extraction equipment and a computer readable storage medium, so as to solve the problem that the existing operating system does not support acquisition of HTML (hypertext markup language) texts from a third-party APP and cannot realize extraction of page content.
In view of the above technical problems, the embodiments of the present invention are solved by the following technical solutions:
the embodiment of the invention provides a page content extraction method, which comprises the following steps: starting a target application program; wherein the target application is used for displaying a page; reading preset operation information of a plurality of nodes; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type; sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.
Before the reading of the preset operation information of the plurality of nodes, the method further comprises: capturing a page of the target application program through a preset layout analysis tool, and identifying element nodes in the page; determining a plurality of target element nodes from the element nodes identified by the layout analysis tool; and constructing node operation information for each target element node according to the position of each target element node, and sequencing the node operation information according to the display sequence of the target element nodes in the page.
After the constructing node operation information for each target element node and before the reading of the preset multiple node operation information, the method further includes: according to the arrangement sequence of the node operation information, storing the node operation information corresponding to the target element nodes into a configuration file; the reading of the preset operation information of the plurality of nodes comprises: and reading a plurality of node operation information which are sequentially arranged in the configuration file through a preset automation tool.
Wherein, the sequentially executing the node operations indicated by the plurality of node operation information according to the arrangement sequence of the plurality of node operation information includes: and executing the node operation respectively indicated by the plurality of node operation information sequentially arranged in the configuration file through the automation tool.
An embodiment of the present invention further provides a device for extracting page content, including: the starting module is used for starting the target application program; wherein the target application program is used for displaying a page; the reading module is used for reading a plurality of preset node operation information; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type; the execution module is used for sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.
Wherein the apparatus further comprises a configuration module; the configuration module is configured to: capturing a page of the target application program through a preset layout analysis tool before reading the preset operation information of the plurality of nodes, and identifying element nodes in the page; determining a plurality of target element nodes from the element nodes identified by the layout analysis tool; and constructing node operation information for each target element node according to the position of each target element node, and sequencing the node operation information according to the display sequence of the target element nodes in the page.
The configuration module is further configured to, after the node operation information is constructed for each target element node, store the node operation information corresponding to each of the plurality of target element nodes in a configuration file according to an arrangement order of the plurality of node operation information before the preset plurality of node operation information is read; the reading module is further configured to: and reading a plurality of node operation information which are sequentially arranged in the configuration file.
Wherein the execution module is further to: and executing node operations respectively indicated by the plurality of pieces of node operation information sequentially arranged in the configuration file through a preset automation tool.
The embodiment of the invention also provides page content acquisition equipment, which comprises at least one processor, at least one memory and a bus, wherein the memory and the bus are connected with the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory, and the program instructions, when executed by the processor, implement the steps of any of the above-mentioned page content obtaining methods.
The embodiment of the present invention further provides a computer-readable storage medium, where a page content obtaining program is stored on the computer-readable storage medium, and when being executed by a processor, the page content obtaining program implements the steps of any one of the page content obtaining methods described above.
The embodiment of the invention has the following beneficial effects:
in the embodiment of the invention, the node operation information is preset, and the node operation of the preset operation type is executed at the node position of one target element in the page through the instruction of the node operation information, so that the required page content can be extracted by executing the operation at different node positions of the target element in the page according to the execution sequence of the node operation information, the extraction process is simple and easy to operate, the API of an application program does not need to be cracked, and the content which is not shown in the page can not be extracted.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention. In the drawings:
FIG. 1 is a flow diagram of a method of page content extraction according to an embodiment of the invention;
FIG. 2 is a detailed flowchart of a page content extraction method according to an embodiment of the invention;
fig. 3 is a structural diagram of a page content extracting apparatus according to an embodiment of the present invention;
fig. 4 is a block diagram of an apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.
According to an embodiment of the invention, a page content extraction method is provided. Fig. 1 is a flowchart illustrating a page content extracting method according to an embodiment of the present invention.
Step S110, starting a target application program; wherein the target application is used for displaying pages.
The target application refers to an application program which needs to extract page content.
Step S120, reading a plurality of preset node operation information; and each node operation information is used for indicating a node position of a target element in the page to execute node operation of a preset operation type.
The target element node is an element node to be operated by extracting page content.
The target element node location refers to the location of the target element node in the page.
Types of operations, including but not limited to: an input type, a click type, a page scroll type, and an extraction type.
Since the present embodiment does not locate the element node by querying the DOM tree corresponding to the page, in the present embodiment, the target element node is represented using the position of the target element node in the page.
Since the page to be extracted may not be extracted by one-time element node operation, and may be realized by operating the element node for multiple times, the present embodiment presets a plurality of node operation information.
Step S130, sequentially executing the node operations indicated by the plurality of node operation information according to the arrangement order of the plurality of node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.
The element content of the target element node is the page content needing to be extracted.
Each node operation information corresponds to one target element node, a plurality of target element nodes are sorted according to the display (operation) sequence in the page, and the arrangement sequence of the plurality of node operation information is sorted according to the arrangement sequence of the plurality of target element nodes, namely: the sorting position of the node operation information is the same as that of the target element node corresponding to the node operation information.
And when the node operation indicated by each node operation information is executed, executing the node operation at the target element node position according to the operation type of the node operation information. For example: the operation type is an input type, then information is input to the target element node location.
In the embodiment, the node operation information is preset, and the node operation of the preset operation type is executed at one target element node position in the page through the instruction of the node operation information, so that the required page content can be extracted by executing the operation at different target element node positions of the page in sequence according to the execution sequence of the node operation information, the extraction process is simple and easy to operate, the API of the application program does not need to be cracked, the content which is not shown in the page can not be extracted, and the legal problem can not be involved.
A more specific implementation is provided below to further describe the page content extraction method according to the embodiment of the present invention.
Fig. 2 is a detailed flowchart of a page content extracting method according to an embodiment of the present invention.
Step S210, capturing a page of the target application program through a preset layout analysis tool, and identifying element nodes in the page.
The layout analysis tool may be the App layout analysis tool uiautomatatorviewer within the Android (Android) self-contained development toolkit.
Step S220, determining a plurality of target element nodes in the element nodes identified by the layout analysis tool.
The operation flow of the automatic operation can be arranged according to the business requirement, and the operation flow comprises element nodes needing sequential operation. And determining the element node included in the operation flow as a target element node.
For example: the operation flow is as follows: and clicking an input box of the page A and inputting a keyword, namely, clicking a search button of the page A, and scrolling a result list of the page B, namely, obtaining the commodity and the price information of the result list of the page B. Then, the element nodes included in the operation flow are: an input box, a search button, a results list, results list items, and price text.
Step S230, constructing node operation information for each target element node according to the position of each target element node, and sorting the node operation information according to the display sequence of the target element nodes in the page.
To facilitate locating the position of the target element node in the page, in the node operation information, the target element node position may be represented by a root element node position, a primary element node (e.g., parent element node) position of the target element node, and a target element node position. Thus, the positioning range can be gradually narrowed in order.
Step S240, storing the node operation information corresponding to the target element nodes into a configuration file according to the arrangement order of the node operation information.
For example: the contents of the configuration file include:
[
{“action”:”input”,“element”:”RootView=>MainView=>InputView”,“page”:“A”},
{“action”:”click”,“element”:”RootView=>MainView=>SearchView”,“page”:“A”},
{“action”:”scroll”,“element”:”RootView=>MainView=>ScrollView”,“page”:“B”},
{“action”:”capture”,“element”:”RootView=>MainView=>ScrollView=>ItemView(1)=>ItemTitle”,“page”:“B”},
{“action”:”capture”,“element”:”RootView=>MainView=>ScrollView=>ItemView(1)=>ItemPrice”,“page”:“B”}
]
in the configuration file, the content in each { } is information of one operation node. "action" indicates the type of operation. "input" indicates the input type. "click" indicates the click type. "scroll" indicates a page scroll type. "capture" indicates the type of extraction. "element" represents a target element node. Wherein:
"RootView ═ MainView ═ InputView" denotes an input element node.
"RootView ═ MainView ═ SearchView" denotes the search button element node.
"RootView ═ MainView ═ ScrollView" denotes a page scroll element button.
"RootView ═ MainView ═ ScrollView ═ ItemView (1) ═ ItemTitle" denotes the commodity name (element content) of the search result element node.
"RootView ═ MainView ═ Scrollview ═ ItemView (1) ═ ItemPrice" represents the price (element content) of the search result element node.
"page" indicates the page number of the target element node.
And step S250, starting the target application program through a preset automation tool.
The automation tool can be an automation framework uiautomator2 with Android.
Step S260, reading, by the automation tool, the multiple pieces of node operation information sequentially arranged in the configuration file.
Step S270, executing, by the automation tool, node operations respectively indicated by the plurality of pieces of node operation information sequentially arranged in the configuration file.
Taking the above configuration file as an example:
when the step of "action" being "input" is executed, the automation tool inputs a preset keyword to the InputView. Wherein, the "input" may include a parameter, and the parameter is a keyword.
When the automation tool executes a step of "action" to "click", a click event is generated in SearchView (search button position).
When the "action" is executed as a "scroll", the automation tool operates the ScrollView to scroll the page in the search result page.
When the "action" is the "capture" step, the automation tool extracts the element content at the ItemTitle position and the element content at the ItemPrice position of ItemView (1), and stores the extracted element contents in the local storage device. When all the steps of which the action is the capture are completed, the automation tool means that the automation operation is finished, and all the extracted element contents are acquired from the local storage device to form a structural document. For example: in the structural document, the element content takes the following format:
[
{ "itemTitle": big bag of potato chips "," itemPrice ": 5.5" },
{ "itemTitle": Small Tacrouch chip "," itemPrice ": 3.5" }
]
In this embodiment, the element nodes in the page can be identified according to the layout analysis tool, the target element nodes are determined based on the identified element nodes, the node operation information is constructed, a configuration file capable of being used in a standardized manner is formed, the page can be automatically operated by running the configuration file through the automation tool, the page content is extracted, the whole extraction process is simple and easy to operate, the extraction efficiency is high, and in the extraction process, the API of the application program does not need to be broken, and the user privacy information which is not shown in the page can not be extracted.
The embodiment of the invention also provides a device for extracting the page content. Fig. 3 is a block diagram of a page content extracting apparatus according to an embodiment of the present invention.
The page content extraction device includes: an initiating module 310, a reading module 320 and an executing module 330.
A starting module 310, configured to start a target application; wherein the target application is used for displaying pages.
A reading module 320, configured to read preset multiple node operation information; and each piece of node operation information is used for indicating a node operation of a preset operation type to be executed at a target element node position in the page.
An executing module 330, configured to sequentially execute, according to an arrangement order of the plurality of pieces of node operation information, a plurality of node operations indicated by the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.
Wherein the device further comprises a configuration module (not shown in the figures).
The configuration module is configured to: capturing a page of the target application program through a preset layout analysis tool before reading the preset operation information of the plurality of nodes, and identifying element nodes in the page; determining a plurality of target element nodes from the element nodes identified by the layout analysis tool; and constructing node operation information for each target element node according to the position of each target element node, and sequencing the node operation information according to the display sequence of the target element nodes in the page.
The configuration module is further configured to, after the node operation information is constructed for each target element node, store the node operation information corresponding to each of the plurality of target element nodes in a configuration file according to an arrangement order of the plurality of node operation information before the preset plurality of node operation information is read. The reading module 320 is further configured to: and reading a plurality of node operation information which are sequentially arranged in the configuration file.
Wherein the execution module 330 is further configured to: and executing the node operation respectively indicated by the plurality of node operation information sequentially arranged in the configuration file through a preset automation tool.
The functions of the apparatus according to the embodiments of the present invention have been described in the foregoing method embodiments, so that reference may be made to the related descriptions in the foregoing embodiments for details that are not described in the foregoing embodiments of the present invention, and further details are not described herein.
The page content extracting device comprises a processor and a memory, wherein the starting module 310, the reading module 320, the executing module 330 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, a plurality of node operation information is set by adjusting kernel parameters, each node operation information indicates that a node operation of a preset operation type is executed at one target element node position in the page, and the required page content can be extracted by executing operations at different target element node positions of the page in sequence according to the execution sequence of the plurality of node operation information.
An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the page content extraction method when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the page content extraction method is executed when the program runs.
The embodiment of the invention provides a page content acquisition device. Fig. 4 is a block diagram of an apparatus according to an embodiment of the present invention. The device 400 includes at least one processor 410, and at least one memory 420, bus 430, coupled to the processor 410; the processor 410 and the memory 420 complete communication with each other through the bus 430; the processor 410 is used to call program instructions in the memory 420 to perform the page content extraction method described above. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: starting a target application program; wherein the target application is used for displaying a page; reading preset operation information of a plurality of nodes; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type; sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.
Before the reading of the preset operation information of the plurality of nodes, the method further includes: capturing a page of the target application program through a preset layout analysis tool, and identifying element nodes in the page; determining a plurality of target element nodes from the element nodes identified by the layout analysis tool; and constructing node operation information for each target element node according to the position of each target element node, and sequencing the node operation information according to the display sequence of the target element nodes in the page.
After the constructing node operation information for each target element node and before the reading of the preset multiple node operation information, the method further includes: according to the arrangement sequence of the node operation information, storing the node operation information corresponding to the target element nodes into a configuration file; the reading of the preset operation information of the plurality of nodes comprises: and reading a plurality of node operation information which are sequentially arranged in the configuration file through a preset automation tool.
Wherein, the sequentially executing the node operations indicated by the plurality of node operation information according to the arrangement order of the plurality of node operation information includes: and executing the node operation respectively indicated by the plurality of node operation information sequentially arranged in the configuration file through the automation tool.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for extracting page content is characterized by comprising the following steps:
starting a target application program; wherein the target application program is used for displaying a page;
reading preset operation information of a plurality of nodes; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type;
sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.
2. The method according to claim 1, prior to said reading the preset operation information of multiple nodes, further comprising:
capturing a page of the target application program through a preset layout analysis tool, and identifying element nodes in the page;
determining a plurality of target element nodes from the element nodes identified by the layout analysis tool;
and constructing node operation information for each target element node according to the position of each target element node, and sequencing the node operation information according to the display sequence of the target element nodes in the page.
3. The method of claim 2,
after the constructing node operation information for each target element node, before the reading of the preset multiple node operation information, the method further includes:
according to the arrangement sequence of the node operation information, storing the node operation information corresponding to the target element nodes into a configuration file;
the reading of the preset operation information of the plurality of nodes comprises:
and reading a plurality of node operation information which are sequentially arranged in the configuration file through a preset automation tool.
4. The method according to claim 3, wherein the sequentially executing the node operations indicated by the plurality of node operation information according to the ranking order of the plurality of node operation information comprises:
and executing the node operation respectively indicated by the plurality of node operation information sequentially arranged in the configuration file through the automation tool.
5. A page content extraction apparatus, comprising:
the starting module is used for starting the target application program; wherein the target application program is used for displaying a page;
the reading module is used for reading a plurality of preset node operation information; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type;
the execution module is used for sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at the position of the target element node indicated by the node operation information.
6. The apparatus of claim 5, further comprising a configuration module; the configuration module is configured to:
capturing a page of the target application program through a preset layout analysis tool before reading the preset operation information of the plurality of nodes, and identifying element nodes in the page;
determining a plurality of target element nodes from the element nodes identified by the layout analysis tool;
and constructing node operation information for each target element node according to the position of each target element node, and sequencing the node operation information according to the display sequence of the target element nodes in the page.
7. The apparatus of claim 6,
the configuration module is further configured to, after the node operation information is constructed for each target element node, store the node operation information corresponding to each of the plurality of target element nodes in a configuration file according to an arrangement order of the plurality of node operation information before the preset plurality of node operation information is read;
the read module is further configured to: and reading a plurality of node operation information which are sequentially arranged in the configuration file.
8. The apparatus of claim 7, wherein the execution module is further configured to:
and executing the node operation respectively indicated by the plurality of node operation information sequentially arranged in the configuration file through a preset automation tool.
9. The page content acquisition device is characterized by comprising at least one processor, at least one memory connected with the processor, and a bus; the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory, and the program instructions realize the steps of the page content acquisition method according to any one of claims 1-4 when being executed by the processor.
10. A computer-readable storage medium, on which a page content acquisition program is stored, which when executed by a processor implements the steps of the page content acquisition method according to any one of claims 1 to 4.
CN202011493052.9A 2020-12-17 2020-12-17 Page content extraction method, device, equipment and computer readable storage medium Pending CN114647466A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011493052.9A CN114647466A (en) 2020-12-17 2020-12-17 Page content extraction method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011493052.9A CN114647466A (en) 2020-12-17 2020-12-17 Page content extraction method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114647466A true CN114647466A (en) 2022-06-21

Family

ID=81990539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011493052.9A Pending CN114647466A (en) 2020-12-17 2020-12-17 Page content extraction method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114647466A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593207A (en) * 2009-07-06 2009-12-02 孟智平 The method and system of a kind of structure and generation webpage
CN101996196A (en) * 2009-08-28 2011-03-30 中国移动通信集团公司 Dynamic webpage acquisition method and device
CN104063115A (en) * 2013-03-18 2014-09-24 联想(北京)有限公司 Information processing method and electronic equipment
CN104504027A (en) * 2014-12-12 2015-04-08 北京国双科技有限公司 Method and device for automatically selecting webpage content
CN104537040A (en) * 2014-12-23 2015-04-22 小米科技有限责任公司 Method and device for capturing webpage content and electronic device
CN105373468A (en) * 2014-06-20 2016-03-02 阿里巴巴集团控股有限公司 A detection method and system for WEB automation testability
CN106776319A (en) * 2016-12-15 2017-05-31 广州酷狗计算机科技有限公司 Automatic test approach and device
CN107423322A (en) * 2017-03-31 2017-12-01 广州视源电子科技股份有限公司 The display methods and device of the label nesting level of Webpage
CN108595329A (en) * 2018-04-23 2018-09-28 腾讯科技(深圳)有限公司 A kind of application testing method, device and computer storage media
CN110162682A (en) * 2019-04-12 2019-08-23 深圳壹账通智能科技有限公司 A kind of crawling method of network data, device, storage medium and terminal device
CN111523074A (en) * 2020-04-26 2020-08-11 成都思维世纪科技有限责任公司 Acquisition system for dynamic page sensitive data of front-end rendering website
CN111797340A (en) * 2020-06-10 2020-10-20 浙江大学 Service packaging system for user-defined extraction flow

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593207A (en) * 2009-07-06 2009-12-02 孟智平 The method and system of a kind of structure and generation webpage
CN101996196A (en) * 2009-08-28 2011-03-30 中国移动通信集团公司 Dynamic webpage acquisition method and device
CN104063115A (en) * 2013-03-18 2014-09-24 联想(北京)有限公司 Information processing method and electronic equipment
CN105373468A (en) * 2014-06-20 2016-03-02 阿里巴巴集团控股有限公司 A detection method and system for WEB automation testability
CN104504027A (en) * 2014-12-12 2015-04-08 北京国双科技有限公司 Method and device for automatically selecting webpage content
CN104537040A (en) * 2014-12-23 2015-04-22 小米科技有限责任公司 Method and device for capturing webpage content and electronic device
CN106776319A (en) * 2016-12-15 2017-05-31 广州酷狗计算机科技有限公司 Automatic test approach and device
CN107423322A (en) * 2017-03-31 2017-12-01 广州视源电子科技股份有限公司 The display methods and device of the label nesting level of Webpage
CN108595329A (en) * 2018-04-23 2018-09-28 腾讯科技(深圳)有限公司 A kind of application testing method, device and computer storage media
CN110162682A (en) * 2019-04-12 2019-08-23 深圳壹账通智能科技有限公司 A kind of crawling method of network data, device, storage medium and terminal device
CN111523074A (en) * 2020-04-26 2020-08-11 成都思维世纪科技有限责任公司 Acquisition system for dynamic page sensitive data of front-end rendering website
CN111797340A (en) * 2020-06-10 2020-10-20 浙江大学 Service packaging system for user-defined extraction flow

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
避雨亭: "使用uiautomatorviewer获取元素", pages 1, Retrieved from the Internet <URL:https://www.cnblogs.com/biyuting/p/6955318.html> *

Similar Documents

Publication Publication Date Title
CN110955428A (en) Page display method and device, electronic equipment and medium
CN105868166B (en) Regular expression generation method and system
CN109308254B (en) Test method, test device and test equipment
US8359307B2 (en) Method and apparatus for building sales tools by mining data from websites
CN106354483B (en) Data processing method and device and electronic equipment
CN110502520B (en) Method, system and equipment for data warehousing and computer readable storage medium
EP3961426A2 (en) Method and apparatus for recommending document, electronic device and medium
CN110489032B (en) Dictionary query method for electronic book and electronic equipment
CN104899203B (en) Webpage generation method and device and terminal equipment
CN110968314A (en) Page generation method and device
CN113656737A (en) Webpage content display method and device, electronic equipment and storage medium
CN107622125B (en) Information crawling method and device and electronic equipment
CN115065945B (en) Short message link generation method and device, electronic equipment and storage medium
CN112825038A (en) Visual page making method based on general component language specification and related product
CN109992759B (en) Table object editing method and device, electronic equipment and storage medium
CN114647466A (en) Page content extraction method, device, equipment and computer readable storage medium
CN113179183B (en) Service switch state control device and method
CN110851346B (en) Query statement boundary problem detection method, device, equipment and storage medium
CN114265777A (en) Application program testing method and device, electronic equipment and storage medium
CN111125605B (en) Page element acquisition method and device
CN112231599A (en) Component model collection method in component electronic commerce platform
CN110750739B (en) Page type determination method and device
CN111291267B (en) APP user behavior analysis method and device
CN117130946B (en) Test scene generation method and device, electronic equipment and readable storage medium
CN114020276A (en) Data processing method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20220621