CN114647466A

CN114647466A - Page content extraction method, device, equipment and computer readable storage medium

Info

Publication number: CN114647466A
Application number: CN202011493052.9A
Authority: CN
Inventors: 王安迪
Original assignee: Guoxinjunhe Beijing Technology Co ltd
Current assignee: Guoxinjunhe Beijing Technology Co ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2022-06-21

Abstract

The invention discloses a page content extraction method, a page content extraction device, page content extraction equipment and a computer readable storage medium. The method comprises the following steps: starting a target application program; wherein the target application program is used for displaying a page; reading preset operation information of a plurality of nodes; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type; sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information. The method and the device do not need to break the API of the application program and extract the content which is not shown in the page.

Description

Page content extraction method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for extracting page content.

Background

With the continuous progress of science and technology, intelligent terminals are gradually popularized and become indispensable tools in the life of users. The intelligent terminal can install various APP (Application), and the Application can be used for presenting page content. The page content displayed by the application program has a great technical value. For example: the page content reflects the personal preference of the user, the page content is extracted and analyzed, and the personal preference data of the user can be determined.

At present, extracting page content requires acquiring an HTML (HyperText Markup Language) text corresponding to the page content, parsing the HTML text into a DOM (Document Object Model) tree structure, locating a required element node in the DOM tree structure, and extracting the page content from the element node.

However, the third-party APP uses the private webpage display component, so that the operating system does not support obtaining the HTML text from the third-party APP, and the extraction of the page content cannot be realized. Although the page content can be obtained by cracking the API of the third-party APP, the obtained page content relates to the user privacy data, and both the behavior of cracking the API of the third-party APP and the behavior of obtaining the user privacy data privately are illegal behaviors and bear greater legal risks.

Disclosure of Invention

The embodiment of the invention mainly aims to provide a page content extraction method, a page content extraction device, page content extraction equipment and a computer readable storage medium, so as to solve the problem that the existing operating system does not support acquisition of HTML (hypertext markup language) texts from a third-party APP and cannot realize extraction of page content.

In view of the above technical problems, the embodiments of the present invention are solved by the following technical solutions:

the embodiment of the invention provides a page content extraction method, which comprises the following steps: starting a target application program; wherein the target application is used for displaying a page; reading preset operation information of a plurality of nodes; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type; sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.

Before the reading of the preset operation information of the plurality of nodes, the method further comprises: capturing a page of the target application program through a preset layout analysis tool, and identifying element nodes in the page; determining a plurality of target element nodes from the element nodes identified by the layout analysis tool; and constructing node operation information for each target element node according to the position of each target element node, and sequencing the node operation information according to the display sequence of the target element nodes in the page.

After the constructing node operation information for each target element node and before the reading of the preset multiple node operation information, the method further includes: according to the arrangement sequence of the node operation information, storing the node operation information corresponding to the target element nodes into a configuration file; the reading of the preset operation information of the plurality of nodes comprises: and reading a plurality of node operation information which are sequentially arranged in the configuration file through a preset automation tool.

Wherein, the sequentially executing the node operations indicated by the plurality of node operation information according to the arrangement sequence of the plurality of node operation information includes: and executing the node operation respectively indicated by the plurality of node operation information sequentially arranged in the configuration file through the automation tool.

An embodiment of the present invention further provides a device for extracting page content, including: the starting module is used for starting the target application program; wherein the target application program is used for displaying a page; the reading module is used for reading a plurality of preset node operation information; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type; the execution module is used for sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.

Wherein the apparatus further comprises a configuration module; the configuration module is configured to: capturing a page of the target application program through a preset layout analysis tool before reading the preset operation information of the plurality of nodes, and identifying element nodes in the page; determining a plurality of target element nodes from the element nodes identified by the layout analysis tool; and constructing node operation information for each target element node according to the position of each target element node, and sequencing the node operation information according to the display sequence of the target element nodes in the page.

The configuration module is further configured to, after the node operation information is constructed for each target element node, store the node operation information corresponding to each of the plurality of target element nodes in a configuration file according to an arrangement order of the plurality of node operation information before the preset plurality of node operation information is read; the reading module is further configured to: and reading a plurality of node operation information which are sequentially arranged in the configuration file.

Wherein the execution module is further to: and executing node operations respectively indicated by the plurality of pieces of node operation information sequentially arranged in the configuration file through a preset automation tool.

The embodiment of the invention also provides page content acquisition equipment, which comprises at least one processor, at least one memory and a bus, wherein the memory and the bus are connected with the processor; the processor and the memory complete mutual communication through the bus; the processor is configured to call program instructions in the memory, and the program instructions, when executed by the processor, implement the steps of any of the above-mentioned page content obtaining methods.

The embodiment of the present invention further provides a computer-readable storage medium, where a page content obtaining program is stored on the computer-readable storage medium, and when being executed by a processor, the page content obtaining program implements the steps of any one of the page content obtaining methods described above.

The embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the node operation information is preset, and the node operation of the preset operation type is executed at the node position of one target element in the page through the instruction of the node operation information, so that the required page content can be extracted by executing the operation at different node positions of the target element in the page according to the execution sequence of the node operation information, the extraction process is simple and easy to operate, the API of an application program does not need to be cracked, and the content which is not shown in the page can not be extracted.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention. In the drawings:

FIG. 1 is a flow diagram of a method of page content extraction according to an embodiment of the invention;

FIG. 2 is a detailed flowchart of a page content extraction method according to an embodiment of the invention;

fig. 3 is a structural diagram of a page content extracting apparatus according to an embodiment of the present invention;

fig. 4 is a block diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

According to an embodiment of the invention, a page content extraction method is provided. Fig. 1 is a flowchart illustrating a page content extracting method according to an embodiment of the present invention.

Step S110, starting a target application program; wherein the target application is used for displaying pages.

The target application refers to an application program which needs to extract page content.

Step S120, reading a plurality of preset node operation information; and each node operation information is used for indicating a node position of a target element in the page to execute node operation of a preset operation type.

The target element node is an element node to be operated by extracting page content.

The target element node location refers to the location of the target element node in the page.

Types of operations, including but not limited to: an input type, a click type, a page scroll type, and an extraction type.

Since the present embodiment does not locate the element node by querying the DOM tree corresponding to the page, in the present embodiment, the target element node is represented using the position of the target element node in the page.

Since the page to be extracted may not be extracted by one-time element node operation, and may be realized by operating the element node for multiple times, the present embodiment presets a plurality of node operation information.

Step S130, sequentially executing the node operations indicated by the plurality of node operation information according to the arrangement order of the plurality of node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.

The element content of the target element node is the page content needing to be extracted.

Each node operation information corresponds to one target element node, a plurality of target element nodes are sorted according to the display (operation) sequence in the page, and the arrangement sequence of the plurality of node operation information is sorted according to the arrangement sequence of the plurality of target element nodes, namely: the sorting position of the node operation information is the same as that of the target element node corresponding to the node operation information.

And when the node operation indicated by each node operation information is executed, executing the node operation at the target element node position according to the operation type of the node operation information. For example: the operation type is an input type, then information is input to the target element node location.

In the embodiment, the node operation information is preset, and the node operation of the preset operation type is executed at one target element node position in the page through the instruction of the node operation information, so that the required page content can be extracted by executing the operation at different target element node positions of the page in sequence according to the execution sequence of the node operation information, the extraction process is simple and easy to operate, the API of the application program does not need to be cracked, the content which is not shown in the page can not be extracted, and the legal problem can not be involved.

A more specific implementation is provided below to further describe the page content extraction method according to the embodiment of the present invention.

Fig. 2 is a detailed flowchart of a page content extracting method according to an embodiment of the present invention.

Step S210, capturing a page of the target application program through a preset layout analysis tool, and identifying element nodes in the page.

The layout analysis tool may be the App layout analysis tool uiautomatatorviewer within the Android (Android) self-contained development toolkit.

Step S220, determining a plurality of target element nodes in the element nodes identified by the layout analysis tool.

The operation flow of the automatic operation can be arranged according to the business requirement, and the operation flow comprises element nodes needing sequential operation. And determining the element node included in the operation flow as a target element node.

For example: the operation flow is as follows: and clicking an input box of the page A and inputting a keyword, namely, clicking a search button of the page A, and scrolling a result list of the page B, namely, obtaining the commodity and the price information of the result list of the page B. Then, the element nodes included in the operation flow are: an input box, a search button, a results list, results list items, and price text.

Step S230, constructing node operation information for each target element node according to the position of each target element node, and sorting the node operation information according to the display sequence of the target element nodes in the page.

To facilitate locating the position of the target element node in the page, in the node operation information, the target element node position may be represented by a root element node position, a primary element node (e.g., parent element node) position of the target element node, and a target element node position. Thus, the positioning range can be gradually narrowed in order.

Step S240, storing the node operation information corresponding to the target element nodes into a configuration file according to the arrangement order of the node operation information.

For example: the contents of the configuration file include:

[

{“action”:”input”,“element”:”RootView＝>MainView＝>InputView”,“page”:“A”},

{“action”:”click”,“element”:”RootView＝>MainView＝>SearchView”,“page”:“A”},

{“action”:”scroll”,“element”:”RootView＝>MainView＝>ScrollView”,“page”:“B”},

{“action”:”capture”,“element”:”RootView＝>MainView＝>ScrollView＝>ItemView(1)＝>ItemTitle”,“page”:“B”},

{“action”:”capture”,“element”:”RootView＝>MainView＝>ScrollView＝>ItemView(1)＝>ItemPrice”,“page”:“B”}

]

in the configuration file, the content in each { } is information of one operation node. "action" indicates the type of operation. "input" indicates the input type. "click" indicates the click type. "scroll" indicates a page scroll type. "capture" indicates the type of extraction. "element" represents a target element node. Wherein:

"RootView ═ MainView ═ InputView" denotes an input element node.

"RootView ═ MainView ═ SearchView" denotes the search button element node.

"RootView ═ MainView ═ ScrollView" denotes a page scroll element button.

"RootView ═ MainView ═ ScrollView ═ ItemView (1) ═ ItemTitle" denotes the commodity name (element content) of the search result element node.

"RootView ═ MainView ═ Scrollview ═ ItemView (1) ═ ItemPrice" represents the price (element content) of the search result element node.

"page" indicates the page number of the target element node.

And step S250, starting the target application program through a preset automation tool.

The automation tool can be an automation framework uiautomator2 with Android.

Step S260, reading, by the automation tool, the multiple pieces of node operation information sequentially arranged in the configuration file.

Step S270, executing, by the automation tool, node operations respectively indicated by the plurality of pieces of node operation information sequentially arranged in the configuration file.

Taking the above configuration file as an example:

when the step of "action" being "input" is executed, the automation tool inputs a preset keyword to the InputView. Wherein, the "input" may include a parameter, and the parameter is a keyword.

When the automation tool executes a step of "action" to "click", a click event is generated in SearchView (search button position).

When the "action" is executed as a "scroll", the automation tool operates the ScrollView to scroll the page in the search result page.

When the "action" is the "capture" step, the automation tool extracts the element content at the ItemTitle position and the element content at the ItemPrice position of ItemView (1), and stores the extracted element contents in the local storage device. When all the steps of which the action is the capture are completed, the automation tool means that the automation operation is finished, and all the extracted element contents are acquired from the local storage device to form a structural document. For example: in the structural document, the element content takes the following format:

[

{ "itemTitle": big bag of potato chips "," itemPrice ": 5.5" },

{ "itemTitle": Small Tacrouch chip "," itemPrice ": 3.5" }

]

In this embodiment, the element nodes in the page can be identified according to the layout analysis tool, the target element nodes are determined based on the identified element nodes, the node operation information is constructed, a configuration file capable of being used in a standardized manner is formed, the page can be automatically operated by running the configuration file through the automation tool, the page content is extracted, the whole extraction process is simple and easy to operate, the extraction efficiency is high, and in the extraction process, the API of the application program does not need to be broken, and the user privacy information which is not shown in the page can not be extracted.

The embodiment of the invention also provides a device for extracting the page content. Fig. 3 is a block diagram of a page content extracting apparatus according to an embodiment of the present invention.

The page content extraction device includes: an initiating module 310, a reading module 320 and an executing module 330.

A starting module 310, configured to start a target application; wherein the target application is used for displaying pages.

A reading module 320, configured to read preset multiple node operation information; and each piece of node operation information is used for indicating a node operation of a preset operation type to be executed at a target element node position in the page.

An executing module 330, configured to sequentially execute, according to an arrangement order of the plurality of pieces of node operation information, a plurality of node operations indicated by the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.

Wherein the device further comprises a configuration module (not shown in the figures).

The configuration module is configured to: capturing a page of the target application program through a preset layout analysis tool before reading the preset operation information of the plurality of nodes, and identifying element nodes in the page; determining a plurality of target element nodes from the element nodes identified by the layout analysis tool; and constructing node operation information for each target element node according to the position of each target element node, and sequencing the node operation information according to the display sequence of the target element nodes in the page.

The configuration module is further configured to, after the node operation information is constructed for each target element node, store the node operation information corresponding to each of the plurality of target element nodes in a configuration file according to an arrangement order of the plurality of node operation information before the preset plurality of node operation information is read. The reading module 320 is further configured to: and reading a plurality of node operation information which are sequentially arranged in the configuration file.

Wherein the execution module 330 is further configured to: and executing the node operation respectively indicated by the plurality of node operation information sequentially arranged in the configuration file through a preset automation tool.

The functions of the apparatus according to the embodiments of the present invention have been described in the foregoing method embodiments, so that reference may be made to the related descriptions in the foregoing embodiments for details that are not described in the foregoing embodiments of the present invention, and further details are not described herein.

The page content extracting device comprises a processor and a memory, wherein the starting module 310, the reading module 320, the executing module 330 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, a plurality of node operation information is set by adjusting kernel parameters, each node operation information indicates that a node operation of a preset operation type is executed at one target element node position in the page, and the required page content can be extracted by executing operations at different target element node positions of the page in sequence according to the execution sequence of the plurality of node operation information.

An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the page content extraction method when executed by a processor.

The embodiment of the invention provides a processor, which is used for running a program, wherein the page content extraction method is executed when the program runs.

The embodiment of the invention provides a page content acquisition device. Fig. 4 is a block diagram of an apparatus according to an embodiment of the present invention. The device 400 includes at least one processor 410, and at least one memory 420, bus 430, coupled to the processor 410; the processor 410 and the memory 420 complete communication with each other through the bus 430; the processor 410 is used to call program instructions in the memory 420 to perform the page content extraction method described above. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: starting a target application program; wherein the target application is used for displaying a page; reading preset operation information of a plurality of nodes; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type; sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.

Before the reading of the preset operation information of the plurality of nodes, the method further includes: capturing a page of the target application program through a preset layout analysis tool, and identifying element nodes in the page; determining a plurality of target element nodes from the element nodes identified by the layout analysis tool; and constructing node operation information for each target element node according to the position of each target element node, and sequencing the node operation information according to the display sequence of the target element nodes in the page.

Wherein, the sequentially executing the node operations indicated by the plurality of node operation information according to the arrangement order of the plurality of node operation information includes: and executing the node operation respectively indicated by the plurality of node operation information sequentially arranged in the configuration file through the automation tool.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for extracting page content is characterized by comprising the following steps:

starting a target application program; wherein the target application program is used for displaying a page;

reading preset operation information of a plurality of nodes; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type;

sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at a target element node position indicated by the node operation information.

2. The method according to claim 1, prior to said reading the preset operation information of multiple nodes, further comprising:

capturing a page of the target application program through a preset layout analysis tool, and identifying element nodes in the page;

determining a plurality of target element nodes from the element nodes identified by the layout analysis tool;

and constructing node operation information for each target element node according to the position of each target element node, and sequencing the node operation information according to the display sequence of the target element nodes in the page.

3. The method of claim 2,

after the constructing node operation information for each target element node, before the reading of the preset multiple node operation information, the method further includes:

according to the arrangement sequence of the node operation information, storing the node operation information corresponding to the target element nodes into a configuration file;

the reading of the preset operation information of the plurality of nodes comprises:

and reading a plurality of node operation information which are sequentially arranged in the configuration file through a preset automation tool.

4. The method according to claim 3, wherein the sequentially executing the node operations indicated by the plurality of node operation information according to the ranking order of the plurality of node operation information comprises:

and executing the node operation respectively indicated by the plurality of node operation information sequentially arranged in the configuration file through the automation tool.

5. A page content extraction apparatus, comprising:

the starting module is used for starting the target application program; wherein the target application program is used for displaying a page;

the reading module is used for reading a plurality of preset node operation information; each piece of node operation information is used for indicating a node position of a target element in a page to execute node operation of a preset operation type;

the execution module is used for sequentially executing the node operations indicated by the node operation information according to the arrangement sequence of the node operation information; when the operation type indicated by the node operation information is an extraction type, extracting element content at the position of the target element node indicated by the node operation information.

6. The apparatus of claim 5, further comprising a configuration module; the configuration module is configured to:

capturing a page of the target application program through a preset layout analysis tool before reading the preset operation information of the plurality of nodes, and identifying element nodes in the page;

7. The apparatus of claim 6,

the configuration module is further configured to, after the node operation information is constructed for each target element node, store the node operation information corresponding to each of the plurality of target element nodes in a configuration file according to an arrangement order of the plurality of node operation information before the preset plurality of node operation information is read;

the read module is further configured to: and reading a plurality of node operation information which are sequentially arranged in the configuration file.

8. The apparatus of claim 7, wherein the execution module is further configured to:

and executing the node operation respectively indicated by the plurality of node operation information sequentially arranged in the configuration file through a preset automation tool.

9. The page content acquisition device is characterized by comprising at least one processor, at least one memory connected with the processor, and a bus; the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory, and the program instructions realize the steps of the page content acquisition method according to any one of claims 1-4 when being executed by the processor.

10. A computer-readable storage medium, on which a page content acquisition program is stored, which when executed by a processor implements the steps of the page content acquisition method according to any one of claims 1 to 4.