CN112579951A - Page element selection method and device, storage medium and equipment - Google Patents

Page element selection method and device, storage medium and equipment Download PDF

Info

Publication number
CN112579951A
CN112579951A CN201910943329.4A CN201910943329A CN112579951A CN 112579951 A CN112579951 A CN 112579951A CN 201910943329 A CN201910943329 A CN 201910943329A CN 112579951 A CN112579951 A CN 112579951A
Authority
CN
China
Prior art keywords
target
style
page
determining
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910943329.4A
Other languages
Chinese (zh)
Inventor
满悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201910943329.4A priority Critical patent/CN112579951A/en
Publication of CN112579951A publication Critical patent/CN112579951A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Abstract

The disclosure relates to a page element selection method, a device, a storage medium and a device, wherein the method comprises the following steps: acquiring a style and a tag name corresponding to a target element; acquiring the style of target page elements which are identical to the tag names of the target elements and have the same level in a page, wherein the elements with the same level in the page are elements at the same level in a DOM tree corresponding to the page; determining a target style set corresponding to the target element according to a style corresponding to an element, which has the same style as the target element, in the target page element; determining a style path according to a target style set corresponding to the target element, determining an element pointed by the style path as a similar element of the target element, and selecting the similar element. Therefore, page elements can be accurately selected, and therefore page content can be accurately extracted.

Description

Page element selection method and device, storage medium and equipment
Technical Field
The present disclosure relates to the field of page processing, and in particular, to a page element selection method, apparatus, storage medium, and device.
Background
With the advent of web page technology, the application of extracting part of content in a page to analyze the page is more and more extensive. In the prior art, when a user extracts a part of content in a page, the user is usually required to manually write an extraction path. In order to further facilitate user operation and reduce complexity of page content extraction, a user can select an element in a page to extract content of a homogeneous element in the page according to a style of the element. In the above manner, if the element selected by the user selects an element including a special style, for example, the element selected by the user is a style set only for the cell in the cell, such as a highlight style, when extracting the page content according to the element, the element itself can be selected only according to the selected element, and the elements belonging to the same kind as the cell in the page are not selected, so that the contents of other cells corresponding to the same kind of elements are not extracted, resulting in missing of the extracted page content.
Disclosure of Invention
The invention aims to provide a page element selection method, a page element selection device, a storage medium and equipment which can accurately extract page content.
In order to achieve the above object, according to a first aspect of the present disclosure, there is provided a page element selection method, including:
acquiring a style and a tag name corresponding to a target element;
acquiring the style of target page elements which are identical to the tag names of the target elements and have the same level in a page, wherein the elements with the same level in the page are elements at the same level in a DOM tree corresponding to the page;
determining a target style set corresponding to the target element according to a style corresponding to an element, which has the same style as the target element, in the target page element;
determining a style path according to a target style set corresponding to the target element, determining an element pointed by the style path as a similar element of the target element, and selecting the similar element.
Optionally, before the step of determining a style path according to the target style set corresponding to the target element, the method further includes:
if the target element has a parent element and the parent element is not a preset element, adding the target element to a target element set, determining the parent element as a new target element, and returning to the step of acquiring the style and the label name corresponding to the target element until the parent element is the preset element;
determining a style path according to a target style set corresponding to the target element, including:
and determining the style path according to the target style set corresponding to each element in the target element set.
Optionally, the determining the style path according to the target style set corresponding to each element in the target element set includes:
and generating the style path according to the parent-child relationship of each element in the target element set and the sequence of pointing the parent element to the child element, wherein each node of the style path comprises the tag name of the element and the target style set corresponding to the element.
Optionally, before the step of determining the target style set corresponding to the target element according to the style corresponding to the element, which has the same style as the target element, in the target page element, the method further includes:
determining whether a target brother element of the target element exists in the target page element according to the DOM tree, wherein the target brother element is a brother element which has the same pattern with the target element in the target page element;
if the target brother element exists in the target page element, determining the target brother element as the element with the same style as the target element;
the determining a target style set corresponding to the target element according to a style corresponding to an element, which has the same style as the target element, in the target page element includes:
for each style of the target element, respectively performing the following steps:
determining the number of occurrences of the pattern in the set formed by the patterns of the target sibling elements;
and when the times meet a preset condition, adding the style to a target style set corresponding to the target element.
Optionally, the method further comprises:
if the target brother element does not exist in the target page element, adding all styles of the target element to a target style set corresponding to the target element to determine the target style set corresponding to the target element.
Optionally, the determining the number of occurrences of the style in the set formed by the styles of the target sibling elements includes:
for each target sibling element, performing the steps of:
and if the style of the target brother element is the same as the style, adding one to the occurrence number, wherein the initial value of the occurrence number is zero.
Optionally, the preset condition is that the number of occurrences is greater than a first preset threshold or a ratio of the number of occurrences to the number of target sibling elements is greater than a second preset threshold.
According to a second aspect of the present disclosure, there is provided a page element selecting apparatus, including:
the first acquisition module is used for acquiring the style and the tag name corresponding to the target element;
the second obtaining module is used for obtaining the style of a target page element which has the same tag name and the same hierarchy as the target element in a page, wherein the element with the same hierarchy in the page is an element at the same hierarchy in a DOM tree corresponding to the page;
the first determining module is used for determining a target style set corresponding to the target element according to a style corresponding to an element, which has the same style as the target element, in the target page element;
and the selection module is used for determining a pattern path according to the target pattern set corresponding to the target element, determining the element pointed by the pattern path as the same type element of the target element, and selecting the same type element.
Optionally, the apparatus further comprises:
a second determining module, configured to add the target element to a target element set when a parent element exists in the target element and the parent element is not a preset element before the selecting module determines the style path according to the target style set corresponding to the target element, determine the parent element as a new target element, and trigger the obtaining module to obtain the style and the tag name corresponding to the target element until the parent element is the preset element;
the selecting module comprises:
and the first determining submodule is used for determining the style path according to the target style set corresponding to each element in the target element set.
Optionally, the first determining sub-module is configured to:
and generating the style path according to the parent-child relationship of each element in the target element set and the sequence of pointing the parent element to the child element, wherein each node of the style path comprises the tag name of the element and the target style set corresponding to the element.
Optionally, the apparatus further comprises:
a third determining module, configured to determine, according to the DOM tree, whether a target sibling element of the target element exists in the target page element before the first determining module determines a target pattern set corresponding to the target element according to a pattern corresponding to an element, in the target page element, having a same pattern as the target element, in the target page element, where the target sibling element is a sibling element, in the target page element, having a same pattern as the target element;
a fourth determining module, configured to determine, when the target sibling element exists in the target page element, the target sibling element as the element having the same style as the target element;
the first determining module includes:
an adding submodule, configured to perform the following steps for each style of the target element: determining the number of occurrences of the pattern in the set formed by the patterns of the target sibling elements; and when the times meet a preset condition, adding the style to a target style set corresponding to the target element.
Optionally, the apparatus further comprises:
and the adding module is used for adding all styles of the target elements to the target style set corresponding to the target elements under the condition that the target brother elements do not exist in the target page elements so as to determine the target style set corresponding to the target elements.
Optionally, the adding submodule is configured to add, for each target sibling element, one to the number of occurrences if a pattern identical to the pattern exists in the pattern of the target sibling element, where the initial value of the number of occurrences is zero.
Optionally, the preset condition is that the number of occurrences is greater than a first preset threshold or a ratio of the number of occurrences to the number of target sibling elements is greater than a second preset threshold.
According to a third aspect of the present disclosure, there is provided a storage medium having stored thereon a program which, when executed by a processor, performs the steps of the method of any one of the above-mentioned first aspects.
According to a fourth aspect of the present disclosure, there is provided an apparatus comprising:
at least one processor, and at least one memory, bus connected with the processor;
the processor and the memory complete mutual communication through the bus;
the processor is configured to call program instructions in the memory to perform the steps of any of the above methods of the first aspect.
In the technical scheme, the style and the tag name corresponding to the target element are obtained, and the style of the target page element with the same tag name and the same level as the target element in the page is obtained, so that the style of the target element can be filtered based on the target page element, a target style set is determined, a style path is determined, and the same type element of the target element is selected from the page. By the technical scheme, the target style set corresponding to the target element is determined according to the style corresponding to the element with the same style as the target element in the target page element, so that the special style corresponding to the target element, such as the highlight style in the background technology, can be effectively filtered. Meanwhile, when the elements in the page are selected, the similar elements of the target elements can be selected in the page according to the pattern path, the phenomenon that the page elements are selected incompletely because the pattern path contains special patterns is avoided, the accuracy and the efficiency of selecting the page elements are improved, and meanwhile, the manual workload required by selecting the page elements can be effectively reduced.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
FIG. 1 is a flow chart of a page element selection method provided in accordance with one embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an exemplary page;
FIG. 3 is a schematic diagram of a DOM tree corresponding to the page shown in FIG. 2;
FIG. 4 is a block diagram of a page element selection apparatus provided in accordance with one embodiment of the present disclosure;
fig. 5 is a block diagram of an apparatus provided in accordance with one embodiment of the present disclosure.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Fig. 1 is a flowchart of a page element selection method according to an embodiment of the present disclosure, and as shown in fig. 1, the method includes:
in S11, the style and tag name corresponding to the target element are acquired.
The target element may be an element selected by the user in the current page, such as a text, a link, and the like. For example, when a user selects a page element in a page through a mouse, a "mouseout" event may be triggered once, and a style corresponding to the target element is obtained, where the style may be a css style sheet and is represented by class. As shown in the page of fig. 2, the element selected by the current user is an element corresponding to "travel attention" (denoted as element T3), and a style corresponding to the element is obtained, where the style may be one or more. Illustratively, the style corresponding to the obtained target element T3 is { class1, class2, class3}, for example, class1 is a style for representing left alignment, class2 is a style for representing underlining, and class3 is a style for representing font and font size correspondence. The manner of obtaining the style corresponding to the element is the prior art, and is not described herein again.
Wherein the tag name of the target element can be obtained through the HTML DOM tagName attribute, such as
As shown in fig. 2, the tag name < a > of the target element T3 is acquired.
In S12, a style of a target page element in a page having the same tag name and the same level as the target element is obtained, where the element in the page having the same level is an element in the same level in a DOM tree corresponding to the page. If not specifically stated, the target page element refers to an element in the page having the same tag name and the same hierarchy as the target element.
The DOM (Document Object Model) tree corresponding to the page may be generated according to the source code of the page, and the generation method is the prior art and is not described herein again.
In S13, a target style set corresponding to the target element is determined according to the style corresponding to the element having the same style as the target element in the target page element.
The target style set is determined according to elements which are the same as the tag names and the levels of the target elements and have the same style, so that the universality of each style in the target style set in the style corresponding to the corresponding target page element is ensured.
In S14, a style path is determined according to the target style set corresponding to the target element, the element pointed by the style path is determined to be the same kind element of the target element, and the same kind element is selected.
The style path may include a tag name of the target element and a style in the target style set corresponding to the tag name, so that the same type element may be selected from the page according to the tag name and the style. Therefore, when the elements in the page are selected according to the style paths corresponding to the target elements, the target elements can be selected, and the corresponding similar elements can be selected at the same time, so that the influence of special styles in the target elements on the selection of the page elements in the prior art is effectively avoided.
In the technical scheme, the style and the tag name corresponding to the target element are obtained, and the style of the target page element with the same tag name and the same level as the target element in the page is obtained, so that the style of the target element can be filtered based on the target page element, a target style set is determined, a style path is determined, and the same type element of the target element is selected from the page. By the technical scheme, the target style set corresponding to the target element is determined according to the style corresponding to the element with the same style as the target element in the target page element, so that the special style corresponding to the target element, such as the highlight style in the background technology, can be effectively filtered. Meanwhile, when the elements in the page are selected, the similar elements of the target elements can be selected in the page according to the pattern path, the phenomenon that the page elements are selected incompletely because the pattern path contains special patterns is avoided, the accuracy and the efficiency of selecting the page elements are improved, and meanwhile, the manual workload required by selecting the page elements can be effectively reduced.
Optionally, before the step of determining the style path according to the target style set corresponding to the target element, the method may further include:
if the target element has a parent element and the parent element is not a preset element, adding the target element to the target element set, determining the parent element as a new target element, and returning to the step of S11 to obtain the style and the tag name corresponding to the target element until the parent element is the preset element.
The preset element may be a root element of a DOM tree corresponding to the page, such as < html >. Optionally, the part displayed on the page is the content corresponding to the < body > tag, and therefore, the preset element may be < body >, so that the number of steps of loop iteration can be effectively reduced, and the efficiency of page element selection can be improved.
In the above example, the preset element may be < body >, as shown in fig. 3, the parent element of the target element T3 is an element < div > (element T8), and if the parent element is not < body >, the steps of S11 to S13 are executed until the determined parent element is < body >, and the loop is ended.
Determining a style path according to the target style set corresponding to the target element may include:
and determining the style path according to the target style set corresponding to each element in the target element set. By way of example, the style path may be determined from parent-child relationships between the various elements, in a manner described in more detail below.
In the technical scheme, when the target element has the parent element and the parent element is not a preset element, the parent element is iterated layer by layer to determine the target style set corresponding to the parent element, and then the style path is determined, so that the accuracy of the style path can be further improved, the accuracy and efficiency of page element selection are improved, and support is provided for extracting the content of the similar elements in the page based on the style path.
Optionally, before the step of determining, according to a style corresponding to an element, which has the same style as the target element, in the target page element, a target style set corresponding to the target element, the method further includes:
and determining whether a target brother element of the target element exists in the target page element according to the DOM tree, wherein the target brother element is a brother element which has the same pattern with the target element in the target page element.
The DOM tree in FIG. 3 is the DOM tree corresponding to the page shown in FIG. 2, and this correspondence is merely an example. Specifically, the DOM tree may be generated according to the page source code, which is not described herein again. Illustratively, as shown in FIG. 3, the tag name of the target element T3 is < a >. According to the DOM tree, it can be determined whether there is a sibling element of the target element T3 directly from the element with tag name < a > and the same level as the target element (i.e., the target page element), and if so, the sibling element is the target sibling element of the target element T3. Illustratively, the determined target sibling elements are T1, T2, T4.
And if the target brother element exists in the target page element, determining the target brother element as the element with the same style as the target element. Then in S13, according to the style corresponding to the element in the target page element having the same style as the target element, an exemplary implementation manner of determining the target style set corresponding to the target element is as follows, and this step may include the following first step and second step:
the first step is to determine the number of occurrences of the pattern in the set formed by the patterns of the target sibling elements.
Optionally, an exemplary implementation manner of determining the number of occurrences of the style in the set formed by the styles of the target sibling elements is as follows, including:
for each target sibling element, performing the steps of:
and if the style of the target brother element is the same as the style, adding one to the number of occurrences, wherein the number of occurrences is initially zero.
In the above example, the tag name of the target element T3 is < a >, the corresponding patterns are { class1, class2, and class3}, the determined target sibling elements are T1, T2, and T4, respectively, and the tag names corresponding to the elements T1, T2, and T4 are all < a >.
The following is a detailed description directed to determining the number of occurrences of style class1 of target element T3 in the target sibling element.
For target sibling element T1, a pattern corresponding to target sibling element T1 may be obtained, for example, the pattern corresponding to target sibling element T1 is { class1, class3, class4}, if there is a pattern in the pattern of target sibling element T1 that is the same as the pattern class1, the number of occurrences is increased by one, and since the number of occurrences is initially zero, the number of occurrences is 1 at this time;
for target sibling element T2, a style corresponding to target sibling element T2 may be obtained, for example, the style corresponding to target sibling element T2 is { class1, class3, class5}, if a style identical to the style class1 exists in the style of target sibling element T2, the number of occurrences is increased by one, and the number of occurrences is 2 at this time;
for target sibling element T4, the style corresponding to target sibling element T4 may be obtained, for example, the style corresponding to target sibling element T4 is { class1, class3}, and if there is a style identical to the style class1 in the style of target sibling element T4, the number of times is increased by one, and the number of times of occurrence is 3.
After performing the above operation for each target sibling element, the determined number of occurrences of style class1 in the target sibling element is 3.
For style class2 of target element T3, the number of occurrences in the target sibling element is also determined in the manner described above, and the determined number of occurrences of class2 in the target sibling element is 0.
If the number of occurrences of the style of the target element in the set formed by the styles of the target sibling elements is 0, the probability that the style is set specifically for the target element is high, and if the number of occurrences of the style of the target element in the set formed by the styles of the target sibling elements is not 0, the probability that the style is set specifically for the target element is low. Therefore, by the technical scheme, the occurrence times of the styles of the target elements in the set formed by the styles of the target sibling elements can be counted quickly and simply, and data support and basis are provided for screening the styles of the target elements subsequently.
And after the occurrence frequency is determined, executing a second step, and adding the style to a target style set corresponding to the target element when the frequency meets a preset condition.
Optionally, the preset condition is that the number of occurrences is greater than a first preset threshold or a ratio of the number of occurrences to the number of target sibling elements is greater than a second preset threshold.
As an example, the preset condition is that the number of occurrences is greater than a first preset threshold, where the first preset threshold may be set according to an actual usage scenario, for example, the first preset threshold may be set to 1, that is, if a style of a target element appears in a set formed by styles of target sibling elements, the style may be regarded as a common style at this time, so that the style is added to a target style set corresponding to the target element.
As an example, the preset condition is that a ratio of the number of occurrences to the number of the target sibling elements is greater than a second preset threshold, where the second preset threshold may be set according to an actual usage scenario, for example, the second preset threshold may be set to 0.3, that is, when a ratio of the number of occurrences of the style of the target element in a set formed by the styles of the target sibling elements to the total number of the target sibling elements is greater than 0.3, the style may be used as a common style at this time, so that the style is added to the target style set corresponding to the target element.
Optionally, when the number of times does not satisfy the preset condition, the style is represented to be set for the target element with a high possibility, and cannot be used for representing common features between the target element and the corresponding target sibling element, and the style can be ignored in this case, such as style class2 described above.
Therefore, by the technical scheme, whether the style is added to the target style set or not can be determined based on the occurrence frequency of the style in the set formed by the styles of the target brother elements, so that the common style corresponding to the target elements can be determined, and the comprehensiveness of page element selection is improved.
Optionally, the method further comprises: if the target brother element does not exist in the target page element, adding all styles of the target element to a target style set corresponding to the target element to determine the target style set corresponding to the target element. Then, step S14 may be executed to determine a style path according to the target style set corresponding to the target element, determine an element pointed by the style path as a similar element of the target element, and select the similar element.
According to the technical scheme, when the target brother element of the target element does not exist in the target page element, all styles of the target brother element are directly added to the corresponding target style set, so that the accuracy of page data extraction based on the target style set is guaranteed.
Optionally, an exemplary implementation manner of determining the style path according to the target style set corresponding to each element in the target element set is as follows, including:
and generating the style path according to the parent-child relationship of each element in the target element set and the sequence of pointing the parent element to the child element, wherein each node of the style path comprises the tag name of the element and the target style set corresponding to the element.
For example, as shown in the DOM tree shown in fig. 3, the elements in the target element set are < a >, < div >, the target style set corresponding to the element < a > is { class1, class3}, and the target style set corresponding to the element < div > is { class9}, where class9 is determined according to elements T5, T8, and T6, and the determination method is the same as the above-described method for determining the target style set corresponding to the element < a >, and is not described here again.
Wherein, according to the DOM tree, it can be determined that < div > is the parent element of < a >, and an exemplary representation of the generated style path is as follows:
div.class9->a.class1.class3;
therefore, when similar elements in the page are selected based on the style path, the elements T1, T2, T3 and T4 can be selected, and the content corresponding to the elements T1, T2, T3 and T4 is extracted, so that the influence of the underline style in the element T3 is avoided, and therefore, the content matched with the target element can be extracted from the page in response to the target element selected by the user, and the user experience is improved.
The present disclosure also provides a page element selecting apparatus, as shown in fig. 4, the apparatus 10 includes:
a first obtaining module 100, configured to obtain a style and a tag name corresponding to a target element;
a second obtaining module 200, configured to obtain a style of a target page element in a page, where the target page element has a tag name same as that of the target element and has a same hierarchy, and the elements having the same hierarchy in the page are elements at the same hierarchy in a DOM tree corresponding to the page;
a first determining module 300, configured to determine, according to a style corresponding to an element, which has a same style as the target element, in the target page element, a target style set corresponding to the target element;
a selecting module 400, configured to determine a style path according to the target style set corresponding to the target element, determine an element pointed by the style path as a similar element of the target element, and select the similar element.
Optionally, the apparatus further comprises:
a second determining module, configured to add the target element to a target element set when a parent element exists in the target element and the parent element is not a preset element before the selecting module determines the style path according to the target style set corresponding to the target element, determine the parent element as a new target element, and trigger the obtaining module to obtain the style and the tag name corresponding to the target element until the parent element is the preset element;
the selecting module comprises:
and the first determining submodule is used for determining the style path according to the target style set corresponding to each element in the target element set.
Optionally, the first determining sub-module is configured to:
and generating the style path according to the parent-child relationship of each element in the target element set and the sequence of pointing the parent element to the child element, wherein each node of the style path comprises the tag name of the element and the target style set corresponding to the element.
Optionally, the apparatus further comprises:
a third determining module, configured to determine, according to the DOM tree, whether a target sibling element of the target element exists in the target page element before the first determining module determines a target pattern set corresponding to the target element according to a pattern corresponding to an element, in the target page element, having a same pattern as the target element, in the target page element, where the target sibling element is a sibling element, in the target page element, having a same pattern as the target element;
a fourth determining module, configured to determine, when the target sibling element exists in the target page element, the target sibling element as the element having the same style as the target element;
the first determining module includes:
an adding submodule, configured to perform the following steps for each style of the target element: determining the number of occurrences of the pattern in the set formed by the patterns of the target sibling elements; and when the times meet a preset condition, adding the style to a target style set corresponding to the target element.
Optionally, the apparatus further comprises:
and the adding module is used for adding all styles of the target elements to the target style set corresponding to the target elements under the condition that the target brother elements do not exist in the target page elements so as to determine the target style set corresponding to the target elements.
Optionally, the adding submodule is configured to add, for each target sibling element, one to the number of occurrences if a pattern identical to the pattern exists in the pattern of the target sibling element, where the initial value of the number of occurrences is zero.
Optionally, the preset condition is that the number of occurrences is greater than a first preset threshold or a ratio of the number of occurrences to the number of target sibling elements is greater than a second preset threshold.
The page element selection device comprises a processor and a memory, wherein the first acquisition module, the second acquisition module, the first determination module, the selection module and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the page content is extracted according to the method by adjusting the kernel parameters.
An embodiment of the present invention provides a storage medium, on which a program is stored, and the program implements the page element selection method when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the page element selection method is executed when the program runs.
An embodiment of the present invention provides an apparatus, as shown in fig. 5, an apparatus 70 includes at least one processor 701, and at least one memory 702 and a bus 703, which are connected to the processor 701; the processor 701 and the memory 702 complete mutual communication through a bus 703; the processor 701 is configured to call program instructions in the memory 702 to execute the page element selection method described above. The device herein may be a server, a PC, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:
acquiring a style and a tag name corresponding to a target element;
acquiring the style of target page elements which are identical to the tag names of the target elements and have the same level in a page, wherein the elements with the same level in the page are elements at the same level in a DOM tree corresponding to the page;
determining a target style set corresponding to the target element according to a style corresponding to an element, which has the same style as the target element, in the target page element;
determining a style path according to a target style set corresponding to the target element, determining an element pointed by the style path as a similar element of the target element, and selecting the similar element.
Optionally, before the step of determining a style path according to the target style set corresponding to the target element, the method further includes:
if the target element has a parent element and the parent element is not a preset element, adding the target element to a target element set, determining the parent element as a new target element, and returning to the step of acquiring the style and the label name corresponding to the target element until the parent element is the preset element;
determining a style path according to a target style set corresponding to the target element, including:
and determining the style path according to the target style set corresponding to each element in the target element set.
Optionally, the determining the style path according to the target style set corresponding to each element in the target element set includes:
and generating the style path according to the parent-child relationship of each element in the target element set and the sequence of pointing the parent element to the child element, wherein each node of the style path comprises the tag name of the element and the target style set corresponding to the element.
Optionally, before the step of determining the target style set corresponding to the target element according to the style corresponding to the element, which has the same style as the target element, in the target page element, the method further includes:
determining whether a target brother element of the target element exists in the target page element according to the DOM tree, wherein the target brother element is a brother element which has the same pattern with the target element in the target page element;
if the target brother element exists in the target page element, determining the target brother element as the element with the same style as the target element;
the determining a target style set corresponding to the target element according to a style corresponding to an element, which has the same style as the target element, in the target page element includes:
for each style of the target element, respectively performing the following steps:
determining the number of occurrences of the pattern in the set formed by the patterns of the target sibling elements;
and when the times meet a preset condition, adding the style to a target style set corresponding to the target element.
Optionally, the method further comprises:
if the target brother element does not exist in the target page element, adding all styles of the target element to a target style set corresponding to the target element to determine the target style set corresponding to the target element.
Optionally, the determining the number of occurrences of the style in the set formed by the styles of the target sibling elements includes:
for each target sibling element, performing the steps of:
and if the style of the target brother element is the same as the style, adding one to the occurrence number, wherein the initial value of the occurrence number is zero.
Optionally, the preset condition is that the number of occurrences is greater than a first preset threshold or a ratio of the number of occurrences to the number of target sibling elements is greater than a second preset threshold.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A page element selection method is characterized by comprising the following steps:
acquiring a style and a tag name corresponding to a target element;
acquiring the style of target page elements which are identical to the tag names of the target elements and have the same level in a page, wherein the elements with the same level in the page are elements at the same level in a DOM tree corresponding to the page;
determining a target style set corresponding to the target element according to a style corresponding to an element, which has the same style as the target element, in the target page element;
determining a style path according to a target style set corresponding to the target element, determining an element pointed by the style path as a similar element of the target element, and selecting the similar element.
2. The method according to claim 1, wherein prior to the step of determining a style path from the target style set to which the target element corresponds, the method further comprises:
if the target element has a parent element and the parent element is not a preset element, adding the target element to a target element set, determining the parent element as a new target element, and returning to the step of acquiring the style and the label name corresponding to the target element until the parent element is the preset element;
determining a style path according to a target style set corresponding to the target element, including:
and determining the style path according to the target style set corresponding to each element in the target element set.
3. The method of claim 2, wherein determining the style path according to a target style set corresponding to each element in the target element set comprises:
and generating the style path according to the parent-child relationship of each element in the target element set and the sequence of pointing the parent element to the child element, wherein each node of the style path comprises the tag name of the element and the target style set corresponding to the element.
4. The method according to claim 1, wherein before the step of determining the target style set corresponding to the target element according to the style corresponding to the element in the target page element having the same style as the target element, the method further comprises:
determining whether a target brother element of the target element exists in the target page element according to the DOM tree, wherein the target brother element is a brother element which has the same pattern with the target element in the target page element;
if the target brother element exists in the target page element, determining the target brother element as the element with the same style as the target element;
the determining a target style set corresponding to the target element according to a style corresponding to an element, which has the same style as the target element, in the target page element includes:
for each style of the target element, respectively performing the following steps:
determining the number of occurrences of the pattern in the set formed by the patterns of the target sibling elements;
and when the times meet a preset condition, adding the style to a target style set corresponding to the target element.
5. The method of claim 4, further comprising:
if the target brother element does not exist in the target page element, adding all styles of the target element to a target style set corresponding to the target element to determine the target style set corresponding to the target element.
6. The method of claim 4, wherein determining the number of occurrences of the pattern in the set formed by the pattern of the target sibling element comprises:
for each target sibling element, performing the steps of:
and if the style of the target brother element is the same as the style, adding one to the occurrence number, wherein the initial value of the occurrence number is zero.
7. The method according to claim 4, wherein the predetermined condition is that the number of occurrences is greater than a first predetermined threshold or that the ratio of the number of occurrences to the number of target sibling elements is greater than a second predetermined threshold.
8. A page element selection apparatus, the apparatus comprising:
the first acquisition module is used for acquiring the style and the tag name corresponding to the target element;
the second obtaining module is used for obtaining the style of a target page element which has the same tag name and the same hierarchy as the target element in a page, wherein the element with the same hierarchy in the page is an element at the same hierarchy in a DOM tree corresponding to the page;
the first determining module is used for determining a target style set corresponding to the target element according to a style corresponding to an element, which has the same style as the target element, in the target page element;
and the selection module is used for determining a pattern path according to the target pattern set corresponding to the target element, determining the element pointed by the pattern path as the same type element of the target element, and selecting the same type element.
9. A storage medium having a program stored thereon, the program being adapted to carry out the steps of the method of any of claims 1-7 when executed by a processor.
10. An apparatus, characterized in that the apparatus comprises:
at least one processor, and at least one memory, bus connected with the processor;
the processor and the memory complete mutual communication through the bus;
the processor is configured to call program instructions in the memory to perform the steps of the method of any of claims 1-7.
CN201910943329.4A 2019-09-30 2019-09-30 Page element selection method and device, storage medium and equipment Pending CN112579951A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910943329.4A CN112579951A (en) 2019-09-30 2019-09-30 Page element selection method and device, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910943329.4A CN112579951A (en) 2019-09-30 2019-09-30 Page element selection method and device, storage medium and equipment

Publications (1)

Publication Number Publication Date
CN112579951A true CN112579951A (en) 2021-03-30

Family

ID=75116516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910943329.4A Pending CN112579951A (en) 2019-09-30 2019-09-30 Page element selection method and device, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN112579951A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002049281A2 (en) * 2000-12-12 2002-06-20 Citrix System, Inc. Methods and apparatus for creating a user interface using property paths
US20050091510A1 (en) * 2003-10-23 2005-04-28 Microsoft Corporation Element persistent identification
US20120110437A1 (en) * 2010-10-28 2012-05-03 Microsoft Corporation Style and layout caching of web content
GB201315993D0 (en) * 2013-09-06 2013-10-23 Middleton Technology Ltd Element identification in a structural model
CN104866509A (en) * 2014-02-26 2015-08-26 阿里巴巴集团控股有限公司 Page element positioning method and device
CN107562600A (en) * 2017-08-23 2018-01-09 广州阿里巴巴文学信息技术有限公司 Page detection method, apparatus, computing device and storage medium
CN107633019A (en) * 2017-08-24 2018-01-26 阿里巴巴集团控股有限公司 A kind of page events acquisition method and device
CN107861868A (en) * 2017-10-31 2018-03-30 郑州云海信息技术有限公司 A kind of method and system for extracting automation test object
CN109582548A (en) * 2017-09-28 2019-04-05 北京国双科技有限公司 A kind of page elements circle choosing method and device buried a little based on nothing
CN109766502A (en) * 2018-12-13 2019-05-17 平安普惠企业管理有限公司 Page improved method, device, computer equipment and storage medium
CN110187880A (en) * 2019-05-30 2019-08-30 北京腾云天下科技有限公司 A kind of similar elemental recognition method, apparatus and calculate equipment
CN110276039A (en) * 2019-06-27 2019-09-24 北京金山安全软件有限公司 Page element path generation method and device and electronic equipment

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002049281A2 (en) * 2000-12-12 2002-06-20 Citrix System, Inc. Methods and apparatus for creating a user interface using property paths
US20050091510A1 (en) * 2003-10-23 2005-04-28 Microsoft Corporation Element persistent identification
US20120110437A1 (en) * 2010-10-28 2012-05-03 Microsoft Corporation Style and layout caching of web content
GB201315993D0 (en) * 2013-09-06 2013-10-23 Middleton Technology Ltd Element identification in a structural model
CN104866509A (en) * 2014-02-26 2015-08-26 阿里巴巴集团控股有限公司 Page element positioning method and device
CN107562600A (en) * 2017-08-23 2018-01-09 广州阿里巴巴文学信息技术有限公司 Page detection method, apparatus, computing device and storage medium
CN107633019A (en) * 2017-08-24 2018-01-26 阿里巴巴集团控股有限公司 A kind of page events acquisition method and device
CN109582548A (en) * 2017-09-28 2019-04-05 北京国双科技有限公司 A kind of page elements circle choosing method and device buried a little based on nothing
CN107861868A (en) * 2017-10-31 2018-03-30 郑州云海信息技术有限公司 A kind of method and system for extracting automation test object
CN109766502A (en) * 2018-12-13 2019-05-17 平安普惠企业管理有限公司 Page improved method, device, computer equipment and storage medium
CN110187880A (en) * 2019-05-30 2019-08-30 北京腾云天下科技有限公司 A kind of similar elemental recognition method, apparatus and calculate equipment
CN110276039A (en) * 2019-06-27 2019-09-24 北京金山安全软件有限公司 Page element path generation method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN106202235B (en) Data processing method and device
CN106649788B (en) Database data transmission method and device
CN106897251B (en) Rich text display method and device
CN108563431B (en) Software development method and device, computer-readable storage medium and electronic equipment
CN107562556B (en) Failure recovery method, recovery device and storage medium
CN106909361B (en) Web development method and device based on template engine
CN110221824B (en) Method and device for generating component
TW201732647A (en) Webpage script loading method and device
CN109145235B (en) Method and device for analyzing webpage and electronic equipment
CN106411970B (en) A kind of fault handling method based on service call, device and system
CN111126019B (en) Report generation method and device based on mode customization and electronic equipment
CN108256870B (en) Method and device for generating description information, updating and processing data based on topological structure
CN110955714A (en) Method and device for converting unstructured text into structured text
CN108572817B (en) Method, apparatus and medium for dynamic resource configuration based on business modeling
CN110889272A (en) Data processing method, device, equipment and storage medium
CN114490658A (en) Node display method, device, storage medium and program product
CN112579951A (en) Page element selection method and device, storage medium and equipment
CN109068286B (en) Information analysis method, medium and equipment
CN110929188A (en) Method and device for rendering server page
CN111124378B (en) Code generation method and device
CN110554867B (en) Application processing method and device
CN110956672A (en) Marketing strategy construction method and device
CN106933852B (en) Webpage updating request generation method and device and response method and device thereof
US11960560B1 (en) Methods for analyzing recurring accessibility issues with dynamic web site behavior and devices thereof
CN110908898B (en) Method and system for generating test scheme

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination