WO2023136875A1 - List extraction and visualization in web pages - Google Patents

List extraction and visualization in web pages Download PDF

Info

Publication number
WO2023136875A1
WO2023136875A1 PCT/US2022/048129 US2022048129W WO2023136875A1 WO 2023136875 A1 WO2023136875 A1 WO 2023136875A1 US 2022048129 W US2022048129 W US 2022048129W WO 2023136875 A1 WO2023136875 A1 WO 2023136875A1
Authority
WO
WIPO (PCT)
Prior art keywords
item
tree
list
node
web page
Prior art date
Application number
PCT/US2022/048129
Other languages
French (fr)
Inventor
Huiming Luo
Xi Chen
Xin Chen
Yuelin Zhang
Yining Chen
Daxin Jiang
Original Assignee
Microsoft Technology Licensing, Llc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc. filed Critical Microsoft Technology Licensing, Llc.
Publication of WO2023136875A1 publication Critical patent/WO2023136875A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Definitions

  • search engine providers may provide search services for assisting users to find web pages of interest. For example, in response to a search query from a user, a search service may return to the user a search result page which includes information about web pages relevant to the search query, e.g., web page links, snippets, etc.
  • Embodiments of the present disclosure propose methods, apparatuses and computer program products for list extraction and visualization in web pages.
  • At least one anchor element group in a target web page may be detected, the at least one anchor element group comprising a first anchor element group.
  • Boundary detection may be performed to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page.
  • Multiple groups of representative metadata respectively corresponding to the multiple items may be obtained from the target web page with the boundaries of the multiple items.
  • the multiple groups of representative metadata may be visualized into a structured list.
  • FIG.l illustrates an exemplary list web page.
  • FIG.2 illustrates an exemplary list web page.
  • FIG.3 illustrates an existing exemplary search result page.
  • FIG.4 illustrates an exemplary process of list extraction and visualization in web pages according to an embodiment.
  • FIG.5 illustrates an exemplary process of anchor element group detection according to an embodiment.
  • FIG.6 illustrates an exemplary anchor element group according to an embodiment.
  • FIG.7 illustrates an exemplary process of boundary detection according to an embodiment.
  • FIG.8 A to FIG.8F illustrate an example of iterative boundary expansion according to an embodiment.
  • FIG.9A to FIG.9F illustrate an example of iterative boundary expansion according to an embodiment.
  • FIG.10 illustrates an exemplary boundary detection result according to an embodiment.
  • FIG.11 illustrates an exemplary process of dominant list determination according to an embodiment.
  • FIG.12 illustrates an exemplary process of representative metadata obtaining according to an embodiment.
  • FIG.13 illustrates an exemplary search result page according to an embodiment.
  • FIG.14 illustrates a flowchart of an exemplary method for list extraction and visualization in web pages according to an embodiment.
  • FIG.15 illustrates an exemplary apparatus for list extraction and visualization in web pages according to an embodiment.
  • FIG.16 illustrates an exemplary apparatus for list extraction and visualization in web pages according to an embodiment.
  • Existing search services usually extract specific text from an original web page to form a snippet in the form of text, i.e., a text snippet, and display the text snippet in a search result page, so that when a user views this text snippet, the user can get a general idea of what the original web page is about.
  • a list web page may refer to that main content in the web page is a list and the list includes multiple items.
  • the existing search services still only extract specific text from a list web page to form a text snippet, and the text in the text snippet may be extracted only from specific items in the list, e.g., extracted from the first item in the list.
  • Embodiments of the present disclosure propose to perform list extraction and visualization in web pages, such that list content may be extracted from a target web page and organized into a structured form.
  • a target web page may be a list web page.
  • the embodiments of the present disclosure may extract list content from a target web page, and visualize at least a portion of the extracted list content to form a snippet in the form of list, i.e., a list snippet.
  • the list snippet may contain richer content about an original list in the target web page, so that a user may learn more comprehensive information about the original list from the list snippet.
  • the embodiments of the present disclosure may present information about the original list to the user in a more friendly and intuitive approach.
  • the embodiments of the present disclosure may present, in the list snippet, key or representative information of items in the original list, and thus may comprehensively and concisely provide information that the user may desire.
  • the list snippet is presented in a search result page, the user may conveniently and comprehensively learn about content of the corresponding target web page without clicking on a web page link.
  • Original lists in target web pages processed by the embodiments of the present disclosure are not limited to those lists with html list tags, but may cover any visually perceptible lists.
  • An "original list” involved in the embodiments of the present disclosure is a visually perceptible list.
  • a visually perceptible list may refer to, e.g., that the list contains multiple items with visually similar structures.
  • an "item” may refer to a component that constitutes a list, which may also be referred to as an object, entity, data record, etc.
  • a visually perceptible list may or may not have an html list tag, and thus the visually perceptible list may have any html tags without limitations.
  • the embodiments of the present disclosure are proposed at least for target web pages that include visually perceptible lists, and may process these target web pages at least from a visually perceptible perspective, rather than simply process these target web pages by using html list tags. Accordingly, the embodiments of the present disclosure may be applied to any target web page that contains a visual list.
  • the embodiments of the present disclosure may identify an original list that may be included in a target web page, through at least detecting an anchor element group in the target web page.
  • Anchor elements in the anchor element group do not necessarily have a html list tag. Since anchor elements may have representative information of items in the original list, detection of anchor element group will facilitate to discover the original list in the target web page.
  • the embodiments of the present disclosure may perform boundary detection to multiple anchor elements in the anchor element group, so as to determine boundaries of multiple items in the original list corresponding to the anchor element group in the target web page.
  • determining boundary of an item may refer to determining which specific elements are included in the item, and accordingly these elements form the item together.
  • the boundary detection may include iterative boundary expansion. For each anchor element, elements that may be within the same item as the anchor element may be found through iterative boundary expansion, and thus the anchor element and the found elements define the boundary of the item.
  • the boundary detection may also include similarity check. The similarity check may be performed for determining whether multiple items determined by expansion from different anchor elements are indeed items in the same original list, e.g., whether these items indeed form an original list. At least through the boundary detection according to the embodiments of the present disclosure, an original list in a target web page and each item in the original list may be accurately identified.
  • a target web page includes two or more original lists
  • the embodiments of the present disclosure may determine a dominant list from these original lists.
  • a dominant list may refer to a list that, e.g., occupies a main position in a web page, presents main content, etc.
  • the embodiments of the present disclosure may include only information about the dominant list in a finally generated structured list, thereby avoiding interference caused by information about lists that are not the dominant list.
  • the embodiments of the present disclosure may obtain, from a target web page, multiple groups of representative metadata for different items in an original list.
  • multiple groups of representative metadata for multiple items in an original list may be obtained from the target web page by using at least boundaries of these items.
  • the multiple groups of representative metadata may be important and representative metadata selected from initial metadata in the target web page through ranking.
  • the embodiments of the present disclosure may visualize the obtained multiple groups of representative metadata to form a structured list.
  • the structured list may be taken as, e.g., a list snippet of the target web page.
  • the embodiments of the present disclosure may be applied in various application scenarios.
  • a search service the embodiments of the present disclosure may generate a structured list for a target web page, such that, e.g., a list snippet is built for the target web page.
  • the search service may present, in a search result page, the structured list generated according to the embodiments of the present disclosure as a list snippet.
  • the embodiments of the present disclosure are not limited to be applied to search services, but may also be applied to any application scenarios that need to perform list extraction and visualization to target web pages.
  • Target web pages processed by the embodiments of the present disclosure may be various list web pages from various websites, online services, etc.
  • FIG. l illustrates an exemplary list web page.
  • List web page 12 is an exemplary article on the network, which may be located in, e.g., an academic website, an online question answering community, etc.
  • This article introduces ten major festivals in China, such as "Spring Festival”, “Mid-Autumn Festival", “Dragon Boat Festival”, etc.
  • Those parts involving the introduced festivals in the article form a visually perceptible list 122.
  • the part involving "Spring Festival", the part involving "Mid-Autumn Festival", the part involving "Dragon Boat Festival”, etc. form multiple items in the list 122 respectively.
  • List web page 14 is a web page from, e.g., a book sales website, a reading communication website, etc. Assuming that multiple options have been selected in an "Options" column on the left side of the web page 14, introductory information of four recommended books that match the selected options is presented on the right side of the web page 14. Taking the first book as an example, introductory information of this book may include, e.g., a cover photo 144, a text introduction 146, etc. The introductory information of the four books forms a visually perceptible list 142. For example, introductory information of each book forms an item in the list 142.
  • List web page 16 is a web page for an exemplary topic "Restaurant X" from a review forum, which includes a discussion thread for "Restaurant X" by multiple users.
  • the web page 16 includes multiple display regions for users Tom, David, Jane, etc. respectively.
  • display region for Tom includes, e.g., Tom's avatar, Tom's name, post time of Tom's comment, specific content of Tom's comment, etc.
  • the display region for Tom, the display region for David, the display region for Jane, etc. form a visually perceptible list 162, and these display regions form multiple items in the list 162 respectively.
  • FIG.2 illustrates an exemplary list web page 20.
  • the web page 20 may come from, e.g., an online shopping website, etc.
  • An online shopping website typically generates or provides a large number of web pages that include lists, e.g., best-selling web pages, most popular product web pages, product category web pages, web pages containing products searched by users, etc.
  • the web page 20 may be a web page for presenting, e.g., cellphones that meet certain conditions. Assuming that multiple options have been selected in an “Options” column on the left side of the web page 20, introductory information of multiple cellphones matching the selected options is presented on the right side of the web page 20.
  • introductory information of a cellphone “M cellphone A4" is presented in a region 22, which includes, e.g., a picture 222 of the cellphone, a brief introduction of the cellphone "M cellphone A4, 6.5 inches, 256G, black", a 5-star rating for the cellphone, the number of reviews about the cellphone “25900 reviews", a price of the cellphone "5500 RMB”, etc.
  • introductory information of a cellphone “M cellphone A3" including at least a picture 242 of the cellphone is presented in a region 24
  • introductory information of a cellphone "M cellphone A2" including at least a picture 262 of the cellphone is presented in a region 26, etc.
  • the introductory information of these cellphones forms a visually perceptible list 202, wherein introductory information of each cellphone forms an item in the list 202.
  • the web page 20 also presents recommendations about related products in a region 28, e.g., introductory information of a first related product which includes at least a picture 282 of the product, introductory information of a second related product which includes at least a picture 284 of the product, introductory information of a third related product which includes at least a picture 286 of the product, etc.
  • the introductory information of these related products forms a visually perceptible list 204, wherein introductory information of each related product forms an item in the list 204.
  • embodiments of the present disclosure are not limited to the exemplary list web pages shown in FIG.l and FIG.2, but may cover various other types of list web pages from various other websites, online services, etc., e.g., list web pages from forum websites regarding topics in various fields, list web pages from product review websites, list web pages from news websites, list web pages from hotel or flight booking websites, etc.
  • FIG.3 illustrates an existing exemplary search result page 300.
  • the search result page 300 may be presented to a user in a search service provided by a certain general search engine provider. It is assumed that the user has input a query "M cellphone" in a search box 310 to indicate that the user wants to obtain web search results regarding the M cellphone.
  • a search result region 320 in the search result page 300 includes multiple web search results. For example, a search result for the web page 20 in FIG.2 is shown in a region 330. As shown in the region 330, the search result for the web page 20 includes a text snippet "M cellphone A4, 6.5 inches, 256G, black, 5 stars, 25900 reviews, 5500 RMB".
  • the text snippet is generated only with the introductory information about the "M cellphone A4" in the web page 20. Based on the text snippet in the region 330, the user can only learn limited information about the web page 20, e.g., can only learn information about the cellphone "M cellphone A4", but cannot learn any information about other cellphones in the list 202 in the web page 20. Moreover, such text snippet also lacks intuitiveness and legibility.
  • FIG.4 illustrates an exemplary process 400 of list extraction and visualization in web pages according to an embodiment.
  • the process 400 may be performed for achieving list extraction and visualization for an original list in a target web page 402, so as to generate a structured list 404.
  • the target web page 402 may be a list web page containing a list, and herein a list in the target web page 402 may be referred to as an original list. If the target web page 402 includes two or more lists, the process 400 may generate the structured list 404 for a dominant list in the target web page 402.
  • At 410, at least one anchor element group in the target web page 404 may be detected.
  • Each anchor element group may include one or more anchor elements, and each anchor element group may correspond to a possible original list. For example, if the target web page 404 includes two or more original lists, two or more anchor element groups respectively corresponding to those original lists may be detected at 410.
  • multiple anchor elements in the target web page 404 may be identified first, and then the multiple anchor elements may be clustered into at least one anchor element group.
  • boundary detection may be performed to multiple anchor elements in the anchor element group to obtain boundaries of multiple items respectively associated with the multiple anchor elements. These items may form an original list, in the target web page 404, corresponding to the anchor element group.
  • the boundary detection may include, e.g., iterative boundary expansion, similarity check, etc., in order to accurately identify at least one original list in the target web page 404 and respective items in each original list.
  • a dominant list may be determined from these original lists.
  • the dominant list may be determined at least with visual features of these original lists.
  • multiple groups of representative metadata respectively corresponding to multiple items in the original list may be obtained from the target web page 404 with at least boundaries of the multiple items.
  • the representative metadata obtaining at 440 may be performed for a dominant list in the target web page 404.
  • representative metadata may refer to data contained in an original list and to be presented in the structured list 404, e.g., image, text, etc.
  • a group of initial metadata may be obtained from the target web page 404 first, and then a group of representative metadata, corresponding to the item, to be presented in the structured list 404 is selected from the group of initial metadata.
  • the multiple groups of representative metadata obtained at 440 may be visualized into the structured list 404.
  • the structured list 404 may be formed with the multiple groups of representative metadata according to a predetermined format or layout.
  • the structured list 404 is a simplified version of the original list in the target web page 402, but it still contains enough information to enable a user to intuitively and comprehensively understand main content of the original list.
  • the structured list 404 may be considered as, e.g., a list snippet of the original list.
  • the process 400 includes the step of dominant list determination at 430, this step may be omitted in the case that the target web page 402 includes only one original list.
  • the step 430 is shown in FIG.4 as being performed before the step 440, the step 430 may also be performed after the step 440.
  • multiple groups of representative metadata of each original list may be obtained first through the step 440, and then, after a dominant list is determined through the step 430, only multiple groups of representative metadata of the dominant list may be provided to the step 450.
  • FIG.5 illustrates an exemplary process 500 of anchor element group detection according to an embodiment.
  • the process 500 is an exemplary implementation of the step 410 in FIG.4.
  • multiple anchor elements may be identified from a target web page 502.
  • these anchor elements may be identified from a html source file of the target web page 502.
  • Each item in the original list may include multiple html elements, and an anchor element may be a html element among these html elements which is the most representative and the most helpful for identifying the entire item.
  • the anchor elements identified at 510 may also be referred to as identified anchor elements.
  • anchor element constraints may be pre-defined, and multiple html elements in the target web page 502 that meet the anchor element constraints may be identified as multiple identified anchor elements.
  • the anchor element constraints may include at least one of: a html element having an image tag; a html element having a title tag; a html element representing a date; etc.
  • each item in the original list may have a corresponding image, and thus a html element in the html source file that has an image tag, e.g., ⁇ img> tag, etc., may be taken as an anchor element to help identifying a corresponding item.
  • each item in the original list may have a corresponding title, and thus a html element in the html source file that has a title tag, e.g., ⁇ hl> tag, ⁇ h2> tag, etc., may be taken as an anchor element to help identifying a corresponding item.
  • each item in the original list may have a date, e.g., a post date, etc.
  • a html element in the html source file that has a character string representing a date may be taken as an anchor element to help identifying a corresponding item.
  • a character string representing a date in the html source file may be identified by various techniques such as regular matching. It should be understood that the embodiments of the present disclosure are not limited to the exemplary anchor element constraints above, but may cover any other types of anchor element constraints.
  • a property set of an identified anchor element may include one or more intrinsic attributes of the identified anchor element, e.g., html tag attribute of the identified anchor element, Cascading Style Sheets (CSS) class, XML Path Language (XPath) information, etc.
  • the html tag attribute may indicate the type of html tag of the identified anchor element.
  • the CSS class may indicate which CSS classes the identified anchor element has.
  • the XPath information may indicate location information, node information, etc. of the identified anchor element, which may be obtained, e.g., from a Document Object Model (DOM) tree corresponding to the html source file.
  • DOM Document Object Model
  • the multiple identified anchor elements may be clustered into at least one anchor element group 504 based on the multiple property sets of these identified anchor elements.
  • Each identified anchor element may be characterized by a corresponding property set, and the multiple property sets of the multiple identified anchor elements may be provided to a pre-trained clustering model as inputs.
  • the clustering model is trained for clustering the multiple anchor elements into at least one anchor element group based on the property sets. For example, those identified anchor elements with similar attributes will be clustered into the same anchor element group.
  • Each anchor element group includes multiple anchor elements having similar attributes, and may correspond to a possible original list, wherein the anchor elements may be respectively associated with different items in the potential original list.
  • process 500 may adopt any combination of various anchor element constraints, property sets containing any combination of various attributes, etc.
  • FIG.6 illustrates an exemplary anchor element group according to an embodiment.
  • the first anchor element group may include multiple anchor elements, and these anchor elements correspond to the image 222, the image 242 and the image 262 in the original list 202 in FIG.2, wherein the image 222, the image 242 and the image 262 may be clustered into the first anchor element group due to having similar attributes.
  • the second anchor element group may include multiple anchor elements, and these anchor elements correspond to the image 282, the image 284 and the image 286 in the original list 204 in FIG.2, wherein the image 282, the image 284 and the image 286 may be clustered into the second anchor element group due to having similar attributes.
  • FIG.6 shows anchor element groups detected according to an anchor element constraint of "a html element having an image tag”
  • the embodiments of the present disclosure may also detect anchor element groups based on other types of anchor element constraints.
  • an anchor element group formed by the title “Spring Festival”, the title “Mid-Autumn Festival”, the title “Dragon Boat Festival”, etc. may be detected according to an anchor element constraint of "a html element having a title tag”.
  • an anchor element group formed by the date "2021-10- 05" in Tom's display region, the date “2021-10-05” in David's display region, the date “2021-10- 06” in Jane's display region, etc. may be detected according to an anchor element constraint of "a html element representing a date”.
  • FIG.7 illustrates an exemplary process 700 of boundary detection according to an embodiment.
  • the process 700 is an exemplary implementation of the step 420 in FIG.4.
  • the process 700 may be used for performing boundary detection to an exemplary anchor element group 702, so as to obtain boundaries of multiple items respectively associated with multiple anchor elements in the anchor element group 702, thereby identifying, in a target web page, an original list 704 corresponding to the anchor element group 702 and respective items in the original list 704.
  • the process 700 may be performed based at least on a DOM tree corresponding to the target web page.
  • iterative boundary expansion may be performed to each anchor element in the anchor element group 702, so as to find elements that may be within the same item as the anchor element.
  • iterative boundary expansion may be synchronously performed by taking the multiple anchor elements in the anchor element group 702 as starting points respectively.
  • Each anchor element may act as a starting point, and through the iterative boundary expansion, it is possible to sequentially determine and expand to multiple other elements in the DOM tree, starting from this anchor element.
  • This anchor element together with the determined other elements form a tree, and this tree represents an item and thus may also be referred to as an item tree.
  • Multiple nodes in the item tree may respectively correspond to multiple elements, e.g., the anchor element and the elements determined through the iterative boundary expansion.
  • Each step of iteration may expand to a next node, and the next node may be included into the item tree. Multiple steps of iteration form a corresponding expansion path.
  • Through the iterative boundary expansion at 710 multiple item trees respectively originating from the multiple anchor elements in the anchor element group 702 may be obtained.
  • the multiple item trees respectively define boundaries of multiple items.
  • the iterative boundary expansion may include various types of expansion, e.g., sibling node expansion, parent node expansion, etc.
  • the sibling node expansion may be performed for expanding from the current node to a sibling node of the current node in a DOM tree corresponding to a target web page. In one case, if the current node has multiple sibling nodes belonging to the same parent node, it is possible to sequentially expand to the multiple sibling nodes from near to far, starting from the current node.
  • the sibling node expansion may adopt a predetermined expansion direction, e.g., expanding to the right, expanding to the left, expanding to the right and the left alternately, expanding to the left after a number of expansions to the right or after meeting a predetermined condition, expanding to the right after a number of expansions to the left or after meeting a predetermined condition, etc.
  • the parent node expansion may be performed for expanding to a parent node of the current node after all sibling nodes of the current node have been included in the same item tree, and including the parent node into the item tree. After expanding to the parent node, sibling node expansion may be further performed to the parent node, e.g., expanding to sibling nodes of the parent node.
  • the iterative boundary expansion may be performed synchronously among different item trees corresponding to different anchor elements. For example, in each step of iteration, sibling node expansion or parent node expansion is performed once in these item trees synchronously. In one case, for example, in a certain step of sibling node expansion, if a certain item tree S currently has no sibling node that can be expanded to, and other item trees have sibling nodes that can be expanded to, then expansion of the item tree S at the current step may be suspended once while sibling node expansion at the current step is performed to other item trees.
  • boundary of each item may be expanded as large as possible, e.g., enabling each item to include as many elements as possible through the iterative boundary expansion.
  • there should not have content overlap between different items e.g., the same element or content should not be included in different items.
  • structures of different items should be similar, e.g., different items should have at least a predetermined proportion of similar elements or nodes, etc.
  • Content overlap between two items may be caused by the fact that two item trees corresponding to the two items have node overlap, e.g., a certain node is shared by the two item trees. Accordingly, content overlap between different items may be avoided through detecting node overlap during the iterative boundary expansion at 710.
  • the node overlap determination at 720 may be performed synchronously with the iterative boundary expansion at 710, e.g., determining whether there is node overlap after each step of iteration.
  • the process 700 may return to 710 and continue performing the iterative boundary expansion.
  • each item tree is caused to go back or reset to a state at the previous step of iteration before the current step of iteration.
  • the embodiments of the present disclosure may avoid node overlap occurring among the obtained multiple item trees, thereby avoiding content overlap occurring among different items.
  • the process 700 may perform similarity check to the multiple item trees.
  • the similarity check may be performed in response to determining that the number of nodes in at least one item tree in the multiple item trees exceeds a node number threshold.
  • the at least one item tree may be a predetermined number or a predetermined proportion of item trees among the multiple item trees, and thus the performing of the similarity check may require that: the number of nodes in each item tree of a predetermined number or a predetermined proportion of item trees among the multiple item trees exceeds a node number threshold.
  • the performing of the similarity check may require that: the iterative boundary expansion at 710 has performed a predetermined number of steps of iteration, i.e., each item has contained a predetermined number of elements or each item tree has contained a predetermined number of nodes.
  • the similarity check may be performed synchronously with the iterative boundary expansion at 710, e.g., the similarity check would be performed after each step of iteration.
  • the similarity check would be performed whenever a predetermined number of steps of iteration are performed, e.g., whenever a predetermined number of elements are newly added into each item or whenever a predetermined number of nodes are newly added into each item tree.
  • the embodiments of the present disclosure are not limited to the above exemplary opportunities of performing the similarity check.
  • a tree similarity between any two item trees in the multiple item trees may be calculated.
  • the embodiments of the present disclosure are not limited to any specific technique for calculating a tree similarity.
  • the embodiments of the present disclosure propose a tree similarity calculation method obtained through improving an existing simple tree matching algorithm, and the proposed tree similarity calculation method calculates a tree similarity at least with a CSS similarity-based weight and/or a minimum depth layer level.
  • the embodiments of the present disclosure calculate a tree similarity at least with a matching weight calculated based on a CSS similarity between root nodes of two item trees.
  • a style presented by CSS is important information for dictating page layout. Therefore, through calculating a matching weight based on a CSS similarity between root nodes of two item trees and utilizing the matching weight for calculating a tree similarity between the two item trees, accuracy of tree similarity calculation may be effectively improved.
  • the matching weight may be calculated based on, e.g., respective CSS classes of the two root nodes.
  • the embodiments of the present disclosure may calculate a tree similarity with nodes within a minimum depth layer level in two item trees.
  • a minimum depth layer level may be defined such that: the number of visible nodes within the minimum depth layer level in an item tree reaches a predetermined proportion, e.g., 80% or any other proportion, of the number of all visible nodes in the item tree.
  • the minimum depth layer level may also be defined such that: the number of visible nodes within layer levels, that are less than the minimum depth layer level, in an item tree does not reach a predetermined proportion of the number of all visible nodes in the item tree.
  • a visible node may refer to a visually visible node in a web page, e.g., a node presenting an image, a node presenting text, etc., therefore, compared to other nodes, a visible node is more important for determining a structure similarity between item trees.
  • An item tree may have multiple layer levels, e.g., assuming that a root node of the item tree is located at a layer level with a depth of 0, child nodes of the root node are located at a layer level with a depth of 1, and so on. Layer levels with larger depth contribute less in determining a structure similarity between two trees.
  • a tree similarity may be calculated with a minimum depth layer level and those layer levels that are less than the minimum depth layer level. For example, assuming that a minimum depth layer level is 3, a tree similarity may be calculated with layer levels with depths of 0, 1, 2, and 3.
  • a minimum depth layer level is determined by at least considering the number of visible nodes, e.g., the number of visible nodes within the minimum depth layer level should not be lower than a predetermined proportion of the number of all visible nodes in the item tree, the predetermined proportion that is appropriately set will ensure that an accurate tree similarity can still be calculated even if those layer levels greater than the minimum depth layer level are not considered in the calculation of tree similarity.
  • the predetermined proportion may have any value preset according to actual application requirements. It should be understood that although the above discussion relates to calculating a tree similarity with nodes within a minimum depth layer level in two item trees, the embodiments of the present disclosure are not limited to this, and may alternatively calculate a tree similarity with nodes within all layer levels in two item trees.
  • Root(T) represents a root node of the tree T
  • Root(T') represents a root node of the tree T'.
  • virtual root nodes may be set for T and T' respectively, and these two virtual root nodes may have the same attribute configuration.
  • Lo, Li, ..., Ln respectively represent subtree set at layer level depths 0, 1,..., n.
  • Ln, Li2, ..., Lik respectively represent k subtrees in the layer level depth i, i.e., subtrees in a subtree setLi.
  • a matching weight between T and T' may be calculated by Jaccard-coefficient.
  • the matching weight between T and T' may be calculated as: I css-1 n css 7
  • a minimum depth layer level MinDepth may be determined.
  • a similarity check function SimilarityCheck(») for calculating a similarity metric between T and T' at the current layer level “layer” is defined, and this function may include the subsequent processing in step 1.3 to step 1.18.
  • step 1.3 and step 1.4 if it is determined that Root(T) and Root(T') have different html tags, a calculation result by SimilarityCheck(») is 0.
  • step 1.5 and step 1.6 if it is determined that the current layer level “layer” is greater than the minimum depth layer level MinDepth, a calculation result by SimilarityCheck(») is 0, thereby tree similarity calculation may be avoided in the case that the current layer level is greater than the minimum depth layer level.
  • step 1.8 subtrees in a subtree setLiayer+i corresponding to a depth “layer+1” in T is represented by m.
  • subtrees in a subtree set L'iayer+i corresponding to a depth “layer+1” in T' is represented by n.
  • a similarity function M[i, j] is defined, which represents the maximum similarity between the first i subtrees in T and the first j subtrees in T'.
  • M[i, 0] and M[0, j] are initialized to 0, respectively.
  • step 1.12 it is defined that subtrees in the subtree set m of T will be traversed.
  • step 1.13 it is defined that subtrees in the subtree set n of T' will be traversed.
  • similarity may be calculated with the similarity function M[i, j]. For example, techniques such as dynamic programming may be adopted for performing the calculations at step 1.14 and step 1.15.
  • M[i, j] will obtain the best similarity from three candidates including M[i, j-1], M[i-1, j] and M[i-1, j - 1]+W[i, j], W[i, j] will recursively calculate a similarity between the i-th subtree Ti in T and the j- th subtree T'j in T' at the layer level “layer+1”.
  • the entire tree structure may be considered, rather than just a root node.
  • the calculation result by SimilarityCheck(») may be returned, which is represented as MatchWeight(Root(T), Root(T')) * (M[m, n] + 1), wherein M[m, n] is the best similarity between subtrees of T and subtrees of T', and "1" represents the root node. It should be understood that the calculation result returned at step 1.18 may indicate, e.g., the number of similar nodes.
  • a final tree similarity may be calculated with all nodes having depth that is not greater than the minimum depth layer level, wherein TreeSimilarity(») is a tree similarity function,
  • is the number of nodes in T having depth that is not greater than the minimum depth layer level, and ⁇ T'
  • the multiple item trees may be divided into at least one tree set at least with a similarity threshold at 745.
  • Each of the at least one tree set may include at least one item tree, and at least one item tree in the same tree set has a tree similarity, that is not lower than the similarity threshold, among each other.
  • item trees having high similarities among each other may be divided into the same tree set.
  • it may be determined whether the number of item trees in a tree set containing the highest number of item trees in the at least one tree set is lower than a tree number threshold.
  • the tree set containing the highest number of item trees may be taken as a target tree set for determining whether the iteration should be stopped.
  • the tree number threshold may have a preset value, and this value may be used for, e.g., ensuring that most of the multiple item trees are included in the target tree set. If it is determined at 750 that the number of item trees in the target tree set is not lower than the tree number threshold, the process 700 may return to 710 and continue performing the iterative boundary expansion. If it is determined at 750 that the number of item trees in the target tree set is lower than the tree number threshold, then at 760, the performing of the iterative boundary expansion may be stopped, and nodes that are determined through a predetermined number of previous steps of iteration may be excluded from the multiple item trees respectively.
  • the predetermined number of previous steps of iterations may be determined through: excluding nodes determined through a predetermined number of previous steps of iteration from the multiple item trees respectively, such that the number of item trees in a target tree set obtained, e.g., through the processing of step 740 and step 745 with respect to updated multiple item trees is not lower than the tree number threshold.
  • the embodiments of the present disclosure may implement similarity check to multiple item trees.
  • the similarity check facilitates to ensure that the obtained item trees have structure similarity. For example, in the case that multiple item trees in a target tree set are provided as an iterative expansion result, these item trees in the iterative expansion result will have high similarity among each other, thereby ensuring that different items have similar structures.
  • further iterative boundary expansion may be performed at 770 to attempt to find multiple better item trees.
  • step 730 if it is determined that the current step of iteration is sibling node expansion, further iterative boundary expansion may be performed to the multiple item trees in a direction which is reverse to the direction of the current step of iteration. For example, for each item tree, if the current step of iteration is to expand right to a sibling node, an attempt may be made to expand left to a different sibling node.
  • the multiple item trees may first be reset to a state at a predetermined previous step of iteration.
  • the predetermined previous step of iteration may be determined, e.g., with the predetermined number of previous steps of iteration involved at 760. For example, if the predetermined number of previous steps of iteration is previous 2 steps of iteration, the predetermined previous step of iteration may be the previous 3 rd step of iteration. Then, if it is determined that a next step of iteration after the predetermined previous step of iteration is sibling node expansion, an attempt may be made to perform further iterative boundary expansion to the multiple item trees in a direction which is reverse to the direction of the next step of iteration.
  • the process 700 may also include performing steps 720 to 730 and/or steps 740 to 760 with respect to the further iterative boundary expansion at 770, so as to ensure that the obtained multiple item trees do not have node overlap but have similar structure, thereby ensuring that different items do not have content overlap but have similar structure.
  • multiple item trees originating from multiple anchor elements in the anchor element group 702 may be obtained, and these item trees respectively define boundaries of multiple corresponding items, thereby the original list 704 formed by these items may be finally identified.
  • both the processing related to determining whether node overlap occurs and the processing related to performing similarity check in the process 700 are optional, and either or both of these processings may be included in the process 700, or either or both of these processings may be omitted from the process 700.
  • the process 700 may also provide only multiple item trees in the target tree set as the iterative expansion result, and form the original list 704 with these item trees.
  • FIG.8 A to FIG.8F illustrate an example of iterative boundary expansion according to an embodiment.
  • an exemplary process of iterative boundary expansion is shown in an exemplary DOM tree corresponding to a target web page.
  • the DOM tree may include multiple nodes, e.g., node 801 to node 826, and other nodes that are not shown. Symbols such as “Div”, “A”, “Span”, “P”, “Img”, etc., displayed in blocks representing nodes indicate html tags of corresponding nodes. Moreover, the embodiments of the present disclosure propose to set "Text" tags for text strings appearing in a html source file, although such html tags do not exist, and these text strings may also be taken as nodes in the DOM tree, e.g., node 819, node 820, etc. These text strings may be visible elements that may be presented, and thus the setting of Text tags and corresponding nodes for these text strings will facilitate to determine item boundaries more accurately.
  • node tags are shown in FIG.8 A to FIG.8F, but in practical applications, any other types of node tag may exist, and the embodiments of the present disclosure are not limited in any way by what specific tags the nodes in the DOM tree have.
  • nodes included into an item tree through iterative boundary expansion are highlighted by shading, and expansion paths of the iterative boundary expansion are indicated by arrows.
  • node 818, node 821 and node 824 have been identified as anchor elements with an Img (image) tag. Then, iterative boundary expansion may be performed synchronously by taking these nodes as starting points respectively, so as to obtain item trees respectively originating from these nodes.
  • an item tree originating from node 818 is referred to as a first item tree
  • an item tree originating from node 821 is referred to as a second item tree
  • an item tree originating from node 824 is referred to as a third item tree.
  • FIG.8B illustrates the 1 st step of iteration. Since none of node 818, node 821 and node 824 has sibling nodes, parent node expansion will be performed in the 1 st step of iteration. For example, in the first item tree, the expansion is from node 818 to a parent node 806 of node 818; in the second item tree, the expansion is from node 821 to a parent node 810 of node 821; and in the third item tree, the expansion is from node 824 to a parent node 814 of node 824.
  • FIG.8C illustrates the 2 nd step to the 4 th step of iteration in which sibling node expansion will be performed.
  • node 806 determined through the 1 st step of iteration is the current node, which has sibling nodes 807, 808 and 809, and thus the 2 nd step to the 4 th step of iteration will expand to the right sequentially to node 807, node 808 and node 809 as indicated by the arrow.
  • the 2 nd step to the 4 th step of iteration will expand to the right sequentially to node 811, node 812 and node 813 as indicated by the arrow; and in the third item tree, the 2 nd step to the 4 th step of iteration will expand to the right sequentially to node 815, node 816 and node 817 as indicated by the arrow.
  • node 808 has child node 819 and node 809 has child node 820
  • node 819 and node 820 may also be included into the first item tree.
  • child node 822 of node 812 and child node 823 of node 813 may be included into the second item tree
  • child node 825 of node 816 and child node 826 of node 817 may be included into the third item tree.
  • FIG.8D illustrates the 5 th step of iteration in which parent node expansion will be performed.
  • the 5 th step of iteration will expand to parent node 803 of nodes 806 to 809 as indicated by the arrow.
  • the 5 th step of iteration will expand to node 804 as indicated by the arrow; and in the third item tree, the 5 th step of iteration will expand to node 805 as indicated by the arrow.
  • FIG.8E illustrates the 6 th step of iteration in which parent node expansion will be performed.
  • node 803 determined through the 5 th step of iteration has no sibling node, and thus the 6 th step of iteration will expand to parent node 802 of node 803 as indicated by the arrow.
  • the 6 th step of iteration will expand to node 802 as indicated by the arrow; and in the third item tree, the 6 th step of iteration will expand to node 802 as indicated by the arrow.
  • node 802 will be included in the first item tree, the second item tree and the third item tree at the same time, thereby causing node overlap to occur. Therefore, the performing of the iterative boundary expansion will be stopped, and node 802 determined through the 6 th step of iteration will be excluded from the first item tree, the second item tree and the third item tree, respectively.
  • the finally obtained first item tree 830, the finally obtained second item tree 840 and the finally obtained third item tree 850 are shown by dashed blocks.
  • the first item tree 830, the second item tree 840 and the third item tree 850 respectively correspond to a first item, a second item and a third item in an original list in the target web page.
  • each item tree has its own root node, e.g., the first item tree 830, the second item tree 840 and the third item tree 850 have their own root nodes 803, 804 and 805, respectively, and thus a boundary of each item tree may actually be indicated by a html tag of a root node, e.g., a boundary of the first item tree may be indicated by the "Div" tag of root node 803.
  • FIG.9A to FIG.9F illustrate an example of iterative boundary expansion according to an embodiment.
  • FIG.9A to FIG.9F illustrate examples of performing iterative boundary expansion in different approaches in an exemplary DOM tree corresponding to a target web page.
  • the DOM tree may include multiple nodes, e.g., node 901 to node 929, and other nodes that are not shown.
  • node 919, node 923 and node 927 have been identified as anchor elements with an Img (image) tag.
  • iterative boundary expansion may be performed synchronously by taking these nodes as starting points respectively, so as to obtain item trees respectively originating from these nodes.
  • an item tree originating from node 919 is referred to as a first item tree
  • an item tree originating from node 923 is referred to as a second item tree
  • an item tree originating from node 927 is referred to as a third item tree.
  • FIG.9B illustrates the 1 st step of iteration. Since none of node 919, node 923 and node 927 has sibling nodes, parent node expansion will be performed in the 1 st step of iteration. For example, in the first item tree, the expansion is from node 919 to parent node 904 of node 919; in the second item tree, the expansion is from node 923 to parent node 909 of node 923; and in the third item tree, the expansion is from node 927 to parent node 914 of node 927.
  • FIG.9C shows item trees finally obtained by performing subsequent iterative boundary expansion in an expansion approach, on the basis of the 1 st step of iteration in FIG.9B.
  • the 2 nd step to the 5 th step of iteration will sequentially perform sibling node expansion to the left as indicated by the arrows.
  • the 2 nd step of iteration will expand to the left to node 903 as indicated by the arrow, and since there is no other sibling node, the 3 rd step to the 5 th step of iteration will be suspended in the first item tree.
  • the 2 nd step to the 5 th step of iteration will sequentially expand to the left to node 908, node 907, node 906 and node 905 as indicated by the arrow.
  • the 2 nd step to the 5 th step of iteration will sequentially expand to the left to node 913, node 912, node 911 and node 910 as indicated by the arrow. Since the node 902 is a common parent node of the first item tree, the second item tree and the third item tree, in order to avoid node overlap, the finally obtained first item tree, second item tree and third item tree do not include node 902.
  • the finally obtained first item tree 932, the finally obtained second item tree 934 and the finally obtained third item tree 936 are shown by dashed blocks. It should be understood that node 915, node 916, node 917, node 928 and node 929 are not included in any item tree. Moreover, it is assumed that similarity check described above in connection with FIG.7 is further performed to these finally obtained item trees, and it is found that although there is a high similarity between the second item tree 934 and the third item tree 936, a tree similarity between the first item tree 932 and the second item tree 934 and a tree similarity between the first item tree 932 and the third item tree 936 are both lower than a similarity threshold. Therefore, the first item tree 932 may be regarded as an unqualified item tree and thus is discarded, and accordingly, the expansion approach in FIG.9C actually outputs only two item trees 934 and 936 finally.
  • FIG.9D shows item trees finally obtained by performing subsequent iterative boundary expansion in another expansion approach, on the basis of the 1 st step of iteration in FIG.9B.
  • the 2 nd step to the 5 th step of iteration will sequentially perform sibling node expansion to the right as indicated by the arrows.
  • the 2 nd step to the 5 th step of iteration will sequentially expand to the right to node 905, node 906, node 907 and node 908 as indicated by the arrow.
  • the 2 nd step to the 5 th step of iteration will sequentially expand to the right to node 910, node 911, node 912 and node 913 as indicated by the arrow.
  • the 2 nd step to the 4 th step of iteration will sequentially expand to the right to node 915, node 916 and node 915 as indicated by the arrow, and since there is no further sibling node, the 5 th step of iteration will be suspended in the third item tree.
  • the finally obtained first item tree 942, the finally obtained second item tree 944 and the finally obtained third item tree 946 are shown by dashed blocks. It should be understood that node 903 and node 918 are not included in any item tree. Moreover, it is assumed that similarity test described above in connection with FIG.7 is further performed to these finally obtained item trees, and it is found that none of tree similarities among these item trees is lower than a similarity threshold.
  • FIG.9D actually outputs all three item trees 942, 944 and 946 finally.
  • further iterative boundary expansion may be performed according to, e.g., step 770 in FIG.7. It is assumed that the item trees shown in FIG.9D are respectively reset to the state at the 4 th step of iteration, as shown in FIG.9E.
  • the current expansion path of the first item tree sequentially includes node 919, node 904, node 905, node 906 and node 907 as indicated by the arrow
  • the current expansion path of the second item tree sequentially includes node 923, node 909, node 910, node 911 and node 912 as indicated by the arrow
  • the current expansion path of the third item tree sequentially includes node 927, node 914, node 915, node 916 and node 917 as indicated by the arrow.
  • the 5 th step of iteration in FIG.9F may attempt to perform sibling node expansion to the left, i.e., in a direction which is reverse to the direction of the 5 th step of iteration in FIG.9D. Accordingly, in the 5 th step of iteration in FIG.9F, the expansion path of the first item tree will further include node 903 as indicated by the arrow, the expansion path of the second item tree will further include node 908 as indicated by the arrow, and the expansion path of the third item tree will further include node 913 as indicated by the arrow.
  • FIG.9F the finally obtained first item tree 952, the finally obtained second item tree 954 and the finally obtained third item tree 956 are shown by dashed blocks. Moreover, it is assumed that similarity check described above in connection with FIG.7 is further performed to these finally obtained item trees, and it is found that none of tree similarities among these item trees is lower than a similarity threshold. Accordingly, the expansion approach in FIG.9F actually outputs all three item trees 952, 954 and 956 finally.
  • the performing of the iterative boundary expansion may follow predetermined criteria, e.g., obtained item trees are the more the better, nodes in each item tree are the more the better, tree similarities among different item trees are the higher the better, etc.
  • predetermined criteria e.g., obtained item trees are the more the better, nodes in each item tree are the more the better, tree similarities among different item trees are the higher the better, etc.
  • the expansion approach in FIG.9F will be better than the expansion approach in FIG.9D, and this is because that: the expansion approach in FIG.9F may include more nodes (e.g., node 903 and node 918) into the item trees; and tree similarities among the item trees obtained through the expansion approach in FIG.9F are higher, e.g., a tree similarity between the third item tree 956 and the first item tree 952 and a tree similarity between the third item tree 956 and the second item tree 954 will be higher than a tree similarity between the third item tree 946 and the first item tree 942 and a tree similarity between the third item tree 946 and the second item tree 944.
  • the expansion approach in FIG.9F may include more nodes (e.g., node 903 and node 918) into the item trees; and tree similarities among the item trees obtained through the expansion approach in FIG.9F are higher, e.g., a tree similarity between the third item tree 956 and the first item tree 952 and a
  • the embodiments of the present disclosure may perform further iterative boundary expansions in various approaches. For example, instead of respectively resetting the item trees shown in FIG.9D to the state at the 4 th step of iteration shown in FIG.9E, the item trees shown in FIG.9D may be reset to a state at any other predetermined previous step of iteration, e.g., reset to the state at the 3 rd step of iteration. Then, expansion may be performed to the reset item trees in a direction which is reverse to the direction of the next step of iteration after the predetermined previous step of iteration. Moreover, it should be understood that the embodiments of the present disclosure may also perform further iterative boundary expansions in multiple different approaches, and select, from these different approaches, an approach that can obtain the best item trees.
  • FIG.10 illustrates an exemplary boundary detection result according to an embodiment. It is assumed that boundary detection has been performed to the target web page 20 in FIG.2. As shown in FIG.10, dashed block 1010 denotes an item corresponding to “M cellphone A4” identified through boundary detection, dashed block 1020 denotes an item corresponding to “M cellphone A3” identified through boundary detection, and dashed block 1030 denotes an item corresponding to "M cellphone A2" identified through boundary detection. The items denoted by dashed blocks 1010, 1020 and 1030 together form the original list 202 in the target web page 20 shown in FIG.2.
  • FIG.11 illustrates an exemplary process 1100 of dominant list determination according to an embodiment. The process 1100 is an exemplary implementation of the step 430 in FIG.4. Assuming that it has been determined that a target web page includes more than one original list, e.g., a first original list 1102, a second original list 1104, etc., the process 1100 may be performed for determining a dominant list from these
  • visual features of the first original list 1102 may be determined at least with boundaries of items in the first original list 1102.
  • visual features of the second original list 1104 may be determined at least with boundaries of items in the second original list 1104.
  • visual features of an original list may refer to various visual features that facilitate to determine whether the original list occupies a main position in a target web page, whether it is used for presenting main content of the target web page, etc.
  • the visual features may include a minimum boundary distance between adjacent items within the original list, which may indicate a visual distance between the two items. For example, a minimum boundary distance between two adjacent items may be calculated with boundaries of these two items.
  • the visual features may include list position which may indicate whether the original list occupies a main position in the target web page and thus acts as a main content portion in the target web page.
  • the list position may include a position of the original list in a horizontal direction in the target web page.
  • the list position may include a position of the original list in a vertical direction in the target web page, e.g., whether the list is located in above-the-fold of the screen, etc.
  • the visual features may include item content richness which indicates visual content richness of items in the original list.
  • item content richness of an item may include, e.g., size of the item, number of nodes contained in the item, etc., determined based on a boundary of the item.
  • a dominant list may be determined from the first original list 1102 and the second original list 1104 based on the visual features of the first original list 1102 and the visual features of the second original list 1104.
  • the dominant list may be determined with multiple heuristic rules defined for the visual features. For example, for the minimum boundary distance, a heuristic rule may be defined as to whether visual distances among items in a list are small, which is based on the consideration that distances among items in a dominant list are usually not very far. For example, for the list position, a heuristic rule may be defined as to whether an original list occupies a main position in a target web page, which is based on the consideration that a dominant list usually occupies a main position in a target web page.
  • a heuristic rule may be defined as to whether an original list has a high item content richness, which is based on the consideration that a dominant list usually has a high item content richness. According to the above heuristic rules, an original list that can better satisfy these heuristic rules may be selected from the first original list 1102 and the second original list 1104 as the dominant list.
  • the original list 202 may be identified as the dominant list from the original list 202 and the original list 204.
  • FIG.12 illustrates an exemplary process 1200 of representative metadata obtaining according to an embodiment.
  • the process 1200 is an exemplary implementation of the step 440 in FIG.4. It is assumed that the process 1200 is performed for obtaining a group of representative metadata for a specific item 1202 in an original list.
  • a group of leaf nodes in an item tree that is identified by a boundary of the item 1202 may be identified. For example, leaf nodes in an item tree corresponding to the item 1202 may be identified.
  • initial metadata may include, e.g., the picture in the item, the character string "M cellphone A4, 6.5 inches, 256G, black", the icon of 5 solid stars, the character string "25900 reviews”, the character string "5500 RMB”, etc.
  • a group of tags corresponding to the extracted group of initial metadata may be determined. For example, a corresponding tag is determined for each initial metadata, to indicate specific meaning of the initial metadata.
  • the group of tags may be determined at 1230 in various approaches.
  • a token sequence may be formed with the group of initial metadata.
  • Each token in the token sequence corresponds to an initial metadata in the group of initial metadata.
  • a feature set for each token in the token sequence may be calculated.
  • the feature set may include various types of feature that facilitate to determine tags, e.g., DOM tree feature, XPath feature, content feature, language feature, rendering feature, etc.
  • the DOM tree feature may include, e.g., layer level depth, tag, class ID, etc., of a node corresponding to the token.
  • the XPath feature may include, e.g., name, CSS class, etc. of a node corresponding to the token.
  • the content feature may include, e.g., text vector of the token, whether the first letter is capitalized, etc.
  • the language feature may include, e.g., language used by the token, Word2vec semantic feature vector of the token, etc.
  • the rendering feature may include, e.g., various features involved in rendering a node corresponding to the token, such as position, length, width, etc. It should be understood that the embodiments of the present disclosure are not limited to the exemplary features included in the feature set given above, but may cover any other features or any combination of these features.
  • a tag for each token may be generated based on multiple feature sets of multiple tokens in the token sequence, through a previously-trained tagger model.
  • the tagger model may be a combined model formed by a discriminative model and a generative model, wherein the discriminative model may be, e.g., a binary-classification or multi-classification model, and the generative model may be, e.g., a sequence-to-sequence (Seq2seq) model.
  • the embodiments of the present disclosure are not limited to generating tags through the tagger model described above, but may also generate tags in any other approaches.
  • an "image” tag may be generated for the picture in the item
  • a "title” tag may be generated for the character string "M phone A4, 6.5 inches, 256G, black”
  • a "rating” tag may be generated for the icon of 5 solid stars
  • a "review” tag may be generated for the character string "25900 reviews”
  • a "price” tag may be generated for the character string "5500 RMB”.
  • the group of initial metadata may be ranked with the generated group of tags.
  • a keyword ranking model may be trained previously, which may be used for ranking a group of tags that act as keywords.
  • the keyword ranking model may be trained for ranking multiple tags according to, e.g., importance degree, representativeness, etc.
  • these tags may be ranked from high to low as, e.g., image tag, title tag, price tag, rating tag, review tag, etc. Accordingly, the initial metadata corresponding to these tags are also ranked in the same order.
  • one or more highest-ranked initial metadata may be selected as a group of representative metadata corresponding to the item 1202.
  • multiple groups of representative metadata respectively corresponding to multiple items in the original list may be obtained.
  • the multiple groups of representative metadata may be subsequently used for generating a structured list.
  • the multiple groups of representative metadata may be visualized into a structured list.
  • Each group of representative metadata may form a new item in the structured list.
  • format or layout of the structured list may be pre-defined for specifying, e.g., an arranging approach (e.g., horizontal arrangement, vertical arrangement, etc.) of multiple items in the structured list, an arranging approach of multiple elements within each item, sizes of items and elements, etc.
  • the format or layout of the structured list may be similar with that of the original list, except that the structured list may include fewer items or elements than the original list.
  • FIG.13 illustrates an exemplary search result page 1300 according to an embodiment. It is assumed that a user has input a query "M cellphone" in a search box 1310 to indicate that the user wants to obtain web search results regarding the M cellphone.
  • a search result region 1320 in the search result page 1300 includes multiple web search results. Unlike the search results shown in the region 330 in FIG.3 for the web page 20 in FIG.2, the search result region 1320 in FIG.13 includes an exemplary structured list 1330 generated for the web page 20 in FIG.2.
  • the structured list 1330 is a simplified version of the original list 202 in the target web page 20, which may act as a list snippet of the original list 202.
  • the structured list 1330 still contains enough information to enable the user to intuitively and comprehensively understand main content of the original list 202.
  • the structured list 1330 includes item 1332, item 1334 and item 1336, and these items respectively correspond to item 1010, item 1020 and item 1030 in the original list in the target web page 20 (as shown in FIG.10), and include main representative content in corresponding items in the original list.
  • item 1332 includes the picture, brief introduction and price for the cellphone "M Cellphone A4" presented in the region 22 in FIG.2.
  • the user may intuitively and conveniently learn main content in the target web page 20 by viewing the structured list 1330 in the search result region 1320, without the need of, e.g., clicking a link to the target web page 20 in order to learn content in the web page.
  • the search result page 1300 in FIG.13 and the structured list 1330 therein are merely exemplary, and the embodiments of the present disclosure are not limited to this example in any approach.
  • FIG.14 illustrates a flowchart of an exemplary method 1400 for list extraction and visualization in web pages according to an embodiment.
  • At 1410 at least one anchor element group in a target web page may be detected, the at least one anchor element group comprising a first anchor element group.
  • boundary detection may be performed to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page.
  • multiple groups of representative metadata respectively corresponding to the multiple items may be obtained from the target web page with the boundaries of the multiple items.
  • the multiple groups of representative metadata may be visualized into a structured list.
  • the detecting at least one anchor element group may comprise: identifying multiple html elements, that meet anchor element constraints, in the target web page as multiple identified anchor elements; extracting, from the target web page, a property set of each identified anchor element in the multiple identified anchor elements; and clustering the multiple identified anchor elements into the at least one anchor element group based on multiple property sets of the multiple identified anchor elements.
  • the anchor element constraints may comprise at least one of: a html element having an image tag; a html element having a title tag; and a html element representing a date.
  • a property set of each identified anchor element may comprise at least one of html tag attribute, CSS class and XPath information of the identified anchor element.
  • the boundary detection may comprise: based on a DOM tree corresponding to the target web page, performing iterative boundary expansion synchronously by taking the multiple anchor elements as starting points respectively, to obtain multiple item trees respectively originating from the multiple anchor elements, wherein each item tree represents an item and comprises multiple nodes, and each node corresponds to an element determined through the iterative boundary expansion.
  • the iterative boundary expansion may comprise: for each item tree and in each step of iteration, expanding to a next node and including the next node into the item tree.
  • the iterative boundary expansion may comprise at least one of: performing sibling node expansion, to expand from the current node to a sibling node of the current node; and performing parent node expansion, to expand to a parent node of the current node after all sibling nodes of the current node have been included in the item tree.
  • the boundary detection may comprise: determining whether the current step of iteration results in that node overlap occurs between the item tree and at least one another item tree in the multiple item trees; and in response to determining that the node overlap occurs, stopping the performing of the iterative boundary expansion, and excluding nodes, that are determined through the current step of iteration, from the multiple item trees respectively.
  • the method may further comprise: if the current step of iteration is sibling node expansion, performing further iterative boundary expansion to the multiple item trees in a direction which is reverse to the direction of the current step of iteration.
  • the boundary detection may comprise: performing similarity check to the multiple item trees.
  • the similarity check may be performed in response to determining that the number of nodes in at least one item tree in the multiple item trees exceeds a node number threshold.
  • the similarity check may comprise: calculating a tree similarity between any two item trees in the multiple item trees; dividing the multiple item trees into at least one tree set at least with a similarity threshold, item trees in each tree set in the at least one tree set having tree similarities, that are not lower than the similarity threshold, among each other; determining whether the number of item trees in a tree set containing the highest number of item trees in the at least one tree set is lower than a tree number threshold; and in response to determining that the number of item trees is lower than the tree number threshold, stopping the performing of the iterative boundary expansion, and excluding nodes, that are determined through a predetermined number of previous steps of iteration, from the multiple item trees respectively.
  • the calculating a tree similarity may comprise at least one of: calculating the tree similarity at least with a matching weight calculated based on a CSS similarity between root nodes of the two item trees; and calculating the tree similarity with nodes within a minimum depth layer level in the two item trees, the minimum depth layer level being defined such that: the number of visible nodes within the minimum depth layer level in an item tree reaches a predetermined proportion of the number of all visible nodes in the item tree.
  • the method may further comprise: resetting the multiple item trees to a state at a predetermined previous step of iteration; and if a next step of iteration after the predetermined previous step of iteration is sibling node expansion, performing further iterative boundary expansion to the multiple item trees in a direction which is reverse to the direction of the next step of iteration.
  • the obtaining multiple groups of representative metadata may comprise, for each item in the multiple items: identifying a group of leaf nodes in an item tree that is identified by a boundary of the item; extracting a group of initial metadata corresponding to the group of leaf nodes; determining a group of tags corresponding to the group of initial metadata; ranking the group of initial metadata with the group of tags; and selecting one or more highest-ranked initial metadata as a group of representative metadata corresponding to the item.
  • the determining a group of tags may comprise: forming a token sequence with the group of initial metadata, each token in the token sequence corresponding to an initial metadata in the group of initial metadata; calculating a feature set for each token in the token sequence; and generating a tag for each token based on multiple feature sets of multiple tokens in the token sequence, through a previously-trained tagger model.
  • the at least one anchor element group may comprise a second anchor element group.
  • the method may further comprise: performing boundary detection to multiple anchor elements in the second anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a second original list in the target web page.
  • the method may further comprise, before obtaining multiple groups of representative metadata or before visualizing the multiple groups of representative metadata into a structured list: determining visual features of the first original list and visual features of the second original list respectively with boundaries of items in the first original list and boundaries of items in the second original list; and determining, based on the visual features of the first original list and the visual features of the second original list, that the first original list is a dominant list in the first original list and the second original list.
  • the visual features may comprise at least one of: minimum boundary distance between adjacent items; list position; and item content richness.
  • the structured list may be presented in a search result page provided by a search service.
  • method 1400 may further comprise any step/process for list extraction and visualization in web pages according to the embodiments of the present disclosure described above.
  • FIG.15 illustrates an exemplary apparatus 1500 for list extraction and visualization in web pages according to an embodiment.
  • the apparatus 1500 may comprise: an anchor element group detecting module 1510, for detecting at least one anchor element group in a target web page, the at least one anchor element group comprising a first anchor element group; a boundary detecting module 1520, for performing boundary detection to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page; a representative metadata obtaining module 1530, for obtaining, from the target web page, multiple groups of representative metadata respectively corresponding to the multiple items, with the boundaries of the multiple items; and a representative metadata visualizing module 1540, for visualizing the multiple groups of representative metadata into a structured list.
  • an anchor element group detecting module 1510 for detecting at least one anchor element group in a target web page, the at least one anchor element group comprising a first anchor element group
  • a boundary detecting module 1520 for performing boundary detection to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items
  • apparatus 1500 may further comprise any other modules that are configured for performing any operation of the methods for list extraction and visualization in web pages according to the embodiments of the present disclosure described above.
  • FIG.16 illustrates an exemplary apparatus 1600 for list extraction and visualization in web pages according to an embodiment.
  • the apparatus 1600 may comprise at least one processor 1610.
  • the apparatus 1600 may further comprise a memory 1620 connected with at least one processor 1610.
  • the memory 1620 may store computer-executable instructions that, when executed, cause the at least one processor 1610 to: detect at least one anchor element group in a target web page, the at least one anchor element group comprising a first anchor element group; perform boundary detection to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page; obtain, from the target web page, multiple groups of representative metadata respectively corresponding to the multiple items, with the boundaries of the multiple items; and visualize the multiple groups of representative metadata into a structured list.
  • the at least one processor 1610 may also be configured for performing any other operation of the methods for list extraction and visualization in web pages according to the embodiments of the present disclosure described above.
  • the embodiments of the present disclosure propose a computer program product for list extraction and visualization in web pages.
  • the computer program product comprises a computer program that is executed by at least one processor for: detecting at least one anchor element group in a target web page, the at least one anchor element group comprising a first anchor element group; performing boundary detection to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page; obtaining, from the target web page, multiple groups of representative metadata respectively corresponding to the multiple items, with the boundaries of the multiple items; and visualizing the multiple groups of representative metadata into a structured list.
  • the computer program may further be executed by the at least one processor for performing any other operation of the methods for list extraction and visualization in web pages according to the embodiments of the present disclosure described above.
  • the embodiments of the present disclosure may be embodied in a non-transitory computer- readable medium.
  • the non-transitory computer readable medium may comprise instructions that, when executed, cause one or more processors to perform any step/process of the methods for list extraction and visualization in web pages according to the embodiments of the present disclosure described above.
  • modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
  • processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a micro-processor, micro-controller, digital signal processor (DSP), a field-programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • PLD programmable logic device
  • the functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, micro-controller, DSP, or other suitable platform.
  • a computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk.
  • a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).

Abstract

The present disclosure provides methods, apparatuses and computer program products for list extraction and visualization in web pages. At least one anchor element group in a target web page may be detected, the at least one anchor element group comprising a first anchor element group. Boundary detection may be performed to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page. Multiple groups of representative metadata respectively corresponding to the multiple items may be obtained from the target web page with the boundaries of the multiple items. The multiple groups of representative metadata may be visualized into a structured list.

Description

LIST EXTRACTION AND VISUALIZATION IN WEB PAGES
BACKGROUND
There are a large number of web pages on the network, and these web pages contain various types of information. In some scenarios, web users may need to find web pages of interest on the network in order to obtain desired information. Search engine providers may provide search services for assisting users to find web pages of interest. For example, in response to a search query from a user, a search service may return to the user a search result page which includes information about web pages relevant to the search query, e.g., web page links, snippets, etc.
SUMMARY
This Summary is provided to introduce a selection of concepts that are further described below in the Detailed Description. It is not intended to identify key features or essential features of the claimed subj ect matter, nor is it intended to be used to limit the scope of the claimed subj ect matter. Embodiments of the present disclosure propose methods, apparatuses and computer program products for list extraction and visualization in web pages. At least one anchor element group in a target web page may be detected, the at least one anchor element group comprising a first anchor element group. Boundary detection may be performed to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page. Multiple groups of representative metadata respectively corresponding to the multiple items may be obtained from the target web page with the boundaries of the multiple items. The multiple groups of representative metadata may be visualized into a structured list.
It should be noted that the above one or more aspects comprise the features hereinafter fully described and particularly pointed out in the claims. The following description and the drawings set forth in detail certain illustrative features of the one or more aspects. These features are only indicative of the various ways in which the principles of various aspects may be employed, and this disclosure is intended to include all such aspects and their equivalents.
BRIEF DESCRIPTION OF THE DRAWINGS
The disclosed aspects will hereinafter be described in conjunction with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.
FIG.l illustrates an exemplary list web page.
FIG.2 illustrates an exemplary list web page.
FIG.3 illustrates an existing exemplary search result page.
FIG.4 illustrates an exemplary process of list extraction and visualization in web pages according to an embodiment. FIG.5 illustrates an exemplary process of anchor element group detection according to an embodiment.
FIG.6 illustrates an exemplary anchor element group according to an embodiment.
FIG.7 illustrates an exemplary process of boundary detection according to an embodiment.
FIG.8 A to FIG.8F illustrate an example of iterative boundary expansion according to an embodiment.
FIG.9A to FIG.9F illustrate an example of iterative boundary expansion according to an embodiment.
FIG.10 illustrates an exemplary boundary detection result according to an embodiment.
FIG.11 illustrates an exemplary process of dominant list determination according to an embodiment.
FIG.12 illustrates an exemplary process of representative metadata obtaining according to an embodiment.
FIG.13 illustrates an exemplary search result page according to an embodiment.
FIG.14 illustrates a flowchart of an exemplary method for list extraction and visualization in web pages according to an embodiment.
FIG.15 illustrates an exemplary apparatus for list extraction and visualization in web pages according to an embodiment.
FIG.16 illustrates an exemplary apparatus for list extraction and visualization in web pages according to an embodiment.
DETAILED DESCRIPTION
The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.
Existing search services usually extract specific text from an original web page to form a snippet in the form of text, i.e., a text snippet, and display the text snippet in a search result page, so that when a user views this text snippet, the user can get a general idea of what the original web page is about. There are a large number of list web pages on the network, wherein a list web page may refer to that main content in the web page is a list and the list includes multiple items. For such list web pages, the existing search services still only extract specific text from a list web page to form a text snippet, and the text in the text snippet may be extracted only from specific items in the list, e.g., extracted from the first item in the list. Therefore, when a user views the text snippet in a search result page, the user can only learn limited local information about the list web page. Embodiments of the present disclosure propose to perform list extraction and visualization in web pages, such that list content may be extracted from a target web page and organized into a structured form. Herein, a target web page may be a list web page. The embodiments of the present disclosure may extract list content from a target web page, and visualize at least a portion of the extracted list content to form a snippet in the form of list, i.e., a list snippet. Compared to a text snippet, the list snippet may contain richer content about an original list in the target web page, so that a user may learn more comprehensive information about the original list from the list snippet. Since the list snippet itself is a structured list, the embodiments of the present disclosure may present information about the original list to the user in a more friendly and intuitive approach. The embodiments of the present disclosure may present, in the list snippet, key or representative information of items in the original list, and thus may comprehensively and concisely provide information that the user may desire. When the list snippet is presented in a search result page, the user may conveniently and comprehensively learn about content of the corresponding target web page without clicking on a web page link.
Original lists in target web pages processed by the embodiments of the present disclosure are not limited to those lists with html list tags, but may cover any visually perceptible lists. An "original list" involved in the embodiments of the present disclosure is a visually perceptible list. A visually perceptible list may refer to, e.g., that the list contains multiple items with visually similar structures. Herein, an "item" may refer to a component that constitutes a list, which may also be referred to as an object, entity, data record, etc. A visually perceptible list may or may not have an html list tag, and thus the visually perceptible list may have any html tags without limitations. The embodiments of the present disclosure are proposed at least for target web pages that include visually perceptible lists, and may process these target web pages at least from a visually perceptible perspective, rather than simply process these target web pages by using html list tags. Accordingly, the embodiments of the present disclosure may be applied to any target web page that contains a visual list.
In an aspect, the embodiments of the present disclosure may identify an original list that may be included in a target web page, through at least detecting an anchor element group in the target web page. Anchor elements in the anchor element group do not necessarily have a html list tag. Since anchor elements may have representative information of items in the original list, detection of anchor element group will facilitate to discover the original list in the target web page.
In an aspect, the embodiments of the present disclosure may perform boundary detection to multiple anchor elements in the anchor element group, so as to determine boundaries of multiple items in the original list corresponding to the anchor element group in the target web page. Herein, determining boundary of an item may refer to determining which specific elements are included in the item, and accordingly these elements form the item together. The boundary detection may include iterative boundary expansion. For each anchor element, elements that may be within the same item as the anchor element may be found through iterative boundary expansion, and thus the anchor element and the found elements define the boundary of the item. The boundary detection may also include similarity check. The similarity check may be performed for determining whether multiple items determined by expansion from different anchor elements are indeed items in the same original list, e.g., whether these items indeed form an original list. At least through the boundary detection according to the embodiments of the present disclosure, an original list in a target web page and each item in the original list may be accurately identified.
In an aspect, if a target web page includes two or more original lists, the embodiments of the present disclosure may determine a dominant list from these original lists. Herein, a dominant list may refer to a list that, e.g., occupies a main position in a web page, presents main content, etc. Preferably, through determining a dominant list and performing subsequent processing only to the dominant list, the embodiments of the present disclosure may include only information about the dominant list in a finally generated structured list, thereby avoiding interference caused by information about lists that are not the dominant list.
In an aspect, the embodiments of the present disclosure may obtain, from a target web page, multiple groups of representative metadata for different items in an original list. For example, multiple groups of representative metadata for multiple items in an original list may be obtained from the target web page by using at least boundaries of these items. In some implementations, the multiple groups of representative metadata may be important and representative metadata selected from initial metadata in the target web page through ranking.
In an aspect, the embodiments of the present disclosure may visualize the obtained multiple groups of representative metadata to form a structured list. The structured list may be taken as, e.g., a list snippet of the target web page.
The embodiments of the present disclosure may be applied in various application scenarios. For example, in a search service, the embodiments of the present disclosure may generate a structured list for a target web page, such that, e.g., a list snippet is built for the target web page. Accordingly, the search service may present, in a search result page, the structured list generated according to the embodiments of the present disclosure as a list snippet. It should be understood that the embodiments of the present disclosure are not limited to be applied to search services, but may also be applied to any application scenarios that need to perform list extraction and visualization to target web pages.
Target web pages processed by the embodiments of the present disclosure may be various list web pages from various websites, online services, etc. FIG. l illustrates an exemplary list web page. List web page 12 is an exemplary article on the network, which may be located in, e.g., an academic website, an online question answering community, etc. This article introduces ten major festivals in China, such as "Spring Festival", "Mid-Autumn Festival", "Dragon Boat Festival", etc. Those parts involving the introduced festivals in the article form a visually perceptible list 122. For example, the part involving "Spring Festival", the part involving "Mid-Autumn Festival", the part involving "Dragon Boat Festival", etc. form multiple items in the list 122 respectively.
List web page 14 is a web page from, e.g., a book sales website, a reading communication website, etc. Assuming that multiple options have been selected in an "Options" column on the left side of the web page 14, introductory information of four recommended books that match the selected options is presented on the right side of the web page 14. Taking the first book as an example, introductory information of this book may include, e.g., a cover photo 144, a text introduction 146, etc. The introductory information of the four books forms a visually perceptible list 142. For example, introductory information of each book forms an item in the list 142.
List web page 16 is a web page for an exemplary topic "Restaurant X" from a review forum, which includes a discussion thread for "Restaurant X" by multiple users. For example, the web page 16 includes multiple display regions for users Tom, David, Jane, etc. respectively. Taking the user Tom as an example, display region for Tom includes, e.g., Tom's avatar, Tom's name, post time of Tom's comment, specific content of Tom's comment, etc. The display region for Tom, the display region for David, the display region for Jane, etc. form a visually perceptible list 162, and these display regions form multiple items in the list 162 respectively.
FIG.2 illustrates an exemplary list web page 20. The web page 20 may come from, e.g., an online shopping website, etc. An online shopping website typically generates or provides a large number of web pages that include lists, e.g., best-selling web pages, most popular product web pages, product category web pages, web pages containing products searched by users, etc. The web page 20 may be a web page for presenting, e.g., cellphones that meet certain conditions. Assuming that multiple options have been selected in an “Options” column on the left side of the web page 20, introductory information of multiple cellphones matching the selected options is presented on the right side of the web page 20. For example, introductory information of a cellphone "M cellphone A4" is presented in a region 22, which includes, e.g., a picture 222 of the cellphone, a brief introduction of the cellphone "M cellphone A4, 6.5 inches, 256G, black", a 5-star rating for the cellphone, the number of reviews about the cellphone "25900 reviews", a price of the cellphone "5500 RMB", etc. Similarly, introductory information of a cellphone "M cellphone A3" including at least a picture 242 of the cellphone is presented in a region 24, introductory information of a cellphone "M cellphone A2" including at least a picture 262 of the cellphone is presented in a region 26, etc. The introductory information of these cellphones forms a visually perceptible list 202, wherein introductory information of each cellphone forms an item in the list 202. Moreover, the web page 20 also presents recommendations about related products in a region 28, e.g., introductory information of a first related product which includes at least a picture 282 of the product, introductory information of a second related product which includes at least a picture 284 of the product, introductory information of a third related product which includes at least a picture 286 of the product, etc. The introductory information of these related products forms a visually perceptible list 204, wherein introductory information of each related product forms an item in the list 204.
It should be understood that the embodiments of the present disclosure are not limited to the exemplary list web pages shown in FIG.l and FIG.2, but may cover various other types of list web pages from various other websites, online services, etc., e.g., list web pages from forum websites regarding topics in various fields, list web pages from product review websites, list web pages from news websites, list web pages from hotel or flight booking websites, etc.
FIG.3 illustrates an existing exemplary search result page 300. The search result page 300 may be presented to a user in a search service provided by a certain general search engine provider. It is assumed that the user has input a query "M cellphone" in a search box 310 to indicate that the user wants to obtain web search results regarding the M cellphone. A search result region 320 in the search result page 300 includes multiple web search results. For example, a search result for the web page 20 in FIG.2 is shown in a region 330. As shown in the region 330, the search result for the web page 20 includes a text snippet "M cellphone A4, 6.5 inches, 256G, black, 5 stars, 25900 reviews, 5500 RMB". The text snippet is generated only with the introductory information about the "M cellphone A4" in the web page 20. Based on the text snippet in the region 330, the user can only learn limited information about the web page 20, e.g., can only learn information about the cellphone "M cellphone A4", but cannot learn any information about other cellphones in the list 202 in the web page 20. Moreover, such text snippet also lacks intuitiveness and legibility.
FIG.4 illustrates an exemplary process 400 of list extraction and visualization in web pages according to an embodiment. The process 400 may be performed for achieving list extraction and visualization for an original list in a target web page 402, so as to generate a structured list 404. The target web page 402 may be a list web page containing a list, and herein a list in the target web page 402 may be referred to as an original list. If the target web page 402 includes two or more lists, the process 400 may generate the structured list 404 for a dominant list in the target web page 402.
At 410, at least one anchor element group in the target web page 404 may be detected. Each anchor element group may include one or more anchor elements, and each anchor element group may correspond to a possible original list. For example, if the target web page 404 includes two or more original lists, two or more anchor element groups respectively corresponding to those original lists may be detected at 410. In an implementation, multiple anchor elements in the target web page 404 may be identified first, and then the multiple anchor elements may be clustered into at least one anchor element group.
At 420, for each anchor element group, boundary detection may be performed to multiple anchor elements in the anchor element group to obtain boundaries of multiple items respectively associated with the multiple anchor elements. These items may form an original list, in the target web page 404, corresponding to the anchor element group. The boundary detection may include, e.g., iterative boundary expansion, similarity check, etc., in order to accurately identify at least one original list in the target web page 404 and respective items in each original list.
At 430, optionally, if it is determined, through previous steps, that the target web page 404 includes two or more original lists, a dominant list may be determined from these original lists. In an implementation, the dominant list may be determined at least with visual features of these original lists.
At 440, for an original list in the target web page 404, multiple groups of representative metadata respectively corresponding to multiple items in the original list may be obtained from the target web page 404 with at least boundaries of the multiple items. Optionally, the representative metadata obtaining at 440 may be performed for a dominant list in the target web page 404. Herein, representative metadata may refer to data contained in an original list and to be presented in the structured list 404, e.g., image, text, etc. In an implementation, for each item, a group of initial metadata may be obtained from the target web page 404 first, and then a group of representative metadata, corresponding to the item, to be presented in the structured list 404 is selected from the group of initial metadata.
At 450, the multiple groups of representative metadata obtained at 440 may be visualized into the structured list 404. In an implementation, the structured list 404 may be formed with the multiple groups of representative metadata according to a predetermined format or layout. The structured list 404 is a simplified version of the original list in the target web page 402, but it still contains enough information to enable a user to intuitively and comprehensively understand main content of the original list. The structured list 404 may be considered as, e.g., a list snippet of the original list.
It should be understood that all the steps and the sequence thereof in the process 400 are exemplary, and the embodiments of the present disclosure would also cover any changes to the process 400. For example, although the process 400 includes the step of dominant list determination at 430, this step may be omitted in the case that the target web page 402 includes only one original list. For example, although the step 430 is shown in FIG.4 as being performed before the step 440, the step 430 may also be performed after the step 440. In this case, multiple groups of representative metadata of each original list may be obtained first through the step 440, and then, after a dominant list is determined through the step 430, only multiple groups of representative metadata of the dominant list may be provided to the step 450.
FIG.5 illustrates an exemplary process 500 of anchor element group detection according to an embodiment. The process 500 is an exemplary implementation of the step 410 in FIG.4.
At 510, multiple anchor elements may be identified from a target web page 502. For example, these anchor elements may be identified from a html source file of the target web page 502. Each item in the original list may include multiple html elements, and an anchor element may be a html element among these html elements which is the most representative and the most helpful for identifying the entire item. The anchor elements identified at 510 may also be referred to as identified anchor elements. In an implementation, anchor element constraints may be pre-defined, and multiple html elements in the target web page 502 that meet the anchor element constraints may be identified as multiple identified anchor elements. For example, the anchor element constraints may include at least one of: a html element having an image tag; a html element having a title tag; a html element representing a date; etc. In one case, each item in the original list may have a corresponding image, and thus a html element in the html source file that has an image tag, e.g., <img> tag, etc., may be taken as an anchor element to help identifying a corresponding item. In one case, each item in the original list may have a corresponding title, and thus a html element in the html source file that has a title tag, e.g., <hl> tag, <h2> tag, etc., may be taken as an anchor element to help identifying a corresponding item. In one case, each item in the original list may have a date, e.g., a post date, etc., and thus a html element in the html source file that has a character string representing a date may be taken as an anchor element to help identifying a corresponding item. In this case, a character string representing a date in the html source file may be identified by various techniques such as regular matching. It should be understood that the embodiments of the present disclosure are not limited to the exemplary anchor element constraints above, but may cover any other types of anchor element constraints.
At 520, a property set of each identified anchor element in the multiple identified anchor elements identified at 510 may be extracted from the target web page 502. A property set of an identified anchor element may include one or more intrinsic attributes of the identified anchor element, e.g., html tag attribute of the identified anchor element, Cascading Style Sheets (CSS) class, XML Path Language (XPath) information, etc. The html tag attribute may indicate the type of html tag of the identified anchor element. The CSS class may indicate which CSS classes the identified anchor element has. The XPath information may indicate location information, node information, etc. of the identified anchor element, which may be obtained, e.g., from a Document Object Model (DOM) tree corresponding to the html source file. It should be understood that the embodiments of the present disclosure are not limited to the above exemplary attributes of an identified anchor element, but may cover any other types of attributes. Through the step 520, multiple property sets respectively corresponding to multiple identified anchor elements may be obtained.
At 530, the multiple identified anchor elements may be clustered into at least one anchor element group 504 based on the multiple property sets of these identified anchor elements. Each identified anchor element may be characterized by a corresponding property set, and the multiple property sets of the multiple identified anchor elements may be provided to a pre-trained clustering model as inputs. The clustering model is trained for clustering the multiple anchor elements into at least one anchor element group based on the property sets. For example, those identified anchor elements with similar attributes will be clustered into the same anchor element group. Each anchor element group includes multiple anchor elements having similar attributes, and may correspond to a possible original list, wherein the anchor elements may be respectively associated with different items in the potential original list.
It should be understood that all the steps in the process 500 are exemplary, and the embodiments of the present disclosure would also cover any changes to the process 500. For example, the process 500 may adopt any combination of various anchor element constraints, property sets containing any combination of various attributes, etc.
FIG.6 illustrates an exemplary anchor element group according to an embodiment. In FIG.6, it is assumed that a first anchor element group and a second anchor element group are detected for the web page 20 in FIG.2. The first anchor element group may include multiple anchor elements, and these anchor elements correspond to the image 222, the image 242 and the image 262 in the original list 202 in FIG.2, wherein the image 222, the image 242 and the image 262 may be clustered into the first anchor element group due to having similar attributes. The second anchor element group may include multiple anchor elements, and these anchor elements correspond to the image 282, the image 284 and the image 286 in the original list 204 in FIG.2, wherein the image 282, the image 284 and the image 286 may be clustered into the second anchor element group due to having similar attributes.
It should be understood that although FIG.6 shows anchor element groups detected according to an anchor element constraint of "a html element having an image tag", the embodiments of the present disclosure may also detect anchor element groups based on other types of anchor element constraints. For example, for the web page 12 in FIG.l, an anchor element group formed by the title "Spring Festival", the title "Mid-Autumn Festival", the title "Dragon Boat Festival", etc., may be detected according to an anchor element constraint of "a html element having a title tag". For example, for the web page 16 in FIG.l, an anchor element group formed by the date "2021-10- 05" in Tom's display region, the date "2021-10-05" in David's display region, the date "2021-10- 06" in Jane's display region, etc., may be detected according to an anchor element constraint of "a html element representing a date".
FIG.7 illustrates an exemplary process 700 of boundary detection according to an embodiment. The process 700 is an exemplary implementation of the step 420 in FIG.4. The process 700 may be used for performing boundary detection to an exemplary anchor element group 702, so as to obtain boundaries of multiple items respectively associated with multiple anchor elements in the anchor element group 702, thereby identifying, in a target web page, an original list 704 corresponding to the anchor element group 702 and respective items in the original list 704. The process 700 may be performed based at least on a DOM tree corresponding to the target web page. At 710, iterative boundary expansion may be performed to each anchor element in the anchor element group 702, so as to find elements that may be within the same item as the anchor element. In an implementation, based on a DOM tree corresponding to the target web page, iterative boundary expansion may be synchronously performed by taking the multiple anchor elements in the anchor element group 702 as starting points respectively. Each anchor element may act as a starting point, and through the iterative boundary expansion, it is possible to sequentially determine and expand to multiple other elements in the DOM tree, starting from this anchor element. This anchor element together with the determined other elements form a tree, and this tree represents an item and thus may also be referred to as an item tree. Multiple nodes in the item tree may respectively correspond to multiple elements, e.g., the anchor element and the elements determined through the iterative boundary expansion. Each step of iteration may expand to a next node, and the next node may be included into the item tree. Multiple steps of iteration form a corresponding expansion path. Through the iterative boundary expansion at 710, multiple item trees respectively originating from the multiple anchor elements in the anchor element group 702 may be obtained. The multiple item trees respectively define boundaries of multiple items.
The iterative boundary expansion may include various types of expansion, e.g., sibling node expansion, parent node expansion, etc. The sibling node expansion may be performed for expanding from the current node to a sibling node of the current node in a DOM tree corresponding to a target web page. In one case, if the current node has multiple sibling nodes belonging to the same parent node, it is possible to sequentially expand to the multiple sibling nodes from near to far, starting from the current node. In one case, the sibling node expansion may adopt a predetermined expansion direction, e.g., expanding to the right, expanding to the left, expanding to the right and the left alternately, expanding to the left after a number of expansions to the right or after meeting a predetermined condition, expanding to the right after a number of expansions to the left or after meeting a predetermined condition, etc. The parent node expansion may be performed for expanding to a parent node of the current node after all sibling nodes of the current node have been included in the same item tree, and including the parent node into the item tree. After expanding to the parent node, sibling node expansion may be further performed to the parent node, e.g., expanding to sibling nodes of the parent node. In this way, iterative expansions to upper-level nodes may be achieved. Moreover, if a certain node is included into the item tree through the iterative boundary expansion and this node has its own lower-level nodes, e.g., child nodes, grandchild nodes, etc., then all the lower-level nodes of this node may be further included into the item tree.
The iterative boundary expansion may be performed synchronously among different item trees corresponding to different anchor elements. For example, in each step of iteration, sibling node expansion or parent node expansion is performed once in these item trees synchronously. In one case, for example, in a certain step of sibling node expansion, if a certain item tree S currently has no sibling node that can be expanded to, and other item trees have sibling nodes that can be expanded to, then expansion of the item tree S at the current step may be suspended once while sibling node expansion at the current step is performed to other item trees.
According to the embodiments of the present disclosure, boundary of each item may be expanded as large as possible, e.g., enabling each item to include as many elements as possible through the iterative boundary expansion. However, there should not have content overlap between different items, e.g., the same element or content should not be included in different items. Furthermore, structures of different items should be similar, e.g., different items should have at least a predetermined proportion of similar elements or nodes, etc.
Content overlap between two items may be caused by the fact that two item trees corresponding to the two items have node overlap, e.g., a certain node is shared by the two item trees. Accordingly, content overlap between different items may be avoided through detecting node overlap during the iterative boundary expansion at 710.
At 720, it may be determined whether the current step of iteration in the iterative boundary expansion results in that node overlap occurs between at least two item trees. For example, the current step of iteration results in including the same one or more nodes into at least two item trees simultaneously, or not. The node overlap determination at 720 may be performed synchronously with the iterative boundary expansion at 710, e.g., determining whether there is node overlap after each step of iteration.
If it is determined at 720 that the current step of iteration does not result in node overlap, the process 700 may return to 710 and continue performing the iterative boundary expansion.
If it is determined at 720 that the current step of iteration results in node overlap, then at 730, the iterative boundary expansion is stopped, and nodes that are determined through the current step of iteration are excluded from each item tree. For example, each item tree is caused to go back or reset to a state at the previous step of iteration before the current step of iteration.
Through performing the step 720 and the step 730, the embodiments of the present disclosure may avoid node overlap occurring among the obtained multiple item trees, thereby avoiding content overlap occurring among different items.
In order to determine whether structures of different items are similar, the process 700 may perform similarity check to the multiple item trees. In one case, the similarity check may be performed in response to determining that the number of nodes in at least one item tree in the multiple item trees exceeds a node number threshold. In an aspect, for example, the at least one item tree may be a predetermined number or a predetermined proportion of item trees among the multiple item trees, and thus the performing of the similarity check may require that: the number of nodes in each item tree of a predetermined number or a predetermined proportion of item trees among the multiple item trees exceeds a node number threshold. In another aspect, for example, the performing of the similarity check may require that: the iterative boundary expansion at 710 has performed a predetermined number of steps of iteration, i.e., each item has contained a predetermined number of elements or each item tree has contained a predetermined number of nodes. In one case, the similarity check may be performed synchronously with the iterative boundary expansion at 710, e.g., the similarity check would be performed after each step of iteration. In one case, the similarity check would be performed whenever a predetermined number of steps of iteration are performed, e.g., whenever a predetermined number of elements are newly added into each item or whenever a predetermined number of nodes are newly added into each item tree. The embodiments of the present disclosure are not limited to the above exemplary opportunities of performing the similarity check.
At 740, a tree similarity between any two item trees in the multiple item trees may be calculated. The embodiments of the present disclosure are not limited to any specific technique for calculating a tree similarity. Preferably, the embodiments of the present disclosure propose a tree similarity calculation method obtained through improving an existing simple tree matching algorithm, and the proposed tree similarity calculation method calculates a tree similarity at least with a CSS similarity-based weight and/or a minimum depth layer level.
In an implementation, the embodiments of the present disclosure calculate a tree similarity at least with a matching weight calculated based on a CSS similarity between root nodes of two item trees. In a web page, a style presented by CSS is important information for dictating page layout. Therefore, through calculating a matching weight based on a CSS similarity between root nodes of two item trees and utilizing the matching weight for calculating a tree similarity between the two item trees, accuracy of tree similarity calculation may be effectively improved. The matching weight may be calculated based on, e.g., respective CSS classes of the two root nodes. In an implementation, the embodiments of the present disclosure may calculate a tree similarity with nodes within a minimum depth layer level in two item trees. Herein, a minimum depth layer level may be defined such that: the number of visible nodes within the minimum depth layer level in an item tree reaches a predetermined proportion, e.g., 80% or any other proportion, of the number of all visible nodes in the item tree. In another aspect, the minimum depth layer level may also be defined such that: the number of visible nodes within layer levels, that are less than the minimum depth layer level, in an item tree does not reach a predetermined proportion of the number of all visible nodes in the item tree. Herein, a visible node may refer to a visually visible node in a web page, e.g., a node presenting an image, a node presenting text, etc., therefore, compared to other nodes, a visible node is more important for determining a structure similarity between item trees. An item tree may have multiple layer levels, e.g., assuming that a root node of the item tree is located at a layer level with a depth of 0, child nodes of the root node are located at a layer level with a depth of 1, and so on. Layer levels with larger depth contribute less in determining a structure similarity between two trees. Therefore, the embodiments of the present disclosure propose to calculate a tree similarity with only a part of layer levels rather than all layer levels of item trees, and thus may effectively improve calculation efficiency and save calculation resources. A tree similarity may be calculated with a minimum depth layer level and those layer levels that are less than the minimum depth layer level. For example, assuming that a minimum depth layer level is 3, a tree similarity may be calculated with layer levels with depths of 0, 1, 2, and 3. Since a minimum depth layer level is determined by at least considering the number of visible nodes, e.g., the number of visible nodes within the minimum depth layer level should not be lower than a predetermined proportion of the number of all visible nodes in the item tree, the predetermined proportion that is appropriately set will ensure that an accurate tree similarity can still be calculated even if those layer levels greater than the minimum depth layer level are not considered in the calculation of tree similarity. The predetermined proportion may have any value preset according to actual application requirements. It should be understood that although the above discussion relates to calculating a tree similarity with nodes within a minimum depth layer level in two item trees, the embodiments of the present disclosure are not limited to this, and may alternatively calculate a tree similarity with nodes within all layer levels in two item trees.
It is assumed T and T' are two item trees. Root(T) represents a root node of the tree T, and Root(T') represents a root node of the tree T'. It should be understood that if T and T' do not have actual root nodes, virtual root nodes may be set for T and T' respectively, and these two virtual root nodes may have the same attribute configuration. For each of T and T', Lo, Li, ..., Ln respectively represent subtree set at layer level depths 0, 1,..., n. Ln, Li2, ..., Lik respectively represent k subtrees in the layer level depth i, i.e., subtrees in a subtree setLi. It is assumed that cssi represents the set of CSS classes that Root(T) has, wherein Root(T) may have 0, 1, or any other number of CSS classes. It is assumed that css2 represents the set of CSS classes that Root(T') has, wherein Root(T') may have 0, 1, or any other number of CSS classes. In an implementation, a matching weight between T and T' may be calculated by Jaccard-coefficient. For example, the matching weight between T and T' may be calculated as: I css-1 n css7 | x css ~f~ css
|css1|+|css2|-|css1 n css2| ’ 1 2 Equation (1)
Figure imgf000016_0001
1, CSS! = css2 wherein MatchWeight(») is a function for calculating the matching weight, | cssx | represents the number of CSS classes contained in cssi, |css21 represents the number of CSS classes contained in css2, I CSS-L A css2 | represents the number of CSS classes contained in both cssi and css2. The tree similarity between T and T' may be calculated based on, e.g., the procedure in Table 1 below.
Figure imgf000016_0002
Table 1
At step 1.1, a minimum depth layer level MinDepth may be determined. At step 1.2, a similarity check function SimilarityCheck(») for calculating a similarity metric between T and T' at the current layer level “layer” is defined, and this function may include the subsequent processing in step 1.3 to step 1.18. At step 1.3 and step 1.4, if it is determined that Root(T) and Root(T') have different html tags, a calculation result by SimilarityCheck(») is 0. At step 1.5 and step 1.6, if it is determined that the current layer level “layer” is greater than the minimum depth layer level MinDepth, a calculation result by SimilarityCheck(») is 0, thereby tree similarity calculation may be avoided in the case that the current layer level is greater than the minimum depth layer level. At step 1.8, subtrees in a subtree setLiayer+i corresponding to a depth “layer+1” in T is represented by m. At step 1.9, subtrees in a subtree set L'iayer+i corresponding to a depth “layer+1” in T' is represented by n. In the procedure of Table 1, a similarity function M[i, j] is defined, which represents the maximum similarity between the first i subtrees in T and the first j subtrees in T'. At step 1.10 and step 1.11, M[i, 0] and M[0, j] are initialized to 0, respectively. At step 1.12, it is defined that subtrees in the subtree set m of T will be traversed. At step 1.13, it is defined that subtrees in the subtree set n of T' will be traversed. At step 1.14 and step 1.15, similarity may be calculated with the similarity function M[i, j]. For example, techniques such as dynamic programming may be adopted for performing the calculations at step 1.14 and step 1.15. M[i, j] will obtain the best similarity from three candidates including M[i, j-1], M[i-1, j] and M[i-1, j - 1]+W[i, j], W[i, j] will recursively calculate a similarity between the i-th subtree Ti in T and the j- th subtree T'j in T' at the layer level “layer+1”. Thus, the entire tree structure may be considered, rather than just a root node. At step 1.18, the calculation result by SimilarityCheck(») may be returned, which is represented as MatchWeight(Root(T), Root(T')) * (M[m, n] + 1), wherein M[m, n] is the best similarity between subtrees of T and subtrees of T', and "1" represents the root node. It should be understood that the calculation result returned at step 1.18 may indicate, e.g., the number of similar nodes. At step 1.19, a final tree similarity may be calculated with all nodes having depth that is not greater than the minimum depth layer level, wherein TreeSimilarity(») is a tree similarity function, |T| is the number of nodes in T having depth that is not greater than the minimum depth layer level, and \T' | is the number of nodes in T' having depth that is not greater than the minimum depth layer level. It should be understood that all the steps in Table 1 are exemplary, and the embodiments of the present disclosure would also cover any changes to these steps.
After a tree similarity between any two item trees in multiple item trees is calculated through step 740, the multiple item trees may be divided into at least one tree set at least with a similarity threshold at 745. Each of the at least one tree set may include at least one item tree, and at least one item tree in the same tree set has a tree similarity, that is not lower than the similarity threshold, among each other. Through step 745, item trees having high similarities among each other may be divided into the same tree set. At 750, it may be determined whether the number of item trees in a tree set containing the highest number of item trees in the at least one tree set is lower than a tree number threshold. The tree set containing the highest number of item trees may be taken as a target tree set for determining whether the iteration should be stopped. The tree number threshold may have a preset value, and this value may be used for, e.g., ensuring that most of the multiple item trees are included in the target tree set. If it is determined at 750 that the number of item trees in the target tree set is not lower than the tree number threshold, the process 700 may return to 710 and continue performing the iterative boundary expansion. If it is determined at 750 that the number of item trees in the target tree set is lower than the tree number threshold, then at 760, the performing of the iterative boundary expansion may be stopped, and nodes that are determined through a predetermined number of previous steps of iteration may be excluded from the multiple item trees respectively. In an implementation, the predetermined number of previous steps of iterations may be determined through: excluding nodes determined through a predetermined number of previous steps of iteration from the multiple item trees respectively, such that the number of item trees in a target tree set obtained, e.g., through the processing of step 740 and step 745 with respect to updated multiple item trees is not lower than the tree number threshold.
Through performing step 740, step 745, step 750 and step 760, the embodiments of the present disclosure may implement similarity check to multiple item trees. The similarity check facilitates to ensure that the obtained item trees have structure similarity. For example, in the case that multiple item trees in a target tree set are provided as an iterative expansion result, these item trees in the iterative expansion result will have high similarity among each other, thereby ensuring that different items have similar structures.
According to the process 700, optionally, further iterative boundary expansion may be performed at 770 to attempt to find multiple better item trees. In an implementation, after step 730 is performed, if it is determined that the current step of iteration is sibling node expansion, further iterative boundary expansion may be performed to the multiple item trees in a direction which is reverse to the direction of the current step of iteration. For example, for each item tree, if the current step of iteration is to expand right to a sibling node, an attempt may be made to expand left to a different sibling node. In another implementation, after step 760 is performed, the multiple item trees may first be reset to a state at a predetermined previous step of iteration. The predetermined previous step of iteration may be determined, e.g., with the predetermined number of previous steps of iteration involved at 760. For example, if the predetermined number of previous steps of iteration is previous 2 steps of iteration, the predetermined previous step of iteration may be the previous 3rd step of iteration. Then, if it is determined that a next step of iteration after the predetermined previous step of iteration is sibling node expansion, an attempt may be made to perform further iterative boundary expansion to the multiple item trees in a direction which is reverse to the direction of the next step of iteration. For example, for each item tree, if the item tree is reset to a state at the previous 3rd step of iteration and the previous 2nd step of iteration is to expand to the right to a sibling node, an attempt may be made to further expand to the left to a different sibling node.
It should be understood that, although not shown, the process 700 may also include performing steps 720 to 730 and/or steps 740 to 760 with respect to the further iterative boundary expansion at 770, so as to ensure that the obtained multiple item trees do not have node overlap but have similar structure, thereby ensuring that different items do not have content overlap but have similar structure.
Through the process 700, multiple item trees originating from multiple anchor elements in the anchor element group 702 may be obtained, and these item trees respectively define boundaries of multiple corresponding items, thereby the original list 704 formed by these items may be finally identified.
It should be understood that all the steps in the process 700 are exemplary, and the embodiments of the present disclosure would also cover any changes to the process 700. For example, both the processing related to determining whether node overlap occurs and the processing related to performing similarity check in the process 700 are optional, and either or both of these processings may be included in the process 700, or either or both of these processings may be omitted from the process 700. Moreover, for example, the process 700 may also provide only multiple item trees in the target tree set as the iterative expansion result, and form the original list 704 with these item trees.
FIG.8 A to FIG.8F illustrate an example of iterative boundary expansion according to an embodiment. In FIG.8A to FIG.8F, an exemplary process of iterative boundary expansion is shown in an exemplary DOM tree corresponding to a target web page.
The DOM tree may include multiple nodes, e.g., node 801 to node 826, and other nodes that are not shown. Symbols such as "Div", "A", "Span", "P", "Img", etc., displayed in blocks representing nodes indicate html tags of corresponding nodes. Moreover, the embodiments of the present disclosure propose to set "Text" tags for text strings appearing in a html source file, although such html tags do not exist, and these text strings may also be taken as nodes in the DOM tree, e.g., node 819, node 820, etc. These text strings may be visible elements that may be presented, and thus the setting of Text tags and corresponding nodes for these text strings will facilitate to determine item boundaries more accurately. It should be understood that, for the purpose of explanation, only a few exemplary node tags are shown in FIG.8 A to FIG.8F, but in practical applications, any other types of node tag may exist, and the embodiments of the present disclosure are not limited in any way by what specific tags the nodes in the DOM tree have. Moreover, in FIG.8 A to FIG.8F, nodes included into an item tree through iterative boundary expansion are highlighted by shading, and expansion paths of the iterative boundary expansion are indicated by arrows.
In FIG.8A, it is assumed that node 818, node 821 and node 824 have been identified as anchor elements with an Img (image) tag. Then, iterative boundary expansion may be performed synchronously by taking these nodes as starting points respectively, so as to obtain item trees respectively originating from these nodes. Hereinafter, an item tree originating from node 818 is referred to as a first item tree, an item tree originating from node 821 is referred to as a second item tree, and an item tree originating from node 824 is referred to as a third item tree.
FIG.8B illustrates the 1st step of iteration. Since none of node 818, node 821 and node 824 has sibling nodes, parent node expansion will be performed in the 1st step of iteration. For example, in the first item tree, the expansion is from node 818 to a parent node 806 of node 818; in the second item tree, the expansion is from node 821 to a parent node 810 of node 821; and in the third item tree, the expansion is from node 824 to a parent node 814 of node 824.
FIG.8C illustrates the 2nd step to the 4th step of iteration in which sibling node expansion will be performed. Taking the first item tree as an example, node 806 determined through the 1st step of iteration is the current node, which has sibling nodes 807, 808 and 809, and thus the 2nd step to the 4th step of iteration will expand to the right sequentially to node 807, node 808 and node 809 as indicated by the arrow. Similarly, in the second item tree, the 2nd step to the 4th step of iteration will expand to the right sequentially to node 811, node 812 and node 813 as indicated by the arrow; and in the third item tree, the 2nd step to the 4th step of iteration will expand to the right sequentially to node 815, node 816 and node 817 as indicated by the arrow. Moreover, since node 808 has child node 819 and node 809 has child node 820, node 819 and node 820 may also be included into the first item tree. Similarly, child node 822 of node 812 and child node 823 of node 813 may be included into the second item tree, and child node 825 of node 816 and child node 826 of node 817 may be included into the third item tree.
FIG.8D illustrates the 5th step of iteration in which parent node expansion will be performed. Taking the first item tree as an example, since all sibling nodes 807, 808 and 809 of node 806 have been included in the first item tree through the 2nd step to the 4th step of iteration, the 5th step of iteration will expand to parent node 803 of nodes 806 to 809 as indicated by the arrow. Similarly, in the second item tree, the 5th step of iteration will expand to node 804 as indicated by the arrow; and in the third item tree, the 5th step of iteration will expand to node 805 as indicated by the arrow.
FIG.8E illustrates the 6th step of iteration in which parent node expansion will be performed. Taking the first item tree as an example, node 803 determined through the 5th step of iteration has no sibling node, and thus the 6th step of iteration will expand to parent node 802 of node 803 as indicated by the arrow. Similarly, in the second item tree, the 6th step of iteration will expand to node 802 as indicated by the arrow; and in the third item tree, the 6th step of iteration will expand to node 802 as indicated by the arrow.
Through the 6th step of iteration, node 802 will be included in the first item tree, the second item tree and the third item tree at the same time, thereby causing node overlap to occur. Therefore, the performing of the iterative boundary expansion will be stopped, and node 802 determined through the 6th step of iteration will be excluded from the first item tree, the second item tree and the third item tree, respectively. In FIG.8F, the finally obtained first item tree 830, the finally obtained second item tree 840 and the finally obtained third item tree 850 are shown by dashed blocks. The first item tree 830, the second item tree 840 and the third item tree 850 respectively correspond to a first item, a second item and a third item in an original list in the target web page. Thus, through the iterative boundary expansion in FIG.8A to FIG.8F, the original list in the target web page and items in the original list may be identified. It should be understood that similarity check described above in connection with FIG.7 may also be performed to the final item trees shown in FIG.8F. Moreover, it should be understood that, as shown in FIG.8F, each item tree has its own root node, e.g., the first item tree 830, the second item tree 840 and the third item tree 850 have their own root nodes 803, 804 and 805, respectively, and thus a boundary of each item tree may actually be indicated by a html tag of a root node, e.g., a boundary of the first item tree may be indicated by the "Div" tag of root node 803.
FIG.9A to FIG.9F illustrate an example of iterative boundary expansion according to an embodiment. FIG.9A to FIG.9F illustrate examples of performing iterative boundary expansion in different approaches in an exemplary DOM tree corresponding to a target web page. The DOM tree may include multiple nodes, e.g., node 901 to node 929, and other nodes that are not shown. In FIG.9A, it is assumed that node 919, node 923 and node 927 have been identified as anchor elements with an Img (image) tag. Then, iterative boundary expansion may be performed synchronously by taking these nodes as starting points respectively, so as to obtain item trees respectively originating from these nodes. Hereinafter, an item tree originating from node 919 is referred to as a first item tree, an item tree originating from node 923 is referred to as a second item tree, and an item tree originating from node 927 is referred to as a third item tree.
FIG.9B illustrates the 1st step of iteration. Since none of node 919, node 923 and node 927 has sibling nodes, parent node expansion will be performed in the 1st step of iteration. For example, in the first item tree, the expansion is from node 919 to parent node 904 of node 919; in the second item tree, the expansion is from node 923 to parent node 909 of node 923; and in the third item tree, the expansion is from node 927 to parent node 914 of node 927.
FIG.9C shows item trees finally obtained by performing subsequent iterative boundary expansion in an expansion approach, on the basis of the 1st step of iteration in FIG.9B. As shown in FIG.9C, the 2nd step to the 5th step of iteration will sequentially perform sibling node expansion to the left as indicated by the arrows. In the first item tree, the 2nd step of iteration will expand to the left to node 903 as indicated by the arrow, and since there is no other sibling node, the 3rd step to the 5th step of iteration will be suspended in the first item tree. In the second item tree, the 2nd step to the 5th step of iteration will sequentially expand to the left to node 908, node 907, node 906 and node 905 as indicated by the arrow. In the third item tree, the 2nd step to the 5th step of iteration will sequentially expand to the left to node 913, node 912, node 911 and node 910 as indicated by the arrow. Since the node 902 is a common parent node of the first item tree, the second item tree and the third item tree, in order to avoid node overlap, the finally obtained first item tree, second item tree and third item tree do not include node 902. In FIG.9C, the finally obtained first item tree 932, the finally obtained second item tree 934 and the finally obtained third item tree 936 are shown by dashed blocks. It should be understood that node 915, node 916, node 917, node 928 and node 929 are not included in any item tree. Moreover, it is assumed that similarity check described above in connection with FIG.7 is further performed to these finally obtained item trees, and it is found that although there is a high similarity between the second item tree 934 and the third item tree 936, a tree similarity between the first item tree 932 and the second item tree 934 and a tree similarity between the first item tree 932 and the third item tree 936 are both lower than a similarity threshold. Therefore, the first item tree 932 may be regarded as an unqualified item tree and thus is discarded, and accordingly, the expansion approach in FIG.9C actually outputs only two item trees 934 and 936 finally.
FIG.9D shows item trees finally obtained by performing subsequent iterative boundary expansion in another expansion approach, on the basis of the 1st step of iteration in FIG.9B. As shown in FIG.9D, the 2nd step to the 5th step of iteration will sequentially perform sibling node expansion to the right as indicated by the arrows. In the first item tree, the 2nd step to the 5th step of iteration will sequentially expand to the right to node 905, node 906, node 907 and node 908 as indicated by the arrow. In the second item tree, the 2nd step to the 5th step of iteration will sequentially expand to the right to node 910, node 911, node 912 and node 913 as indicated by the arrow. In the third item tree, the 2nd step to the 4th step of iteration will sequentially expand to the right to node 915, node 916 and node 915 as indicated by the arrow, and since there is no further sibling node, the 5th step of iteration will be suspended in the third item tree. In FIG.9D, the finally obtained first item tree 942, the finally obtained second item tree 944 and the finally obtained third item tree 946 are shown by dashed blocks. It should be understood that node 903 and node 918 are not included in any item tree. Moreover, it is assumed that similarity test described above in connection with FIG.7 is further performed to these finally obtained item trees, and it is found that none of tree similarities among these item trees is lower than a similarity threshold. Accordingly, the expansion approach in FIG.9D actually outputs all three item trees 942, 944 and 946 finally. According to the embodiments of the present disclosure, in order to find better item trees, further iterative boundary expansion may be performed according to, e.g., step 770 in FIG.7. It is assumed that the item trees shown in FIG.9D are respectively reset to the state at the 4th step of iteration, as shown in FIG.9E. In FIG.9E, the current expansion path of the first item tree sequentially includes node 919, node 904, node 905, node 906 and node 907 as indicated by the arrow, the current expansion path of the second item tree sequentially includes node 923, node 909, node 910, node 911 and node 912 as indicated by the arrow, and the current expansion path of the third item tree sequentially includes node 927, node 914, node 915, node 916 and node 917 as indicated by the arrow. Unlike the 5th step of iteration in FIG.9D that performs sibling node expansion to the right, the 5th step of iteration in FIG.9F may attempt to perform sibling node expansion to the left, i.e., in a direction which is reverse to the direction of the 5th step of iteration in FIG.9D. Accordingly, in the 5th step of iteration in FIG.9F, the expansion path of the first item tree will further include node 903 as indicated by the arrow, the expansion path of the second item tree will further include node 908 as indicated by the arrow, and the expansion path of the third item tree will further include node 913 as indicated by the arrow. In FIG.9F, the finally obtained first item tree 952, the finally obtained second item tree 954 and the finally obtained third item tree 956 are shown by dashed blocks. Moreover, it is assumed that similarity check described above in connection with FIG.7 is further performed to these finally obtained item trees, and it is found that none of tree similarities among these item trees is lower than a similarity threshold. Accordingly, the expansion approach in FIG.9F actually outputs all three item trees 952, 954 and 956 finally.
In the embodiments of the present disclosure, the performing of the iterative boundary expansion may follow predetermined criteria, e.g., obtained item trees are the more the better, nodes in each item tree are the more the better, tree similarities among different item trees are the higher the better, etc. Through comparing among FIG.9C, FIG.9D and FIG.9F, it can be seen that the expansion approaches in FIG.9D and FIG.9F will be better than the expansion approach in FIG.9C, because the expansion approaches in FIG.9D and FIG.9F may output a larger number of item trees. Moreover, the expansion approach in FIG.9F will be better than the expansion approach in FIG.9D, and this is because that: the expansion approach in FIG.9F may include more nodes (e.g., node 903 and node 918) into the item trees; and tree similarities among the item trees obtained through the expansion approach in FIG.9F are higher, e.g., a tree similarity between the third item tree 956 and the first item tree 952 and a tree similarity between the third item tree 956 and the second item tree 954 will be higher than a tree similarity between the third item tree 946 and the first item tree 942 and a tree similarity between the third item tree 946 and the second item tree 944.
It should be understood that the embodiments of the present disclosure may perform further iterative boundary expansions in various approaches. For example, instead of respectively resetting the item trees shown in FIG.9D to the state at the 4th step of iteration shown in FIG.9E, the item trees shown in FIG.9D may be reset to a state at any other predetermined previous step of iteration, e.g., reset to the state at the 3rd step of iteration. Then, expansion may be performed to the reset item trees in a direction which is reverse to the direction of the next step of iteration after the predetermined previous step of iteration. Moreover, it should be understood that the embodiments of the present disclosure may also perform further iterative boundary expansions in multiple different approaches, and select, from these different approaches, an approach that can obtain the best item trees.
FIG.10 illustrates an exemplary boundary detection result according to an embodiment. It is assumed that boundary detection has been performed to the target web page 20 in FIG.2. As shown in FIG.10, dashed block 1010 denotes an item corresponding to “M cellphone A4” identified through boundary detection, dashed block 1020 denotes an item corresponding to “M cellphone A3” identified through boundary detection, and dashed block 1030 denotes an item corresponding to "M cellphone A2" identified through boundary detection. The items denoted by dashed blocks 1010, 1020 and 1030 together form the original list 202 in the target web page 20 shown in FIG.2. FIG.11 illustrates an exemplary process 1100 of dominant list determination according to an embodiment. The process 1100 is an exemplary implementation of the step 430 in FIG.4. Assuming that it has been determined that a target web page includes more than one original list, e.g., a first original list 1102, a second original list 1104, etc., the process 1100 may be performed for determining a dominant list from these original lists.
At 1110, visual features of the first original list 1102 may be determined at least with boundaries of items in the first original list 1102. At 1120, visual features of the second original list 1104 may be determined at least with boundaries of items in the second original list 1104.
In the embodiments of the present disclosure, visual features of an original list may refer to various visual features that facilitate to determine whether the original list occupies a main position in a target web page, whether it is used for presenting main content of the target web page, etc. In an implementation, the visual features may include a minimum boundary distance between adjacent items within the original list, which may indicate a visual distance between the two items. For example, a minimum boundary distance between two adjacent items may be calculated with boundaries of these two items. In an implementation, the visual features may include list position which may indicate whether the original list occupies a main position in the target web page and thus acts as a main content portion in the target web page. For example, the list position may include a position of the original list in a horizontal direction in the target web page. For example, the list position may include a position of the original list in a vertical direction in the target web page, e.g., whether the list is located in above-the-fold of the screen, etc. In an implementation, the visual features may include item content richness which indicates visual content richness of items in the original list. For example, item content richness of an item may include, e.g., size of the item, number of nodes contained in the item, etc., determined based on a boundary of the item. At 1130, a dominant list may be determined from the first original list 1102 and the second original list 1104 based on the visual features of the first original list 1102 and the visual features of the second original list 1104. In an implementation, the dominant list may be determined with multiple heuristic rules defined for the visual features. For example, for the minimum boundary distance, a heuristic rule may be defined as to whether visual distances among items in a list are small, which is based on the consideration that distances among items in a dominant list are usually not very far. For example, for the list position, a heuristic rule may be defined as to whether an original list occupies a main position in a target web page, which is based on the consideration that a dominant list usually occupies a main position in a target web page. For example, for the item content richness, a heuristic rule may be defined as to whether an original list has a high item content richness, which is based on the consideration that a dominant list usually has a high item content richness. According to the above heuristic rules, an original list that can better satisfy these heuristic rules may be selected from the first original list 1102 and the second original list 1104 as the dominant list.
Taking the target web page 20 in FIG.2 as an example, through performing the process 1100, the original list 202 may be identified as the dominant list from the original list 202 and the original list 204.
It should be understood that all the steps in the process 1100 are exemplary, and the embodiments of the present disclosure would also cover any changes to the process 1100. For example, when there are more than two original lists, a dominant list may be determined at least with visual features of these original lists in an approach similar to the process 1100. Moreover, various visual features and various heuristic rules given above are exemplary, and the embodiments of the present disclosure may adopt any one or more of these visual features and heuristic rules, or adopt any other types of visual feature and heuristic rule.
FIG.12 illustrates an exemplary process 1200 of representative metadata obtaining according to an embodiment. The process 1200 is an exemplary implementation of the step 440 in FIG.4. It is assumed that the process 1200 is performed for obtaining a group of representative metadata for a specific item 1202 in an original list.
At 1210, a group of leaf nodes in an item tree that is identified by a boundary of the item 1202 may be identified. For example, leaf nodes in an item tree corresponding to the item 1202 may be identified.
At 1220, a group of initial metadata corresponding to the identified group of leaf nodes may be extracted. For example, initial metadata of each leaf node may be extracted. Taking the item denoted by dashed block 1010 in FIG.10 as an example, initial metadata may include, e.g., the picture in the item, the character string "M cellphone A4, 6.5 inches, 256G, black", the icon of 5 solid stars, the character string "25900 reviews", the character string "5500 RMB", etc.
At 1230, a group of tags corresponding to the extracted group of initial metadata may be determined. For example, a corresponding tag is determined for each initial metadata, to indicate specific meaning of the initial metadata. The group of tags may be determined at 1230 in various approaches.
In an implementation, a token sequence may be formed with the group of initial metadata. Each token in the token sequence corresponds to an initial metadata in the group of initial metadata. Then, a feature set for each token in the token sequence may be calculated. The feature set may include various types of feature that facilitate to determine tags, e.g., DOM tree feature, XPath feature, content feature, language feature, rendering feature, etc. The DOM tree feature may include, e.g., layer level depth, tag, class ID, etc., of a node corresponding to the token. The XPath feature may include, e.g., name, CSS class, etc. of a node corresponding to the token. The content feature may include, e.g., text vector of the token, whether the first letter is capitalized, etc. The language feature may include, e.g., language used by the token, Word2vec semantic feature vector of the token, etc. The rendering feature may include, e.g., various features involved in rendering a node corresponding to the token, such as position, length, width, etc. It should be understood that the embodiments of the present disclosure are not limited to the exemplary features included in the feature set given above, but may cover any other features or any combination of these features. A tag for each token may be generated based on multiple feature sets of multiple tokens in the token sequence, through a previously-trained tagger model. Exemplarily, the tagger model may be a combined model formed by a discriminative model and a generative model, wherein the discriminative model may be, e.g., a binary-classification or multi-classification model, and the generative model may be, e.g., a sequence-to-sequence (Seq2seq) model. It should be understood that the embodiments of the present disclosure are not limited to generating tags through the tagger model described above, but may also generate tags in any other approaches.
Still taking the item denoted by dashed block 1010 in FIG.10 as an example, through step 1230, an "image" tag may be generated for the picture in the item, a "title" tag may be generated for the character string "M phone A4, 6.5 inches, 256G, black", a "rating" tag may be generated for the icon of 5 solid stars, a "review" tag may be generated for the character string "25900 reviews" and a "price" tag may be generated for the character string "5500 RMB".
At 1240, the group of initial metadata may be ranked with the generated group of tags. In an implementation, a keyword ranking model may be trained previously, which may be used for ranking a group of tags that act as keywords. For example, the keyword ranking model may be trained for ranking multiple tags according to, e.g., importance degree, representativeness, etc. As an example, for the image tag, title tag, rating tag, review tag, price tag, etc., through ranking at 1240, these tags may be ranked from high to low as, e.g., image tag, title tag, price tag, rating tag, review tag, etc. Accordingly, the initial metadata corresponding to these tags are also ranked in the same order.
At 1250, one or more highest-ranked initial metadata may be selected as a group of representative metadata corresponding to the item 1202.
Through performing the process 1200 to each item in the original list, multiple groups of representative metadata respectively corresponding to multiple items in the original list may be obtained. The multiple groups of representative metadata may be subsequently used for generating a structured list.
It should be understood that all the steps in the process 1200 are exemplary, and the embodiments of the present disclosure would also cover any changes to the process 1200.
According to the embodiments of the present disclosure, after multiple groups of representative metadata respectively corresponding to multiple items in the original list are obtained, the multiple groups of representative metadata may be visualized into a structured list. Each group of representative metadata may form a new item in the structured list. It should be understood that the embodiments of the present disclosure are not limited to any specific approach for visualizing multiple groups of representative metadata into a structured list. In an implementation, format or layout of the structured list may be pre-defined for specifying, e.g., an arranging approach (e.g., horizontal arrangement, vertical arrangement, etc.) of multiple items in the structured list, an arranging approach of multiple elements within each item, sizes of items and elements, etc. In an implementation, the format or layout of the structured list may be similar with that of the original list, except that the structured list may include fewer items or elements than the original list.
FIG.13 illustrates an exemplary search result page 1300 according to an embodiment. It is assumed that a user has input a query "M cellphone" in a search box 1310 to indicate that the user wants to obtain web search results regarding the M cellphone. A search result region 1320 in the search result page 1300 includes multiple web search results. Unlike the search results shown in the region 330 in FIG.3 for the web page 20 in FIG.2, the search result region 1320 in FIG.13 includes an exemplary structured list 1330 generated for the web page 20 in FIG.2. The structured list 1330 is a simplified version of the original list 202 in the target web page 20, which may act as a list snippet of the original list 202. The structured list 1330 still contains enough information to enable the user to intuitively and comprehensively understand main content of the original list 202. For example, the structured list 1330 includes item 1332, item 1334 and item 1336, and these items respectively correspond to item 1010, item 1020 and item 1030 in the original list in the target web page 20 (as shown in FIG.10), and include main representative content in corresponding items in the original list. Taking item 1332 as an example, it includes the picture, brief introduction and price for the cellphone "M Cellphone A4" presented in the region 22 in FIG.2. Therefore, the user may intuitively and conveniently learn main content in the target web page 20 by viewing the structured list 1330 in the search result region 1320, without the need of, e.g., clicking a link to the target web page 20 in order to learn content in the web page. It should be understood that the search result page 1300 in FIG.13 and the structured list 1330 therein are merely exemplary, and the embodiments of the present disclosure are not limited to this example in any approach.
FIG.14 illustrates a flowchart of an exemplary method 1400 for list extraction and visualization in web pages according to an embodiment.
At 1410, at least one anchor element group in a target web page may be detected, the at least one anchor element group comprising a first anchor element group.
At 1420, boundary detection may be performed to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page. At 1430, multiple groups of representative metadata respectively corresponding to the multiple items may be obtained from the target web page with the boundaries of the multiple items.
At 1440, the multiple groups of representative metadata may be visualized into a structured list. In an implementation, the detecting at least one anchor element group may comprise: identifying multiple html elements, that meet anchor element constraints, in the target web page as multiple identified anchor elements; extracting, from the target web page, a property set of each identified anchor element in the multiple identified anchor elements; and clustering the multiple identified anchor elements into the at least one anchor element group based on multiple property sets of the multiple identified anchor elements.
The anchor element constraints may comprise at least one of: a html element having an image tag; a html element having a title tag; and a html element representing a date. A property set of each identified anchor element may comprise at least one of html tag attribute, CSS class and XPath information of the identified anchor element. In an implementation, the boundary detection may comprise: based on a DOM tree corresponding to the target web page, performing iterative boundary expansion synchronously by taking the multiple anchor elements as starting points respectively, to obtain multiple item trees respectively originating from the multiple anchor elements, wherein each item tree represents an item and comprises multiple nodes, and each node corresponds to an element determined through the iterative boundary expansion.
The iterative boundary expansion may comprise: for each item tree and in each step of iteration, expanding to a next node and including the next node into the item tree.
The iterative boundary expansion may comprise at least one of: performing sibling node expansion, to expand from the current node to a sibling node of the current node; and performing parent node expansion, to expand to a parent node of the current node after all sibling nodes of the current node have been included in the item tree.
The boundary detection may comprise: determining whether the current step of iteration results in that node overlap occurs between the item tree and at least one another item tree in the multiple item trees; and in response to determining that the node overlap occurs, stopping the performing of the iterative boundary expansion, and excluding nodes, that are determined through the current step of iteration, from the multiple item trees respectively.
The method may further comprise: if the current step of iteration is sibling node expansion, performing further iterative boundary expansion to the multiple item trees in a direction which is reverse to the direction of the current step of iteration.
In an implementation, the boundary detection may comprise: performing similarity check to the multiple item trees.
The similarity check may be performed in response to determining that the number of nodes in at least one item tree in the multiple item trees exceeds a node number threshold.
The similarity check may comprise: calculating a tree similarity between any two item trees in the multiple item trees; dividing the multiple item trees into at least one tree set at least with a similarity threshold, item trees in each tree set in the at least one tree set having tree similarities, that are not lower than the similarity threshold, among each other; determining whether the number of item trees in a tree set containing the highest number of item trees in the at least one tree set is lower than a tree number threshold; and in response to determining that the number of item trees is lower than the tree number threshold, stopping the performing of the iterative boundary expansion, and excluding nodes, that are determined through a predetermined number of previous steps of iteration, from the multiple item trees respectively.
The calculating a tree similarity may comprise at least one of: calculating the tree similarity at least with a matching weight calculated based on a CSS similarity between root nodes of the two item trees; and calculating the tree similarity with nodes within a minimum depth layer level in the two item trees, the minimum depth layer level being defined such that: the number of visible nodes within the minimum depth layer level in an item tree reaches a predetermined proportion of the number of all visible nodes in the item tree.
The method may further comprise: resetting the multiple item trees to a state at a predetermined previous step of iteration; and if a next step of iteration after the predetermined previous step of iteration is sibling node expansion, performing further iterative boundary expansion to the multiple item trees in a direction which is reverse to the direction of the next step of iteration.
In an implementation, the obtaining multiple groups of representative metadata may comprise, for each item in the multiple items: identifying a group of leaf nodes in an item tree that is identified by a boundary of the item; extracting a group of initial metadata corresponding to the group of leaf nodes; determining a group of tags corresponding to the group of initial metadata; ranking the group of initial metadata with the group of tags; and selecting one or more highest-ranked initial metadata as a group of representative metadata corresponding to the item.
The determining a group of tags may comprise: forming a token sequence with the group of initial metadata, each token in the token sequence corresponding to an initial metadata in the group of initial metadata; calculating a feature set for each token in the token sequence; and generating a tag for each token based on multiple feature sets of multiple tokens in the token sequence, through a previously-trained tagger model.
In an implementation, the at least one anchor element group may comprise a second anchor element group. The method may further comprise: performing boundary detection to multiple anchor elements in the second anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a second original list in the target web page. The method may further comprise, before obtaining multiple groups of representative metadata or before visualizing the multiple groups of representative metadata into a structured list: determining visual features of the first original list and visual features of the second original list respectively with boundaries of items in the first original list and boundaries of items in the second original list; and determining, based on the visual features of the first original list and the visual features of the second original list, that the first original list is a dominant list in the first original list and the second original list.
The visual features may comprise at least one of: minimum boundary distance between adjacent items; list position; and item content richness.
In an implementation, the structured list may be presented in a search result page provided by a search service.
It should be understood that the method 1400 may further comprise any step/process for list extraction and visualization in web pages according to the embodiments of the present disclosure described above.
FIG.15 illustrates an exemplary apparatus 1500 for list extraction and visualization in web pages according to an embodiment.
The apparatus 1500 may comprise: an anchor element group detecting module 1510, for detecting at least one anchor element group in a target web page, the at least one anchor element group comprising a first anchor element group; a boundary detecting module 1520, for performing boundary detection to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page; a representative metadata obtaining module 1530, for obtaining, from the target web page, multiple groups of representative metadata respectively corresponding to the multiple items, with the boundaries of the multiple items; and a representative metadata visualizing module 1540, for visualizing the multiple groups of representative metadata into a structured list.
Moreover, the apparatus 1500 may further comprise any other modules that are configured for performing any operation of the methods for list extraction and visualization in web pages according to the embodiments of the present disclosure described above.
FIG.16 illustrates an exemplary apparatus 1600 for list extraction and visualization in web pages according to an embodiment.
The apparatus 1600 may comprise at least one processor 1610. The apparatus 1600 may further comprise a memory 1620 connected with at least one processor 1610. The memory 1620 may store computer-executable instructions that, when executed, cause the at least one processor 1610 to: detect at least one anchor element group in a target web page, the at least one anchor element group comprising a first anchor element group; perform boundary detection to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page; obtain, from the target web page, multiple groups of representative metadata respectively corresponding to the multiple items, with the boundaries of the multiple items; and visualize the multiple groups of representative metadata into a structured list. Moreover, the at least one processor 1610 may also be configured for performing any other operation of the methods for list extraction and visualization in web pages according to the embodiments of the present disclosure described above.
The embodiments of the present disclosure propose a computer program product for list extraction and visualization in web pages. The computer program product comprises a computer program that is executed by at least one processor for: detecting at least one anchor element group in a target web page, the at least one anchor element group comprising a first anchor element group; performing boundary detection to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page; obtaining, from the target web page, multiple groups of representative metadata respectively corresponding to the multiple items, with the boundaries of the multiple items; and visualizing the multiple groups of representative metadata into a structured list. Moreover, the computer program may further be executed by the at least one processor for performing any other operation of the methods for list extraction and visualization in web pages according to the embodiments of the present disclosure described above.
The embodiments of the present disclosure may be embodied in a non-transitory computer- readable medium. The non-transitory computer readable medium may comprise instructions that, when executed, cause one or more processors to perform any step/process of the methods for list extraction and visualization in web pages according to the embodiments of the present disclosure described above.
It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.
In addition, the articles "a" and "an" as used in this description and appended claims, unless otherwise specified or clear from the context that they are for the singular form, should generally be interpreted as meaning "one" or "one or more."
It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
Processors have been described in connection with various apparatuses and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. By way of example, a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a micro-processor, micro-controller, digital signal processor (DSP), a field-programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, micro-controller, DSP, or other suitable platform.
Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk. Although a memory is shown as being separate from the processor in various aspects presented in this disclosure, a memory may also be internal to the processor (e.g., a cache or a register).
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skilled in the art are intended to be covered by the claims.

Claims

1. A method for list extraction and visualization in web pages, comprising: detecting at least one anchor element group in a target web page, the at least one anchor element group comprising a first anchor element group; performing boundary detection to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page; obtaining, from the target web page, multiple groups of representative metadata respectively corresponding to the multiple items, with the boundaries of the multiple items; and visualizing the multiple groups of representative metadata into a structured list.
2. The method of claim 1, wherein the detecting at least one anchor element group comprises: identifying multiple html elements, that meet anchor element constraints, in the target web page as multiple identified anchor elements; extracting, from the target web page, a property set of each identified anchor element in the multiple identified anchor elements; and clustering the multiple identified anchor elements into the at least one anchor element group based on multiple property sets of the multiple identified anchor elements.
3. The method of claim 1, wherein the boundary detection comprises: based on a Document Object Model (DOM) tree corresponding to the target web page, performing iterative boundary expansion synchronously by taking the multiple anchor elements as starting points respectively, to obtain multiple item trees respectively originating from the multiple anchor elements, wherein each item tree represents an item and comprises multiple nodes, and each node corresponds to an element determined through the iterative boundary expansion.
4. The method of claim 3, wherein the iterative boundary expansion comprises: for each item tree and in each step of iteration, expanding to a next node and including the next node into the item tree.
5. The method of claim 4, wherein the iterative boundary expansion comprises at least one of: performing sibling node expansion, to expand from the current node to a sibling node of the current node; and performing parent node expansion, to expand to a parent node of the current node after all sibling nodes of the current node have been included in the item tree.
6. The method of claim 4, wherein the boundary detection comprises: determining whether the current step of iteration results in that node overlap occurs between
32 the item tree and at least one another item tree in the multiple item trees; and in response to determining that the node overlap occurs, stopping the performing of the iterative boundary expansion, and excluding nodes, that are determined through the current step of iteration, from the multiple item trees respectively.
7. The method of claim 6, further comprising: if the current step of iteration is sibling node expansion, performing further iterative boundary expansion to the multiple item trees in a direction which is reverse to the direction of the current step of iteration.
8. The method of claim 3, wherein the boundary detection comprises: performing similarity check to the multiple item trees.
9. The method of claim 8, wherein the similarity check comprises: calculating a tree similarity between any two item trees in the multiple item trees; dividing the multiple item trees into at least one tree set at least with a similarity threshold, item trees in each tree set in the at least one tree set having tree similarities, that are not lower than the similarity threshold, among each other; determining whether the number of item trees in a tree set containing the highest number of item trees in the at least one tree set is lower than a tree number threshold; and in response to determining that the number of item trees is lower than the tree number threshold, stopping the performing of the iterative boundary expansion, and excluding nodes, that are determined through a predetermined number of previous steps of iteration, from the multiple item trees respectively.
10. The method of claim 9, wherein the calculating a tree similarity comprises at least one of: calculating the tree similarity at least with a matching weight calculated based on a CSS similarity between root nodes of the two item trees; and calculating the tree similarity with nodes within a minimum depth layer level in the two item trees, the minimum depth layer level being defined such that: the number of visible nodes within the minimum depth layer level in an item tree reaches a predetermined proportion of the number of all visible nodes in the item tree.
11. The method of claim 9, further comprising: resetting the multiple item trees to a state at a predetermined previous step of iteration; and if a next step of iteration after the predetermined previous step of iteration is sibling node expansion, performing further iterative boundary expansion to the multiple item trees in a direction which is reverse to the direction of the next step of iteration.
12. The method of claim 1, wherein the obtaining multiple groups of representative metadata
33 comprises, for each item in the multiple items: identifying a group of leaf nodes in an item tree that is identified by a boundary of the item; extracting a group of initial metadata corresponding to the group of lead nodes; determining a group of tags corresponding to the group of initial metadata; ranking the group of initial metadata with the group of tags; and selecting one or more highest-ranked initial metadata as a group of representative metadata corresponding to the item.
13. The method of claim 1, wherein the at least one anchor element group comprises a second anchor element group, the method further comprises: performing boundary detection to multiple anchor elements in the second anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a second original list in the target web page, and the method further comprises, before obtaining multiple groups of representative metadata or before visualizing the multiple groups of representative metadata into a structured list: determining visual features of the first original list and visual features of the second original list respectively with boundaries of items in the first original list and boundaries of items in the second original list; and determining, based on the visual features of the first original list and the visual features of the second original list, that the first original list is a dominant list in the first original list and the second original list.
14. An apparatus for list extraction and visualization in web pages, comprising: at least one processor; and a memory storing computer-executable instructions that, when executed, cause the at least one processor to: detect at least one anchor element group in a target web page, the at least one anchor element group comprising a first anchor element group, perform boundary detection to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page, obtain, from the target web page, multiple groups of representative metadata respectively corresponding to the multiple items, with the boundaries of the multiple items, and visualize the multiple groups of representative metadata into a structured list.
15. A computer program product for list extraction and visualization in web pages, comprising a computer program that is executed by at least one processor for: detecting at least one anchor element group in a target web page, the at least one anchor element group comprising a first anchor element group; performing boundary detection to multiple anchor elements in the first anchor element group, to obtain boundaries of multiple items respectively associated with the multiple anchor elements, the multiple items corresponding to a first original list in the target web page; obtaining, from the target web page, multiple groups of representative metadata respectively corresponding to the multiple items, with the boundaries of the multiple items; and visualizing the multiple groups of representative metadata into a structured list.
PCT/US2022/048129 2022-01-14 2022-10-28 List extraction and visualization in web pages WO2023136875A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210040984.0A CN116484126A (en) 2022-01-14 2022-01-14 List extraction and visualization in web pages
CN202210040984.0 2022-01-14

Publications (1)

Publication Number Publication Date
WO2023136875A1 true WO2023136875A1 (en) 2023-07-20

Family

ID=84361964

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/048129 WO2023136875A1 (en) 2022-01-14 2022-10-28 List extraction and visualization in web pages

Country Status (2)

Country Link
CN (1) CN116484126A (en)
WO (1) WO2023136875A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975167B (en) * 2023-09-20 2024-02-27 联通在线信息科技有限公司 Metadata grading method and system based on weighted Jaccard coefficient

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
THAMVISET WACHIRAWUT ET AL: "Information extraction for deep web using repetitive subject pattern", WORLD WIDE WEB, vol. 17, no. 5, 14 August 2013 (2013-08-14), NL, pages 1109 - 1139, XP093022672, ISSN: 1386-145X, Retrieved from the Internet <URL:http://link.springer.com/article/10.1007/s11280-013-0248-y/fulltext.html> [retrieved on 20230208], DOI: 10.1007/s11280-013-0248-y *

Also Published As

Publication number Publication date
CN116484126A (en) 2023-07-25

Similar Documents

Publication Publication Date Title
JP5501373B2 (en) System and method for collecting and ranking data from multiple websites
Liu et al. Vide: A vision-based approach for deep web data extraction
US7680858B2 (en) Techniques for clustering structurally similar web pages
US7904455B2 (en) Cascading cluster collages: visualization of image search results on small displays
US8046681B2 (en) Techniques for inducing high quality structural templates for electronic documents
US9183281B2 (en) Context-based document unit recommendation for sensemaking tasks
CN103955529B (en) A kind of internet information search polymerize rendering method
US20090248707A1 (en) Site-specific information-type detection methods and systems
Di Giacomo et al. Graph visualization techniques for web clustering engines
US8001106B2 (en) Systems and methods for tokenizing and interpreting uniform resource locators
JP6116247B2 (en) System and method for searching for documents with block division, identification, indexing of visual elements
US20150067476A1 (en) Title and body extraction from web page
US20110282877A1 (en) Method and system for automatically extracting data from web sites
US20150287047A1 (en) Extracting Information from Chain-Store Websites
WO2023136875A1 (en) List extraction and visualization in web pages
US8983980B2 (en) Domain constraint based data record extraction
Jannach et al. Automated ontology instantiation from tabular web sources—the AllRight system
Alcic et al. Measuring performance of web image context extraction
CA2614774A1 (en) Method and system for automatically extracting data from web sites
Bing et al. Robust detection of semi-structured web records using a dom structure-knowledge-driven model
Zeng et al. A web page segmentation approach using visual semantics
Zeleny et al. Cluster-based Page Segmentation-a fast and precise method for web page pre-processing
Kudělka et al. Web pages reordering and clustering based on Web patterns
Ramya et al. Automatic extraction of facets for user queries [AEFUQ]
Devera et al. Team 3: Object Detection and Topic Modeling (Objects&Topics) CS 5604 F2022

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22812992

Country of ref document: EP

Kind code of ref document: A1