CN110020247B - Webpage key module extraction method and device - Google Patents

Webpage key module extraction method and device Download PDF

Info

Publication number
CN110020247B
CN110020247B CN201711402540.2A CN201711402540A CN110020247B CN 110020247 B CN110020247 B CN 110020247B CN 201711402540 A CN201711402540 A CN 201711402540A CN 110020247 B CN110020247 B CN 110020247B
Authority
CN
China
Prior art keywords
webpage
module
parent
link
ancestor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711402540.2A
Other languages
Chinese (zh)
Other versions
CN110020247A (en
Inventor
初光磊
丁彬
段盼盼
李学环
齐骥
钱岭
吴昊天
邱雨
王瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201711402540.2A priority Critical patent/CN110020247B/en
Publication of CN110020247A publication Critical patent/CN110020247A/en
Application granted granted Critical
Publication of CN110020247B publication Critical patent/CN110020247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention discloses a method and a device for extracting a webpage key module.

Description

Webpage key module extraction method and device
Technical Field
The invention relates to the technical field of internet, in particular to a method and a device for extracting a webpage key module.
Background
In the era of information explosion today, networks play a very important role, and information contents contained on web pages become rich and complex, for example, the web pages can contain navigation, title, text, time, even advertisement, and various types of data are mixed together, which brings certain trouble to users to extract key and effective information.
In the prior art, in order to extract effective content from a web page, the web page often needs to be analyzed finely, so as to extract key information on the web page. Currently, webpage information extraction based on visual features is often adopted in webpage information extraction. The webpage information extraction adopts a Vision-based Page Segmentation (VIPS) algorithm, and key information of the webpage is extracted through a Document Object Model (DOM) tree structure and visual features. Specifically, the contents of each part on the webpage can be visually judged through vision, whether nodes can be divided is judged according to the DOM tree structure, namely whether tags corresponding to the contents of each part on the webpage have sub-pages is judged, the sub-pages are divided until the sub-pages can not be divided continuously, and partial information which can be directly seen in vision is extracted as key information. If the key information on the webpage can not be visually obtained, the key information on the webpage can not be extracted.
Disclosure of Invention
The invention aims to provide a method and a device for extracting a webpage key module, which are used for solving the problem of how to extract the webpage key module when visual features are lacked.
The purpose of the invention is realized by the following technical scheme:
the invention provides a webpage key module extraction method on one hand, which comprises the following steps:
obtaining an effective link contained in a target webpage and a label containing the effective link, wherein the effective link is a detail page link pointing to the inside of the webpage;
determining a common parent web page module containing the tags of the valid links;
and taking the parent webpage module with the maximum number of effective links in the public parent webpage module as the webpage key module.
Optionally, the obtaining of the valid link included in the target webpage and the tag including the valid link includes:
acquiring all tags directly containing links in a target webpage;
deleting invalid links from the links directly contained in all the tags to obtain valid links contained in the target webpage;
and deleting the label corresponding to the invalid link from all the labels to obtain the label containing the valid link.
Optionally, the invalid link includes at least one of the following links:
a link in which the domain name is inconsistent with the domain name of the web page;
a link containing a predetermined keyword;
the keywords are keywords in a non-detail page link common keyword table and keywords in a useless link common keyword table.
Optionally, the determining a common parent web page module including the tag of the valid link includes:
acquiring a parent webpage module of each tag in the tags containing the effective links;
combining parent webpage modules of each label pairwise, and respectively determining whether ancestor-descendant relations exist between ancestor webpage modules of each label in the combination aiming at each combination, wherein the ancestor webpage modules are the parent webpage modules of the parent webpage modules included in the combination;
if an ancestor-descendant relationship exists between ancestor webpage modules of each label in the combination, taking the ancestor webpage module as an ancestor as a public parent webpage module;
if the ancestor webpage modules of each label in the combination do not have ancestor-descendant relations and the ancestor webpage modules are the same, performing hierarchical clustering on the father webpage module included in the combination and each father webpage module included under the ancestor webpage module, and taking the minimum public father webpage module of each father webpage module after clustering as a public father webpage module.
Optionally, before performing hierarchical clustering on the parent webpage module included in the combination and each parent webpage module included under the ancestor webpage module thereof, the method further includes:
acquiring a parent webpage module node chain corresponding to each parent webpage module node in the combination;
determining a minimum public father webpage module node of a father webpage module node chain corresponding to each father webpage module node;
determining a relative path from each webpage module node to the minimum public father webpage module node, and determining a node name on the relative path;
and determining that each parent webpage module in the combination is similar and aggregatable according to the node name similarity on the relative path.
Another aspect of the present invention provides a device for extracting a key module of a web page, including:
the system comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring an effective link contained in a target webpage and a label containing the effective link, and the effective link is a detail page link pointing to the inside of the webpage;
and the processing unit is used for determining a public father webpage module containing the label of the effective link and taking the father webpage module containing the maximum number of effective links in the public father webpage module as the webpage key module.
Optionally, the obtaining unit is configured to obtain an effective link included in the target web page and a tag including the effective link as follows:
acquiring all tags directly containing links in a target webpage;
deleting invalid links from the links directly contained in all the tags to obtain valid links contained in the target webpage;
and deleting the label corresponding to the invalid link from all the labels to obtain the label containing the valid link.
Optionally, the invalid link includes at least one of the following links:
a link in which the domain name is inconsistent with the domain name of the web page;
a link containing a predetermined keyword;
the keywords are keywords in a non-detail page link common keyword table and keywords in a useless link common keyword table.
Optionally, the processing unit is specifically configured to process a public parent web page module including the tag of the valid link in the following manner:
acquiring a parent webpage module of each tag in the tags containing the effective links;
combining parent webpage modules of each label pairwise, and respectively determining whether ancestor-descendant relations exist between ancestor webpage modules of each label in the combination aiming at each combination, wherein the ancestor webpage modules are the parent webpage modules of the parent webpage modules included in the combination;
if an ancestor-descendant relationship exists between ancestor webpage modules of each label in the combination, taking the ancestor webpage module as an ancestor as a public parent webpage module;
if the ancestor webpage modules of each label in the combination do not have ancestor-descendant relations and the ancestor webpage modules are the same, performing hierarchical clustering on the father webpage module included in the combination and each father webpage module included under the ancestor webpage module, and taking the minimum public father webpage module of each father webpage module after clustering as a public father webpage module.
Optionally, the obtaining unit is further configured to: acquiring a parent webpage module node chain corresponding to each parent webpage module node in the combination;
the processing unit is further to:
determining a minimum public father webpage module node of a father webpage module node chain corresponding to each father webpage module node acquired by an acquisition unit;
determining a relative path from each webpage module node to the minimum public father webpage module node, and determining a node name on the relative path;
and determining that each parent webpage module in the combination is similar and aggregatable according to the node name similarity on the relative path.
The invention also provides a webpage key module extraction device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; when the processor executes the program, the webpage key module extraction method is realized.
The invention also provides a computer storage medium, which stores program instructions, and the program instructions are used for realizing the webpage key module extraction method when being executed by a processor.
According to the webpage key module extraction method and device, the effective links contained in the target webpage and the tags containing the effective links are obtained, the public father webpage module containing the tags of the effective links is determined, and the father webpage module containing the largest number of effective links in the public father webpage module is used as the webpage key module, so that the webpage key module can be extracted under the condition of lacking of visual features.
Drawings
Fig. 1 is a flowchart of a method for extracting a key module of a web page according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an embodiment of obtaining valid links and tags containing the valid links from a target web page;
FIG. 3 is a flowchart of an implementation of determining a common parent web page module containing a tag of the active link according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating an embodiment of a method for determining hierarchical clustering of parent web page modules included in the group;
FIG. 5 is a diagram of a DOM tree node provided in an embodiment of the present invention;
fig. 6 is a block diagram of a structure of a device for extracting a key module of a web page according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
According to the webpage key module extraction method and device provided by the embodiment of the invention, the public father webpage module containing the effective link label is determined by acquiring the effective link contained in the target webpage and the effective link label, and the father webpage module containing the most effective links in the public father webpage module is used as the webpage key module, so that the problem of how to extract the webpage key module when the visual characteristics are lacked is solved.
Fig. 1 is a flowchart illustrating a method for extracting a key module of a web page according to an embodiment of the present invention, referring to fig. 1, which may include the following steps:
s101: and acquiring the effective links contained in the target webpage and the tags containing the effective links.
Generally, when browsing a web page, the web page includes various links, and in order to view information in a targeted manner, some key information in the links needs to be selected.
In the embodiment of the invention, the effective links contained in the webpage can be acquired from the target webpage. The effective link refers to a detail page link pointing to the inside of the webpage, and the detail page link can be understood as a link containing the most information content in the webpage.
In the embodiment of the invention, all the labels containing the links can be searched in the source code by downloading the source code of the target webpage, and the label of the effective link can be selected from the labels containing the links.
S102: a common parent web page module containing a tag of a valid link is determined.
In the embodiment of the invention, the public father webpage module of all father webpage modules can be determined by searching the father webpage module containing the label of the effective link.
S103: and taking the parent webpage module with the maximum number of effective links in the public parent webpage module as a webpage key module.
In the embodiment of the invention, after the public father webpage module is determined, the father webpage module with the largest number of effective links in the father webpage modules with the effective links can be selected as the key module of the webpage by searching the father webpage module with the effective links in the public father webpage module.
In a possible implementation manner, the method shown in fig. 2 may be used to obtain the valid links included in the target web page and the tags including the valid links, and referring to fig. 2, the method includes:
s1011: and acquiring all tags directly containing the links in the target webpage.
S1012: and deleting the invalid links from the links directly contained in all the tags to obtain the valid links contained in the target webpage.
S1013: and deleting the label corresponding to the invalid link from all the labels to obtain the label containing the valid link.
Specifically, the invalid link in the embodiment of the present invention may be a link pointing to an external website, where the invalid link may be a link whose domain name is inconsistent with a domain name of a web page, or a link including a preset keyword.
In one possible implementation, the domain name of the target web page may be obtained, all the tags directly containing the links are obtained in the target web page, and the most basic tag, for example, the a tag, is selected from all the tags containing the links. When judging whether the links are valid links, the domain names of all the links corresponding to the a-labels can be respectively compared with the domain name of the target webpage, if the domain name of the link corresponding to the a-label is inconsistent with the domain name of the target webpage, the link corresponding to the a-label can be regarded as an invalid link, and the invalid link is deleted.
In another possible implementation, a keyword table of links of non-detail pages and a keyword table of links of useless links may be preset, and for example, the keywords may be set as: advertisements, recommendations, etc. when similar such keywords appear in the web page, the link may be considered invalid and deleted.
In an embodiment of the present invention, after obtaining the tag of the effective link, a public parent webpage module including the tag of the effective link may be determined, in a possible implementation manner, the method shown in fig. 3 may be used to determine the public parent webpage module including the tag of the effective link, and referring to fig. 3, the method includes:
s1021: and acquiring the parent webpage module of each tag in the tags containing the effective links.
Generally, a set of information with obvious visual identification features in a webpage can be considered as a webpage module, and the webpage module is generally a div, table, and other line block label.
In the DOM tree, there is an association between all nodes. If the row block label of the node is div label, table label or body label, the node is considered as a block node, and the block node can also be called as a web page module. In the following description of the embodiments of the present invention, the block nodes and the web page modules are sometimes mixed, but those skilled in the art should understand that the meanings of the block nodes and the web page modules are consistent.
In the embodiment of the present invention, a tag in an effective link is taken as an example, and a parent webpage module for acquiring each tag in tags including effective links is described in detail below.
Forming a queue by all the a labels of the effective links, searching a father block node corresponding to each a label in the queue, if the label of the father block node is a body label, ignoring the father block node corresponding to the label, if the label of the father block node is not a body label, forming the father block node corresponding to the searched a label into a queue, and considering a set formed by all the father block nodes corresponding to the searched a labels, which are not body labels, as a father webpage module.
S1022: combining the parent webpage modules of the labels pairwise, and respectively determining whether ancestor-descendant relations exist between ancestor webpage modules of the labels in the combinations according to the combinations.
In the embodiment of the invention, different labels may correspond to different father webpage modules, two different father webpage modules can be randomly selected from the father webpage modules to form a binary set, and the father webpage modules corresponding to the father webpage modules of the two different labels are respectively determined. And determining whether an ancestor-descendant relationship exists between the parent webpage modules corresponding to the parent webpage modules of each binary set in all the binary sets by traversing all the binary sets.
In a possible implementation manner, assuming that the queue formed by the parent block nodes corresponding to the a tags found in step S1021 is block _ pre, because different a tags may correspond to different parent block nodes, two different parent block nodes may be selected in block _ pre, for example, M and n are selected, and the parent block nodes corresponding to M and n are found to be MpAnd NpThen determine MpAnd NpWhether there is an ancestor-descendant relationship between them.
S1023: and if the ancestor webpage modules of each label in the combination have ancestor-descendant relations, taking the ancestor webpage module as an ancestor as a public parent webpage module.
In the embodiment of the present invention, if MpAnd NpThere is an ancestor-descendant relationship between, and MpRelative to NpIn other words, MpIf it is ancestor, then M will bepAs a common parent web page module.
S1024: if the ancestor webpage modules of each label in the combination do not have ancestor-descendant relations and the ancestor webpage modules are the same, hierarchical clustering is carried out on the father webpage module included in the combination and each father webpage module included under the ancestor webpage module, and the minimum public father webpage module of each father webpage module after clustering is used as a public father webpage module.
In the embodiment of the invention, if the ancestor webpage module (group) of each label in the combinationParent web page module of the parent web page module included in the contract) MpAnd NpThere is no ancestor-descendant relationship between the parent webpage modules and the parent webpage modules included in the combination are the same, in a possible implementation, before hierarchical clustering is performed on the parent webpage module included in the combination and each parent webpage module included under the ancestor webpage module of the parent webpage module, a method for determining how to perform hierarchical clustering on the parent webpage modules included in the combination may be further included as shown in fig. 4, with reference to fig. 4, the method includes:
s10241: and acquiring a parent webpage module node chain corresponding to each parent webpage module node in the combination.
In the embodiment of the invention, the parent node M of M and n can be obtainedpAnd NpWhen M ispAnd NpAnd obtaining m _ plinks and n _ plinks of m and n parent node chains for the same node. The parent chain of nodes includes all block nodes as well as non-block nodes. For example, FIG. 5 is a schematic diagram of a DOM tree node, see FIG. 5, where M isp(table) represents MpFor a block node, MP (div) indicates that MP is a block node, and then M _ plinks is (M, M) for M's parent node chainpMP, A), the parent node chain N _ plinks of N is (N, N)p,MP,A)。
S10242: and determining the minimum public father webpage module node of the father webpage module node chain corresponding to each father webpage module node.
In the embodiment of the invention, MpMP and A are father nodes, the father nodes on the DOM tree are compared in sequence, and M can be obtainedpAnd determining the node as the minimum common parent webpage module node.
S10243: determining a relative path from each webpage module node to the minimum common parent webpage module node, and determining the node name on the relative path.
In the embodiment of the invention, M and n are relative to a minimum common father webpage module node MpCorresponding to the nodes in fig. 5, then M _ rlinks is (M, M)p) N _ rlinks is (n, M)p) And determining the node names of m _ rlinks and n _ rlinks.
S10244: and determining that all the parent webpage modules in the combination are similar and aggregatable according to the node name similarity on the relative path.
In the embodiment of the invention, the similarity of m _ rlinks and n _ rlinks paths can be represented according to the node name similarity on the relative path.
Specifically, the node names M, M according to the relative paths M _ rlinks and n _ rlinksp,n,MpCalculating the similarity, wherein the similarity can be calculated according to the following formula:
Figure BDA0001519679970000101
where sum refers to the sum of the lengths of the name strings m _ tags on the relative path of node m and the name strings n _ tags on the relative path of node n, ldist is the class edit distance, and refers to the number of operations (insertion, deletion, replacement) for converting m _ tags into n _ tags, which is the minimum, and if the operation is deletion or insertion, the number of operations +1, and if the operation is replacement, the number of operations + 2. The calculation of this formula is well known in the art and will not be described further herein.
In the embodiment of the invention, after the similarity between m _ tags and n _ tags is calculated, whether m _ rlinks and n _ rlinks are similar needs to be judged, if the similarity threshold is s _ threshold, if s is larger than or equal to s _ threshold, the m _ tags and the n _ tags are considered to be similar, namely the tags m and n are similar; otherwise, m _ tags and n _ tags are considered dissimilar.
If the m _ tags is not similar to the n _ tags, deleting the webpage modules corresponding to the m and the n; if m _ tags and n _ tags are similar, then it is necessary to determine whether q nodes having common parents with m and n are aggregatable with m and n.
Since M and n are parents of MpAnd NpThe following processes will be described in detail by taking m-node as an example in the embodiment of the present invention. M shown in FIG. 5pThe child nodes in (2) are not limited to the three nodes m, n and q in the graph, and { q, X1, X2, X3 … Xn } is clustered on the assumption that X1, X2 and X3 … Xn can be included. Assuming that the aggregation threshold is c _ threshold, the number of all block nodes having a common parent node with the m node is set as count _ a, and the aggregation degree is calculated.
Specifically, the aggregation level may be calculated by (count _ mc +1)/(count _ m +1), where count _ mc +1 represents the number of clusterable labels, and count _ m +1 represents the number of all block nodes having a common parent node with m nodes, for example, if count _ mc +1 is 3 and count _ m +1 is 3 in fig. 5, the aggregation level is 3/3 ═ 1. If the degree of polymerization is less than the threshold c _ threshold, { m, q, X1, X2, X3 … Xn } is not polymerizable; if the degree of polymerization is greater than the threshold c _ threshold, m is aggregated with nodes q, X1, X2, X3 … Xn that have a common parent node in the rest.
In the embodiment of the invention, m and the other nodes q, X1, X2 and X3 … Xn with common father nodes can be gathered, then m, q, X1, X2 and X3 … Xn are clustered and merged, and the common father webpage module of each father webpage module after clustering and merging is used as the common father webpage module.
Specifically, if M and q, X1, X2, and X3 … Xn are aggregatable, then M and q, X1, X2, and X3 … Xn are clustered and merged to obtain a common parent node MpAnd M ispAs a new parent web page module.
In the embodiment of the invention, by judging whether the relative paths are similar and aggregating the nodes, the introduction of a redundant module in the extraction of the webpage module can be avoided, and the extraction accuracy of the key module of the target page is improved.
Based on the same concept as the concept applied to the method embodiment related to the extraction of the webpage key module, the embodiment of the present invention further provides a device for extracting the webpage key module, and fig. 6 is a block diagram illustrating a structure of the device for extracting the webpage key module according to the embodiment of the present invention, and referring to fig. 6, the device includes: an acquisition unit 101, a processing unit 102, wherein:
the acquiring unit 101 is configured to acquire an effective link included in the target web page and a tag including the effective link, where the effective link is a detail page link pointing to the inside of the web page.
And the processing unit 102 is configured to determine a public parent web page module including a tag of an effective link, and use a parent web page module including the largest number of effective links in the public parent web page module as a web page key module.
Specifically, the obtaining unit 101 is configured to obtain the valid link included in the target webpage and the tag including the valid link as follows:
acquiring all tags directly containing links in a target webpage; deleting invalid links from the links directly contained in all the tags to obtain the valid links contained in the target webpage; and deleting the label corresponding to the invalid link from all the labels to obtain the label containing the valid link.
Optionally, the invalid link includes at least one of the following links:
a link in which the domain name is inconsistent with the domain name of the web page; a link containing a predetermined keyword; the keywords are keywords in a non-detail page link common keyword table and keywords in a useless link common keyword table.
Further, the processing unit 102 is specifically configured to process the common parent web page module containing the tag of the active link in the following manner:
acquiring a parent webpage module of each tag in tags containing effective links; combining the parent webpage modules of the labels pairwise, and respectively determining whether ancestor-descendant relations exist between ancestor webpage modules of the labels in the combinations aiming at each combination, wherein the ancestor webpage modules are the parent webpage modules of the parent webpage modules included in the combinations; if an ancestor-descendant relationship exists between ancestor webpage modules of each label in the combination, taking the ancestor webpage module as an ancestor as a public parent webpage module; if the ancestor webpage modules of each label in the combination do not have ancestor-descendant relations and the ancestor webpage modules are the same, hierarchical clustering is carried out on the father webpage module included in the combination and each father webpage module included under the ancestor webpage module, and the minimum public father webpage module of each father webpage module after clustering is used as a public father webpage module.
Further, the obtaining unit 101 is further configured to:
and acquiring a parent webpage module node chain corresponding to each parent webpage module node in the combination.
The processing unit 102 is further configured to:
determining a minimum public father webpage module node of a father webpage module node chain corresponding to each father webpage module node acquired by an acquisition unit; determining a relative path from each webpage module node to the minimum public father webpage module node, and determining a node name on the relative path; and determining that all the parent webpage modules in the combination are similar and aggregatable according to the node name similarity on the relative path.
The embodiment of the invention also provides webpage key module extraction equipment, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; when the processor executes the program, the method for extracting the webpage key module is realized.
The embodiment of the invention also provides a computer storage medium, wherein the computer storage medium stores program instructions, and the program instructions are used for realizing the webpage key module extraction method when being executed by a processor.
It should be noted that, in the embodiment of the present invention, the functional implementation of each unit in the device for extracting a webpage key module mentioned above may further refer to the description of the related method embodiment, and is not described herein again.
The present application is described above with reference to block diagrams and/or flowchart illustrations of methods, apparatus (systems) and/or computer program products according to embodiments of the application. It will be understood that one block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.
Accordingly, the subject application may also be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.). Furthermore, the present application may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this application, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A webpage key module extraction method is characterized by comprising the following steps:
obtaining an effective link contained in a target webpage and a label containing the effective link, wherein the effective link is a detail page link pointing to the inside of the webpage;
determining a common parent web page module containing the tags of the valid links; wherein the determining a common parent web page module containing the tag of the valid link comprises: acquiring a parent webpage module of each tag in the tags containing the effective links; combining parent webpage modules of each label pairwise, and respectively determining whether ancestor-descendant relations exist between ancestor webpage modules of each label in the combination aiming at each combination, wherein the ancestor webpage modules are the parent webpage modules of the parent webpage modules included in the combination; if an ancestor-descendant relationship exists between ancestor webpage modules of each label in the combination, taking the ancestor webpage module as an ancestor as a public parent webpage module; if the ancestor webpage modules of each label in the combination do not have ancestor-descendant relations and the ancestor webpage modules are the same, performing hierarchical clustering on the parent webpage module included in the combination and each parent webpage module included under the ancestor webpage module, and taking the minimum public parent webpage module of each clustered parent webpage module as a public parent webpage module;
and taking the parent webpage module with the maximum number of effective links in the public parent webpage module as the webpage key module.
2. The method of claim 1, wherein the obtaining of the valid link included in the target web page and the tag including the valid link comprises:
acquiring all tags directly containing links in a target webpage;
deleting invalid links from the links directly contained in all the tags to obtain valid links contained in the target webpage;
and deleting the label corresponding to the invalid link from all the labels to obtain the label containing the valid link.
3. The method of claim 2, wherein the invalid link comprises at least one of:
a link in which the domain name is inconsistent with the domain name of the web page;
a link containing a predetermined keyword;
the keywords are keywords in a non-detail page link common keyword table and keywords in a useless link common keyword table.
4. The method of claim 1, wherein prior to hierarchically clustering a parent web page module included in the group with each parent web page module included under its ancestor web page module, the method further comprises:
acquiring a parent webpage module node chain corresponding to each parent webpage module node in the combination;
determining a minimum public father webpage module node of a father webpage module node chain corresponding to each father webpage module node;
determining a relative path from each webpage module node to the minimum public father webpage module node, and determining a node name on the relative path;
and determining that each parent webpage module in the combination is similar and aggregatable according to the node name similarity on the relative path.
5. A webpage key module extraction device is characterized by comprising:
the system comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring an effective link contained in a target webpage and a label containing the effective link, and the effective link is a detail page link pointing to the inside of the webpage;
the processing unit is used for determining a public father webpage module containing the label of the effective link and taking the father webpage module containing the maximum number of the effective links in the public father webpage module as the webpage key module; wherein the processing unit is specifically configured to process a common parent web page module including the tag of the active link in a push-to-talk manner, and includes: acquiring a parent webpage module of each tag in the tags containing the effective links; combining parent webpage modules of each label pairwise, and respectively determining whether ancestor-descendant relations exist between ancestor webpage modules of each label in the combination aiming at each combination, wherein the ancestor webpage modules are the parent webpage modules of the parent webpage modules included in the combination; if an ancestor-descendant relationship exists between ancestor webpage modules of each label in the combination, taking the ancestor webpage module as an ancestor as a public parent webpage module; if the ancestor webpage modules of each label in the combination do not have ancestor-descendant relations and the ancestor webpage modules are the same, performing hierarchical clustering on the father webpage module included in the combination and each father webpage module included under the ancestor webpage module, and taking the minimum public father webpage module of each father webpage module after clustering as a public father webpage module.
6. The apparatus of claim 5, wherein the obtaining unit is configured to obtain the valid link included in the target webpage and the tag including the valid link as follows:
acquiring all tags directly containing links in a target webpage;
deleting invalid links from the links directly contained in all the tags to obtain valid links contained in the target webpage;
and deleting the label corresponding to the invalid link from all the labels to obtain the label containing the valid link.
7. The apparatus of claim 6, wherein the invalid link comprises at least one of:
a link in which the domain name is inconsistent with the domain name of the web page;
a link containing a predetermined keyword;
the keywords are keywords in a non-detail page link common keyword table and keywords in a useless link common keyword table.
8. The apparatus of claim 6, wherein the obtaining unit is further configured to:
acquiring a parent webpage module node chain corresponding to each parent webpage module node in the combination;
the processing unit is further to:
determining a minimum public father webpage module node of a father webpage module node chain corresponding to each father webpage module node acquired by an acquisition unit;
determining a relative path from each webpage module node to the minimum public father webpage module node, and determining a node name on the relative path;
and determining that each parent webpage module in the combination is similar and aggregatable according to the node name similarity on the relative path.
9. A webpage key module extracting device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; the method for extracting the key module of the webpage is characterized in that the processor realizes the method for extracting the key module of the webpage according to any one of claims 1 to 4 when executing the program.
10. A computer storage medium having stored thereon program instructions for implementing a web page key module extraction method according to any one of claims 1 to 4 when executed by a processor.
CN201711402540.2A 2017-12-22 2017-12-22 Webpage key module extraction method and device Active CN110020247B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711402540.2A CN110020247B (en) 2017-12-22 2017-12-22 Webpage key module extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711402540.2A CN110020247B (en) 2017-12-22 2017-12-22 Webpage key module extraction method and device

Publications (2)

Publication Number Publication Date
CN110020247A CN110020247A (en) 2019-07-16
CN110020247B true CN110020247B (en) 2021-05-14

Family

ID=67187130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711402540.2A Active CN110020247B (en) 2017-12-22 2017-12-22 Webpage key module extraction method and device

Country Status (1)

Country Link
CN (1) CN110020247B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254009A (en) * 2011-07-15 2011-11-23 福建星网锐捷通讯股份有限公司 Method for extracting data of webpage table
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN103838792A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Method for determining webpage theme
CN104462394A (en) * 2012-06-25 2015-03-25 北京奇虎科技有限公司 System and method for recognizing content posts of webpage
KR20160045974A (en) * 2014-10-17 2016-04-28 인포뱅크 주식회사 Apparatus and method for relaying group message
CN105786951A (en) * 2015-12-31 2016-07-20 北京金山安全软件有限公司 Method and device for extracting content blocks in webpage and server
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device
CN102254009A (en) * 2011-07-15 2011-11-23 福建星网锐捷通讯股份有限公司 Method for extracting data of webpage table
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN104462394A (en) * 2012-06-25 2015-03-25 北京奇虎科技有限公司 System and method for recognizing content posts of webpage
CN103838792A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Method for determining webpage theme
KR20160045974A (en) * 2014-10-17 2016-04-28 인포뱅크 주식회사 Apparatus and method for relaying group message
CN105786951A (en) * 2015-12-31 2016-07-20 北京金山安全软件有限公司 Method and device for extracting content blocks in webpage and server
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种通用的网页内容抽取模块的设计与实现;罗超然;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150815(第8期);全文 *
基于分块的主题信息抽取研究与应用;张超;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100715(第7期);第2.2,3.1-3.3节、图3.3-3.4 *
基于单DOM树特征预分类的自适应Web信息抽取方法;谢馨庭;《中国优秀硕士学位论文全文数据库 信息科技辑》;20170615(第6期);全文 *

Also Published As

Publication number Publication date
CN110020247A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
US10671584B2 (en) Identifying unvisited portions of visited information
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
WO2019041521A1 (en) Apparatus and method for extracting user keyword, and computer-readable storage medium
CN107391675B (en) Method and apparatus for generating structured information
CN110390038B (en) Page blocking method, device and equipment based on DOM tree and storage medium
US20150324350A1 (en) Identifying Content Relationship for Content Copied by a Content Identification Mechanism
US9514113B1 (en) Methods for automatic footnote generation
CN110309251B (en) Text data processing method, device and computer readable storage medium
CN111539193A (en) Ontology-based document analysis and annotation generation
WO2022222300A1 (en) Open relationship extraction method and apparatus, electronic device, and storage medium
CN110738049B (en) Similar text processing method and device and computer readable storage medium
CN110210038B (en) Core entity determining method, system, server and computer readable medium thereof
WO2021068681A1 (en) Tag analysis method and device, and computer readable storage medium
CN110825941A (en) Content management system identification method, device and storage medium
CN105302807A (en) Method and apparatus for obtaining information category
CN111857660B (en) Context-aware API recommendation method and terminal based on query statement
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN106326236A (en) Webpage content identification method and system
CN105260469A (en) Sitemap processing method, apparatus and device
US20180173687A1 (en) Automatic datacenter state summarization
CN112883192B (en) Heterogeneous domain user and resource association mining method and system
CN106033444B (en) Text content clustering method and device
CN111061975B (en) Method and device for processing irrelevant content in page
JP2007122398A (en) Method for determining identity of fragment, and computer program
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant