CN110020247B

CN110020247B - Webpage key module extraction method and device

Info

Publication number: CN110020247B
Application number: CN201711402540.2A
Authority: CN
Inventors: 初光磊; 丁彬; 段盼盼; 李学环; 齐骥; 钱岭; 吴昊天; 邱雨; 王瑶
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2017-12-22
Filing date: 2017-12-22
Publication date: 2021-05-14
Anticipated expiration: 2037-12-22
Also published as: CN110020247A

Abstract

The invention discloses a method and a device for extracting a webpage key module.

Description

Webpage key module extraction method and device

Technical Field

The invention relates to the technical field of internet, in particular to a method and a device for extracting a webpage key module.

Background

In the era of information explosion today, networks play a very important role, and information contents contained on web pages become rich and complex, for example, the web pages can contain navigation, title, text, time, even advertisement, and various types of data are mixed together, which brings certain trouble to users to extract key and effective information.

In the prior art, in order to extract effective content from a web page, the web page often needs to be analyzed finely, so as to extract key information on the web page. Currently, webpage information extraction based on visual features is often adopted in webpage information extraction. The webpage information extraction adopts a Vision-based Page Segmentation (VIPS) algorithm, and key information of the webpage is extracted through a Document Object Model (DOM) tree structure and visual features. Specifically, the contents of each part on the webpage can be visually judged through vision, whether nodes can be divided is judged according to the DOM tree structure, namely whether tags corresponding to the contents of each part on the webpage have sub-pages is judged, the sub-pages are divided until the sub-pages can not be divided continuously, and partial information which can be directly seen in vision is extracted as key information. If the key information on the webpage can not be visually obtained, the key information on the webpage can not be extracted.

Disclosure of Invention

The invention aims to provide a method and a device for extracting a webpage key module, which are used for solving the problem of how to extract the webpage key module when visual features are lacked.

The purpose of the invention is realized by the following technical scheme:

the invention provides a webpage key module extraction method on one hand, which comprises the following steps:

obtaining an effective link contained in a target webpage and a label containing the effective link, wherein the effective link is a detail page link pointing to the inside of the webpage;

determining a common parent web page module containing the tags of the valid links;

and taking the parent webpage module with the maximum number of effective links in the public parent webpage module as the webpage key module.

Optionally, the obtaining of the valid link included in the target webpage and the tag including the valid link includes:

acquiring all tags directly containing links in a target webpage;

deleting invalid links from the links directly contained in all the tags to obtain valid links contained in the target webpage;

and deleting the label corresponding to the invalid link from all the labels to obtain the label containing the valid link.

Optionally, the invalid link includes at least one of the following links:

a link in which the domain name is inconsistent with the domain name of the web page;

a link containing a predetermined keyword;

the keywords are keywords in a non-detail page link common keyword table and keywords in a useless link common keyword table.

Optionally, the determining a common parent web page module including the tag of the valid link includes:

acquiring a parent webpage module of each tag in the tags containing the effective links;

combining parent webpage modules of each label pairwise, and respectively determining whether ancestor-descendant relations exist between ancestor webpage modules of each label in the combination aiming at each combination, wherein the ancestor webpage modules are the parent webpage modules of the parent webpage modules included in the combination;

if an ancestor-descendant relationship exists between ancestor webpage modules of each label in the combination, taking the ancestor webpage module as an ancestor as a public parent webpage module;

if the ancestor webpage modules of each label in the combination do not have ancestor-descendant relations and the ancestor webpage modules are the same, performing hierarchical clustering on the father webpage module included in the combination and each father webpage module included under the ancestor webpage module, and taking the minimum public father webpage module of each father webpage module after clustering as a public father webpage module.

Optionally, before performing hierarchical clustering on the parent webpage module included in the combination and each parent webpage module included under the ancestor webpage module thereof, the method further includes:

acquiring a parent webpage module node chain corresponding to each parent webpage module node in the combination;

determining a minimum public father webpage module node of a father webpage module node chain corresponding to each father webpage module node;

determining a relative path from each webpage module node to the minimum public father webpage module node, and determining a node name on the relative path;

and determining that each parent webpage module in the combination is similar and aggregatable according to the node name similarity on the relative path.

Another aspect of the present invention provides a device for extracting a key module of a web page, including:

the system comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring an effective link contained in a target webpage and a label containing the effective link, and the effective link is a detail page link pointing to the inside of the webpage;

and the processing unit is used for determining a public father webpage module containing the label of the effective link and taking the father webpage module containing the maximum number of effective links in the public father webpage module as the webpage key module.

Optionally, the obtaining unit is configured to obtain an effective link included in the target web page and a tag including the effective link as follows:

acquiring all tags directly containing links in a target webpage;

Optionally, the invalid link includes at least one of the following links:

a link containing a predetermined keyword;

Optionally, the processing unit is specifically configured to process a public parent web page module including the tag of the valid link in the following manner:

Optionally, the obtaining unit is further configured to: acquiring a parent webpage module node chain corresponding to each parent webpage module node in the combination;

the processing unit is further to:

determining a minimum public father webpage module node of a father webpage module node chain corresponding to each father webpage module node acquired by an acquisition unit;

The invention also provides a webpage key module extraction device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; when the processor executes the program, the webpage key module extraction method is realized.

The invention also provides a computer storage medium, which stores program instructions, and the program instructions are used for realizing the webpage key module extraction method when being executed by a processor.

According to the webpage key module extraction method and device, the effective links contained in the target webpage and the tags containing the effective links are obtained, the public father webpage module containing the tags of the effective links is determined, and the father webpage module containing the largest number of effective links in the public father webpage module is used as the webpage key module, so that the webpage key module can be extracted under the condition of lacking of visual features.

Drawings

Fig. 1 is a flowchart of a method for extracting a key module of a web page according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an embodiment of obtaining valid links and tags containing the valid links from a target web page;

FIG. 3 is a flowchart of an implementation of determining a common parent web page module containing a tag of the active link according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating an embodiment of a method for determining hierarchical clustering of parent web page modules included in the group;

FIG. 5 is a diagram of a DOM tree node provided in an embodiment of the present invention;

fig. 6 is a block diagram of a structure of a device for extracting a key module of a web page according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

According to the webpage key module extraction method and device provided by the embodiment of the invention, the public father webpage module containing the effective link label is determined by acquiring the effective link contained in the target webpage and the effective link label, and the father webpage module containing the most effective links in the public father webpage module is used as the webpage key module, so that the problem of how to extract the webpage key module when the visual characteristics are lacked is solved.

Fig. 1 is a flowchart illustrating a method for extracting a key module of a web page according to an embodiment of the present invention, referring to fig. 1, which may include the following steps:

s101: and acquiring the effective links contained in the target webpage and the tags containing the effective links.

Generally, when browsing a web page, the web page includes various links, and in order to view information in a targeted manner, some key information in the links needs to be selected.

In the embodiment of the invention, the effective links contained in the webpage can be acquired from the target webpage. The effective link refers to a detail page link pointing to the inside of the webpage, and the detail page link can be understood as a link containing the most information content in the webpage.

In the embodiment of the invention, all the labels containing the links can be searched in the source code by downloading the source code of the target webpage, and the label of the effective link can be selected from the labels containing the links.

S102: a common parent web page module containing a tag of a valid link is determined.

In the embodiment of the invention, the public father webpage module of all father webpage modules can be determined by searching the father webpage module containing the label of the effective link.

S103: and taking the parent webpage module with the maximum number of effective links in the public parent webpage module as a webpage key module.

In the embodiment of the invention, after the public father webpage module is determined, the father webpage module with the largest number of effective links in the father webpage modules with the effective links can be selected as the key module of the webpage by searching the father webpage module with the effective links in the public father webpage module.

In a possible implementation manner, the method shown in fig. 2 may be used to obtain the valid links included in the target web page and the tags including the valid links, and referring to fig. 2, the method includes:

s1011: and acquiring all tags directly containing the links in the target webpage.

S1012: and deleting the invalid links from the links directly contained in all the tags to obtain the valid links contained in the target webpage.

S1013: and deleting the label corresponding to the invalid link from all the labels to obtain the label containing the valid link.

Specifically, the invalid link in the embodiment of the present invention may be a link pointing to an external website, where the invalid link may be a link whose domain name is inconsistent with a domain name of a web page, or a link including a preset keyword.

In one possible implementation, the domain name of the target web page may be obtained, all the tags directly containing the links are obtained in the target web page, and the most basic tag, for example, the a tag, is selected from all the tags containing the links. When judging whether the links are valid links, the domain names of all the links corresponding to the a-labels can be respectively compared with the domain name of the target webpage, if the domain name of the link corresponding to the a-label is inconsistent with the domain name of the target webpage, the link corresponding to the a-label can be regarded as an invalid link, and the invalid link is deleted.

In another possible implementation, a keyword table of links of non-detail pages and a keyword table of links of useless links may be preset, and for example, the keywords may be set as: advertisements, recommendations, etc. when similar such keywords appear in the web page, the link may be considered invalid and deleted.

In an embodiment of the present invention, after obtaining the tag of the effective link, a public parent webpage module including the tag of the effective link may be determined, in a possible implementation manner, the method shown in fig. 3 may be used to determine the public parent webpage module including the tag of the effective link, and referring to fig. 3, the method includes:

s1021: and acquiring the parent webpage module of each tag in the tags containing the effective links.

Generally, a set of information with obvious visual identification features in a webpage can be considered as a webpage module, and the webpage module is generally a div, table, and other line block label.

In the DOM tree, there is an association between all nodes. If the row block label of the node is div label, table label or body label, the node is considered as a block node, and the block node can also be called as a web page module. In the following description of the embodiments of the present invention, the block nodes and the web page modules are sometimes mixed, but those skilled in the art should understand that the meanings of the block nodes and the web page modules are consistent.

In the embodiment of the present invention, a tag in an effective link is taken as an example, and a parent webpage module for acquiring each tag in tags including effective links is described in detail below.

Forming a queue by all the a labels of the effective links, searching a father block node corresponding to each a label in the queue, if the label of the father block node is a body label, ignoring the father block node corresponding to the label, if the label of the father block node is not a body label, forming the father block node corresponding to the searched a label into a queue, and considering a set formed by all the father block nodes corresponding to the searched a labels, which are not body labels, as a father webpage module.

S1022: combining the parent webpage modules of the labels pairwise, and respectively determining whether ancestor-descendant relations exist between ancestor webpage modules of the labels in the combinations according to the combinations.

In the embodiment of the invention, different labels may correspond to different father webpage modules, two different father webpage modules can be randomly selected from the father webpage modules to form a binary set, and the father webpage modules corresponding to the father webpage modules of the two different labels are respectively determined. And determining whether an ancestor-descendant relationship exists between the parent webpage modules corresponding to the parent webpage modules of each binary set in all the binary sets by traversing all the binary sets.

In a possible implementation manner, assuming that the queue formed by the parent block nodes corresponding to the a tags found in step S1021 is block _ pre, because different a tags may correspond to different parent block nodes, two different parent block nodes may be selected in block _ pre, for example, M and n are selected, and the parent block nodes corresponding to M and n are found to be M_pAnd N_pThen determine M_pAnd N_pWhether there is an ancestor-descendant relationship between them.

S1023: and if the ancestor webpage modules of each label in the combination have ancestor-descendant relations, taking the ancestor webpage module as an ancestor as a public parent webpage module.

In the embodiment of the present invention, if M_pAnd N_pThere is an ancestor-descendant relationship between, and M_pRelative to N_pIn other words, M_pIf it is ancestor, then M will be_pAs a common parent web page module.

S1024: if the ancestor webpage modules of each label in the combination do not have ancestor-descendant relations and the ancestor webpage modules are the same, hierarchical clustering is carried out on the father webpage module included in the combination and each father webpage module included under the ancestor webpage module, and the minimum public father webpage module of each father webpage module after clustering is used as a public father webpage module.

In the embodiment of the invention, if the ancestor webpage module (group) of each label in the combinationParent web page module of the parent web page module included in the contract) M_pAnd N_pThere is no ancestor-descendant relationship between the parent webpage modules and the parent webpage modules included in the combination are the same, in a possible implementation, before hierarchical clustering is performed on the parent webpage module included in the combination and each parent webpage module included under the ancestor webpage module of the parent webpage module, a method for determining how to perform hierarchical clustering on the parent webpage modules included in the combination may be further included as shown in fig. 4, with reference to fig. 4, the method includes:

s10241: and acquiring a parent webpage module node chain corresponding to each parent webpage module node in the combination.

In the embodiment of the invention, the parent node M of M and n can be obtained_pAnd N_pWhen M is_pAnd N_pAnd obtaining m _ plinks and n _ plinks of m and n parent node chains for the same node. The parent chain of nodes includes all block nodes as well as non-block nodes. For example, FIG. 5 is a schematic diagram of a DOM tree node, see FIG. 5, where M is_p(table) represents M_pFor a block node, MP (div) indicates that MP is a block node, and then M _ plinks is (M, M) for M's parent node chain_pMP, A), the parent node chain N _ plinks of N is (N, N)_p，MP,A)。

S10242: and determining the minimum public father webpage module node of the father webpage module node chain corresponding to each father webpage module node.

In the embodiment of the invention, M_pMP and A are father nodes, the father nodes on the DOM tree are compared in sequence, and M can be obtained_pAnd determining the node as the minimum common parent webpage module node.

S10243: determining a relative path from each webpage module node to the minimum common parent webpage module node, and determining the node name on the relative path.

In the embodiment of the invention, M and n are relative to a minimum common father webpage module node M_pCorresponding to the nodes in fig. 5, then M _ rlinks is (M, M)_p) N _ rlinks is (n, M)_p) And determining the node names of m _ rlinks and n _ rlinks.

S10244: and determining that all the parent webpage modules in the combination are similar and aggregatable according to the node name similarity on the relative path.

In the embodiment of the invention, the similarity of m _ rlinks and n _ rlinks paths can be represented according to the node name similarity on the relative path.

Specifically, the node names M, M according to the relative paths M _ rlinks and n _ rlinks_p，n，M_pCalculating the similarity, wherein the similarity can be calculated according to the following formula:

where sum refers to the sum of the lengths of the name strings m _ tags on the relative path of node m and the name strings n _ tags on the relative path of node n, ldist is the class edit distance, and refers to the number of operations (insertion, deletion, replacement) for converting m _ tags into n _ tags, which is the minimum, and if the operation is deletion or insertion, the number of operations +1, and if the operation is replacement, the number of operations + 2. The calculation of this formula is well known in the art and will not be described further herein.

In the embodiment of the invention, after the similarity between m _ tags and n _ tags is calculated, whether m _ rlinks and n _ rlinks are similar needs to be judged, if the similarity threshold is s _ threshold, if s is larger than or equal to s _ threshold, the m _ tags and the n _ tags are considered to be similar, namely the tags m and n are similar; otherwise, m _ tags and n _ tags are considered dissimilar.

If the m _ tags is not similar to the n _ tags, deleting the webpage modules corresponding to the m and the n; if m _ tags and n _ tags are similar, then it is necessary to determine whether q nodes having common parents with m and n are aggregatable with m and n.

Since M and n are parents of M_pAnd N_pThe following processes will be described in detail by taking m-node as an example in the embodiment of the present invention. M shown in FIG. 5_pThe child nodes in (2) are not limited to the three nodes m, n and q in the graph, and { q, X1, X2, X3 … Xn } is clustered on the assumption that X1, X2 and X3 … Xn can be included. Assuming that the aggregation threshold is c _ threshold, the number of all block nodes having a common parent node with the m node is set as count _ a, and the aggregation degree is calculated.

Specifically, the aggregation level may be calculated by (count _ mc +1)/(count _ m +1), where count _ mc +1 represents the number of clusterable labels, and count _ m +1 represents the number of all block nodes having a common parent node with m nodes, for example, if count _ mc +1 is 3 and count _ m +1 is 3 in fig. 5, the aggregation level is 3/3 ═ 1. If the degree of polymerization is less than the threshold c _ threshold, { m, q, X1, X2, X3 … Xn } is not polymerizable; if the degree of polymerization is greater than the threshold c _ threshold, m is aggregated with nodes q, X1, X2, X3 … Xn that have a common parent node in the rest.

In the embodiment of the invention, m and the other nodes q, X1, X2 and X3 … Xn with common father nodes can be gathered, then m, q, X1, X2 and X3 … Xn are clustered and merged, and the common father webpage module of each father webpage module after clustering and merging is used as the common father webpage module.

Specifically, if M and q, X1, X2, and X3 … Xn are aggregatable, then M and q, X1, X2, and X3 … Xn are clustered and merged to obtain a common parent node M_pAnd M is_pAs a new parent web page module.

In the embodiment of the invention, by judging whether the relative paths are similar and aggregating the nodes, the introduction of a redundant module in the extraction of the webpage module can be avoided, and the extraction accuracy of the key module of the target page is improved.

Based on the same concept as the concept applied to the method embodiment related to the extraction of the webpage key module, the embodiment of the present invention further provides a device for extracting the webpage key module, and fig. 6 is a block diagram illustrating a structure of the device for extracting the webpage key module according to the embodiment of the present invention, and referring to fig. 6, the device includes: an acquisition unit 101, a processing unit 102, wherein:

the acquiring unit 101 is configured to acquire an effective link included in the target web page and a tag including the effective link, where the effective link is a detail page link pointing to the inside of the web page.

And the processing unit 102 is configured to determine a public parent web page module including a tag of an effective link, and use a parent web page module including the largest number of effective links in the public parent web page module as a web page key module.

Specifically, the obtaining unit 101 is configured to obtain the valid link included in the target webpage and the tag including the valid link as follows:

acquiring all tags directly containing links in a target webpage; deleting invalid links from the links directly contained in all the tags to obtain the valid links contained in the target webpage; and deleting the label corresponding to the invalid link from all the labels to obtain the label containing the valid link.

Optionally, the invalid link includes at least one of the following links:

a link in which the domain name is inconsistent with the domain name of the web page; a link containing a predetermined keyword; the keywords are keywords in a non-detail page link common keyword table and keywords in a useless link common keyword table.

Further, the processing unit 102 is specifically configured to process the common parent web page module containing the tag of the active link in the following manner:

acquiring a parent webpage module of each tag in tags containing effective links; combining the parent webpage modules of the labels pairwise, and respectively determining whether ancestor-descendant relations exist between ancestor webpage modules of the labels in the combinations aiming at each combination, wherein the ancestor webpage modules are the parent webpage modules of the parent webpage modules included in the combinations; if an ancestor-descendant relationship exists between ancestor webpage modules of each label in the combination, taking the ancestor webpage module as an ancestor as a public parent webpage module; if the ancestor webpage modules of each label in the combination do not have ancestor-descendant relations and the ancestor webpage modules are the same, hierarchical clustering is carried out on the father webpage module included in the combination and each father webpage module included under the ancestor webpage module, and the minimum public father webpage module of each father webpage module after clustering is used as a public father webpage module.

Further, the obtaining unit 101 is further configured to:

and acquiring a parent webpage module node chain corresponding to each parent webpage module node in the combination.

The processing unit 102 is further configured to:

determining a minimum public father webpage module node of a father webpage module node chain corresponding to each father webpage module node acquired by an acquisition unit; determining a relative path from each webpage module node to the minimum public father webpage module node, and determining a node name on the relative path; and determining that all the parent webpage modules in the combination are similar and aggregatable according to the node name similarity on the relative path.

The embodiment of the invention also provides webpage key module extraction equipment, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; when the processor executes the program, the method for extracting the webpage key module is realized.

The embodiment of the invention also provides a computer storage medium, wherein the computer storage medium stores program instructions, and the program instructions are used for realizing the webpage key module extraction method when being executed by a processor.

It should be noted that, in the embodiment of the present invention, the functional implementation of each unit in the device for extracting a webpage key module mentioned above may further refer to the description of the related method embodiment, and is not described herein again.

The present application is described above with reference to block diagrams and/or flowchart illustrations of methods, apparatus (systems) and/or computer program products according to embodiments of the application. It will be understood that one block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, and/or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer and/or other programmable data processing apparatus, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks.

Accordingly, the subject application may also be embodied in hardware and/or in software (including firmware, resident software, micro-code, etc.). Furthermore, the present application may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium for use by or in connection with an instruction execution system. In the context of this application, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A webpage key module extraction method is characterized by comprising the following steps:

determining a common parent web page module containing the tags of the valid links; wherein the determining a common parent web page module containing the tag of the valid link comprises: acquiring a parent webpage module of each tag in the tags containing the effective links; combining parent webpage modules of each label pairwise, and respectively determining whether ancestor-descendant relations exist between ancestor webpage modules of each label in the combination aiming at each combination, wherein the ancestor webpage modules are the parent webpage modules of the parent webpage modules included in the combination; if an ancestor-descendant relationship exists between ancestor webpage modules of each label in the combination, taking the ancestor webpage module as an ancestor as a public parent webpage module; if the ancestor webpage modules of each label in the combination do not have ancestor-descendant relations and the ancestor webpage modules are the same, performing hierarchical clustering on the parent webpage module included in the combination and each parent webpage module included under the ancestor webpage module, and taking the minimum public parent webpage module of each clustered parent webpage module as a public parent webpage module;

2. The method of claim 1, wherein the obtaining of the valid link included in the target web page and the tag including the valid link comprises:

acquiring all tags directly containing links in a target webpage;

3. The method of claim 2, wherein the invalid link comprises at least one of:

a link containing a predetermined keyword;

4. The method of claim 1, wherein prior to hierarchically clustering a parent web page module included in the group with each parent web page module included under its ancestor web page module, the method further comprises:

5. A webpage key module extraction device is characterized by comprising:

the processing unit is used for determining a public father webpage module containing the label of the effective link and taking the father webpage module containing the maximum number of the effective links in the public father webpage module as the webpage key module; wherein the processing unit is specifically configured to process a common parent web page module including the tag of the active link in a push-to-talk manner, and includes: acquiring a parent webpage module of each tag in the tags containing the effective links; combining parent webpage modules of each label pairwise, and respectively determining whether ancestor-descendant relations exist between ancestor webpage modules of each label in the combination aiming at each combination, wherein the ancestor webpage modules are the parent webpage modules of the parent webpage modules included in the combination; if an ancestor-descendant relationship exists between ancestor webpage modules of each label in the combination, taking the ancestor webpage module as an ancestor as a public parent webpage module; if the ancestor webpage modules of each label in the combination do not have ancestor-descendant relations and the ancestor webpage modules are the same, performing hierarchical clustering on the father webpage module included in the combination and each father webpage module included under the ancestor webpage module, and taking the minimum public father webpage module of each father webpage module after clustering as a public father webpage module.

6. The apparatus of claim 5, wherein the obtaining unit is configured to obtain the valid link included in the target webpage and the tag including the valid link as follows:

acquiring all tags directly containing links in a target webpage;

7. The apparatus of claim 6, wherein the invalid link comprises at least one of:

a link containing a predetermined keyword;

8. The apparatus of claim 6, wherein the obtaining unit is further configured to:

the processing unit is further to:

9. A webpage key module extracting device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; the method for extracting the key module of the webpage is characterized in that the processor realizes the method for extracting the key module of the webpage according to any one of claims 1 to 4 when executing the program.

10. A computer storage medium having stored thereon program instructions for implementing a web page key module extraction method according to any one of claims 1 to 4 when executed by a processor.