CN110020247A - A kind of webpage key modules extracting method and device - Google Patents

A kind of webpage key modules extracting method and device Download PDF

Info

Publication number
CN110020247A
CN110020247A CN201711402540.2A CN201711402540A CN110020247A CN 110020247 A CN110020247 A CN 110020247A CN 201711402540 A CN201711402540 A CN 201711402540A CN 110020247 A CN110020247 A CN 110020247A
Authority
CN
China
Prior art keywords
web page
page module
father
link
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711402540.2A
Other languages
Chinese (zh)
Other versions
CN110020247B (en
Inventor
初光磊
丁彬
段盼盼
李学环
齐骥
钱岭
吴昊天
邱雨
王瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201711402540.2A priority Critical patent/CN110020247B/en
Publication of CN110020247A publication Critical patent/CN110020247A/en
Application granted granted Critical
Publication of CN110020247B publication Critical patent/CN110020247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of webpage key modules extracting method and devices, by obtaining the valid link for including in target webpage, and the label comprising valid link, determine public father's Web page module of the label comprising valid link, it will include the most father's Web page module of effective number of links in public father's Web page module, as webpage key modules, so as to carry out the extraction of webpage key modules in the case where lacking visual signature.

Description

A kind of webpage key modules extracting method and device
Technical field
The present invention relates to Internet technical field more particularly to a kind of webpage key modules extracting method and devices.
Background technique
In the epoch of nowadays information explosion, network plays very important role, and the information content for including on webpage Also become abundant complexity, such as may include navigation, title, text, time, the even contents such as advertisement on webpage, on webpage Various types of data are mixed in together, this just extracts crucial and effective information to user and brings certain trouble.
In the prior art, it in order to extract effective content in webpage, generally requires to carry out finely webpage Analysis, and then extract the key message on page-out.Currently, in the extraction of webpage information frequently be view-based access control model feature Webpage information extract.The webpage information, which is extracted, divides (Vision-based Page using the page of view-based access control model Segmentation, VIPS) algorithm, pass through DOM Document Object Model (Document Object Mode, DOM) tree construction and view Feel the key message of feature extraction webpage.Specifically, the content of each section on webpage can be intuitively judged by vision, then Judge whether node can divide according to DOM tree structure, that is, judges whether label corresponding to each section content on webpage has subpage Face divides subpage frame, until subpage frame can not continue division, then the partial information that can will visually be immediately seen It is extracted as key message.If visually cannot intuitively obtain the key message on webpage, can not just mention Take out the key message on webpage.
Summary of the invention
The object of the present invention is to provide a kind of webpage key modules extracting method and devices, to solve lacking visual signature When, the problem of how extracting webpage key modules.
The purpose of the present invention is what is be achieved through the following technical solutions:
One aspect of the present invention provides a kind of webpage key modules extracting method, comprising:
Obtain the valid link for including in target webpage, and the label comprising the valid link, the valid link To be directed toward the details page link inside webpage;
Determine public father's Web page module of the label comprising the valid link;
It will include the most father's Web page module of the valid link number in public father's Web page module, as the net Page key modules.
Optionally, described to obtain the valid link for including in target webpage, and the label comprising the valid link, packet It includes:
Obtain all labels in target webpage directly comprising link;
Invalid link is deleted in the link that all labels directly include, obtains the active chain for including in target webpage It connects;
The corresponding label of the invalid link is deleted in all labels, obtains the mark comprising the valid link Label.
Optionally, the invalid link includes at least one in following link:
The domain name of domain name and webpage is inconsistent to be linked;
It include the link for presetting keyword;
Wherein, the keyword is that non-details page links the keyword and the common key of useless link commonly used in antistop list Keyword in vocabulary.
Optionally, the determination includes public father's Web page module of the label of the valid link, comprising:
Obtain father's Web page module of each label in the label comprising the valid link;
Each label in the combination is determined respectively by father's Web page module combination of two of each label, and for each combination Ancestors' Web page module between whether there is ancestors-descendent relationship, ancestors' Web page module be combine in include father's webpage Father's Web page module of module;
If there are ancestors-descendent relationships between ancestors' Web page module of each label in the combination, ancestors will be used as Ancestors' Web page module as public father's Web page module;
If ancestors-descendent relationship is not present in the combination between ancestors' Web page module of each label, and ancestors' webpage Module is identical, then carries out to each father's Web page module for including under the father's Web page module and its ancestors' Web page module for including in the combination Hierarchical clustering, and using the public father's Web page module of minimum of each father's Web page module after cluster as public father's Web page module.
Optionally, each father's webpage for including under the father's Web page module and its ancestors' Web page module that include in the described pair of combination Before module carries out hierarchical clustering, the method also includes:
Obtain father's Web page module node chain corresponding to each father's Web page module node for including in the combination;
Determine the public father's Web page module node of minimum of father's Web page module node chain corresponding to each father's Web page module node;
Determine each Web page module node to the relative path between the public father's Web page module node of the minimum, and really Nodename on the fixed relative path;
According to the nodename similarity on the relative path, determine each father's Web page module in the combination it is similar and It can gather.
Another aspect of the present invention provides a kind of webpage key modules extraction element, comprising:
Acquiring unit, for obtaining the valid link for including in target webpage, and the label comprising the valid link, The valid link is the details page link being directed toward inside webpage;
Processing unit, for determining public father's Web page module of the label comprising the valid link, and will be described public Comprising father's Web page module that the valid link number is most in father's Web page module, as the webpage key modules.
Optionally, the acquiring unit for obtaining the valid link for including in target webpage, and packet as follows Label containing the valid link:
Obtain all labels in target webpage directly comprising link;
Invalid link is deleted in the link that all labels directly include, obtains the active chain for including in target webpage It connects;
The corresponding label of the invalid link is deleted in all labels, obtains the mark comprising the valid link Label.
Optionally, the invalid link includes at least one in following link:
The domain name of domain name and webpage is inconsistent to be linked;
It include the link for presetting keyword;
Wherein, the keyword is that non-details page links the keyword and the common key of useless link commonly used in antistop list Keyword in vocabulary.
Optionally, the processing unit is specifically used for being pressed into the public of label of the under type processing comprising the valid link Father's Web page module:
Obtain father's Web page module of each label in the label comprising the valid link;
Each label in the combination is determined respectively by father's Web page module combination of two of each label, and for each combination Ancestors' Web page module between whether there is ancestors-descendent relationship, ancestors' Web page module be combine in include father's webpage Father's Web page module of module;
If there are ancestors-descendent relationships between ancestors' Web page module of each label in the combination, ancestors will be used as Ancestors' Web page module as public father's Web page module;
If ancestors-descendent relationship is not present in the combination between ancestors' Web page module of each label, and ancestors' webpage Module is identical, then carries out to each father's Web page module for including under the father's Web page module and its ancestors' Web page module for including in the combination Hierarchical clustering, and using the public father's Web page module of minimum of each father's Web page module after cluster as public father's Web page module.
Optionally, the acquiring unit is also used to: being obtained corresponding to each father's Web page module node for including in the combination Father's Web page module node chain;
The processing unit is also used to:
Determine that the minimum of father's Web page module node chain corresponding to each father's Web page module node that acquiring unit is got is public Father's Web page module node altogether;
Determine each Web page module node to the relative path between the public father's Web page module node of the minimum, and really Nodename on the fixed relative path;
According to the nodename similarity on the relative path, determine each father's Web page module in the combination it is similar and It can gather.
The present invention also provides a kind of webpage key modules extract equipment, including memory, processor and it is stored in described On memory and the computer program that can run on the processor;The processor is realized when executing described program as above-mentioned The webpage key modules extracting method.
The present invention also provides a kind of computer storage medium, it is stored with program instruction in the computer storage medium, When the program instruction is executed by processor, for realizing webpage key modules extracting method as described above.
It is effective by include in acquisition target webpage in webpage key modules extracting method provided by the invention and device Link, and the label comprising valid link determine public father's Web page module of the label comprising valid link, by public father's net The father Web page module most comprising effective number of links in page module, as webpage key modules, so as to lack vision spy In the case where sign, the extraction of webpage key modules is carried out.
Detailed description of the invention
Fig. 1 is a kind of webpage key modules extracting method flow chart provided in an embodiment of the present invention;
Fig. 2 is to obtain the valid link for including, and the implementation flow chart of the label comprising valid link in target webpage;
Fig. 3 is the reality of public father's Web page module of label of the determination provided in an embodiment of the present invention comprising the valid link Apply flow chart;
Fig. 4 is that provided in an embodiment of the present invention how to determine carries out hierarchical clustering to the father's Web page module for including in the combination Method implementation flow chart;
Fig. 5 is the schematic diagram of DOM tree node provided in an embodiment of the present invention;
Fig. 6 is a kind of structural block diagram of webpage key modules extraction element provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, is not whole embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Webpage key modules extracting method provided in an embodiment of the present invention and device include by obtaining in target webpage Valid link, and the label comprising valid link determine public father's Web page module of the label comprising valid link, and will be public Father's Web page module most comprising effective number of links in father's Web page module altogether, as webpage key modules, to solve scarce When few visual signature, the problem of webpage key modules are extracted how is carried out.
Fig. 1 show a kind of webpage key modules extracting method flow chart provided in an embodiment of the present invention, refering to fig. 1 institute Show, it may include following steps:
S101: the valid link for including in target webpage, and the label comprising valid link are obtained.
In general, can include various links in webpage when browsing webpage, for convenient for targetedly checking letter Breath, it is necessary to select the key message in some links.
In the embodiment of the present invention, the valid link for including in webpage can be obtained in target webpage.Wherein, valid link To be directed toward the details page link inside webpage, details page link can be understood as including the most chain of information content content in webpage It connects.
In the embodiment of the present invention, it can be searched in source code by downloading target webpage source code and all include The label of link, and select in the label for including link the label of valid link.
S102: public father's Web page module of the label comprising valid link is determined.
, can be by searching for father's Web page module of the label comprising valid link in the embodiment of the present invention, and then determine institute There is public father's Web page module of father's Web page module.
S103: will include the most father's Web page module of effective number of links in public father's Web page module, as webpage key Module.
It, can be by searching for being wrapped in public father's Web page module after determining public father's Web page module in the embodiment of the present invention Father's Web page module containing valid link, the selection father webpage most comprising valid link number in father's Web page module of valid link Key modules of the module as webpage.
In a kind of possible embodiment, the valid link for including in target webpage can be obtained by method shown in Fig. 2, And the label comprising valid link, as shown in fig.2, including:
S1011: all labels in target webpage directly comprising link are obtained.
S1012: deleting invalid link in the link that all labels directly include, and obtain including in target webpage is effective Link.
S1013: the corresponding label of the invalid link is deleted in all labels, obtains the label comprising valid link.
Specifically, the invalid link in the embodiment of the present invention can be regarded as the link for being directed toward external website, the invalid link Can be domain name and webpage domain name it is inconsistent link, be also possible to include the link for presetting keyword.
In a kind of possible embodiment, the domain name of available target webpage, and get in target webpage all Label directly comprising link selects most basic label, such as selection a label in all labels for including link.? Judge whether when being valid link, can by the domain name of all-links corresponding to a label respectively with the domain name of target webpage into Row comparison can will be corresponding to a label if the domain name of link corresponding to a label and the domain name of target webpage are inconsistent Link is considered invalid link, and deletes the invalid link.
In alternatively possible embodiment, the antistop list and useless link that non-details page links can be preset Common antistop list, for example, it can be set to keyword are as follows: advertisement, recommendation etc. can when occurring similar this keyword in webpage It is considered invalid link, and deletes the invalid link.
In the embodiment of the present invention, the public affairs of the label comprising valid link can be determined after getting the label of valid link Father's Web page module altogether, in a kind of possible embodiment, method shown in Fig. 3 can be used determine include the valid link mark Public father's Web page module of label, as shown in fig.3, including:
S1021: father's Web page module of each label in the label comprising valid link is obtained.
In general, the information aggregate in webpage with obvious visual identity feature can be considered to Web page module, and webpage Module is usually the row blocks label such as div, table.
In dom tree, all there is incidence relation between all nodes.If the row block label of node be div tag, Table label or body label, then it is assumed that the node is block node, and block node is alternatively referred to as Web page module.The embodiment of the present invention In middle the following description, block node, Web page module are used with sometimes, but it will be understood by those skilled in the art that it was stated Meaning is consistent.
It will by taking a label in valid link as an example, include below the label of valid link to acquisition in the embodiment of the present invention In father's Web page module of each label be described in detail.
All a labels of valid link are formed into a queue, find father's block corresponding to each a label in the queue Node ignores the corresponding father's block node of the label, if the mark of father's block node if the label of father's block node is body label Label are not body labels, then father's block node corresponding to a label searched out are formed a queue, all a marks searched out The corresponding father's block node of label is not that the set of father's block node composition of body label is regarded as father's Web page module.
S1022: each mark in combination is determined respectively by father's Web page module combination of two of each label, and for each combination It whether there is ancestors-descendent relationship between ancestors' Web page module of label.
In the embodiment of the present invention, different labels may correspond to different father's Web page modules, can from father's Web page module with Machine selects two different father's Web page modules, is formed a binary set, and determines the father of two different labels respectively Father's Web page module corresponding to Web page module.By traversing all binary set, determine each two in all binary set It whether there is ancestors-descendent relationship between father's Web page module corresponding to father's Web page module of member set.
In a kind of possible embodiment, it is assumed that the composition of father's block node corresponding to a label searched out in step S1021 Queue be blocks_pre can then be selected in blocks_pre since different a labels may correspond to different father's block nodes Two different father's block nodes, such as selection m and n are selected, finding the corresponding father's block node of m and n is MpAnd Np, it is determined that MpWith NpBetween whether there is ancestors-descendent relationship.
S1023: if there are ancestors-descendent relationships between ancestors' Web page module of each label in combination, ancestral will be used as First ancestors' Web page module is as public father's Web page module.
In the embodiment of the present invention, if MpWith NpBetween there are ancestors-descendent relationship, and MpRelative to NpFor, MpIt is ancestors, Then by MpAs public father's Web page module.
S1024: if ancestors-descendent relationship is not present in combination between ancestors' Web page module of each label, and ancestors' net Page module is identical, then to each father's Web page module for including under father's Web page module for including in the combination and its ancestors' Web page module into Row hierarchical clustering, and using the public father's Web page module of minimum of each father's Web page module after cluster as public father's Web page module.
In the embodiment of the present invention, if combination in each label ancestors' Web page module (father's Web page module for including in combination Father's Web page module) MpWith NpBetween be not present ancestors-descendent relationship, and combine in include father's Web page module it is identical, Yi Zhongke Can embodiment in, father's Web page module for including in the combination and each father's webpage mould for including under its ancestors' Web page module Before block carries out hierarchical clustering, it may also include that as shown in Figure 4 how determining carries out the father's Web page module for including in the combination The method of hierarchical clustering, as shown in fig.4, this method comprises:
S10241: father's Web page module node chain corresponding to each father's Web page module node for including in combination is obtained.
In the embodiment of the present invention, the father node M of acquisition m and n can be passed throughpAnd Np, work as MpWith NpFor same node, obtain m and Father node the chain m_plinks and n_plinks of n.The father node chain includes all block node and non-piece of node.For example, Fig. 5 It is shown the schematic diagram of DOM tree node, as shown in fig.5, wherein, Mp(table) M is indicatedpFor block node, MP (div) indicates MP For block node, then the father node chain m_plinks of m is (m, Mp, MP, A), the father node chain n_plinks of n is (n, Np, MP, A).
S10242: the public father's webpage mould of minimum of father's Web page module node chain corresponding to each father's Web page module node is determined Block node.
In the embodiment of the present invention, Mp, MP and A are father nodes, compare the father node on dom tree in order, can be by MpReally It is set to minimum public father's Web page module node.
S10243: determine each Web page module node to the relative path between minimum public father's Web page module node, and really Determine the nodename on relative path.
In the embodiment of the present invention, m and n are relative to minimum public father's Web page module node MpRelative path m_rlinks and N_rlinks, corresponding to the node in Fig. 5, then m_rlinks is (m, Mp), n_rlinks is (n, Mp), and determine m_rlinks With the nodename of n_rlinks.
S10244: according to the nodename similarity on relative path, determine each father's Web page module in combination it is similar and It can gather.
In the embodiment of the present invention, m_rlinks and n_rlinks can be indicated according to the nodename similarity on relative path The similarity in path.
Specifically, according to the nodename m, M of relative path m_rlinks and n_rlinksp, n, MpCalculate similarity, phase It can be calculated according to the following formula like degree:
Wherein, the title string on the relative path of the title string m_tags on the relative path of sum finger joint point m and node n The length summation of n_tags word string, ldist are class editing distances, refer to that m_tags is converted to the least operation of n_tags and (is inserted into, deletes Remove, replace) number, if operation is deletion or insertion, number of operations+1, if it is replacement, number of operations+2.The meter of the formula Involved by calculation has in the prior art, details are not described herein.
In the embodiment of the present invention, after calculating the similarity of m_tags and n_tags, need to judge m_rlinks and Whether n_rlinks similar, it is assumed that similar threshold value s_threshold, if s >=s_threshold, then it is assumed that m_tags and N_tags is similar, i.e. label m, n is similar;Otherwise, then it is assumed that m_tags and n_tags is dissimilar.
If m_tags and n_tags is dissimilar, Web page module corresponding to m, n is deleted;If m_tags and n_tags phase Seemingly, then needing to judge whether the q node for having common parent with m, n can gather with m, n.
Due to the father node M of m and npAnd NpFor same node, therefore, in the embodiment of the present invention by taking m node as an example, to following mistakes Journey is described in detail.M shown in fig. 5pChild node be not limited to tri- nodes of m, n, q in figure, it is assumed that may also include has X1, X2, X3 ... Xn then cluster { q, X1, X2, X3 ... Xn }.Assuming that polymerization threshold value is c_threshold, if all and m The block number of nodes that node possesses public father node is count_a, calculates the degree of polymerization.
Specifically, the degree of polymerization can be calculated according to (count_mc+1)/(count_m+1), wherein count_mc+1 expression can Cluster labels number, count_m+1 indicates all block number of nodes for having common parent with m node, for example, count_mc+ in Fig. 5 1 is 3, count_m+1 3, then the degree of polymerization is 3/3=1.If the degree of polymerization is less than threshold value c_threshold, m, q, X1, X2, X3 ... Xn } it is not polymerizable;If the degree of polymerization is greater than threshold value c_threshold, m is possessed into common parent with remaining Node q, X1, X2, X3 ... Xn polymerization.
In the embodiment of the present invention, node q, X1, X2, X3 ... the Xn that m possesses common parent with remaining can gather, then by m with Q, X1, X2, X3 ... Xn carry out Cluster merging, using public father's Web page module of father's Web page module each after Cluster merging as public father Web page module.
Specifically, X1, X2, X3 ... Xn can gather if m and q, then m and q, X1, X2, X3 ... Xn is clustered and merged To public father node Mp, and by MpAs new father's Web page module.
In the embodiment of the present invention, by judging whether relative path is similar and polymerize to node, can avoid in webpage The introducing of redundant module in module extraction, improves the accuracy that target pages key modules are extracted.
Based on it is above-mentioned be applied to webpage key modules extract involved in the identical design of embodiment of the method, the present invention is real It applies example and additionally provides a kind of device that webpage key modules are extracted, Fig. 6 show a kind of webpage provided in an embodiment of the present invention and closes The structural block diagram of key module extraction element, as shown in fig.6, the device includes: acquiring unit 101, processing unit 102, in which:
Acquiring unit 101, for obtaining the valid link for including in target webpage, and the label comprising valid link, The valid link is the details page link being directed toward inside webpage.
Processing unit 102, for determining public father's Web page module of the label comprising valid link, and by public father's webpage Comprising father's Web page module that the valid link number is most in module, as webpage key modules.
Specifically, valid link of the acquiring unit 101 for including in acquisition target webpage as follows, and comprising The label of valid link:
Obtain all labels in target webpage directly comprising link;Nothing is deleted in the link that all labels directly include Effect link, obtains the valid link for including in target webpage;The corresponding label of the invalid link is deleted in all labels, is obtained To the label comprising the valid link.
Optionally, invalid link includes at least one in following link:
The domain name of domain name and webpage is inconsistent to be linked;It include the link for presetting keyword;Wherein, keyword is Non- details page links keyword and the useless keyword linked in common antistop list in common antistop list.
Further, processing unit 102 is specifically used for being pressed into the public father of label of the under type processing comprising valid link Web page module:
Obtain father's Web page module of each label in the label comprising valid link;Two-by-two by father's Web page module of each label Combination, and determined between ancestors' Web page module of each label in combination respectively for each combination with the presence or absence of after ancestors- For relationship, ancestors' Web page module is the father's Web page module for the father's Web page module for including in combination;If the ancestral of each label in combination There are ancestors-descendent relationships between first Web page module, then using as ancestors' Web page module of ancestors as public father's Web page module; It is if ancestors-descendent relationship is not present in combination between ancestors' Web page module of each label, and ancestors' Web page module is identical, then right Each father's Web page module for including under the father's Web page module and its ancestors' Web page module that include in the combination carries out hierarchical clustering, and will The public father's Web page module of minimum of each father's Web page module after cluster is as public father's Web page module.
Further, acquiring unit 101 is also used to:
Obtain father's Web page module node chain corresponding to each father's Web page module node for including in combination.
Processing unit 102 is also used to:
Determine that the minimum of father's Web page module node chain corresponding to each father's Web page module node that acquiring unit is got is public Father's Web page module node altogether;Determine each Web page module node to the opposite road between the public father's Web page module node of the minimum Diameter, and determine the nodename on relative path;According to the nodename similarity on relative path, each father in combination is determined Web page module is similar and can gather.
The embodiment of the invention also provides a kind of webpage key modules extract equipments, including memory, processor and storage On a memory and the computer program that can run on a processor;Processor is realized when executing program as above-mentioned webpage is crucial Module extracting method.
The embodiment of the invention also provides a kind of computer storage medium, program is stored in computer storage medium and is referred to It enables, when which is executed by processor, for realizing such as above-mentioned webpage key modules extracting method.
It should be noted that each unit in the device that the above-mentioned webpage key modules being related to are extracted in the embodiment of the present invention Function realization can be with further reference to the description of related method embodiment, details are not described herein.
Above by reference to showing according to the method, apparatus (system) of the embodiment of the present application and/or the frame of computer program product Figure and/or flow chart describe the application.It should be understood that can realize that block diagram and or flow chart is shown by computer program instructions The combination of the block of a block and block diagram and or flow chart diagram for figure.These computer program instructions can be supplied to logical With computer, the processor of special purpose computer and/or other programmable data processing units, to generate machine, so that via meter The instruction that calculation machine processor and/or other programmable data processing units execute creates for realizing block diagram and or flow chart block In specified function action method.
Correspondingly, the application can also be implemented with hardware and/or software (including firmware, resident software, microcode etc.).More Further, the application can take computer usable or the shape of the computer program product on computer readable storage medium Formula has the computer realized in the medium usable or computer readable program code, to be made by instruction execution system It is used with or in conjunction with instruction execution system.In the present context, computer can be used or computer-readable medium can be with It is arbitrary medium, may include, stores, communicates, transmits or transmit program, is made by instruction execution system, device or equipment With, or instruction execution system, device or equipment is combined to use.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims (12)

1. a kind of webpage key modules extracting method characterized by comprising
The valid link for including in target webpage, and the label comprising the valid link are obtained, the valid link is to refer to To the details page link inside webpage;
Determine public father's Web page module of the label comprising the valid link;
It will include the most father's Web page module of the valid link number in public father's Web page module, closed as the webpage Key module.
2. the method as described in claim 1, which is characterized in that the valid link for including in the acquisition target webpage, and Label comprising the valid link, comprising:
Obtain all labels in target webpage directly comprising link;
Invalid link is deleted in the link that all labels directly include, obtains the valid link for including in target webpage;
The corresponding label of the invalid link is deleted in all labels, obtains the label comprising the valid link.
3. method according to claim 2, which is characterized in that the invalid link includes at least one in following link:
The domain name of domain name and webpage is inconsistent to be linked;
It include the link for presetting keyword;
Wherein, the keyword is that non-details page links the keyword and the common antistop list of useless link commonly used in antistop list In keyword.
4. method as described in any one of claims 1 to 3, which is characterized in that the determination includes the mark of the valid link Public father's Web page module of label, comprising:
Obtain father's Web page module of each label in the label comprising the valid link;
Each label in the combination is determined respectively by father's Web page module combination of two of each label, and for each combination It whether there is ancestors-descendent relationship between ancestors' Web page module, ancestors' Web page module is the father's webpage mould for including in combination Father's Web page module of block;
If there are ancestors-descendent relationships between ancestors' Web page module of each label in the combination, by the ancestral as ancestors First Web page module is as public father's Web page module;
If ancestors-descendent relationship is not present in the combination between ancestors' Web page module of each label, and ancestors' Web page module It is identical, then level is carried out to each father's Web page module for including under the father's Web page module and its ancestors' Web page module for including in the combination Cluster, and using the public father's Web page module of minimum of each father's Web page module after cluster as public father's Web page module.
5. method as claimed in claim 4, which is characterized in that the father's Web page module for including in the described pair of combination and its ancestors Before each father's Web page module for including under Web page module carries out hierarchical clustering, the method also includes:
Obtain father's Web page module node chain corresponding to each father's Web page module node for including in the combination;
Determine the public father's Web page module node of minimum of father's Web page module node chain corresponding to each father's Web page module node;
Determine that each Web page module node to the relative path between the public father's Web page module node of the minimum, and determines institute State the nodename on relative path;
According to the nodename similarity on the relative path, determine that each father's Web page module in the combination is similar and can It is poly-.
6. a kind of webpage key modules extraction element characterized by comprising
Acquiring unit, it is described for obtaining the valid link for including in target webpage, and the label comprising the valid link Valid link is the details page link being directed toward inside webpage;
Processing unit, for determining public father's Web page module of the label comprising the valid link, and by public father's net The father Web page module most comprising the valid link number in page module, as the webpage key modules.
7. device as claimed in claim 6, which is characterized in that the acquiring unit for obtaining target webpage as follows In include valid link, and the label comprising the valid link:
Obtain all labels in target webpage directly comprising link;
Invalid link is deleted in the link that all labels directly include, obtains the valid link for including in target webpage;
The corresponding label of the invalid link is deleted in all labels, obtains the label comprising the valid link.
8. device as claimed in claim 7, which is characterized in that the invalid link includes at least one in following link:
The domain name of domain name and webpage is inconsistent to be linked;
It include the link for presetting keyword;
Wherein, the keyword is that non-details page links the keyword and the common antistop list of useless link commonly used in antistop list In keyword.
9. such as the described in any item devices of claim 6 to 8, which is characterized in that the processing unit is specifically used for being pressed into lower section Public father's Web page module of label of the formula processing comprising the valid link:
Obtain father's Web page module of each label in the label comprising the valid link;
By father's Web page module combination of two of each label, and it is directed to each ancestral for combining and determining each label in the combination respectively It whether there is ancestors-descendent relationship between first Web page module, ancestors' Web page module is the father's Web page module for including in combination Father's Web page module;
If there are ancestors-descendent relationships between ancestors' Web page module of each label in the combination, by the ancestral as ancestors First Web page module is as public father's Web page module;
If ancestors-descendent relationship is not present in the combination between ancestors' Web page module of each label, and ancestors' Web page module It is identical, then level is carried out to each father's Web page module for including under the father's Web page module and its ancestors' Web page module for including in the combination Cluster, and using the public father's Web page module of minimum of each father's Web page module after cluster as public father's Web page module.
10. device as claimed in claim 9, which is characterized in that the acquiring unit is also used to:
Obtain father's Web page module node chain corresponding to each father's Web page module node for including in the combination;
The processing unit is also used to:
Determine the public father of minimum of father's Web page module node chain corresponding to each father's Web page module node that acquiring unit is got Web page module node;
Determine that each Web page module node to the relative path between the public father's Web page module node of the minimum, and determines institute State the nodename on relative path;
According to the nodename similarity on the relative path, determine that each father's Web page module in the combination is similar and can It is poly-.
11. a kind of webpage key modules extract equipment, including memory, processor and it is stored on the memory and can be in institute State the computer program run on processor;It is characterized in that, the processor realizes such as claim when executing described program The described in any item webpage key modules extracting methods of 1-5.
12. a kind of computer storage medium, which is characterized in that be stored with program instruction in the computer storage medium, the journey When sequence instruction is executed by processor, for realizing webpage key modules extracting method such as described in any one of claim 1 to 5.
CN201711402540.2A 2017-12-22 2017-12-22 Webpage key module extraction method and device Active CN110020247B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711402540.2A CN110020247B (en) 2017-12-22 2017-12-22 Webpage key module extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711402540.2A CN110020247B (en) 2017-12-22 2017-12-22 Webpage key module extraction method and device

Publications (2)

Publication Number Publication Date
CN110020247A true CN110020247A (en) 2019-07-16
CN110020247B CN110020247B (en) 2021-05-14

Family

ID=67187130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711402540.2A Active CN110020247B (en) 2017-12-22 2017-12-22 Webpage key module extraction method and device

Country Status (1)

Country Link
CN (1) CN110020247B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254009A (en) * 2011-07-15 2011-11-23 福建星网锐捷通讯股份有限公司 Method for extracting data of webpage table
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN103838792A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Method for determining webpage theme
CN104462394A (en) * 2012-06-25 2015-03-25 北京奇虎科技有限公司 System and method for recognizing content posts of webpage
KR20160045974A (en) * 2014-10-17 2016-04-28 인포뱅크 주식회사 Apparatus and method for relaying group message
CN105786951A (en) * 2015-12-31 2016-07-20 北京金山安全软件有限公司 Method and device for extracting content blocks in webpage and server
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541874A (en) * 2010-12-16 2012-07-04 中国移动通信集团公司 Webpage text content extracting method and device
CN102254009A (en) * 2011-07-15 2011-11-23 福建星网锐捷通讯股份有限公司 Method for extracting data of webpage table
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
CN102663023A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN104462394A (en) * 2012-06-25 2015-03-25 北京奇虎科技有限公司 System and method for recognizing content posts of webpage
CN103838792A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Method for determining webpage theme
KR20160045974A (en) * 2014-10-17 2016-04-28 인포뱅크 주식회사 Apparatus and method for relaying group message
CN105786951A (en) * 2015-12-31 2016-07-20 北京金山安全软件有限公司 Method and device for extracting content blocks in webpage and server
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张超: "基于分块的主题信息抽取研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
罗超然: "一种通用的网页内容抽取模块的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
谢馨庭: "基于单DOM树特征预分类的自适应Web信息抽取方法", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Also Published As

Publication number Publication date
CN110020247B (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN108932294B (en) Resume data processing method, device, equipment and storage medium based on index
US11544578B2 (en) Method, device and equipment for fusing different instances describing same entity
US11321421B2 (en) Method, apparatus and device for generating entity relationship data, and storage medium
Sun et al. Dom based content extraction via text density
US8554800B2 (en) System, methods and applications for structured document indexing
CN102473190B (en) Keyword assignment to a web page
CN107391675B (en) Method and apparatus for generating structured information
US20150067476A1 (en) Title and body extraction from web page
CN104102639B (en) Popularization triggering method based on text classification and device
CN104750789A (en) Label recommendation method and device
CN103294781A (en) Method and equipment used for processing page data
US20100211533A1 (en) Extracting structured data from web forums
CN111639284A (en) Webpage labeling method and device, electronic equipment and medium
CN111079043A (en) Key content positioning method
US9767086B2 (en) System and method for enablement of data masking for web documents
CN105574200A (en) User interest extraction method based on historical record
CN104462540A (en) Webpage information extraction method
JP2008165284A (en) System and method for monitoring rumor and program
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN105512334A (en) Data mining method based on search words
JP2007122398A (en) Method for determining identity of fragment, and computer program
WO2015047921A1 (en) Determining images of article for extraction
CN111061975A (en) Method and device for processing irrelevant content in page
CN110020247A (en) A kind of webpage key modules extracting method and device
CN113656466B (en) Policy data query method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant