CN110020247A - A kind of webpage key modules extracting method and device - Google Patents
A kind of webpage key modules extracting method and device Download PDFInfo
- Publication number
- CN110020247A CN110020247A CN201711402540.2A CN201711402540A CN110020247A CN 110020247 A CN110020247 A CN 110020247A CN 201711402540 A CN201711402540 A CN 201711402540A CN 110020247 A CN110020247 A CN 110020247A
- Authority
- CN
- China
- Prior art keywords
- web page
- page module
- father
- link
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/954—Navigation, e.g. using categorised browsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of webpage key modules extracting method and devices, by obtaining the valid link for including in target webpage, and the label comprising valid link, determine public father's Web page module of the label comprising valid link, it will include the most father's Web page module of effective number of links in public father's Web page module, as webpage key modules, so as to carry out the extraction of webpage key modules in the case where lacking visual signature.
Description
Technical field
The present invention relates to Internet technical field more particularly to a kind of webpage key modules extracting method and devices.
Background technique
In the epoch of nowadays information explosion, network plays very important role, and the information content for including on webpage
Also become abundant complexity, such as may include navigation, title, text, time, the even contents such as advertisement on webpage, on webpage
Various types of data are mixed in together, this just extracts crucial and effective information to user and brings certain trouble.
In the prior art, it in order to extract effective content in webpage, generally requires to carry out finely webpage
Analysis, and then extract the key message on page-out.Currently, in the extraction of webpage information frequently be view-based access control model feature
Webpage information extract.The webpage information, which is extracted, divides (Vision-based Page using the page of view-based access control model
Segmentation, VIPS) algorithm, pass through DOM Document Object Model (Document Object Mode, DOM) tree construction and view
Feel the key message of feature extraction webpage.Specifically, the content of each section on webpage can be intuitively judged by vision, then
Judge whether node can divide according to DOM tree structure, that is, judges whether label corresponding to each section content on webpage has subpage
Face divides subpage frame, until subpage frame can not continue division, then the partial information that can will visually be immediately seen
It is extracted as key message.If visually cannot intuitively obtain the key message on webpage, can not just mention
Take out the key message on webpage.
Summary of the invention
The object of the present invention is to provide a kind of webpage key modules extracting method and devices, to solve lacking visual signature
When, the problem of how extracting webpage key modules.
The purpose of the present invention is what is be achieved through the following technical solutions:
One aspect of the present invention provides a kind of webpage key modules extracting method, comprising:
Obtain the valid link for including in target webpage, and the label comprising the valid link, the valid link
To be directed toward the details page link inside webpage;
Determine public father's Web page module of the label comprising the valid link;
It will include the most father's Web page module of the valid link number in public father's Web page module, as the net
Page key modules.
Optionally, described to obtain the valid link for including in target webpage, and the label comprising the valid link, packet
It includes:
Obtain all labels in target webpage directly comprising link;
Invalid link is deleted in the link that all labels directly include, obtains the active chain for including in target webpage
It connects;
The corresponding label of the invalid link is deleted in all labels, obtains the mark comprising the valid link
Label.
Optionally, the invalid link includes at least one in following link:
The domain name of domain name and webpage is inconsistent to be linked;
It include the link for presetting keyword;
Wherein, the keyword is that non-details page links the keyword and the common key of useless link commonly used in antistop list
Keyword in vocabulary.
Optionally, the determination includes public father's Web page module of the label of the valid link, comprising:
Obtain father's Web page module of each label in the label comprising the valid link;
Each label in the combination is determined respectively by father's Web page module combination of two of each label, and for each combination
Ancestors' Web page module between whether there is ancestors-descendent relationship, ancestors' Web page module be combine in include father's webpage
Father's Web page module of module;
If there are ancestors-descendent relationships between ancestors' Web page module of each label in the combination, ancestors will be used as
Ancestors' Web page module as public father's Web page module;
If ancestors-descendent relationship is not present in the combination between ancestors' Web page module of each label, and ancestors' webpage
Module is identical, then carries out to each father's Web page module for including under the father's Web page module and its ancestors' Web page module for including in the combination
Hierarchical clustering, and using the public father's Web page module of minimum of each father's Web page module after cluster as public father's Web page module.
Optionally, each father's webpage for including under the father's Web page module and its ancestors' Web page module that include in the described pair of combination
Before module carries out hierarchical clustering, the method also includes:
Obtain father's Web page module node chain corresponding to each father's Web page module node for including in the combination;
Determine the public father's Web page module node of minimum of father's Web page module node chain corresponding to each father's Web page module node;
Determine each Web page module node to the relative path between the public father's Web page module node of the minimum, and really
Nodename on the fixed relative path;
According to the nodename similarity on the relative path, determine each father's Web page module in the combination it is similar and
It can gather.
Another aspect of the present invention provides a kind of webpage key modules extraction element, comprising:
Acquiring unit, for obtaining the valid link for including in target webpage, and the label comprising the valid link,
The valid link is the details page link being directed toward inside webpage;
Processing unit, for determining public father's Web page module of the label comprising the valid link, and will be described public
Comprising father's Web page module that the valid link number is most in father's Web page module, as the webpage key modules.
Optionally, the acquiring unit for obtaining the valid link for including in target webpage, and packet as follows
Label containing the valid link:
Obtain all labels in target webpage directly comprising link;
Invalid link is deleted in the link that all labels directly include, obtains the active chain for including in target webpage
It connects;
The corresponding label of the invalid link is deleted in all labels, obtains the mark comprising the valid link
Label.
Optionally, the invalid link includes at least one in following link:
The domain name of domain name and webpage is inconsistent to be linked;
It include the link for presetting keyword;
Wherein, the keyword is that non-details page links the keyword and the common key of useless link commonly used in antistop list
Keyword in vocabulary.
Optionally, the processing unit is specifically used for being pressed into the public of label of the under type processing comprising the valid link
Father's Web page module:
Obtain father's Web page module of each label in the label comprising the valid link;
Each label in the combination is determined respectively by father's Web page module combination of two of each label, and for each combination
Ancestors' Web page module between whether there is ancestors-descendent relationship, ancestors' Web page module be combine in include father's webpage
Father's Web page module of module;
If there are ancestors-descendent relationships between ancestors' Web page module of each label in the combination, ancestors will be used as
Ancestors' Web page module as public father's Web page module;
If ancestors-descendent relationship is not present in the combination between ancestors' Web page module of each label, and ancestors' webpage
Module is identical, then carries out to each father's Web page module for including under the father's Web page module and its ancestors' Web page module for including in the combination
Hierarchical clustering, and using the public father's Web page module of minimum of each father's Web page module after cluster as public father's Web page module.
Optionally, the acquiring unit is also used to: being obtained corresponding to each father's Web page module node for including in the combination
Father's Web page module node chain;
The processing unit is also used to:
Determine that the minimum of father's Web page module node chain corresponding to each father's Web page module node that acquiring unit is got is public
Father's Web page module node altogether;
Determine each Web page module node to the relative path between the public father's Web page module node of the minimum, and really
Nodename on the fixed relative path;
According to the nodename similarity on the relative path, determine each father's Web page module in the combination it is similar and
It can gather.
The present invention also provides a kind of webpage key modules extract equipment, including memory, processor and it is stored in described
On memory and the computer program that can run on the processor;The processor is realized when executing described program as above-mentioned
The webpage key modules extracting method.
The present invention also provides a kind of computer storage medium, it is stored with program instruction in the computer storage medium,
When the program instruction is executed by processor, for realizing webpage key modules extracting method as described above.
It is effective by include in acquisition target webpage in webpage key modules extracting method provided by the invention and device
Link, and the label comprising valid link determine public father's Web page module of the label comprising valid link, by public father's net
The father Web page module most comprising effective number of links in page module, as webpage key modules, so as to lack vision spy
In the case where sign, the extraction of webpage key modules is carried out.
Detailed description of the invention
Fig. 1 is a kind of webpage key modules extracting method flow chart provided in an embodiment of the present invention;
Fig. 2 is to obtain the valid link for including, and the implementation flow chart of the label comprising valid link in target webpage;
Fig. 3 is the reality of public father's Web page module of label of the determination provided in an embodiment of the present invention comprising the valid link
Apply flow chart;
Fig. 4 is that provided in an embodiment of the present invention how to determine carries out hierarchical clustering to the father's Web page module for including in the combination
Method implementation flow chart;
Fig. 5 is the schematic diagram of DOM tree node provided in an embodiment of the present invention;
Fig. 6 is a kind of structural block diagram of webpage key modules extraction element provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, is not whole embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Webpage key modules extracting method provided in an embodiment of the present invention and device include by obtaining in target webpage
Valid link, and the label comprising valid link determine public father's Web page module of the label comprising valid link, and will be public
Father's Web page module most comprising effective number of links in father's Web page module altogether, as webpage key modules, to solve scarce
When few visual signature, the problem of webpage key modules are extracted how is carried out.
Fig. 1 show a kind of webpage key modules extracting method flow chart provided in an embodiment of the present invention, refering to fig. 1 institute
Show, it may include following steps:
S101: the valid link for including in target webpage, and the label comprising valid link are obtained.
In general, can include various links in webpage when browsing webpage, for convenient for targetedly checking letter
Breath, it is necessary to select the key message in some links.
In the embodiment of the present invention, the valid link for including in webpage can be obtained in target webpage.Wherein, valid link
To be directed toward the details page link inside webpage, details page link can be understood as including the most chain of information content content in webpage
It connects.
In the embodiment of the present invention, it can be searched in source code by downloading target webpage source code and all include
The label of link, and select in the label for including link the label of valid link.
S102: public father's Web page module of the label comprising valid link is determined.
, can be by searching for father's Web page module of the label comprising valid link in the embodiment of the present invention, and then determine institute
There is public father's Web page module of father's Web page module.
S103: will include the most father's Web page module of effective number of links in public father's Web page module, as webpage key
Module.
It, can be by searching for being wrapped in public father's Web page module after determining public father's Web page module in the embodiment of the present invention
Father's Web page module containing valid link, the selection father webpage most comprising valid link number in father's Web page module of valid link
Key modules of the module as webpage.
In a kind of possible embodiment, the valid link for including in target webpage can be obtained by method shown in Fig. 2,
And the label comprising valid link, as shown in fig.2, including:
S1011: all labels in target webpage directly comprising link are obtained.
S1012: deleting invalid link in the link that all labels directly include, and obtain including in target webpage is effective
Link.
S1013: the corresponding label of the invalid link is deleted in all labels, obtains the label comprising valid link.
Specifically, the invalid link in the embodiment of the present invention can be regarded as the link for being directed toward external website, the invalid link
Can be domain name and webpage domain name it is inconsistent link, be also possible to include the link for presetting keyword.
In a kind of possible embodiment, the domain name of available target webpage, and get in target webpage all
Label directly comprising link selects most basic label, such as selection a label in all labels for including link.?
Judge whether when being valid link, can by the domain name of all-links corresponding to a label respectively with the domain name of target webpage into
Row comparison can will be corresponding to a label if the domain name of link corresponding to a label and the domain name of target webpage are inconsistent
Link is considered invalid link, and deletes the invalid link.
In alternatively possible embodiment, the antistop list and useless link that non-details page links can be preset
Common antistop list, for example, it can be set to keyword are as follows: advertisement, recommendation etc. can when occurring similar this keyword in webpage
It is considered invalid link, and deletes the invalid link.
In the embodiment of the present invention, the public affairs of the label comprising valid link can be determined after getting the label of valid link
Father's Web page module altogether, in a kind of possible embodiment, method shown in Fig. 3 can be used determine include the valid link mark
Public father's Web page module of label, as shown in fig.3, including:
S1021: father's Web page module of each label in the label comprising valid link is obtained.
In general, the information aggregate in webpage with obvious visual identity feature can be considered to Web page module, and webpage
Module is usually the row blocks label such as div, table.
In dom tree, all there is incidence relation between all nodes.If the row block label of node be div tag,
Table label or body label, then it is assumed that the node is block node, and block node is alternatively referred to as Web page module.The embodiment of the present invention
In middle the following description, block node, Web page module are used with sometimes, but it will be understood by those skilled in the art that it was stated
Meaning is consistent.
It will by taking a label in valid link as an example, include below the label of valid link to acquisition in the embodiment of the present invention
In father's Web page module of each label be described in detail.
All a labels of valid link are formed into a queue, find father's block corresponding to each a label in the queue
Node ignores the corresponding father's block node of the label, if the mark of father's block node if the label of father's block node is body label
Label are not body labels, then father's block node corresponding to a label searched out are formed a queue, all a marks searched out
The corresponding father's block node of label is not that the set of father's block node composition of body label is regarded as father's Web page module.
S1022: each mark in combination is determined respectively by father's Web page module combination of two of each label, and for each combination
It whether there is ancestors-descendent relationship between ancestors' Web page module of label.
In the embodiment of the present invention, different labels may correspond to different father's Web page modules, can from father's Web page module with
Machine selects two different father's Web page modules, is formed a binary set, and determines the father of two different labels respectively
Father's Web page module corresponding to Web page module.By traversing all binary set, determine each two in all binary set
It whether there is ancestors-descendent relationship between father's Web page module corresponding to father's Web page module of member set.
In a kind of possible embodiment, it is assumed that the composition of father's block node corresponding to a label searched out in step S1021
Queue be blocks_pre can then be selected in blocks_pre since different a labels may correspond to different father's block nodes
Two different father's block nodes, such as selection m and n are selected, finding the corresponding father's block node of m and n is MpAnd Np, it is determined that MpWith
NpBetween whether there is ancestors-descendent relationship.
S1023: if there are ancestors-descendent relationships between ancestors' Web page module of each label in combination, ancestral will be used as
First ancestors' Web page module is as public father's Web page module.
In the embodiment of the present invention, if MpWith NpBetween there are ancestors-descendent relationship, and MpRelative to NpFor, MpIt is ancestors,
Then by MpAs public father's Web page module.
S1024: if ancestors-descendent relationship is not present in combination between ancestors' Web page module of each label, and ancestors' net
Page module is identical, then to each father's Web page module for including under father's Web page module for including in the combination and its ancestors' Web page module into
Row hierarchical clustering, and using the public father's Web page module of minimum of each father's Web page module after cluster as public father's Web page module.
In the embodiment of the present invention, if combination in each label ancestors' Web page module (father's Web page module for including in combination
Father's Web page module) MpWith NpBetween be not present ancestors-descendent relationship, and combine in include father's Web page module it is identical, Yi Zhongke
Can embodiment in, father's Web page module for including in the combination and each father's webpage mould for including under its ancestors' Web page module
Before block carries out hierarchical clustering, it may also include that as shown in Figure 4 how determining carries out the father's Web page module for including in the combination
The method of hierarchical clustering, as shown in fig.4, this method comprises:
S10241: father's Web page module node chain corresponding to each father's Web page module node for including in combination is obtained.
In the embodiment of the present invention, the father node M of acquisition m and n can be passed throughpAnd Np, work as MpWith NpFor same node, obtain m and
Father node the chain m_plinks and n_plinks of n.The father node chain includes all block node and non-piece of node.For example, Fig. 5
It is shown the schematic diagram of DOM tree node, as shown in fig.5, wherein, Mp(table) M is indicatedpFor block node, MP (div) indicates MP
For block node, then the father node chain m_plinks of m is (m, Mp, MP, A), the father node chain n_plinks of n is (n, Np, MP, A).
S10242: the public father's webpage mould of minimum of father's Web page module node chain corresponding to each father's Web page module node is determined
Block node.
In the embodiment of the present invention, Mp, MP and A are father nodes, compare the father node on dom tree in order, can be by MpReally
It is set to minimum public father's Web page module node.
S10243: determine each Web page module node to the relative path between minimum public father's Web page module node, and really
Determine the nodename on relative path.
In the embodiment of the present invention, m and n are relative to minimum public father's Web page module node MpRelative path m_rlinks and
N_rlinks, corresponding to the node in Fig. 5, then m_rlinks is (m, Mp), n_rlinks is (n, Mp), and determine m_rlinks
With the nodename of n_rlinks.
S10244: according to the nodename similarity on relative path, determine each father's Web page module in combination it is similar and
It can gather.
In the embodiment of the present invention, m_rlinks and n_rlinks can be indicated according to the nodename similarity on relative path
The similarity in path.
Specifically, according to the nodename m, M of relative path m_rlinks and n_rlinksp, n, MpCalculate similarity, phase
It can be calculated according to the following formula like degree:
Wherein, the title string on the relative path of the title string m_tags on the relative path of sum finger joint point m and node n
The length summation of n_tags word string, ldist are class editing distances, refer to that m_tags is converted to the least operation of n_tags and (is inserted into, deletes
Remove, replace) number, if operation is deletion or insertion, number of operations+1, if it is replacement, number of operations+2.The meter of the formula
Involved by calculation has in the prior art, details are not described herein.
In the embodiment of the present invention, after calculating the similarity of m_tags and n_tags, need to judge m_rlinks and
Whether n_rlinks similar, it is assumed that similar threshold value s_threshold, if s >=s_threshold, then it is assumed that m_tags and
N_tags is similar, i.e. label m, n is similar;Otherwise, then it is assumed that m_tags and n_tags is dissimilar.
If m_tags and n_tags is dissimilar, Web page module corresponding to m, n is deleted;If m_tags and n_tags phase
Seemingly, then needing to judge whether the q node for having common parent with m, n can gather with m, n.
Due to the father node M of m and npAnd NpFor same node, therefore, in the embodiment of the present invention by taking m node as an example, to following mistakes
Journey is described in detail.M shown in fig. 5pChild node be not limited to tri- nodes of m, n, q in figure, it is assumed that may also include has
X1, X2, X3 ... Xn then cluster { q, X1, X2, X3 ... Xn }.Assuming that polymerization threshold value is c_threshold, if all and m
The block number of nodes that node possesses public father node is count_a, calculates the degree of polymerization.
Specifically, the degree of polymerization can be calculated according to (count_mc+1)/(count_m+1), wherein count_mc+1 expression can
Cluster labels number, count_m+1 indicates all block number of nodes for having common parent with m node, for example, count_mc+ in Fig. 5
1 is 3, count_m+1 3, then the degree of polymerization is 3/3=1.If the degree of polymerization is less than threshold value c_threshold, m, q, X1,
X2, X3 ... Xn } it is not polymerizable;If the degree of polymerization is greater than threshold value c_threshold, m is possessed into common parent with remaining
Node q, X1, X2, X3 ... Xn polymerization.
In the embodiment of the present invention, node q, X1, X2, X3 ... the Xn that m possesses common parent with remaining can gather, then by m with
Q, X1, X2, X3 ... Xn carry out Cluster merging, using public father's Web page module of father's Web page module each after Cluster merging as public father
Web page module.
Specifically, X1, X2, X3 ... Xn can gather if m and q, then m and q, X1, X2, X3 ... Xn is clustered and merged
To public father node Mp, and by MpAs new father's Web page module.
In the embodiment of the present invention, by judging whether relative path is similar and polymerize to node, can avoid in webpage
The introducing of redundant module in module extraction, improves the accuracy that target pages key modules are extracted.
Based on it is above-mentioned be applied to webpage key modules extract involved in the identical design of embodiment of the method, the present invention is real
It applies example and additionally provides a kind of device that webpage key modules are extracted, Fig. 6 show a kind of webpage provided in an embodiment of the present invention and closes
The structural block diagram of key module extraction element, as shown in fig.6, the device includes: acquiring unit 101, processing unit 102, in which:
Acquiring unit 101, for obtaining the valid link for including in target webpage, and the label comprising valid link,
The valid link is the details page link being directed toward inside webpage.
Processing unit 102, for determining public father's Web page module of the label comprising valid link, and by public father's webpage
Comprising father's Web page module that the valid link number is most in module, as webpage key modules.
Specifically, valid link of the acquiring unit 101 for including in acquisition target webpage as follows, and comprising
The label of valid link:
Obtain all labels in target webpage directly comprising link;Nothing is deleted in the link that all labels directly include
Effect link, obtains the valid link for including in target webpage;The corresponding label of the invalid link is deleted in all labels, is obtained
To the label comprising the valid link.
Optionally, invalid link includes at least one in following link:
The domain name of domain name and webpage is inconsistent to be linked;It include the link for presetting keyword;Wherein, keyword is
Non- details page links keyword and the useless keyword linked in common antistop list in common antistop list.
Further, processing unit 102 is specifically used for being pressed into the public father of label of the under type processing comprising valid link
Web page module:
Obtain father's Web page module of each label in the label comprising valid link;Two-by-two by father's Web page module of each label
Combination, and determined between ancestors' Web page module of each label in combination respectively for each combination with the presence or absence of after ancestors-
For relationship, ancestors' Web page module is the father's Web page module for the father's Web page module for including in combination;If the ancestral of each label in combination
There are ancestors-descendent relationships between first Web page module, then using as ancestors' Web page module of ancestors as public father's Web page module;
It is if ancestors-descendent relationship is not present in combination between ancestors' Web page module of each label, and ancestors' Web page module is identical, then right
Each father's Web page module for including under the father's Web page module and its ancestors' Web page module that include in the combination carries out hierarchical clustering, and will
The public father's Web page module of minimum of each father's Web page module after cluster is as public father's Web page module.
Further, acquiring unit 101 is also used to:
Obtain father's Web page module node chain corresponding to each father's Web page module node for including in combination.
Processing unit 102 is also used to:
Determine that the minimum of father's Web page module node chain corresponding to each father's Web page module node that acquiring unit is got is public
Father's Web page module node altogether;Determine each Web page module node to the opposite road between the public father's Web page module node of the minimum
Diameter, and determine the nodename on relative path;According to the nodename similarity on relative path, each father in combination is determined
Web page module is similar and can gather.
The embodiment of the invention also provides a kind of webpage key modules extract equipments, including memory, processor and storage
On a memory and the computer program that can run on a processor;Processor is realized when executing program as above-mentioned webpage is crucial
Module extracting method.
The embodiment of the invention also provides a kind of computer storage medium, program is stored in computer storage medium and is referred to
It enables, when which is executed by processor, for realizing such as above-mentioned webpage key modules extracting method.
It should be noted that each unit in the device that the above-mentioned webpage key modules being related to are extracted in the embodiment of the present invention
Function realization can be with further reference to the description of related method embodiment, details are not described herein.
Above by reference to showing according to the method, apparatus (system) of the embodiment of the present application and/or the frame of computer program product
Figure and/or flow chart describe the application.It should be understood that can realize that block diagram and or flow chart is shown by computer program instructions
The combination of the block of a block and block diagram and or flow chart diagram for figure.These computer program instructions can be supplied to logical
With computer, the processor of special purpose computer and/or other programmable data processing units, to generate machine, so that via meter
The instruction that calculation machine processor and/or other programmable data processing units execute creates for realizing block diagram and or flow chart block
In specified function action method.
Correspondingly, the application can also be implemented with hardware and/or software (including firmware, resident software, microcode etc.).More
Further, the application can take computer usable or the shape of the computer program product on computer readable storage medium
Formula has the computer realized in the medium usable or computer readable program code, to be made by instruction execution system
It is used with or in conjunction with instruction execution system.In the present context, computer can be used or computer-readable medium can be with
It is arbitrary medium, may include, stores, communicates, transmits or transmit program, is made by instruction execution system, device or equipment
With, or instruction execution system, device or equipment is combined to use.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art
Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies
Within, then the present invention is also intended to include these modifications and variations.
Claims (12)
1. a kind of webpage key modules extracting method characterized by comprising
The valid link for including in target webpage, and the label comprising the valid link are obtained, the valid link is to refer to
To the details page link inside webpage;
Determine public father's Web page module of the label comprising the valid link;
It will include the most father's Web page module of the valid link number in public father's Web page module, closed as the webpage
Key module.
2. the method as described in claim 1, which is characterized in that the valid link for including in the acquisition target webpage, and
Label comprising the valid link, comprising:
Obtain all labels in target webpage directly comprising link;
Invalid link is deleted in the link that all labels directly include, obtains the valid link for including in target webpage;
The corresponding label of the invalid link is deleted in all labels, obtains the label comprising the valid link.
3. method according to claim 2, which is characterized in that the invalid link includes at least one in following link:
The domain name of domain name and webpage is inconsistent to be linked;
It include the link for presetting keyword;
Wherein, the keyword is that non-details page links the keyword and the common antistop list of useless link commonly used in antistop list
In keyword.
4. method as described in any one of claims 1 to 3, which is characterized in that the determination includes the mark of the valid link
Public father's Web page module of label, comprising:
Obtain father's Web page module of each label in the label comprising the valid link;
Each label in the combination is determined respectively by father's Web page module combination of two of each label, and for each combination
It whether there is ancestors-descendent relationship between ancestors' Web page module, ancestors' Web page module is the father's webpage mould for including in combination
Father's Web page module of block;
If there are ancestors-descendent relationships between ancestors' Web page module of each label in the combination, by the ancestral as ancestors
First Web page module is as public father's Web page module;
If ancestors-descendent relationship is not present in the combination between ancestors' Web page module of each label, and ancestors' Web page module
It is identical, then level is carried out to each father's Web page module for including under the father's Web page module and its ancestors' Web page module for including in the combination
Cluster, and using the public father's Web page module of minimum of each father's Web page module after cluster as public father's Web page module.
5. method as claimed in claim 4, which is characterized in that the father's Web page module for including in the described pair of combination and its ancestors
Before each father's Web page module for including under Web page module carries out hierarchical clustering, the method also includes:
Obtain father's Web page module node chain corresponding to each father's Web page module node for including in the combination;
Determine the public father's Web page module node of minimum of father's Web page module node chain corresponding to each father's Web page module node;
Determine that each Web page module node to the relative path between the public father's Web page module node of the minimum, and determines institute
State the nodename on relative path;
According to the nodename similarity on the relative path, determine that each father's Web page module in the combination is similar and can
It is poly-.
6. a kind of webpage key modules extraction element characterized by comprising
Acquiring unit, it is described for obtaining the valid link for including in target webpage, and the label comprising the valid link
Valid link is the details page link being directed toward inside webpage;
Processing unit, for determining public father's Web page module of the label comprising the valid link, and by public father's net
The father Web page module most comprising the valid link number in page module, as the webpage key modules.
7. device as claimed in claim 6, which is characterized in that the acquiring unit for obtaining target webpage as follows
In include valid link, and the label comprising the valid link:
Obtain all labels in target webpage directly comprising link;
Invalid link is deleted in the link that all labels directly include, obtains the valid link for including in target webpage;
The corresponding label of the invalid link is deleted in all labels, obtains the label comprising the valid link.
8. device as claimed in claim 7, which is characterized in that the invalid link includes at least one in following link:
The domain name of domain name and webpage is inconsistent to be linked;
It include the link for presetting keyword;
Wherein, the keyword is that non-details page links the keyword and the common antistop list of useless link commonly used in antistop list
In keyword.
9. such as the described in any item devices of claim 6 to 8, which is characterized in that the processing unit is specifically used for being pressed into lower section
Public father's Web page module of label of the formula processing comprising the valid link:
Obtain father's Web page module of each label in the label comprising the valid link;
By father's Web page module combination of two of each label, and it is directed to each ancestral for combining and determining each label in the combination respectively
It whether there is ancestors-descendent relationship between first Web page module, ancestors' Web page module is the father's Web page module for including in combination
Father's Web page module;
If there are ancestors-descendent relationships between ancestors' Web page module of each label in the combination, by the ancestral as ancestors
First Web page module is as public father's Web page module;
If ancestors-descendent relationship is not present in the combination between ancestors' Web page module of each label, and ancestors' Web page module
It is identical, then level is carried out to each father's Web page module for including under the father's Web page module and its ancestors' Web page module for including in the combination
Cluster, and using the public father's Web page module of minimum of each father's Web page module after cluster as public father's Web page module.
10. device as claimed in claim 9, which is characterized in that the acquiring unit is also used to:
Obtain father's Web page module node chain corresponding to each father's Web page module node for including in the combination;
The processing unit is also used to:
Determine the public father of minimum of father's Web page module node chain corresponding to each father's Web page module node that acquiring unit is got
Web page module node;
Determine that each Web page module node to the relative path between the public father's Web page module node of the minimum, and determines institute
State the nodename on relative path;
According to the nodename similarity on the relative path, determine that each father's Web page module in the combination is similar and can
It is poly-.
11. a kind of webpage key modules extract equipment, including memory, processor and it is stored on the memory and can be in institute
State the computer program run on processor;It is characterized in that, the processor realizes such as claim when executing described program
The described in any item webpage key modules extracting methods of 1-5.
12. a kind of computer storage medium, which is characterized in that be stored with program instruction in the computer storage medium, the journey
When sequence instruction is executed by processor, for realizing webpage key modules extracting method such as described in any one of claim 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711402540.2A CN110020247B (en) | 2017-12-22 | 2017-12-22 | Webpage key module extraction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711402540.2A CN110020247B (en) | 2017-12-22 | 2017-12-22 | Webpage key module extraction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110020247A true CN110020247A (en) | 2019-07-16 |
CN110020247B CN110020247B (en) | 2021-05-14 |
Family
ID=67187130
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711402540.2A Active CN110020247B (en) | 2017-12-22 | 2017-12-22 | Webpage key module extraction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110020247B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102254009A (en) * | 2011-07-15 | 2011-11-23 | 福建星网锐捷通讯股份有限公司 | Method for extracting data of webpage table |
CN102298638A (en) * | 2011-08-31 | 2011-12-28 | 北京中搜网络技术股份有限公司 | Method and system for extracting news webpage contents by clustering webpage labels |
CN102541874A (en) * | 2010-12-16 | 2012-07-04 | 中国移动通信集团公司 | Webpage text content extracting method and device |
CN102663023A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Implementation method for extracting web content |
CN103838792A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Method for determining webpage theme |
CN104462394A (en) * | 2012-06-25 | 2015-03-25 | 北京奇虎科技有限公司 | System and method for recognizing content posts of webpage |
KR20160045974A (en) * | 2014-10-17 | 2016-04-28 | 인포뱅크 주식회사 | Apparatus and method for relaying group message |
CN105786951A (en) * | 2015-12-31 | 2016-07-20 | 北京金山安全软件有限公司 | Method and device for extracting content blocks in webpage and server |
CN106960057A (en) * | 2017-04-05 | 2017-07-18 | 上海威固信息技术有限公司 | A kind of method that Web page text is extracted based on information density |
-
2017
- 2017-12-22 CN CN201711402540.2A patent/CN110020247B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102541874A (en) * | 2010-12-16 | 2012-07-04 | 中国移动通信集团公司 | Webpage text content extracting method and device |
CN102254009A (en) * | 2011-07-15 | 2011-11-23 | 福建星网锐捷通讯股份有限公司 | Method for extracting data of webpage table |
CN102298638A (en) * | 2011-08-31 | 2011-12-28 | 北京中搜网络技术股份有限公司 | Method and system for extracting news webpage contents by clustering webpage labels |
CN102663023A (en) * | 2012-03-22 | 2012-09-12 | 浙江盘石信息技术有限公司 | Implementation method for extracting web content |
CN104462394A (en) * | 2012-06-25 | 2015-03-25 | 北京奇虎科技有限公司 | System and method for recognizing content posts of webpage |
CN103838792A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Method for determining webpage theme |
KR20160045974A (en) * | 2014-10-17 | 2016-04-28 | 인포뱅크 주식회사 | Apparatus and method for relaying group message |
CN105786951A (en) * | 2015-12-31 | 2016-07-20 | 北京金山安全软件有限公司 | Method and device for extracting content blocks in webpage and server |
CN106960057A (en) * | 2017-04-05 | 2017-07-18 | 上海威固信息技术有限公司 | A kind of method that Web page text is extracted based on information density |
Non-Patent Citations (3)
Title |
---|
张超: "基于分块的主题信息抽取研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
罗超然: "一种通用的网页内容抽取模块的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
谢馨庭: "基于单DOM树特征预分类的自适应Web信息抽取方法", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN110020247B (en) | 2021-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108932294B (en) | Resume data processing method, device, equipment and storage medium based on index | |
US11544578B2 (en) | Method, device and equipment for fusing different instances describing same entity | |
US11321421B2 (en) | Method, apparatus and device for generating entity relationship data, and storage medium | |
Sun et al. | Dom based content extraction via text density | |
US8554800B2 (en) | System, methods and applications for structured document indexing | |
CN102473190B (en) | Keyword assignment to a web page | |
CN107391675B (en) | Method and apparatus for generating structured information | |
US20150067476A1 (en) | Title and body extraction from web page | |
CN104102639B (en) | Popularization triggering method based on text classification and device | |
CN104750789A (en) | Label recommendation method and device | |
CN103294781A (en) | Method and equipment used for processing page data | |
US20100211533A1 (en) | Extracting structured data from web forums | |
CN111639284A (en) | Webpage labeling method and device, electronic equipment and medium | |
CN111079043A (en) | Key content positioning method | |
US9767086B2 (en) | System and method for enablement of data masking for web documents | |
CN105574200A (en) | User interest extraction method based on historical record | |
CN104462540A (en) | Webpage information extraction method | |
JP2008165284A (en) | System and method for monitoring rumor and program | |
CN112818200A (en) | Data crawling and event analyzing method and system based on static website | |
CN105512334A (en) | Data mining method based on search words | |
JP2007122398A (en) | Method for determining identity of fragment, and computer program | |
WO2015047921A1 (en) | Determining images of article for extraction | |
CN111061975A (en) | Method and device for processing irrelevant content in page | |
CN110020247A (en) | A kind of webpage key modules extracting method and device | |
CN113656466B (en) | Policy data query method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |