CN105512296A - Webpage difference based webpage analysis method and system - Google Patents

Webpage difference based webpage analysis method and system Download PDF

Info

Publication number
CN105512296A
CN105512296A CN201510917292.XA CN201510917292A CN105512296A CN 105512296 A CN105512296 A CN 105512296A CN 201510917292 A CN201510917292 A CN 201510917292A CN 105512296 A CN105512296 A CN 105512296A
Authority
CN
China
Prior art keywords
webpage
web page
node
visual
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510917292.XA
Other languages
Chinese (zh)
Other versions
CN105512296B (en
Inventor
冯建兴
张云刚
翁时锋
梁丰
王遵义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Zhongqing Cyyun New Media Technology Co Ltd
Original Assignee
Ningbo Zhongqing Cyyun New Media Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo Zhongqing Cyyun New Media Technology Co Ltd filed Critical Ningbo Zhongqing Cyyun New Media Technology Co Ltd
Priority to CN201510917292.XA priority Critical patent/CN105512296B/en
Publication of CN105512296A publication Critical patent/CN105512296A/en
Application granted granted Critical
Publication of CN105512296B publication Critical patent/CN105512296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage difference based webpage analysis method and system. The method comprises steps as follows: webpage information needing analysis is collected; the webpage information is clustered according to a node structure of the collected webpage information; visual webpage elements with node content change in each node are extracted according to a clustering result; the extracted visual webpage elements are classified so as to be identified. According to the webpage difference based webpage analysis method and system, webpage content analysis can be realized automatically, manual operation is not required, further, analysis for the core webpage information concerned by a user is realized through extraction of the visual changing webpage elements, and the method and the system have the advantages of high pertinence, good analysis effect, manpower cost saving, high efficiency, high analysis capacity and high universality.

Description

Based on webpage analysis method and the system of webpage difference
Technical field
The present invention relates to a kind of Analysis of Network Information technology, particularly a kind of webpage analysis method based on webpage difference and system.
Background technology
Public sentiment system needs to carry out continual crawl to webpages a large amount of on network, and correct analyzing web page content, analyze the information such as issuing time wherein, author, title.Such as, in the service of volunteer's network public-opinion, we need the network public sentiment information of being correlated with from a lot of website collection volunteer.But the form of different site pages varies, even the form of the different spaces of a whole page of same website, the page also may be different.These forms also can irregularly be revised.These difference and amendment bring very large difficulty to the web page analysis of robotization, therefore often need a large amount of manual interventions, carry out manual configuration to the parse error of new webpage and constantly appearance.
At present, the analysis of web page contents mainly contains following several method:
1., by the Content Organizing pattern of artificial mode analyzing web page, find the required node extracting web page element place in dom tree.Such as, by artificial O&E, determine the matched rule (the general XPath that uses describes) fixed.Wherein, DOM is the standard of W3C (World Wide Web Consortium), it defines the standard of access HTML and XML document, and based on this standard, the code structure of webpage can be shown by the form of a tree, and this tree is just referred to as dom tree, as shown in Figure 1.XPath refers to the path of certain node that root node arrives on dom tree, a given webpage, and an XPath uniquely can determine a web page joint.
2. the border of each subject information block in webpage is found by some heuristic rules.Such as, the node of the time of description generally can comprise " issuing time ", words such as " times ".On node corresponding to the title of article, much can comprise " title ".
3., by the feature of analyzing web page element, text and advertisement are distinguished in the distribution as by the function word inside each message block.Seldom comprise the word that " ", " " is such in such as advertisement text, but in body text content, such word occurs very frequent.
4., by playing up webpage, then analysis visualization unique point (T-Point) extracts web page contents.Such as, after a webpage shows in a browser, the position (T-Point) of the screen at its title place is determined substantially, then can determine the web page element that will extract according to these positional informations.
But all there are some shortcomings in above-mentioned several method:
1. the Content Organizing mode approach defect of artificial mode analyzing web page is inefficiency, because this mode needs the webpage after to each new webpage and change to carry out manual analysis.
2. heuristic rule may be effective to the webpage of most of main flow, but can not meet other non-mainstream webpage Protean.There is serious long tail effect in internet, non-mainstream little Websites quantity is numerous, and heuristic rule is not enough for the robotization analytic ability of these webpages.
3. the method for the feature of analyzing web page element is fine for the web page joint effect with rich text content, but then effect is not had for the node of content of text little (the general text of author field of such as web documents is very short, and the article of microblogging is inherently very short for another example).
4. first the way of analysis visualization unique point needs to play up webpage, and content (such as long text or the picture) effect larger for viewing area variation range is not satisfactory.
Summary of the invention
The object of the embodiment of the present invention is to provide a kind of webpage analysis method based on webpage difference and system, so that the efficiency that the analysis solving existing web page contents exists is low, analytic ability is not enough, the problem of poor universality.
The embodiment of the present invention proposes a kind of webpage analysis method based on webpage difference, comprising:
Gather the info web of Water demand;
According to the node structure of the described info web gathered, cluster is carried out to described info web;
According to described cluster result, extract in each node, the visual web page element of node content variation;
The visual web page element extracted is classified, to identify described visual web page element.
According to the webpage analysis method based on webpage difference described in present pre-ferred embodiments, the node structure of the described described info web according to gathering, also comprises before described info web being carried out to the step of cluster: remove the visual content in the source code of described info web.
According to the webpage analysis method based on webpage difference described in present pre-ferred embodiments, the node structure of the described described info web according to gathering, the step of described info web being carried out to cluster comprises:
According to the node structure of the described info web gathered, calculate the distance between any two webpages;
According to the distance between any two webpages calculated, cluster is carried out to described info web.
According to the webpage analysis method based on webpage difference described in present pre-ferred embodiments, described according to described cluster result, extract in each node, the step of the visual web page element of node content variation comprises:
Remove the node that the cluster result interior joint frequency of occurrences is less than setting threshold value;
Remove the node that the identical content frequency of occurrences is greater than setting threshold value;
Extract the visual web page element of residue node.
According to the webpage analysis method based on webpage difference described in present pre-ferred embodiments, the described visual web page element to extracting is classified, and comprises with the step identified described visual web page element:
Build the sorter of web page element classification;
According to the sorter built, the visual web page element extracted is classified, to identify described visual web page element.
The present invention also proposes a kind of web page analysis system based on webpage difference, comprising:
Information acquisition module, for gathering the info web of Water demand;
Cluster module, for the node structure according to the described info web gathered, carries out cluster to described info web;
Element extraction module, for according to described cluster result, extracts in each node, the visual web page element of node content variation;
Sort module, for classifying to the visual web page element extracted, to identify described visual web page element.
According to the web page analysis system based on webpage difference described in present pre-ferred embodiments, described web page analysis system also comprises:
Content filtering module, before carrying out cluster at described cluster module to described info web, removes the visual content in the source code of described info web.
According to the web page analysis system based on webpage difference described in present pre-ferred embodiments, described cluster module comprises further:
Metrics calculation unit, for the node structure according to the described info web gathered, calculates the distance between any two webpages;
Website construction unit, for according to the distance between any two webpages calculated, carries out cluster to described info web.
According to the web page analysis system based on webpage difference described in present pre-ferred embodiments, described element extraction module comprises:
First filter element, is less than the node of setting threshold value for removing the cluster result interior joint frequency of occurrences;
Second filter element, is greater than the node of setting threshold value for removing the identical content frequency of occurrences;
Extraction unit, for extracting the visual web page element of residue node.
According to the web page analysis system based on webpage difference described in present pre-ferred embodiments, described sort module comprises:
Sorter construction unit, for building the sorter of web page element classification;
Elemental recognition unit, for according to the sorter built, classifies to the visual web page element extracted, to identify described visual web page element.
Relative to prior art, the invention has the beneficial effects as follows:
The present invention automatically can realize the analysis to web page contents, and without the need to manual operation, saved human cost, efficiency is high.
The present invention adopts web page joint structure to carry out cluster to web page joint, and according to cluster result, info web is analyzed, analytic ability is strong, and without the need to playing up webpage, all can be suitable for the webpage of various website, there is very strong versatility, effectively can overcome the long tail effect existing for internet information analysis.
The present invention is by extracting visual variation web page element, achieve the analysis of the core net page information be concerned about for user, not only with strong points, analytical effect is good, and eliminate in a large number for the analytic operation amount of inessential information user, decrease the computation burden of system, substantially increase efficiency.
Certainly, the arbitrary product implementing the application might not need to reach above-described all advantages simultaneously.
Accompanying drawing explanation
Fig. 1 is a kind of schematic diagram of dom tree;
Fig. 2 is the process flow diagram of a kind of webpage analysis method based on webpage difference of the embodiment of the present invention;
Fig. 3 is a kind of process flow diagram described info web being carried out to cluster of the embodiment of the present invention;
Fig. 4 is the schematic diagram of a visual webpage of forum one space of a whole page;
Fig. 5 is the schematic diagram of a visual webpage of forum one space of a whole page;
Fig. 6 is a kind of processing procedure schematic diagram to a website construction dom tree of the embodiment of the present invention;
Fig. 7 is a kind of process flow diagram extracting the computation process of the visual web page element of node content variation of the embodiment of the present invention;
Fig. 8 is a kind of process flow diagram of classifying to the visual web page element extracted of the embodiment of the present invention;
Fig. 9 is the process flow diagram of another kind based on the webpage analysis method of webpage difference of the embodiment of the present invention;
Figure 10 is the schematic diagram of a kind of visual content of the embodiment of the present invention;
Figure 11 is the structural drawing of a kind of web page analysis system based on webpage difference of the embodiment of the present invention;
Figure 12 is the structural drawing of a kind of cluster module of the embodiment of the present invention;
Figure 13 is the structural drawing of a kind of element extraction module of the embodiment of the present invention;
Figure 14 is the structural drawing of a kind of sort module of the embodiment of the present invention;
Figure 15 is the structural drawing of another kind based on the web page analysis system of webpage difference of the embodiment of the present invention.
Embodiment
Aforementioned and other technology contents, Characteristic for the present invention, can clearly present in following cooperation describes in detail with reference to graphic preferred embodiment.By the explanation of embodiment, when can to the present invention for the technological means reaching predetermined object and take and effect be able to more deeply and concrete understanding, however institute's accompanying drawings be only to provide with reference to and the use of explanation, be not used for being limited the present invention.
Refer to Fig. 2, it is the process flow diagram of a kind of webpage analysis method based on webpage difference of the embodiment of the present invention, and it comprises the following steps:
S21, gathers the info web of Water demand.
S22, according to the node structure of the described info web gathered, carries out cluster to described info web.
S23, according to described cluster result, extracts in each node, the visual web page element of node content variation.
S24, classifies to the visual web page element extracted, to identify described visual web page element.
In step S21, the info web of described Water demand can be certain space of a whole page of certain website, and gathers its a large amount of history web pages information.Described info web can comprise the visual page content that webpage source code and browser display go out, and wherein can build each webpage dom tree structure intuitively according to webpage source code.
In step S22, described node and the node of dom tree.Because the structure of the webpage gathered may be varied, even if what gather is the info web of the same space of a whole page, its structure of web page may also be not quite similar, such as the webpage of forum equally, the structure of the page of different plate just may be different, therefore in this step, the object of cluster is exactly that webpage similar for node structure is classified as a class, even solve its structure of web page of the same space of a whole page also diversified problem of possibility of same website, so that the analysis of web page contents.
The algorithm that cluster adopts can be selected as required, and the present invention preferably utilizes and calculates distance between any two webpages to carry out cluster.Specifically, refer to Fig. 3, when cluster is carried out to described info web, may further include following steps again:
S221, according to the node structure of the described info web gathered, calculates the distance between any two webpages.
Node structure due to webpage can be regarded as a dom tree, and namely the distance between webpage and webpage also calculates the editing distance between two dom trees.For the editing distance between two dom trees, there is method (the such as RTED of the editing distance proposing calculating two tree at present in the world, M.PawlikandN.Augsten.Rted:arobustalgorithmforthetreeedit distance.Proc.VLDBEndow., 5 (4): 334-345,2011.), do not repeat them here.
S222, according to the distance between any two webpages calculated, carries out cluster to described info web.
The algorithm that cluster adopts can be selected as required, such as KMeans algorithm, K-means algorithm is the very typical clustering algorithm based on distance, adopts distance as the evaluation index of similarity, namely think that the distance of two objects is nearer, its similarity is larger.
In step S23, the visual web page element of described node content variation refers to the information that on visual page, content can change.Incorporated by reference to see Fig. 4 and Fig. 5, it is the schematic diagram of two visual webpages of the same space of a whole page of same forum, wherein, namely be the visual web page element of node content variation with the region of oval mark in figure, i.e. author, time of origin, title, content, reading number, reply number, these information are different on these two pages, and the information that these visual content can change, often namely user compares the core information of care.Otherwise, the information that on two pages, content is constant, the often information that not too can be concerned about of user.As can be seen here, the object of this step is exactly find out the core information that on webpage, user can be concerned about.
For each website construction, remove the dom tree node that those are constant.Then, in remaining dom tree node, the visualized elements in each such dom tree node is then as the information candidate of required extraction.
For the ease of calculating, can first process the dom tree of each website construction, first to the dom tree structure corresponding to the webpage of each cluster, adopt comparison (treealignment) algorithm (such as DirectOptimization) of tree, calculate a comparison of these trees.Then, according to this comparison, calculate a minimum Common Trees, namely calculate dom tree unions all in each website construction.As shown in Figure 6, suppose the information comprising two webpages in a website construction, wherein, tree 1 and tree 2 are the dom tree of these two webpages respectively, through calculating the union of tree 1 and tree 2 these two dom trees, finally obtain this minimum public subtree corresponding with this website construction of tree 3.And the extraction of the follow-up visual web page element to node content variation can based on this minimum public subtree, so that simplify computation process.
Specifically, refer to Fig. 7, the computation process extracting the visual web page element of node content variation in the present embodiment can comprise the following steps:
S231, removes the node that the cluster result interior joint frequency of occurrences is less than setting threshold value.
Because the information gathered is the information that user compares concern, thus need all to occur in most of webpage, so remove the lower node of the frequency of occurrences by this step.Threshold value described in this step can set as required, preferred 0.8N, and wherein N is the sum of this cluster interior joint.
S232, removes the node that the identical content frequency of occurrences is greater than setting threshold value.
The information that content can change, namely user compares the core information of care, and remove the little node of content change by this step, namely eliminating user is not the info web be concerned about very much.Therefore, if the frequency that the information of the same content on a node occurs is too high, this node is so screened out.Threshold value described in this step can set as required, preferred 0.2N, and wherein N is the sum of this cluster interior joint.
S233, extracts the visual web page element of residue node.Namely all the elements information in each node of residue is extracted.
In step S24, in order to realize the analysis to info web, the information to extracting is needed to classify again.Please participate in Fig. 8, classification be carried out to the visual web page element extracted and may further include following steps:
S241, builds the sorter of web page element classification.
Described sorter is used for classifying for the information in the dom tree node of each candidate, and whether such as identify is time, author, title, content etc.This sorter only needs to build once all webpages of all websites.Classifying rules also can set according to actual needs, for example,
A) for the time, regular expression can be adopted to mate, adopt mode discriminator to give a mark according to the distribution of character.Through comparing, employing SVM (SupportVectorMachine is a learning model having supervision, is commonly used to carry out pattern-recognition, classification, and regretional analysis) carry out classification marking.
B) for author, the individual character that system is extracted author's name decomposes, the number of word, adopts rule to give a mark.
C) for title, the first participle of system, then sees the distribution situation of word, carries out vectorization.Due to LDA (LatentDirichletAllocation, it is a kind of document subject matter generation model, it is a kind of non-supervisory machine learning techniques, can be used for identifying subject information hiding in extensive document sets or corpus) in the excellent effect processing text, system preferably adopts LDA to carry out vectorization, then adopt mode discriminator to give a mark, through comparing, system adopts random forests algorithm to give a mark.
D) for content, the first participle of system, then shift to an earlier date function word wherein, system is chosen random forest and is given a mark.
S242, according to the sorter built, classifies to the visual web page element extracted, to identify described visual web page element.
The visual web page element that each dom tree Node extraction goes out is classified, thus identifies the classification of these web page elements.Computing method can be, respectively naive Bayesian (naivebayes) sorter is built again to time, author, title, content, when supposing that different web pages is separate, judge that a web page joint is time, author, title or content.
Especially, after the cluster of node structure of the present invention completes, when needing to analyze the new info web gathered, and can walk it at S22 this website, the space of a whole page website construction that have gathered and compare, find a most suitable classification.The dom tree node of the webpage information acquisition corresponding to this classification carries out information extraction.If do not find the cluster that suitable, mean that this webpage have employed a kind of new form.Now need to get back to S22 step, reanalyse the dom tree node that information needed in this new web page is corresponding.
The present invention automatically can realize the analysis to web page contents, and without the need to manual operation, saved human cost, efficiency is high.
The present invention adopts web page joint structure to carry out cluster to web page joint, and according to cluster result, info web is analyzed, analytic ability is strong, and without the need to playing up webpage, all can be suitable for the webpage of various website, there is very strong versatility, effectively can overcome the long tail effect existing for internet information analysis.
The present invention is by extracting visual variation web page element, achieve the analysis of the core net page information be concerned about for user, not only with strong points, analytical effect is good, and eliminate in a large number for the analytic operation amount of inessential information user, decrease the computation burden of system, substantially increase efficiency.
Refer to Fig. 9, it is the process flow diagram of another kind based on the webpage analysis method of webpage difference of the embodiment of the present invention, and it comprises the following steps:
S91, gathers the info web of Water demand.
S92, removes the visual content in the source code of described info web.
Visual content described here is different from aforementioned visual web page element, and visual content refers to the value in webpage source code, and visual web page element refers to the content information be presented at after browser renders on visual page.As shown in Figure 10, the title " attention of dog is supported in Quzhou " on webpage is visual, and this title has a corresponding node in dom tree, and the value of this node is exactly " attention of dog is supported in Quzhou ".The visual content removed in webpage is exactly the value of this node in dom tree removed.So that the analysis to dom tree.
S93, according to the node structure of the described info web gathered, calculates the distance between any two webpages.
S94, according to the distance between any two webpages calculated, carries out cluster to described info web.
S95, removes the node that the cluster result interior joint frequency of occurrences is less than setting threshold value.
S96, removes the node that the identical content frequency of occurrences is greater than setting threshold value.
S97, extracts the visual web page element of residue node.
S98, builds the sorter of web page element classification.
S99, according to the sorter built, classifies to the visual web page element extracted, to identify described visual web page element.
The present invention also proposes a kind of web page analysis system based on webpage difference, refer to Figure 11, it is the structural drawing of a kind of web page analysis system based on webpage difference of the embodiment of the present invention, and it comprises: information acquisition module 111, cluster module 112, element extraction module 113 and sort module 114.Cluster module 112 is connected with information acquisition module 111, and element extraction module 113 is connected with cluster module 112, and sort module 114 is connected with element extraction module 113.
Information acquisition module 111 is for gathering the info web of Water demand.The info web of described Water demand can be certain space of a whole page of certain website, and gathers its a large amount of history web pages information.Described info web can comprise the visual page content that webpage source code and browser display go out, and wherein can build each webpage dom tree structure intuitively according to webpage source code.
Cluster module 112, for the node structure according to the described info web gathered, carries out cluster to described info web.Described node and the node of dom tree.Because the structure of the webpage gathered may be varied, even if what gather is the info web of the same space of a whole page, its structure of web page may also be not quite similar, such as the webpage of forum equally, the structure of the page of different plate just may be different, therefore in this step, the object of cluster is exactly that webpage similar for node structure is classified as a class, even solve its structure of web page of the same space of a whole page also diversified problem of possibility of same website, so that the analysis of web page contents.
Further, as shown in figure 12, cluster module 112 may further include again: metrics calculation unit 1121 and website construction unit 1122.
Metrics calculation unit 1121, for the node structure according to the described info web gathered, calculates the distance between any two webpages.Node structure due to webpage can be regarded as a dom tree, and namely the distance between webpage and webpage also calculates the editing distance between two dom trees.For the editing distance between two dom trees, there is the method for the editing distance proposing calculating two tree at present in the world, do not repeated them here.
Website construction unit 1122, for according to the distance between any two webpages calculated, carries out cluster to described info web.The algorithm that cluster adopts can be selected as required, such as KMeans algorithm.
Element extraction module 113, for according to described cluster result, extracts in each node, the visual web page element of node content variation.The visual web page element of described node content variation refers to the information that on visual page, content can change.The information that these visual content can change, often namely user compares the core information of care.Otherwise, the information that on two pages, content is constant, the often information that not too can be concerned about of user.As can be seen here, the object of element extraction module 113 is exactly find out the core information that on webpage, user can be concerned about.
For the ease of calculating, element extraction module 113 can first process the dom tree of each website construction, first to the dom tree structure corresponding to the webpage of each cluster, adopt comparison (treealignment) algorithm (such as DirectOptimization) of tree, calculate a comparison of these trees.Then, according to this comparison, calculate a minimum Common Trees, namely calculate dom tree unions all in each website construction.And the extraction of the follow-up visual web page element to node content variation can based on this minimum public subtree, so that simplify computation process.
Further, as shown in figure 13, element extraction module 113 may further include again: the first filter element 1131, second filter element 1132 and extraction unit 1133.
First filter element 1131 is less than the node of setting threshold value for removing the cluster result interior joint frequency of occurrences.Because the information gathered is the information that user compares concern, thus need all to occur in most of webpage, so remove the lower node of the frequency of occurrences by this step.Threshold value described here can set as required, preferred 0.8N, and wherein N is the sum of this cluster interior joint.
Second filter element 1132 is greater than the node of setting threshold value for removing the identical content frequency of occurrences.The information that content can change, namely user compares the core information of care, and by removing the little node of content change, namely eliminating user is not the info web be concerned about very much.Therefore, if the frequency that the information of the same content on a node occurs is too high, this node is so screened out.Threshold value described here can set as required, preferred 0.2N, and wherein N is the sum of this cluster interior joint.
Extraction unit 1133 is for extracting the visual web page element of residue node.Namely all the elements information in each node of residue is extracted.
Sort module 114 for classifying to the visual web page element extracted, to identify described visual web page element.Further, refer to Figure 14, sort module 114 may further include again: sorter construction unit 1141 and elemental recognition unit 1142.
Sorter construction unit 1141 is for building the sorter of web page element classification.Described sorter is used for classifying for the information in the dom tree node of each candidate, and whether such as identify is time, author, title, content etc.This sorter only needs to build once all webpages of all websites.Classifying rules also can set according to actual needs.
Elemental recognition unit 1142, for according to the sorter built, is classified to the visual web page element extracted, to identify described visual web page element.The visual web page element that each dom tree Node extraction goes out is classified, thus identifies the classification of these web page elements.Computing method can be, respectively naive Bayesian (naivebayes) sorter is built again to time, author, title, content, when supposing that different web pages is separate, judge that a web page joint is time, author, title or content.
Further, after the cluster of node structure of the present invention completes, when needing to analyze the new info web gathered, this website gathered in the cluster result that itself and cluster module 112 can be exported, space of a whole page website construction compare, and find a most suitable classification.The dom tree node of the webpage information acquisition corresponding to this classification carries out information extraction.If do not find the cluster that suitable, mean that this webpage have employed a kind of new form.Now need to reanalyse dom tree node corresponding to information needed in this new web page by cluster module 112.
The present invention automatically can realize the analysis to web page contents, and without the need to manual operation, saved human cost, efficiency is high.
The present invention adopts web page joint structure to carry out cluster to web page joint, and according to cluster result, info web is analyzed, analytic ability is strong, and without the need to playing up webpage, all can be suitable for the webpage of various website, there is very strong versatility, effectively can overcome the long tail effect existing for internet information analysis.
The present invention is by extracting visual variation web page element, achieve the analysis of the core net page information be concerned about for user, not only with strong points, analytical effect is good, and eliminate in a large number for the analytic operation amount of inessential information user, decrease the computation burden of system, substantially increase efficiency.
Refer to Figure 15, it is the structural drawing of another kind based on the web page analysis system of webpage difference of the embodiment of the present invention, compared with the embodiment of Figure 11, the web page analysis system based on webpage difference of the present embodiment is except comprising: information acquisition module 111, cluster module 112, element extraction module 113 and sort module 114, also comprise: content filtering module 115.
Content filtering module 115, for before carrying out cluster at described cluster module 112 to described info web, removes the visual content in the source code of described info web.Visual content described here is different from aforementioned visual web page element, and visual content refers to the value in webpage source code, and visual web page element refers to the content information be presented at after browser renders on visual page.As shown in Figure 10, the title " attention of dog is supported in Quzhou " on webpage is visual, and this title has a corresponding node in dom tree, and the value of this node is exactly " attention of dog is supported in Quzhou ".The visual content removed in webpage is exactly the value of this node in dom tree removed.So that the analysis to dom tree.
Through the above description of the embodiments, those skilled in the art can be well understood to the embodiment of the present invention can by hardware implementing, and the mode that also can add necessary general hardware platform by software realizes.Based on such understanding, the technical scheme of the embodiment of the present invention can embody with the form of software product, it (can be CD-ROM that this software product can be stored in a non-volatile memory medium, USB flash disk, portable hard drive etc.) in, comprise some instructions and perform each method implementing described in scene of the embodiment of the present invention in order to make a computer equipment (can be personal computer, server, or the network equipment etc.).
The above, it is only preferred embodiment of the present invention, not any pro forma restriction is done to the present invention, although the present invention discloses as above with preferred embodiment, but and be not used to limit the present invention, any those skilled in the art, do not departing within the scope of technical scheme, make a little change when the technology contents of above-mentioned announcement can be utilized or be modified to the Equivalent embodiments of equivalent variations, in every case be do not depart from technical scheme content, according to any simple modification that technical spirit of the present invention is done above embodiment, equivalent variations and modification, all still belong in the scope of technical solution of the present invention.

Claims (10)

1. based on a webpage analysis method for webpage difference, it is characterized in that, comprising:
Gather the info web of Water demand;
According to the node structure of the described info web gathered, cluster is carried out to described info web;
According to described cluster result, extract in each node, the visual web page element of node content variation;
The visual web page element extracted is classified, to identify described visual web page element.
2. as claimed in claim 1 based on the webpage analysis method of webpage difference, it is characterized in that, the node structure of the described described info web according to gathering, also comprises before described info web being carried out to the step of cluster: remove the visual content in the source code of described info web.
3. as claimed in claim 1 based on the webpage analysis method of webpage difference, it is characterized in that, the node structure of the described described info web according to gathering, the step of described info web being carried out to cluster comprises:
According to the node structure of the described info web gathered, calculate the distance between any two webpages;
According to the distance between any two webpages calculated, cluster is carried out to described info web.
4. as claimed in claim 1 based on the webpage analysis method of webpage difference, it is characterized in that, described according to described cluster result, extract in each node, the step of the visual web page element of node content variation comprises:
Remove the node that the cluster result interior joint frequency of occurrences is less than setting threshold value;
Remove the node that the identical content frequency of occurrences is greater than setting threshold value;
Extract the visual web page element of residue node.
5., as claimed in claim 1 based on the webpage analysis method of webpage difference, it is characterized in that, the described visual web page element to extracting is classified, and comprises with the step identified described visual web page element:
Build the sorter of web page element classification;
According to the sorter built, the visual web page element extracted is classified, to identify described visual web page element.
6., based on a web page analysis system for webpage difference, it is characterized in that, comprising:
Information acquisition module, for gathering the info web of Water demand;
Cluster module, for the node structure according to the described info web gathered, carries out cluster to described info web;
Element extraction module, for according to described cluster result, extracts in each node, the visual web page element of node content variation;
Sort module, for classifying to the visual web page element extracted, to identify described visual web page element.
7., as claimed in claim 6 based on the web page analysis system of webpage difference, it is characterized in that, described web page analysis system also comprises:
Content filtering module, before carrying out cluster at described cluster module to described info web, removes the visual content in the source code of described info web.
8., as claimed in claim 6 based on the web page analysis system of webpage difference, it is characterized in that, described cluster module comprises further:
Metrics calculation unit, for the node structure according to the described info web gathered, calculates the distance between any two webpages;
Website construction unit, for according to the distance between any two webpages calculated, carries out cluster to described info web.
9., as claimed in claim 6 based on the web page analysis system of webpage difference, it is characterized in that, described element extraction module comprises:
First filter element, is less than the node of setting threshold value for removing the cluster result interior joint frequency of occurrences;
Second filter element, is greater than the node of setting threshold value for removing the identical content frequency of occurrences;
Extraction unit, for extracting the visual web page element of residue node.
10., as claimed in claim 6 based on the web page analysis system of webpage difference, it is characterized in that, described sort module comprises:
Sorter construction unit, for building the sorter of web page element classification;
Elemental recognition unit, for according to the sorter built, classifies to the visual web page element extracted, to identify described visual web page element.
CN201510917292.XA 2015-12-11 2015-12-11 Webpage analysis method and system based on webpage difference Active CN105512296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510917292.XA CN105512296B (en) 2015-12-11 2015-12-11 Webpage analysis method and system based on webpage difference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510917292.XA CN105512296B (en) 2015-12-11 2015-12-11 Webpage analysis method and system based on webpage difference

Publications (2)

Publication Number Publication Date
CN105512296A true CN105512296A (en) 2016-04-20
CN105512296B CN105512296B (en) 2019-01-22

Family

ID=55720277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510917292.XA Active CN105512296B (en) 2015-12-11 2015-12-11 Webpage analysis method and system based on webpage difference

Country Status (1)

Country Link
CN (1) CN105512296B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399167A (en) * 2017-02-04 2018-08-14 百度在线网络技术(北京)有限公司 Webpage information extracting method and device
CN108628977A (en) * 2018-04-25 2018-10-09 咪咕文化科技有限公司 A kind of web page contents processing method, device and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009048818A2 (en) * 2007-10-11 2009-04-16 Google Inc. Methods and systems for classifying search results to determine page elements
CN102750372A (en) * 2012-06-15 2012-10-24 翁时锋 Analytical method for automatically acquiring webpage structured information
CN103294781A (en) * 2013-05-14 2013-09-11 百度在线网络技术(北京)有限公司 Method and equipment used for processing page data
CN103473338A (en) * 2013-09-22 2013-12-25 北京奇虎科技有限公司 Webpage content extraction method and webpage content extraction system
CN103544178A (en) * 2012-07-13 2014-01-29 百度在线网络技术(北京)有限公司 Method and equipment for providing reconstruction page corresponding to target page
CN104834717A (en) * 2015-05-11 2015-08-12 浪潮集团有限公司 Web information automatic extraction method based on webpage clustering

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009048818A2 (en) * 2007-10-11 2009-04-16 Google Inc. Methods and systems for classifying search results to determine page elements
CN102750372A (en) * 2012-06-15 2012-10-24 翁时锋 Analytical method for automatically acquiring webpage structured information
CN103544178A (en) * 2012-07-13 2014-01-29 百度在线网络技术(北京)有限公司 Method and equipment for providing reconstruction page corresponding to target page
CN103294781A (en) * 2013-05-14 2013-09-11 百度在线网络技术(北京)有限公司 Method and equipment used for processing page data
CN103473338A (en) * 2013-09-22 2013-12-25 北京奇虎科技有限公司 Webpage content extraction method and webpage content extraction system
CN104834717A (en) * 2015-05-11 2015-08-12 浪潮集团有限公司 Web information automatic extraction method based on webpage clustering

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399167A (en) * 2017-02-04 2018-08-14 百度在线网络技术(北京)有限公司 Webpage information extracting method and device
CN108399167B (en) * 2017-02-04 2022-04-29 百度在线网络技术(北京)有限公司 Webpage information extraction method and device
CN108628977A (en) * 2018-04-25 2018-10-09 咪咕文化科技有限公司 A kind of web page contents processing method, device and computer readable storage medium
CN108628977B (en) * 2018-04-25 2021-03-16 咪咕文化科技有限公司 Webpage content processing method and device and computer readable storage medium

Also Published As

Publication number Publication date
CN105512296B (en) 2019-01-22

Similar Documents

Publication Publication Date Title
CN110334346B (en) Information extraction method and device of PDF (Portable document Format) file
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
Kang et al. Modeling user interest in social media using news media and wikipedia
CN103226578B (en) Towards the website identification of medical domain and the method for webpage disaggregated classification
Pivk et al. Transforming arbitrary tables into logical form with TARTAR
CN106294425B (en) The automatic image-text method of abstracting and system of commodity network of relation article
CN105279277A (en) Knowledge data processing method and device
CN104915447A (en) Method and device for tracing hot topics and confirming keywords
CN102831193A (en) Topic detecting device and topic detecting method based on distributed multistage cluster
CN102509233A (en) User online action information-based recommendation method
CN109033200A (en) Method, apparatus, equipment and the computer-readable medium of event extraction
CN102298638A (en) Method and system for extracting news webpage contents by clustering webpage labels
CN112650923A (en) Public opinion processing method and device for news events, storage medium and computer equipment
CN104217038A (en) Knowledge network building method for financial news
CN108021715B (en) Heterogeneous label fusion system based on semantic structure feature analysis
JP2019507425A (en) Service processing method, data processing method and apparatus
CN109871433B (en) Method, device, equipment and medium for calculating relevance between document and topic
Alcic et al. Page segmentation by web content clustering
CN106844482B (en) Search engine-based retrieval information matching method and device
Bykau et al. Fine-grained controversy detection in Wikipedia
CN103761221A (en) System and method for identifying sensitive text messages
CN108446333B (en) Big data text mining processing system and method thereof
CN113157871B (en) News public opinion text processing method, server and medium applying artificial intelligence
CN103870495A (en) Method and device for extracting information from website
Wei et al. Online education recommendation model based on user behavior data analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant