CN103198075A - Method and device for extracting web page information blocks - Google Patents

Method and device for extracting web page information blocks Download PDF

Info

Publication number
CN103198075A
CN103198075A CN2012100046538A CN201210004653A CN103198075A CN 103198075 A CN103198075 A CN 103198075A CN 2012100046538 A CN2012100046538 A CN 2012100046538A CN 201210004653 A CN201210004653 A CN 201210004653A CN 103198075 A CN103198075 A CN 103198075A
Authority
CN
China
Prior art keywords
message block
eigenwert
classification
feature
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100046538A
Other languages
Chinese (zh)
Inventor
徐羽
彭默
蔡兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN2012100046538A priority Critical patent/CN103198075A/en
Publication of CN103198075A publication Critical patent/CN103198075A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method and a device for extracting web page information blocks, and belongs to the field of computers. The method comprises the following steps: acquiring characteristic values of a plurality of characteristics of a web page, wherein the web page comprises a plurality of information blocks; determining types of the information blocks according to the acquired characteristic values, wherein the information blocks are in one-to-one correspondence with the types respectively, and the types comprises at least one of the following types: page top navigation, subnavigation, text title, text information, text, novel title, novel text information, novel text, novel navigation, blog navigation, blog title, blog information, blog text, link information block and transactional block; selecting at least one information block to display from the web page. The device comprises an acquisition module, a determination module and a selection module. Through the invention, labor investment and the maintenance are reduced.

Description

A kind of method and device that extracts the info web piece
Technical field
The present invention relates to computer realm, particularly a kind of method and device that extracts the info web piece.
Background technology
For news web page, novel webpage and blog web page, comprise the valuable key message to the user in these webpages, as body, novel text and blog text, also comprise the information useless to the user, as garbages such as advertisements; If in the webpage that returns to the user, include only key message, not only make things convenient for the user to browse, also reduce taking of Internet resources.
Can from webpage, extract the message block that comprises key message at present, the message block that extracts is packaged into new web page; Next be that example describes this scheme with the news web page: a news web page generally comprises message block such as a page top navigation, secondary navigation, text title, text message, text, mutual piece and link information piece, and the valuable message block of user is comprised secondary navigation, text title, text message and text; For a news web page, at first the information that this news web page is comprised is divided into a plurality of message block, determine DOM (the Document Object Model of this news web page, DOM Document Object Model) tree construction, make the title that information model that this dom tree structure comprises is determined each message block of dividing in advance according to the technician, extract name then and be called secondary navigation, text title, text message and the text message block of correspondence respectively, and four message block that will extract are packaged into new news web page.
Wherein, need to prove: the technician sorts out a large amount of webpages in advance, the webpage that will belong to same dom tree structure is classified as a class, then one or more message block templates that this dom tree structure comprises are analyzed and produced to the webpage that belongs to same dom tree structure.
In realizing process of the present invention, the inventor finds that there is following problem at least in prior art:
The dom tree structure of the webpage of the different web sites all different dom tree that makes is of a great variety, so the webpage that belongs to each dom tree structure is sorted out, go out the message block template that each dom tree structure comprises according to the webpage making that belongs to each dom tree structure, need to drop into great amount of manpower; Webpage correcting might be carried out in the website, in case webpage correcting, the dom tree structure that webpage adopts also changes thereupon, so just needs to make the message block template that the dom tree after changing comprises again, maintenance is huge.
Summary of the invention
In order to reduce human input and maintenance, the invention provides the method and the device that extract the info web piece.Described technical scheme is as follows:
A kind of method that extracts the info web piece, described method comprises:
Obtain the eigenwert of the included a plurality of features of webpage, described webpage comprises a plurality of message block;
Determine the classification of described each message block according to described a plurality of eigenwerts of obtaining, described a plurality of message block is corresponding one by one with a plurality of classifications respectively, and described a plurality of classifications comprise in a page top navigation, secondary navigation, text title, text message, text, novel title, novel text message, novel text, novel navigation, blog navigation, blog title, blog information, blog text, link information piece and the mutual piece at least one;
Selected at least one message block is to show from described webpage.
Obtain the eigenwert of the included a plurality of features of webpage, comprising:
The eigenwert of the feature that described webpage has is set to first eigenwert, and the eigenwert of the feature that described webpage does not have is set to second eigenwert.
Described a plurality of eigenwerts that described basis is obtained determine that the classification of described each message block comprises: calculate the probability that each message block belongs to each classification, and be the classification of this message block with the class declaration of corresponding probability maximum.
The probability that each message block of described calculating belongs to each classification comprises:
At any classification C, the total sample number Total that the total sample number Ctotal that comprises according to described classification C and each classification comprise calculates the class probability of described classification C
Figure BDA0000129641140000021
Eigenwert for described message block characteristic of correspondence Tk, k is the arbitrary value between 1 to default value F, the eigenwert of obtaining described feature Tk from the sample that described classification C comprises is the sample of first eigenwert, adds up the number of the described sample that obtains, and obtains the first number of samples Ck;
The eigenwert of obtaining described feature Tk from the sample that described each default classification comprises is the sample of first eigenwert, adds up the number of the described sample that obtains, and obtains the second number of samples Ek;
Eigenwert to described feature Tk is judged;
If be first eigenwert, then according to the described first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of described feature Tk by following formula (1), calculate probability P=P (C) * P (T1) * P (T2) * P (T3) * that described message block belongs to described classification C according to the characteristic probability of described class probability P (C) and described message block characteristic of correspondence ... * P (TF);
Figure BDA0000129641140000022
Wherein, k1 and k2 are coefficient, and the value of k2 is more than or equal to the value of k1;
If be second eigenwert, then according to the described first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of described feature Tk by following formula (2); Characteristic probability according to described class probability P (C) and described message block characteristic of correspondence calculates probability P=P (C) * P (T1) * P (T2) * P (T3) * that described message block belongs to described classification C ... * P (TF);
P ( Tk ) = 1 - Ck + k 1 Ek + k 2 · · · · · · ( 2 ) .
A kind of device that extracts the info web piece, described device comprises:
Acquisition module, for the eigenwert of obtaining the included a plurality of features of webpage, described webpage comprises a plurality of message block;
Determination module, be used for determining according to described a plurality of eigenwerts of obtaining the classification of described each message block, described a plurality of message block is corresponding one by one with a plurality of classifications respectively, and described a plurality of classifications comprise a page top navigation, secondary navigation, text title, text message, text, novel title, novel text message, novel text, novel navigation, blog navigation, blog title, blog information, blog text, link information piece and mutual piece;
Chosen module is used for from selected at least one message block of described webpage to show.
Described acquisition module, the concrete eigenwert that is used for the feature that described webpage has is set to first eigenwert, and the eigenwert of the feature that described webpage does not have is set to second eigenwert.
Described determination module comprises:
Computing unit is used for calculating the probability that described message block belongs to each classification;
Definition unit, the class declaration that is used for corresponding probability maximum is the classification of this message block.
Described computing unit,
First computation subunit is used at any classification C, and the total sample number Total that the total sample number Ctotal that comprises according to described classification C and each classification comprise calculates the class probability of described classification
Figure BDA0000129641140000032
The first statistics subelement, be used for the eigenwert for described message block characteristic of correspondence Tk, k is the arbitrary value between 1 to default value F, the eigenwert of obtaining described feature Tk from the sample that described classification C comprises is the sample of first eigenwert, add up the number of the described sample that obtains, obtain the first number of samples Ck;
The second statistics subelement, the eigenwert of obtaining described feature Tk for the sample that comprises from described each default classification is the sample of first eigenwert, adds up the number of the described sample that obtains, and obtains the second number of samples Ek;
Judgment sub-unit is used for the eigenwert of described feature Tk is judged;
Second computation subunit, if being used for is first eigenwert, then according to the described first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of described feature Tk by following formula (1), calculate probability P=P (C) * P (T1) * P (T2) * P (T3) * that described message block belongs to described classification C according to the characteristic probability of described class probability P (C) and described message block characteristic of correspondence ... * P (TF);
Wherein, k1 and k2 are coefficient, and the value of k2 is more than or equal to the value of k1;
Second computation subunit, if being used for is second eigenwert, then according to the described first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of described feature Tk by following formula (2), calculate probability P=P (C) * P (T1) * P (T2) * P (T3) * that described message block belongs to described classification C according to the characteristic probability of described class probability P (C) and described message block characteristic of correspondence ... * P (TF);
P ( Tk ) = 1 - Ck + k 1 Ek + k 2 · · · · · · ( 2 ) .
In the present invention, obtain the eigenwert of a plurality of features that webpage comprises, according to the eigenwert of a plurality of features of obtaining, determine the classification under each message block, selected at least one message block is to show from the message block that webpage comprises; Classification is corresponding one by one with message block, classification comprises the navigation of page or leaf top, secondary navigation, the text title, text message, text, the novel title, the novel text message, the novel text, the novel navigation, the blog navigation, the blog title, blog information, the blog text, in link information piece and the mutual piece at least one, webpage comprises that the kind of a plurality of features and message block is constant, and webpage correcting can not change the kind that webpage comprises a plurality of features and message block yet, therefore the technician only need arrange a plurality of features and for every kind of message block arrange a corresponding classification just can, thereby reduced human input and maintenance.
Description of drawings
Fig. 1 is a kind of method flow diagram that extracts the info web piece that the embodiment of the invention 1 provides;
Fig. 2 is a kind of news web page synoptic diagram that the embodiment of the invention 2 provides;
Fig. 3 is a kind of method flow diagram that extracts the info web piece that the embodiment of the invention 2 provides;
Fig. 4 is a kind of novel webpage synoptic diagram that the embodiment of the invention 3 provides;
Fig. 5 is a kind of method flow diagram that extracts the info web piece that the embodiment of the invention 3 provides;
Fig. 6 is a kind of blog web page synoptic diagram that the embodiment of the invention 4 provides;
Fig. 7 is a kind of method flow diagram that extracts the info web piece that the embodiment of the invention 4 provides;
Fig. 8 is a kind of apparatus structure synoptic diagram that extracts the info web piece that the embodiment of the invention 5 provides.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, embodiment of the present invention is described further in detail below in conjunction with accompanying drawing.
Embodiment 1
As shown in Figure 1, the embodiment of the invention provides a kind of method that extracts the info web piece, comprising:
Step 101: obtain the eigenwert of a plurality of features that webpage comprises, webpage comprises a plurality of message block;
Wherein, this webpage comprises news web page, novel webpage or blog web page etc.
Step 102: according to a plurality of eigenwerts of obtaining, determine the classification that each message block is affiliated, a plurality of classifications are corresponding one by one with a plurality of classifications respectively, and a plurality of classifications comprise in a page top navigation, secondary navigation, text title, text message, text, novel title, novel text message, novel text, novel navigation, blog navigation, blog title, blog information, blog text, link information piece and the mutual piece at least one;
Step 103: selected at least one message block is to show from the message block that this webpage comprises.
Wherein, can from the message block that this webpage comprises, select the message block of the one or more classifications that belong to appointment.
In embodiments of the present invention, obtain the eigenwert of a plurality of features that webpage comprises, according to the eigenwert of a plurality of features of obtaining, determine the classification under each message block, selected at least one message block is to show from the message block that webpage comprises; Classification is corresponding one by one with message block, classification comprises the navigation of page or leaf top, secondary navigation, the text title, text message, text, the novel title, the novel text message, the novel text, the novel navigation, the blog navigation, the blog title, blog information, the blog text, in link information piece and the mutual piece at least one, the webpage of different web sites comprises that the kind of a plurality of features and message block is constant, and webpage correcting can not change the kind that webpage comprises a plurality of features and message block yet, therefore the technician only need arrange a plurality of features and for every kind of message block arrange a corresponding classification just can, thereby reduced human input and maintenance.
Embodiment 2
The embodiment of the invention provides a kind of method that extracts the info web piece.Wherein, referring to Fig. 2, news web page generally comprises message block such as a page top navigation, secondary navigation, text title, text message, text, link information piece and mutual piece, the valuable message block of user is respectively secondary navigation, text title, text message and text, and comprise to the user valuable key message in secondary navigation, text title, text message and the text, in the present embodiment, from news web page, extract the message block that comprises key message out by this method, referring to Fig. 3, this method comprises:
Step 201: the information that comprises in the news web page is divided, obtain one or more message block that news web page comprises; Suppose that the message block that obtains is respectively message block 1, message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7;
Wherein, in the text of news web page, message block is all encapsulated the code that realizes message block by the piece label; Particularly, the text of news web page is resolved, the same label encapsulated information that parses is divided into a message block.
Wherein, referring to Fig. 2, in this step, the information that news web page shown in Figure 2 is comprised is divided into seven message block, namely be respectively message block 1, message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7, but and do not know which classification message block 1, message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7 belong to respectively.In the present embodiment, be that example describes with message block 1, determine classification under the message block 1 by the step of following 202-205, and message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7 all obtain classification under message block 2, message block 3, message block 4, message block 5, message block 6 and the message block 7 respectively with the message block 1 the same step of carrying out following 202-205.
Step 202: a message block at dividing, be assumed to be message block 1, from characteristic set, determine the feature that feature that message block 1 has and message block 1 do not have, characteristic set comprises a plurality of features;
Wherein, a plurality of features that characteristic set comprises are in advance a large amount of webpage analyses to be acquired by the technician, and these webpages comprise news web page, novel webpage and/or blog web page etc.For example, in the present embodiment, characteristic set comprises default 111 features, and is as follows respectively:
Feature 1: message block is the message block that comprises in the webpage that paragraph tag is maximum;
Feature 2: message block is that literal density is maximum message block in the webpage;
Feature 3: message block is the message block that comprises that literal is maximum;
Feature 4: message block is and the information similar piece of the title of webpage;
Feature 5: contain complete statement in the message block;
Feature 6: contain heading label in the message block;
Feature 7: contain interactive tag in the message block;
Feature 8: contain link label in the message block;
Feature 9: contain complete time character string in the message block;
Feature 10: contain the secondary navigation symbol in the message block.
Wherein, feature 11 to feature 111 all is crucial participle, therefore comprises 100 crucial participles, and feature 11 to feature 111 is respectively China, news, comment, video, fashion, more, recommend, employ sincerely, homepage, service, blog, forum, up-to-date, recreation, the star, finance and economics, information, the website, relevant, the woman, we, automobile, picture, amusement, advertisement, fat-reducing, com, company, after, the whole world, the beauty, special topic, science and technology, about, expert, man, life, registration, all rights reserved, mobile phone, the movie actress, sexuality, how, exposure, exposition, source, focus, exploitation, brand, the world, physical culture, the online friend transmits, the world, network, text, recruitment, community, health, www, unique skill, house property, data, issue, hour, pay close attention to, domestic, ten is big, woman, channel, summer, military affairs, free, slip-stick artist, chest enlarge, contact, diabetes, accident, economy, tourism, hot topic, high definition, man, licence, user name, treatment, wonderful method, the male sex, statement, information, subway, product, click, the internet, the reporter, seniority among brothers and sisters, password is delivered, map and search.
Wherein, need to prove: in the Html language, paragraph tag can be<p〉label or<BR label, heading label can be<H1 〉,<H2 〉,<H3 〉,<H4 〉,<H5〉and/or<H6 〉, interactive tag can be<from〉and<input〉label formed, link label can be<li〉label.
For example, the paragraph tag number, character number and the literal number that comprise of statistics from the code of message block 1 correspondence, the literal number that computing information piece 1 comprises and the ratio between the character number are with the ratio that the calculates literal density as message block 1; Similarity between the title of computing information piece 1 and news web page; Wherein, go out message block 2 by above-mentioned identical method statistic, message block 3, message block 4, message block 5, the paragraph tag number that message block 6 and message block 7 comprise respectively, character number and literal number, calculate the similarity between the title of the literal density of message block 2 and message block 2 and news web page, similarity between the literal density of message block 3 and the title of message block 3 and news web page, similarity between the literal density of message block 4 and the title of message block 4 and news web page, similarity between the literal density of message block 5 and the title of message block 5 and news web page, similarity between the literal density of message block 6 and the title of message block 6 and news web page, the similarity between the literal density of message block 7 and the title of message block 7 and news web page;
Judge that message block 1 comprises whether the number of paragraph tag is maximum, if then message block 1 has feature 1, otherwise message block 1 does not have feature 1; Whether the literal density of judging message block 1 is maximum, if then message block 1 has feature 2, otherwise message block 1 does not have feature 2; Judge whether the literal number that message block 1 comprises is maximum, if then message block 1 has feature 3, otherwise message block 1 does not have feature 3; And, judge whether the similarity between the title of message block 1 and news web page is maximum, if then message block 1 has feature 4, otherwise message block 1 does not have feature 4;
Determine whether comprise complete statement in the code of message block 1 correspondence, heading label, interactive tag, link label, complete time character string and secondary navigation symbol; Finish statement if message block 1 comprises, then message block 1 has feature 5, otherwise message block 1 does not have feature 5; If message block 1 comprises heading label, then message block 1 has feature 6, otherwise message block 1 does not have feature 6; If message block 1 comprises interactive tag, then message block 1 has feature 7, otherwise message block 1 does not have feature 7; If message block 1 comprises link label, then message block 1 has feature 8, otherwise message block 1 does not have feature 8; If message block 1 comprises complete time character string, then message block 1 has feature 9, otherwise message block 1 does not have feature 9; And if message block 1 comprises the secondary navigation symbol, then message block 1 has feature 10, otherwise message block 1 does not have feature 10;
And, according to feature 11 to feature 111 respectively corresponding crucial participle determine the crucial participle and the crucial participle not to be covered that comprise in the message block 1, thereby further, determine message block 1 and have the crucial participle characteristic of correspondence that message block 1 comprises, and do not have message block 1 crucial participle characteristic of correspondence not to be covered; So determining message block 1 has which feature in the characteristic set and does not have which feature in the characteristic set.
Wherein, news web page is the dom tree structure, and each message block is a node in the dom tree structure, and this node also comprises one or more leaf nodes; Correspondingly, in the present embodiment, the operation of the similarity between the title of computing information piece and news web page, can be specially: obtain the leaf node that the node of this message block correspondence comprises, calculate information that each leaf node comprises and the similarity between the news web page according to existing similarity computational algorithm, select maximum similarity, with the similarity selected as the similarity between the title of this message block and news web page.
Wherein, the message block 2 that comprises for news web page, message block 3, message block 4, message block 5, message block 6 and message block 7, all determining message block 2 by above-mentioned identical method has which feature in the characteristic set and does not have which feature in the characteristic set, message block 3 has which feature in the characteristic set and does not have which feature in the characteristic set, message block 4 has which feature in the characteristic set and does not have which feature in the characteristic set, message block 5 has which feature in the characteristic set and does not have which feature in the characteristic set, message block 6 has which feature in the characteristic set and does not have which feature in the characteristic set, and message block 7 has which feature in the characteristic set and do not have which feature in the characteristic set.
Step 203: the feature characteristic of correspondence value that message block 1 has is set to first eigenwert, and the eigenwert of the feature that do not have of message block 1 is set to second eigenwert, obtains the eigenwert of message block 1 corresponding a plurality of features;
Wherein, first eigenwert can be 1 or 2 etc., and second eigenwert can be 0 etc., in the present embodiment, does not do restriction to the concrete value of first eigenwert and to the concrete value of second eigenwert.
Step 204: belong to the probability of each classification according to the eigenwert computing information piece 1 of message block 1 corresponding a plurality of features, message block is corresponding one by one with classification;
Wherein, the message block corresponding class that comprises of news web page can comprise in a page top navigation, secondary navigation, text title, text message, text, link information piece and the mutual piece at least one.
Wherein, classification comprises one or more samples, and this sample comprises that title is the eigenwert of a plurality of features of the message block correspondence of this classification correspondence.
Wherein, the message block that news web page generally includes is respectively a page top navigation, secondary navigation, text title, text message, text, link information piece and mutual piece, so each message block that comprises in the news web page is corresponding with a classification; And for any classification, this classification comprises one or more samples, and each sample comprises the eigenwert of a plurality of features of the message block correspondence that such is not corresponding, and a plurality of features of comprising of each sample are identical with a plurality of features of comprising in the characteristic set respectively.
Wherein, the technician collects a large amount of news web pages in advance, and artificially each news web page is divided, obtain the message block that each news web page comprises and be respectively the navigation of page or leaf top, secondary navigation, the text title, text message, text, link information piece and mutual piece, then the page or leaf top navigation of each news web page is analyzed and obtained page or leaf top navigation corresponding class, namely obtain one or more samples that the navigation of page or leaf top comprises, secondary navigation analysis to each news web page obtains the secondary navigation corresponding class, namely obtain one or more samples that secondary navigation comprises, the text title of each news web page analyzed obtain text title corresponding class, namely obtain one or more samples that the text title comprises, text message analysis to each news web page obtains the text message corresponding class, namely obtain one or more samples that text message comprises, text analysis to each news web page obtains the text corresponding class, namely obtain one or more samples that text comprises, the link information piece of each news web page analyzed obtain link information piece corresponding class, namely obtain one or more samples that the link information piece comprises, and the mutual piece analysis of each news web page obtained mutual piece corresponding class, namely obtain one or more samples that mutual piece comprises.
Particularly, can calculate the probability that message block 1 belongs to each classification by the step of following (1)-(7), comprise:
(1): at a default classification C, be assumed to be the navigation of page or leaf top, eigenwert for message block 1 character pair Tk, wherein, k is the arbitrary value between 1 to default value F, the eigenwert of obtaining feature Tk from the sample that the navigation of page or leaf top comprises is the sample of first eigenwert, and the number of the sample that statistics is obtained obtains the first number of samples Ck;
Further, in execution in step (1) before, also push up navigation the total sample number C that comprises and the total sample number E that each classification comprises according to page or leaf in advance and calculate class probability P (the C)=C/E that navigates in the page or leaf top.
(2): the eigenwert of obtaining feature Tk from the sample that each classification comprises is the sample of first eigenwert, and the number of the sample that statistics is obtained obtains the second number of samples Ek;
(3): the eigenwert to message block 1 characteristic of correspondence Tk judges, if be first eigenwert, and execution in step (4) then, if be second eigenwert, execution in step (5) then;
(4): according to the first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of feature Tk by following formula (1);
Figure BDA0000129641140000091
Wherein, k1 and k2 are coefficient, and the value of k2 is more than or equal to the value of k1;
Wherein, in the present embodiment, the value of k1 can be 1 or 2 etc., and the value of k2 can be 2 or 3 etc.
(5): according to the first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of feature Tk by following formula (2);
P ( Tk ) = 1 - Ck + k 1 Ek + k 2 · · · · · · ( 2 )
Wherein, the eigenwert of message block 1 a corresponding default value F feature calculates the characteristic probability of message block 1 a corresponding default value F feature by the step of above-mentioned (1)-(5), is respectively P (T1), P (T2), P (T3) ... P (TF).
(6): according to the class probability P (C) of page or leaf top navigation and the characteristic probability of message block 1 a corresponding default value F feature, calculate probability P 1=P (C) * P (T1) * P (T2) * P (T3) * that message block 1 belongs to the navigation of page or leaf top ... * P (TF).
(7): calculate probability, the probability that belongs to the text title, the probability that belongs to text message that message block 1 belongs to secondary navigation respectively, belong to the probability of text, the probability that belongs to the probability of link information piece and belong to mutual piece by the step of above-mentioned (1)-(6).
Step 205: selecting the classification of probability maximum, is classification under the message block 1 with the class declaration of probability maximum;
Wherein, for message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7, all equally carry out above-mentioned 202 to 205 step with message block 1 respectively and obtain the classification under the message block 2, the classification under the message block 3, the classification under the message block 4, classification, classification under the message block 6 and the classification under the message block 7 under the message block 5, so obtain the classification under each message block that news web page comprises.
Step 206: selectedly from the message block that news web page comprises belong to other message block of specified class, to show.
Further, selected message block is packaged into a new news web page.
Suppose, obtain message block 1, message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7 and belong to a page top navigation, secondary navigation, text title, text message, text, link information piece and mutual piece respectively, and the classification of appointment is respectively secondary navigation, text title, text message and text; Correspondingly, the selected message block that belongs to secondary navigation, text title, text message and text respectively from message block 1, message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7 that news web page comprises, be respectively message block 2, message block 3, message block 4 and message block 5, and selected message block 2, message block 3, message block 4 and message block 5 is packaged into a new news web page.
In embodiments of the present invention, the information that news web page is comprised is divided into one or more message block, obtain the eigenwert of a plurality of features of message block correspondence according to default characteristic set, eigenwert and default classification according to a plurality of features of message block correspondence, determine the classification that message block is affiliated, selected other message block of specified class that belongs to from the message block that news web page comprises; Characteristic set comprises a plurality of features, and a classification is corresponding with a message block, and this classification comprises one or more samples, and this sample comprises that title is the eigenwert of a plurality of features of the message block correspondence of this classification correspondence; Wherein, the news web page of different news websites comprises that the kind of a plurality of features and message block is constant, and news web page correcting can not change the kind that news web page comprises a plurality of features and message block yet, therefore the technician only need arrange a plurality of features and for every kind of message block arrange a corresponding classification just can, thereby reduced human input and maintenance.
Embodiment 3
The embodiment of the invention provides a kind of method that extracts the info web piece.Wherein, referring to Fig. 4, the novel webpage generally comprises message block such as a page top navigation, secondary navigation, novel title, novel text message, novel text, novel navigation, link information piece and mutual piece, the valuable message block of user is respectively secondary navigation, novel title, novel text message and text, and comprise the valuable key message to the user in secondary navigation, novel title, novel text and the novel navigation, in the present embodiment, from the novel webpage, extract the message block that comprises key message out by this method, referring to Fig. 5, this method comprises:
Step 301: the information that comprises in the novel webpage is divided, obtain one or more message block that the novel webpage comprises; Suppose that the message block that obtains is respectively message block 1, message block 2, message block 3, message block 4, message block 5, message block 6, message block 7 and message block 8;
Wherein, in the text of novel webpage, message block is all encapsulated the code that realizes message block by the piece label; Particularly, the text of novel webpage is resolved, the same label encapsulated information that parses is divided into a message block.
Wherein, referring to Fig. 4, in this step, novel webpage shown in Figure 4 is comprised that information is divided into eight message block, be respectively message block 1, message block 2, message block 3, message block 4, message block 5, message block 6, message block 7 and message block 8, but and do not know classification under message block 1, message block 2, message block 3, message block 4, message block 5, message block 6, message block 7 and message block 8 are respectively.In the present embodiment, be that example describes with message block 1, determine classification under the message block 1 by the step of following 302-305, and message block 2, message block 3, message block 4, message block 5, message block 6, message block 7 and message block 8 are all determined message block 2, message block 3, message block 4, message block 5, message block 6, message block 7 and message block 8 classification under respectively with the message block 1 the same step of carrying out following 302-305.
Step 302: a message block at dividing, be assumed to be message block 1, from characteristic set, determine the feature that feature that message block 1 has and message block 1 do not have, characteristic set comprises a plurality of features;
Wherein, a plurality of features that characteristic set comprises are in advance a large amount of webpage analyses to be acquired by the technician, and these webpages comprise news web page, novel webpage and/or blog web page etc.For example, in the present embodiment, characteristic set comprises default 111 features, and the content of concrete 111 features no longer describes in detail in the present embodiment referring to the corresponding contents of embodiment 2.
Wherein, the paragraph tag number, character number and the literal number that comprise of statistics from the code of message block 1 correspondence, the literal number that computing information piece 1 comprises and the ratio between the character number are with the ratio that the calculates literal density as message block 1; Similarity between the title of computing information piece 1 and novel webpage; Wherein, go out message block 2 by above-mentioned identical method statistic, message block 3, message block 4, message block 5, message block 6, the paragraph tag number that message block 7 and message block 8 comprise respectively, character number and literal number, calculate the similarity between the title of the literal density of message block 2 and message block 2 and novel webpage, similarity between the title of the literal density of message block 3 and message block 3 and novel webpage, similarity between the title of the literal density of message block 4 and message block 4 and novel webpage, similarity between the title of the literal density of message block 5 and message block 5 and novel webpage, similarity between the title of the literal density of message block 6 and message block 6 and novel webpage, similarity between the title of the literal density of message block 7 and message block 7 and novel webpage, the similarity between the title of the literal density of message block 8 and message block 8 and novel webpage;
Judge that message block 1 comprises whether the number of paragraph tag is maximum, if then message block 1 has feature 1, otherwise message block 1 does not have feature 1; Whether the literal density of judging message block 1 is maximum, if then message block 1 has feature 2, otherwise message block 1 does not have feature 2; Judge whether the literal number that message block 1 comprises is maximum, if then message block 1 has feature 3, otherwise message block 1 does not have feature 3; And, judge whether the similarity between the title of message block 1 and novel webpage is maximum, if then message block 1 has feature 4, otherwise message block 1 does not have feature 4;
Determine whether comprise complete statement in the code of message block 1 correspondence, heading label, interactive tag, link label, complete time character string and secondary navigation symbol; Finish statement if message block 1 comprises, then message block 1 has feature 5, otherwise message block 1 does not have feature 5; If message block 1 comprises heading label, then message block 1 has feature 6, otherwise message block 1 does not have feature 6; If message block 1 comprises interactive tag, then message block 1 has feature 7, otherwise message block 1 does not have feature 7; If message block 1 comprises link label, then message block 1 has feature 8, otherwise message block 1 does not have feature 8; If message block 1 comprises complete time character string, then message block 1 has feature 9, otherwise message block 1 does not have feature 9; And if message block 1 comprises the secondary navigation symbol, then message block 1 has feature 10, otherwise message block 1 does not have feature 10;
And, according to feature 11 to feature 111 respectively corresponding crucial participle determine the crucial participle and the crucial participle not to be covered that comprise in the message block 1, thereby further, determine message block 1 and have the crucial participle characteristic of correspondence that message block 1 comprises, and do not have message block 1 crucial participle characteristic of correspondence not to be covered; So determining message block 1 has which feature in the characteristic set and does not have which feature in the characteristic set.
Wherein, the message block 2 that comprises for the novel webpage, message block 3, message block 4, message block 5, message block 6, message block 7 and message block 8, all determining message block 2 by above-mentioned identical method has which feature in the characteristic set and does not have which feature in the characteristic set, message block 3 has which feature in the characteristic set and does not have which feature in the characteristic set, message block 4 has which feature in the characteristic set and does not have which feature in the characteristic set, message block 5 has which feature in the characteristic set and does not have which feature in the characteristic set, message block 6 has which feature in the characteristic set and does not have which feature in the characteristic set, message block 7 has which feature in the characteristic set and does not have which feature in the characteristic set, and message block 8 has which feature in the characteristic set and do not have which feature in the characteristic set.
Step 303: the feature characteristic of correspondence value that message block 1 has is set to first eigenwert, and the eigenwert of the feature that do not have of message block 1 is set to second eigenwert, obtains the eigenwert of message block 1 corresponding a plurality of features;
Wherein, first eigenwert can be 1 or 2 etc., and second eigenwert can be 0 etc., in the present embodiment, does not do restriction to the concrete value of first eigenwert and to the concrete value of second eigenwert.
Step 304: belong to the probability of each classification according to the eigenwert computing information piece 1 of message block 1 corresponding a plurality of features, message block is corresponding one by one with classification;
Wherein, the message block corresponding class that comprises of novel webpage comprises in a page top navigation, secondary navigation, novel title, novel text message, novel text, novel navigation, link information piece and the mutual piece at least one.
Wherein, classification comprises one or more samples, and this sample comprises such not eigenwert of a plurality of features of corresponding message block correspondence.
Wherein, the message block that the novel webpage generally includes is respectively a page top navigation, secondary navigation, novel title, novel text message, novel text, novel navigation, link information piece and mutual piece, so each message block that comprises in the novel webpage is corresponding with a classification; And for arbitrary classification, this classification comprises one or more samples, and each sample comprises the eigenwert of a plurality of features of the message block correspondence that such is not corresponding, and a plurality of features of comprising of each sample are identical with a plurality of features of comprising in the characteristic set respectively.
Wherein, the technician collects a large amount of novel webpages in advance, and artificially each novel webpage is divided, obtain the message block that each novel webpage comprises, be respectively the navigation of page or leaf top, secondary navigation, the novel title, the novel text message, the novel text, the novel navigation, link information piece and mutual piece, then the page or leaf top navigation of each novel webpage is analyzed and obtained page or leaf top navigation corresponding class, namely obtain one or more samples that the navigation of page or leaf top comprises, secondary navigation analysis to each novel webpage obtains the secondary navigation corresponding class, namely obtain one or more samples that secondary navigation comprises, the novel title of each novel webpage analyzed obtain novel title corresponding class, namely obtain one or more samples that the novel title comprises, the novel text message of each novel webpage analyzed obtain novel text message corresponding class, namely obtain one or more samples that the novel text message comprises, text analysis to each novel webpage obtains novel text corresponding class, namely obtain one or more samples that the novel text comprises, navigation is analyzed and is obtained novel navigation corresponding class to the novel of each novel webpage, namely obtain one or more samples that the novel navigation comprises, the link information piece of each novel webpage analyzed obtain link information piece corresponding class, namely obtain one or more samples that the link information piece comprises, and the mutual piece analysis of each novel webpage obtained mutual piece corresponding class, namely obtain one or more samples that mutual piece comprises.
Particularly, can calculate the probability that message block 1 belongs to default each classification by the step of following (1)-(7), comprise:
(1): at a default classification, be assumed to be the navigation of page or leaf top, eigenwert for message block 1 character pair Tk, wherein, k is the arbitrary value between 1 to default value F, the eigenwert of obtaining feature Tk from the sample that the navigation of page or leaf top comprises is the sample of first eigenwert, and the number of the sample that statistics is obtained obtains the first number of samples Ck;
Further, also push up navigation the total sample number C that comprises and the total sample number E that each classification comprises according to page or leaf in advance before in execution in step (1) and calculate class probability P (the C)=C/E that navigates in the page or leaf top.
(2): the eigenwert of obtaining feature Tk from the sample that each classification comprises is the sample of first eigenwert, and the number of the sample that statistics is obtained obtains the second number of samples Ek;
(3): the eigenwert to message block 1 characteristic of correspondence Tk judges, if be first eigenwert, and execution in step (4) then, if be second eigenwert, execution in step (5) then;
(4): according to the first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of feature Tk by following formula (1);
Figure BDA0000129641140000141
Wherein, k1 and k2 are coefficient, and the value of k2 is more than or equal to the value of k1;
(5): according to the first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of feature Tk by following formula (2);
P ( Tk ) = 1 - Ck + k 1 Ek + k 2 · · · · · · ( 2 )
Wherein, the eigenwert of message block 1 a corresponding default value F feature calculates the characteristic probability of message block 1 a corresponding default value F feature by the step of above-mentioned (1)-(5), is respectively P (T1), P (T2), P (T3) ... P (TF).
(6): according to the class probability P (C) of page or leaf top navigation and the characteristic probability of message block 1 a corresponding default value F feature, calculate probability P 1=P (C) * P (T1) * P (T2) * P (T3) * that message block 1 belongs to the navigation of page or leaf top ... * P (TF).
(7): calculate probability, the probability that belongs to the novel title, the probability that belongs to the novel text message, the probability that belongs to the novel text that message block 1 belongs to secondary navigation respectively, the probability that belongs to the novel navigation, the probability that belongs to the probability of link information piece and belong to mutual piece by the step of above-mentioned (1)-(6).
Step 305: selecting the classification of probability maximum, is classification under the message block 1 with the class declaration of probability maximum;
Wherein, for message block 2, message block 3, message block 4, message block 5, message block 6, message block 7 and message block 8, all equally carry out above-mentioned 302 to 305 step with message block 1 respectively and obtain the classification under the message block 2, the classification under the message block 3, the classification under the message block 4, the classification under the message block 5, classification, classification under the message block 7 and the classification under the message block 8 under the message block 6, so obtain the classification under each message block that the novel webpage comprises.
Step 306: selectedly from the message block that the novel webpage comprises belong to other message block of specified class, to show.
Further, selected message block is packaged into a new novel webpage.
Suppose, obtain message block 1, message block 2, message block 3, message block 4, message block 5, message block 6, message block 7 and message block 8 and belong to a page top navigation, secondary navigation, novel title, novel text message, novel text, novel navigation, link information piece and mutual piece respectively, and specify classification to be respectively secondary navigation, novel title, novel text and novel navigation; Correspondingly, the selected message block that belongs to secondary navigation, novel title, novel text and novel navigation respectively from message block 1, message block 2, message block 3, message block 4, message block 5, message block 6, message block 7 and message block 8 that the novel webpage comprises, be respectively message block 2, message block 3, message block 5 and message block 6, and selected message block 2, message block 3, message block 5 and message block 6 is packaged into a new novel webpage.
In embodiments of the present invention, the information that the novel webpage is comprised is divided into one or more message block, obtain the eigenwert of a plurality of features of message block correspondence according to default characteristic set, eigenwert and default classification according to a plurality of features of message block correspondence, determine the classification that message block is affiliated, selected other message block of specified class that belongs to from the message block that the novel webpage comprises; Characteristic set comprises a plurality of features, and classification is corresponding with a message block, and this classification comprises one or more samples, and this sample comprises that title is the eigenwert of a plurality of features of the message block correspondence of this classification correspondence; Wherein, the novel webpage of different novels website comprises that the kind of a plurality of features and message block is constant, and the correcting of novel webpage can not change the kind that the novel webpage comprises a plurality of features and message block yet, therefore the technician only need arrange a plurality of features and for every kind of message block arrange a corresponding classification just can, thereby reduced human input and maintenance.
Embodiment 4
The embodiment of the invention provides a kind of method that extracts the info web piece.Wherein, referring to Fig. 6, blog web page generally comprises message block such as page or leaf top navigation, blog navigation, blog title, blog information, blog text, link information piece and mutual piece, the valuable message block of user is respectively blog navigation, blog title, blog information and blog text, and comprise to the user valuable key message in secondary navigation, blog title, blog information and the blog text, in the present embodiment, from blog web page, extract the message block that comprises key message out by this method, referring to Fig. 7, this method comprises:
Step 401: the information that comprises in the blog web page is divided, obtain one or more message block that blog web page comprises; Suppose that the message block that obtains is respectively message block 1, message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7;
Wherein, in the text of blog web page, message block is all encapsulated the code that realizes message block by the piece label; Particularly, the text of blog web page is resolved, the same label encapsulated information that parses is divided into a message block.
Wherein, referring to Fig. 6, in this step, the information that blog web page shown in Figure 6 is comprised is divided into seven message block, be respectively message block 1, message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7, but and do not know classification under message block 1, message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7 are respectively.In the present embodiment, be that example describes with message block 1, determine classification under the message block 1 by the step of following 402-405, and message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7 are all determined message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7 classification under respectively with the message block 1 the same step of carrying out following 402-405.
Step 402: a message block at dividing, be assumed to be message block 1, from characteristic set, determine the feature that feature that message block 1 has and message block 1 do not have, characteristic set comprises a plurality of features;
Wherein, a plurality of features that characteristic set comprises are in advance a large amount of webpage analyses to be acquired by the technician, and these webpages comprise news web page, novel webpage and/or blog web page etc.For example, in the present embodiment, characteristic set comprises default 111 features, and the content of concrete 111 features no longer describes in detail in the present embodiment referring to the corresponding contents of embodiment 2.
Wherein, the paragraph tag number, character number and the literal number that comprise of statistics from the code of message block 1 correspondence, the literal number that computing information piece 1 comprises and the ratio between the character number are with the ratio that the calculates literal density as message block 1; Similarity between the title of computing information piece 1 and blog web page; Wherein, go out message block 2 by above-mentioned identical method statistic, message block 3, message block 4, message block 5, the paragraph tag number that message block 6 and message block 7 comprise respectively, character number and literal number, calculate the similarity between the title of the literal density of message block 2 and message block 2 and blog web page, similarity between the literal density of message block 3 and the title of message block 3 and blog web page, similarity between the literal density of message block 4 and the title of message block 4 and blog web page, similarity between the literal density of message block 5 and the title of message block 5 and blog web page, similarity between the literal density of message block 6 and the title of message block 6 and blog web page, the similarity between the literal density of message block 7 and the title of message block 7 and blog web page;
Judge that message block 1 comprises whether the number of paragraph tag is maximum, if then message block 1 has feature 1, otherwise message block 1 does not have feature 1; Whether the literal density of judging message block 1 is maximum, if then message block 1 has feature 2, otherwise message block 1 does not have feature 2; Judge whether the literal number that message block 1 comprises is maximum, if then message block 1 has feature 3, otherwise message block 1 does not have feature 3; And, judge whether the similarity between the title of message block 1 and news web page is maximum, if then message block 1 has feature 4, otherwise message block 1 does not have feature 4;
Determine whether comprise complete statement in the code of message block 1 correspondence, heading label, interactive tag, link label, complete time character string and secondary navigation symbol; Finish statement if message block 1 comprises, then message block 1 has feature 5, otherwise message block 1 does not have feature 5; If message block 1 comprises heading label, then message block 1 has feature 6, otherwise message block 1 does not have feature 6; If message block 1 comprises interactive tag, then message block 1 has feature 7, otherwise message block 1 does not have feature 7; If message block 1 comprises link label, then message block 1 has feature 8, otherwise message block 1 does not have feature 8; If message block 1 comprises complete time character string, then message block 1 has feature 9, otherwise message block 1 does not have feature 9; And if message block 1 comprises the secondary navigation symbol, then message block 1 has feature 10, otherwise message block 1 does not have feature 10;
And, according to feature 11 to feature 111 respectively corresponding crucial participle determine the crucial participle and the crucial participle not to be covered that comprise in the message block 1, thereby further, determine message block 1 and have the crucial participle characteristic of correspondence that message block 1 comprises, and do not have message block 1 crucial participle characteristic of correspondence not to be covered; So determining message block 1 has which feature in the characteristic set and does not have which feature in the characteristic set.
Wherein, the message block 2 that comprises for blog web page, message block 3, message block 4, message block 5, message block 6 and message block 7, all determining message block 2 by above-mentioned identical method has which feature in the characteristic set and does not have which feature in the characteristic set, message block 3 has which feature in the characteristic set and does not have which feature in the characteristic set, message block 4 has which feature in the characteristic set and does not have which feature in the characteristic set, message block 5 has which feature in the characteristic set and does not have which feature in the characteristic set, message block 6 has which feature in the characteristic set and does not have which feature in the characteristic set, and message block 7 has which feature in the characteristic set and do not have which feature in the characteristic set.
Step 403: the feature characteristic of correspondence value that message block 1 has is set to first eigenwert, and the eigenwert of the feature that do not have of message block 1 is set to second eigenwert, obtains the eigenwert of message block 1 corresponding a plurality of features;
Wherein, first eigenwert can be 1 or 2 etc., and second eigenwert can be 0 etc., in the present embodiment, does not do restriction to the concrete value of first eigenwert and to the concrete value of second eigenwert.
Step 404: belong to the probability of each classification according to the eigenwert computing information piece 1 of message block 1 corresponding a plurality of features, message block is corresponding one by one with classification;
Wherein, the message block corresponding class that comprises of blog web page comprise a page or leaf top navigation, blog navigation, blog title, blog information, blog text, link information piece and alternately in the piece at least one.
Wherein, this classification comprises one or more samples, and this sample comprises such not eigenwert of a plurality of features of corresponding message block correspondence.
Wherein, the message block that blog web page generally includes is respectively page or leaf top navigation, blog navigation, blog title, blog information, blog text, link information piece and mutual piece, so each message block that comprises in the blog web page is corresponding with a classification; And for arbitrary classification, this classification comprises one or more samples, and each sample comprises the eigenwert of a plurality of features, and a plurality of features that each sample comprises the message block correspondence that such is not corresponding are identical with a plurality of features of comprising in the characteristic set respectively.
Wherein, the technician collects a large amount of blog web page in advance, and artificially each blog web page is divided, obtain the message block that each blog web page comprises, be respectively the navigation of page or leaf top, the blog navigation, the blog title, blog information, the blog text, link information piece and mutual piece, then the page or leaf top navigation of each blog web page is analyzed and obtained page or leaf top navigation corresponding class, namely obtain one or more samples that the navigation of page or leaf top comprises, navigation is analyzed and is obtained blog navigation corresponding class to the blog of each blog web page, namely obtain one or more samples that the blog navigation comprises, the blog title of each blog web page analyzed obtain blog title corresponding class, namely obtain one or more samples that the blog title comprises, blog information analysis to each blog web page obtains the blog information corresponding class, namely obtain one or more samples that blog information comprises, the blog text of each blog web page analyzed obtain blog text corresponding class, namely obtain one or more samples that the blog text comprises, the link information piece of each blog web page analyzed obtain link information piece corresponding class, namely obtain one or more samples that the link information piece comprises and the mutual piece analysis of each blog web page is obtained mutual piece corresponding class, namely obtain one or more samples that mutual piece comprises.
Particularly, can calculate the probability that message block 1 belongs to each classification by the step of following (1)-(7), comprise:
(1): at a default classification, be assumed to be the navigation of page or leaf top, eigenwert for message block 1 character pair Tk, wherein, k is the arbitrary value between 1 to default value F, the eigenwert of obtaining feature Tk from the sample that the navigation of page or leaf top comprises is the sample of first eigenwert, and the number of the sample that statistics is obtained obtains the first number of samples Ck;
Further, also push up navigation the total sample number C that comprises and the total sample number E that each classification comprises according to page or leaf in advance before in execution in step (1) and calculate class probability P (the C)=C/E that navigates in the page or leaf top.
(2): the eigenwert of obtaining feature Tk from the sample that each classification comprises is the sample of first eigenwert, and the number of the sample that statistics is obtained obtains the second number of samples Ek;
(3): the eigenwert to message block 1 characteristic of correspondence Tk judges, if be first eigenwert, and execution in step (4) then, if be second eigenwert, execution in step (5) then;
(4): according to the first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of feature Tk by following formula (1);
Figure BDA0000129641140000181
Wherein, k1 and k2 are coefficient, and the value of k2 is more than or equal to the value of k1;
(5): according to the first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of feature Tk by following formula (2);
P ( Tk ) = 1 - Ck + k 1 Ek + k 2 · · · · · · ( 2 )
Wherein, the eigenwert of message block 1 a corresponding default value F feature calculates the characteristic probability of message block 1 a corresponding default value F feature by the step of above-mentioned (1)-(5), is respectively P (T1), P (T2), P (T3) ... P (TF).
(6): according to the class probability P (C) of page or leaf top navigation and the characteristic probability of message block 1 a corresponding default value F feature, calculate probability P 1=P (C) * P (T1) * P (T2) * P (T3) * that message block 1 belongs to the navigation of page or leaf top ... * P (TF).
(7): calculate probability, the probability that belongs to the blog title, the probability that belongs to blog information that message block 1 belongs to the blog navigation respectively, belong to the probability of blog text, the probability that belongs to the probability of link information piece and belong to mutual piece by the step of above-mentioned (1)-(6).
Step 405: selecting the classification of probability maximum, is classification under the message block 1 with the class declaration of probability maximum;
Wherein, for message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7, all equally carry out above-mentioned 402 to 405 step with message block 1 respectively and obtain the classification under the message block 2, the classification under the message block 3, the classification under the message block 4, classification, classification under the message block 6 and the classification under the message block 7 under the message block 5, so obtain the classification under each message block that blog web page comprises.
Step 406: selectedly from the message block that blog web page comprises belong to other message block of specified class, to show.
Further, selected message block is packaged into a new blog web page.
Suppose, obtain message block 1, message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7 and belong to page or leaf top navigation, blog navigation, blog title, blog information, blog text, link information piece and mutual piece respectively, and specify classification to be respectively blog navigation, blog title, blog information and blog text; Correspondingly, the selected message block that belongs to secondary navigation, blog title, blog information and blog text respectively from message block 1, message block 2, message block 3, message block 4, message block 5, message block 6 and message block 7 that blog web page comprises, be respectively message block 2, message block 3, message block 4 and message block 5, and selected message block 2, message block 3, message block 4 and message block 5 is packaged into a new blog web page.
In embodiments of the present invention, the information that blog web page is comprised is divided into one or more message block, obtain the eigenwert of a plurality of features of message block correspondence according to default characteristic set, eigenwert and default classification according to a plurality of features of message block correspondence, determine the classification that message block is affiliated, selected other message block of specified class that belongs to from the message block that blog web page comprises; Characteristic set comprises a plurality of features, and classification is corresponding with a message block, and this classification comprises one or more samples, and this sample comprises such not eigenwert of a plurality of features of corresponding message block correspondence; Wherein, the blog web page of different blogs website comprises that the kind of a plurality of features and message block is constant, and blog web page correcting can not change the kind that blog web page comprises a plurality of features and message block yet, therefore the technician only need arrange a plurality of features and for every kind of message block arrange a corresponding classification just can, thereby reduced human input and maintenance.
Embodiment 5
As shown in Figure 8, the embodiment of the invention provides a kind of device that extracts the info web piece, comprising:
Acquisition module 501, for the eigenwert of obtaining the included a plurality of features of webpage, described webpage comprises a plurality of message block;
Determination module 502, be used for determining according to described a plurality of eigenwerts of obtaining the classification of described each message block, described a plurality of message block is corresponding one by one with a plurality of classifications respectively, and described a plurality of classifications comprise a page top navigation, secondary navigation, text title, text message, text, link information piece and mutual piece;
Chosen module 503 is used for from selected at least one message block of webpage to show.
Wherein, acquisition module 501, the concrete eigenwert that is used for the feature that webpage has is set to first eigenwert, and the eigenwert of the feature that webpage does not have is set to second eigenwert.
Wherein, determination module 502 comprises:
Computing unit is used for the probability that the computing information piece belongs to each classification;
Definition unit, the class declaration that is used for corresponding probability maximum is the classification of this message block.
Wherein, computing unit comprises:
First computation subunit is used at a classification C, and the total sample number Total that the total sample number Ctotal that comprises according to described classification C and each classification comprise calculates the class probability of described classification C
Figure BDA0000129641140000201
The first statistics subelement, be used for the eigenwert for message block characteristic of correspondence Tk, k is the arbitrary value between 1 to default value F, the eigenwert of obtaining feature Tk from the sample that this classification C comprises is the sample of first eigenwert, the number of the sample that statistics is obtained obtains the first number of samples Ck;
The second statistics subelement, the eigenwert of obtaining feature Tk for the sample that comprises from each default classification is the sample of first eigenwert, the number of the sample that statistics is obtained obtains the second number of samples Ek;
Judgment sub-unit is used for the eigenwert of feature Tk is judged;
Second computation subunit, if being used for is first eigenwert, then according to the first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of feature Tk by following formula (1), calculate probability P=P (C) * P (T1) * P (T2) * P (T3) * that this message block belongs to this classification C according to the characteristic probability of this classification probability P (C) and this message block characteristic of correspondence ... * P (TF);
Figure BDA0000129641140000202
Wherein, k1 and k2 are coefficient, and the value of k2 is more than or equal to the value of k1;
Second computation subunit, if being used for is second eigenwert, then according to the first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of feature Tk by following formula (2), calculate probability P=P (C) * P (T1) * P (T2) * P (T3) * that this message block belongs to this classification C according to the characteristic probability of this classification probability P (C) and this message block characteristic of correspondence ... * P (TF);
P ( Tk ) = 1 - Ck + k 1 Ek + k 2 · · · · · · ( 2 ) .
In embodiments of the present invention, the information that webpage is comprised is divided into one or more message block, obtains the eigenwert of a plurality of features of message block correspondence, according to the eigenwert of a plurality of features of message block correspondence, determine the affiliated classification of message block, selected message block at least from the message block that webpage comprises; Characteristic set comprises a plurality of features, classification is corresponding with a message block, wherein, the webpage of different web sites comprises that the kind of a plurality of features and message block is constant, and webpage correcting can not change the kind that webpage comprises a plurality of features and message block yet, therefore the technician only need arrange a plurality of features and for every kind of message block arrange a corresponding classification just can, thereby reduced human input and maintenance.
The all or part of step that one of ordinary skill in the art will appreciate that realization above-described embodiment can be finished by hardware, also can instruct relevant hardware to finish by program, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (8)

1. a method that extracts the info web piece is characterized in that, described method comprises:
Obtain the eigenwert of the included a plurality of features of webpage, described webpage comprises a plurality of message block;
Determine the classification of described each message block according to described a plurality of eigenwerts of obtaining, described a plurality of message block is corresponding one by one with a plurality of classifications respectively, and described a plurality of classifications comprise in a page top navigation, secondary navigation, text title, text message, text, novel title, novel text message, novel text, novel navigation, blog navigation, blog title, blog information, blog text, link information piece and the mutual piece at least one;
Selected at least one message block is to show from described webpage.
2. the method for claim 1 is characterized in that, obtains the eigenwert of the included a plurality of features of webpage, comprising:
The eigenwert of the feature that described webpage has is set to first eigenwert, and the eigenwert of the feature that described webpage does not have is set to second eigenwert.
3. method as claimed in claim 2, it is characterized in that, described a plurality of eigenwerts that described basis is obtained determine that the classification of described each message block comprises: calculate the probability that each message block belongs to each classification, and be the classification of this message block with the class declaration of corresponding probability maximum.
4. method as claimed in claim 3 is characterized in that, the probability that each message block of described calculating belongs to each classification comprises:
At any classification C, the total sample number Total that the total sample number Ctotal that comprises according to described classification C and each classification comprise calculates the class probability of described classification C
Figure FDA0000129641130000011
Eigenwert for described message block characteristic of correspondence Tk, k is the arbitrary value between 1 to default value F, the eigenwert of obtaining described feature Tk from the sample that described classification C comprises is the sample of first eigenwert, adds up the number of the described sample that obtains, and obtains the first number of samples Ck;
The eigenwert of obtaining described feature Tk from the sample that described each default classification comprises is the sample of first eigenwert, adds up the number of the described sample that obtains, and obtains the second number of samples Ek;
Eigenwert to described feature Tk is judged;
If be first eigenwert, then according to the described first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of described feature Tk by following formula (1), calculate probability P=P (C) * P (T1) * P (T2) * P (T3) * that described message block belongs to described classification C according to the characteristic probability of described class probability P (C) and described message block characteristic of correspondence ... * P (TF);
Wherein, k1 and k2 are coefficient, and the value of k2 is more than or equal to the value of k1;
If be second eigenwert, then according to the described first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of described feature Tk by following formula (2); Characteristic probability according to described class probability P (C) and described message block characteristic of correspondence calculates probability P=P (C) * P (T1) * P (T2) * P (T3) * that described message block belongs to described classification C ... * P (TF);
P ( Tk ) = 1 - Ck + k 1 Ek + k 2 · · · · · · ( 2 ) .
5. a device that extracts the info web piece is characterized in that, described device comprises:
Acquisition module, for the eigenwert of obtaining the included a plurality of features of webpage, described webpage comprises a plurality of message block;
Determination module, be used for determining according to described a plurality of eigenwerts of obtaining the classification of described each message block, described a plurality of message block is corresponding one by one with a plurality of classifications respectively, and described a plurality of classifications comprise a page top navigation, secondary navigation, text title, text message, text, novel title, novel text message, novel text, novel navigation, blog navigation, blog title, blog information, blog text, link information piece and mutual piece;
Chosen module is used for from selected at least one message block of described webpage to show.
6. device as claimed in claim 5 is characterized in that,
Described acquisition module, the concrete eigenwert that is used for the feature that described webpage has is set to first eigenwert, and the eigenwert of the feature that described webpage does not have is set to second eigenwert.
7. device as claimed in claim 6 is characterized in that, described determination module comprises:
Computing unit is used for calculating the probability that described message block belongs to each classification;
Definition unit, the class declaration that is used for corresponding probability maximum is the classification of this message block.
8. device as claimed in claim 7 is characterized in that, described computing unit,
First computation subunit is used at any classification C, and the total sample number Total that the total sample number Ctotal that comprises according to described classification C and each classification comprise calculates the class probability of described classification
Figure FDA0000129641130000023
The first statistics subelement, be used for the eigenwert for described message block characteristic of correspondence Tk, k is the arbitrary value between 1 to default value F, the eigenwert of obtaining described feature Tk from the sample that described classification C comprises is the sample of first eigenwert, add up the number of the described sample that obtains, obtain the first number of samples Ck;
The second statistics subelement, the eigenwert of obtaining described feature Tk for the sample that comprises from described each default classification is the sample of first eigenwert, adds up the number of the described sample that obtains, and obtains the second number of samples Ek;
Judgment sub-unit is used for the eigenwert of described feature Tk is judged;
Second computation subunit, if being used for is first eigenwert, then according to the described first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of described feature Tk by following formula (1), calculate probability P=P (C) * P (T1) * P (T2) * P (T3) * that described message block belongs to described classification C according to the characteristic probability of described class probability P (C) and described message block characteristic of correspondence ... * P (TF);
Figure FDA0000129641130000031
Wherein, k1 and k2 are coefficient, and the value of k2 is more than or equal to the value of k1;
Second computation subunit, if being used for is second eigenwert, then according to the described first number of samples Ck and the second number of samples Ek, and calculate the characteristic probability P (Tk) of described feature Tk by following formula (2), calculate probability P=P (C) * P (T1) * P (T2) * P (T3) * that described message block belongs to described classification C according to the characteristic probability of described class probability P (C) and described message block characteristic of correspondence ... * P (TF);
P ( Tk ) = 1 - Ck + k 1 Ek + ke 2 · · · · · · ( 2 ) .
CN2012100046538A 2012-01-09 2012-01-09 Method and device for extracting web page information blocks Pending CN103198075A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100046538A CN103198075A (en) 2012-01-09 2012-01-09 Method and device for extracting web page information blocks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100046538A CN103198075A (en) 2012-01-09 2012-01-09 Method and device for extracting web page information blocks

Publications (1)

Publication Number Publication Date
CN103198075A true CN103198075A (en) 2013-07-10

Family

ID=48720644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100046538A Pending CN103198075A (en) 2012-01-09 2012-01-09 Method and device for extracting web page information blocks

Country Status (1)

Country Link
CN (1) CN103198075A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus
CN110688552A (en) * 2019-06-27 2020-01-14 平安科技(深圳)有限公司 Webpage text content acquisition method and device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270334A1 (en) * 2007-04-30 2008-10-30 Microsoft Corporation Classifying functions of web blocks based on linguistic features
CN101788987A (en) * 2009-01-23 2010-07-28 北京大学 Automatic judging method of network resource types

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080270334A1 (en) * 2007-04-30 2008-10-30 Microsoft Corporation Classifying functions of web blocks based on linguistic features
CN101788987A (en) * 2009-01-23 2010-07-28 北京大学 Automatic judging method of network resource types

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李连霞等: "基于多特征的网页内容提取研究", 《第三届和谐人机环境联合学术会议(HHME2007)论文集》 *
杨义先等编著: "《网络安全理论与技术》", 31 October 2003, 人民邮电出版社 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus
CN105183801B (en) * 2015-08-25 2018-07-06 北京信息科技大学 web page text extracting method and device
CN110688552A (en) * 2019-06-27 2020-01-14 平安科技(深圳)有限公司 Webpage text content acquisition method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN102831199B (en) Method and device for establishing interest model
CN104899273B (en) A kind of Web Personalization method based on topic and relative entropy
CN103106259B (en) A kind of mobile webpage content recommendation method based on situation
An et al. Design of recommendation system for tourist spot using sentiment analysis based on CNN-LSTM
CN103870973B (en) Information push, searching method and the device of keyword extraction based on electronic information
US10776885B2 (en) Mutually reinforcing ranking of social media accounts and contents
CN105893609A (en) Mobile APP recommendation method based on weighted mixing
CN105930469A (en) Hadoop-based individualized tourism recommendation system and method
CN103544178A (en) Method and equipment for providing reconstruction page corresponding to target page
CN103294781A (en) Method and equipment used for processing page data
CN102270206A (en) Method and device for capturing valid web page contents
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
CN104484431A (en) Multi-source individualized news webpage recommending method based on field body
US20170235836A1 (en) Information identification and extraction
CN107562939A (en) Vertical field news recommends method, apparatus and readable storage medium
CN105677787B (en) Information retrieval device and information search method
CN105677931A (en) Information search method and device
CN101515272A (en) Method and device for extracting webpage content
CN105069103A (en) Method and system for APP search engine to utilize client comment
CN105893362A (en) A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN104598604A (en) Browsing method of website navigation applied in various browsers
CN104199938A (en) RSS-based agricultural land information sending method and system
CN105160016A (en) Method and device for acquiring user attributes
CN105117434A (en) Webpage classification method and webpage classification system
CN103440315A (en) Web page cleaning method based on theme

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130710