Summary of the invention
The object of this invention is to provide a kind of news web page key element extraction method, with solve existing element of news extracting method exist to masterplate too rely on, versatility is poor, and masterplate code is concluded complicated problem.
The present invention proposes a kind of news web page key element extraction method, comprises the following steps:
(1) extract web page title and webpage metamessage in webpage source code, and obtain the keyword dictionary about web page contents;
(2) literal node in webpage source code is traveled through, and according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, and utilize described keyword dictionary to detect and extract headline, issuing time, informed source and body.
Further, step (1) also comprises before: (10) carry out pre-service to webpage source code, removes scripted code.
Further, step (1) also comprises: (11) carry out participle and remove stop words the web page title extracting and webpage metamessage.
Further, step also comprises in (2): filter literal node (21), and the literal node filtering out is got rid of outside sensing range.
Further, in step (21), according to the father node label of literal node, literal node is filtered, comprising:
(211) filter out the literal node without father node;
(212) filter out the literal node that father node label does not belong in the middle of <div>, <paragraph>, <tablecolumn>, <heading>, <span> one;
(213) label that filters out father node is <div>, and style setting is the literal node of " hiding ";
(214) after headline and issuing time have been detected, the literal node that the label that filters out father node is <heading>;
(215) label that filters out father node is <span> or <div>, and text size is less than the literal node of text paragraph average length.
Further, in step (21), according to content of text, literal node is filtered, comprising:
(216) filter out the literal node that comprises copyright statement information;
(217) filter out the literal node of the Word message that contains " sharing " and/or " comment " and/or " microblogging ".
Further, in step (2), while detecting and extract headline, issuing time, informed source, comprise:
(22) when belonging to a literal node of web page title, the text size of this literal node is not less than 1/3rd of web page title Chinese version length, or the text of arbitrary literal node and the text similarity of web page title are not less than predetermined threshold value, the content of text that extracts this literal node is headline, and after this no longer carries out the detection of headline;
(23) content of literal node is mated with time format, and be issuing time by the contents extraction of the literal node that the match is successful, and after this no longer carry out the detection of issuing time;
(24) Word message that comprises " source " or " author " when the content of a literal node, is informed source by the contents extraction of this literal node, and after this no longer carries out the detection of informed source.
Further, in step (2), when detecting and extracting body, comprise:
(25) set up the high collection that hits, preserve the high literal node of keyword dictionary hits;
(26) adopt cluster mode to hit collection to height and purify, get the longest continuous nodes set as the collection of purifying;
(27) find out the concentrated public father node of minimum of purifying;
(28) document tree of traversal taking the public father node of minimum as root node, and obtain body.
Further, step (25) afterwards, also comprises:
(251) set up doubtful collection, preserve keyword dictionary hits deficiency, or text size is greater than the literal node of a preset value;
(252) higher quantity of information of hitting collection and doubtful collection;
(253) if high quantity of information of hitting collection is less than the quantity of information of doubtful collection, reduce the hits threshold value that is selected into height and hits collection, again travel through the literal node in webpage source code, and re-establish height and hit collection;
(254) if high quantity of information of hitting collection is less than the quantity of information of doubtful collection, enter step (26).
Further, step (28) comprising:
(281) if literal node is identical with the node of headline, issuing time, informed source, initial using this literal node as body;
(282) if the father node label of literal node is link type, and its node is not upwards all list type, and the content of extracting this literal node adds body;
(283) if the father node label of literal node belongs in the middle of <div>, <paragraph>, <tablecolumn>, <heading>, <span>, extract this literal node content and add body.
With respect to prior art, the invention has the beneficial effects as follows: the present invention starts with from Chinese news web page is carried out to statistical study, the advantage of comprehensive machine learning method, regular expression method, has proposed a whole set of automatic flows of accurate extraction headline, issuing time, informed source, body four key elements.The present invention can not produce and rely on specific template, has very strong versatility.
Embodiment
Illustrate the present invention below in conjunction with accompanying drawing.
Refer to Fig. 3, its a kind of news web page key element extraction method process flow diagram that is the embodiment of the present invention, it comprises the following steps:
S31, extracts web page title and webpage metamessage in webpage source code, and obtains the keyword dictionary about web page contents.
Web page title is the high level overview to a webpage, and in the time browsing a webpage, the information occurring at the show bar on browser top is exactly " web page title ".In webpage source code (HTML code), web page title is positioned at <head> ... between </head> label, its form is: <title> network marketing teaching website </title>, wherein " network marketing teaching website " is exactly " web page title ".
Webpage metamessage is included in <meta> label, provides the information relevant to document with the form of key-value pair, is mainly used as the index reference of search engine.In metamessage, description is the descriptor of web page contents, and keywords is the keyword of web page contents, can understand well news content by these two information.
After extracting web page title and webpage metamessage, the present invention preferably adopts word bag model to extract the keyword about web page contents, and forms keyword dictionary.Word bag model is a concept in text mining, and it does not consider order, the modified relationship of word, only text fragment is regarded as to the set of word.Taking the news pages of Fig. 1 as example, the word bag model that can form according to its web page title and webpage metamessage as shown in Figure 4.On the basis of word bag model, the present invention can further be expressed as vector by text fragment, calculates the content similarity degree between text fragment afterwards by the vectorial computing such as distance, inner product.If only consider whether word occurs and using vector distance as measuring similarity, similarity is calculated and can be reduced to public entry number between statistics word bag.Certainly, the similarity of text is calculated except discrete vector distance, also has Cosine distance, Euclidean distance, city distance etc.
S32, literal node in webpage source code is traveled through, and according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, and utilize described keyword dictionary to detect and extract headline, issuing time, informed source and body.
Literal node of the present invention refers to the node in DOM Document Object Model.DOM Document Object Model (Document Object Model, be called for short DOM), it is a kind of application programming interfaces, can be used for the document of the types such as dynamic access HTML, XML.In the present invention, mainly use HTMLDOM, it represents document with tree structure, and has defined access and the method that operates element in document.
For further understanding technical scheme of the present invention, illustrate the present invention with a detailed embodiment below, refer to Fig. 5, it is the another kind of news web page key element extraction method process flow diagram that the embodiment of the present invention is comparatively detailed, it comprises the following steps:
S501, carries out pre-service to webpage source code, removes script (JS) code, in order to avoid the dynamic load content wherein comprising is disturbed the judgement of text position.
S502, extracts web page title and webpage metamessage in webpage source code, and obtains the keyword dictionary about web page contents.
After extracting web page title and webpage metamessage, the present invention preferably adopts word bag model to extract the keyword about web page contents, and forms keyword dictionary.Word bag model as shown in Figure 4,
But the word in word bag also not all needs, there is very frequent that some words occur in news web page, but they do not have too large help to the expression of news content, such as " at present ", " so ", the word such as " it is reported ", be referred to as in the present invention stop words (stop words).Therefore can begin to take shape after keyword dictionary, then may disturb the stop words of content of text similarity judgement to remove these, so that computing is more succinct.
S503, sets up high collection and the doubtful collection of hitting.Height hits collection and doubtful collection is used for respectively in ergodic process, preserve literal node to keyword dictionary hits high (meaning that content similarity is high) and hits deficiency but the sufficiently long literal node of node text, object is in order to excavate doubtful text and then definite text scope.After this, analyzing web page structure, starts traversal literal node wherein.
S504 carries out literal node to filter, and the literal node filtering out is got rid of outside sensing range in ergodic process.The present invention preferably adopts two kinds of rules to filter literal node:
1. according to the father node label of literal node, literal node is filtered, comprising:
. filter out the literal node without father node;
. filter out the literal node that father node label does not belong in the middle of <div>, <paragraph>, <tablecolumn>, <heading>, <span> one;
. the label that filters out father node is <div>, and style setting is the literal node of " hiding ";
. after headline and issuing time have been detected, the literal node that the label that filters out father node is <heading>;
. the label that filters out father node is <span> or <div>, and text size is less than the literal node of text paragraph average length, described text paragraph average length is to add up based on a large amount of <span> or <div> exemplar the empirical value obtaining, and this class literal node is considered to navigation information and will not detects.
2. according to content of text, literal node is filtered, comprising:
. filter out the literal node that comprises copyright statement information;
. filter out the literal node of the Word message that contains " sharing " and/or " comment " and/or " microblogging ".
S505, news label detects.
When belonging to a literal node of web page title, the text size of this literal node is not less than 1/3rd of web page title Chinese version length, or the text of arbitrary literal node and the text similarity of web page title are not less than predetermined threshold value, the content of text that extracts this literal node is headline, and after this no longer carries out the detection of headline.Before headline, can not there is text, so once detect news label, for optimizing follow-up computing, can remove high collection and the doubtful concentrated headline place node literal node before of hitting.
S506, issuing time detects.
In the situation that title finds, the content of literal node is mated with time format, and be issuing time by the contents extraction of the literal node that the match is successful, and after this no longer carry out the detection of issuing time.Time format described here generally has numeral and digital connector two parts.Time numeral can be 4, such as " 2012 ", can be also 2, such as " 12 "; Maximum 2 of month, day numeral, can zero padding in 1 situation, such as " 02 month 03 day ", also can not zero padding, and such as " February 3 ".Numeral connector mainly contains middle horizontal line, period, space, word (date) and forward slash.Before issuing time, can not there is text, so once detect issuing time, for optimizing follow-up computing, can remove high collection and the doubtful concentrated issuing time place node literal node before of hitting.
S507, informed source detects.
The issuing time of news web page and informed source are probably in same section of text, so also will carry out source format coupling to it finding after issuing time.When the Word message that the content of a literal node comprises " source " or " author ", be informed source by the contents extraction of this literal node, and after this no longer carry out the detection of informed source.
S508, detects the hits of literal node to keyword dictionary.
In the process of traversal, investigate the literal node that meets testing requirement, if the content in a literal node is more than or equal to 2 to the hits of keyword dictionary, adds this literal node to height and hit collection; If the content in a literal node is 1 to the hits of keyword dictionary, add this node to doubtful collection; If the content in a literal node is not hit keyword dictionary, be the normal number of words that shows lower hemistich of general Chinese news web page but the text size of literal node is greater than 20(20), think probably to belong to text, will add doubtful collection to by this node.
Wherein, add height to hit the predetermined threshold value of collection and doubtful collection, and add the text size of doubtful collection all can arrange according to the needs of actual conditions.
S509, height hits collection, doubtful collection quantity of information detects.
The detection that height hits collection quantity of information mainly relies on content, the main dependency structure factor of detection of doubtful collection quantity of information.Height hits concentrates literal node number to be designated as N1; Calculate the doubtful LDR(Length-Distance Ratio that concentrates each node) value, the literal node number that LDR value is greater than certain threshold value is designated as N2; Doubtful concentrating has the literal node number that keyword hits to be designated as N3.Obtain height according to triangular magnitude relationship and hit collection and the comparison of doubtful collection quantity of information.
If height hits to collect and contains much information in doubtful collection, enter step S60, be selected into the hits threshold value that height hits collection (as hit and reduce to 1 and hit by 2) otherwise reduce, again travel through and re-establish height and hit collection.
If it is larger to find to remain doubtful collection quantity of information after traversal, the information of very possible web page title (<title>) and webpage metamessage (<meta>) is insufficient, can directly form new height from the doubtful concentrated literal node of choosing N2 quantity and hit collection.
If N2 is 0, very possible body text is very short or disperse, and causes LDR value very little, now can directly carry out text extraction, and method is as follows:
. if doubtful concentrated literal node quantity is little, can directly get text that length is the longest as text;
. doubtful first concentrated node is suspected as title, since second node, interval threshold is set, find the continuous text node that meets interval threshold, using the combination of its content as text.
Wherein, the LDR(Length-Distance Ratio that mentioned here) value is a kind of architectural feature of news web page, is used for measuring text context and connects compactedness, contributes to distinguish text and non-text.Text in webpage has certain text size, between adjacent text node, there is certain distance, length and ratio of distances constant can be weighed and between text, be connected compactedness, as shown in Figure 6, L is text size, D is the distance between text node, and front and back are averaged the tolerance that can regard text context compactedness as.
The calculation expression of LDR value is as follows:
LDR value is necessarily less than 1, more shows that close to 1 context connects compactedness better, and the text may be more true text.
S510, builds the collection of purifying.
Extract high hitting and concentrate the reference position of each literal node in web page code, adopt clustering method to hit collection to height and carry out cluster, cluster refers to that according to the similarity of some feature of literal node, height being hit to collection is divided into different classes of process, class interior element similarity is large, and between class and class, difference is large.Consider that these literal nodes may belong to three parts before text, in text or after text, are preferably made as 3 initial category number.Analyze cluster result, the purification of getting the longest continuous nodes set and hitting collection as height, being called the collection of purifying.
The preferred K-means cluster of cluster mode of the present invention, K-means cluster is a kind of clustering method, first it need to determine the number k that divides classification, choose k initial classification center, each object, according to drawing wherein in certain classification with the distance size at k center, upgrades k classification center afterwards, so iterates, until k center is basicly stable, obtain k class cluster result.
S511, finds out the concentrated public father node of minimum of purifying.
Enumerating purifies concentrates ancestors' (being the upper node of each literal node in dom tree) of each literal node, the ancestors' stored count repeating, after finding position in count value maximum node and leaning on most as start of text (STX) node, this node is the public father node of minimum of purification element of set element described later namely.If headline node obtains, and the position of the start of text (STX) node extracting is prior to headline node, thinking purifies collects the content being mixed with outside text, now we get the revised start of text (STX) node of conduct after position in count value time minor node is leaned on most, record its position stand-by.
S512, the document tree of traversal taking the public father node of minimum as root node, and obtain body.
While obtaining body, to taking the public father node of minimum, the literal node in the document tree of root node is handled as follows:
1) if node is identical with the headline finding, time or source Nodes, will not extract, but set it as the initial of true text, namely will empty the text having found;
2) if the father node label of node is link type, continue upwards to detect, if not list type, also just can get rid of navigation possibility, extract node content and add the text having extracted;
3) if the father node label of node belongs to <div>, <paragraph>, <tablecolumn>, <heading>, in the middle of <span> one, extracts node content and adds the text having extracted.
The present invention starts with from Chinese news web page is carried out to statistical study, and the advantage of comprehensive machine learning method, regular expression method has proposed a whole set of automatic flows of accurate extraction headline, issuing time, informed source, body four key elements.The present invention can not produce and rely on specific template, has very strong versatility.
The present invention carries out analysis and the extraction of webpage according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, because it is all that text module has generally all comprised four key elements that will look for, therefore generally in the process of extracting body, headline, issuing time and informed source are just obtained.But for some special webpages, if do not obtain headline and issuing time is carried out following extra flow process after extracting body:
One, the additional extractions algorithm flow of headline, issuing time.
If obtained in the process that S61 headline has been extracted at text, needn't further detect, otherwise carry out this flow process, because only just can carry out issuing time detection in the existing situation of headline, point two kinds of possible operations:
. if headline does not obtain, but web page title exists and keyword dictionary in element number more, do not find headline to be because similarity threshold arranges too highly, now can reduce threshold value and again travel through and search, before seek scope is start of text (STX) node.
. if headline does not obtain, but in keyword dictionary, element number is less, think that web page title may be irrelevant with body matter, now can carry out participle to the body matter having obtained, remove stop words, obtain new keyword dictionary, literal node before traversal start of text (STX) node, gets the maximum literal node of keyword dictionary hits as headline.
If S62 finds through step S61 headline, issuing time, needn't further detect, otherwise have following possibility:
. if headline does not obtain, can expand the possible span of father node label of headline, if meeting, the word of certain node comprised by web page title or the condition very high with web page title similarity, think headline, otherwise can specify the in text in short as headline.For the text time, similarly, expand the span of father label, first specify in text first time format occurrence as the text time, otherwise before specifying text last time format occurrence as the text time.
. if headline obtains, according to method above, the time is processed.
Two, the additional extractions algorithm flow of informed source.
If S71 headline and time obtain, no matter now whether informed source extracts, all to further detect so, prevent that " source ", " author " word of in text, comprising from producing interference, the informed source finding before this saves backup.
S72, informed source one are positioned, after headline node, may be positioned at after timing node, but the word length of informed source is generally less than the paragraph in text.Can from headline node, start search, stop condition be present node after timing node and node word length be greater than certain threshold value.If find the word of source format in this process, be preferentially chosen as informed source.
If do not find informed source in S73 step S72, the source of specify message source for preserving in S71 so, if the source of preserving in step S72 is for empty, our specify message source is title first source format occurrence afterwards, does not limit father node tag types.
If S74 does not obtain text source yet to this step, can export the list of doubtful source, enter the pattern of Active Learning.
. interactive learning: user can specify real informed source in the list of doubtful source, and program deposits background data base in this designated result.Can from database, read at set intervals the informed source that all users specify, they are carried out to marginal testing, if really belong to source, become a full member of the list of media word, be applied in extraction algorithm.
. doubtful source statistical study: in the situation that not having user to participate in, the list of doubtful source can be deposited in to background data base, dittograph is added up.The counting of doubtful source word in staqtistical data base at set intervals, gives certain probable value to represent its possibility as media word according to count value to each word.In actual applications, along with the operation of system, number of times that news media become source can be a lot, and non-source word in the list of doubtful source can disperse very much.The count value of word is higher, and it represents that the possibility of source of news is just larger.The flow process framework of Active Learning as shown in Figure 7.
The inventor has also done accuracy test to method of the present invention:
Inventor originates taking the RSS of Baidu as news web page, captured 11 classes from 429 websites totally 1721 without repeat news as test set, test is carried out on the M332 of Toshiba notebook computer, this machine is equipped with 32 Win7 Ultimate operating systems, processor model is Intel (R) Core (TM) 2 DuoCpu T6400, dominant frequency 2.00GHz, internal memory 2.00G, part of detecting carries out according to the order of headline-issuing time-informed source-body.Test result is as shown in table 1:
|
Body |
Headline |
Issuing time |
Informed source |
Accuracy (%) |
96.11 |
98.43 |
98.2 |
97.39 |
Table 1
As can be seen here, the present invention is not only without relying on the code masterplate of manual compiling, and the analysis of webpage is had to very high accuracy.In addition, in to the test process of 1721 webpages, be 65ms the averaging time of the single webpage of arithmetic analysis, amounts to the 1s time and process 15 webpages, has higher operational efficiency.
Disclosed is above only several specific embodiment of the present invention, but the present invention is not limited thereto, and the changes that any person skilled in the art can think of only otherwise exceed scope described in appended claims, all should drop in protection scope of the present invention.