CN102750390B - Automatic news webpage element extracting method - Google Patents

Automatic news webpage element extracting method Download PDF

Info

Publication number
CN102750390B
CN102750390B CN201210232831.2A CN201210232831A CN102750390B CN 102750390 B CN102750390 B CN 102750390B CN 201210232831 A CN201210232831 A CN 201210232831A CN 102750390 B CN102750390 B CN 102750390B
Authority
CN
China
Prior art keywords
node
literal
web page
literal node
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210232831.2A
Other languages
Chinese (zh)
Other versions
CN102750390A (en
Inventor
张长水
宋成儒
翁时锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Zhongqing Cyyun New Media Technology Co Ltd
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201210232831.2A priority Critical patent/CN102750390B/en
Publication of CN102750390A publication Critical patent/CN102750390A/en
Application granted granted Critical
Publication of CN102750390B publication Critical patent/CN102750390B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an automatic news webpage element extracting method which comprises the following steps of: (1) extracting a webpage title and webpage meta-information in a webpage sound code and obtaining a keyword dictionary related to webpage content; and (2), traversing literal nodes in the webpage sound code, and detecting and extracting a news title, issue time, a message source and a news text by utilizing the keyword dictionary according to a news title-issue time-message source-news text sequence or a news title-message source-issue time-news text sequence. The method provided by the invention does not depend on a specific template and is strong in commonality.

Description

News web page key element extraction method
Technical field
The present invention relates to internet information analytical technology, particularly a kind of news web page key element extraction method.
Background technology
In recent years, along with the extensive of internet popularized, people obtain useful information from the network media more and more.The network information has high-timeliness, and a lot of grave news events are all first on network, to spread and come.Thus, analyze the network information, particularly news information, can help us to hold well social development pulse, find local anomaly in time, safeguard that social harmony is stable.
News on internet is vast as the open sea, if adopt manual method to analyze, does not catch up with on the one hand the renewal speed of news, easily occurs on the other hand careless mistake, so conventionally want computer analysis.Given certain news web page, wants understand information wherein and analyzed, and what first will do is exactly automatically to extract headline, issuing time, informed source, this 4 flash-news key element of body as shown in Figure 1.
Existing element of news extracting method is only placed on focus on headline and text mostly, mainly contains following three kinds of methods:
1, regular expression
Regular expression is a character string being generated by specific syntax rule, is used for describing or coupling meets the statement of certain syntax specification.If news web page is generated by same template, we can be expressed as the code pattern in text region a regular expression, refer to Fig. 2, and it is to utilize the method for regular expression to extract the schematic diagram of element of news.Can extract its content by this unique expression formula to each new input webpage.This method is simple and convenient, with strong points, once writes, infinitely operation.
But, the defect of regular expression method is that the artificial generalization procedure of web page code pattern is very complicated, the regular expression of being write as for a template is only applicable to this template, webpage for extended formatting is felt simply helpless, even if original template, if increase nested or amendment slightly in text, also may cause contents extraction failure.
2, wrapper
The method of regular expression needs manual compiling and can only be corresponding one by one with web page template.After this people attempt seeking the automatic deduction method of multi-template webpage unified model.N.Kushmerick has proposed first the algorithm of a WIEN by name and has realized this idea in 1997, and final mask is called to wrapper.Here wrapper represents a kind of flow process, is directed to certain new information source, can utilize existing template data and webpage experimental knowledge to carry out conclusion and the automatic deduction of similar artificial intelligence.Derivation result can be applied in the information automation extraction of new information source.
Although wrapper extracting method part has solved regular expression method poor efficiency, the narrow shortcoming of application surface, do not break away from all the time the essence of former method, conclude cost high, do not break away from the dependence to template in essence.
In sum, existing element of news extracting method exist to masterplate too rely on, versatility is poor, and masterplate code is concluded complicated problem.
Summary of the invention
The object of this invention is to provide a kind of news web page key element extraction method, with solve existing element of news extracting method exist to masterplate too rely on, versatility is poor, and masterplate code is concluded complicated problem.
The present invention proposes a kind of news web page key element extraction method, comprises the following steps:
(1) extract web page title and webpage metamessage in webpage source code, and obtain the keyword dictionary about web page contents;
(2) literal node in webpage source code is traveled through, and according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, and utilize described keyword dictionary to detect and extract headline, issuing time, informed source and body.
Further, step (1) also comprises before: (10) carry out pre-service to webpage source code, removes scripted code.
Further, step (1) also comprises: (11) carry out participle and remove stop words the web page title extracting and webpage metamessage.
Further, step also comprises in (2): filter literal node (21), and the literal node filtering out is got rid of outside sensing range.
Further, in step (21), according to the father node label of literal node, literal node is filtered, comprising:
(211) filter out the literal node without father node;
(212) filter out the literal node that father node label does not belong in the middle of <div>, <paragraph>, <tablecolumn>, <heading>, <span> one;
(213) label that filters out father node is <div>, and style setting is the literal node of " hiding ";
(214) after headline and issuing time have been detected, the literal node that the label that filters out father node is <heading>;
(215) label that filters out father node is <span> or <div>, and text size is less than the literal node of text paragraph average length.
Further, in step (21), according to content of text, literal node is filtered, comprising:
(216) filter out the literal node that comprises copyright statement information;
(217) filter out the literal node of the Word message that contains " sharing " and/or " comment " and/or " microblogging ".
Further, in step (2), while detecting and extract headline, issuing time, informed source, comprise:
(22) when belonging to a literal node of web page title, the text size of this literal node is not less than 1/3rd of web page title Chinese version length, or the text of arbitrary literal node and the text similarity of web page title are not less than predetermined threshold value, the content of text that extracts this literal node is headline, and after this no longer carries out the detection of headline;
(23) content of literal node is mated with time format, and be issuing time by the contents extraction of the literal node that the match is successful, and after this no longer carry out the detection of issuing time;
(24) Word message that comprises " source " or " author " when the content of a literal node, is informed source by the contents extraction of this literal node, and after this no longer carries out the detection of informed source.
Further, in step (2), when detecting and extracting body, comprise:
(25) set up the high collection that hits, preserve the high literal node of keyword dictionary hits;
(26) adopt cluster mode to hit collection to height and purify, get the longest continuous nodes set as the collection of purifying;
(27) find out the concentrated public father node of minimum of purifying;
(28) document tree of traversal taking the public father node of minimum as root node, and obtain body.
Further, step (25) afterwards, also comprises:
(251) set up doubtful collection, preserve keyword dictionary hits deficiency, or text size is greater than the literal node of a preset value;
(252) higher quantity of information of hitting collection and doubtful collection;
(253) if high quantity of information of hitting collection is less than the quantity of information of doubtful collection, reduce the hits threshold value that is selected into height and hits collection, again travel through the literal node in webpage source code, and re-establish height and hit collection;
(254) if high quantity of information of hitting collection is less than the quantity of information of doubtful collection, enter step (26).
Further, step (28) comprising:
(281) if literal node is identical with the node of headline, issuing time, informed source, initial using this literal node as body;
(282) if the father node label of literal node is link type, and its node is not upwards all list type, and the content of extracting this literal node adds body;
(283) if the father node label of literal node belongs in the middle of <div>, <paragraph>, <tablecolumn>, <heading>, <span>, extract this literal node content and add body.
With respect to prior art, the invention has the beneficial effects as follows: the present invention starts with from Chinese news web page is carried out to statistical study, the advantage of comprehensive machine learning method, regular expression method, has proposed a whole set of automatic flows of accurate extraction headline, issuing time, informed source, body four key elements.The present invention can not produce and rely on specific template, has very strong versatility.
Brief description of the drawings
Fig. 1 is the schematic diagram of a news web page four elements;
Fig. 2 utilizes the method for regular expression to extract the schematic diagram of element of news;
Fig. 3 is a kind of news web page key element extraction method process flow diagram of the embodiment of the present invention;
Fig. 4 is according to the word bag model schematic diagram of web page title in Fig. 1 and the formation of webpage metamessage;
Fig. 5 is the another kind of news web page key element extraction method process flow diagram that the embodiment of the present invention is comparatively detailed;
Fig. 6 is a kind of news web page architectural feature schematic diagram;
Fig. 7 is the flow process frame diagram in the unknown source of Active Learning Method study of the present invention.
Embodiment
Illustrate the present invention below in conjunction with accompanying drawing.
Refer to Fig. 3, its a kind of news web page key element extraction method process flow diagram that is the embodiment of the present invention, it comprises the following steps:
S31, extracts web page title and webpage metamessage in webpage source code, and obtains the keyword dictionary about web page contents.
Web page title is the high level overview to a webpage, and in the time browsing a webpage, the information occurring at the show bar on browser top is exactly " web page title ".In webpage source code (HTML code), web page title is positioned at <head> ... between </head> label, its form is: <title> network marketing teaching website </title>, wherein " network marketing teaching website " is exactly " web page title ".
Webpage metamessage is included in <meta> label, provides the information relevant to document with the form of key-value pair, is mainly used as the index reference of search engine.In metamessage, description is the descriptor of web page contents, and keywords is the keyword of web page contents, can understand well news content by these two information.
After extracting web page title and webpage metamessage, the present invention preferably adopts word bag model to extract the keyword about web page contents, and forms keyword dictionary.Word bag model is a concept in text mining, and it does not consider order, the modified relationship of word, only text fragment is regarded as to the set of word.Taking the news pages of Fig. 1 as example, the word bag model that can form according to its web page title and webpage metamessage as shown in Figure 4.On the basis of word bag model, the present invention can further be expressed as vector by text fragment, calculates the content similarity degree between text fragment afterwards by the vectorial computing such as distance, inner product.If only consider whether word occurs and using vector distance as measuring similarity, similarity is calculated and can be reduced to public entry number between statistics word bag.Certainly, the similarity of text is calculated except discrete vector distance, also has Cosine distance, Euclidean distance, city distance etc.
S32, literal node in webpage source code is traveled through, and according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, and utilize described keyword dictionary to detect and extract headline, issuing time, informed source and body.
Literal node of the present invention refers to the node in DOM Document Object Model.DOM Document Object Model (Document Object Model, be called for short DOM), it is a kind of application programming interfaces, can be used for the document of the types such as dynamic access HTML, XML.In the present invention, mainly use HTMLDOM, it represents document with tree structure, and has defined access and the method that operates element in document.
For further understanding technical scheme of the present invention, illustrate the present invention with a detailed embodiment below, refer to Fig. 5, it is the another kind of news web page key element extraction method process flow diagram that the embodiment of the present invention is comparatively detailed, it comprises the following steps:
S501, carries out pre-service to webpage source code, removes script (JS) code, in order to avoid the dynamic load content wherein comprising is disturbed the judgement of text position.
S502, extracts web page title and webpage metamessage in webpage source code, and obtains the keyword dictionary about web page contents.
After extracting web page title and webpage metamessage, the present invention preferably adopts word bag model to extract the keyword about web page contents, and forms keyword dictionary.Word bag model as shown in Figure 4,
But the word in word bag also not all needs, there is very frequent that some words occur in news web page, but they do not have too large help to the expression of news content, such as " at present ", " so ", the word such as " it is reported ", be referred to as in the present invention stop words (stop words).Therefore can begin to take shape after keyword dictionary, then may disturb the stop words of content of text similarity judgement to remove these, so that computing is more succinct.
S503, sets up high collection and the doubtful collection of hitting.Height hits collection and doubtful collection is used for respectively in ergodic process, preserve literal node to keyword dictionary hits high (meaning that content similarity is high) and hits deficiency but the sufficiently long literal node of node text, object is in order to excavate doubtful text and then definite text scope.After this, analyzing web page structure, starts traversal literal node wherein.
S504 carries out literal node to filter, and the literal node filtering out is got rid of outside sensing range in ergodic process.The present invention preferably adopts two kinds of rules to filter literal node:
1. according to the father node label of literal node, literal node is filtered, comprising:
. filter out the literal node without father node;
. filter out the literal node that father node label does not belong in the middle of <div>, <paragraph>, <tablecolumn>, <heading>, <span> one;
. the label that filters out father node is <div>, and style setting is the literal node of " hiding ";
. after headline and issuing time have been detected, the literal node that the label that filters out father node is <heading>;
. the label that filters out father node is <span> or <div>, and text size is less than the literal node of text paragraph average length, described text paragraph average length is to add up based on a large amount of <span> or <div> exemplar the empirical value obtaining, and this class literal node is considered to navigation information and will not detects.
2. according to content of text, literal node is filtered, comprising:
. filter out the literal node that comprises copyright statement information;
. filter out the literal node of the Word message that contains " sharing " and/or " comment " and/or " microblogging ".
S505, news label detects.
When belonging to a literal node of web page title, the text size of this literal node is not less than 1/3rd of web page title Chinese version length, or the text of arbitrary literal node and the text similarity of web page title are not less than predetermined threshold value, the content of text that extracts this literal node is headline, and after this no longer carries out the detection of headline.Before headline, can not there is text, so once detect news label, for optimizing follow-up computing, can remove high collection and the doubtful concentrated headline place node literal node before of hitting.
S506, issuing time detects.
In the situation that title finds, the content of literal node is mated with time format, and be issuing time by the contents extraction of the literal node that the match is successful, and after this no longer carry out the detection of issuing time.Time format described here generally has numeral and digital connector two parts.Time numeral can be 4, such as " 2012 ", can be also 2, such as " 12 "; Maximum 2 of month, day numeral, can zero padding in 1 situation, such as " 02 month 03 day ", also can not zero padding, and such as " February 3 ".Numeral connector mainly contains middle horizontal line, period, space, word (date) and forward slash.Before issuing time, can not there is text, so once detect issuing time, for optimizing follow-up computing, can remove high collection and the doubtful concentrated issuing time place node literal node before of hitting.
S507, informed source detects.
The issuing time of news web page and informed source are probably in same section of text, so also will carry out source format coupling to it finding after issuing time.When the Word message that the content of a literal node comprises " source " or " author ", be informed source by the contents extraction of this literal node, and after this no longer carry out the detection of informed source.
S508, detects the hits of literal node to keyword dictionary.
In the process of traversal, investigate the literal node that meets testing requirement, if the content in a literal node is more than or equal to 2 to the hits of keyword dictionary, adds this literal node to height and hit collection; If the content in a literal node is 1 to the hits of keyword dictionary, add this node to doubtful collection; If the content in a literal node is not hit keyword dictionary, be the normal number of words that shows lower hemistich of general Chinese news web page but the text size of literal node is greater than 20(20), think probably to belong to text, will add doubtful collection to by this node.
Wherein, add height to hit the predetermined threshold value of collection and doubtful collection, and add the text size of doubtful collection all can arrange according to the needs of actual conditions.
S509, height hits collection, doubtful collection quantity of information detects.
The detection that height hits collection quantity of information mainly relies on content, the main dependency structure factor of detection of doubtful collection quantity of information.Height hits concentrates literal node number to be designated as N1; Calculate the doubtful LDR(Length-Distance Ratio that concentrates each node) value, the literal node number that LDR value is greater than certain threshold value is designated as N2; Doubtful concentrating has the literal node number that keyword hits to be designated as N3.Obtain height according to triangular magnitude relationship and hit collection and the comparison of doubtful collection quantity of information.
If height hits to collect and contains much information in doubtful collection, enter step S60, be selected into the hits threshold value that height hits collection (as hit and reduce to 1 and hit by 2) otherwise reduce, again travel through and re-establish height and hit collection.
If it is larger to find to remain doubtful collection quantity of information after traversal, the information of very possible web page title (<title>) and webpage metamessage (<meta>) is insufficient, can directly form new height from the doubtful concentrated literal node of choosing N2 quantity and hit collection.
If N2 is 0, very possible body text is very short or disperse, and causes LDR value very little, now can directly carry out text extraction, and method is as follows:
. if doubtful concentrated literal node quantity is little, can directly get text that length is the longest as text;
. doubtful first concentrated node is suspected as title, since second node, interval threshold is set, find the continuous text node that meets interval threshold, using the combination of its content as text.
Wherein, the LDR(Length-Distance Ratio that mentioned here) value is a kind of architectural feature of news web page, is used for measuring text context and connects compactedness, contributes to distinguish text and non-text.Text in webpage has certain text size, between adjacent text node, there is certain distance, length and ratio of distances constant can be weighed and between text, be connected compactedness, as shown in Figure 6, L is text size, D is the distance between text node, and front and back are averaged the tolerance that can regard text context compactedness as.
The calculation expression of LDR value is as follows:
LDR ( i ) = 1 2 ( L ( i - 1 ) D ( i - 1,1 ) + L ( i ) D ( i , 1 + 1 ) ) ,
LDR value is necessarily less than 1, more shows that close to 1 context connects compactedness better, and the text may be more true text.
S510, builds the collection of purifying.
Extract high hitting and concentrate the reference position of each literal node in web page code, adopt clustering method to hit collection to height and carry out cluster, cluster refers to that according to the similarity of some feature of literal node, height being hit to collection is divided into different classes of process, class interior element similarity is large, and between class and class, difference is large.Consider that these literal nodes may belong to three parts before text, in text or after text, are preferably made as 3 initial category number.Analyze cluster result, the purification of getting the longest continuous nodes set and hitting collection as height, being called the collection of purifying.
The preferred K-means cluster of cluster mode of the present invention, K-means cluster is a kind of clustering method, first it need to determine the number k that divides classification, choose k initial classification center, each object, according to drawing wherein in certain classification with the distance size at k center, upgrades k classification center afterwards, so iterates, until k center is basicly stable, obtain k class cluster result.
S511, finds out the concentrated public father node of minimum of purifying.
Enumerating purifies concentrates ancestors' (being the upper node of each literal node in dom tree) of each literal node, the ancestors' stored count repeating, after finding position in count value maximum node and leaning on most as start of text (STX) node, this node is the public father node of minimum of purification element of set element described later namely.If headline node obtains, and the position of the start of text (STX) node extracting is prior to headline node, thinking purifies collects the content being mixed with outside text, now we get the revised start of text (STX) node of conduct after position in count value time minor node is leaned on most, record its position stand-by.
S512, the document tree of traversal taking the public father node of minimum as root node, and obtain body.
While obtaining body, to taking the public father node of minimum, the literal node in the document tree of root node is handled as follows:
1) if node is identical with the headline finding, time or source Nodes, will not extract, but set it as the initial of true text, namely will empty the text having found;
2) if the father node label of node is link type, continue upwards to detect, if not list type, also just can get rid of navigation possibility, extract node content and add the text having extracted;
3) if the father node label of node belongs to <div>, <paragraph>, <tablecolumn>, <heading>, in the middle of <span> one, extracts node content and adds the text having extracted.
The present invention starts with from Chinese news web page is carried out to statistical study, and the advantage of comprehensive machine learning method, regular expression method has proposed a whole set of automatic flows of accurate extraction headline, issuing time, informed source, body four key elements.The present invention can not produce and rely on specific template, has very strong versatility.
The present invention carries out analysis and the extraction of webpage according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, because it is all that text module has generally all comprised four key elements that will look for, therefore generally in the process of extracting body, headline, issuing time and informed source are just obtained.But for some special webpages, if do not obtain headline and issuing time is carried out following extra flow process after extracting body:
One, the additional extractions algorithm flow of headline, issuing time.
If obtained in the process that S61 headline has been extracted at text, needn't further detect, otherwise carry out this flow process, because only just can carry out issuing time detection in the existing situation of headline, point two kinds of possible operations:
. if headline does not obtain, but web page title exists and keyword dictionary in element number more, do not find headline to be because similarity threshold arranges too highly, now can reduce threshold value and again travel through and search, before seek scope is start of text (STX) node.
. if headline does not obtain, but in keyword dictionary, element number is less, think that web page title may be irrelevant with body matter, now can carry out participle to the body matter having obtained, remove stop words, obtain new keyword dictionary, literal node before traversal start of text (STX) node, gets the maximum literal node of keyword dictionary hits as headline.
If S62 finds through step S61 headline, issuing time, needn't further detect, otherwise have following possibility:
. if headline does not obtain, can expand the possible span of father node label of headline, if meeting, the word of certain node comprised by web page title or the condition very high with web page title similarity, think headline, otherwise can specify the in text in short as headline.For the text time, similarly, expand the span of father label, first specify in text first time format occurrence as the text time, otherwise before specifying text last time format occurrence as the text time.
. if headline obtains, according to method above, the time is processed.
Two, the additional extractions algorithm flow of informed source.
If S71 headline and time obtain, no matter now whether informed source extracts, all to further detect so, prevent that " source ", " author " word of in text, comprising from producing interference, the informed source finding before this saves backup.
S72, informed source one are positioned, after headline node, may be positioned at after timing node, but the word length of informed source is generally less than the paragraph in text.Can from headline node, start search, stop condition be present node after timing node and node word length be greater than certain threshold value.If find the word of source format in this process, be preferentially chosen as informed source.
If do not find informed source in S73 step S72, the source of specify message source for preserving in S71 so, if the source of preserving in step S72 is for empty, our specify message source is title first source format occurrence afterwards, does not limit father node tag types.
If S74 does not obtain text source yet to this step, can export the list of doubtful source, enter the pattern of Active Learning.
. interactive learning: user can specify real informed source in the list of doubtful source, and program deposits background data base in this designated result.Can from database, read at set intervals the informed source that all users specify, they are carried out to marginal testing, if really belong to source, become a full member of the list of media word, be applied in extraction algorithm.
. doubtful source statistical study: in the situation that not having user to participate in, the list of doubtful source can be deposited in to background data base, dittograph is added up.The counting of doubtful source word in staqtistical data base at set intervals, gives certain probable value to represent its possibility as media word according to count value to each word.In actual applications, along with the operation of system, number of times that news media become source can be a lot, and non-source word in the list of doubtful source can disperse very much.The count value of word is higher, and it represents that the possibility of source of news is just larger.The flow process framework of Active Learning as shown in Figure 7.
The inventor has also done accuracy test to method of the present invention:
Inventor originates taking the RSS of Baidu as news web page, captured 11 classes from 429 websites totally 1721 without repeat news as test set, test is carried out on the M332 of Toshiba notebook computer, this machine is equipped with 32 Win7 Ultimate operating systems, processor model is Intel (R) Core (TM) 2 DuoCpu T6400, dominant frequency 2.00GHz, internal memory 2.00G, part of detecting carries out according to the order of headline-issuing time-informed source-body.Test result is as shown in table 1:
Body Headline Issuing time Informed source
Accuracy (%) 96.11 98.43 98.2 97.39
Table 1
As can be seen here, the present invention is not only without relying on the code masterplate of manual compiling, and the analysis of webpage is had to very high accuracy.In addition, in to the test process of 1721 webpages, be 65ms the averaging time of the single webpage of arithmetic analysis, amounts to the 1s time and process 15 webpages, has higher operational efficiency.
Disclosed is above only several specific embodiment of the present invention, but the present invention is not limited thereto, and the changes that any person skilled in the art can think of only otherwise exceed scope described in appended claims, all should drop in protection scope of the present invention.

Claims (8)

1. a news web page key element extraction method, is characterized in that, comprises the following steps:
(1) extract web page title and webpage metamessage in webpage source code, and obtain the keyword dictionary about web page contents;
(2) literal node in webpage source code is traveled through, and according to the order of headline-issuing time-informed source-body or headline-informed source-issuing time-body, and utilize described keyword dictionary to detect and extract headline, issuing time, informed source and body;
Wherein, detect and further comprise while extracting body:
(251) set up the high collection that hits, preserve the high literal node of keyword dictionary hits;
(252) set up doubtful collection, preserve keyword dictionary hits deficiency, or text size is greater than the literal node of a preset value;
(253) higher quantity of information of hitting collection and doubtful collection;
(254) if high quantity of information of hitting collection is less than the quantity of information of doubtful collection, reduce the hits threshold value that is selected into height and hits collection, again travel through the literal node in webpage source code, and re-establish height and hit collection;
(255) adopt cluster mode to hit collection to height and purify, get the longest continuous nodes set as the collection of purifying;
(256) find out the concentrated public father node of minimum of purifying;
(257) document tree of traversal taking the public father node of minimum as root node, and obtain body.
2. news web page key element extraction method as claimed in claim 1, is characterized in that, step (1) also comprises before: (10) carry out pre-service to webpage source code, removes scripted code.
3. news web page key element extraction method as claimed in claim 1, is characterized in that, step (1) also comprises: (11) carry out participle and remove stop words the web page title extracting and webpage metamessage.
4. news web page key element extraction method as claimed in claim 1, is characterized in that, step also comprises in (2): filter literal node (21), and the literal node filtering out is got rid of outside sensing range.
5. news web page key element extraction method as claimed in claim 4, is characterized in that, in step (21), according to the father node label of literal node, literal node is filtered, and comprising:
(211) filter out the literal node without father node;
(212) filter out the literal node that father node label does not belong in the middle of <div>, <paragraph>, <tablecolumn>, <heading>, <span> one;
(213) label that filters out father node is <div>, and style setting is the literal node of " hiding ";
(214) after headline and issuing time have been detected, the literal node that the label that filters out father node is <heading>;
(215) label that filters out father node is <span> or <div>, and text size is less than the literal node of text paragraph average length.
6. news web page key element extraction method as claimed in claim 4, is characterized in that, in step (21), according to content of text, literal node is filtered, and comprising:
(216) filter out the literal node that comprises copyright statement information;
(217) filter out the literal node of the Word message that contains " sharing " and/or " comment " and/or " microblogging ".
7. news web page key element extraction method as claimed in claim 1, is characterized in that, in step (2), while detecting and extract headline, issuing time, informed source, comprises:
(22) when belonging to a literal node of web page title, the text size of this literal node is not less than 1/3rd of web page title Chinese version length, or the text of arbitrary literal node and the text similarity of web page title are not less than predetermined threshold value, the content of text that extracts this literal node is headline, and after this no longer carries out the detection of headline;
(23) content of literal node is mated with time format, and be issuing time by the contents extraction of the literal node that the match is successful, and after this no longer carry out the detection of issuing time;
(24) Word message that comprises " source " or " author " when the content of a literal node, is informed source by the contents extraction of this literal node, and after this no longer carries out the detection of informed source.
8. news web page key element extraction method as claimed in claim 1, is characterized in that, step (257) comprising:
(2571) if literal node is identical with the node of headline, issuing time, informed source, initial using this literal node as body;
(2572) if the father node label of literal node is link type, and its node is not upwards all list type, and the content of extracting this literal node adds body;
(2573) if the father node label of literal node belongs in the middle of <div>, <paragraph>, <tablecolumn>, <heading>, <span>, extract this literal node content and add body.
CN201210232831.2A 2012-07-05 2012-07-05 Automatic news webpage element extracting method Active CN102750390B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210232831.2A CN102750390B (en) 2012-07-05 2012-07-05 Automatic news webpage element extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210232831.2A CN102750390B (en) 2012-07-05 2012-07-05 Automatic news webpage element extracting method

Publications (2)

Publication Number Publication Date
CN102750390A CN102750390A (en) 2012-10-24
CN102750390B true CN102750390B (en) 2014-07-23

Family

ID=47030575

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210232831.2A Active CN102750390B (en) 2012-07-05 2012-07-05 Automatic news webpage element extracting method

Country Status (1)

Country Link
CN (1) CN102750390B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488675A (en) * 2013-07-11 2014-01-01 哈尔滨工程大学 Automatic precise extraction device for multi-webpage news comment contents
CN106033428B (en) * 2015-03-11 2019-08-30 北大方正集团有限公司 The selection method of uniform resource locator and the selection device of uniform resource locator
CN106021392A (en) * 2016-05-12 2016-10-12 中国互联网络信息中心 News key information extraction method and system
CN107766384A (en) * 2016-08-22 2018-03-06 北京国双科技有限公司 A kind of method and apparatus for determining page issuing time
CN108090104B (en) * 2016-11-23 2023-05-02 百度在线网络技术(北京)有限公司 Method and device for acquiring webpage information
CN108241680B (en) * 2016-12-26 2020-10-13 北京国双科技有限公司 Method and device for acquiring reading amount of webpage
CN106874346B (en) * 2016-12-26 2020-10-30 微梦创科网络科技(中国)有限公司 Method and device for extracting page text in webpage
CN108320255B (en) * 2017-01-16 2022-06-21 软通动力信息技术(集团)股份有限公司 Information processing method and device
CN108153851B (en) * 2017-12-21 2021-06-18 北京工业大学 General forum subject post page information extraction method based on rules and semantics
CN108009137B (en) * 2017-12-22 2021-01-29 鼎富智能科技有限公司 Standard document processing method, device and system based on configuration file
CN108399257B (en) * 2018-03-08 2021-05-18 江苏省广播电视总台 Personalized news clue recommendation method based on intelligent manuscript analysis
CN108874870A (en) * 2018-04-24 2018-11-23 北京中科闻歌科技股份有限公司 A kind of data pick-up method, equipment and computer can storage mediums
CN110968807A (en) * 2018-09-27 2020-04-07 北京国双科技有限公司 Webpage text extraction method and device
CN109710833B (en) * 2018-12-29 2021-07-16 上海蜜度信息技术有限公司 Method and apparatus for determining content node
CN109857956B (en) * 2019-01-25 2019-12-31 四川大学 News webpage key information automatic extraction method based on label and block characteristics
CN110427541B (en) * 2019-08-05 2022-09-16 安徽大学 Webpage content extraction method, system, electronic equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694594A (en) * 1994-11-14 1997-12-02 Chang; Daniel System for linking hypermedia data objects in accordance with associations of source and destination data objects and similarity threshold without using keywords or link-difining terms
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694594A (en) * 1994-11-14 1997-12-02 Chang; Daniel System for linking hypermedia data objects in accordance with associations of source and destination data objects and similarity threshold without using keywords or link-difining terms
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Dou Shen 等.Web-page classification through summarization.《SIGIR "04 Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval》.2004,第242-249页.
Web-page classification through summarization;Dou Shen 等;《SIGIR "04 Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval》;20040729;第242-249页 *

Also Published As

Publication number Publication date
CN102750390A (en) 2012-10-24

Similar Documents

Publication Publication Date Title
CN102750390B (en) Automatic news webpage element extracting method
Choi et al. Emerging topic detection in twitter stream based on high utility pattern mining
US8185530B2 (en) Method and system for web document clustering
Chen et al. Websrc: A dataset for web-based structural reading comprehension
CN101231661B (en) Method and system for digging object grade knowledge
Kang et al. Modeling user interest in social media using news media and wikipedia
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
Yao et al. Bursty event detection from collaborative tags
EP2657853A1 (en) Webpage information detection method and system
CN103914478A (en) Webpage training method and system and webpage prediction method and system
CN102890702A (en) Internet forum-oriented opinion leader mining method
JPWO2009096523A1 (en) Information analysis apparatus, search system, information analysis method, and information analysis program
CN102945244A (en) Chinese web page repeated document detection and filtration method based on full stop characteristic word string
CN106354844B (en) Service combination package recommendation system and method based on text mining
CN103023714A (en) Activeness and cluster structure analyzing system and method based on network topics
CN105718590A (en) Multi-tenant oriented SaaS public opinion monitoring system and method
US10467255B2 (en) Methods and systems for analyzing reading logs and documents thereof
CN110390044A (en) A kind of searching method and equipment of the similar network page
Lee et al. CAST: A context-aware story-teller for streaming social content
CN103246644A (en) Method and device for processing Internet public opinion information
Alassi et al. Effectiveness of template detection on noise reduction and websites summarization
Chen et al. Finding keywords in blogs: Efficient keyword extraction in blog mining via user behaviors
Jatowt et al. Generic method for detecting focus time of documents
CN103927365B (en) Web page time sensibility measurement method based on energy function
US10025936B2 (en) Systems and methods for SQL value evaluation to detect evaluation flaws

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: NINGBO ZHONGQING HUAYUN NEW MEDIA TECHNOLOGY CO.,

Free format text: FORMER OWNER: WENG SHIFENG

Effective date: 20141210

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 315192 NINGBO, ZHEJIANG PROVINCE TO: 315100 NINGBO, ZHEJIANG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20141210

Address after: 315100, 8 floor, Di Yi Building, 666 Taikang Road, Ningbo, Zhejiang, Yinzhou District

Patentee after: NINGBO ZHONGQING CYYUN NEW MEDIA TECHNOLOGY CO., LTD.

Address before: 315192 room 298, science and technology center, 514 bachelor Road, Yinzhou District, Zhejiang, Ningbo

Patentee before: Weng Shifeng