CN108664522A

CN108664522A - Web page processing method and device

Info

Publication number: CN108664522A
Application number: CN201710212470.8A
Authority: CN
Inventors: 韦仕伟; 石玉明
Original assignee: Excellent Letter Interconnected (beijing) Information Technology Co Ltd
Current assignee: Excellent Letter Interconnected (beijing) Information Technology Co Ltd
Priority date: 2017-04-01
Filing date: 2017-04-01
Publication date: 2018-10-16

Abstract

An embodiment of the present invention provides a kind of web page processing method and devices.The extracting article of the task is divided into two steps by the embodiment of the present invention, and the first step needs to identify whether identification target webpage is positive web page text；If target webpage is positive web page text, second step recycles the article preset in extracting rule extraction target webpage, and default extracting rule from positive web page text for extracting article.If target webpage is non-positive web page text, the flow of extraction article need not be just executed.As it can be seen that the default extracting rule in the embodiment of the present invention only need to take into account positive web page text without taking into account non-positive web page text, to reduce in advance in the training difficulty of the default extracting rule of training.And reduce follow-up optimization and preset the complexity of extracting rule, and improve the optimization efficiency that extracting rule is preset in follow-up optimization.

Description

Web page processing method and device

Technical field

The present embodiments relate to internet arenas, especially take into account a kind of web page processing method and device.

Background technology

With the rapid development of Internet, explosive growth is presented in the information on network, webpage can be accessed as open Resource become mostly important one of information source on internet.

Current webpage includes positive web page text (content Web page) and non-positive web page text, and positive web page text is to include Article, and the webpage with unique theme.Non- positive web page text is the webpage without unique theme, and non-positive web page text does not wrap Containing article.

Due to technology and business, the content in current each webpage is very complicated, in addition to text on positive web page text Except chapter, it is also doped with a large amount of hash, such as navigation bar, advertisement link, copyright information and other recommendation article links Etc., positive web page text article to display is hidden in unrelated content.

However, we usually need the article in positive web page text to build corpus at present, and then spy is completed by corpus The various tasks such as fixed text mining, web information retrieval and natural language processing.Therefore, it is necessary to text is extracted from positive web page text Chapter.

Wherein, in the prior art, article is directly extracted from webpage using extracting rule in the prior art, due to just Gap is larger between the structure of web page of web page text and the structure of web page of non-positive web page text, but the structure of web page of different positive web page texts The structure of web page of essentially identical and different non-positive web page text is essentially identical.Therefore, extracting rule in the prior art needs same When take into account positive web page text and non-positive web page text, this adds increased technical staff in advance training extracting rule training difficulty.

Invention content

To overcome the problems in correlation technique, a kind of web page processing method of offer of the embodiment of the present invention and device.

According to a first aspect of the embodiments of the present invention, a kind of web page processing method is provided, the method includes：

Identify whether target webpage is positive web page text；

If target webpage is positive web page text, the article in the target webpage, institute are extracted using default extracting rule Default extracting rule is stated for extracting article from positive web page text.

Wherein, whether the identification target webpage is positive web page text, including：

Obtain the web data of the target webpage；

The structure feature of the target webpage is obtained according to the web data；

The semantic feature of the target webpage is obtained according to the web data；

The text feature of the target webpage is obtained according to the web data；

The rewards and punishments scoring feature of the target webpage is obtained according to the web data；

The structure feature, the semantic feature, the text feature and the rewards and punishments scoring feature are input to pre- The Web page classifying device being first arranged determines that the target webpage is positive web page text or non-according to the output result of the Web page classifying device Positive web page text.

Further, after the web data for obtaining the target webpage, further include：

From the data removed in the web data at least following label：Data, style labels in script labels In data, the data in conments labels and have data in the label of hiding attribute.

Wherein, the structure feature that the target webpage is obtained according to the web data, including：

Obtain the quantity of the heading label in the web data, the quantity of paragraph tag, the quantity of DIV labels and a The quantity of label；

Calculate the chain text ratio of the web data；

Obtain quantity of in the web data, pixel quantity more than the picture of predetermined threshold value；

Judge to whether there is paging keyword in the web data, the paging keyword includes at least：Homepage, upper one Page, lower one page, endpage and full text；

By the quantity of the heading label, the quantity of the paragraph tag, the quantity of the DIV labels, a labels Quantity, the chain text than the quantity of the, picture and are used to indicate in the web data crucial with the presence or absence of the paging Structure feature of the information of word as the target webpage.

Wherein, the semantic feature that the target webpage is obtained according to the web data, including：

The quantity for the label that class attribute values are content or article is obtained from the web data, and is made For the semantic feature of the target webpage.

Wherein, the text feature that the target webpage is obtained according to the web data, including：

Judge that the data in the heading label in the web data whether there is in the head labels of the web data In；

In other data in addition to a labels in the web data, the quantity of punctuation mark is counted；

The quantity for calculating the punctuation mark and the character in other data in addition to a labels in the web data Total quantity between ratio；

The quantity of predetermined keyword is counted in the web data, the predetermined keyword includes at least：Comment comes Source, details are checked more and are checked in full；

The maximum quantity in the quantity of the text character in each paragraph tag is obtained in the web data；

Calculate the par of the quantity of the text character in each paragraph tag in the web data；

The data in the heading label in the web data will be used to indicate to whether there is in the web data The quantity of information, the punctuation mark in head labels, the ratio, the quantity of the predetermined keyword, the maximum number The text feature of amount and the par as the target webpage.

Wherein, the rewards and punishments scoring feature that the target webpage is obtained according to the web data, including：

For each label in the web data, according to the type of the label, the class attributes of the label It is worth, the quantity of the text character in the label and the chain text ratio of the label, calculates the cumulative rewards and punishments scoring of the label；

The father node of each label in the web data is calculated according to the cumulative rewards and punishments of each label scoring The cumulative rewards and punishments integral of cumulative rewards and punishments integral and ancestor node；

The cumulative rewards and punishments integral of maximum preset quantity is obtained as rewards and punishments scoring feature.

According to a second aspect of the embodiments of the present invention, a kind of page processor is provided, described device includes：

Identification module, whether target webpage is positive web page text for identification；

Extraction module extracts the target network if being positive web page text for target webpage using extracting rule is preset Article in page, the default extracting rule from positive web page text for extracting article.

Wherein, the identification module includes：

First acquisition unit, the web data for obtaining the target webpage；

Second acquisition unit, the structure feature for obtaining the target webpage according to the web data；

Third acquiring unit, the semantic feature for obtaining the target webpage according to the web data；

4th acquiring unit, the text feature for obtaining the target webpage according to the web data；

5th acquiring unit, the rewards and punishments scoring feature for obtaining the target webpage according to the web data；

Determination unit, for the structure feature, the semantic feature, the text feature and the rewards and punishments to be scored Feature is input to pre-set Web page classifying device, determines that the target webpage is according to the output result of the Web page classifying device Positive web page text or non-positive web page text.

Further, the identification module further includes：

Removal unit, for from the data removed in the web data at least following label：In script labels Data, the data in style labels, the data in conments labels and have data in the label of hiding attribute.

Wherein, first acquisition module includes：

First obtain subelement, for obtain the quantity of the heading label in the web data, the quantity of paragraph tag, The quantity of DIV labels and the quantity of a labels；

First computation subunit, the chain text ratio for calculating the web data；

Second obtains subelement, for obtaining the picture of in the web data, pixel quantity more than predetermined threshold value Quantity；

First judgment sub-unit whether there is paging keyword for judging in the web data, the paging is crucial Word includes at least：Homepage, page up, lower one page, endpage and full text；

First determination subelement, for marking the quantity of the heading label, the quantity of the paragraph tag, the DIV The quantity of label, the quantity of a labels, the chain text than, the picture quantity and be used to indicate in the web data With the presence or absence of structure feature of the information as the target webpage of the paging keyword.

Wherein, the second acquisition unit is specifically used for：Class attribute values are obtained from the web data is The quantity of the label of content or article, and as the semantic feature of the target webpage.

Wherein, the third acquiring unit includes：

Second judgment sub-unit, for judging that the data in the heading label in the web data whether there is in described In the head labels of web data；

First statistics subelement, in other data in addition to a labels in the web data, counting punctuate The quantity of symbol；

Second computation subunit, for calculate the quantity of the punctuation mark in the web data except a labels with Ratio between the total quantity of character in other outer data；

Second statistics subelement, the quantity for counting predetermined keyword in the web data, the default key Word includes at least：Comment, details, is checked more and is checked in full source；

Third obtains subelement, the number for obtaining the text character in each paragraph tag in the web data Maximum quantity in amount.

Third computation subunit, the number for calculating the text character in each paragraph tag in the web data The par of amount；

Second determination subelement, for whether there is the data in the heading label being used to indicate in the web data The quantity of information, the punctuation mark in the head labels of the web data, the ratio, the predetermined keyword The text feature of quantity, the maximum quantity and the par as the target webpage.

Wherein, the 4th acquiring unit includes：

4th computation subunit, for for each label in the web data, according to the type of the label, The class attribute values of the label, the quantity of the text character in the label and the chain text ratio of the label, described in calculating The cumulative rewards and punishments of label are scored；

5th computation subunit, it is every in the web data for being calculated according to the cumulative rewards and punishments of each label scoring The cumulative rewards and punishments integral of the father node of one label and the cumulative rewards and punishments integral of ancestor node；

4th obtains subelement, special as rewards and punishments scoring for obtaining the cumulative rewards and punishments integral of maximum preset quantity Sign.

Technical solution provided in an embodiment of the present invention can include the following benefits：

Wherein, include article in positive web page text, article is not included in non-positive web page text.It is directly to utilize in the prior art Extracting rule in the prior art extracts article from webpage, due to the webpage of the structure of web page and non-positive web page text of positive web page text Gap is larger between structure, but the webpage of the essentially identical and different non-positive web page text of structure of web page of different positive web page texts Structure is essentially identical.Therefore, extracting rule in the prior art needs while taking into account positive web page text and non-positive web page text, this just increases Training difficulty of the technical staff in advance in training extracting rule is added.

Such as technical staff will acquire the sample characteristics data of multiple non-positive web page texts and multiple positive web page texts in advance Sample characteristics data, and then according to the sample characteristics data of multiple non-positive web page texts and the sample characteristics number of multiple positive web page texts According to training extracting rule, due to the structure of web page due to positive web page text and between the structure of web page of non-positive web page text, gap is larger, Therefore, the sample characteristics data of non-positive web page text and the sample characteristics data of positive web page text are two kinds of completely different characteristics According to, so, it is more complex that the extracting rule trained according to two different characteristics may result in the extracting rule that training obtains.

Secondly, subsequently optimization extracting rule when complexity it is higher and subsequently optimization extracting rule when efficiency compared with It is low.For example, if the accuracy rate for extracting article from webpage using extracting rule in the prior art is relatively low, carried to improve The accuracy rate for taking article needs to optimize extracting rule in the prior art, but since extraction in the prior art is advised It then needs to take into account non-positive web page text and positive web page text simultaneously, therefore, be needed when optimizing extracting rule in the prior art simultaneous simultaneously Gu Feizheng web page texts and positive web page text cause the complexity for optimizing extracting rule higher, and optimize the optimization efficiency of extracting rule It is relatively low.

And in embodiments of the present invention, the extracting article of the task is divided into two steps, the first step needs to identify identification target network Whether page is positive web page text；If target webpage is positive web page text, second step, which recycles, presets extracting rule extraction target webpage In article, default extracting rule from positive web page text for extracting article.If target webpage is non-positive web page text, with regard to nothing The flow of extraction article must be executed.As it can be seen that the default extracting rule in the embodiment of the present invention is only needed without taking into account non-positive web page text Positive web page text is taken into account, to reduce in advance in the training difficulty of the default extracting rule of training.So that default extracting rule More it is absorbed in effective.Such as technical staff need to only acquire the sample characteristics data of multiple positive web page texts in advance, it is more without acquiring The sample characteristics data of a non-positive web page text, and then train default extraction rule only according to the sample characteristics data of positive web page text Then, make the default extracting rule that training obtains simpler only according to a kind of default extracting rule of characteristic training.

Secondly, the complexity when extracting rule is preset in follow-up optimization is relatively low and when extracting rule is preset in follow-up optimization Efficiency it is higher.For example, if the standard of article is extracted from positive web page text using the default extracting rule in the embodiment of the present invention True rate is relatively low, then in order to improve the accuracy rate of extraction article, needs to optimize default extracting rule, but carried due to default It takes rule only to need to take into account positive web page text, without taking into account non-positive web page text, therefore, is only needed when extracting rule is preset in optimization Take into account positive web page text, compared with the prior art when optimizing extracting rule in the prior art need simultaneously take into account non-positive web page text with Positive web page text, the embodiment of the present invention can reduce optimization and preset the complexity of extracting rule and improve the default extracting rule of optimization Optimization efficiency.

It should be understood that above general description and following detailed description is only exemplary and explanatory, not The embodiment of the present invention can be limited.

Description of the drawings

The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the present invention Example, and the principle together with specification for explaining the embodiment of the present invention.

Fig. 1 is a kind of flow chart of web page processing method shown according to an exemplary embodiment；

Fig. 2 is a kind of block diagram of page processor shown according to an exemplary embodiment.

Specific implementation mode

Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the embodiment of the present invention.On the contrary, they be only with The example of as detailed in the attached claim, the embodiment of the present invention the consistent device and method of some aspects.

Fig. 1 is a kind of flow chart of web page processing method shown according to an exemplary embodiment, as shown in Figure 1, the party Method includes the following steps.

In step S101, whether identification target webpage is positive web page text；

Wherein, it can be realized by following flow in this step, including：

11) web data of target webpage, is obtained；

12) structure feature of target webpage, is obtained according to the web data of target webpage；

Wherein it is possible to first count the quantity of the heading label in the web data, paragraph tag in the web data The quantity of DIV labels in quantity, the web data and the quantity of a labels in the web data；Then the webpage is calculated The chain text ratio of data；Quantity of in the web data, pixel quantity more than the picture of predetermined threshold value is obtained later；Judge this again It whether there is paging keyword in web data, paging keyword includes at least：Homepage, page up, lower one page, endpage and complete Text etc.；Finally by the quantity of heading label, the quantity of paragraph tag, the quantity of DIV labels, the quantity of a labels, the webpage number According to chain text compare, the quantity of the picture and be used to indicate in the web data with the presence or absence of paging keyword information as mesh Mark the structure feature of webpage.

Wherein, calculate the web data chain text than when, can first count the total of the character that the web data includes Quantity counts the total quantity of the character in all a labels that the web data includes later；Then the web data is calculated Difference between the total quantity of character in all a labels that the total quantity for the character for including and the web data include, The ratio between the total quantity and the difference of the character in all a labels that the web data includes is calculated again, obtains the net The chain text ratio of page data.

In embodiments of the present invention, it is generally the case that due to including article, the character that article includes in positive web page text It is more, such as including many Chinese characters and/or letter, therefore, the character that the web data of positive web page text includes is more.Secondly, Including in positive web page text click and the button that includes link it is less, therefore, the chain in the web data of positive web page text Connect it is less, that is, the negligible amounts for a labels that the web data of positive web page text includes, in this way, the web data of positive web page text The total quantity of character in all a labels for including is with regard to relatively low.Therefore, the chain text of the web data can be compared as judgement Target webpage whether be positive web page text a decision factor.

Wherein, when being more than the quantity of picture of predetermined threshold value in obtaining the web data, pixel quantity, for the net Any one picture that page data includes, can count the quantity for the pixel that the picture includes；Wherein, picture is by pixel Constitute, picture it is generally rectangular in shape, that is, the quantity of the pixel of every a line in picture is all identical, and in picture The quantity of pixel in each row is all identical, therefore, can count the quantity of the pixel in arbitrary a line of the picture, with And the quantity of the pixel in any one row of the picture is counted, the quantity and the pixel in the row for calculating the pixel in the row Product between the quantity of point, obtains the quantity for the pixel that the picture includes；Then compare the pixel quantity that the picture includes Size between predetermined threshold value increases the picture recorded in caching when the quantity that the picture includes is more than predetermined threshold value Quantity equally executes aforesaid operations for other each pictures that the web data includes；It finally obtains and is recorded in caching Picture quantity, and as in the web data, pixel quantity be more than predetermined threshold value picture quantity.

In embodiments of the present invention, in the data of a web displaying, often in addition to word, can click and include one Except button of link etc. element, picture can be shown toward contact, and under normal conditions, the picture shown in positive web page text compared with Greatly, the picture shown in non-positive web page text is smaller, that is, the quantity for the pixel that the picture shown in positive web page text includes compared with It is more, the negligible amounts for the pixel that the picture shown in non-positive web page text includes.Therefore, can by it is in the web data, as Prime number amount be more than predetermined threshold value picture quantity as judge target webpage whether be positive web page text a decision factor.

Judge to whether there is paging keyword in the web data, paging keyword includes at least：Homepage, page up, under One page, endpage and full text etc.；

In embodiments of the present invention, the character that the article in positive web page text includes is often more, but a webpage and nothing Method, which is shown, finishes character all in article, therefore, positive web page text often by article Pagination Display, so, display text When webpage, can mostly just show a part of character of positive web page text Chinese chapter, and display simultaneously can click, for obtaining just The button of the other parts character of article in web page text, and shown on the button homepage, page up, lower one page, endpage with And at least one of full text etc. paging keyword, the article that can be got in positive web page text with prompting user to click on Other parts character, button form in web data is a labels, the data in a labels include should about homepage, on At least one of one page, lower one page, endpage and full text etc. paging keyword.

Rather than general in positive web page text does not include that there is usually no need article paging in article namely non-positive web page text The demand of display, it is seen then that general in the web data of non-positive web page text does not include paging keyword.

Therefore, it can will whether there is paging keyword in a labels detected in the web data, then generate for referring to Show the instruction information that whether there is paging keyword in the web data, and will be used to indicate to whether there is in the web data and divide Page keyword instruction information as judge target webpage whether be positive web page text a decision factor.

13) semantic feature of target webpage, is obtained according to the web data of target webpage；

In this step, the mark that class attribute values are content or article can be obtained from the web data The quantity of label, and as the semantic feature of target webpage.

Wherein, in the quantity for obtaining the label that class attribute values are content or article from the web data When, for any one label in the web data, the class attribute values of the label can be obtained, judge the label Class attribute values whether be content either article when the label class attribute values be content or article When, the quantity for increasing the label recorded in caching equally executes above-mentioned behaviour for other each labels in the web data Make；Finally obtain the quantity of label recorded in caching, and as being content from the class attribute values in the web data Or the quantity of the label of article.

In embodiments of the present invention, multiple labels are will include in the web data of a webpage, and each label has Standby respective class attribute values.

For positive web page text, since positive web page text includes article, and article is made of multiple paragraphs, therefore, the positive net of text Multiple paragraph tags are will include in the web data of page, each paragraph tag includes the character in paragraph, therefore, paragraph The class attribute values of label are content or article.

Rather than in positive web page text do not include article, also do not include just paragraph, into rather than the web data of text webpage in Do not include just paragraph tag, also just there is no the labels that class attribute values are content or article.

Therefore, the quantity for the label that class attribute values are content or article will be obtained from the web data As judgement target webpage whether be positive web page text a decision factor.

14) text feature of target webpage, is obtained according to the web data of target webpage；

Wherein it is possible to judge that the data in the heading label in the web data whether there is in the head of the web data In label；In other data in addition to a labels in the web data, the quantity of punctuation mark is counted；Calculate punctuate symbol Number quantity and the web data in other data in addition to a labels in character total quantity between ratio；At this The quantity of predetermined keyword is counted in web data, predetermined keyword includes at least：Comment, source, details, check it is more and Check full text；The maximum quantity in the quantity of the text character in each paragraph tag is obtained in the web data；It calculates The par of the quantity of the text character in each paragraph tag in the web data；The web data will be used to indicate In heading label in data with the presence or absence of in the head labels of the web data information, the quantity of the punctuation mark, The text feature as target webpage of quantity, the maximum quantity and the par of the ratio, the predetermined keyword.

Wherein, the data in the heading label in judging the web data whether there is marks in the head of the web data When in label, the data in the heading label in the web data can be obtained, in the head labels for then obtaining the web data Data, and search whether that there are in the heading label in the web data in the data in the head labels of the web data Data, when in the data in the head labels of the web data there are when the data in the heading label in the web data, Generate the instruction that the data in the heading label being used to indicate in the web data are present in the head labels of the web data Information.

In embodiments of the present invention, for any one webpage, when the webpage is positive web page text, due in positive web page text Include article, and article generally all there is title therefore can there is heading label, text in the web data of positive web page text The title for the article that positive web page text includes is stored in the heading label of webpage.

Wherein, each webpage has web page title, and web page title is stored in the head labels of webpage, for example, net Page head is generally stored inside in the title labels in the head labels of webpage.And in embodiments of the present invention, positive web page text Web page title would generally include the title of the article in positive web page text.

But in non-positive web page text, since article being not present in non-positive web page text, it is not present in non-positive web page text The title of article, into rather than text webpage in web data in not include heading label, although non-positive web page text also has net Page head, but be the title that article is not present in the web page title in non-text webpage.

Therefore, the data in the heading label in the web data can will be used to indicate to whether there is in the web data Head labels in instruction information as judgement target webpage whether be positive web page text a decision factor.

Wherein, which includes many labels, contains respective data in each label, therefore, meter Calculate the ratio between the total quantity of the character in other data in addition to a labels in the quantity and the web data of punctuation mark Value；When, the total quantity of character can be counted in other data in addition to the data for including except a labels of the web data, it will The total quantity of punctuation mark and the total quantity of character are summed, the total quantity that then total quantity of calculating punctuation mark is obtained with summation Between ratio.

Wherein, it is generally the case that due to including article in positive web page text, user is after viewing finishes article, sometimes Time needs to comment on article, therefore, comment window can be generally provided in positive web page text, such user can be in comment window Mouth input comment information, comments on article with realizing.But often due to the limitation of the size of webpage, if will Comment window is directly displayed in positive web page text, will be blocked the article of display, be led to not see all characters in article, Therefore, it is necessary to while showing article, not showing comment window, only display can click, for obtaining comment on window and press Button shows the keyword about " comment " on the button, comment window can be got to prompt user to click on, when After user finishes watching article, if necessary to comment on, then it can click on, comment window, Zhi Houyong will be shown in webpage Family inputs comment information on comment window again.Button form in web data can be a labels, the data in a labels Include the keyword about " comment ".

Rather than user comment is not needed, therefore, will not show and comment in non-positive web page text not including article in positive web page text yet By window, that is, also just there is no include keyword about " comment ".

Secondly, the article in positive web page text is also sometimes the article for reprinting other webpages, in order to safeguard the article owner Copyright, need on webpage indicate article source, therefore, in positive web page text can generally provide can click, for obtaining The button of this article original web page is shown, the keyword about " source " can be shown on button, it should to prompt user to click Button can get the button for showing this article original web page, and button form in web data can be a labels, a Data in label include should be about the keyword in " source ".

Rather than in positive web page text the copyright of the article owner need not be also safeguarded not including article, and therefore, non-positive web page text In will not show to include keyword about " source ".

In addition, the character that the article in positive web page text includes is often possible to more, but a webpage can not have been shown again Characters all Bi Wenzhang, therefore, positive web page text often by article Pagination Display, so, when showing positive web page text, usually only Can only show a part of character of positive web page text Chinese chapter, and show it is can clicking, for obtaining the article in positive web page text The button of other parts character, and show details on the button, check more and check the predetermined keywords such as full text, to carry Show that user clicks on all characters that can get the article in positive web page text, form can in web data for the button Think a labels, the data in a labels include being somebody's turn to do about details, checking more and check the predetermined keywords such as full text.

Rather than general in positive web page text does not include that there is usually no need article paging in article namely non-positive web page text The demand of display, it is seen then that do not include generally details in the web data of non-positive web page text, check more and check full text decile Page keyword.

Therefore, to sum up, can using the quantity of the predetermined keyword in the web data as judgement target webpage whether For a decision factor of positive web page text.

The maximum quantity and meter in the quantity of the text character in each paragraph tag are obtained in the web data When calculating the par of the quantity of the text character in each paragraph tag in the web data, the net can be counted respectively The quantity of the text character in each paragraph tag in page data, selects the maximum quantity in the quantity counted, and, By the summation of the quantity of the text character in all paragraph tags in the web data counted, then calculates the summation and be somebody's turn to do Ratio between the quantity of paragraph tag in web data, and as the text in each paragraph tag in the web data The par of the quantity of this character.

Wherein, the article in positive web page text includes multiple paragraphs, and the text character that usually each paragraph includes More, even if having, the text character that individual paragraphs include is less, then all paragraphs in article include being averaged for text character Quantity is also more.

It therefore, can be by the maximum quantity in the quantity of the text character in each paragraph tag in the web data And the par of the quantity of the text character in each paragraph tag in the web data is as judgement target webpage Whether be positive web page text a decision factor.

15) the rewards and punishments scoring feature of target webpage, is obtained according to the web data；

For positive web page text, the article that positive web page text includes usually has multiple paragraph compositions, each paragraph is in net It is all embodied by a paragraph tag in page data.

Multiple paragraph tags have the same father node, and father node can be DIV labels, and multiple father nodes can also be specific The same ancestor node, until root node.

Wherein, all paragraph tags in article just have a father node, alternatively, a part of paragraph tag in article Has a father node, other father nodes have the same ancestor node again.

It, can be according to the type of the label, the class attributes of the label for any one label in the web data It is worth, the quantity of the text character in the label and the chain text ratio of the label, calculates the cumulative rewards and punishments scoring of the label；For this Other each labels in web data, equally execute aforesaid operations.Wherein, heading label, paragraph tag, DIV labels with And the labels such as a labels are different types of label.

Wherein, the embodiment of the present invention do not limit the type according to label, label class attribute values, the text in label The quantity of character and the chain text ratio of label calculate the circular of the cumulative rewards and punishments scoring of label, may refer to existing Any one computational methods in technology.

It is scored according to the cumulative rewards and punishments of each paragraph tag and calculates the father node of each label in the web data Cumulative rewards and punishments integral and ancestor node cumulative rewards and punishments integral；Wherein, when multiple paragraph tags have the same father node When, the cumulative rewards and punishments of multiple paragraph tag are integrated and are added, the cumulative rewards and punishments integral of the father node is obtained.When multiple father nodes When having the same ancestor node, the cumulative rewards and punishments integral of multiple father node is added and obtains the cumulative rewards and punishments of the ancestor node Integral.

Then, the cumulative rewards and punishments of each label integral, each father node cumulative rewards and punishments integral and each The cumulative rewards and punishments of ancestor node etc. integrate, and obtain rewards and punishments of the cumulative rewards and punishments integral of maximum preset quantity as target webpage Score feature.

Wherein, preset quantity can be 3,4 or 5 etc., and the embodiment of the present invention is not limited this.

In embodiments of the present invention, in positive web page text, since the cumulative rewards and punishments integral of father node is by multiple paragraph marks What the multiply-add rewards and punishments integral addition of label obtained, the cumulative rewards and punishments integral of ancestor node is integrated by the cumulative rewards and punishments of multiple father nodes It obtains, therefore, each cumulative rewards and punishments integral difference in the cumulative rewards and punishments integral of maximum preset quantity in positive web page text Differing greatly between other cumulative rewards and punishments integrals.

And in non-positive web page text, due to not including article in non-positive web page text, and then just do not include paragraph, does not just deposit yet In paragraph tag, and then in the dom trees in non-positive web page text, just there is no the father nodes being made of multiple paragraph tags.It can See, in non-positive web page text, the difference between the cumulative rewards and punishments integral of each label is smaller.

Therefore, the cumulative rewards and punishments of maximum preset quantity can be integrated to the rewards and punishments scoring feature as target webpage, and will Target webpage rewards and punishments scoring feature as judge target webpage whether be positive web page text a decision factor.

16), the structure feature, the semantic feature, this article eigen and the rewards and punishments scoring feature are input to and set in advance The Web page classifying device set determines that target webpage is positive web page text or non-positive web page text according to the output result of the Web page classifying device.

The Web page classifying device is used to judge that webpage is positive web page text or non-positive web page text according to the characteristic of webpage.

Wherein, the positive web page text of multiple samples can be arranged in technical staff in advance, according to the structure feature of the positive web page text of sample, Semantic feature, text feature and rewards and punishments scoring feature training net web page classifier.

Wherein, in training webpage distributor, XGBoost algorithms may be used, it is of course also possible to use other algorithms, The embodiment of the present invention is not limited this.

In embodiments of the present invention, for a webpage, by the structure feature of the webpage, the semantic feature of the webpage, After the Web page classifying device that the rewards and punishments scoring feature input training of the text feature of the webpage and the webpage obtains, Web page classifying Device scores to the rewards and punishments of the structure feature of the webpage, the semantic feature of the webpage, the text feature of the webpage and the webpage The result that the webpage is positive web page text or non-positive web page text will be exported after characteristic processing.

In step s 102, it if target webpage is positive web page text, is extracted in target webpage using extracting rule is preset Article, default extracting rule from positive web page text for extracting article.

Further, exist in the web data of target webpage and judging that target webpage will not profit when whether being positive web page text Data in the data used, such as script labels, the data in style labels, the data in conments labels and Have the data etc. in the label of hiding attribute, whether these data are that positive web page text does not help to judgement target webpage. Wherein, having the data in the label of hiding attribute will not show on webpage.

In addition, there is the data that can be all shown on some positive web page texts and non-positive web page text, for example, webpage statutory authority Whether mark etc., these statutory authorities mark can show on each webpage, but be text in judgement target webpage The statutory authority for the webpage that will not be used when webpage identifies.

The embodiment of the present invention is to need to obtain the characteristic of target webpage from the web data of target webpage to judge Whether target webpage is positive web page text, but when obtaining the characteristic of target webpage, needs to be traversed for the entire of target webpage Web data, and then can just find the characteristic that can be used when judging whether target webpage is positive web page text.However, such as When the data that can not be used when judging whether target webpage is positive web page text in the web data of fruit target webpage are more, It may result in the characteristic for consuming a longer time and capable of just getting and being used when judging whether target webpage is positive web page text According to causing finally to need to spend more time that can just determine whether target webpage is positive web page text, thereby reduce judgement mesh Mark webpage whether be positive web page text judgement efficiency.

Therefore, in order to improve whether judgement target webpage is that the judgement efficiency of positive web page text can be with after this step From the data removed in the web data of target webpage at least following label：In data, style labels in script labels Data, the data in conments labels, have data in the label of hiding display properties and on each webpage The data etc. that will be shown.

In this way, in an alternative embodiment of the invention, step 12)~15) in web data be：It is obtained from step 11) To web data in eliminate the data in script labels, the data in style labels, the number in conments labels According to, have it is remaining after the data in the label of hiding display properties and the data that can be shown on each webpage etc. Web data.

Fig. 2 is a kind of block diagram of page processor shown according to an exemplary embodiment.With reference to Fig. 2, the device packet It includes：

Identification module 11, whether target webpage is positive web page text for identification；

Extraction module 12 extracts the target if being positive web page text for target webpage using extracting rule is preset Article in webpage, the default extracting rule from positive web page text for extracting article.

Wherein, the identification module 11 includes：

First acquisition unit, the web data for obtaining the target webpage；

Further, the identification module 11 further includes：

Wherein, first acquisition module includes：

First computation subunit, the chain text ratio for calculating the web data；

Wherein, the third acquiring unit includes：

Wherein, the 4th acquiring unit includes：

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, explanation will be not set forth in detail herein.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the present invention Its embodiment.This application is intended to cover any variations, uses, or adaptations of the embodiment of the present invention, these modifications, Purposes or adaptive change follow the general principle of the embodiment of the present invention and include undocumented skill of the embodiment of the present invention Common knowledge in art field or conventional techniques.The description and examples are only to be considered as illustrative, the embodiment of the present invention True scope and spirit pointed out by the attached claims.

It should be understood that the embodiment of the present invention is not limited to the accurate knot for being described above and being shown in the accompanying drawings Structure, and various modifications and changes may be made without departing from the scope thereof.The range of the embodiment of the present invention is only by appended right It is required that limit.

Claims

1. a kind of web page processing method, which is characterized in that the method includes：

Identify whether target webpage is positive web page text；

If target webpage is positive web page text, the article in the target webpage is extracted using default extracting rule, it is described pre- If extracting rule from positive web page text for extracting article.

2. according to the method described in claim 1, it is characterized in that, whether the identification target webpage is positive web page text, including：

Obtain the web data of the target webpage；

The text feature of the target webpage is obtained according to the web data；

The structure feature, the semantic feature, the text feature and the rewards and punishments scoring feature are input to and set in advance The Web page classifying device set determines that the target webpage is positive web page text or non-text according to the output result of the Web page classifying device Webpage.

3. according to the method described in claim 2, it is characterized in that, after the web data for obtaining the target webpage, Further include：

From the data removed in the web data at least following label：In data, style labels in script labels Data, the data in conments labels and have data in the label of hiding attribute.

4. according to the method in claim 2 or 3, which is characterized in that described to obtain the target according to the web data The structure feature of webpage, including：

Obtain the quantity of the heading label in the web data, the quantity of paragraph tag, the quantity of DIV labels and a labels Quantity；

Calculate the chain text ratio of the web data；

Judge to whether there is paging keyword in the web data, the paging keyword includes at least：Homepage, page up, Lower one page, endpage and full text；

By the number of the quantity of the heading label, the quantity of the paragraph tag, the quantity of the DIV labels, a labels Amount, the chain text than, the picture quantity and be used to indicate in the web data and whether there is the paging keyword Structure feature of the information as the target webpage.

5. according to the method in claim 2 or 3, which is characterized in that described to obtain the target according to the web data The semantic feature of webpage, including：

The quantity for the label that class attribute values are content or article is obtained from the web data, and as institute State the semantic feature of target webpage.

6. according to the method in claim 2 or 3, which is characterized in that described to obtain the target according to the web data The text feature of webpage, including：

Judge that the data in the heading label in the web data whether there is in the head labels of the web data；

Calculate the total of the quantity of the punctuation mark and the character in other data in addition to a labels in the web data Ratio between quantity；

The quantity of predetermined keyword is counted in the web data, the predetermined keyword includes at least：Comment, source, in detail Feelings are checked more and are checked in full；

The data in the heading label in the web data will be used to indicate to mark with the presence or absence of the head in the web data The quantity of information, the punctuation mark in label, the ratio, the quantity of the predetermined keyword, the maximum quantity and Text feature of the par as the target webpage.

7. according to the method in claim 2 or 3, which is characterized in that described to obtain the target according to the web data The rewards and punishments scoring feature of webpage, including：

For each label in the web data, according to the type of the label, the class attribute values of the label, The chain text ratio of the quantity of text character in the label and the label calculates the cumulative rewards and punishments scoring of the label；

The cumulative of the father node of each label in the web data is calculated according to the cumulative rewards and punishments of each label scoring The cumulative rewards and punishments of rewards and punishments integral and ancestor node integrate；

8. a kind of page processor, which is characterized in that described device includes：

Extraction module is extracted using extracting rule is preset in the target webpage if being positive web page text for target webpage Article, the default extracting rule from positive web page text for extracting article.

9. device according to claim 8, which is characterized in that the identification module includes：

First acquisition unit, the web data for obtaining the target webpage；

Determination unit is used for the structure feature, the semantic feature, the text feature and the rewards and punishments scoring feature It is input to pre-set Web page classifying device, determines that the target webpage is text according to the output result of the Web page classifying device Webpage or non-positive web page text.

10. device according to claim 9, which is characterized in that the identification module further includes：

Removal unit, for from the data removed in the web data at least following label：Data in script labels, Data in style labels, the data in conments labels and have data in the label of hiding attribute.

11. device according to claim 9 or 10, which is characterized in that first acquisition module includes：

First obtains subelement, for obtaining the quantity of the heading label in the web data, the quantity of paragraph tag, DIV The quantity of label and the quantity of a labels；

First computation subunit, the chain text ratio for calculating the web data；

Second obtains subelement, for obtaining quantity of in the web data, pixel quantity more than the picture of predetermined threshold value；

First judgment sub-unit whether there is paging keyword for judging in the web data, the paging keyword is extremely Include less：Homepage, page up, lower one page, endpage and full text；

First determination subelement, for by the quantity of the heading label, the quantity of the paragraph tag, the DIV labels Quantity, the quantity of a labels, the chain text than, the picture quantity and be used to indicate in the web data whether Structure feature there are the information of the paging keyword as the target webpage.

12. device according to claim 9 or 10, which is characterized in that the second acquisition unit is specifically used for：From described The quantity for the label that class attribute values are content or article is obtained in web data, and as the target webpage Semantic feature.

13. device according to claim 9 or 10, which is characterized in that the third acquiring unit includes：

Second judgment sub-unit, for judging that the data in the heading label in the web data whether there is in the webpage In the head labels of data；

First statistics subelement, in other data in addition to a labels in the web data, counting punctuation mark Quantity；

Second computation subunit, for calculate the quantity of the punctuation mark in the web data in addition to a labels Ratio between the total quantity of character in other data；

Second statistics subelement, the quantity for counting predetermined keyword in the web data, the predetermined keyword is extremely Include less：Comment, details, is checked more and is checked in full source；

Third obtains subelement, in the quantity for obtaining the text character in each paragraph tag in the web data Maximum quantity；

Third computation subunit, the quantity for calculating text character in each paragraph tag in the web data Par；

Second determination subelement, the data for that will be used to indicate in the heading label in the web data whether there is in institute State the number of information in the head labels of web data, the quantity of the punctuation mark, the ratio, the predetermined keyword Amount, the text feature of the maximum quantity and the par as the target webpage.

14. device according to claim 9 or 10, which is characterized in that the 4th acquiring unit includes：

4th computation subunit, for for each label in the web data, according to the type of the label, described The class attribute values of label, the quantity of the text character in the label and the chain text ratio of the label, calculate the label Cumulative rewards and punishments scoring；

5th computation subunit, for calculating each in the web data according to the cumulative rewards and punishments of each label scoring The cumulative rewards and punishments integral of the father node of label and the cumulative rewards and punishments integral of ancestor node；

4th obtains subelement, for obtaining the cumulative rewards and punishments integral of maximum preset quantity as rewards and punishments scoring feature.