CN108664522A - Web page processing method and device - Google Patents
Web page processing method and device Download PDFInfo
- Publication number
- CN108664522A CN108664522A CN201710212470.8A CN201710212470A CN108664522A CN 108664522 A CN108664522 A CN 108664522A CN 201710212470 A CN201710212470 A CN 201710212470A CN 108664522 A CN108664522 A CN 108664522A
- Authority
- CN
- China
- Prior art keywords
- data
- text
- web page
- label
- web data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
An embodiment of the present invention provides a kind of web page processing method and devices.The extracting article of the task is divided into two steps by the embodiment of the present invention, and the first step needs to identify whether identification target webpage is positive web page text;If target webpage is positive web page text, second step recycles the article preset in extracting rule extraction target webpage, and default extracting rule from positive web page text for extracting article.If target webpage is non-positive web page text, the flow of extraction article need not be just executed.As it can be seen that the default extracting rule in the embodiment of the present invention only need to take into account positive web page text without taking into account non-positive web page text, to reduce in advance in the training difficulty of the default extracting rule of training.And reduce follow-up optimization and preset the complexity of extracting rule, and improve the optimization efficiency that extracting rule is preset in follow-up optimization.
Description
Technical field
The present embodiments relate to internet arenas, especially take into account a kind of web page processing method and device.
Background technology
With the rapid development of Internet, explosive growth is presented in the information on network, webpage can be accessed as open
Resource become mostly important one of information source on internet.
Current webpage includes positive web page text (content Web page) and non-positive web page text, and positive web page text is to include
Article, and the webpage with unique theme.Non- positive web page text is the webpage without unique theme, and non-positive web page text does not wrap
Containing article.
Due to technology and business, the content in current each webpage is very complicated, in addition to text on positive web page text
Except chapter, it is also doped with a large amount of hash, such as navigation bar, advertisement link, copyright information and other recommendation article links
Etc., positive web page text article to display is hidden in unrelated content.
However, we usually need the article in positive web page text to build corpus at present, and then spy is completed by corpus
The various tasks such as fixed text mining, web information retrieval and natural language processing.Therefore, it is necessary to text is extracted from positive web page text
Chapter.
Wherein, in the prior art, article is directly extracted from webpage using extracting rule in the prior art, due to just
Gap is larger between the structure of web page of web page text and the structure of web page of non-positive web page text, but the structure of web page of different positive web page texts
The structure of web page of essentially identical and different non-positive web page text is essentially identical.Therefore, extracting rule in the prior art needs same
When take into account positive web page text and non-positive web page text, this adds increased technical staff in advance training extracting rule training difficulty.
Invention content
To overcome the problems in correlation technique, a kind of web page processing method of offer of the embodiment of the present invention and device.
According to a first aspect of the embodiments of the present invention, a kind of web page processing method is provided, the method includes:
Identify whether target webpage is positive web page text;
If target webpage is positive web page text, the article in the target webpage, institute are extracted using default extracting rule
Default extracting rule is stated for extracting article from positive web page text.
Wherein, whether the identification target webpage is positive web page text, including:
Obtain the web data of the target webpage;
The structure feature of the target webpage is obtained according to the web data;
The semantic feature of the target webpage is obtained according to the web data;
The text feature of the target webpage is obtained according to the web data;
The rewards and punishments scoring feature of the target webpage is obtained according to the web data;
The structure feature, the semantic feature, the text feature and the rewards and punishments scoring feature are input to pre-
The Web page classifying device being first arranged determines that the target webpage is positive web page text or non-according to the output result of the Web page classifying device
Positive web page text.
Further, after the web data for obtaining the target webpage, further include:
From the data removed in the web data at least following label:Data, style labels in script labels
In data, the data in conments labels and have data in the label of hiding attribute.
Wherein, the structure feature that the target webpage is obtained according to the web data, including:
Obtain the quantity of the heading label in the web data, the quantity of paragraph tag, the quantity of DIV labels and a
The quantity of label;
Calculate the chain text ratio of the web data;
Obtain quantity of in the web data, pixel quantity more than the picture of predetermined threshold value;
Judge to whether there is paging keyword in the web data, the paging keyword includes at least:Homepage, upper one
Page, lower one page, endpage and full text;
By the quantity of the heading label, the quantity of the paragraph tag, the quantity of the DIV labels, a labels
Quantity, the chain text than the quantity of the, picture and are used to indicate in the web data crucial with the presence or absence of the paging
Structure feature of the information of word as the target webpage.
Wherein, the semantic feature that the target webpage is obtained according to the web data, including:
The quantity for the label that class attribute values are content or article is obtained from the web data, and is made
For the semantic feature of the target webpage.
Wherein, the text feature that the target webpage is obtained according to the web data, including:
Judge that the data in the heading label in the web data whether there is in the head labels of the web data
In;
In other data in addition to a labels in the web data, the quantity of punctuation mark is counted;
The quantity for calculating the punctuation mark and the character in other data in addition to a labels in the web data
Total quantity between ratio;
The quantity of predetermined keyword is counted in the web data, the predetermined keyword includes at least:Comment comes
Source, details are checked more and are checked in full;
The maximum quantity in the quantity of the text character in each paragraph tag is obtained in the web data;
Calculate the par of the quantity of the text character in each paragraph tag in the web data;
The data in the heading label in the web data will be used to indicate to whether there is in the web data
The quantity of information, the punctuation mark in head labels, the ratio, the quantity of the predetermined keyword, the maximum number
The text feature of amount and the par as the target webpage.
Wherein, the rewards and punishments scoring feature that the target webpage is obtained according to the web data, including:
For each label in the web data, according to the type of the label, the class attributes of the label
It is worth, the quantity of the text character in the label and the chain text ratio of the label, calculates the cumulative rewards and punishments scoring of the label;
The father node of each label in the web data is calculated according to the cumulative rewards and punishments of each label scoring
The cumulative rewards and punishments integral of cumulative rewards and punishments integral and ancestor node;
The cumulative rewards and punishments integral of maximum preset quantity is obtained as rewards and punishments scoring feature.
According to a second aspect of the embodiments of the present invention, a kind of page processor is provided, described device includes:
Identification module, whether target webpage is positive web page text for identification;
Extraction module extracts the target network if being positive web page text for target webpage using extracting rule is preset
Article in page, the default extracting rule from positive web page text for extracting article.
Wherein, the identification module includes:
First acquisition unit, the web data for obtaining the target webpage;
Second acquisition unit, the structure feature for obtaining the target webpage according to the web data;
Third acquiring unit, the semantic feature for obtaining the target webpage according to the web data;
4th acquiring unit, the text feature for obtaining the target webpage according to the web data;
5th acquiring unit, the rewards and punishments scoring feature for obtaining the target webpage according to the web data;
Determination unit, for the structure feature, the semantic feature, the text feature and the rewards and punishments to be scored
Feature is input to pre-set Web page classifying device, determines that the target webpage is according to the output result of the Web page classifying device
Positive web page text or non-positive web page text.
Further, the identification module further includes:
Removal unit, for from the data removed in the web data at least following label:In script labels
Data, the data in style labels, the data in conments labels and have data in the label of hiding attribute.
Wherein, first acquisition module includes:
First obtain subelement, for obtain the quantity of the heading label in the web data, the quantity of paragraph tag,
The quantity of DIV labels and the quantity of a labels;
First computation subunit, the chain text ratio for calculating the web data;
Second obtains subelement, for obtaining the picture of in the web data, pixel quantity more than predetermined threshold value
Quantity;
First judgment sub-unit whether there is paging keyword for judging in the web data, the paging is crucial
Word includes at least:Homepage, page up, lower one page, endpage and full text;
First determination subelement, for marking the quantity of the heading label, the quantity of the paragraph tag, the DIV
The quantity of label, the quantity of a labels, the chain text than, the picture quantity and be used to indicate in the web data
With the presence or absence of structure feature of the information as the target webpage of the paging keyword.
Wherein, the second acquisition unit is specifically used for:Class attribute values are obtained from the web data is
The quantity of the label of content or article, and as the semantic feature of the target webpage.
Wherein, the third acquiring unit includes:
Second judgment sub-unit, for judging that the data in the heading label in the web data whether there is in described
In the head labels of web data;
First statistics subelement, in other data in addition to a labels in the web data, counting punctuate
The quantity of symbol;
Second computation subunit, for calculate the quantity of the punctuation mark in the web data except a labels with
Ratio between the total quantity of character in other outer data;
Second statistics subelement, the quantity for counting predetermined keyword in the web data, the default key
Word includes at least:Comment, details, is checked more and is checked in full source;
Third obtains subelement, the number for obtaining the text character in each paragraph tag in the web data
Maximum quantity in amount.
Third computation subunit, the number for calculating the text character in each paragraph tag in the web data
The par of amount;
Second determination subelement, for whether there is the data in the heading label being used to indicate in the web data
The quantity of information, the punctuation mark in the head labels of the web data, the ratio, the predetermined keyword
The text feature of quantity, the maximum quantity and the par as the target webpage.
Wherein, the 4th acquiring unit includes:
4th computation subunit, for for each label in the web data, according to the type of the label,
The class attribute values of the label, the quantity of the text character in the label and the chain text ratio of the label, described in calculating
The cumulative rewards and punishments of label are scored;
5th computation subunit, it is every in the web data for being calculated according to the cumulative rewards and punishments of each label scoring
The cumulative rewards and punishments integral of the father node of one label and the cumulative rewards and punishments integral of ancestor node;
4th obtains subelement, special as rewards and punishments scoring for obtaining the cumulative rewards and punishments integral of maximum preset quantity
Sign.
Technical solution provided in an embodiment of the present invention can include the following benefits:
Wherein, include article in positive web page text, article is not included in non-positive web page text.It is directly to utilize in the prior art
Extracting rule in the prior art extracts article from webpage, due to the webpage of the structure of web page and non-positive web page text of positive web page text
Gap is larger between structure, but the webpage of the essentially identical and different non-positive web page text of structure of web page of different positive web page texts
Structure is essentially identical.Therefore, extracting rule in the prior art needs while taking into account positive web page text and non-positive web page text, this just increases
Training difficulty of the technical staff in advance in training extracting rule is added.
Such as technical staff will acquire the sample characteristics data of multiple non-positive web page texts and multiple positive web page texts in advance
Sample characteristics data, and then according to the sample characteristics data of multiple non-positive web page texts and the sample characteristics number of multiple positive web page texts
According to training extracting rule, due to the structure of web page due to positive web page text and between the structure of web page of non-positive web page text, gap is larger,
Therefore, the sample characteristics data of non-positive web page text and the sample characteristics data of positive web page text are two kinds of completely different characteristics
According to, so, it is more complex that the extracting rule trained according to two different characteristics may result in the extracting rule that training obtains.
Secondly, subsequently optimization extracting rule when complexity it is higher and subsequently optimization extracting rule when efficiency compared with
It is low.For example, if the accuracy rate for extracting article from webpage using extracting rule in the prior art is relatively low, carried to improve
The accuracy rate for taking article needs to optimize extracting rule in the prior art, but since extraction in the prior art is advised
It then needs to take into account non-positive web page text and positive web page text simultaneously, therefore, be needed when optimizing extracting rule in the prior art simultaneous simultaneously
Gu Feizheng web page texts and positive web page text cause the complexity for optimizing extracting rule higher, and optimize the optimization efficiency of extracting rule
It is relatively low.
And in embodiments of the present invention, the extracting article of the task is divided into two steps, the first step needs to identify identification target network
Whether page is positive web page text;If target webpage is positive web page text, second step, which recycles, presets extracting rule extraction target webpage
In article, default extracting rule from positive web page text for extracting article.If target webpage is non-positive web page text, with regard to nothing
The flow of extraction article must be executed.As it can be seen that the default extracting rule in the embodiment of the present invention is only needed without taking into account non-positive web page text
Positive web page text is taken into account, to reduce in advance in the training difficulty of the default extracting rule of training.So that default extracting rule
More it is absorbed in effective.Such as technical staff need to only acquire the sample characteristics data of multiple positive web page texts in advance, it is more without acquiring
The sample characteristics data of a non-positive web page text, and then train default extraction rule only according to the sample characteristics data of positive web page text
Then, make the default extracting rule that training obtains simpler only according to a kind of default extracting rule of characteristic training.
Secondly, the complexity when extracting rule is preset in follow-up optimization is relatively low and when extracting rule is preset in follow-up optimization
Efficiency it is higher.For example, if the standard of article is extracted from positive web page text using the default extracting rule in the embodiment of the present invention
True rate is relatively low, then in order to improve the accuracy rate of extraction article, needs to optimize default extracting rule, but carried due to default
It takes rule only to need to take into account positive web page text, without taking into account non-positive web page text, therefore, is only needed when extracting rule is preset in optimization
Take into account positive web page text, compared with the prior art when optimizing extracting rule in the prior art need simultaneously take into account non-positive web page text with
Positive web page text, the embodiment of the present invention can reduce optimization and preset the complexity of extracting rule and improve the default extracting rule of optimization
Optimization efficiency.
It should be understood that above general description and following detailed description is only exemplary and explanatory, not
The embodiment of the present invention can be limited.
Description of the drawings
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the present invention
Example, and the principle together with specification for explaining the embodiment of the present invention.
Fig. 1 is a kind of flow chart of web page processing method shown according to an exemplary embodiment;
Fig. 2 is a kind of block diagram of page processor shown according to an exemplary embodiment.
Specific implementation mode
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistent with the embodiment of the present invention.On the contrary, they be only with
The example of as detailed in the attached claim, the embodiment of the present invention the consistent device and method of some aspects.
Fig. 1 is a kind of flow chart of web page processing method shown according to an exemplary embodiment, as shown in Figure 1, the party
Method includes the following steps.
In step S101, whether identification target webpage is positive web page text;
Wherein, it can be realized by following flow in this step, including:
11) web data of target webpage, is obtained;
12) structure feature of target webpage, is obtained according to the web data of target webpage;
Wherein it is possible to first count the quantity of the heading label in the web data, paragraph tag in the web data
The quantity of DIV labels in quantity, the web data and the quantity of a labels in the web data;Then the webpage is calculated
The chain text ratio of data;Quantity of in the web data, pixel quantity more than the picture of predetermined threshold value is obtained later;Judge this again
It whether there is paging keyword in web data, paging keyword includes at least:Homepage, page up, lower one page, endpage and complete
Text etc.;Finally by the quantity of heading label, the quantity of paragraph tag, the quantity of DIV labels, the quantity of a labels, the webpage number
According to chain text compare, the quantity of the picture and be used to indicate in the web data with the presence or absence of paging keyword information as mesh
Mark the structure feature of webpage.
Wherein, calculate the web data chain text than when, can first count the total of the character that the web data includes
Quantity counts the total quantity of the character in all a labels that the web data includes later;Then the web data is calculated
Difference between the total quantity of character in all a labels that the total quantity for the character for including and the web data include,
The ratio between the total quantity and the difference of the character in all a labels that the web data includes is calculated again, obtains the net
The chain text ratio of page data.
In embodiments of the present invention, it is generally the case that due to including article, the character that article includes in positive web page text
It is more, such as including many Chinese characters and/or letter, therefore, the character that the web data of positive web page text includes is more.Secondly,
Including in positive web page text click and the button that includes link it is less, therefore, the chain in the web data of positive web page text
Connect it is less, that is, the negligible amounts for a labels that the web data of positive web page text includes, in this way, the web data of positive web page text
The total quantity of character in all a labels for including is with regard to relatively low.Therefore, the chain text of the web data can be compared as judgement
Target webpage whether be positive web page text a decision factor.
Wherein, when being more than the quantity of picture of predetermined threshold value in obtaining the web data, pixel quantity, for the net
Any one picture that page data includes, can count the quantity for the pixel that the picture includes;Wherein, picture is by pixel
Constitute, picture it is generally rectangular in shape, that is, the quantity of the pixel of every a line in picture is all identical, and in picture
The quantity of pixel in each row is all identical, therefore, can count the quantity of the pixel in arbitrary a line of the picture, with
And the quantity of the pixel in any one row of the picture is counted, the quantity and the pixel in the row for calculating the pixel in the row
Product between the quantity of point, obtains the quantity for the pixel that the picture includes;Then compare the pixel quantity that the picture includes
Size between predetermined threshold value increases the picture recorded in caching when the quantity that the picture includes is more than predetermined threshold value
Quantity equally executes aforesaid operations for other each pictures that the web data includes;It finally obtains and is recorded in caching
Picture quantity, and as in the web data, pixel quantity be more than predetermined threshold value picture quantity.
In embodiments of the present invention, in the data of a web displaying, often in addition to word, can click and include one
Except button of link etc. element, picture can be shown toward contact, and under normal conditions, the picture shown in positive web page text compared with
Greatly, the picture shown in non-positive web page text is smaller, that is, the quantity for the pixel that the picture shown in positive web page text includes compared with
It is more, the negligible amounts for the pixel that the picture shown in non-positive web page text includes.Therefore, can by it is in the web data, as
Prime number amount be more than predetermined threshold value picture quantity as judge target webpage whether be positive web page text a decision factor.
Judge to whether there is paging keyword in the web data, paging keyword includes at least:Homepage, page up, under
One page, endpage and full text etc.;
In embodiments of the present invention, the character that the article in positive web page text includes is often more, but a webpage and nothing
Method, which is shown, finishes character all in article, therefore, positive web page text often by article Pagination Display, so, display text
When webpage, can mostly just show a part of character of positive web page text Chinese chapter, and display simultaneously can click, for obtaining just
The button of the other parts character of article in web page text, and shown on the button homepage, page up, lower one page, endpage with
And at least one of full text etc. paging keyword, the article that can be got in positive web page text with prompting user to click on
Other parts character, button form in web data is a labels, the data in a labels include should about homepage, on
At least one of one page, lower one page, endpage and full text etc. paging keyword.
Rather than general in positive web page text does not include that there is usually no need article paging in article namely non-positive web page text
The demand of display, it is seen then that general in the web data of non-positive web page text does not include paging keyword.
Therefore, it can will whether there is paging keyword in a labels detected in the web data, then generate for referring to
Show the instruction information that whether there is paging keyword in the web data, and will be used to indicate to whether there is in the web data and divide
Page keyword instruction information as judge target webpage whether be positive web page text a decision factor.
13) semantic feature of target webpage, is obtained according to the web data of target webpage;
In this step, the mark that class attribute values are content or article can be obtained from the web data
The quantity of label, and as the semantic feature of target webpage.
Wherein, in the quantity for obtaining the label that class attribute values are content or article from the web data
When, for any one label in the web data, the class attribute values of the label can be obtained, judge the label
Class attribute values whether be content either article when the label class attribute values be content or article
When, the quantity for increasing the label recorded in caching equally executes above-mentioned behaviour for other each labels in the web data
Make;Finally obtain the quantity of label recorded in caching, and as being content from the class attribute values in the web data
Or the quantity of the label of article.
In embodiments of the present invention, multiple labels are will include in the web data of a webpage, and each label has
Standby respective class attribute values.
For positive web page text, since positive web page text includes article, and article is made of multiple paragraphs, therefore, the positive net of text
Multiple paragraph tags are will include in the web data of page, each paragraph tag includes the character in paragraph, therefore, paragraph
The class attribute values of label are content or article.
Rather than in positive web page text do not include article, also do not include just paragraph, into rather than the web data of text webpage in
Do not include just paragraph tag, also just there is no the labels that class attribute values are content or article.
Therefore, the quantity for the label that class attribute values are content or article will be obtained from the web data
As judgement target webpage whether be positive web page text a decision factor.
14) text feature of target webpage, is obtained according to the web data of target webpage;
Wherein it is possible to judge that the data in the heading label in the web data whether there is in the head of the web data
In label;In other data in addition to a labels in the web data, the quantity of punctuation mark is counted;Calculate punctuate symbol
Number quantity and the web data in other data in addition to a labels in character total quantity between ratio;At this
The quantity of predetermined keyword is counted in web data, predetermined keyword includes at least:Comment, source, details, check it is more and
Check full text;The maximum quantity in the quantity of the text character in each paragraph tag is obtained in the web data;It calculates
The par of the quantity of the text character in each paragraph tag in the web data;The web data will be used to indicate
In heading label in data with the presence or absence of in the head labels of the web data information, the quantity of the punctuation mark,
The text feature as target webpage of quantity, the maximum quantity and the par of the ratio, the predetermined keyword.
Wherein, the data in the heading label in judging the web data whether there is marks in the head of the web data
When in label, the data in the heading label in the web data can be obtained, in the head labels for then obtaining the web data
Data, and search whether that there are in the heading label in the web data in the data in the head labels of the web data
Data, when in the data in the head labels of the web data there are when the data in the heading label in the web data,
Generate the instruction that the data in the heading label being used to indicate in the web data are present in the head labels of the web data
Information.
In embodiments of the present invention, for any one webpage, when the webpage is positive web page text, due in positive web page text
Include article, and article generally all there is title therefore can there is heading label, text in the web data of positive web page text
The title for the article that positive web page text includes is stored in the heading label of webpage.
Wherein, each webpage has web page title, and web page title is stored in the head labels of webpage, for example, net
Page head is generally stored inside in the title labels in the head labels of webpage.And in embodiments of the present invention, positive web page text
Web page title would generally include the title of the article in positive web page text.
But in non-positive web page text, since article being not present in non-positive web page text, it is not present in non-positive web page text
The title of article, into rather than text webpage in web data in not include heading label, although non-positive web page text also has net
Page head, but be the title that article is not present in the web page title in non-text webpage.
Therefore, the data in the heading label in the web data can will be used to indicate to whether there is in the web data
Head labels in instruction information as judgement target webpage whether be positive web page text a decision factor.
Wherein, which includes many labels, contains respective data in each label, therefore, meter
Calculate the ratio between the total quantity of the character in other data in addition to a labels in the quantity and the web data of punctuation mark
Value;When, the total quantity of character can be counted in other data in addition to the data for including except a labels of the web data, it will
The total quantity of punctuation mark and the total quantity of character are summed, the total quantity that then total quantity of calculating punctuation mark is obtained with summation
Between ratio.
Wherein, it is generally the case that due to including article in positive web page text, user is after viewing finishes article, sometimes
Time needs to comment on article, therefore, comment window can be generally provided in positive web page text, such user can be in comment window
Mouth input comment information, comments on article with realizing.But often due to the limitation of the size of webpage, if will
Comment window is directly displayed in positive web page text, will be blocked the article of display, be led to not see all characters in article,
Therefore, it is necessary to while showing article, not showing comment window, only display can click, for obtaining comment on window and press
Button shows the keyword about " comment " on the button, comment window can be got to prompt user to click on, when
After user finishes watching article, if necessary to comment on, then it can click on, comment window, Zhi Houyong will be shown in webpage
Family inputs comment information on comment window again.Button form in web data can be a labels, the data in a labels
Include the keyword about " comment ".
Rather than user comment is not needed, therefore, will not show and comment in non-positive web page text not including article in positive web page text yet
By window, that is, also just there is no include keyword about " comment ".
Secondly, the article in positive web page text is also sometimes the article for reprinting other webpages, in order to safeguard the article owner
Copyright, need on webpage indicate article source, therefore, in positive web page text can generally provide can click, for obtaining
The button of this article original web page is shown, the keyword about " source " can be shown on button, it should to prompt user to click
Button can get the button for showing this article original web page, and button form in web data can be a labels, a
Data in label include should be about the keyword in " source ".
Rather than in positive web page text the copyright of the article owner need not be also safeguarded not including article, and therefore, non-positive web page text
In will not show to include keyword about " source ".
In addition, the character that the article in positive web page text includes is often possible to more, but a webpage can not have been shown again
Characters all Bi Wenzhang, therefore, positive web page text often by article Pagination Display, so, when showing positive web page text, usually only
Can only show a part of character of positive web page text Chinese chapter, and show it is can clicking, for obtaining the article in positive web page text
The button of other parts character, and show details on the button, check more and check the predetermined keywords such as full text, to carry
Show that user clicks on all characters that can get the article in positive web page text, form can in web data for the button
Think a labels, the data in a labels include being somebody's turn to do about details, checking more and check the predetermined keywords such as full text.
Rather than general in positive web page text does not include that there is usually no need article paging in article namely non-positive web page text
The demand of display, it is seen then that do not include generally details in the web data of non-positive web page text, check more and check full text decile
Page keyword.
Therefore, to sum up, can using the quantity of the predetermined keyword in the web data as judgement target webpage whether
For a decision factor of positive web page text.
The maximum quantity and meter in the quantity of the text character in each paragraph tag are obtained in the web data
When calculating the par of the quantity of the text character in each paragraph tag in the web data, the net can be counted respectively
The quantity of the text character in each paragraph tag in page data, selects the maximum quantity in the quantity counted, and,
By the summation of the quantity of the text character in all paragraph tags in the web data counted, then calculates the summation and be somebody's turn to do
Ratio between the quantity of paragraph tag in web data, and as the text in each paragraph tag in the web data
The par of the quantity of this character.
Wherein, the article in positive web page text includes multiple paragraphs, and the text character that usually each paragraph includes
More, even if having, the text character that individual paragraphs include is less, then all paragraphs in article include being averaged for text character
Quantity is also more.
It therefore, can be by the maximum quantity in the quantity of the text character in each paragraph tag in the web data
And the par of the quantity of the text character in each paragraph tag in the web data is as judgement target webpage
Whether be positive web page text a decision factor.
15) the rewards and punishments scoring feature of target webpage, is obtained according to the web data;
For positive web page text, the article that positive web page text includes usually has multiple paragraph compositions, each paragraph is in net
It is all embodied by a paragraph tag in page data.
Multiple paragraph tags have the same father node, and father node can be DIV labels, and multiple father nodes can also be specific
The same ancestor node, until root node.
Wherein, all paragraph tags in article just have a father node, alternatively, a part of paragraph tag in article
Has a father node, other father nodes have the same ancestor node again.
It, can be according to the type of the label, the class attributes of the label for any one label in the web data
It is worth, the quantity of the text character in the label and the chain text ratio of the label, calculates the cumulative rewards and punishments scoring of the label;For this
Other each labels in web data, equally execute aforesaid operations.Wherein, heading label, paragraph tag, DIV labels with
And the labels such as a labels are different types of label.
Wherein, the embodiment of the present invention do not limit the type according to label, label class attribute values, the text in label
The quantity of character and the chain text ratio of label calculate the circular of the cumulative rewards and punishments scoring of label, may refer to existing
Any one computational methods in technology.
It is scored according to the cumulative rewards and punishments of each paragraph tag and calculates the father node of each label in the web data
Cumulative rewards and punishments integral and ancestor node cumulative rewards and punishments integral;Wherein, when multiple paragraph tags have the same father node
When, the cumulative rewards and punishments of multiple paragraph tag are integrated and are added, the cumulative rewards and punishments integral of the father node is obtained.When multiple father nodes
When having the same ancestor node, the cumulative rewards and punishments integral of multiple father node is added and obtains the cumulative rewards and punishments of the ancestor node
Integral.
Then, the cumulative rewards and punishments of each label integral, each father node cumulative rewards and punishments integral and each
The cumulative rewards and punishments of ancestor node etc. integrate, and obtain rewards and punishments of the cumulative rewards and punishments integral of maximum preset quantity as target webpage
Score feature.
Wherein, preset quantity can be 3,4 or 5 etc., and the embodiment of the present invention is not limited this.
In embodiments of the present invention, in positive web page text, since the cumulative rewards and punishments integral of father node is by multiple paragraph marks
What the multiply-add rewards and punishments integral addition of label obtained, the cumulative rewards and punishments integral of ancestor node is integrated by the cumulative rewards and punishments of multiple father nodes
It obtains, therefore, each cumulative rewards and punishments integral difference in the cumulative rewards and punishments integral of maximum preset quantity in positive web page text
Differing greatly between other cumulative rewards and punishments integrals.
And in non-positive web page text, due to not including article in non-positive web page text, and then just do not include paragraph, does not just deposit yet
In paragraph tag, and then in the dom trees in non-positive web page text, just there is no the father nodes being made of multiple paragraph tags.It can
See, in non-positive web page text, the difference between the cumulative rewards and punishments integral of each label is smaller.
Therefore, the cumulative rewards and punishments of maximum preset quantity can be integrated to the rewards and punishments scoring feature as target webpage, and will
Target webpage rewards and punishments scoring feature as judge target webpage whether be positive web page text a decision factor.
16), the structure feature, the semantic feature, this article eigen and the rewards and punishments scoring feature are input to and set in advance
The Web page classifying device set determines that target webpage is positive web page text or non-positive web page text according to the output result of the Web page classifying device.
The Web page classifying device is used to judge that webpage is positive web page text or non-positive web page text according to the characteristic of webpage.
Wherein, the positive web page text of multiple samples can be arranged in technical staff in advance, according to the structure feature of the positive web page text of sample,
Semantic feature, text feature and rewards and punishments scoring feature training net web page classifier.
Wherein, in training webpage distributor, XGBoost algorithms may be used, it is of course also possible to use other algorithms,
The embodiment of the present invention is not limited this.
In embodiments of the present invention, for a webpage, by the structure feature of the webpage, the semantic feature of the webpage,
After the Web page classifying device that the rewards and punishments scoring feature input training of the text feature of the webpage and the webpage obtains, Web page classifying
Device scores to the rewards and punishments of the structure feature of the webpage, the semantic feature of the webpage, the text feature of the webpage and the webpage
The result that the webpage is positive web page text or non-positive web page text will be exported after characteristic processing.
In step s 102, it if target webpage is positive web page text, is extracted in target webpage using extracting rule is preset
Article, default extracting rule from positive web page text for extracting article.
Wherein, include article in positive web page text, article is not included in non-positive web page text.It is directly to utilize in the prior art
Extracting rule in the prior art extracts article from webpage, due to the webpage of the structure of web page and non-positive web page text of positive web page text
Gap is larger between structure, but the webpage of the essentially identical and different non-positive web page text of structure of web page of different positive web page texts
Structure is essentially identical.Therefore, extracting rule in the prior art needs while taking into account positive web page text and non-positive web page text, this just increases
Training difficulty of the technical staff in advance in training extracting rule is added.
Such as technical staff will acquire the sample characteristics data of multiple non-positive web page texts and multiple positive web page texts in advance
Sample characteristics data, and then according to the sample characteristics data of multiple non-positive web page texts and the sample characteristics number of multiple positive web page texts
According to training extracting rule, due to the structure of web page due to positive web page text and between the structure of web page of non-positive web page text, gap is larger,
Therefore, the sample characteristics data of non-positive web page text and the sample characteristics data of positive web page text are two kinds of completely different characteristics
According to, so, it is more complex that the extracting rule trained according to two different characteristics may result in the extracting rule that training obtains.
Secondly, subsequently optimization extracting rule when complexity it is higher and subsequently optimization extracting rule when efficiency compared with
It is low.For example, if the accuracy rate for extracting article from webpage using extracting rule in the prior art is relatively low, carried to improve
The accuracy rate for taking article needs to optimize extracting rule in the prior art, but since extraction in the prior art is advised
It then needs to take into account non-positive web page text and positive web page text simultaneously, therefore, be needed when optimizing extracting rule in the prior art simultaneous simultaneously
Gu Feizheng web page texts and positive web page text cause the complexity for optimizing extracting rule higher, and optimize the optimization efficiency of extracting rule
It is relatively low.
And in embodiments of the present invention, the extracting article of the task is divided into two steps, the first step needs to identify identification target network
Whether page is positive web page text;If target webpage is positive web page text, second step, which recycles, presets extracting rule extraction target webpage
In article, default extracting rule from positive web page text for extracting article.If target webpage is non-positive web page text, with regard to nothing
The flow of extraction article must be executed.As it can be seen that the default extracting rule in the embodiment of the present invention is only needed without taking into account non-positive web page text
Positive web page text is taken into account, to reduce in advance in the training difficulty of the default extracting rule of training.So that default extracting rule
More it is absorbed in effective.Such as technical staff need to only acquire the sample characteristics data of multiple positive web page texts in advance, it is more without acquiring
The sample characteristics data of a non-positive web page text, and then train default extraction rule only according to the sample characteristics data of positive web page text
Then, make the default extracting rule that training obtains simpler only according to a kind of default extracting rule of characteristic training.
Secondly, the complexity when extracting rule is preset in follow-up optimization is relatively low and when extracting rule is preset in follow-up optimization
Efficiency it is higher.For example, if the standard of article is extracted from positive web page text using the default extracting rule in the embodiment of the present invention
True rate is relatively low, then in order to improve the accuracy rate of extraction article, needs to optimize default extracting rule, but carried due to default
It takes rule only to need to take into account positive web page text, without taking into account non-positive web page text, therefore, is only needed when extracting rule is preset in optimization
Take into account positive web page text, compared with the prior art when optimizing extracting rule in the prior art need simultaneously take into account non-positive web page text with
Positive web page text, the embodiment of the present invention can reduce optimization and preset the complexity of extracting rule and improve the default extracting rule of optimization
Optimization efficiency.
Further, exist in the web data of target webpage and judging that target webpage will not profit when whether being positive web page text
Data in the data used, such as script labels, the data in style labels, the data in conments labels and
Have the data etc. in the label of hiding attribute, whether these data are that positive web page text does not help to judgement target webpage.
Wherein, having the data in the label of hiding attribute will not show on webpage.
In addition, there is the data that can be all shown on some positive web page texts and non-positive web page text, for example, webpage statutory authority
Whether mark etc., these statutory authorities mark can show on each webpage, but be text in judgement target webpage
The statutory authority for the webpage that will not be used when webpage identifies.
The embodiment of the present invention is to need to obtain the characteristic of target webpage from the web data of target webpage to judge
Whether target webpage is positive web page text, but when obtaining the characteristic of target webpage, needs to be traversed for the entire of target webpage
Web data, and then can just find the characteristic that can be used when judging whether target webpage is positive web page text.However, such as
When the data that can not be used when judging whether target webpage is positive web page text in the web data of fruit target webpage are more,
It may result in the characteristic for consuming a longer time and capable of just getting and being used when judging whether target webpage is positive web page text
According to causing finally to need to spend more time that can just determine whether target webpage is positive web page text, thereby reduce judgement mesh
Mark webpage whether be positive web page text judgement efficiency.
Therefore, in order to improve whether judgement target webpage is that the judgement efficiency of positive web page text can be with after this step
From the data removed in the web data of target webpage at least following label:In data, style labels in script labels
Data, the data in conments labels, have data in the label of hiding display properties and on each webpage
The data etc. that will be shown.
In this way, in an alternative embodiment of the invention, step 12)~15) in web data be:It is obtained from step 11)
To web data in eliminate the data in script labels, the data in style labels, the number in conments labels
According to, have it is remaining after the data in the label of hiding display properties and the data that can be shown on each webpage etc.
Web data.
Fig. 2 is a kind of block diagram of page processor shown according to an exemplary embodiment.With reference to Fig. 2, the device packet
It includes:
Identification module 11, whether target webpage is positive web page text for identification;
Extraction module 12 extracts the target if being positive web page text for target webpage using extracting rule is preset
Article in webpage, the default extracting rule from positive web page text for extracting article.
Wherein, the identification module 11 includes:
First acquisition unit, the web data for obtaining the target webpage;
Second acquisition unit, the structure feature for obtaining the target webpage according to the web data;
Third acquiring unit, the semantic feature for obtaining the target webpage according to the web data;
4th acquiring unit, the text feature for obtaining the target webpage according to the web data;
5th acquiring unit, the rewards and punishments scoring feature for obtaining the target webpage according to the web data;
Determination unit, for the structure feature, the semantic feature, the text feature and the rewards and punishments to be scored
Feature is input to pre-set Web page classifying device, determines that the target webpage is according to the output result of the Web page classifying device
Positive web page text or non-positive web page text.
Further, the identification module 11 further includes:
Removal unit, for from the data removed in the web data at least following label:In script labels
Data, the data in style labels, the data in conments labels and have data in the label of hiding attribute.
Wherein, first acquisition module includes:
First obtain subelement, for obtain the quantity of the heading label in the web data, the quantity of paragraph tag,
The quantity of DIV labels and the quantity of a labels;
First computation subunit, the chain text ratio for calculating the web data;
Second obtains subelement, for obtaining the picture of in the web data, pixel quantity more than predetermined threshold value
Quantity;
First judgment sub-unit whether there is paging keyword for judging in the web data, the paging is crucial
Word includes at least:Homepage, page up, lower one page, endpage and full text;
First determination subelement, for marking the quantity of the heading label, the quantity of the paragraph tag, the DIV
The quantity of label, the quantity of a labels, the chain text than, the picture quantity and be used to indicate in the web data
With the presence or absence of structure feature of the information as the target webpage of the paging keyword.
Wherein, the second acquisition unit is specifically used for:Class attribute values are obtained from the web data is
The quantity of the label of content or article, and as the semantic feature of the target webpage.
Wherein, the third acquiring unit includes:
Second judgment sub-unit, for judging that the data in the heading label in the web data whether there is in described
In the head labels of web data;
First statistics subelement, in other data in addition to a labels in the web data, counting punctuate
The quantity of symbol;
Second computation subunit, for calculate the quantity of the punctuation mark in the web data except a labels with
Ratio between the total quantity of character in other outer data;
Second statistics subelement, the quantity for counting predetermined keyword in the web data, the default key
Word includes at least:Comment, details, is checked more and is checked in full source;
Third obtains subelement, the number for obtaining the text character in each paragraph tag in the web data
Maximum quantity in amount.
Third computation subunit, the number for calculating the text character in each paragraph tag in the web data
The par of amount;
Second determination subelement, for whether there is the data in the heading label being used to indicate in the web data
The quantity of information, the punctuation mark in the head labels of the web data, the ratio, the predetermined keyword
The text feature of quantity, the maximum quantity and the par as the target webpage.
Wherein, the 4th acquiring unit includes:
4th computation subunit, for for each label in the web data, according to the type of the label,
The class attribute values of the label, the quantity of the text character in the label and the chain text ratio of the label, described in calculating
The cumulative rewards and punishments of label are scored;
5th computation subunit, it is every in the web data for being calculated according to the cumulative rewards and punishments of each label scoring
The cumulative rewards and punishments integral of the father node of one label and the cumulative rewards and punishments integral of ancestor node;
4th obtains subelement, special as rewards and punishments scoring for obtaining the cumulative rewards and punishments integral of maximum preset quantity
Sign.
Wherein, include article in positive web page text, article is not included in non-positive web page text.It is directly to utilize in the prior art
Extracting rule in the prior art extracts article from webpage, due to the webpage of the structure of web page and non-positive web page text of positive web page text
Gap is larger between structure, but the webpage of the essentially identical and different non-positive web page text of structure of web page of different positive web page texts
Structure is essentially identical.Therefore, extracting rule in the prior art needs while taking into account positive web page text and non-positive web page text, this just increases
Training difficulty of the technical staff in advance in training extracting rule is added.
Such as technical staff will acquire the sample characteristics data of multiple non-positive web page texts and multiple positive web page texts in advance
Sample characteristics data, and then according to the sample characteristics data of multiple non-positive web page texts and the sample characteristics number of multiple positive web page texts
According to training extracting rule, due to the structure of web page due to positive web page text and between the structure of web page of non-positive web page text, gap is larger,
Therefore, the sample characteristics data of non-positive web page text and the sample characteristics data of positive web page text are two kinds of completely different characteristics
According to, so, it is more complex that the extracting rule trained according to two different characteristics may result in the extracting rule that training obtains.
Secondly, subsequently optimization extracting rule when complexity it is higher and subsequently optimization extracting rule when efficiency compared with
It is low.For example, if the accuracy rate for extracting article from webpage using extracting rule in the prior art is relatively low, carried to improve
The accuracy rate for taking article needs to optimize extracting rule in the prior art, but since extraction in the prior art is advised
It then needs to take into account non-positive web page text and positive web page text simultaneously, therefore, be needed when optimizing extracting rule in the prior art simultaneous simultaneously
Gu Feizheng web page texts and positive web page text cause the complexity for optimizing extracting rule higher, and optimize the optimization efficiency of extracting rule
It is relatively low.
And in embodiments of the present invention, the extracting article of the task is divided into two steps, the first step needs to identify identification target network
Whether page is positive web page text;If target webpage is positive web page text, second step, which recycles, presets extracting rule extraction target webpage
In article, default extracting rule from positive web page text for extracting article.If target webpage is non-positive web page text, with regard to nothing
The flow of extraction article must be executed.As it can be seen that the default extracting rule in the embodiment of the present invention is only needed without taking into account non-positive web page text
Positive web page text is taken into account, to reduce in advance in the training difficulty of the default extracting rule of training.So that default extracting rule
More it is absorbed in effective.Such as technical staff need to only acquire the sample characteristics data of multiple positive web page texts in advance, it is more without acquiring
The sample characteristics data of a non-positive web page text, and then train default extraction rule only according to the sample characteristics data of positive web page text
Then, make the default extracting rule that training obtains simpler only according to a kind of default extracting rule of characteristic training.
Secondly, the complexity when extracting rule is preset in follow-up optimization is relatively low and when extracting rule is preset in follow-up optimization
Efficiency it is higher.For example, if the standard of article is extracted from positive web page text using the default extracting rule in the embodiment of the present invention
True rate is relatively low, then in order to improve the accuracy rate of extraction article, needs to optimize default extracting rule, but carried due to default
It takes rule only to need to take into account positive web page text, without taking into account non-positive web page text, therefore, is only needed when extracting rule is preset in optimization
Take into account positive web page text, compared with the prior art when optimizing extracting rule in the prior art need simultaneously take into account non-positive web page text with
Positive web page text, the embodiment of the present invention can reduce optimization and preset the complexity of extracting rule and improve the default extracting rule of optimization
Optimization efficiency.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, explanation will be not set forth in detail herein.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the present invention
Its embodiment.This application is intended to cover any variations, uses, or adaptations of the embodiment of the present invention, these modifications,
Purposes or adaptive change follow the general principle of the embodiment of the present invention and include undocumented skill of the embodiment of the present invention
Common knowledge in art field or conventional techniques.The description and examples are only to be considered as illustrative, the embodiment of the present invention
True scope and spirit pointed out by the attached claims.
It should be understood that the embodiment of the present invention is not limited to the accurate knot for being described above and being shown in the accompanying drawings
Structure, and various modifications and changes may be made without departing from the scope thereof.The range of the embodiment of the present invention is only by appended right
It is required that limit.
Claims (14)
1. a kind of web page processing method, which is characterized in that the method includes:
Identify whether target webpage is positive web page text;
If target webpage is positive web page text, the article in the target webpage is extracted using default extracting rule, it is described pre-
If extracting rule from positive web page text for extracting article.
2. according to the method described in claim 1, it is characterized in that, whether the identification target webpage is positive web page text, including:
Obtain the web data of the target webpage;
The structure feature of the target webpage is obtained according to the web data;
The semantic feature of the target webpage is obtained according to the web data;
The text feature of the target webpage is obtained according to the web data;
The rewards and punishments scoring feature of the target webpage is obtained according to the web data;
The structure feature, the semantic feature, the text feature and the rewards and punishments scoring feature are input to and set in advance
The Web page classifying device set determines that the target webpage is positive web page text or non-text according to the output result of the Web page classifying device
Webpage.
3. according to the method described in claim 2, it is characterized in that, after the web data for obtaining the target webpage,
Further include:
From the data removed in the web data at least following label:In data, style labels in script labels
Data, the data in conments labels and have data in the label of hiding attribute.
4. according to the method in claim 2 or 3, which is characterized in that described to obtain the target according to the web data
The structure feature of webpage, including:
Obtain the quantity of the heading label in the web data, the quantity of paragraph tag, the quantity of DIV labels and a labels
Quantity;
Calculate the chain text ratio of the web data;
Obtain quantity of in the web data, pixel quantity more than the picture of predetermined threshold value;
Judge to whether there is paging keyword in the web data, the paging keyword includes at least:Homepage, page up,
Lower one page, endpage and full text;
By the number of the quantity of the heading label, the quantity of the paragraph tag, the quantity of the DIV labels, a labels
Amount, the chain text than, the picture quantity and be used to indicate in the web data and whether there is the paging keyword
Structure feature of the information as the target webpage.
5. according to the method in claim 2 or 3, which is characterized in that described to obtain the target according to the web data
The semantic feature of webpage, including:
The quantity for the label that class attribute values are content or article is obtained from the web data, and as institute
State the semantic feature of target webpage.
6. according to the method in claim 2 or 3, which is characterized in that described to obtain the target according to the web data
The text feature of webpage, including:
Judge that the data in the heading label in the web data whether there is in the head labels of the web data;
In other data in addition to a labels in the web data, the quantity of punctuation mark is counted;
Calculate the total of the quantity of the punctuation mark and the character in other data in addition to a labels in the web data
Ratio between quantity;
The quantity of predetermined keyword is counted in the web data, the predetermined keyword includes at least:Comment, source, in detail
Feelings are checked more and are checked in full;
The maximum quantity in the quantity of the text character in each paragraph tag is obtained in the web data;
Calculate the par of the quantity of the text character in each paragraph tag in the web data;
The data in the heading label in the web data will be used to indicate to mark with the presence or absence of the head in the web data
The quantity of information, the punctuation mark in label, the ratio, the quantity of the predetermined keyword, the maximum quantity and
Text feature of the par as the target webpage.
7. according to the method in claim 2 or 3, which is characterized in that described to obtain the target according to the web data
The rewards and punishments scoring feature of webpage, including:
For each label in the web data, according to the type of the label, the class attribute values of the label,
The chain text ratio of the quantity of text character in the label and the label calculates the cumulative rewards and punishments scoring of the label;
The cumulative of the father node of each label in the web data is calculated according to the cumulative rewards and punishments of each label scoring
The cumulative rewards and punishments of rewards and punishments integral and ancestor node integrate;
The cumulative rewards and punishments integral of maximum preset quantity is obtained as rewards and punishments scoring feature.
8. a kind of page processor, which is characterized in that described device includes:
Identification module, whether target webpage is positive web page text for identification;
Extraction module is extracted using extracting rule is preset in the target webpage if being positive web page text for target webpage
Article, the default extracting rule from positive web page text for extracting article.
9. device according to claim 8, which is characterized in that the identification module includes:
First acquisition unit, the web data for obtaining the target webpage;
Second acquisition unit, the structure feature for obtaining the target webpage according to the web data;
Third acquiring unit, the semantic feature for obtaining the target webpage according to the web data;
4th acquiring unit, the text feature for obtaining the target webpage according to the web data;
5th acquiring unit, the rewards and punishments scoring feature for obtaining the target webpage according to the web data;
Determination unit is used for the structure feature, the semantic feature, the text feature and the rewards and punishments scoring feature
It is input to pre-set Web page classifying device, determines that the target webpage is text according to the output result of the Web page classifying device
Webpage or non-positive web page text.
10. device according to claim 9, which is characterized in that the identification module further includes:
Removal unit, for from the data removed in the web data at least following label:Data in script labels,
Data in style labels, the data in conments labels and have data in the label of hiding attribute.
11. device according to claim 9 or 10, which is characterized in that first acquisition module includes:
First obtains subelement, for obtaining the quantity of the heading label in the web data, the quantity of paragraph tag, DIV
The quantity of label and the quantity of a labels;
First computation subunit, the chain text ratio for calculating the web data;
Second obtains subelement, for obtaining quantity of in the web data, pixel quantity more than the picture of predetermined threshold value;
First judgment sub-unit whether there is paging keyword for judging in the web data, the paging keyword is extremely
Include less:Homepage, page up, lower one page, endpage and full text;
First determination subelement, for by the quantity of the heading label, the quantity of the paragraph tag, the DIV labels
Quantity, the quantity of a labels, the chain text than, the picture quantity and be used to indicate in the web data whether
Structure feature there are the information of the paging keyword as the target webpage.
12. device according to claim 9 or 10, which is characterized in that the second acquisition unit is specifically used for:From described
The quantity for the label that class attribute values are content or article is obtained in web data, and as the target webpage
Semantic feature.
13. device according to claim 9 or 10, which is characterized in that the third acquiring unit includes:
Second judgment sub-unit, for judging that the data in the heading label in the web data whether there is in the webpage
In the head labels of data;
First statistics subelement, in other data in addition to a labels in the web data, counting punctuation mark
Quantity;
Second computation subunit, for calculate the quantity of the punctuation mark in the web data in addition to a labels
Ratio between the total quantity of character in other data;
Second statistics subelement, the quantity for counting predetermined keyword in the web data, the predetermined keyword is extremely
Include less:Comment, details, is checked more and is checked in full source;
Third obtains subelement, in the quantity for obtaining the text character in each paragraph tag in the web data
Maximum quantity;
Third computation subunit, the quantity for calculating text character in each paragraph tag in the web data
Par;
Second determination subelement, the data for that will be used to indicate in the heading label in the web data whether there is in institute
State the number of information in the head labels of web data, the quantity of the punctuation mark, the ratio, the predetermined keyword
Amount, the text feature of the maximum quantity and the par as the target webpage.
14. device according to claim 9 or 10, which is characterized in that the 4th acquiring unit includes:
4th computation subunit, for for each label in the web data, according to the type of the label, described
The class attribute values of label, the quantity of the text character in the label and the chain text ratio of the label, calculate the label
Cumulative rewards and punishments scoring;
5th computation subunit, for calculating each in the web data according to the cumulative rewards and punishments of each label scoring
The cumulative rewards and punishments integral of the father node of label and the cumulative rewards and punishments integral of ancestor node;
4th obtains subelement, for obtaining the cumulative rewards and punishments integral of maximum preset quantity as rewards and punishments scoring feature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710212470.8A CN108664522A (en) | 2017-04-01 | 2017-04-01 | Web page processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710212470.8A CN108664522A (en) | 2017-04-01 | 2017-04-01 | Web page processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108664522A true CN108664522A (en) | 2018-10-16 |
Family
ID=63784668
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710212470.8A Pending CN108664522A (en) | 2017-04-01 | 2017-04-01 | Web page processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108664522A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111859873A (en) * | 2020-07-30 | 2020-10-30 | 京华信息科技股份有限公司 | Document footnote conversion method |
CN113378088A (en) * | 2021-06-24 | 2021-09-10 | 中国电子信息产业集团有限公司第六研究所 | Webpage text extraction method, device, equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
US20110072343A1 (en) * | 2009-08-21 | 2011-03-24 | Yieldbuild, Inc. | Optimizing online advertising link and text characteristics |
CN102841920A (en) * | 2012-06-30 | 2012-12-26 | 北京百度网讯科技有限公司 | Method and device for extracting webpage frame information |
CN103838801A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Webpage theme information extraction method |
CN104331498A (en) * | 2014-11-19 | 2015-02-04 | 亚信科技(南京)有限公司 | Method for automatically classifying webpage content visited by Internet users |
CN105205090A (en) * | 2015-05-29 | 2015-12-30 | 湖南大学 | Web page text classification algorithm research based on web page link analysis and support vector machine |
CN105677764A (en) * | 2015-12-30 | 2016-06-15 | 百度在线网络技术(北京)有限公司 | Information extraction method and device |
CN105718522A (en) * | 2016-01-15 | 2016-06-29 | 北京傲游天下科技有限公司 | Browser body content presentation method |
-
2017
- 2017-04-01 CN CN201710212470.8A patent/CN108664522A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
US20110072343A1 (en) * | 2009-08-21 | 2011-03-24 | Yieldbuild, Inc. | Optimizing online advertising link and text characteristics |
CN102841920A (en) * | 2012-06-30 | 2012-12-26 | 北京百度网讯科技有限公司 | Method and device for extracting webpage frame information |
CN103838801A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Webpage theme information extraction method |
CN104331498A (en) * | 2014-11-19 | 2015-02-04 | 亚信科技(南京)有限公司 | Method for automatically classifying webpage content visited by Internet users |
CN105205090A (en) * | 2015-05-29 | 2015-12-30 | 湖南大学 | Web page text classification algorithm research based on web page link analysis and support vector machine |
CN105677764A (en) * | 2015-12-30 | 2016-06-15 | 百度在线网络技术(北京)有限公司 | Information extraction method and device |
CN105718522A (en) * | 2016-01-15 | 2016-06-29 | 北京傲游天下科技有限公司 | Browser body content presentation method |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111859873A (en) * | 2020-07-30 | 2020-10-30 | 京华信息科技股份有限公司 | Document footnote conversion method |
CN113378088A (en) * | 2021-06-24 | 2021-09-10 | 中国电子信息产业集团有限公司第六研究所 | Webpage text extraction method, device, equipment and storage medium |
CN113378088B (en) * | 2021-06-24 | 2024-01-19 | 中国电子信息产业集团有限公司第六研究所 | Webpage text extraction method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10055391B2 (en) | Method and apparatus for forming a structured document from unstructured information | |
US10380197B2 (en) | Network searching method and network searching system | |
US8812505B2 (en) | Method for recommending best information in real time by appropriately obtaining gist of web page and user's preference | |
US8355997B2 (en) | Method and system for developing a classification tool | |
US8725717B2 (en) | System and method for identifying topics for short text communications | |
CN109614550A (en) | Public sentiment monitoring method, device, computer equipment and storage medium | |
CN109716327A (en) | The video capture frame of visual search platform | |
US11907644B2 (en) | Detecting compatible layouts for content-based native ads | |
CN101534306A (en) | Detecting method and a device for fishing website | |
US8458584B1 (en) | Extraction and analysis of user-generated content | |
CN110134845A (en) | Project public sentiment monitoring method, device, computer equipment and storage medium | |
KR20040104060A (en) | Linking method of related site with keyword db mining of blog contents | |
JP2022128268A (en) | Apparatus for detecting use of trademark, method for detecting use of trademark and program for detecting use of trademark | |
CN110134844A (en) | Subdivision field public sentiment monitoring method, device, computer equipment and storage medium | |
CN104036190A (en) | Method and device for detecting page tampering | |
CN106372232B (en) | Information mining method and device based on artificial intelligence | |
KR102185733B1 (en) | Server and method for automatically generating profile | |
JP6621514B1 (en) | Summary creation device, summary creation method, and program | |
CN107145591A (en) | Title-based webpage effective metadata content extraction method | |
CN108664522A (en) | Web page processing method and device | |
US20230401617A1 (en) | Inserting code into a document object model of a graphical user interface (gui) for unified presentation of data | |
CN104036189A (en) | Page distortion detecting method and black link database generating method | |
CN108604232A (en) | Improve information source by modeling rolling behavior | |
CN105138572A (en) | Method and device for obtaining correlation weight of user tag | |
KR102309870B1 (en) | Method and apparatus for text summary in display ad |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181016 |