CN105740355B

CN105740355B - Webpage context extraction method and device based on aggregation text density

Info

Publication number: CN105740355B
Application number: CN201610050995.1A
Authority: CN
Inventors: 刘忠; 陈发君; 黄金才; 朱承; 修保新; 程光权; 陈超; 冯旸赫
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2016-01-26
Filing date: 2016-01-26
Publication date: 2019-03-26
Anticipated expiration: 2036-01-26
Also published as: CN105740355A

Abstract

The present invention provides a kind of webpage context extraction method and device based on aggregation text density, and this method separates the method for webpage HTML according to label, is split to webpage text content, to effectively separate each class text therein.It is versatile without customizing special website extracting rule；Without using complicated text mining means, this method is simple and efficient, and extracts precise and high efficiency to all kinds of Web page texts.

Description

Webpage context extraction method and device based on aggregation text density

Technical field

The present invention relates to spiders technical field, it is specifically related to a kind of Web page text based on aggregation text density and mentions Take method and device.

Background technique

With the rapid development of social informatization, internet has become the important sources that people obtain information.Net The people are directly viewable web page contents usually using browser, in addition, there are many more information processing work (such as information Internet-based Retrieval, data mining, machine translation etc.) it is also to be carried out with the information content of webpage basis data, main is to be based on The text of webpage is handled.But also comprising many other than comprising useful information (such as body matter) in most of webpages Noise information, such as navigation information, related link and the advertisement of website, copyright information and some scripting languages etc..It is how quasi- Really, the text message for efficiently extracting webpage is accomplished neither to omit text nor is mixed into noise, has become current network information The important topic for extracting and applying has very high application value and practice significance.

A variety of extracting methods exist in the prior art for this problem:

1) based on the context extraction method of DOM tree structure

Structure or information lack of standardization first in the html file of reparation webpage are (such as: starting label<h1>and be not over mark Label</a>deng), make the html file of standard.Then html file is resolved into DOM (Document Object Model, document dbject model) tree.Finally traversal dom tree identifies and rejects non-text message, and close according to page layout, text The Rule Extractions body texts such as degree.The page structure of many websites becomes increasingly complex, is also more and more lack of standardization at present, will lead to nothing Method constructs DOM number to extract text and extract template building failure.Building and traversal dom tree process, space-time later is complicated It is slow to spend height, low efficiency, speed.Manual maintenance more new information (such as Advertisement Server list) is needed in noise identification, cannot be done To automation.

2) rule-based extraction text

It is that extracting rule, such as regular expression or XPath etc. are specified in specific website by artificial means.Advantage is quasi- True degree is high, but the disadvantage is that do not have versatility, can not extend, can only parse the webpage of fixed website or fixed format, and The formulation process of rule is time-consuming and laborious, once page layout changes, it is difficult to and discovery is updated maintenance in time.3) it is based on net Text block is extracted in page segmentation

Utilize the separator bar and some visual informations (such as text color, font size, text information in html tag Deng) it is separated out the text block in webpage.Due to the HTML different style of different web sites, divide no unified approach, versatility is difficult To guarantee；Need to increase the artificial rule much assisted.4) text is extracted based on data mining and machine learning method

Method includes the following steps: linearisation reconstruct web page code keeps the logical order of text paragraph embedding not because of label Set rule is destroyed；Filter HTML noise label；Text fragment is parsed and stored as unit of<table>label；Use text Clustering algorithm is to paragraphs clustering and ultimately generates text.There are problems: simple problem complicates, so that extracting text becomes cumbersome Complexity is unfavorable for extensive utilization.

Summary of the invention

It is an object of the invention to provide one for the technical problem of the existing technology mentioned in above-mentioned background technique Webpage context extraction method and device of the kind based on aggregation text density.

The present invention provides a kind of webpage context extraction method based on aggregation text density, comprising the following steps: step S100: obtaining the html source file text of webpage, deletes unworthy first label and rejects the spcial character in text, obtains Sample text；Step S200: it is null by the second tag replacements all in sample text, multiple null texts is generated, by null Text conversion is queue T, and adjacent null is accorded with by null separating herein two-by-two；Step S300: queue T is separated into multiple sub- teams All texts in each subqueue are merged into a text block by column, multiple text blocks are formed queue B, according to text threshold Value and index threshold value are split queue T；Step S400: the maximum text of text size is chosen from queue B as webpage Text；Null number of the threshold value between preset any two subqueue is indexed, text threshold value is contained text in preset subqueue Number of characters.

Further, the second label is replaced using regular expression in step S200, Substitution Rules are as follows: R [(" i ", N)], wherein " i " is the second label, n is the quantity that the tag replacement is null.

Further, in step S300 the following steps are included:

Step S310: queue T is looped through, note currentElement is Tc, if effective Chinese character number of currentElement Tc is small In text size threshold value, then currentElement Tc text is added in queue B and continues to traverse queue T, if currentElement Tc's has Effect Chinese character number is greater than text threshold value and then remembers that currentElement Tc is the currently active text Tcv, and creation provisional version block Temp is The textual value of the currently active text Tcv；

Step S320: queue T is begun stepping through from next element after the currently active text Tcv, ignores space or null Element is until finding next effective text Ncv, if next effective text Ncv and position rope of effective text Tcv in queue T Draw difference and be less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and have next Effect text Ncv is assigned to effective text Tcv；

Step S330: continue to next effective element Ncv after next effective text Ncv_i+2Queue T is traversed, if Ncv_i+2It is greater than index threshold value with location index difference of the currently active text Tcv in queue T, then by provisional version block Temp Duplication portion is put into queue B, by Ncv_i+2It is assigned to currentElement Tc and continues cycling through traversal queue T.

Further, the first label is unworthy Html label.

Another aspect of the present invention additionally provide it is a kind of as above-mentioned method with based on aggregation text density Web page text mention Take device, comprising: webpage html file obtains module, for obtaining the html source file text of webpage, deletes unworthy the One label simultaneously rejects the spcial character in text, obtains sample text；Null divides module, for by sample text all the Two tag replacements are null, generate multiple null texts, are queue T by null text conversion, two-by-two adjacent null herein by Null symbol separates；Queue conversion module is separated into multiple subqueues for queue T, and all texts in each subqueue are closed And be a text block, multiple text blocks are formed into queue B, queue T is split according to text threshold value and index threshold value；Text This selection module, for choosing the maximum text of text size from queue B as Web page text.

Further, the second label is replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i " For the second label, n is the quantity that the tag replacement is null.

Further, queue conversion module includes: first circulation module: for looping through queue T, note currentElement is CurrentElement Tc text is added in queue B if effective Chinese character number of currentElement Tc is less than text size threshold value by Tc And continue to traverse queue T, remember that currentElement Tc is current if effective Chinese character number of currentElement Tc is greater than text threshold value Effective text Tcv, the textual value that creation provisional version block Temp is the currently active text Tcv；Second circulation module, for from working as Next element after preceding effective text Tcv begins stepping through queue T, ignores space or null element until finding next effective text This Ncv, if next effective text Ncv and location index difference of effective text Tcv in queue T are less than index threshold value, The text of next effective text Ncv is appended in provisional version block Temp, and next effective text Ncv is assigned to effective text This Tcv；Queue B forms module, for continuing to next effective element Ncv after next effective text Ncv_i+2Traverse team T is arranged, if Ncv_i+2Be greater than index threshold value with location index difference of the currently active text Tcv in queue T, then it will interim text This block Temp duplication portion is put into queue B, by Ncv_i+2It is assigned to currentElement Tc and continues cycling through traversal queue T.

Technical effect of the invention:

Webpage context extraction method Web page text method provided by the invention based on aggregation text density, it is special without customizing Different website extracting rule, it is versatile；Without using complicated text mining means, this method is simple and efficient, to all kinds of nets Page text extracts precise and high efficiency.Webpage context extraction method provided by the invention passes through clear according to label to the webpage HTML of acquisition Web page text is obtained by the method for aggregation after reason, conversion process, both without customizing special website rule, avoids being arranged general Property poor website rule；Also the situation either with or without generation and traversal dom tree, under avoiding efficient；It is tested by practice The extraction Web page text of the method precise and high efficiency, while being also applied for all kinds of websites.

Web page text extraction element provided by the invention based on aggregation text density is without using complicated text mining Means, this method are simple and efficient, and extract precise and high efficiency to all kinds of Web page texts.

Specifically please refer to what the webpage context extraction method based on aggregation text density and device according to the present invention proposed Various embodiments it is described below, will make apparent in terms of above and other of the invention.

Detailed description of the invention

Fig. 1 is the webpage context extraction method flow diagram provided by the invention based on aggregation text density；

Fig. 2 is the structural schematic diagram of the Web page text extraction element provided by the invention based on aggregation text density.

Specific embodiment

The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.

Referring to Fig. 1, the webpage context extraction method provided by the invention based on aggregation text density, comprising the following steps:

Step S100: obtaining the html source file text of webpage, deletes unworthy first label and rejects in text Spcial character obtains sample text；

Step S200: being null by the second tag replacements all in sample text, generate multiple null texts, by null text Originally queue T is converted to, adjacent null is accorded with by null separating herein two-by-two；

Step S300: being separated into multiple subqueues for queue T, and all texts in each subqueue are merged into a text This block forms the queue B being made of multiple text blocks, is split according to text threshold value and index threshold value to queue T, indexes threshold The null number being worth between preset any two subqueue, text threshold value are contained text character number in preset subqueue；

Step S400: the maximum text of text size is chosen from queue B as Web page text.

The present invention is using the replacement of label and deletes as starting, and according to text character number and null number, by source file text Text subqueue in this is divided into different subqueues, so that the text that body text is acted on other be separated, this method without Specific extraction principle need to be manually set according to specific webpage, it is only necessary to be replaced, be can be realized to text just according to label condition The extraction of text.Efficiency is improved.

Unworthy first label can be all kinds of common unworthy Html labels.It is referred to herein unworthy Html label including but not limited to note (<！--...-->,<！...>), script (<script...>...</script>), head (<head..>...</head>), pattern (<link.../>), editor class (<input../>).

Spcial character is rejected, certain texts can replace with special character in webpage source file, if space character is in webpage It is " &nbsp in source code ", known herein is to delete such spcial character without specific meaning.Specifically, in this step In each element text in queue T is detected, reject all kinds of normal spcial characters in text, these spcial characters include But it is not limited to space (" &nbsp "), greater-than sign (" &gt；"), less than sign (" &lt；") and equal to number (" &quot；").

Second label herein refers to that after deleting unworthy first labelling step, other deleted are not normal Html label.By by the second tag replacement all in Html text be certain amount null after, in sample text The content that text containing body matter is divided with other labels separates.

Preferably, for step S200 the following steps are included: by the text label in the second label according to it is following rule into The replacement of row null.According to corresponding relationship Substitution Rules are as follows: R [(" i ", n)], wherein " i " is the second label, n is the tag replacement For the quantity of null.Such as: R:[(" div ", 5), (" tr ", 5), (" h1 ", 9), (" br ", 5), (" span ", 4), (" Table ", 2)] be replaced using regular expression.

Specific as follows: all elements in R are made of key-value pair, and the key in R element is bookmark name: such as div, tr, Hl etc. is all kinds of common second labels.Value in R element represents the null number replaced in label conversion process；Such as in R One element (" div ", 5) represents when the second label detected is div, will start or end-tag replaces with 5 skies Row symbol (" n ").For other second labels not in relational expression R, then a null symbol is replaced with.The replacement principle of this step It is the replacement of view-based access control model effect, is spaced the second bigger label in visual effect, will be replaced with more nulls.Later by one The texts that multiple in a web page text are separated by null form list T, with null segmentation adjacent text two-by-two in list.

Step S300 is aggregation text steps, has been divided by label by the web page text information that above step obtains The adjacent small text block of physical location is collected as a text block by the small text block separated one by one by null, this step.

Specifically, in step S300 the following steps are included:

Step S310: queue T is looped through, note currentElement is Tc, if effective Chinese character number of currentElement Tc is small In specified text size threshold value (such as 4), then queue B is added in currentElement Tc text and continues to traverse queue T.If Tc's has Effect Chinese character number is greater than specified threshold and then illustrates that currentElement Tc is effective text, and note Tc is the currently active text Tcv, creation Provisional version block Temp is the textual value of the currently active text Tcv.

Step S320: beginning stepping through queue T from next element after Tcv later, ignores space or null text element Element is until finding next effective text Ncv, if next effective text Ncv and location index of effective text Tcv in queue T Difference is less than specified index threshold value (such as 7), then the text of next effective text Ncv is appended in provisional version block Temp And next effective text Ncv is assigned to effective text Tcv.

Step S330: continue to begin stepping through next effective element Ncvi+2 after next effective text Ncv later Queue T.If Ncv_i+2It is greater than specified index threshold value with location index difference of the currently active text Tcv in queue T, then Text block Temp duplication portion is put into queue B, by Ncv_i+2It is assigned to currentElement Tc and continues cycling through traversal queue T.

Step S400 be selection text step, after step S300, relevant text flocked together (such as: just Text, advertisement, link etc.), the longest element of element text size in queue B is obtained, this element text is exactly text, arrives this text Extraction is fully completed.

Using the step according in usual webpage: 1) text connects together, and will not be separated by noises such as advertisements；2) just The text block length of text is longer and is separated by not far；3) content of text should be longest.Thus effectively will be in webpage Text collection both avoids the step of using duplication and algorithm, in turn avoids specifying different extracting rules for different web pages It is cumbersome, improve the efficiency extracted to web page text.

Referring to fig. 2, another aspect of the present invention also provides the device of another above method, comprising:

Webpage html file obtains module 100, for obtaining the html source file text of webpage, deletes unworthy first Label simultaneously rejects the spcial character in text, obtains sample text；

Null divides module 200, for being null by the second tag replacements all in sample text, generates multiple null texts Null text conversion is queue T by this, and adjacent null is accorded with by null separating herein two-by-two；

Queue conversion module 300 is separated into multiple subqueues for queue T, and all texts in each subqueue are closed And be a text block, multiple text blocks are formed into queue B, queue T is split according to text threshold value and index threshold value；

Text selection module 400, for choosing the maximum text of text size from queue B as Web page text.

The device is not necessarily to be not necessarily to manpower intervention according to specific webpage design extracting rule, can effectively improve extraction efficiency.

Preferably, the second label is replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i " is Second label, n are the quantity that the tag replacement is null.Extracted by this rule, can effectively realize to invalid document with just Text flies separation, is difficult to divide after avoiding the two from mixing.

Preferably, queue conversion module includes:

First circulation module: for looping through queue T, note currentElement is Tc, if effective Chinese of currentElement Tc Number of characters is less than text size threshold value, then currentElement Tc text is added in queue B and continues to traverse queue T, if current member Effective Chinese character number of plain Tc is greater than text threshold value and then remembers that currentElement Tc is the currently active text Tcv, creates provisional version Block Temp is the textual value of the currently active text Tcv；

Second circulation module ignores sky for beginning stepping through queue T from next element after the currently active text Tcv Lattice or null element are until finding next effective text Ncv, if next effective text Ncv and effective text Tcv are in queue T Location index difference be less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and Next effective text Ncv is assigned to effective text Tcv；

Queue B forms module, for continuing to next effective element Ncv after next effective text Ncv_i+2Traversal Queue T, if Ncv_i+2Be greater than index threshold value with location index difference of the currently active text Tcv in queue T, then it will be interim Text block Temp duplication portion is put into queue B, by Ncv_i+2It is assigned to currentElement Tc and continues cycling through traversal queue T.

Using the module, can effectively avoid being particularly suitable for in text the omission of text contained in branch's text Also there is the case where label.

Those skilled in the art will be clear that the scope of the present invention is not limited to example discussed above, it is possible to carry out to it Several changes and modification, the scope of the present invention limited without departing from the appended claims.Although oneself is through in attached drawing and explanation The present invention is illustrated and described in book in detail, but such illustrate and describe is only explanation or schematical, and not restrictive. The present invention is not limited to the disclosed embodiments.

By to attached drawing, the research of specification and claims, those skilled in the art can be in carrying out the present invention Understand and realize the deformation of the disclosed embodiments.In detail in the claims, term " includes " is not excluded for other steps or element, And indefinite article "one" or "an" be not excluded for it is multiple.The certain measures quoted in mutually different dependent claims The fact does not mean that the combination of these measures cannot be advantageously used.Any reference marker in claims is not constituted pair The limitation of the scope of the present invention.

Claims

1. a kind of webpage context extraction method based on aggregation text density, comprising the following steps:

Step S100: obtaining the html source file text of webpage, delete unworthy first label and reject it is special in text Character obtains sample text；

Step S200: being null by the second tag replacements all in the sample text, generate multiple null texts, by null text Originally queue T is converted to, the adjacent null text is accorded with by null and being separated two-by-two；

Step S300: being separated into multiple subqueues for the queue T, and all texts in each subqueue are merged into one Multiple text blocks are formed queue B, are split according to text threshold value and index threshold value to the queue T by a text block；

Step S400: the maximum text of text size is chosen from the queue B as Web page text；

Null number of the index threshold value between preset any two subqueue, the text threshold value is the preset son Contained text character number in queue；

In the step S300 the following steps are included:

Step S310: queue T is looped through, note currentElement is Tc, if effective Chinese character number of the currentElement Tc is small In the text size threshold value, then the currentElement Tc text is added in the queue B and continues to traverse the queue T, Remember that the currentElement Tc is the currently active if effective Chinese character number of the currentElement Tc is greater than the text threshold value Text Tcv, the textual value that creation provisional version block Temp is the currently active text Tcv；

Step S320: the queue T is begun stepping through from next element after the currently active text Tcv, ignores space Or null element is until finding next effective text Ncv, if next effective text Ncv and effective text Tcv exist Location index difference in queue T is less than index threshold value, then is appended to the text of next effective text Ncv described interim In text block Temp, and next effective text Ncv is assigned to effective text Tcv；

Step S330: continue to next effective element Ncv after next effective text Ncv_i+2The queue T is traversed, If the Ncv_i+2It is greater than the index threshold with location index difference of the currently active text Tcv in the queue T Provisional version block Temp duplication portion is then put into the queue B, by the Ncv by value_i+2It is assigned to the current member Plain Tc continues cycling through traversal queue T.

2. the webpage context extraction method according to claim 1 based on aggregation text density, which is characterized in that the step Second label described in rapid S200 is replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i " is institute The second label is stated, n is the quantity that the tag replacement is null.

3. the webpage context extraction method according to claim 1 based on aggregation text density, which is characterized in that described the One label is unworthy Html label.

4. a kind of method according to any one of claims 1 to 3 is used to extract based on the Web page text of aggregation text density Device characterized by comprising

Webpage html file obtains module, for obtaining the html source file text of webpage, deletes unworthy first label simultaneously The spcial character in text is rejected, sample text is obtained；

Null divides module, is used to the second tag replacements all in the sample text be null, generates multiple null texts, It is queue T by null text conversion, the adjacent null text is accorded with by null and being separated two-by-two；

Queue conversion module is separated into multiple subqueues for the queue T, and all texts in each subqueue are closed And be a text block, multiple text blocks are formed into queue B, according to text threshold value and index threshold value to the queue T into Row segmentation；

Text selection module, for choosing the maximum text of text size from the queue B as Web page text.

5. the Web page text extraction element according to claim 4 based on aggregation text density, which is characterized in that described the Two labels are replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i " is second label, n is The tag replacement is the quantity of null.

6. the Web page text extraction element according to claim 4 based on aggregation text density, which is characterized in that the team Column conversion module includes:

First circulation module: for looping through queue T, note currentElement is Tc, if effective Chinese of the currentElement Tc Number of characters is less than the text size threshold value, then the currentElement Tc text is added in the queue B and continues to traverse institute Queue T is stated, remembers that the currentElement Tc is if effective Chinese character number of the currentElement Tc is greater than the text threshold value The currently active text Tcv, the textual value that creation provisional version block Temp is the currently active text Tcv；

Second circulation module, for beginning stepping through the queue T from next element after the currently active text Tcv, Ignore space or null element until finding next effective text Ncv, if next effective text Ncv and effective text Location index difference of this Tcv in queue T is less than index threshold value, then is appended to the text of next effective text Ncv In the provisional version block Temp, and next effective text Ncv is assigned to effective text Tcv；

Queue B forms module, for continuing to next effective element Ncv after next effective text Ncv_i+2Traversal The queue T, if the Ncv_i+2It is greater than institute with location index difference of the currently active text Tcv in the queue T Index threshold value is stated, then provisional version block Temp duplication portion is put into the queue B, by the Ncv_i+2It is assigned to institute It states currentElement Tc and continues cycling through traversal queue T.