CN106649560B - A kind of Web page text extracting method and device - Google Patents
A kind of Web page text extracting method and device Download PDFInfo
- Publication number
- CN106649560B CN106649560B CN201610986453.5A CN201610986453A CN106649560B CN 106649560 B CN106649560 B CN 106649560B CN 201610986453 A CN201610986453 A CN 201610986453A CN 106649560 B CN106649560 B CN 106649560B
- Authority
- CN
- China
- Prior art keywords
- text block
- text
- punctuation mark
- content
- title
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9562—Bookmark management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a kind of Web page text extracting method and devices, belong to technical field of information processing, wherein method includes the following steps: obtaining the title content in webpage html source code;The path of all text blocks in webpage html source code is obtained, and establishes text block path list;Title content is compared with the text block content of each text block, obtains the text block where title content;The punctuation mark weight that each path corresponds to text block is calculated since next path of text block respective path where title content according to the sequence of path in lists;Judged according to punctuation mark weight, marked according to judging result be text text block.The present invention has the advantages that scalability is good, realizes that simple, text extracting is with high accuracy.
Description
Technical field
The present invention relates to technical field of information processing, and in particular to a kind of Web page text extracting method and device, especially
It is related to the Web page text extracting method and device of a kind of combination punctuation mark weight and structure feature.
Background technique
With the rapid development of internet, cyberspace stores more and more information resources, and Web page is therein
One kind mainly showing form, and quantity is more and more huger.According to the 35th China Internet network state of development statistical report, China
Webpage quantity is up to 189,900,000,000, Nian Zengchang 26.6%.In recent years, how the web data of these magnanimity is analyzed and processed, from
In excavate valuable information become research a hot issue.It is the big data technology of representative to magnanimity using Hadoop
Web data storage and analysis provide effective means, but the content on webpage is other than subject content, usually all can be in the page
Middle placement navigation bar is to facilitate user to access, content also unrelated with theme just like advertisement, copyright information, associated connection etc., I
Be referred to as " noise ", these noise informations can handle to web data analysis and bring very big influence, seriously affect analysis result
Accuracy.These noises how are removed, the body matter in webpage is extracted, web data are analyzed significant.
However, current Web page text extracting method there is also some problems, requires further improvement and perfect.Such as
Method based on template needs to specify corresponding extraction template to different webpage formats, although having very high extraction precision,
But scalability is poor, maintenance cost is very high, can not use on a large scale.For example the method for view-based access control model feature realizes complexity,
Efficiency is lower, it is difficult to cope with the processing of magnanimity web data.For example then to extract precision lower for Statistics-Based Method.
Summary of the invention
Therefore, technical problems to be solved of the embodiment of the present invention are that Web page text extracting method in the prior art can expand
Malleability is poor, realization is complicated, it is lower to extract precision.
For this purpose, a kind of Web page text extracting method of the embodiment of the present invention, comprising the following steps:
Obtain the title content in webpage html source code;
The path of all text blocks in webpage html source code is obtained, and establishes text block path list;
Title content is compared with the text block content of each text block, obtains the text block where title content;
According to the sequence of path in lists, since next path of text block respective path where title content,
Calculate the punctuation mark weight that each path corresponds to text block;
Judged according to punctuation mark weight, marked according to judging result be text text block.
Preferably, the step of title content obtained in webpage html source code includes:
The content of title label and h1 label in webpage html source code is obtained respectively;
Character string cutting is carried out using content of the separator to title label, and cutting result sequence is stored in array
In;
Judge respectively the content of first element of array, the content of the last one element and h1 label content whether be
Conventional non-title text;
When there are any one in the content of the content of first element of array, the content of the last one element and h1 label
When a content is not conventional non-title text, obtain title content be first be not conventional non-title text content.
Preferably, described to be compared title content with the text block content of each text block, obtain title content institute
Text block the step of include:
Compare the size of the editing distance of the text block content of title content and each text block;
Text block where obtaining title content is the corresponding text block of smallest edit distance.
Preferably, described to be judged according to punctuation mark weight, marked according to judging result be text text block
The step of include:
Judge whether each punctuation mark weight is greater than or equal to first threshold respectively;
When punctuation mark weight is greater than or equal to first threshold, this corresponding text block of punctuation mark weight is marked to be positive
Text.
Preferably, described to be judged according to punctuation mark weight, marked according to judging result be text text block
The step of further include:
When punctuation mark weight is less than first threshold, judge whether punctuation mark weight is greater than or equal to second threshold,
The second threshold is less than first threshold;
When punctuation mark weight is greater than or equal to second threshold, this corresponding text block of punctuation mark weight is judged respectively
In whether comprising the punctuation mark weight that terminates the previous text block that punctuation mark, this punctuation mark weight correspond to text block be
Whether the punctuation mark weight of no the latter text block that text block is corresponded to more than or equal to second threshold, this punctuation mark weight
Whether the previous text block of text block is corresponded to terminate punctuation mark knot more than or equal to second threshold, this punctuation mark weight
Whether tail and this punctuation mark weight correspond to the latter text block of text block to terminate punctuation mark ending;
When corresponding comprising terminating punctuation mark or this punctuation mark weight in this corresponding text block of punctuation mark weight
The previous text block of text block and the punctuation mark weight of the latter text block are all larger than or are equal to second threshold or this mark
Point symbol weight corresponds to the punctuation mark weight of any text block in the previous text block and the latter text block of text block
More than or equal to second threshold and when this any text block ends up to terminate punctuation mark, mark this punctuation mark weight corresponding
Text block is text.
Preferably, it is described acquisition webpage html source code in title content the step of before, it is further comprising the steps of:
Remove content unrelated with text structure in webpage html source code.
Preferably, judged described according to punctuation mark weight, marked according to judging result be text text
It is further comprising the steps of after the step of block:
According to the path of text block, the boundary for the text block for having been labeled as text is cut, obtains accurate text
Content.
A kind of Web page text extracting device of the embodiment of the present invention, comprising:
First acquisition unit, for obtaining the title content in webpage html source code;
Second acquisition unit for obtaining the path of all text blocks in webpage html source code, and establishes text block path
List;
First title text block obtaining unit, for comparing title content and the text block content of each text block
Compared with, obtain title content where text block;
Punctuation mark weight calculation unit, for the sequence according to path in lists, text block where from title content
Next path of respective path starts, and calculates the punctuation mark weight that each path corresponds to text block;
First body text block obtaining unit is marked for being judged according to punctuation mark weight according to judging result
It is the text block of text out.
Preferably, the first acquisition unit includes:
Third acquiring unit, for obtaining the content of title label and h1 label in webpage html source code respectively;
Cutting unit, for using separator to carry out character string cutting to the content of title label, and cutting result is suitable
Sequence is stored in array;
First judging unit, for judge respectively the content of first element of array, the content of the last one element and
Whether the content of h1 label is conventional non-title text;
Title content obtaining unit, for working as the content of first element, the content of the last one element and the h1 of array
When to there is any one content in the content of label be not conventional non-title text, it is not normal for obtaining title content to be first
Advise the content of non-title text.
Preferably, the first title text block obtaining unit includes:
Comparing unit, the size of the editing distance of the text block content for comparing title content and each text block;
Second title text block obtaining unit, it is corresponding for smallest edit distance for the text block where obtaining title content
Text block.
Preferably, the first body text block obtaining unit includes:
Second judgment unit, for judging whether each punctuation mark weight is greater than or equal to first threshold respectively;
Second body text block obtaining unit, for marking this when punctuation mark weight is greater than or equal to first threshold
The corresponding text block of punctuation mark weight is text.
Preferably, the first body text block obtaining unit further include:
Third judging unit, for judging whether punctuation mark weight is big when punctuation mark weight is less than first threshold
In or equal to second threshold, the second threshold is less than first threshold;
4th judging unit, for judging that this punctuate accords with respectively when punctuation mark weight is greater than or equal to second threshold
In number corresponding text block of weight whether comprising terminating punctuation mark, this punctuation mark weight correspond to the previous text of text block
Whether the punctuation mark weight of block is greater than or equal to second threshold, this punctuation mark weight corresponds to the latter text block of text block
Punctuation mark weight whether be greater than or equal to second threshold, this punctuation mark weight corresponds to the previous text block of text block is
It is no whether to correspond to the latter text block of text block to terminate punctuate to terminate punctuation mark ending and this punctuation mark weight
Symbol ending;
Third body text block obtaining unit, for working as in this corresponding text block of punctuation mark weight comprising terminating punctuate
Symbol or this punctuation mark weight correspond to text block previous text block and the latter text block punctuation mark weight it is equal
It is corresponded in the previous text block and the latter text block of text block more than or equal to second threshold or this punctuation mark weight
The punctuation mark weight of any text block be greater than or equal to second threshold and this any text block to terminate punctuation mark ending
When, marking this corresponding text block of punctuation mark weight is text.
Preferably, before the first acquisition unit, further includes:
Unit is removed, for removing content unrelated with text structure in webpage html source code.
Preferably, after the first body text block obtaining unit, further includes:
Accurate text obtaining unit, for the path according to text block, to the boundary of the text block for having been labeled as text into
Row is cut, and obtains accurate body matter.
The technical solution of the embodiment of the present invention, has the advantages that
1. Web page text extracting method provided in an embodiment of the present invention and device, by by title content and text block content
It is compared, finds the text block where title content, further according to the sequence in path list, calculate text where title content
Whether next text block of this block and its punctuation mark weight of later text block, judge text block using punctuation mark weight
For text, have found be text continuous text block, it is seen that the above method and device are applicable to webpage html source code, are not necessarily to
Specific extraction template, scalability are good;And the calculating of punctuation mark weight, realization side are utilized when judging whether it is text
Method is simple and improves extraction precision.
2. Web page text extracting method provided in an embodiment of the present invention and device, by obtaining label<title>and<h1>
Content, fully considered title label that may be present in html source code, improve title acquisition accuracy.Pass through judgement
Specific non-title canonical formula whether is matched in the title content got, and it is insubstantial to eliminate such as forum, blog, Newsweek
The title of property, non-title, further improves the accuracy of the title content got.
3. Web page text extracting method provided in an embodiment of the present invention and device, pass through setting for two threshold values differing in size
It sets, punctuation mark weight is compared judgement with it respectively, and based on texts knots such as front and back text block, end punctuation marks
The actual characteristic of structure feature, thus accurately marked be text text block, whole process is clear, clear, succinct, and
All webpages are applicable in, accuracy rate is high, has a wide range of application.
Detailed description of the invention
In order to illustrate more clearly of the technical solution in the specific embodiment of the invention, specific embodiment will be retouched below
Attached drawing needed in stating is briefly described, it should be apparent that, the accompanying drawings in the following description is some realities of the invention
Mode is applied, it for those of ordinary skill in the art, without creative efforts, can also be attached according to these
Figure obtains other attached drawings.
Fig. 1 is the flow chart of a specific example of Web page text extracting method in the embodiment of the present invention 1;
Fig. 2 is the flow chart that a specific example of body text block is marked in the embodiment of the present invention 1;
Fig. 3 is the functional block diagram of a specific example of Web page text extracting device in the embodiment of the present invention 2.
Specific embodiment
Technical solution of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation
Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill
Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that term " first ", " second ", " third " etc. are only used for description mesh
, it is not understood to indicate or imply relative importance.
As long as in addition, the non-structure each other of technical characteristic involved in invention described below different embodiments
It can be combined with each other at conflict.
Embodiment 1
The present embodiment provides a kind of Web page text extracting method, this method combines punctuation mark weight and structure feature,
Text extracting is done for the information in webpage html source code, as shown in Figure 1, this method comprises the following steps:
Title content title in S1, acquisition webpage html source code;
S2, all text block characteristic informations in webpage html source code, including path are obtained, and establishing includes path list
Text block characteristic information list listBlock;
S3, in text block characteristic information list listBlock, by the text of title content title and each text block
Block content is compared, and obtains the text block blockTitle where title content;
S4, the sequence according to path in lists, under the text block blockTitle respective path of title content place
One path starts, and calculates the punctuation mark weight PuncWeight that each path corresponds to text block;
S5, judged according to punctuation mark weight PuncWeight, marked according to judging result be text text
Block.
Above-mentioned Web page text extracting method, by step S1-S5, by comparing title content and text block content
Compared with, the text block where title content is found, further according to the sequence in path list, text block where calculating title content
The punctuation mark weight of next text block and its later text block, judges whether text block is positive using punctuation mark weight
Text, have found be text continuous text block, it is seen that the above method is applicable to webpage html source code, extracts without specific
Template, scalability are good;And the calculating of punctuation mark weight is utilized when judging whether it is text, its implementation is simple and mentions
High extraction precision.Above method maintenance cost is low, can use on a large scale and accuracy rate is higher, for large-scale internet open source feelings
Report obtains and provides support.
Preferably, the step of title content title in the acquisition webpage html source code of above-mentioned steps S1 includes:
S11, the content oriTitle and h1 for obtaining label<title>and label<h1>in webpage html source code respectively.Example
Such as, regular expression can be passed through when acquisition: (? is)<title.*?>(.+?)</title>and (? is)<h1.*?>(.*?)
</h1>。
S12, character string cutting is carried out to oriTitle using separator SplitStr, and cutting result sequence is stored in
In array titleArray.Such as separator can be-, _, « etc..
S13, judge respectively array titleArray first element titleArray [1] content, the last one member
Whether the content and h1 of plain titleArray [last] are conventional non-title texts, i.e., whether match specific non-title canonical formula
ExcludeTitlePattern, for example, [^ s] { 1,5 } (forum | blog | news (weekly | center | is net (stood)?)? | important news | it is first
Page | net).
S14, the content of first element titleArray [1] as array titleArray, the last one element
When to there is any one content in the content and h1 of titleArray [last] be not conventional non-title text, obtain in title
Hold content, the last one element of the first element titleArray [1] for the array titleArray that title is sequence judgement
First in the content and h1 of titleArray [last] be not conventional non-title text content.The mode of selection can be with
Be: whether the content for first judging first element titleArray [1] of array titleArray is conventional non-title text,
When not being, taking title content title is titleArray [1], when being, then judges the last one of array titleArray
Whether the content of element titleArray [last] is conventional non-title text, when not being, takes the title content title to be
TitleArray [last] when being, then judges whether h1 is that conventional non-title text takes title content when not being
Title is h1, when being, takes title content title for sky.
S15, the content of first element titleArray [1] as array titleArray, the last one element
When the content and h1 of titleArray [last] is all conventional non-title text, take title content title for sky.
Above-mentioned Web page text extracting method, by step S11-S15, by obtaining the content of label<title>and<h1>,
It has fully considered title label that may be present in html source code, has improved the accuracy of title acquisition.It is got by judgement
Title content in whether match specific non-title canonical formula, eliminate such as forum, blog, Newsweek unsubstantiality,
The title of non-title further improves the accuracy of the title content got.
Preferably, above-mentioned steps S2 specific method the following steps are included:
S21, it realizes a class HtmlParserHandler, inherits the DefaultHandler class of tagSoup;
S22, in startElement () method of class HtmlParserHandler, by currently processed object element
Tag name and number information be pressed into stack stack, number since 1 ing, one label of every processing, number add 1;
S23, in characters () method of class HtmlParserHandler, record text block content of text,
Meanwhile the content in stack stack is spliced, obtain the path characteristic information of text block, text block path characteristic information
Format is the path of label where from root node to text block, such as/html_1/body_2/div_54/div_55/div_58/
Text block path characteristic information is recorded in listBlock h1_59.
S24, in endElement () method of class HtmlParserHandler, the method for popping of call stack stack,
If first entered the method gone out afterwards, the object element of currently processed end is popped.
It will be understood by those of skill in the art that the method for obtaining text block characteristic information is not limited to above by class
The method of HtmlParserHandler can also can obtain the method for text block characteristic information by other to realize.
Preferably, above-mentioned steps S3 includes:
S31, the size for comparing title content title with the editing distance of the text block content of each text block, such as can
To use Levenshtein distance.
Text block blockTitle where S32, acquisition title content title is the corresponding text of smallest edit distance
Block, label blockTitle are the serial number of text block.
Above-mentioned Web page text extracting method compares title content and text by editing distance by step S31-S32
Block content, so as to be quickly found text block where title content, not only processing method is simple, effective, but also handles effect
Rate is very fast.
Preferably, in above-mentioned steps S4, the method for calculating the punctuation mark weight PuncWeight of text block can be logical
Following formula is crossed to calculate:
PuncWeight=K1*S1+K2*S2
+K3*isEndPunc+K4*isLink
Wherein, S1 is represented in text block and is owned,.;;The sum of number of symbol;S2, which is represented in text block, to be owned:!!Symbol
Number the sum of number;Whether isEndPunc represents text block with any end punctuation mark.!!;;Ending, if it is,
IsEndPunc=1, otherwise isEndPunc=-1;Whether it is hyperlink that it is whole to represent text block by isLink, if text block exists
In label<a>, i.e., if it is, isLink=-1, otherwise isLink=0;K1~K4 respectively represents tetrameric on the right of formula
Weight, K1~K4 can be set as 0.3-1.2 respectively, for example, K1=1.0, K2=0.4, K3=1.0, K4=1.0.It calculates each
Each text block PuncWeight can be saved in corresponding text block message in listBlock after text block PuncWeight
In.
Preferably, above-mentioned steps S5 includes:
S51, judge whether each punctuation mark weight PuncWeight is greater than or equal to first threshold, such as first respectively
Threshold value can take 3.
S52, when punctuation mark weight PuncWeight be greater than or equal to first threshold when, mark this punctuation mark weight
The corresponding text block of PuncWeight is text.
S53, when punctuation mark weight PuncWeight be less than first threshold when, judge punctuation mark weight PuncWeight
Whether second threshold is greater than or equal to, and second threshold is less than first threshold, such as second threshold can take 1.
S54, when punctuation mark weight PuncWeight be greater than or equal to second threshold when, judge respectively this punctuation mark weigh
Whether comprising terminating punctuation mark (including symbol in the corresponding text block of value PuncWeight.!!;;(the sentence of Chinese and English
Number, exclamation mark, question mark, branch)), this punctuation mark weight PuncWeight correspond to text block previous text block punctuate
Whether symbol weight is greater than or equal to second threshold, this punctuation mark weight PuncWeight corresponds to the latter text of text block
Before whether the punctuation mark weight of block is greater than or equal to second threshold, this punctuation mark weight PuncWeight corresponds to text block
Whether to terminate, punctuation mark ends up one text block and this punctuation mark weight PuncWeight corresponds to the latter of text block
Whether a text block is to terminate punctuation mark ending.
S55, when in the corresponding text block of this punctuation mark weight PuncWeight comprising terminate punctuation mark or this mark
Point symbol weight PuncWeight correspond to text block previous text block and the latter text block punctuation mark weight it is big
The previous text block and the latter of text block are corresponded in or equal to second threshold or this punctuation mark weight PuncWeight
The punctuation mark weight of any text block in text block is greater than or equal to second threshold and this any text block to terminate punctuate
When symbol ends up, marking the corresponding text block of this punctuation mark weight PuncWeight is text.
In above-mentioned steps S54-S55, the sequence of judgement can be it is a variety of, such as a kind of specific method as shown in Fig. 2, tool
Body are as follows: first judge whether comprising terminating punctuation mark in the corresponding text block of this punctuation mark weight PuncWeight, when being,
Marking the corresponding text block of this punctuation mark weight PuncWeight is text, when not being, then judges this punctuation mark weight
Whether the punctuation mark weight of previous text block and the latter text block that PuncWeight corresponds to text block is all larger than or waits
In second threshold, when being, marking the corresponding text block of this punctuation mark weight PuncWeight is text, when not being, then
Judge that this punctuation mark weight PuncWeight corresponds in the previous text block and the latter text block of text block whether to have to appoint
The punctuation mark weight of one text block be greater than or equal to second threshold and this any text block be to terminate punctuation mark ending, when
When being, marking the corresponding text block of this punctuation mark weight PuncWeight is text, when not being, this punctuation mark is marked to weigh
The corresponding text block of value PuncWeight is not text.
S56, when in the corresponding text block of this punctuation mark weight PuncWeight do not include terminate punctuation mark and this mark
Point symbol weight PuncWeight is corresponded in the previous text block of text block and the punctuation mark weight of the latter text block
It is any less than second threshold and this punctuation mark weight PuncWeight correspond to text block previous text block and the latter it is literary
The punctuation mark weight of any text block in this block is greater than or equal to second threshold and this any text block not to terminate punctuate
When symbol ends up, marking the corresponding text block of this punctuation mark weight PuncWeight is not text.
S57, when punctuation mark weight PuncWeight be less than second threshold when, mark this punctuation mark weight
The corresponding text block of PuncWeight is not text.
Further, above-mentioned steps S5 can also include:
S58, judge a upper body text BOB(beginning of block) from discovery, be more than whether N number of text block is not text later
Text block, wherein N is set as 2-25, such as N takes 20.If so, entering step S59;If it is not, then maintaining existing ongoing mistake
Journey.
S59, deterministic process terminate, and obtain whole continuous text blocks in webpage html source code.
Above-mentioned Web page text extracting method, will by the setting of two threshold values to differ in size by step S51-S59
Punctuation mark weight is compared judgement with it respectively, and special based on body structures such as front and back text block, end punctuation marks
The actual characteristic of sign, thus accurately marked be text text block, whole process is clear, clear, succinct, and to institute
There is webpage to be applicable in, accuracy rate is high, has a wide range of application.
Preferably, further comprising the steps of before above-mentioned steps S1:
S0, content unrelated with text structure in webpage html source code is removed.After removing these contents, that is, eliminate a large amount of
Noise in webpage html source code improves the accuracy rate of subsequent step so as to the subsequent processing step of simplification.
Specifically, above-mentioned steps S0 includes:
S01, label<head>and it includes contents is removed;
S02, replacement   and the other similar spcial character with & beginning are sky;
S03, removal label <!DOCTYPE>;
S04, removal annotation information label <!---->;
S05, label<script>and it includes script informations is removed;
S06, label<style>and it includes CSS style information is removed;
S07, label<select>and it includes contents is removed;
S08, label is removed<font><strike><u><b><i><em><strong><sub><code><tt><sup><
var><abbr><ACRONYM><center><ignore_js_op>;
S09, replacement<span>are space.
Preferably, further comprising the steps of after above-mentioned steps S5:
S6, the path according to text block, cut the boundary for the text block for having been labeled as text, obtain accurately just
Literary content.By cutting, study analysis is carried out to the HTML structure information of these text blocks, further determines that body matter really
Range is cut, to extract the body matter of accurate webpage, further improves the accuracy for extracting text.
Specifically, above-mentioned steps S6 includes:
S61, all body text blocks are grouped by father node path, the text block with identical father node path point
It is one group;
S62, the sum of punctuation mark weight PuncWeight of all text blocks in each grouping is calculated
sumPuncWeight;
S63, selection have the grouping of maximum sumPuncWeight, find out in all text blocks of the group frequency of occurrence most
More routing information pathM, lookup when, do not consider the serial number information of the last one node in path;
S64, pathM is compared with the routing information of all text blocks for having been labeled as being text, select have with
The beginning text block startBlock and end text block endBlock of pathM same paths;
All routing information setPath occurred in S65, statistics startBlock and endBlock;
S66, where the title text block blockTitle the latter text BOB(beginning of block), from listBock, one by one
The routing information of text block is judged whether in setPath, if marking if;
The beginning text block contenStartBlock and end text block marked in S67, record previous step
contenEndBlock;
S68, according to the sequence in listBlock, splice contenStartBlock and contenEndBlock and its it
Between all text blocks content, splicing result is accurate body matter.
Above-mentioned Web page text extracting method passes through the path characteristic information according to text block, knot by step S61-S68
The punctuation mark weight for closing text block, marked body text block is further cut, accurate text is had found and opens
Beginning text block and end text block further improve Web page text extracting to be spliced into the accurate body matter between it
Precision.
Embodiment 2
Corresponding to embodiment 1, the present embodiment provides a kind of Web page text extracting devices, as shown in Figure 3, comprising:
First acquisition unit 1, for obtaining the title content in webpage html source code;
Second acquisition unit 2 for obtaining the path of all text blocks in webpage html source code, and establishes text block path
List;
First title text block obtaining unit 3, for comparing title content and the text block content of each text block
Compared with, obtain title content where text block;
Punctuation mark weight calculation unit 4, for the sequence according to path in lists, text block where from title content
Next path of respective path starts, and calculates the punctuation mark weight that each path corresponds to text block;
First body text block obtaining unit 5 is marked for being judged according to punctuation mark weight according to judging result
It is the text block of text out.
Above-mentioned Web page text extracting device finds title content by the way that title content to be compared with text block content
The text block at place, further according to the sequence in path list, next text block of text block where calculating title content and its
The punctuation mark weight of later text block, judges whether text block is text, and having found is text using punctuation mark weight
Continuous text block, it is seen that above-mentioned apparatus is applicable to webpage html source code, be not necessarily to specific extraction template, scalability
It is good;And the calculating of punctuation mark weight is utilized when judging whether it is text, its implementation is simple and improves extraction precision.
Above-mentioned apparatus maintenance cost is low, can use on a large scale and accuracy rate is higher, provides for large-scale internet open source information acquisition
Support.
Preferably, first acquisition unit 1 includes:
Third acquiring unit, for obtaining the content of title label and h1 label in webpage html source code respectively;
Cutting unit, for using separator to carry out character string cutting to the content of title label, and cutting result is suitable
Sequence is stored in array;
First judging unit, for judge respectively the content of first element of array, the content of the last one element and
Whether the content of h1 label is conventional non-title text;
Title content obtaining unit, for working as the content of first element, the content of the last one element and the h1 of array
When to there is any one content in the content of label be not conventional non-title text, it is not normal for obtaining title content to be first
Advise the content of non-title text.
Preferably, the first title text block obtaining unit 3 includes:
Comparing unit, the size of the editing distance of the text block content for comparing title content and each text block;
Second title text block obtaining unit, it is corresponding for smallest edit distance for the text block where obtaining title content
Text block.
Preferably, the first body text block obtaining unit 5 includes:
Second judgment unit, for judging whether each punctuation mark weight is greater than or equal to first threshold respectively;
Second body text block obtaining unit, for marking this when punctuation mark weight is greater than or equal to first threshold
The corresponding text block of punctuation mark weight is text.
Preferably, the first body text block obtaining unit 5 further include:
Third judging unit, for judging whether punctuation mark weight is big when punctuation mark weight is less than first threshold
In or equal to second threshold, second threshold is less than first threshold;
4th judging unit, for judging that this punctuate accords with respectively when punctuation mark weight is greater than or equal to second threshold
In number corresponding text block of weight whether comprising terminating punctuation mark, this punctuation mark weight correspond to the previous text of text block
Whether the punctuation mark weight of block is greater than or equal to second threshold, this punctuation mark weight corresponds to the latter text block of text block
Punctuation mark weight whether be greater than or equal to second threshold, this punctuation mark weight corresponds to the previous text block of text block is
It is no whether to correspond to the latter text block of text block to terminate punctuate to terminate punctuation mark ending and this punctuation mark weight
Symbol ending;
Third body text block obtaining unit, for working as in this corresponding text block of punctuation mark weight comprising terminating punctuate
Symbol or this punctuation mark weight correspond to text block previous text block and the latter text block punctuation mark weight it is equal
It is corresponded in the previous text block and the latter text block of text block more than or equal to second threshold or this punctuation mark weight
The punctuation mark weight of any text block be greater than or equal to second threshold and this any text block to terminate punctuation mark ending
When, marking this corresponding text block of punctuation mark weight is text.
Preferably, before first acquisition unit 1, further includes:
Unit is removed, for removing content unrelated with text structure in webpage html source code.
Preferably, after the first body text block obtaining unit 5, further includes:
Accurate text obtaining unit, for the path according to text block, to the boundary of the text block for having been labeled as text into
Row is cut, and obtains accurate body matter.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Obviously, the above embodiments are merely examples for clarifying the description, and does not limit the embodiments.It is right
For those of ordinary skill in the art, can also make on the basis of the above description it is other it is various forms of variation or
It changes.There is no necessity and possibility to exhaust all the enbodiments.And it is extended from this it is obvious variation or
It changes still within the protection scope of the invention.
Claims (7)
1. a kind of Web page text extracting method, which comprises the following steps:
Obtain the title content in webpage html source code;
The path of all text blocks in webpage html source code is obtained, and establishes text block path list;
Title content is compared with the text block content of each text block, obtains the text block where title content;
According to the sequence of path in lists, since next path of text block respective path where title content, calculate
Each path corresponds to the punctuation mark weight of text block;
Judged according to punctuation mark weight, marked according to judging result be text text block;
Described to be judged according to punctuation mark weight, marking the step of being the text block of text according to judging result includes:
Judge whether each punctuation mark weight is greater than or equal to first threshold respectively;
When punctuation mark weight is greater than or equal to first threshold, marking this corresponding text block of punctuation mark weight is text;
When punctuation mark weight is less than first threshold, judge whether punctuation mark weight is greater than or equal to second threshold, it is described
Second threshold is less than first threshold;
When punctuation mark weight be greater than or equal to second threshold when, judge respectively be in this corresponding text block of punctuation mark weight
It is no whether big comprising terminating punctuation mark, the punctuation mark weight for the previous text block that this punctuation mark weight corresponds to text block
Whether it is greater than in or equal to the punctuation mark weight of second threshold, this punctuation mark weight the latter text block for corresponding to text block
Or equal to second threshold, this punctuation mark weight correspond to text block previous text block whether with terminate punctuation mark ending with
And whether this punctuation mark weight corresponds to the latter text block of text block to terminate punctuation mark ending;
When corresponding to text comprising terminating punctuation mark or this punctuation mark weight in this corresponding text block of punctuation mark weight
The previous text block of block and the punctuation mark weight of the latter text block are all larger than or are equal to second threshold or this punctuate accords with
The punctuation mark weight that number weight corresponds to any text block in the previous text block and the latter text block of text block is greater than
Or equal to second threshold and when this any text block ends up to terminate punctuation mark, mark this corresponding text of punctuation mark weight
Block is text.
2. the method according to claim 1, wherein the step for obtaining the title content in webpage html source code
Suddenly include:
The content of title label and h1 label in webpage html source code is obtained respectively;
Character string cutting is carried out using content of the separator to title label, and cutting result sequence is stored in array;
Whether the content for judging the content of first element of array, the content of the last one element and h1 label respectively is conventional
Non- title text;
When in the content of the content of first element of array, the content of the last one element and h1 label exist any one in
When appearance is not conventional non-title text, obtain title content be first be not conventional non-title text content.
3. the method according to claim 1, wherein described will be in the text block of title content and each text block
Appearance is compared, obtain title content where text block the step of include:
Compare the size of the editing distance of the text block content of title content and each text block;
Text block where obtaining title content is the corresponding text block of smallest edit distance.
4. the method according to claim 1, wherein the title content in the acquisition webpage html source code
It is further comprising the steps of before step:
Remove content unrelated with text structure in webpage html source code.
5. method according to claim 1-4, which is characterized in that sentenced described according to punctuation mark weight
It is disconnected, further comprising the steps of after the step of being the text block of text is marked according to judging result:
According to the path of text block, the boundary for the text block for having been labeled as text is cut, obtains accurate body matter.
6. a kind of Web page text extracting device characterized by comprising
First acquisition unit, for obtaining the title content in webpage html source code;
Second acquisition unit for obtaining the path of all text blocks in webpage html source code, and establishes text block path list;
First title text block obtaining unit is obtained for title content to be compared with the text block content of each text block
Obtain the text block where title content;
Punctuation mark weight calculation unit, for the sequence according to path in lists, text block is corresponding where from title content
Next path in path starts, and calculates the punctuation mark weight that each path corresponds to text block;
First body text block obtaining unit, for being judged that marking according to judging result is according to punctuation mark weight
The text block of text;
The first body text block obtaining unit includes:
Second judgment unit, for judging whether each punctuation mark weight is greater than or equal to first threshold respectively;
Second body text block obtaining unit, for marking this punctuate when punctuation mark weight is greater than or equal to first threshold
The corresponding text block of symbol weight is text;
Third judging unit, for when punctuation mark weight be less than first threshold when, judge punctuation mark weight whether be greater than or
Equal to second threshold, the second threshold is less than first threshold;
4th judging unit, for judging that this punctuation mark is weighed respectively when punctuation mark weight is greater than or equal to second threshold
Be worth in corresponding text block whether comprising terminating punctuation mark, this punctuation mark weight correspond to the previous text block of text block
Whether punctuation mark weight is greater than or equal to second threshold, this punctuation mark weight correspond to text block the latter text block mark
Whether point symbol weight is greater than or equal to second threshold, this punctuation mark weight correspond to text block previous text block whether with
Terminate punctuation mark ending and whether this punctuation mark weight corresponds to the latter text block of text block to terminate punctuation mark
Ending;
Third body text block obtaining unit, for working as in this corresponding text block of punctuation mark weight comprising terminating punctuate symbol
Number or this punctuation mark weight correspond to text block previous text block and the latter text block punctuation mark weight it is big
It corresponds in the previous text block and the latter text block of text block in or equal to second threshold or this punctuation mark weight
When the punctuation mark weight of any text block is greater than or equal to second threshold and this any text block to terminate punctuation mark ending,
Marking this corresponding text block of punctuation mark weight is text.
7. device according to claim 6, which is characterized in that the first acquisition unit includes:
Third acquiring unit, for obtaining the content of title label and h1 label in webpage html source code respectively;
Cutting unit is protected for using separator to carry out character string cutting to the content of title label, and by cutting result sequence
There are in array;
First judging unit, for judging that the content of first element, the content of the last one element and h1 of array are marked respectively
Whether the content of label is conventional non-title text;
Title content obtaining unit, for working as the content of first element, the content of the last one element and the h1 label of array
Content in when to there is any one content be not conventional non-title text, it is not conventional non-that to obtain title content, which be first,
The content of title text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610986453.5A CN106649560B (en) | 2016-11-03 | 2016-11-03 | A kind of Web page text extracting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610986453.5A CN106649560B (en) | 2016-11-03 | 2016-11-03 | A kind of Web page text extracting method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649560A CN106649560A (en) | 2017-05-10 |
CN106649560B true CN106649560B (en) | 2019-09-24 |
Family
ID=58805474
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610986453.5A Expired - Fee Related CN106649560B (en) | 2016-11-03 | 2016-11-03 | A kind of Web page text extracting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649560B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113723980A (en) * | 2020-05-26 | 2021-11-30 | 北京达佳互联信息技术有限公司 | Method and device for detecting advertisement landing page, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101042692A (en) * | 2006-03-24 | 2007-09-26 | 富士通株式会社 | translation obtaining method and apparatus based on semantic forecast |
CN101458718A (en) * | 2009-01-05 | 2009-06-17 | 北京大学 | Search engine dynamic summarization extracting method |
CN102591612A (en) * | 2011-12-27 | 2012-07-18 | 厦门市美亚柏科信息股份有限公司 | General webpage text extraction method based on punctuation continuity and system thereof |
US8682883B2 (en) * | 2011-04-14 | 2014-03-25 | Predictix Llc | Systems and methods for identifying sets of similar products |
-
2016
- 2016-11-03 CN CN201610986453.5A patent/CN106649560B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101042692A (en) * | 2006-03-24 | 2007-09-26 | 富士通株式会社 | translation obtaining method and apparatus based on semantic forecast |
CN101458718A (en) * | 2009-01-05 | 2009-06-17 | 北京大学 | Search engine dynamic summarization extracting method |
US8682883B2 (en) * | 2011-04-14 | 2014-03-25 | Predictix Llc | Systems and methods for identifying sets of similar products |
CN102591612A (en) * | 2011-12-27 | 2012-07-18 | 厦门市美亚柏科信息股份有限公司 | General webpage text extraction method based on punctuation continuity and system thereof |
Also Published As
Publication number | Publication date |
---|---|
CN106649560A (en) | 2017-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6842167B2 (en) | Summary generator, summary generation method and computer program | |
CN107392143B (en) | Resume accurate analysis method based on SVM text classification | |
CN101408898B (en) | Method and device for extracting web page text | |
CN101251855B (en) | Equipment, system and method for cleaning internet web page | |
CN107358208B (en) | A kind of PDF document structured message extracting method and device | |
US7606816B2 (en) | Record boundary identification and extraction through pattern mining | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN111274814B (en) | Novel semi-supervised text entity information extraction method | |
US9449114B2 (en) | Removing non-substantive content from a web page by removing its text-sparse nodes and removing high-frequency sentences of its text-dense nodes using sentence hash value frequency across a web page collection | |
CN106960058A (en) | A kind of structure of web page alteration detection method and system | |
KR20120051419A (en) | Apparatus and method for extracting cascading style sheet | |
Uzun et al. | An effective and efficient Web content extractor for optimizing the crawling process | |
CN108536683A (en) | A kind of paper fragmentation information abstracting method based on machine learning | |
KR102110281B1 (en) | Automated composition evaluator | |
CN106227770A (en) | A kind of intelligentized news web page information extraction method | |
CN106649560B (en) | A kind of Web page text extracting method and device | |
CN106372038A (en) | Keyword extraction method and device | |
CN105183730B (en) | The treating method and apparatus of webpage information | |
CN104217025B (en) | For the entry extraction system and method for more record webpages | |
CN104615728B (en) | A kind of webpage context extraction method and device | |
CN111611788B (en) | Data processing method and device, electronic equipment and storage medium | |
CN109213974A (en) | A kind of electronic document conversion method and device | |
CN104636324B (en) | Topic source tracing method and system | |
CN109670162A (en) | The determination method, apparatus and terminal device of title | |
CN112818693A (en) | Automatic extraction method and system for electronic component model words |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20190924 Termination date: 20201103 |