CN106649560B - A kind of Web page text extracting method and device - Google Patents

A kind of Web page text extracting method and device Download PDF

Info

Publication number
CN106649560B
CN106649560B CN201610986453.5A CN201610986453A CN106649560B CN 106649560 B CN106649560 B CN 106649560B CN 201610986453 A CN201610986453 A CN 201610986453A CN 106649560 B CN106649560 B CN 106649560B
Authority
CN
China
Prior art keywords
text block
text
punctuation mark
content
title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610986453.5A
Other languages
Chinese (zh)
Other versions
CN106649560A (en
Inventor
贲兴龙
苏雪阳
韩国辉
袁林
陈晓琳
王睿
刘志明
袁翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN201610986453.5A priority Critical patent/CN106649560B/en
Publication of CN106649560A publication Critical patent/CN106649560A/en
Application granted granted Critical
Publication of CN106649560B publication Critical patent/CN106649560B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a kind of Web page text extracting method and devices, belong to technical field of information processing, wherein method includes the following steps: obtaining the title content in webpage html source code;The path of all text blocks in webpage html source code is obtained, and establishes text block path list;Title content is compared with the text block content of each text block, obtains the text block where title content;The punctuation mark weight that each path corresponds to text block is calculated since next path of text block respective path where title content according to the sequence of path in lists;Judged according to punctuation mark weight, marked according to judging result be text text block.The present invention has the advantages that scalability is good, realizes that simple, text extracting is with high accuracy.

Description

A kind of Web page text extracting method and device
Technical field
The present invention relates to technical field of information processing, and in particular to a kind of Web page text extracting method and device, especially It is related to the Web page text extracting method and device of a kind of combination punctuation mark weight and structure feature.
Background technique
With the rapid development of internet, cyberspace stores more and more information resources, and Web page is therein One kind mainly showing form, and quantity is more and more huger.According to the 35th China Internet network state of development statistical report, China Webpage quantity is up to 189,900,000,000, Nian Zengchang 26.6%.In recent years, how the web data of these magnanimity is analyzed and processed, from In excavate valuable information become research a hot issue.It is the big data technology of representative to magnanimity using Hadoop Web data storage and analysis provide effective means, but the content on webpage is other than subject content, usually all can be in the page Middle placement navigation bar is to facilitate user to access, content also unrelated with theme just like advertisement, copyright information, associated connection etc., I Be referred to as " noise ", these noise informations can handle to web data analysis and bring very big influence, seriously affect analysis result Accuracy.These noises how are removed, the body matter in webpage is extracted, web data are analyzed significant.
However, current Web page text extracting method there is also some problems, requires further improvement and perfect.Such as Method based on template needs to specify corresponding extraction template to different webpage formats, although having very high extraction precision, But scalability is poor, maintenance cost is very high, can not use on a large scale.For example the method for view-based access control model feature realizes complexity, Efficiency is lower, it is difficult to cope with the processing of magnanimity web data.For example then to extract precision lower for Statistics-Based Method.
Summary of the invention
Therefore, technical problems to be solved of the embodiment of the present invention are that Web page text extracting method in the prior art can expand Malleability is poor, realization is complicated, it is lower to extract precision.
For this purpose, a kind of Web page text extracting method of the embodiment of the present invention, comprising the following steps:
Obtain the title content in webpage html source code;
The path of all text blocks in webpage html source code is obtained, and establishes text block path list;
Title content is compared with the text block content of each text block, obtains the text block where title content;
According to the sequence of path in lists, since next path of text block respective path where title content, Calculate the punctuation mark weight that each path corresponds to text block;
Judged according to punctuation mark weight, marked according to judging result be text text block.
Preferably, the step of title content obtained in webpage html source code includes:
The content of title label and h1 label in webpage html source code is obtained respectively;
Character string cutting is carried out using content of the separator to title label, and cutting result sequence is stored in array In;
Judge respectively the content of first element of array, the content of the last one element and h1 label content whether be Conventional non-title text;
When there are any one in the content of the content of first element of array, the content of the last one element and h1 label When a content is not conventional non-title text, obtain title content be first be not conventional non-title text content.
Preferably, described to be compared title content with the text block content of each text block, obtain title content institute Text block the step of include:
Compare the size of the editing distance of the text block content of title content and each text block;
Text block where obtaining title content is the corresponding text block of smallest edit distance.
Preferably, described to be judged according to punctuation mark weight, marked according to judging result be text text block The step of include:
Judge whether each punctuation mark weight is greater than or equal to first threshold respectively;
When punctuation mark weight is greater than or equal to first threshold, this corresponding text block of punctuation mark weight is marked to be positive Text.
Preferably, described to be judged according to punctuation mark weight, marked according to judging result be text text block The step of further include:
When punctuation mark weight is less than first threshold, judge whether punctuation mark weight is greater than or equal to second threshold, The second threshold is less than first threshold;
When punctuation mark weight is greater than or equal to second threshold, this corresponding text block of punctuation mark weight is judged respectively In whether comprising the punctuation mark weight that terminates the previous text block that punctuation mark, this punctuation mark weight correspond to text block be Whether the punctuation mark weight of no the latter text block that text block is corresponded to more than or equal to second threshold, this punctuation mark weight Whether the previous text block of text block is corresponded to terminate punctuation mark knot more than or equal to second threshold, this punctuation mark weight Whether tail and this punctuation mark weight correspond to the latter text block of text block to terminate punctuation mark ending;
When corresponding comprising terminating punctuation mark or this punctuation mark weight in this corresponding text block of punctuation mark weight The previous text block of text block and the punctuation mark weight of the latter text block are all larger than or are equal to second threshold or this mark Point symbol weight corresponds to the punctuation mark weight of any text block in the previous text block and the latter text block of text block More than or equal to second threshold and when this any text block ends up to terminate punctuation mark, mark this punctuation mark weight corresponding Text block is text.
Preferably, it is described acquisition webpage html source code in title content the step of before, it is further comprising the steps of:
Remove content unrelated with text structure in webpage html source code.
Preferably, judged described according to punctuation mark weight, marked according to judging result be text text It is further comprising the steps of after the step of block:
According to the path of text block, the boundary for the text block for having been labeled as text is cut, obtains accurate text Content.
A kind of Web page text extracting device of the embodiment of the present invention, comprising:
First acquisition unit, for obtaining the title content in webpage html source code;
Second acquisition unit for obtaining the path of all text blocks in webpage html source code, and establishes text block path List;
First title text block obtaining unit, for comparing title content and the text block content of each text block Compared with, obtain title content where text block;
Punctuation mark weight calculation unit, for the sequence according to path in lists, text block where from title content Next path of respective path starts, and calculates the punctuation mark weight that each path corresponds to text block;
First body text block obtaining unit is marked for being judged according to punctuation mark weight according to judging result It is the text block of text out.
Preferably, the first acquisition unit includes:
Third acquiring unit, for obtaining the content of title label and h1 label in webpage html source code respectively;
Cutting unit, for using separator to carry out character string cutting to the content of title label, and cutting result is suitable Sequence is stored in array;
First judging unit, for judge respectively the content of first element of array, the content of the last one element and Whether the content of h1 label is conventional non-title text;
Title content obtaining unit, for working as the content of first element, the content of the last one element and the h1 of array When to there is any one content in the content of label be not conventional non-title text, it is not normal for obtaining title content to be first Advise the content of non-title text.
Preferably, the first title text block obtaining unit includes:
Comparing unit, the size of the editing distance of the text block content for comparing title content and each text block;
Second title text block obtaining unit, it is corresponding for smallest edit distance for the text block where obtaining title content Text block.
Preferably, the first body text block obtaining unit includes:
Second judgment unit, for judging whether each punctuation mark weight is greater than or equal to first threshold respectively;
Second body text block obtaining unit, for marking this when punctuation mark weight is greater than or equal to first threshold The corresponding text block of punctuation mark weight is text.
Preferably, the first body text block obtaining unit further include:
Third judging unit, for judging whether punctuation mark weight is big when punctuation mark weight is less than first threshold In or equal to second threshold, the second threshold is less than first threshold;
4th judging unit, for judging that this punctuate accords with respectively when punctuation mark weight is greater than or equal to second threshold In number corresponding text block of weight whether comprising terminating punctuation mark, this punctuation mark weight correspond to the previous text of text block Whether the punctuation mark weight of block is greater than or equal to second threshold, this punctuation mark weight corresponds to the latter text block of text block Punctuation mark weight whether be greater than or equal to second threshold, this punctuation mark weight corresponds to the previous text block of text block is It is no whether to correspond to the latter text block of text block to terminate punctuate to terminate punctuation mark ending and this punctuation mark weight Symbol ending;
Third body text block obtaining unit, for working as in this corresponding text block of punctuation mark weight comprising terminating punctuate Symbol or this punctuation mark weight correspond to text block previous text block and the latter text block punctuation mark weight it is equal It is corresponded in the previous text block and the latter text block of text block more than or equal to second threshold or this punctuation mark weight The punctuation mark weight of any text block be greater than or equal to second threshold and this any text block to terminate punctuation mark ending When, marking this corresponding text block of punctuation mark weight is text.
Preferably, before the first acquisition unit, further includes:
Unit is removed, for removing content unrelated with text structure in webpage html source code.
Preferably, after the first body text block obtaining unit, further includes:
Accurate text obtaining unit, for the path according to text block, to the boundary of the text block for having been labeled as text into Row is cut, and obtains accurate body matter.
The technical solution of the embodiment of the present invention, has the advantages that
1. Web page text extracting method provided in an embodiment of the present invention and device, by by title content and text block content It is compared, finds the text block where title content, further according to the sequence in path list, calculate text where title content Whether next text block of this block and its punctuation mark weight of later text block, judge text block using punctuation mark weight For text, have found be text continuous text block, it is seen that the above method and device are applicable to webpage html source code, are not necessarily to Specific extraction template, scalability are good;And the calculating of punctuation mark weight, realization side are utilized when judging whether it is text Method is simple and improves extraction precision.
2. Web page text extracting method provided in an embodiment of the present invention and device, by obtaining label<title>and<h1> Content, fully considered title label that may be present in html source code, improve title acquisition accuracy.Pass through judgement Specific non-title canonical formula whether is matched in the title content got, and it is insubstantial to eliminate such as forum, blog, Newsweek The title of property, non-title, further improves the accuracy of the title content got.
3. Web page text extracting method provided in an embodiment of the present invention and device, pass through setting for two threshold values differing in size It sets, punctuation mark weight is compared judgement with it respectively, and based on texts knots such as front and back text block, end punctuation marks The actual characteristic of structure feature, thus accurately marked be text text block, whole process is clear, clear, succinct, and All webpages are applicable in, accuracy rate is high, has a wide range of application.
Detailed description of the invention
In order to illustrate more clearly of the technical solution in the specific embodiment of the invention, specific embodiment will be retouched below Attached drawing needed in stating is briefly described, it should be apparent that, the accompanying drawings in the following description is some realities of the invention Mode is applied, it for those of ordinary skill in the art, without creative efforts, can also be attached according to these Figure obtains other attached drawings.
Fig. 1 is the flow chart of a specific example of Web page text extracting method in the embodiment of the present invention 1;
Fig. 2 is the flow chart that a specific example of body text block is marked in the embodiment of the present invention 1;
Fig. 3 is the functional block diagram of a specific example of Web page text extracting device in the embodiment of the present invention 2.
Specific embodiment
Technical solution of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that term " first ", " second ", " third " etc. are only used for description mesh , it is not understood to indicate or imply relative importance.
As long as in addition, the non-structure each other of technical characteristic involved in invention described below different embodiments It can be combined with each other at conflict.
Embodiment 1
The present embodiment provides a kind of Web page text extracting method, this method combines punctuation mark weight and structure feature, Text extracting is done for the information in webpage html source code, as shown in Figure 1, this method comprises the following steps:
Title content title in S1, acquisition webpage html source code;
S2, all text block characteristic informations in webpage html source code, including path are obtained, and establishing includes path list Text block characteristic information list listBlock;
S3, in text block characteristic information list listBlock, by the text of title content title and each text block Block content is compared, and obtains the text block blockTitle where title content;
S4, the sequence according to path in lists, under the text block blockTitle respective path of title content place One path starts, and calculates the punctuation mark weight PuncWeight that each path corresponds to text block;
S5, judged according to punctuation mark weight PuncWeight, marked according to judging result be text text Block.
Above-mentioned Web page text extracting method, by step S1-S5, by comparing title content and text block content Compared with, the text block where title content is found, further according to the sequence in path list, text block where calculating title content The punctuation mark weight of next text block and its later text block, judges whether text block is positive using punctuation mark weight Text, have found be text continuous text block, it is seen that the above method is applicable to webpage html source code, extracts without specific Template, scalability are good;And the calculating of punctuation mark weight is utilized when judging whether it is text, its implementation is simple and mentions High extraction precision.Above method maintenance cost is low, can use on a large scale and accuracy rate is higher, for large-scale internet open source feelings Report obtains and provides support.
Preferably, the step of title content title in the acquisition webpage html source code of above-mentioned steps S1 includes:
S11, the content oriTitle and h1 for obtaining label<title>and label<h1>in webpage html source code respectively.Example Such as, regular expression can be passed through when acquisition: (? is)<title.*?>(.+?)</title>and (? is)<h1.*?>(.*?) </h1>。
S12, character string cutting is carried out to oriTitle using separator SplitStr, and cutting result sequence is stored in In array titleArray.Such as separator can be-, _, &laquo etc..
S13, judge respectively array titleArray first element titleArray [1] content, the last one member Whether the content and h1 of plain titleArray [last] are conventional non-title texts, i.e., whether match specific non-title canonical formula ExcludeTitlePattern, for example, [^ s] { 1,5 } (forum | blog | news (weekly | center | is net (stood)?)? | important news | it is first Page | net).
S14, the content of first element titleArray [1] as array titleArray, the last one element When to there is any one content in the content and h1 of titleArray [last] be not conventional non-title text, obtain in title Hold content, the last one element of the first element titleArray [1] for the array titleArray that title is sequence judgement First in the content and h1 of titleArray [last] be not conventional non-title text content.The mode of selection can be with Be: whether the content for first judging first element titleArray [1] of array titleArray is conventional non-title text, When not being, taking title content title is titleArray [1], when being, then judges the last one of array titleArray Whether the content of element titleArray [last] is conventional non-title text, when not being, takes the title content title to be TitleArray [last] when being, then judges whether h1 is that conventional non-title text takes title content when not being Title is h1, when being, takes title content title for sky.
S15, the content of first element titleArray [1] as array titleArray, the last one element When the content and h1 of titleArray [last] is all conventional non-title text, take title content title for sky.
Above-mentioned Web page text extracting method, by step S11-S15, by obtaining the content of label<title>and<h1>, It has fully considered title label that may be present in html source code, has improved the accuracy of title acquisition.It is got by judgement Title content in whether match specific non-title canonical formula, eliminate such as forum, blog, Newsweek unsubstantiality, The title of non-title further improves the accuracy of the title content got.
Preferably, above-mentioned steps S2 specific method the following steps are included:
S21, it realizes a class HtmlParserHandler, inherits the DefaultHandler class of tagSoup;
S22, in startElement () method of class HtmlParserHandler, by currently processed object element Tag name and number information be pressed into stack stack, number since 1 ing, one label of every processing, number add 1;
S23, in characters () method of class HtmlParserHandler, record text block content of text, Meanwhile the content in stack stack is spliced, obtain the path characteristic information of text block, text block path characteristic information Format is the path of label where from root node to text block, such as/html_1/body_2/div_54/div_55/div_58/ Text block path characteristic information is recorded in listBlock h1_59.
S24, in endElement () method of class HtmlParserHandler, the method for popping of call stack stack, If first entered the method gone out afterwards, the object element of currently processed end is popped.
It will be understood by those of skill in the art that the method for obtaining text block characteristic information is not limited to above by class The method of HtmlParserHandler can also can obtain the method for text block characteristic information by other to realize.
Preferably, above-mentioned steps S3 includes:
S31, the size for comparing title content title with the editing distance of the text block content of each text block, such as can To use Levenshtein distance.
Text block blockTitle where S32, acquisition title content title is the corresponding text of smallest edit distance Block, label blockTitle are the serial number of text block.
Above-mentioned Web page text extracting method compares title content and text by editing distance by step S31-S32 Block content, so as to be quickly found text block where title content, not only processing method is simple, effective, but also handles effect Rate is very fast.
Preferably, in above-mentioned steps S4, the method for calculating the punctuation mark weight PuncWeight of text block can be logical Following formula is crossed to calculate:
PuncWeight=K1*S1+K2*S2
+K3*isEndPunc+K4*isLink
Wherein, S1 is represented in text block and is owned,.;;The sum of number of symbol;S2, which is represented in text block, to be owned:!!Symbol Number the sum of number;Whether isEndPunc represents text block with any end punctuation mark.!!;;Ending, if it is, IsEndPunc=1, otherwise isEndPunc=-1;Whether it is hyperlink that it is whole to represent text block by isLink, if text block exists In label<a>, i.e., if it is, isLink=-1, otherwise isLink=0;K1~K4 respectively represents tetrameric on the right of formula Weight, K1~K4 can be set as 0.3-1.2 respectively, for example, K1=1.0, K2=0.4, K3=1.0, K4=1.0.It calculates each Each text block PuncWeight can be saved in corresponding text block message in listBlock after text block PuncWeight In.
Preferably, above-mentioned steps S5 includes:
S51, judge whether each punctuation mark weight PuncWeight is greater than or equal to first threshold, such as first respectively Threshold value can take 3.
S52, when punctuation mark weight PuncWeight be greater than or equal to first threshold when, mark this punctuation mark weight The corresponding text block of PuncWeight is text.
S53, when punctuation mark weight PuncWeight be less than first threshold when, judge punctuation mark weight PuncWeight Whether second threshold is greater than or equal to, and second threshold is less than first threshold, such as second threshold can take 1.
S54, when punctuation mark weight PuncWeight be greater than or equal to second threshold when, judge respectively this punctuation mark weigh Whether comprising terminating punctuation mark (including symbol in the corresponding text block of value PuncWeight.!!;;(the sentence of Chinese and English Number, exclamation mark, question mark, branch)), this punctuation mark weight PuncWeight correspond to text block previous text block punctuate Whether symbol weight is greater than or equal to second threshold, this punctuation mark weight PuncWeight corresponds to the latter text of text block Before whether the punctuation mark weight of block is greater than or equal to second threshold, this punctuation mark weight PuncWeight corresponds to text block Whether to terminate, punctuation mark ends up one text block and this punctuation mark weight PuncWeight corresponds to the latter of text block Whether a text block is to terminate punctuation mark ending.
S55, when in the corresponding text block of this punctuation mark weight PuncWeight comprising terminate punctuation mark or this mark Point symbol weight PuncWeight correspond to text block previous text block and the latter text block punctuation mark weight it is big The previous text block and the latter of text block are corresponded in or equal to second threshold or this punctuation mark weight PuncWeight The punctuation mark weight of any text block in text block is greater than or equal to second threshold and this any text block to terminate punctuate When symbol ends up, marking the corresponding text block of this punctuation mark weight PuncWeight is text.
In above-mentioned steps S54-S55, the sequence of judgement can be it is a variety of, such as a kind of specific method as shown in Fig. 2, tool Body are as follows: first judge whether comprising terminating punctuation mark in the corresponding text block of this punctuation mark weight PuncWeight, when being, Marking the corresponding text block of this punctuation mark weight PuncWeight is text, when not being, then judges this punctuation mark weight Whether the punctuation mark weight of previous text block and the latter text block that PuncWeight corresponds to text block is all larger than or waits In second threshold, when being, marking the corresponding text block of this punctuation mark weight PuncWeight is text, when not being, then Judge that this punctuation mark weight PuncWeight corresponds in the previous text block and the latter text block of text block whether to have to appoint The punctuation mark weight of one text block be greater than or equal to second threshold and this any text block be to terminate punctuation mark ending, when When being, marking the corresponding text block of this punctuation mark weight PuncWeight is text, when not being, this punctuation mark is marked to weigh The corresponding text block of value PuncWeight is not text.
S56, when in the corresponding text block of this punctuation mark weight PuncWeight do not include terminate punctuation mark and this mark Point symbol weight PuncWeight is corresponded in the previous text block of text block and the punctuation mark weight of the latter text block It is any less than second threshold and this punctuation mark weight PuncWeight correspond to text block previous text block and the latter it is literary The punctuation mark weight of any text block in this block is greater than or equal to second threshold and this any text block not to terminate punctuate When symbol ends up, marking the corresponding text block of this punctuation mark weight PuncWeight is not text.
S57, when punctuation mark weight PuncWeight be less than second threshold when, mark this punctuation mark weight The corresponding text block of PuncWeight is not text.
Further, above-mentioned steps S5 can also include:
S58, judge a upper body text BOB(beginning of block) from discovery, be more than whether N number of text block is not text later Text block, wherein N is set as 2-25, such as N takes 20.If so, entering step S59;If it is not, then maintaining existing ongoing mistake Journey.
S59, deterministic process terminate, and obtain whole continuous text blocks in webpage html source code.
Above-mentioned Web page text extracting method, will by the setting of two threshold values to differ in size by step S51-S59 Punctuation mark weight is compared judgement with it respectively, and special based on body structures such as front and back text block, end punctuation marks The actual characteristic of sign, thus accurately marked be text text block, whole process is clear, clear, succinct, and to institute There is webpage to be applicable in, accuracy rate is high, has a wide range of application.
Preferably, further comprising the steps of before above-mentioned steps S1:
S0, content unrelated with text structure in webpage html source code is removed.After removing these contents, that is, eliminate a large amount of Noise in webpage html source code improves the accuracy rate of subsequent step so as to the subsequent processing step of simplification.
Specifically, above-mentioned steps S0 includes:
S01, label<head>and it includes contents is removed;
S02, replacement &nbsp and the other similar spcial character with & beginning are sky;
S03, removal label <!DOCTYPE>;
S04, removal annotation information label <!---->;
S05, label<script>and it includes script informations is removed;
S06, label<style>and it includes CSS style information is removed;
S07, label<select>and it includes contents is removed;
S08, label is removed<font><strike><u><b><i><em><strong><sub><code><tt><sup>< var><abbr><ACRONYM><center><ignore_js_op>;
S09, replacement<span>are space.
Preferably, further comprising the steps of after above-mentioned steps S5:
S6, the path according to text block, cut the boundary for the text block for having been labeled as text, obtain accurately just Literary content.By cutting, study analysis is carried out to the HTML structure information of these text blocks, further determines that body matter really Range is cut, to extract the body matter of accurate webpage, further improves the accuracy for extracting text.
Specifically, above-mentioned steps S6 includes:
S61, all body text blocks are grouped by father node path, the text block with identical father node path point It is one group;
S62, the sum of punctuation mark weight PuncWeight of all text blocks in each grouping is calculated sumPuncWeight;
S63, selection have the grouping of maximum sumPuncWeight, find out in all text blocks of the group frequency of occurrence most More routing information pathM, lookup when, do not consider the serial number information of the last one node in path;
S64, pathM is compared with the routing information of all text blocks for having been labeled as being text, select have with The beginning text block startBlock and end text block endBlock of pathM same paths;
All routing information setPath occurred in S65, statistics startBlock and endBlock;
S66, where the title text block blockTitle the latter text BOB(beginning of block), from listBock, one by one The routing information of text block is judged whether in setPath, if marking if;
The beginning text block contenStartBlock and end text block marked in S67, record previous step contenEndBlock;
S68, according to the sequence in listBlock, splice contenStartBlock and contenEndBlock and its it Between all text blocks content, splicing result is accurate body matter.
Above-mentioned Web page text extracting method passes through the path characteristic information according to text block, knot by step S61-S68 The punctuation mark weight for closing text block, marked body text block is further cut, accurate text is had found and opens Beginning text block and end text block further improve Web page text extracting to be spliced into the accurate body matter between it Precision.
Embodiment 2
Corresponding to embodiment 1, the present embodiment provides a kind of Web page text extracting devices, as shown in Figure 3, comprising:
First acquisition unit 1, for obtaining the title content in webpage html source code;
Second acquisition unit 2 for obtaining the path of all text blocks in webpage html source code, and establishes text block path List;
First title text block obtaining unit 3, for comparing title content and the text block content of each text block Compared with, obtain title content where text block;
Punctuation mark weight calculation unit 4, for the sequence according to path in lists, text block where from title content Next path of respective path starts, and calculates the punctuation mark weight that each path corresponds to text block;
First body text block obtaining unit 5 is marked for being judged according to punctuation mark weight according to judging result It is the text block of text out.
Above-mentioned Web page text extracting device finds title content by the way that title content to be compared with text block content The text block at place, further according to the sequence in path list, next text block of text block where calculating title content and its The punctuation mark weight of later text block, judges whether text block is text, and having found is text using punctuation mark weight Continuous text block, it is seen that above-mentioned apparatus is applicable to webpage html source code, be not necessarily to specific extraction template, scalability It is good;And the calculating of punctuation mark weight is utilized when judging whether it is text, its implementation is simple and improves extraction precision. Above-mentioned apparatus maintenance cost is low, can use on a large scale and accuracy rate is higher, provides for large-scale internet open source information acquisition Support.
Preferably, first acquisition unit 1 includes:
Third acquiring unit, for obtaining the content of title label and h1 label in webpage html source code respectively;
Cutting unit, for using separator to carry out character string cutting to the content of title label, and cutting result is suitable Sequence is stored in array;
First judging unit, for judge respectively the content of first element of array, the content of the last one element and Whether the content of h1 label is conventional non-title text;
Title content obtaining unit, for working as the content of first element, the content of the last one element and the h1 of array When to there is any one content in the content of label be not conventional non-title text, it is not normal for obtaining title content to be first Advise the content of non-title text.
Preferably, the first title text block obtaining unit 3 includes:
Comparing unit, the size of the editing distance of the text block content for comparing title content and each text block;
Second title text block obtaining unit, it is corresponding for smallest edit distance for the text block where obtaining title content Text block.
Preferably, the first body text block obtaining unit 5 includes:
Second judgment unit, for judging whether each punctuation mark weight is greater than or equal to first threshold respectively;
Second body text block obtaining unit, for marking this when punctuation mark weight is greater than or equal to first threshold The corresponding text block of punctuation mark weight is text.
Preferably, the first body text block obtaining unit 5 further include:
Third judging unit, for judging whether punctuation mark weight is big when punctuation mark weight is less than first threshold In or equal to second threshold, second threshold is less than first threshold;
4th judging unit, for judging that this punctuate accords with respectively when punctuation mark weight is greater than or equal to second threshold In number corresponding text block of weight whether comprising terminating punctuation mark, this punctuation mark weight correspond to the previous text of text block Whether the punctuation mark weight of block is greater than or equal to second threshold, this punctuation mark weight corresponds to the latter text block of text block Punctuation mark weight whether be greater than or equal to second threshold, this punctuation mark weight corresponds to the previous text block of text block is It is no whether to correspond to the latter text block of text block to terminate punctuate to terminate punctuation mark ending and this punctuation mark weight Symbol ending;
Third body text block obtaining unit, for working as in this corresponding text block of punctuation mark weight comprising terminating punctuate Symbol or this punctuation mark weight correspond to text block previous text block and the latter text block punctuation mark weight it is equal It is corresponded in the previous text block and the latter text block of text block more than or equal to second threshold or this punctuation mark weight The punctuation mark weight of any text block be greater than or equal to second threshold and this any text block to terminate punctuation mark ending When, marking this corresponding text block of punctuation mark weight is text.
Preferably, before first acquisition unit 1, further includes:
Unit is removed, for removing content unrelated with text structure in webpage html source code.
Preferably, after the first body text block obtaining unit 5, further includes:
Accurate text obtaining unit, for the path according to text block, to the boundary of the text block for having been labeled as text into Row is cut, and obtains accurate body matter.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Obviously, the above embodiments are merely examples for clarifying the description, and does not limit the embodiments.It is right For those of ordinary skill in the art, can also make on the basis of the above description it is other it is various forms of variation or It changes.There is no necessity and possibility to exhaust all the enbodiments.And it is extended from this it is obvious variation or It changes still within the protection scope of the invention.

Claims (7)

1. a kind of Web page text extracting method, which comprises the following steps:
Obtain the title content in webpage html source code;
The path of all text blocks in webpage html source code is obtained, and establishes text block path list;
Title content is compared with the text block content of each text block, obtains the text block where title content;
According to the sequence of path in lists, since next path of text block respective path where title content, calculate Each path corresponds to the punctuation mark weight of text block;
Judged according to punctuation mark weight, marked according to judging result be text text block;
Described to be judged according to punctuation mark weight, marking the step of being the text block of text according to judging result includes:
Judge whether each punctuation mark weight is greater than or equal to first threshold respectively;
When punctuation mark weight is greater than or equal to first threshold, marking this corresponding text block of punctuation mark weight is text;
When punctuation mark weight is less than first threshold, judge whether punctuation mark weight is greater than or equal to second threshold, it is described Second threshold is less than first threshold;
When punctuation mark weight be greater than or equal to second threshold when, judge respectively be in this corresponding text block of punctuation mark weight It is no whether big comprising terminating punctuation mark, the punctuation mark weight for the previous text block that this punctuation mark weight corresponds to text block Whether it is greater than in or equal to the punctuation mark weight of second threshold, this punctuation mark weight the latter text block for corresponding to text block Or equal to second threshold, this punctuation mark weight correspond to text block previous text block whether with terminate punctuation mark ending with And whether this punctuation mark weight corresponds to the latter text block of text block to terminate punctuation mark ending;
When corresponding to text comprising terminating punctuation mark or this punctuation mark weight in this corresponding text block of punctuation mark weight The previous text block of block and the punctuation mark weight of the latter text block are all larger than or are equal to second threshold or this punctuate accords with The punctuation mark weight that number weight corresponds to any text block in the previous text block and the latter text block of text block is greater than Or equal to second threshold and when this any text block ends up to terminate punctuation mark, mark this corresponding text of punctuation mark weight Block is text.
2. the method according to claim 1, wherein the step for obtaining the title content in webpage html source code Suddenly include:
The content of title label and h1 label in webpage html source code is obtained respectively;
Character string cutting is carried out using content of the separator to title label, and cutting result sequence is stored in array;
Whether the content for judging the content of first element of array, the content of the last one element and h1 label respectively is conventional Non- title text;
When in the content of the content of first element of array, the content of the last one element and h1 label exist any one in When appearance is not conventional non-title text, obtain title content be first be not conventional non-title text content.
3. the method according to claim 1, wherein described will be in the text block of title content and each text block Appearance is compared, obtain title content where text block the step of include:
Compare the size of the editing distance of the text block content of title content and each text block;
Text block where obtaining title content is the corresponding text block of smallest edit distance.
4. the method according to claim 1, wherein the title content in the acquisition webpage html source code It is further comprising the steps of before step:
Remove content unrelated with text structure in webpage html source code.
5. method according to claim 1-4, which is characterized in that sentenced described according to punctuation mark weight It is disconnected, further comprising the steps of after the step of being the text block of text is marked according to judging result:
According to the path of text block, the boundary for the text block for having been labeled as text is cut, obtains accurate body matter.
6. a kind of Web page text extracting device characterized by comprising
First acquisition unit, for obtaining the title content in webpage html source code;
Second acquisition unit for obtaining the path of all text blocks in webpage html source code, and establishes text block path list;
First title text block obtaining unit is obtained for title content to be compared with the text block content of each text block Obtain the text block where title content;
Punctuation mark weight calculation unit, for the sequence according to path in lists, text block is corresponding where from title content Next path in path starts, and calculates the punctuation mark weight that each path corresponds to text block;
First body text block obtaining unit, for being judged that marking according to judging result is according to punctuation mark weight The text block of text;
The first body text block obtaining unit includes:
Second judgment unit, for judging whether each punctuation mark weight is greater than or equal to first threshold respectively;
Second body text block obtaining unit, for marking this punctuate when punctuation mark weight is greater than or equal to first threshold The corresponding text block of symbol weight is text;
Third judging unit, for when punctuation mark weight be less than first threshold when, judge punctuation mark weight whether be greater than or Equal to second threshold, the second threshold is less than first threshold;
4th judging unit, for judging that this punctuation mark is weighed respectively when punctuation mark weight is greater than or equal to second threshold Be worth in corresponding text block whether comprising terminating punctuation mark, this punctuation mark weight correspond to the previous text block of text block Whether punctuation mark weight is greater than or equal to second threshold, this punctuation mark weight correspond to text block the latter text block mark Whether point symbol weight is greater than or equal to second threshold, this punctuation mark weight correspond to text block previous text block whether with Terminate punctuation mark ending and whether this punctuation mark weight corresponds to the latter text block of text block to terminate punctuation mark Ending;
Third body text block obtaining unit, for working as in this corresponding text block of punctuation mark weight comprising terminating punctuate symbol Number or this punctuation mark weight correspond to text block previous text block and the latter text block punctuation mark weight it is big It corresponds in the previous text block and the latter text block of text block in or equal to second threshold or this punctuation mark weight When the punctuation mark weight of any text block is greater than or equal to second threshold and this any text block to terminate punctuation mark ending, Marking this corresponding text block of punctuation mark weight is text.
7. device according to claim 6, which is characterized in that the first acquisition unit includes:
Third acquiring unit, for obtaining the content of title label and h1 label in webpage html source code respectively;
Cutting unit is protected for using separator to carry out character string cutting to the content of title label, and by cutting result sequence There are in array;
First judging unit, for judging that the content of first element, the content of the last one element and h1 of array are marked respectively Whether the content of label is conventional non-title text;
Title content obtaining unit, for working as the content of first element, the content of the last one element and the h1 label of array Content in when to there is any one content be not conventional non-title text, it is not conventional non-that to obtain title content, which be first, The content of title text.
CN201610986453.5A 2016-11-03 2016-11-03 A kind of Web page text extracting method and device Expired - Fee Related CN106649560B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610986453.5A CN106649560B (en) 2016-11-03 2016-11-03 A kind of Web page text extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610986453.5A CN106649560B (en) 2016-11-03 2016-11-03 A kind of Web page text extracting method and device

Publications (2)

Publication Number Publication Date
CN106649560A CN106649560A (en) 2017-05-10
CN106649560B true CN106649560B (en) 2019-09-24

Family

ID=58805474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610986453.5A Expired - Fee Related CN106649560B (en) 2016-11-03 2016-11-03 A kind of Web page text extracting method and device

Country Status (1)

Country Link
CN (1) CN106649560B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723980A (en) * 2020-05-26 2021-11-30 北京达佳互联信息技术有限公司 Method and device for detecting advertisement landing page, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042692A (en) * 2006-03-24 2007-09-26 富士通株式会社 translation obtaining method and apparatus based on semantic forecast
CN101458718A (en) * 2009-01-05 2009-06-17 北京大学 Search engine dynamic summarization extracting method
CN102591612A (en) * 2011-12-27 2012-07-18 厦门市美亚柏科信息股份有限公司 General webpage text extraction method based on punctuation continuity and system thereof
US8682883B2 (en) * 2011-04-14 2014-03-25 Predictix Llc Systems and methods for identifying sets of similar products

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042692A (en) * 2006-03-24 2007-09-26 富士通株式会社 translation obtaining method and apparatus based on semantic forecast
CN101458718A (en) * 2009-01-05 2009-06-17 北京大学 Search engine dynamic summarization extracting method
US8682883B2 (en) * 2011-04-14 2014-03-25 Predictix Llc Systems and methods for identifying sets of similar products
CN102591612A (en) * 2011-12-27 2012-07-18 厦门市美亚柏科信息股份有限公司 General webpage text extraction method based on punctuation continuity and system thereof

Also Published As

Publication number Publication date
CN106649560A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
JP6842167B2 (en) Summary generator, summary generation method and computer program
CN107392143B (en) Resume accurate analysis method based on SVM text classification
CN101408898B (en) Method and device for extracting web page text
CN101251855B (en) Equipment, system and method for cleaning internet web page
CN107358208B (en) A kind of PDF document structured message extracting method and device
US7606816B2 (en) Record boundary identification and extraction through pattern mining
CN104881458B (en) A kind of mask method and device of Web page subject
CN111274814B (en) Novel semi-supervised text entity information extraction method
US9449114B2 (en) Removing non-substantive content from a web page by removing its text-sparse nodes and removing high-frequency sentences of its text-dense nodes using sentence hash value frequency across a web page collection
CN106960058A (en) A kind of structure of web page alteration detection method and system
KR20120051419A (en) Apparatus and method for extracting cascading style sheet
Uzun et al. An effective and efficient Web content extractor for optimizing the crawling process
CN108536683A (en) A kind of paper fragmentation information abstracting method based on machine learning
KR102110281B1 (en) Automated composition evaluator
CN106227770A (en) A kind of intelligentized news web page information extraction method
CN106649560B (en) A kind of Web page text extracting method and device
CN106372038A (en) Keyword extraction method and device
CN105183730B (en) The treating method and apparatus of webpage information
CN104217025B (en) For the entry extraction system and method for more record webpages
CN104615728B (en) A kind of webpage context extraction method and device
CN111611788B (en) Data processing method and device, electronic equipment and storage medium
CN109213974A (en) A kind of electronic document conversion method and device
CN104636324B (en) Topic source tracing method and system
CN109670162A (en) The determination method, apparatus and terminal device of title
CN112818693A (en) Automatic extraction method and system for electronic component model words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190924

Termination date: 20201103