CN105740355B - Webpage context extraction method and device based on aggregation text density - Google Patents

Webpage context extraction method and device based on aggregation text density Download PDF

Info

Publication number
CN105740355B
CN105740355B CN201610050995.1A CN201610050995A CN105740355B CN 105740355 B CN105740355 B CN 105740355B CN 201610050995 A CN201610050995 A CN 201610050995A CN 105740355 B CN105740355 B CN 105740355B
Authority
CN
China
Prior art keywords
text
queue
null
ncv
threshold value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610050995.1A
Other languages
Chinese (zh)
Other versions
CN105740355A (en
Inventor
刘忠
陈发君
黄金才
朱承
修保新
程光权
陈超
冯旸赫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201610050995.1A priority Critical patent/CN105740355B/en
Publication of CN105740355A publication Critical patent/CN105740355A/en
Application granted granted Critical
Publication of CN105740355B publication Critical patent/CN105740355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of webpage context extraction method and device based on aggregation text density, and this method separates the method for webpage HTML according to label, is split to webpage text content, to effectively separate each class text therein.It is versatile without customizing special website extracting rule;Without using complicated text mining means, this method is simple and efficient, and extracts precise and high efficiency to all kinds of Web page texts.

Description

Webpage context extraction method and device based on aggregation text density
Technical field
The present invention relates to spiders technical field, it is specifically related to a kind of Web page text based on aggregation text density and mentions Take method and device.
Background technique
With the rapid development of social informatization, internet has become the important sources that people obtain information.Net The people are directly viewable web page contents usually using browser, in addition, there are many more information processing work (such as information Internet-based Retrieval, data mining, machine translation etc.) it is also to be carried out with the information content of webpage basis data, main is to be based on The text of webpage is handled.But also comprising many other than comprising useful information (such as body matter) in most of webpages Noise information, such as navigation information, related link and the advertisement of website, copyright information and some scripting languages etc..It is how quasi- Really, the text message for efficiently extracting webpage is accomplished neither to omit text nor is mixed into noise, has become current network information The important topic for extracting and applying has very high application value and practice significance.
A variety of extracting methods exist in the prior art for this problem:
1) based on the context extraction method of DOM tree structure
Structure or information lack of standardization first in the html file of reparation webpage are (such as: starting label<h1>and be not over mark Label</a>deng), make the html file of standard.Then html file is resolved into DOM (Document Object Model, document dbject model) tree.Finally traversal dom tree identifies and rejects non-text message, and close according to page layout, text The Rule Extractions body texts such as degree.The page structure of many websites becomes increasingly complex, is also more and more lack of standardization at present, will lead to nothing Method constructs DOM number to extract text and extract template building failure.Building and traversal dom tree process, space-time later is complicated It is slow to spend height, low efficiency, speed.Manual maintenance more new information (such as Advertisement Server list) is needed in noise identification, cannot be done To automation.
2) rule-based extraction text
It is that extracting rule, such as regular expression or XPath etc. are specified in specific website by artificial means.Advantage is quasi- True degree is high, but the disadvantage is that do not have versatility, can not extend, can only parse the webpage of fixed website or fixed format, and The formulation process of rule is time-consuming and laborious, once page layout changes, it is difficult to and discovery is updated maintenance in time.3) it is based on net Text block is extracted in page segmentation
Utilize the separator bar and some visual informations (such as text color, font size, text information in html tag Deng) it is separated out the text block in webpage.Due to the HTML different style of different web sites, divide no unified approach, versatility is difficult To guarantee;Need to increase the artificial rule much assisted.4) text is extracted based on data mining and machine learning method
Method includes the following steps: linearisation reconstruct web page code keeps the logical order of text paragraph embedding not because of label Set rule is destroyed;Filter HTML noise label;Text fragment is parsed and stored as unit of<table>label;Use text Clustering algorithm is to paragraphs clustering and ultimately generates text.There are problems: simple problem complicates, so that extracting text becomes cumbersome Complexity is unfavorable for extensive utilization.
Summary of the invention
It is an object of the invention to provide one for the technical problem of the existing technology mentioned in above-mentioned background technique Webpage context extraction method and device of the kind based on aggregation text density.
The present invention provides a kind of webpage context extraction method based on aggregation text density, comprising the following steps: step S100: obtaining the html source file text of webpage, deletes unworthy first label and rejects the spcial character in text, obtains Sample text;Step S200: it is null by the second tag replacements all in sample text, multiple null texts is generated, by null Text conversion is queue T, and adjacent null is accorded with by null separating herein two-by-two;Step S300: queue T is separated into multiple sub- teams All texts in each subqueue are merged into a text block by column, multiple text blocks are formed queue B, according to text threshold Value and index threshold value are split queue T;Step S400: the maximum text of text size is chosen from queue B as webpage Text;Null number of the threshold value between preset any two subqueue is indexed, text threshold value is contained text in preset subqueue Number of characters.
Further, the second label is replaced using regular expression in step S200, Substitution Rules are as follows: R [(" i ", N)], wherein " i " is the second label, n is the quantity that the tag replacement is null.
Further, in step S300 the following steps are included:
Step S310: queue T is looped through, note currentElement is Tc, if effective Chinese character number of currentElement Tc is small In text size threshold value, then currentElement Tc text is added in queue B and continues to traverse queue T, if currentElement Tc's has Effect Chinese character number is greater than text threshold value and then remembers that currentElement Tc is the currently active text Tcv, and creation provisional version block Temp is The textual value of the currently active text Tcv;
Step S320: queue T is begun stepping through from next element after the currently active text Tcv, ignores space or null Element is until finding next effective text Ncv, if next effective text Ncv and position rope of effective text Tcv in queue T Draw difference and be less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and have next Effect text Ncv is assigned to effective text Tcv;
Step S330: continue to next effective element Ncv after next effective text Ncvi+2Queue T is traversed, if Ncvi+2It is greater than index threshold value with location index difference of the currently active text Tcv in queue T, then by provisional version block Temp Duplication portion is put into queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
Further, the first label is unworthy Html label.
Another aspect of the present invention additionally provide it is a kind of as above-mentioned method with based on aggregation text density Web page text mention Take device, comprising: webpage html file obtains module, for obtaining the html source file text of webpage, deletes unworthy the One label simultaneously rejects the spcial character in text, obtains sample text;Null divides module, for by sample text all the Two tag replacements are null, generate multiple null texts, are queue T by null text conversion, two-by-two adjacent null herein by Null symbol separates;Queue conversion module is separated into multiple subqueues for queue T, and all texts in each subqueue are closed And be a text block, multiple text blocks are formed into queue B, queue T is split according to text threshold value and index threshold value;Text This selection module, for choosing the maximum text of text size from queue B as Web page text.
Further, the second label is replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i " For the second label, n is the quantity that the tag replacement is null.
Further, queue conversion module includes: first circulation module: for looping through queue T, note currentElement is CurrentElement Tc text is added in queue B if effective Chinese character number of currentElement Tc is less than text size threshold value by Tc And continue to traverse queue T, remember that currentElement Tc is current if effective Chinese character number of currentElement Tc is greater than text threshold value Effective text Tcv, the textual value that creation provisional version block Temp is the currently active text Tcv;Second circulation module, for from working as Next element after preceding effective text Tcv begins stepping through queue T, ignores space or null element until finding next effective text This Ncv, if next effective text Ncv and location index difference of effective text Tcv in queue T are less than index threshold value, The text of next effective text Ncv is appended in provisional version block Temp, and next effective text Ncv is assigned to effective text This Tcv;Queue B forms module, for continuing to next effective element Ncv after next effective text Ncvi+2Traverse team T is arranged, if Ncvi+2Be greater than index threshold value with location index difference of the currently active text Tcv in queue T, then it will interim text This block Temp duplication portion is put into queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
Technical effect of the invention:
Webpage context extraction method Web page text method provided by the invention based on aggregation text density, it is special without customizing Different website extracting rule, it is versatile;Without using complicated text mining means, this method is simple and efficient, to all kinds of nets Page text extracts precise and high efficiency.Webpage context extraction method provided by the invention passes through clear according to label to the webpage HTML of acquisition Web page text is obtained by the method for aggregation after reason, conversion process, both without customizing special website rule, avoids being arranged general Property poor website rule;Also the situation either with or without generation and traversal dom tree, under avoiding efficient;It is tested by practice The extraction Web page text of the method precise and high efficiency, while being also applied for all kinds of websites.
Web page text extraction element provided by the invention based on aggregation text density is without using complicated text mining Means, this method are simple and efficient, and extract precise and high efficiency to all kinds of Web page texts.
Specifically please refer to what the webpage context extraction method based on aggregation text density and device according to the present invention proposed Various embodiments it is described below, will make apparent in terms of above and other of the invention.
Detailed description of the invention
Fig. 1 is the webpage context extraction method flow diagram provided by the invention based on aggregation text density;
Fig. 2 is the structural schematic diagram of the Web page text extraction element provided by the invention based on aggregation text density.
Specific embodiment
The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.
Referring to Fig. 1, the webpage context extraction method provided by the invention based on aggregation text density, comprising the following steps:
Step S100: obtaining the html source file text of webpage, deletes unworthy first label and rejects in text Spcial character obtains sample text;
Step S200: being null by the second tag replacements all in sample text, generate multiple null texts, by null text Originally queue T is converted to, adjacent null is accorded with by null separating herein two-by-two;
Step S300: being separated into multiple subqueues for queue T, and all texts in each subqueue are merged into a text This block forms the queue B being made of multiple text blocks, is split according to text threshold value and index threshold value to queue T, indexes threshold The null number being worth between preset any two subqueue, text threshold value are contained text character number in preset subqueue;
Step S400: the maximum text of text size is chosen from queue B as Web page text.
The present invention is using the replacement of label and deletes as starting, and according to text character number and null number, by source file text Text subqueue in this is divided into different subqueues, so that the text that body text is acted on other be separated, this method without Specific extraction principle need to be manually set according to specific webpage, it is only necessary to be replaced, be can be realized to text just according to label condition The extraction of text.Efficiency is improved.
Unworthy first label can be all kinds of common unworthy Html labels.It is referred to herein unworthy Html label including but not limited to note (<!--...-->,<!...>), script (<script...>...</script>), head (<head..>...</head>), pattern (<link.../>), editor class (<input../>).
Spcial character is rejected, certain texts can replace with special character in webpage source file, if space character is in webpage It is " &nbsp in source code ", known herein is to delete such spcial character without specific meaning.Specifically, in this step In each element text in queue T is detected, reject all kinds of normal spcial characters in text, these spcial characters include But it is not limited to space (" &nbsp "), greater-than sign (" &gt;"), less than sign (" &lt;") and equal to number (" &quot;").
Second label herein refers to that after deleting unworthy first labelling step, other deleted are not normal Html label.By by the second tag replacement all in Html text be certain amount null after, in sample text The content that text containing body matter is divided with other labels separates.
Preferably, for step S200 the following steps are included: by the text label in the second label according to it is following rule into The replacement of row null.According to corresponding relationship Substitution Rules are as follows: R [(" i ", n)], wherein " i " is the second label, n is the tag replacement For the quantity of null.Such as: R:[(" div ", 5), (" tr ", 5), (" h1 ", 9), (" br ", 5), (" span ", 4), (" Table ", 2)] be replaced using regular expression.
Specific as follows: all elements in R are made of key-value pair, and the key in R element is bookmark name: such as div, tr, Hl etc. is all kinds of common second labels.Value in R element represents the null number replaced in label conversion process;Such as in R One element (" div ", 5) represents when the second label detected is div, will start or end-tag replaces with 5 skies Row symbol (" n ").For other second labels not in relational expression R, then a null symbol is replaced with.The replacement principle of this step It is the replacement of view-based access control model effect, is spaced the second bigger label in visual effect, will be replaced with more nulls.Later by one The texts that multiple in a web page text are separated by null form list T, with null segmentation adjacent text two-by-two in list.
Step S300 is aggregation text steps, has been divided by label by the web page text information that above step obtains The adjacent small text block of physical location is collected as a text block by the small text block separated one by one by null, this step.
Specifically, in step S300 the following steps are included:
Step S310: queue T is looped through, note currentElement is Tc, if effective Chinese character number of currentElement Tc is small In specified text size threshold value (such as 4), then queue B is added in currentElement Tc text and continues to traverse queue T.If Tc's has Effect Chinese character number is greater than specified threshold and then illustrates that currentElement Tc is effective text, and note Tc is the currently active text Tcv, creation Provisional version block Temp is the textual value of the currently active text Tcv.
Step S320: beginning stepping through queue T from next element after Tcv later, ignores space or null text element Element is until finding next effective text Ncv, if next effective text Ncv and location index of effective text Tcv in queue T Difference is less than specified index threshold value (such as 7), then the text of next effective text Ncv is appended in provisional version block Temp And next effective text Ncv is assigned to effective text Tcv.
Step S330: continue to begin stepping through next effective element Ncvi+2 after next effective text Ncv later Queue T.If Ncvi+2It is greater than specified index threshold value with location index difference of the currently active text Tcv in queue T, then Text block Temp duplication portion is put into queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
Step S400 be selection text step, after step S300, relevant text flocked together (such as: just Text, advertisement, link etc.), the longest element of element text size in queue B is obtained, this element text is exactly text, arrives this text Extraction is fully completed.
Using the step according in usual webpage: 1) text connects together, and will not be separated by noises such as advertisements;2) just The text block length of text is longer and is separated by not far;3) content of text should be longest.Thus effectively will be in webpage Text collection both avoids the step of using duplication and algorithm, in turn avoids specifying different extracting rules for different web pages It is cumbersome, improve the efficiency extracted to web page text.
Referring to fig. 2, another aspect of the present invention also provides the device of another above method, comprising:
Webpage html file obtains module 100, for obtaining the html source file text of webpage, deletes unworthy first Label simultaneously rejects the spcial character in text, obtains sample text;
Null divides module 200, for being null by the second tag replacements all in sample text, generates multiple null texts Null text conversion is queue T by this, and adjacent null is accorded with by null separating herein two-by-two;
Queue conversion module 300 is separated into multiple subqueues for queue T, and all texts in each subqueue are closed And be a text block, multiple text blocks are formed into queue B, queue T is split according to text threshold value and index threshold value;
Text selection module 400, for choosing the maximum text of text size from queue B as Web page text.
The device is not necessarily to be not necessarily to manpower intervention according to specific webpage design extracting rule, can effectively improve extraction efficiency.
Preferably, the second label is replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i " is Second label, n are the quantity that the tag replacement is null.Extracted by this rule, can effectively realize to invalid document with just Text flies separation, is difficult to divide after avoiding the two from mixing.
Preferably, queue conversion module includes:
First circulation module: for looping through queue T, note currentElement is Tc, if effective Chinese of currentElement Tc Number of characters is less than text size threshold value, then currentElement Tc text is added in queue B and continues to traverse queue T, if current member Effective Chinese character number of plain Tc is greater than text threshold value and then remembers that currentElement Tc is the currently active text Tcv, creates provisional version Block Temp is the textual value of the currently active text Tcv;
Second circulation module ignores sky for beginning stepping through queue T from next element after the currently active text Tcv Lattice or null element are until finding next effective text Ncv, if next effective text Ncv and effective text Tcv are in queue T Location index difference be less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and Next effective text Ncv is assigned to effective text Tcv;
Queue B forms module, for continuing to next effective element Ncv after next effective text Ncvi+2Traversal Queue T, if Ncvi+2Be greater than index threshold value with location index difference of the currently active text Tcv in queue T, then it will be interim Text block Temp duplication portion is put into queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
Using the module, can effectively avoid being particularly suitable for in text the omission of text contained in branch's text Also there is the case where label.
Those skilled in the art will be clear that the scope of the present invention is not limited to example discussed above, it is possible to carry out to it Several changes and modification, the scope of the present invention limited without departing from the appended claims.Although oneself is through in attached drawing and explanation The present invention is illustrated and described in book in detail, but such illustrate and describe is only explanation or schematical, and not restrictive. The present invention is not limited to the disclosed embodiments.
By to attached drawing, the research of specification and claims, those skilled in the art can be in carrying out the present invention Understand and realize the deformation of the disclosed embodiments.In detail in the claims, term " includes " is not excluded for other steps or element, And indefinite article "one" or "an" be not excluded for it is multiple.The certain measures quoted in mutually different dependent claims The fact does not mean that the combination of these measures cannot be advantageously used.Any reference marker in claims is not constituted pair The limitation of the scope of the present invention.

Claims (6)

1. a kind of webpage context extraction method based on aggregation text density, comprising the following steps:
Step S100: obtaining the html source file text of webpage, delete unworthy first label and reject it is special in text Character obtains sample text;
Step S200: being null by the second tag replacements all in the sample text, generate multiple null texts, by null text Originally queue T is converted to, the adjacent null text is accorded with by null and being separated two-by-two;
Step S300: being separated into multiple subqueues for the queue T, and all texts in each subqueue are merged into one Multiple text blocks are formed queue B, are split according to text threshold value and index threshold value to the queue T by a text block;
Step S400: the maximum text of text size is chosen from the queue B as Web page text;
Null number of the index threshold value between preset any two subqueue, the text threshold value is the preset son Contained text character number in queue;
In the step S300 the following steps are included:
Step S310: queue T is looped through, note currentElement is Tc, if effective Chinese character number of the currentElement Tc is small In the text size threshold value, then the currentElement Tc text is added in the queue B and continues to traverse the queue T, Remember that the currentElement Tc is the currently active if effective Chinese character number of the currentElement Tc is greater than the text threshold value Text Tcv, the textual value that creation provisional version block Temp is the currently active text Tcv;
Step S320: the queue T is begun stepping through from next element after the currently active text Tcv, ignores space Or null element is until finding next effective text Ncv, if next effective text Ncv and effective text Tcv exist Location index difference in queue T is less than index threshold value, then is appended to the text of next effective text Ncv described interim In text block Temp, and next effective text Ncv is assigned to effective text Tcv;
Step S330: continue to next effective element Ncv after next effective text Ncvi+2The queue T is traversed, If the Ncvi+2It is greater than the index threshold with location index difference of the currently active text Tcv in the queue T Provisional version block Temp duplication portion is then put into the queue B, by the Ncv by valuei+2It is assigned to the current member Plain Tc continues cycling through traversal queue T.
2. the webpage context extraction method according to claim 1 based on aggregation text density, which is characterized in that the step Second label described in rapid S200 is replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i " is institute The second label is stated, n is the quantity that the tag replacement is null.
3. the webpage context extraction method according to claim 1 based on aggregation text density, which is characterized in that described the One label is unworthy Html label.
4. a kind of method according to any one of claims 1 to 3 is used to extract based on the Web page text of aggregation text density Device characterized by comprising
Webpage html file obtains module, for obtaining the html source file text of webpage, deletes unworthy first label simultaneously The spcial character in text is rejected, sample text is obtained;
Null divides module, is used to the second tag replacements all in the sample text be null, generates multiple null texts, It is queue T by null text conversion, the adjacent null text is accorded with by null and being separated two-by-two;
Queue conversion module is separated into multiple subqueues for the queue T, and all texts in each subqueue are closed And be a text block, multiple text blocks are formed into queue B, according to text threshold value and index threshold value to the queue T into Row segmentation;
Text selection module, for choosing the maximum text of text size from the queue B as Web page text.
5. the Web page text extraction element according to claim 4 based on aggregation text density, which is characterized in that described the Two labels are replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i " is second label, n is The tag replacement is the quantity of null.
6. the Web page text extraction element according to claim 4 based on aggregation text density, which is characterized in that the team Column conversion module includes:
First circulation module: for looping through queue T, note currentElement is Tc, if effective Chinese of the currentElement Tc Number of characters is less than the text size threshold value, then the currentElement Tc text is added in the queue B and continues to traverse institute Queue T is stated, remembers that the currentElement Tc is if effective Chinese character number of the currentElement Tc is greater than the text threshold value The currently active text Tcv, the textual value that creation provisional version block Temp is the currently active text Tcv;
Second circulation module, for beginning stepping through the queue T from next element after the currently active text Tcv, Ignore space or null element until finding next effective text Ncv, if next effective text Ncv and effective text Location index difference of this Tcv in queue T is less than index threshold value, then is appended to the text of next effective text Ncv In the provisional version block Temp, and next effective text Ncv is assigned to effective text Tcv;
Queue B forms module, for continuing to next effective element Ncv after next effective text Ncvi+2Traversal The queue T, if the Ncvi+2It is greater than institute with location index difference of the currently active text Tcv in the queue T Index threshold value is stated, then provisional version block Temp duplication portion is put into the queue B, by the Ncvi+2It is assigned to institute It states currentElement Tc and continues cycling through traversal queue T.
CN201610050995.1A 2016-01-26 2016-01-26 Webpage context extraction method and device based on aggregation text density Active CN105740355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610050995.1A CN105740355B (en) 2016-01-26 2016-01-26 Webpage context extraction method and device based on aggregation text density

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610050995.1A CN105740355B (en) 2016-01-26 2016-01-26 Webpage context extraction method and device based on aggregation text density

Publications (2)

Publication Number Publication Date
CN105740355A CN105740355A (en) 2016-07-06
CN105740355B true CN105740355B (en) 2019-03-26

Family

ID=56246654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610050995.1A Active CN105740355B (en) 2016-01-26 2016-01-26 Webpage context extraction method and device based on aggregation text density

Country Status (1)

Country Link
CN (1) CN105740355B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951401B (en) * 2017-03-14 2020-03-20 深圳市茁壮网络股份有限公司 Document text recognition method and device
CN107273491B (en) * 2017-06-15 2020-07-24 华中师范大学 Webpage segmentation method and device and electronic equipment
CN107766477A (en) * 2017-09-30 2018-03-06 武汉汉思信息技术有限责任公司 Page structure data extraction method, terminal device and storage medium
CN111563387B (en) * 2019-02-12 2023-05-02 阿里巴巴集团控股有限公司 Sentence similarity determining method and device, sentence translating method and device
CN113537091B (en) * 2021-07-20 2024-05-03 东莞盟大集团有限公司 Webpage text recognition method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855324A (en) * 2012-09-11 2013-01-02 北京云泓道元信息技术有限公司 Automatic extracting method and device for network information
CN103425765A (en) * 2013-08-06 2013-12-04 优视科技有限公司 Method and device for extracting webpage text and method and system for webpage preview
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779172B (en) * 2012-06-25 2016-06-01 北京奇虎科技有限公司 The recognition system of non-body text and method in a kind of webpage

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855324A (en) * 2012-09-11 2013-01-02 北京云泓道元信息技术有限公司 Automatic extracting method and device for network information
CN103425765A (en) * 2013-08-06 2013-12-04 优视科技有限公司 Method and device for extracting webpage text and method and system for webpage preview
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text
CN105183801A (en) * 2015-08-25 2015-12-23 北京信息科技大学 Web page body text extraction method and apparatus

Also Published As

Publication number Publication date
CN105740355A (en) 2016-07-06

Similar Documents

Publication Publication Date Title
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN101025738B (en) Template-free dynamic website generating method
CN109543126B (en) Webpage text information extraction method based on block character ratio
CN105022803B (en) A kind of method and system for extracting Web page text content
Peters et al. Content extraction using diverse feature sets
CN106446072B (en) The treating method and apparatus of web page contents
CN112667940B (en) Webpage text extraction method based on deep learning
CN102270206A (en) Method and device for capturing valid web page contents
CN101702160B (en) Method for acquiring internet subject information and device thereof
CN103577171B (en) A kind of method and mobile terminal of display web page contents
CN102253979A (en) Vision-based web page extracting method
CN109492177B (en) web page blocking method based on web page semantic structure
CN109857956A (en) The automatic abstracting method of news web page key message based on label and blocking characteristic
WO2014153457A1 (en) Merging web page style addresses
CN106503211A (en) Information issues the method that the mobile edition of class website is automatically generated
CN107590288B (en) Method and device for extracting webpage image-text blocks
CN103049536A (en) Webpage main text content extracting method and webpage text content extracting system
CN103530429A (en) Webpage content extracting method
CN106547895B (en) Webpage information extraction method and device
CN108733813A (en) Information extracting method, system towards BBS forum Web pages contents and medium
US20130124684A1 (en) Visual separator detection in web pages using code analysis
CN109657114B (en) Method for extracting webpage semi-structured data
CN108959204B (en) Internet financial project information extraction method and system
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN103853770B (en) The method and system of model content in a kind of extraction forum Web pages

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant