CN105740355B - Webpage context extraction method and device based on aggregation text density - Google Patents
Webpage context extraction method and device based on aggregation text density Download PDFInfo
- Publication number
- CN105740355B CN105740355B CN201610050995.1A CN201610050995A CN105740355B CN 105740355 B CN105740355 B CN 105740355B CN 201610050995 A CN201610050995 A CN 201610050995A CN 105740355 B CN105740355 B CN 105740355B
- Authority
- CN
- China
- Prior art keywords
- text
- queue
- null
- ncv
- threshold value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention provides a kind of webpage context extraction method and device based on aggregation text density, and this method separates the method for webpage HTML according to label, is split to webpage text content, to effectively separate each class text therein.It is versatile without customizing special website extracting rule;Without using complicated text mining means, this method is simple and efficient, and extracts precise and high efficiency to all kinds of Web page texts.
Description
Technical field
The present invention relates to spiders technical field, it is specifically related to a kind of Web page text based on aggregation text density and mentions
Take method and device.
Background technique
With the rapid development of social informatization, internet has become the important sources that people obtain information.Net
The people are directly viewable web page contents usually using browser, in addition, there are many more information processing work (such as information Internet-based
Retrieval, data mining, machine translation etc.) it is also to be carried out with the information content of webpage basis data, main is to be based on
The text of webpage is handled.But also comprising many other than comprising useful information (such as body matter) in most of webpages
Noise information, such as navigation information, related link and the advertisement of website, copyright information and some scripting languages etc..It is how quasi-
Really, the text message for efficiently extracting webpage is accomplished neither to omit text nor is mixed into noise, has become current network information
The important topic for extracting and applying has very high application value and practice significance.
A variety of extracting methods exist in the prior art for this problem:
1) based on the context extraction method of DOM tree structure
Structure or information lack of standardization first in the html file of reparation webpage are (such as: starting label<h1>and be not over mark
Label</a>deng), make the html file of standard.Then html file is resolved into DOM (Document Object
Model, document dbject model) tree.Finally traversal dom tree identifies and rejects non-text message, and close according to page layout, text
The Rule Extractions body texts such as degree.The page structure of many websites becomes increasingly complex, is also more and more lack of standardization at present, will lead to nothing
Method constructs DOM number to extract text and extract template building failure.Building and traversal dom tree process, space-time later is complicated
It is slow to spend height, low efficiency, speed.Manual maintenance more new information (such as Advertisement Server list) is needed in noise identification, cannot be done
To automation.
2) rule-based extraction text
It is that extracting rule, such as regular expression or XPath etc. are specified in specific website by artificial means.Advantage is quasi-
True degree is high, but the disadvantage is that do not have versatility, can not extend, can only parse the webpage of fixed website or fixed format, and
The formulation process of rule is time-consuming and laborious, once page layout changes, it is difficult to and discovery is updated maintenance in time.3) it is based on net
Text block is extracted in page segmentation
Utilize the separator bar and some visual informations (such as text color, font size, text information in html tag
Deng) it is separated out the text block in webpage.Due to the HTML different style of different web sites, divide no unified approach, versatility is difficult
To guarantee;Need to increase the artificial rule much assisted.4) text is extracted based on data mining and machine learning method
Method includes the following steps: linearisation reconstruct web page code keeps the logical order of text paragraph embedding not because of label
Set rule is destroyed;Filter HTML noise label;Text fragment is parsed and stored as unit of<table>label;Use text
Clustering algorithm is to paragraphs clustering and ultimately generates text.There are problems: simple problem complicates, so that extracting text becomes cumbersome
Complexity is unfavorable for extensive utilization.
Summary of the invention
It is an object of the invention to provide one for the technical problem of the existing technology mentioned in above-mentioned background technique
Webpage context extraction method and device of the kind based on aggregation text density.
The present invention provides a kind of webpage context extraction method based on aggregation text density, comprising the following steps: step
S100: obtaining the html source file text of webpage, deletes unworthy first label and rejects the spcial character in text, obtains
Sample text;Step S200: it is null by the second tag replacements all in sample text, multiple null texts is generated, by null
Text conversion is queue T, and adjacent null is accorded with by null separating herein two-by-two;Step S300: queue T is separated into multiple sub- teams
All texts in each subqueue are merged into a text block by column, multiple text blocks are formed queue B, according to text threshold
Value and index threshold value are split queue T;Step S400: the maximum text of text size is chosen from queue B as webpage
Text;Null number of the threshold value between preset any two subqueue is indexed, text threshold value is contained text in preset subqueue
Number of characters.
Further, the second label is replaced using regular expression in step S200, Substitution Rules are as follows: R [(" i ",
N)], wherein " i " is the second label, n is the quantity that the tag replacement is null.
Further, in step S300 the following steps are included:
Step S310: queue T is looped through, note currentElement is Tc, if effective Chinese character number of currentElement Tc is small
In text size threshold value, then currentElement Tc text is added in queue B and continues to traverse queue T, if currentElement Tc's has
Effect Chinese character number is greater than text threshold value and then remembers that currentElement Tc is the currently active text Tcv, and creation provisional version block Temp is
The textual value of the currently active text Tcv;
Step S320: queue T is begun stepping through from next element after the currently active text Tcv, ignores space or null
Element is until finding next effective text Ncv, if next effective text Ncv and position rope of effective text Tcv in queue T
Draw difference and be less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and have next
Effect text Ncv is assigned to effective text Tcv;
Step S330: continue to next effective element Ncv after next effective text Ncvi+2Queue T is traversed, if
Ncvi+2It is greater than index threshold value with location index difference of the currently active text Tcv in queue T, then by provisional version block Temp
Duplication portion is put into queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
Further, the first label is unworthy Html label.
Another aspect of the present invention additionally provide it is a kind of as above-mentioned method with based on aggregation text density Web page text mention
Take device, comprising: webpage html file obtains module, for obtaining the html source file text of webpage, deletes unworthy the
One label simultaneously rejects the spcial character in text, obtains sample text;Null divides module, for by sample text all the
Two tag replacements are null, generate multiple null texts, are queue T by null text conversion, two-by-two adjacent null herein by
Null symbol separates;Queue conversion module is separated into multiple subqueues for queue T, and all texts in each subqueue are closed
And be a text block, multiple text blocks are formed into queue B, queue T is split according to text threshold value and index threshold value;Text
This selection module, for choosing the maximum text of text size from queue B as Web page text.
Further, the second label is replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i "
For the second label, n is the quantity that the tag replacement is null.
Further, queue conversion module includes: first circulation module: for looping through queue T, note currentElement is
CurrentElement Tc text is added in queue B if effective Chinese character number of currentElement Tc is less than text size threshold value by Tc
And continue to traverse queue T, remember that currentElement Tc is current if effective Chinese character number of currentElement Tc is greater than text threshold value
Effective text Tcv, the textual value that creation provisional version block Temp is the currently active text Tcv;Second circulation module, for from working as
Next element after preceding effective text Tcv begins stepping through queue T, ignores space or null element until finding next effective text
This Ncv, if next effective text Ncv and location index difference of effective text Tcv in queue T are less than index threshold value,
The text of next effective text Ncv is appended in provisional version block Temp, and next effective text Ncv is assigned to effective text
This Tcv;Queue B forms module, for continuing to next effective element Ncv after next effective text Ncvi+2Traverse team
T is arranged, if Ncvi+2Be greater than index threshold value with location index difference of the currently active text Tcv in queue T, then it will interim text
This block Temp duplication portion is put into queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
Technical effect of the invention:
Webpage context extraction method Web page text method provided by the invention based on aggregation text density, it is special without customizing
Different website extracting rule, it is versatile;Without using complicated text mining means, this method is simple and efficient, to all kinds of nets
Page text extracts precise and high efficiency.Webpage context extraction method provided by the invention passes through clear according to label to the webpage HTML of acquisition
Web page text is obtained by the method for aggregation after reason, conversion process, both without customizing special website rule, avoids being arranged general
Property poor website rule;Also the situation either with or without generation and traversal dom tree, under avoiding efficient;It is tested by practice
The extraction Web page text of the method precise and high efficiency, while being also applied for all kinds of websites.
Web page text extraction element provided by the invention based on aggregation text density is without using complicated text mining
Means, this method are simple and efficient, and extract precise and high efficiency to all kinds of Web page texts.
Specifically please refer to what the webpage context extraction method based on aggregation text density and device according to the present invention proposed
Various embodiments it is described below, will make apparent in terms of above and other of the invention.
Detailed description of the invention
Fig. 1 is the webpage context extraction method flow diagram provided by the invention based on aggregation text density;
Fig. 2 is the structural schematic diagram of the Web page text extraction element provided by the invention based on aggregation text density.
Specific embodiment
The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention
It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.
Referring to Fig. 1, the webpage context extraction method provided by the invention based on aggregation text density, comprising the following steps:
Step S100: obtaining the html source file text of webpage, deletes unworthy first label and rejects in text
Spcial character obtains sample text;
Step S200: being null by the second tag replacements all in sample text, generate multiple null texts, by null text
Originally queue T is converted to, adjacent null is accorded with by null separating herein two-by-two;
Step S300: being separated into multiple subqueues for queue T, and all texts in each subqueue are merged into a text
This block forms the queue B being made of multiple text blocks, is split according to text threshold value and index threshold value to queue T, indexes threshold
The null number being worth between preset any two subqueue, text threshold value are contained text character number in preset subqueue;
Step S400: the maximum text of text size is chosen from queue B as Web page text.
The present invention is using the replacement of label and deletes as starting, and according to text character number and null number, by source file text
Text subqueue in this is divided into different subqueues, so that the text that body text is acted on other be separated, this method without
Specific extraction principle need to be manually set according to specific webpage, it is only necessary to be replaced, be can be realized to text just according to label condition
The extraction of text.Efficiency is improved.
Unworthy first label can be all kinds of common unworthy Html labels.It is referred to herein unworthy
Html label including but not limited to note (<!--...-->,<!...>), script (<script...>...</script>), head
(<head..>...</head>), pattern (<link.../>), editor class (<input../>).
Spcial character is rejected, certain texts can replace with special character in webpage source file, if space character is in webpage
It is "   in source code ", known herein is to delete such spcial character without specific meaning.Specifically, in this step
In each element text in queue T is detected, reject all kinds of normal spcial characters in text, these spcial characters include
But it is not limited to space ("   "), greater-than sign (" >;"), less than sign (" <;") and equal to number (" ";").
Second label herein refers to that after deleting unworthy first labelling step, other deleted are not normal
Html label.By by the second tag replacement all in Html text be certain amount null after, in sample text
The content that text containing body matter is divided with other labels separates.
Preferably, for step S200 the following steps are included: by the text label in the second label according to it is following rule into
The replacement of row null.According to corresponding relationship Substitution Rules are as follows: R [(" i ", n)], wherein " i " is the second label, n is the tag replacement
For the quantity of null.Such as: R:[(" div ", 5), (" tr ", 5), (" h1 ", 9), (" br ", 5), (" span ", 4), ("
Table ", 2)] be replaced using regular expression.
Specific as follows: all elements in R are made of key-value pair, and the key in R element is bookmark name: such as div, tr,
Hl etc. is all kinds of common second labels.Value in R element represents the null number replaced in label conversion process;Such as in R
One element (" div ", 5) represents when the second label detected is div, will start or end-tag replaces with 5 skies
Row symbol (" n ").For other second labels not in relational expression R, then a null symbol is replaced with.The replacement principle of this step
It is the replacement of view-based access control model effect, is spaced the second bigger label in visual effect, will be replaced with more nulls.Later by one
The texts that multiple in a web page text are separated by null form list T, with null segmentation adjacent text two-by-two in list.
Step S300 is aggregation text steps, has been divided by label by the web page text information that above step obtains
The adjacent small text block of physical location is collected as a text block by the small text block separated one by one by null, this step.
Specifically, in step S300 the following steps are included:
Step S310: queue T is looped through, note currentElement is Tc, if effective Chinese character number of currentElement Tc is small
In specified text size threshold value (such as 4), then queue B is added in currentElement Tc text and continues to traverse queue T.If Tc's has
Effect Chinese character number is greater than specified threshold and then illustrates that currentElement Tc is effective text, and note Tc is the currently active text Tcv, creation
Provisional version block Temp is the textual value of the currently active text Tcv.
Step S320: beginning stepping through queue T from next element after Tcv later, ignores space or null text element
Element is until finding next effective text Ncv, if next effective text Ncv and location index of effective text Tcv in queue T
Difference is less than specified index threshold value (such as 7), then the text of next effective text Ncv is appended in provisional version block Temp
And next effective text Ncv is assigned to effective text Tcv.
Step S330: continue to begin stepping through next effective element Ncvi+2 after next effective text Ncv later
Queue T.If Ncvi+2It is greater than specified index threshold value with location index difference of the currently active text Tcv in queue T, then
Text block Temp duplication portion is put into queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
Step S400 be selection text step, after step S300, relevant text flocked together (such as: just
Text, advertisement, link etc.), the longest element of element text size in queue B is obtained, this element text is exactly text, arrives this text
Extraction is fully completed.
Using the step according in usual webpage: 1) text connects together, and will not be separated by noises such as advertisements;2) just
The text block length of text is longer and is separated by not far;3) content of text should be longest.Thus effectively will be in webpage
Text collection both avoids the step of using duplication and algorithm, in turn avoids specifying different extracting rules for different web pages
It is cumbersome, improve the efficiency extracted to web page text.
Referring to fig. 2, another aspect of the present invention also provides the device of another above method, comprising:
Webpage html file obtains module 100, for obtaining the html source file text of webpage, deletes unworthy first
Label simultaneously rejects the spcial character in text, obtains sample text;
Null divides module 200, for being null by the second tag replacements all in sample text, generates multiple null texts
Null text conversion is queue T by this, and adjacent null is accorded with by null separating herein two-by-two;
Queue conversion module 300 is separated into multiple subqueues for queue T, and all texts in each subqueue are closed
And be a text block, multiple text blocks are formed into queue B, queue T is split according to text threshold value and index threshold value;
Text selection module 400, for choosing the maximum text of text size from queue B as Web page text.
The device is not necessarily to be not necessarily to manpower intervention according to specific webpage design extracting rule, can effectively improve extraction efficiency.
Preferably, the second label is replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i " is
Second label, n are the quantity that the tag replacement is null.Extracted by this rule, can effectively realize to invalid document with just
Text flies separation, is difficult to divide after avoiding the two from mixing.
Preferably, queue conversion module includes:
First circulation module: for looping through queue T, note currentElement is Tc, if effective Chinese of currentElement Tc
Number of characters is less than text size threshold value, then currentElement Tc text is added in queue B and continues to traverse queue T, if current member
Effective Chinese character number of plain Tc is greater than text threshold value and then remembers that currentElement Tc is the currently active text Tcv, creates provisional version
Block Temp is the textual value of the currently active text Tcv;
Second circulation module ignores sky for beginning stepping through queue T from next element after the currently active text Tcv
Lattice or null element are until finding next effective text Ncv, if next effective text Ncv and effective text Tcv are in queue T
Location index difference be less than index threshold value, then the text of next effective text Ncv is appended in provisional version block Temp, and
Next effective text Ncv is assigned to effective text Tcv;
Queue B forms module, for continuing to next effective element Ncv after next effective text Ncvi+2Traversal
Queue T, if Ncvi+2Be greater than index threshold value with location index difference of the currently active text Tcv in queue T, then it will be interim
Text block Temp duplication portion is put into queue B, by Ncvi+2It is assigned to currentElement Tc and continues cycling through traversal queue T.
Using the module, can effectively avoid being particularly suitable for in text the omission of text contained in branch's text
Also there is the case where label.
Those skilled in the art will be clear that the scope of the present invention is not limited to example discussed above, it is possible to carry out to it
Several changes and modification, the scope of the present invention limited without departing from the appended claims.Although oneself is through in attached drawing and explanation
The present invention is illustrated and described in book in detail, but such illustrate and describe is only explanation or schematical, and not restrictive.
The present invention is not limited to the disclosed embodiments.
By to attached drawing, the research of specification and claims, those skilled in the art can be in carrying out the present invention
Understand and realize the deformation of the disclosed embodiments.In detail in the claims, term " includes " is not excluded for other steps or element,
And indefinite article "one" or "an" be not excluded for it is multiple.The certain measures quoted in mutually different dependent claims
The fact does not mean that the combination of these measures cannot be advantageously used.Any reference marker in claims is not constituted pair
The limitation of the scope of the present invention.
Claims (6)
1. a kind of webpage context extraction method based on aggregation text density, comprising the following steps:
Step S100: obtaining the html source file text of webpage, delete unworthy first label and reject it is special in text
Character obtains sample text;
Step S200: being null by the second tag replacements all in the sample text, generate multiple null texts, by null text
Originally queue T is converted to, the adjacent null text is accorded with by null and being separated two-by-two;
Step S300: being separated into multiple subqueues for the queue T, and all texts in each subqueue are merged into one
Multiple text blocks are formed queue B, are split according to text threshold value and index threshold value to the queue T by a text block;
Step S400: the maximum text of text size is chosen from the queue B as Web page text;
Null number of the index threshold value between preset any two subqueue, the text threshold value is the preset son
Contained text character number in queue;
In the step S300 the following steps are included:
Step S310: queue T is looped through, note currentElement is Tc, if effective Chinese character number of the currentElement Tc is small
In the text size threshold value, then the currentElement Tc text is added in the queue B and continues to traverse the queue T,
Remember that the currentElement Tc is the currently active if effective Chinese character number of the currentElement Tc is greater than the text threshold value
Text Tcv, the textual value that creation provisional version block Temp is the currently active text Tcv;
Step S320: the queue T is begun stepping through from next element after the currently active text Tcv, ignores space
Or null element is until finding next effective text Ncv, if next effective text Ncv and effective text Tcv exist
Location index difference in queue T is less than index threshold value, then is appended to the text of next effective text Ncv described interim
In text block Temp, and next effective text Ncv is assigned to effective text Tcv;
Step S330: continue to next effective element Ncv after next effective text Ncvi+2The queue T is traversed,
If the Ncvi+2It is greater than the index threshold with location index difference of the currently active text Tcv in the queue T
Provisional version block Temp duplication portion is then put into the queue B, by the Ncv by valuei+2It is assigned to the current member
Plain Tc continues cycling through traversal queue T.
2. the webpage context extraction method according to claim 1 based on aggregation text density, which is characterized in that the step
Second label described in rapid S200 is replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i " is institute
The second label is stated, n is the quantity that the tag replacement is null.
3. the webpage context extraction method according to claim 1 based on aggregation text density, which is characterized in that described the
One label is unworthy Html label.
4. a kind of method according to any one of claims 1 to 3 is used to extract based on the Web page text of aggregation text density
Device characterized by comprising
Webpage html file obtains module, for obtaining the html source file text of webpage, deletes unworthy first label simultaneously
The spcial character in text is rejected, sample text is obtained;
Null divides module, is used to the second tag replacements all in the sample text be null, generates multiple null texts,
It is queue T by null text conversion, the adjacent null text is accorded with by null and being separated two-by-two;
Queue conversion module is separated into multiple subqueues for the queue T, and all texts in each subqueue are closed
And be a text block, multiple text blocks are formed into queue B, according to text threshold value and index threshold value to the queue T into
Row segmentation;
Text selection module, for choosing the maximum text of text size from the queue B as Web page text.
5. the Web page text extraction element according to claim 4 based on aggregation text density, which is characterized in that described the
Two labels are replaced using regular expression, Substitution Rules are as follows: R [(" i ", n)], wherein " i " is second label, n is
The tag replacement is the quantity of null.
6. the Web page text extraction element according to claim 4 based on aggregation text density, which is characterized in that the team
Column conversion module includes:
First circulation module: for looping through queue T, note currentElement is Tc, if effective Chinese of the currentElement Tc
Number of characters is less than the text size threshold value, then the currentElement Tc text is added in the queue B and continues to traverse institute
Queue T is stated, remembers that the currentElement Tc is if effective Chinese character number of the currentElement Tc is greater than the text threshold value
The currently active text Tcv, the textual value that creation provisional version block Temp is the currently active text Tcv;
Second circulation module, for beginning stepping through the queue T from next element after the currently active text Tcv,
Ignore space or null element until finding next effective text Ncv, if next effective text Ncv and effective text
Location index difference of this Tcv in queue T is less than index threshold value, then is appended to the text of next effective text Ncv
In the provisional version block Temp, and next effective text Ncv is assigned to effective text Tcv;
Queue B forms module, for continuing to next effective element Ncv after next effective text Ncvi+2Traversal
The queue T, if the Ncvi+2It is greater than institute with location index difference of the currently active text Tcv in the queue T
Index threshold value is stated, then provisional version block Temp duplication portion is put into the queue B, by the Ncvi+2It is assigned to institute
It states currentElement Tc and continues cycling through traversal queue T.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610050995.1A CN105740355B (en) | 2016-01-26 | 2016-01-26 | Webpage context extraction method and device based on aggregation text density |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610050995.1A CN105740355B (en) | 2016-01-26 | 2016-01-26 | Webpage context extraction method and device based on aggregation text density |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105740355A CN105740355A (en) | 2016-07-06 |
CN105740355B true CN105740355B (en) | 2019-03-26 |
Family
ID=56246654
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610050995.1A Active CN105740355B (en) | 2016-01-26 | 2016-01-26 | Webpage context extraction method and device based on aggregation text density |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105740355B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951401B (en) * | 2017-03-14 | 2020-03-20 | 深圳市茁壮网络股份有限公司 | Document text recognition method and device |
CN107273491B (en) * | 2017-06-15 | 2020-07-24 | 华中师范大学 | Webpage segmentation method and device and electronic equipment |
CN107766477A (en) * | 2017-09-30 | 2018-03-06 | 武汉汉思信息技术有限责任公司 | Page structure data extraction method, terminal device and storage medium |
CN111563387B (en) * | 2019-02-12 | 2023-05-02 | 阿里巴巴集团控股有限公司 | Sentence similarity determining method and device, sentence translating method and device |
CN113537091B (en) * | 2021-07-20 | 2024-05-03 | 东莞盟大集团有限公司 | Webpage text recognition method and device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102855324A (en) * | 2012-09-11 | 2013-01-02 | 北京云泓道元信息技术有限公司 | Automatic extracting method and device for network information |
CN103425765A (en) * | 2013-08-06 | 2013-12-04 | 优视科技有限公司 | Method and device for extracting webpage text and method and system for webpage preview |
CN104598577A (en) * | 2015-01-14 | 2015-05-06 | 晶赞广告(上海)有限公司 | Extraction method for webpage text |
CN105183801A (en) * | 2015-08-25 | 2015-12-23 | 北京信息科技大学 | Web page body text extraction method and apparatus |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102779172B (en) * | 2012-06-25 | 2016-06-01 | 北京奇虎科技有限公司 | The recognition system of non-body text and method in a kind of webpage |
-
2016
- 2016-01-26 CN CN201610050995.1A patent/CN105740355B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102855324A (en) * | 2012-09-11 | 2013-01-02 | 北京云泓道元信息技术有限公司 | Automatic extracting method and device for network information |
CN103425765A (en) * | 2013-08-06 | 2013-12-04 | 优视科技有限公司 | Method and device for extracting webpage text and method and system for webpage preview |
CN104598577A (en) * | 2015-01-14 | 2015-05-06 | 晶赞广告(上海)有限公司 | Extraction method for webpage text |
CN105183801A (en) * | 2015-08-25 | 2015-12-23 | 北京信息科技大学 | Web page body text extraction method and apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN105740355A (en) | 2016-07-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105740355B (en) | Webpage context extraction method and device based on aggregation text density | |
CN101025738B (en) | Template-free dynamic website generating method | |
CN109543126B (en) | Webpage text information extraction method based on block character ratio | |
CN105022803B (en) | A kind of method and system for extracting Web page text content | |
Peters et al. | Content extraction using diverse feature sets | |
CN106446072B (en) | The treating method and apparatus of web page contents | |
CN112667940B (en) | Webpage text extraction method based on deep learning | |
CN102270206A (en) | Method and device for capturing valid web page contents | |
CN101702160B (en) | Method for acquiring internet subject information and device thereof | |
CN103577171B (en) | A kind of method and mobile terminal of display web page contents | |
CN102253979A (en) | Vision-based web page extracting method | |
CN109492177B (en) | web page blocking method based on web page semantic structure | |
CN109857956A (en) | The automatic abstracting method of news web page key message based on label and blocking characteristic | |
WO2014153457A1 (en) | Merging web page style addresses | |
CN106503211A (en) | Information issues the method that the mobile edition of class website is automatically generated | |
CN107590288B (en) | Method and device for extracting webpage image-text blocks | |
CN103049536A (en) | Webpage main text content extracting method and webpage text content extracting system | |
CN103530429A (en) | Webpage content extracting method | |
CN106547895B (en) | Webpage information extraction method and device | |
CN108733813A (en) | Information extracting method, system towards BBS forum Web pages contents and medium | |
US20130124684A1 (en) | Visual separator detection in web pages using code analysis | |
CN109657114B (en) | Method for extracting webpage semi-structured data | |
CN108959204B (en) | Internet financial project information extraction method and system | |
CN107145591B (en) | Title-based webpage effective metadata content extraction method | |
CN103853770B (en) | The method and system of model content in a kind of extraction forum Web pages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |