CN103699591A - Page body extraction method based on sample page - Google Patents

Page body extraction method based on sample page Download PDF

Info

Publication number
CN103699591A
CN103699591A CN201310665878.2A CN201310665878A CN103699591A CN 103699591 A CN103699591 A CN 103699591A CN 201310665878 A CN201310665878 A CN 201310665878A CN 103699591 A CN103699591 A CN 103699591A
Authority
CN
China
Prior art keywords
text
page
similar
similarity
denoted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310665878.2A
Other languages
Chinese (zh)
Inventor
兰秋军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN201310665878.2A priority Critical patent/CN103699591A/en
Publication of CN103699591A publication Critical patent/CN103699591A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Abstract

The invention discloses a method for extracting page bodies by filtering noise information such as advertisements, irrelevant links, pictures and the like from pages. The method is characterized in that page bodies are extracted rapidly and accurately by using the similarity on the aspects of the structures and contents of column pages in the same website. The method comprises the following steps: prompting a user to specify a sample page; performing initial filtering in a system by using page tags; calculating the text similarity of a page to be processed and the sample page on the basis of an editing distance; quickly identifying the boundary of a page body in order to extract pure body contents from noise-including page texts. The method has the characteristics of easiness and practicability in operation, rapidness and accuracy. The technical scheme can be widely applied to systems for acquiring, searching, reproducing, mining and analyzing webpage text information.

Description

A kind of webpage context extraction method based on example page
Technical field
The present invention relates to Web page text and extract implementation method, particularly from webpage, extract the technology of body text information.
Background technology
Current is the age of an information explosion, collects and understand the ability of information, is the key that a lot of enterprises seize competitive advantage.Popularizing fast of the Internet, makes network become the main place that information produces, propagates, spreads, receives.Even traditional newspaper, magazine, books etc. also all start electronization, with the form appearance of network text.The quick collection network information for this reason, and to its retrieve, taxonomic revision, reprinting, storage and mining analysis become people in the urgent need to.
HTML is a kind of markup language that creates webpage, is now international standard, by World Wide Web Consortium (W3C), is safeguarded.The descriptive text that any webpage is comprised of html tag in fact exactly, HTML label can comment, figure, animation, sound, form, link etc.The structure of HTML comprises head (Head), main body (Body) two large divisions.Head is described the required information of browser, and main body comprises the particular content that will illustrate.Although the technology of each website generating web page varies, at client browser end, all webpages are all supported to represent with html format.Can obtain by the standard mode of standard identical html text information.
Existing network text extracting, what generally adopt is the method for rule-based template, to specific website, by analyzing its page structure, determines the residing page-tag of text border.There is very large defect in this mode: because different web sites adopts different technological development, and also frequently correcting of Website page, this just needs information mechanism cost human and material resources energetically to remove to safeguard pattern rule.Even if there is the support of professional tool software, the defining operation of pattern rule is still very complicated, and especially needing operating personnel deeply to grasp webpage development technology could, to structure of web page correct understanding, could accurately extract.
The applicant carries out statistical study by each column to domestic and international more or less a hundred well-known website, finds that nearly all text website all has following several feature:
1. at same subject plate, the time is not apart that too two pages of (the not words of correcting) far away have identical page layout conventionally, and has similar advertisement, focus navigation etc.And, do not comprise that the similarity of other web page text of body matter is very high, the i.e. webpage context extraction method based on template, need to use wrapper (Wrapper) to extract the effective information in webpage, wrapper is a program, the spatial layout feature of this program based on the page, for a specific class webpage, write resolver, parse the position of text in the page, the advantage of this method is to realize simply, it is high that text extracts accuracy, but it is poor that shortcoming is versatility, each class webpage all will be write specific wrapper, be not suitable for the extraction of the extensive page, if there is variation in a certain class page in addition, the wrapper of such page will lose efficacy so, must remodify.
2. Web page text originally flocks together continuously in page text, seldom in the middle of text, by advertisement, navigation text, is inserted and interrupts.
3. the Web page text of nearly all website is all comprised by <p> or two html tags of <div>.Shortcoming is that the paragraph that comprises <p> or two html tags of <div> in Web page text is too many, computing is complicated, consuming time longer, concrete invention can be referring to application number 2010105533273, a kind of Web page text extracting method and device based on DIV position by name.
Summary of the invention
The object of the present invention is to provide a kind of user of allowing to operate more simple, can filter out in the page Web page text extracting mode of the noise informations such as a large amount of advertisements that exist, focus link, thus collection from webpage easily, reprinting, the required body text of mining analysis.
The present invention consists of following step: (1) user or program are specified an example page; (2) utilize webpage label tentatively to filter, obtain the text containing text; (3) utilize editing distance model to calculate the similarity between page text to be extracted and example page text, identification text border; (4) extract text.
Specify any one example page to carry out text extraction to the to be extracted page similar to example page.
The page to be extracted and example page are carried out to the accuracy and efficiency that preliminary filtration treatment is extracted to improve follow-up text, preliminary method of filtering is to utilize the innerHTML attribute that meets W3C standard to obtain the content page text that comprises html tag, with regular expression, remove html tag again, two texts after being filtered are respectively Similar Text.
Can guarantee that like this method has versatility, make the content of text that obtains and website based on development language and browser irrelevant.For the purpose of describing conveniently below, claim that two texts after this filtration are respectively PageA and PageB.
To beginning and the end boundary of Similar Text applicating text similarity analysis technology identification Web page text, the calculating of text similarity, based on scale model, need to be calculated two similarity sequential values, and step is as follows:
(1) head of alignment Similar Text character string,
(2) start from top to bottom to intercept respectively the substring of Similar Text designated length, be denoted as s1, s2,
(3) calculate s1, the similar value of s2,
(4) move backward a character, continue the substring of the next designated length of intercepting, be still denoted as s1, s2,
(5) forward step (2) to and repeat said process, until end is denoted as FS by the similar sequences obtaining like this;
Similarly,
(1) afterbody of alignment Similar Text,
(2) start from bottom to up to intercept respectively the substring of Similar Text designated length, be denoted as s1, s2,
(3) calculate s1, the similar value of s2,
(4) move forward a character, continue the substring of the next designated length of intercepting, be still denoted as s1, s2,
(5) forward step (2) to and repeat said process, until head is denoted as BS by the similar sequences obtaining like this.
Obtain two similar sequences, identify accordingly the border of text, analyze two sequences, when starting and finishing, similarity value is higher, and middle similarity value diminishes rapidly, occurs turnover;
And this turning point corresponding be exactly the border of text, what wherein the turning point of sequence FS was corresponding is the head of text, the afterbody of the corresponding text of the turning point of sequence B S;
According to the mode of sequential value is counter, push away, can obtain the head and the tail boundary position of text in Similar Text, thereby can extract text, noise text message is filtered out.
The invention has the advantages that, highly versatile, having overcome inhomogeneity webpage needs Different Package device to extract the shortcoming of effective information in webpage, has versatility.That the present invention has is simple to operate, easily go, feature fast and accurately, and it is high that text extracts accuracy rate.Its technical scheme can be widely used in the systems such as web page text information acquisition, search, reprinting, mining analysis.
Accompanying drawing explanation
Accompanying drawing 1 is system architecture diagram.
Accompanying drawing 2 is page HTML code and dom tree topology example.
The division signal of 3 two page text of accompanying drawing.
The concrete instance of 4 two page text of accompanying drawing.
Accompanying drawing 5 is FB(flow block).
Embodiment
The invention is characterized in, user need not understand the layout structure of the page and realize technology, and only by specifying an example page, be equivalent to inform that computing machine extracts the mode of text, thereby need not carry out complicated operation, define the pattern rule with maintain pages.Its operation steps in computing machine is very simple, and user only needs following two steps can easily start page body extraction work:
(1) add new task to task list
To each task, user only must provide the information of following 4 aspects:
Figure 2013106658782100002DEST_PATH_IMAGE001
an example page network address;
Figure 724240DEST_PATH_IMAGE002
page network address Changing Pattern information to be collected;
collection period information;
Figure 641380DEST_PATH_IMAGE004
text storage information
(2) startup system is carried out Automatic Extraction (as the step 12) of accompanying drawing 1
System will read task list automatically, according to each mission bit stream, take turns, periodically according to the network address Changing Pattern of giving, obtain web page text to be collected, extract text, preserve/be published to assigned address.
Be composition structure of the present invention as shown in Figure 1, mainly comprise following functions module: task definition module 11, page capture engine modules 12, text extract engine modules 13, stored/distributed engine modules 18.Task definition module 11 realizes user interface, and appointed task information is added in task list 13.Task list is the text of a format, has preserved each mission bit stream, and every mission bit stream comprises example page network address, page network address Changing Pattern to be collected, collection period, stored/distributed mode etc.Page capture engine 12 reads the data in task list 13, and according to the collection period of each task, access the Internet 14 gathers corresponding web page information on time, forms html text 15.Text extracts 16, engine according to html text 15, and example web page text, based on text extractive technique of the present invention, extracts text from webpage, and the noise informations such as filtering advertisement, navigation, form plain text 17.Finally, stored/distributed engine 18 will carry out the issue of result 19 according to the stored/distributed mode of mission bit stream appointment.
In functional structure as shown in Figure 1, page capture engine 12 is mainly that DOM technology realizes by DOM Document Object Model (Document Object Model).DOM be one cross-platform, with the standard of the expression of language independent and operation HTML, XHTML, XML object (can referring to http://www.w3.org/dom/).The structure of any page can show by the tag tree of DOM, and as shown in Figure 2, the left side is the HTML code of a webpage, and the right corresponding DOM tag tree that is it.Utilize the application programming interfaces (API) of DOM can access easily and operate each DOM node object.For example, based on Microsoft .Net framework, the innerText attribute of the HtmlElement object that it provides has just represented in <html> </html> label all removals that the comprise text of html label.(referring to http://msdn.microsoft.com).Based on this, can obtain the text message that comprises text of the page, and get rid of the multimedia noises such as most picture, video.And can be based on the preliminary filtration fraction noise of regular expression, and do not lose body matter.
The page to be extracted and example page are carried out to the accuracy and efficiency that preliminary filtration treatment is extracted to improve follow-up text.Preliminary method of filtering is to utilize the innerHTML attribute that meets W3C standard to obtain the content page text that comprises html tag, then removes html tag with regular expression.Can guarantee that like this method has versatility, make the content of text that obtains and website based on development language and browser irrelevant.For describe below convenient for the purpose of, claim that two texts after this filtration are respectively PageA and PageB,, i.e. similar pages, referring to accompanying drawing 4.
It is core of the present invention and key that text extracts engine.Its technology starting point is to be applied to three universals that each website and webpage text that this instructions background technology partly points out has.Ultimate principle is described below: as shown in Figure 3, establishing PageA is the corresponding text after preliminary filtration of example page, and PageB is the text of the page to be extracted after preliminary filtration.Because text flocks together continuously, therefore two texts all can be divided into three parts,
Figure 951223DEST_PATH_IMAGE005
,
Figure 2013106658782100002DEST_PATH_IMAGE006
, and , ,
Figure 2013106658782100002DEST_PATH_IMAGE010
.Wherein,
Figure 671682DEST_PATH_IMAGE006
represent the text of example page, and represent the text of the page to be extracted.And
Figure 199933DEST_PATH_IMAGE005
,
Figure 931128DEST_PATH_IMAGE007
represent respectively the noise text of example page text front and back,
Figure 924492DEST_PATH_IMAGE008
,
Figure 741138DEST_PATH_IMAGE010
represent the noise text of page body to be carried front and back.Lift a concrete instance, as shown in Figure 4, this is the texts of two actual pages after preliminary filtration.In this figure, use as space is limited, " ... .. " replaced part text, and for clarity sake, page body marks with underscore.For example, to PageA, " German Chancellor ... Reuters reported " is exactly text, in Fig. 3
Figure 478150DEST_PATH_IMAGE006
, and " British Broadcasting ... 2010 22:31 UK " is in Fig. 3
Figure 562387DEST_PATH_IMAGE005
, " Print Sponsor ... you are able to do so. " is
Figure 992232DEST_PATH_IMAGE007
.Example page is very similar with in content in structure with the page to be carried, and generally has ,
Figure 2013106658782100002DEST_PATH_IMAGE012
.And
Figure 633615DEST_PATH_IMAGE006
with
Figure 277085DEST_PATH_IMAGE009
have very large difference, similarity degree is extremely low.Accordingly, situation of change that can be based on two text similarities, finds
Figure 674569DEST_PATH_IMAGE006
with
Figure 403490DEST_PATH_IMAGE009
head and the tail border, the most at last
Figure 46087DEST_PATH_IMAGE009
from PageB, extract.
Task definition module 11 in Fig. 1 can be sayed to realize based on any common programming, as the Visual Basic under Microsoft .Net, C++, the realizations such as C#.Task list can file form exist, concrete form, developer can freely define.
Page capture engine modules 12 in Fig. 1, can realize the collection of page data and the preliminary filtration of text more easily based on DOM and regular expression.Its realization is easier, and concrete programming tool can be based on Microsoft .Net, also can be based on Java, or other supports the developing instrument of DOM object and regular expression, developer can select according to specific needs.
In Fig. 1, text extracts engine modules, and a key of its realization is the similarity of calculating between text, and this can realize by the editing distance based on Levenshtein, and Levenshtein editing distance is defined as follows:
EditDist("","")?=?0
EditDist(s,?"")?=?EditDist("",?s)?=?L
EditDist(s1+ch1,?s2+ch2)=min{?EditDist(s1,?s2)?+?if?ch1=ch2?then?0?else?1,
EditDist(s1+ch1,?s2)?+?1,
EditDist(s1,?s2+ch2)?+?1?}
Here EditDist is for calculating the function of two string editing distances, and " " represents null character string, and L represents the length of character string s.Ch1, ch2 represents respectively single character, min is the function of minimizing.
And character string
Figure 544064DEST_PATH_IMAGE013
,
Figure 2013106658782100002DEST_PATH_IMAGE014
between similarity available following formula calculate,
Figure 378028DEST_PATH_IMAGE015
Here
Figure 2013106658782100002DEST_PATH_IMAGE016
,
Figure 390983DEST_PATH_IMAGE017
represent respectively character string ,
Figure 688289DEST_PATH_IMAGE014
length, max is maximizing function.
The editing distance model that the calculating of text similarity proposes based on similarity calculations such as Levenshtein, vector space, included angle cosines, above-mentioned three kinds of computation models are at present conventional similarity technology model.Need to calculate two similarity sequential values, step is as follows:
(1) head of alignment characters string PageA and PageB,
(2) start from top to bottom to intercept respectively the substring that PageA and PageB length are d, be denoted as s1, s2, wherein length is that d is designated parameter, the character string line number of setting according to web length.
(3) calculate s1, the similar value of s2,
(4) move backward a character, continue the substring that the next length of intercepting is d, be still denoted as s1, s2,
(5) forward step (2) to and repeat said process, until PageA or PageB end are denoted as FS by the similar sequences obtaining like this.
Similarly,
(1) afterbody of alignment PageA and PageB,
(2) start from bottom to up to intercept respectively the substring that PageA and PageB length are d, be denoted as s1, s2,
(3) calculate s1, the similar value of s2,
(4) move forward a character, continue the substring that the next length of intercepting is d, be still denoted as s1, s2,
(5) forward step (2) to and repeat said process, until PageA or PageB head are denoted as BS by the similar sequences obtaining like this.
Similar sequences is divided into forward direction similar sequences FS and backward similar sequences BS, and it calculates flow process as shown in Figure 5 and carries out.Wherein FS is after example web page text is alignd with web page text to be extracted front portion, and the similarity of calculating from front to back a series of synchronous substrings obtains, and BS in contrast, is by after afterbody alignment, calculates from back to front gained.And the position that in FS sequence, similar value occurs to transfer is exactly the place, first border of text, the position that in BS sequence, similar value occurs to transfer is exactly trailing edge circle of text.
Analyze two sequences, two sequences all can be " Z " font, and when starting and finishing, similarity value is higher, and middle similarity value diminishes rapidly, occur turnover.
And this turning point corresponding be exactly the border of text, what wherein the turning point of FS was corresponding is the head of text.The afterbody of the corresponding text of turning point of BS.According to counter the pushing away of mode that obtains sequential value, can obtain the head and the tail boundary position of text in PageA and PageB accordingly.Thereby can extract text, noise text message is filtered out.

Claims (5)

1. the webpage context extraction method based on example page, is characterized in that, following step, consists of:
(1) user or program are specified an example page;
(2) utilize webpage label tentatively to filter, obtain the text containing text;
(3) utilize editing distance model to calculate the similarity between page text to be extracted and example page text, identification text border;
(4) extract text.
2. a kind of webpage context extraction method based on example page according to claim 1, is characterized in that, specifies any one example page to carry out text extraction to the to be extracted page similar to example page.
3. a kind of webpage context extraction method based on example page according to claim 1, it is characterized in that, the page to be extracted and example page are carried out to the accuracy and efficiency that preliminary filtration treatment is extracted to improve follow-up text, preliminary method of filtering is to utilize the innerHTML attribute that meets W3C standard to obtain the content page text that comprises html tag, with regular expression, remove html tag again, two texts after being filtered are respectively Similar Text.
4. according to a kind of webpage context extraction method based on example page described in claim 1 or 3, it is characterized in that, beginning and end boundary to Similar Text applicating text similarity analysis technology identification Web page text, the calculating of text similarity is based on scale model, need to calculate two similarity sequential values, step is as follows:
(1) head of alignment Similar Text character string,
(2) start from top to bottom to intercept respectively the substring of Similar Text designated length, be denoted as s1, s2,
(3) calculate s1, the similar value of s2,
(4) move backward a character, continue the substring of the next designated length of intercepting, be still denoted as s1, s2,
(5) forward step (2) to and repeat said process, until end is denoted as FS by the similar sequences obtaining like this;
Similarly,
(1) afterbody of alignment Similar Text,
(2) start from bottom to up to intercept respectively the substring of Similar Text designated length, be denoted as s1, s2,
(3) calculate s1, the similar value of s2,
(4) move forward a character, continue the substring of the next designated length of intercepting, be still denoted as s1, s2,
(5) forward step (2) to and repeat said process, until top is denoted as BS by the similar sequences obtaining like this.
5. a kind of webpage context extraction method based on example page of stating according to claim 4, it is characterized in that, obtain two similar sequences, identify accordingly the border of text, analyze two sequences, when starting and finishing, similarity value is higher, middle similarity value diminishes rapidly, the border that occurs the corresponding text of turning point of turnover, what wherein the turning point of sequence FS was corresponding is the head of text, the afterbody of the corresponding text of turning point of sequence B S; According to the mode of sequential value is counter, push away, can obtain the head and the tail boundary position of text in Similar Text, thereby can extract text, noise text message is filtered out.
CN201310665878.2A 2013-12-11 2013-12-11 Page body extraction method based on sample page Pending CN103699591A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310665878.2A CN103699591A (en) 2013-12-11 2013-12-11 Page body extraction method based on sample page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310665878.2A CN103699591A (en) 2013-12-11 2013-12-11 Page body extraction method based on sample page

Publications (1)

Publication Number Publication Date
CN103699591A true CN103699591A (en) 2014-04-02

Family

ID=50361119

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310665878.2A Pending CN103699591A (en) 2013-12-11 2013-12-11 Page body extraction method based on sample page

Country Status (1)

Country Link
CN (1) CN103699591A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage
CN105138517A (en) * 2015-10-23 2015-12-09 青岛恒波仪器有限公司 Parallel web page identification method and parallel web page identification device
CN105426388A (en) * 2015-10-23 2016-03-23 青岛恒波仪器有限公司 Apparatus for extracting and comparing webpage text
CN105550165A (en) * 2015-12-23 2016-05-04 深圳市八零年代网络科技有限公司 Plug-in and method capable of importing webpage article into webpage text editor
CN106227858A (en) * 2016-07-28 2016-12-14 北京橘子文化传媒有限公司 A kind of mobile Internet webpage or the accurate extracting method of media platform article content
CN106844410A (en) * 2015-12-04 2017-06-13 奥多比公司 Determine the quality of the summary of content of multimedia
CN107016124A (en) * 2017-04-27 2017-08-04 维沃移动通信有限公司 A kind of image processing method, mobile terminal and Cloud Server
CN107357766A (en) * 2017-07-19 2017-11-17 掌阅科技股份有限公司 Page editing method, electronic equipment and computer-readable storage medium based on e-book
CN107562799A (en) * 2017-08-04 2018-01-09 海南智媒云图科技股份有限公司 A kind of content reprints the method and device shared
WO2019090738A1 (en) * 2017-11-10 2019-05-16 深圳市华阅文化传媒有限公司 Method and device for purifying web fiction page
WO2020098098A1 (en) * 2018-11-13 2020-05-22 平安科技(深圳)有限公司 Semantic analysis-based text accuracy calculation method, device and computer device
WO2020098099A1 (en) * 2018-11-13 2020-05-22 平安科技(深圳)有限公司 Text accuracy calculation method and apparatus based on semantic parsing, and computer device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739402A (en) * 2008-11-07 2010-06-16 华为技术有限公司 Method and device for interest analysis
CN102314520A (en) * 2011-10-24 2012-01-11 莫雅静 Webpage text extraction method and device based on statistical backtracking positioning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739402A (en) * 2008-11-07 2010-06-16 华为技术有限公司 Method and device for interest analysis
CN102314520A (en) * 2011-10-24 2012-01-11 莫雅静 Webpage text extraction method and device based on statistical backtracking positioning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
QIUJUN LAN: "Extraction of News Content for Text Mining Based on Edit Distance", 《JOURNAL OF COMPUTATIONAL INFORMATION SYSTEMS》 *
朱育发等: "《jQuery开发完全技术宝典》", 28 February 2012, 中国铁道出版社 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022803B (en) * 2015-07-01 2018-05-15 广州市万隆证券咨询顾问有限公司 A kind of method and system for extracting Web page text content
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage
CN105138517A (en) * 2015-10-23 2015-12-09 青岛恒波仪器有限公司 Parallel web page identification method and parallel web page identification device
CN105426388A (en) * 2015-10-23 2016-03-23 青岛恒波仪器有限公司 Apparatus for extracting and comparing webpage text
CN106844410B (en) * 2015-12-04 2022-02-08 奥多比公司 Determining quality of a summary of multimedia content
CN106844410A (en) * 2015-12-04 2017-06-13 奥多比公司 Determine the quality of the summary of content of multimedia
CN105550165A (en) * 2015-12-23 2016-05-04 深圳市八零年代网络科技有限公司 Plug-in and method capable of importing webpage article into webpage text editor
CN106227858B (en) * 2016-07-28 2019-06-25 北京橘子文化传媒有限公司 A kind of accurate extracting method of mobile Internet webpage or media platform article content
CN106227858A (en) * 2016-07-28 2016-12-14 北京橘子文化传媒有限公司 A kind of mobile Internet webpage or the accurate extracting method of media platform article content
CN107016124B (en) * 2017-04-27 2018-11-30 维沃移动通信有限公司 A kind of image processing method, mobile terminal and Cloud Server
CN107016124A (en) * 2017-04-27 2017-08-04 维沃移动通信有限公司 A kind of image processing method, mobile terminal and Cloud Server
CN107357766A (en) * 2017-07-19 2017-11-17 掌阅科技股份有限公司 Page editing method, electronic equipment and computer-readable storage medium based on e-book
CN107562799A (en) * 2017-08-04 2018-01-09 海南智媒云图科技股份有限公司 A kind of content reprints the method and device shared
WO2019090738A1 (en) * 2017-11-10 2019-05-16 深圳市华阅文化传媒有限公司 Method and device for purifying web fiction page
WO2020098098A1 (en) * 2018-11-13 2020-05-22 平安科技(深圳)有限公司 Semantic analysis-based text accuracy calculation method, device and computer device
WO2020098099A1 (en) * 2018-11-13 2020-05-22 平安科技(深圳)有限公司 Text accuracy calculation method and apparatus based on semantic parsing, and computer device

Similar Documents

Publication Publication Date Title
CN103699591A (en) Page body extraction method based on sample page
US9619448B2 (en) Automated document revision markup and change control
CN102254009B (en) Method for extracting data of webpage table
CN105022803B (en) A kind of method and system for extracting Web page text content
EP2938044B1 (en) System, method, apparatus, and server for displaying network medium information
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN102270206A (en) Method and device for capturing valid web page contents
TW201250492A (en) Method and system of extracting web page information
CN103329122A (en) Storage of a document using multiple representations
US20120304051A1 (en) Automation Tool for XML Based Pagination Process
CN101702160B (en) Method for acquiring internet subject information and device thereof
CN106021392A (en) News key information extraction method and system
CN101872350A (en) Web page text extracting method and device thereof
CN104123269A (en) Semi-automatic publication generation method and system based on template
US20130124684A1 (en) Visual separator detection in web pages using code analysis
CN110347390B (en) Method, storage medium, equipment and system for rapidly generating WEB page
CN108874934B (en) Page text extraction method and device
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN112433995B (en) File format conversion method, system, computer device and storage medium
CN104572874B (en) A kind of abstracting method and device of webpage information
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction
CA2578979A1 (en) System and method of report representation
CN109740097A (en) A kind of Web page text extracting method of logic-based chained block
CN114637505A (en) Page content extraction method and device
Palekar et al. Deep web data extraction using web-programming-language-independent approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140402