TW201019142A - Dynamic webpage content capturing method - Google Patents

Dynamic webpage content capturing method Download PDF

Info

Publication number
TW201019142A
TW201019142A TW97142784A TW97142784A TW201019142A TW 201019142 A TW201019142 A TW 201019142A TW 97142784 A TW97142784 A TW 97142784A TW 97142784 A TW97142784 A TW 97142784A TW 201019142 A TW201019142 A TW 201019142A
Authority
TW
Taiwan
Prior art keywords
webpage
web
parser
dynamic
content
Prior art date
Application number
TW97142784A
Other languages
Chinese (zh)
Other versions
TWI399653B (en
Inventor
guo-ren Zhao
yi-chang Cai
qing-chang Li
Original Assignee
guo-ren Zhao
yi-chang Cai
qing-chang Li
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by guo-ren Zhao, yi-chang Cai, qing-chang Li filed Critical guo-ren Zhao
Priority to TW97142784A priority Critical patent/TW201019142A/en
Publication of TW201019142A publication Critical patent/TW201019142A/en
Application granted granted Critical
Publication of TWI399653B publication Critical patent/TWI399653B/zh

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a dynamic webpage content capturing method, which uses an interpreter to write and establish a plurality of webpage parsers. Each webpage parser is defined to be capable of capturing specific data and include different numbers of webpage parser calling conditions. When practically applying the webpage parsers to capture the content of a source webpage, each webpage parser dynamically calls another corresponding webpage parser for being added into a parsing process according to the data contained in the webpage. The called webpage parser can further call the other webpage parsers according to the processed content without being limited by the number of callings. Accordingly, the required information in the source webpage can be completely taken out for subsequent use by others. In response to fast and frequent update and revision of the webpage content, the webpage parsers can perform editing operations, such as modification, insertion or deletion, with suitable syntax directly by the users or system administrator when the webpage parsers are in executing or non-executing state, so as to adapt to new content and immediately generate effect without the need of requesting the original developer to perform complicated modifying and compiling operations.

Description

201019142 九、發明說明: 【發明所屬之技術領域】 本發明係關於一種動態式網頁内容擷取方法,、t 一 種利用以直譯器(丨nterpreter)建立之網頁剖析器, 對網頁内容進行動態式的判斷分析,以精細擷 資 訊的網頁内容榻取方法。 、 【先前技術】 ❿ ❹ 請參閱第六圖所示,為本國公告第Ι292ι〇4號「利用 網頁模板剖析網頁文件以擷取資料之方法」發明專利案, 該案的主要技術是先建立一網頁模板 八 败再透過一網頁剖析 器(Pa⑽)依據該網頁模板之設定,剖析所讀取之網頁文 件内容’並將該網頁文件内容中剖析出之資料擷取下來, 並記錄於一資料庫,以達到自動祿取出網頁文件内容中所 含資訊之目的。 但是實施前述方法的基礎是必須預先建立一「網頁模 ^ ’該網頁剖析器才能處理符合該網頁模板敎義的網 頁貪料’目前在網際網路上係存在難以計數的站點,每 二點可能皆具有其獨立的網頁格式,而且資訊之更新速 變侍更加頻繁’隨時都有新資料上載到飼服 1甚至疋大幅度地修改整體版面之配置方式。當面臨到 資訊快速更新、網頁内衮f “夕 搡用胖… 貝内谷更加多樣化等情況後,前述專利 + ^ 貝模板來擁取網頁内容的作法便顯得不 切貫際。因為網頁模板上的 今係既疋格式,無法對應至I: 201019142 如此快速變化的來源網頁,該網頁剖析器自然a 網頁的内容加以分辨祿取,甚至是來源網頁中的資科= 僅是單純地改變前、後記载順序,網頁模板便無法 源網頁對應。 尺…云興此來 若是對既有之铜百@ k β + 之來源網頁的格式内容、::修改調整,使其與欲剖析 是重新建立網頁模板必二二能短期地發揮作用,但 到的效果卻是相當時間及精神,而所得 ❹ 迅速、資料極度龐雜的情況的更新速度如此 更加顯得不具有實用價值5’修改網頁模版的作法便 【發明内容】 鑑於目別利用網頁模板擷取網頁内容的作法 而無法因應更新速度已 k於死板 百为4 曰趨頻繁的網頁,導致擷取出之網 頁内谷無法符合預期要求,故本發明之 、’ 種動態式網頁I 的係提供一 ❿ 唧貝内谷擷取方法,利用以直 (interpreter)所撰芎之網百卹 譯益程式 ^ ^ 1 ser)# ^ ^ 孓解析網頁過程中,任一網頁 可自動地判讀作举中之絪百i χ 一次貞4析盗(parser) 網頁剖析器(parser)加解柘 吁Η匕 …業’以取出作業中網頁各 …心並加以分門別類儲存 身視網頁更新内容,透過直#||簡單地修改^二理者本 以編輯網頁剖析器便可適應新網頁内纟:二刪除 發者進行傳統繁靖之程式編譯作業,該此網^還原始開 一’頁剖析器係支 5 201019142 援即時編輯,即當網頁剖析_ 可允許使用者動態地新増、冊m、修改所有二仍 (_「)’讓程式在執行中,即時根據 格:析器 化而動態因應’而是否須 广式的變 可藉由程式撰寫出的條件加:::網頁剖析器的運作^ 為達成前述目&,„ 係包含: 發明之動態式網頁内容擷取方法 以直譯器程式建立出複數網頁 内係定義有其資料掏取條件及數目不等之網剖析器 條件; 頁σ彳析器呼叫 指定待分析之來源網頁為-待處理網頁· 該《剖析H全面地檢視待處理中 資料類取條件或網頁剖析器中疋否存在符合 該資料掏取條件時係將該筆、g貝抖’當資料符合 人杯^^ 資抖加以擷取暫存,者咨财所 動作且再進,斷是否其資料棟取 析作業; 、頁σ1】析器進入解 判斷待處理網頁是否均已解 ::之所有網頁剖析器及所有待處:::進入解析 右仍有等待處理的網頁或等待執行 =行完畢, 行解析作業; J析器則持續執 輸出解析結果,各參與解析作 負責執行之功能自來源網頁中操取所 剖析器依據其 、s ’該些操取出 201019142 之資訊係加以分類輸出 【實施方式】 本發明係提出-種動態式網頁内容操取方法,利用以 直譯器⑽e一)撰寫而成的網頁剖析器(pa·),㈣ 頁内容進行全面解讀,將網頁中所包含的細目資訊逐一操 取、分類、儲存以提供予其它的各類應用,前述直譯器包 含各種可行的語言,例如但不限定於:pHp、—細>、201019142 IX. Description of the invention: [Technical field of the invention] The present invention relates to a dynamic web content capture method, and t is a web page parser built by a literary translator to dynamically dynamic web content. Judgment analysis, to fine-tune the information on the content of the web page. [Previous Technology] ❿ ❹ Please refer to the sixth figure for the invention of the patent case for the National Publication No. Ι 292 〇 4 “Using the Web Template to Analyze the Web File to Capture the Data”. The main technology of the case is to establish a first The webpage template is defeated by a webpage parser (Pa(10)) according to the setting of the webpage template, parsing the content of the webpage file read, and extracting the parsed data in the webpage file content, and recording it in a database. In order to achieve the purpose of automatically extracting the information contained in the content of the web file. However, the basis of implementing the foregoing method is that a "webpage module" must be pre-established in order to process a webpage that conforms to the meaning of the webpage template. "There is currently an uncountable site on the Internet, and every two points may be All have their own independent web page format, and the information update speed is changed more frequently. 'New data is uploaded to the feeding service at any time. Even the configuration of the overall layout is greatly modified. When faced with rapid information update, web page 衮f “The use of fat in the evening... After the more diverse the Bene Valley, the aforementioned patent + ^ template to capture the content of the web page is not consistent. Because the current format on the web page template does not correspond to I: 201019142 such a fast-changing source web page, the web parser naturally distinguishes the content of the web page, even the source web page is only simple. The page format is changed before and after, and the web page template cannot be mapped to the source page.尺... Yun Xing is the format content of the source page of the existing copper hundred @ k β +, :: modify the adjustment, so that it is necessary to re-establish the template of the web page, it can play a short-term role, but to The effect is quite time and spirit, and the update speed of the data is very fast and the data is extremely complicated. The speed of updating is so much less practical. 5' Modifying the template of the webpage [invention content] In view of the use of the webpage template to capture the webpage The content of the content can not respond to the update speed has been a rigid web page, resulting in a web page that is not in line with the expected requirements, so the invention of the dynamic web page I provides a 唧Bene Valley's method of extracting, using the interpreter to write a web-based translation program ^ ^ 1 ser) # ^ ^ 孓 In the process of parsing a web page, any web page can be automatically interpreted as a hundred i χ one 贞 4 arbitrage (parser) web parser (parser) add 柘 柘 Η匕 业 业 业 业 业 业 业 业 业 业 业 业 业 业 业 业 取出 取出 取出 取出 取出 取出 取出 取出 取出 取出 取出 取出 取出 取出 取出 取出 取出 取出 取出 取出By directly #||simplely modifying the ^2 ruler to edit the web profiler to adapt to the new web page: the second delete the sender to perform the traditional arbitrarily compiled program, the net ^ also originally opened a 'page parser Department 5 201019142 Aid for instant editing, that is, when the web page is parsed _ allows the user to dynamically update the new album, m, and modify all the two still (_")' to let the program execute, and dynamically respond according to the grid: 'Whether it is necessary to change the conditions that can be written by the program plus::: The operation of the web profiler ^ To achieve the above objectives &, „ contains: Invented dynamic web content capture method to the interpreter program The network parser conditions with different data acquisition conditions and numbers are defined in the plural webpages; the page σ parser call specifies the source webpage to be analyzed as the to-be-processed webpage· The parsing H comprehensively examines the pending If there is a condition in the data class or a web page parser that is in compliance with the data retrieval condition, the pen and the g-be shaken are used when the data meets the human cup ^^ 抖 撷 撷 撷 暂 暂 , , Re-entry, break The data ridge takes the analysis operation; , page σ1] The arsenal enters the solution to determine whether the pending web page has been solved:: all the web page parsers and all the waiting places::: enter the parsing right, there are still waiting for processing web pages or waiting for execution = After the line is completed, the line parsing operation; the J parser continues to output the parsing result, and each participating parsing function is responsible for executing the function. The parser is operated from the source web page according to the s 'the operation information of the 201019142. [Embodiment] The present invention proposes a dynamic web content operation method, which utilizes a web page parser (pa·) written by an interpreter (10) e), and (4) a comprehensive interpretation of the page content, which is included in the web page. The detailed information is handled, classified, and stored one by one for other various applications. The aforementioned interpreter includes various feasible languages, such as but not limited to: pHp, - fine >

PeM BaS|e'ASP· · _等各種程式;利用直譯器程式建 立本發明的剖析器時’甚至允許當網頁剖析器(parser)正 在剖析網頁擷取資料時,仍可動態地新增、刪除、修改所 有相關的剖析器(pa「ser),讓程式可以在執行中,即時根 據網頁内容格式的變化而動態因應,在設計上,是否項暫 時中止該網頁剖析器的運作,可藉由程式撰寫出的條件加 以控制。 請參考第一圖所示,為—待解析的網頁示意圖,該網 頁僅作為範例說明以幫助理解本發明之技術,並非限制為PeM BaS|e'ASP· · _ and other programs; when using the interpreter program to build the parser of the present invention, 'even allowing the web parser to dynamically add and delete when the web page is being parsed. Modify all relevant parsers (pa "ser", so that the program can dynamically respond to changes in the content format of the webpage during execution. In design, whether the program temporarily suspends the operation of the web parser can be used by the program. The written conditions are controlled. Please refer to the first figure, which is a schematic diagram of the webpage to be parsed. This webpage is only used as an example to help understand the technology of the present invention, and is not limited to

具有此種格式之網頁方能解析。該網頁可供任何人以支援 其格式的劇覽器直接間靖,n+A • 見益置接閱讀’且包含有網址資訊⑴)及本文 資訊(12) ’在本文資訊(12)中除了單純的文字資料⑽)之 外亦可此含括圖片播、聲音檀、影像擋、超連結等各 型態的資料,本發明可顧取之資料型態並不褐限為文字資 枓(㈣’任何型態的資料均可利用本發明的技術加以榻 取,而在以下的詳細說明中,將以文字資料⑽)為範例說 201019142 明。 進步參考第二圖所示,本發明之主要步驟包含有· 建立網頁剖析器(201),此步驟係利用直譯器程式撰寫 出複數個網頁剖析器(pa⑽)’各網頁剖析器内係定義有 貧料擷取條件及數目不等之網頁剖析器呼叫條件,當滿足 網頁刮析器呼叫條件時係可呼叫其它的網頁剖析器加入解 f作業;1亥等網頁剖析器可以根據使用者自訂需求或根據 剖析器本身負責的功能而將其分類成不等群組,網頁剖析 ^此,間不必然具有主從g係,亦可為平行關係。舉例 而:,遠複數個網頁剖析器依功能規劃為群組,如第三圖 :丁二T將其劃分為負責拆解網址的網址剖析器、負責拆 Γ八二的網頁成分剖析器、-階成分剖析器、二階 成刀口丨J析器…N階成分立丨丨杯哭 析器’該些剖析器可設計為共用, P使針對不同的來源網址仍可適用。 特性⑷如:方式如第四圖所示’是根據不同的來源網址 特险(例如不同網站、不同子路徑 源網址之網頁㈣^ ^屬該種來 亦可進一牛五争 母個網頁剖析器群組之下 進 乂再建立有其專屬的吹雜知^ 不限制要在同-群惟實際應用時,並 不同^ 匕層之下的網頁剖析器要逐-執行, 、且之間的網頁剖析器可以視網頁中所包含的 〜、呼叫其它—個或多個網頁刊析 / 式執行料,例如圖中所 :::成非規則的跳躍 之間的呼叫順序。 噩戎表不為不同網頁剖析器 讀入使用者指定之待分 斤、、祠頁(202),使用者可以指定 •201019142 -::個網址以選定單個或複數個待分析的來源網頁, 右疋心疋單—來源網站,亦可設定要在該網站中取出幾層 子路=的網頁内容,例如使用者指定叫:\\333\時,若僅 =對-層子路徑的網頁内容加以剖析,則會加以處理的網 頁内容將包含到http:Uaaa灿匕心的網頁内容。 後二=行網頁剖析器(203)’當選定好待分析的網頁 剖析器時則边已預先建立好的網頁剖析器,在執行網頁 中的資料,^個網頁剖析器可依據所職予的功能取出網頁 其它網頁剖析器…“: 時動態地呼叫 以執行其功能業’使被呼叫之網頁剖析器得 現有其它網頁剖析器所負責之工^ =執;:=若發 頁剖析器,換古之,畚一,田 #再進-步呼叫該網 其它網頁气析;谁紘 頁剖析器可呼叫一個或多個 限制二=:業,而呼叫的次數、階數亦不 或是否須呼叫其它網頁剖析否:==的資料時 容是否有符合資料揭取條件及網頁f地檢視網頁内 存在,若有符合條件的資料存在條件的資料 制。 排列、出現的先後順序所限 頁分畢(2。4),在執行網 能在網頁中又包含其它網頁之待處理網頁以外,可 、’、α,於S亥連結的網頁亦可 9 201019142 能具有可供擷取的内容,則網頁 網頁自動納入至解析作業中理時亦可將該 網頁處理,故可能有若干網頁传暫時間點僅對單- 』貝係皙時處於等待狀離,且簞 一個網頁剖析器可能呼叫多個。 』祈斋進入解析作業, 故有可能其它複數個網頁 、 注…J析器亦正處於等待狀態中,尚 待上一個網頁剖析器執行完 a ^ ^ 交丹依-人執仃,若判斷結果 疋仍有切處理的網頁或是有網頁剖析器亦在等待執行 ❹ ❿ 中則持續則-步驟之執行網頁剖析器(2〇3),反之、,若網 頁剖析器已處理至最細節的資料 ^ „ 町貧料而不再呼叫其它的網頁刮 析盗,且所有網頁均已處理完畢,則結束作業。 輸出解析結果(205),為網百> 〗在網頁剖析器持續執行之過程 中,除了會動態地呼叫装亡、态# & , 吁”其匕適當的網頁剖析器之外,同時 依據所設計的功能棟取出所需的資訊,該些摘取出來的資 訊係加以分類、輸出至-資料庫或輸出成為-槽案(如 XML、EXCEL··.格式)等,|锉在刑 ” ^寻其儲存型態係不受特定限制。 請參考第五圖所示,為本發明於執行時的其中一種實 施例’以擷取第一圖所示的網頁為範例配合說明,假設所 需的資料為該項產品的相關規格介紹m資料(12〇)中 所列舉的各項細目’則利用本發明的作法能將各項目中的 細節均完整擷取出來。 首先使用者於^曰疋網頁來源(本實施例以p C η 〇 M E 之購物網頁為例)後,該網頁的内容可先經由一網頁成分刳 ^器(HTML Parser)拆解成不同部分,例如網頁中除了文 字資料(12Q)以外,亦包含有上方的產品標題網框、左方的 20 201019142 產品列舉網框、中央的產品影像圖檔等 頁成分剖析器可以僅取出文字 、 網A web page with this format can be parsed. This webpage can be used by anyone to support the format of the browser directly, n+A • see and read the 'and contains the website information (1)) and this article information (12) 'In addition to this information (12) In addition to the simple text data (10), it can also include various types of information such as picture broadcast, sound sandal, image block, and hyperlink. The data type that can be taken by the present invention is not limited to text resources ((4) 'Any type of data can be taken by the technique of the present invention, and in the following detailed description, the text data (10)) will be used as an example to say 201019142. Progress Referring to the second figure, the main steps of the present invention include: creating a web page parser (201), which is to compose a plurality of web parsers (pa(10)) using an interpreter program. Web page parser call conditions with different conditions and numbers of poor materials, when the web page scraper call conditions are met, other web page parsers can be called to join the f job; 1Hai web page parser can be customized according to the user The requirements are classified into unequal groups according to the functions that the parser itself is responsible for. The web page analysis ^ does not necessarily have a master-slave g system or a parallel relationship. For example: a number of web page parsers are grouped according to function, such as the third picture: Ding Er T divides it into a URL parser responsible for disassembling the web address, and a web page parser that is responsible for the demolition of the web page. The step component parser, the second-order knives, the arranging device, the N-stage component, the crepe-cracker, the parsers can be designed to be shared, and P can be applied to different source URLs. Features (4) such as: the way as shown in the fourth figure 'is based on different source URLs (for example, different websites, different sub-path source URLs (4) ^ ^ is a kind of this kind of can also enter a bullish five contender web page parser The group is created under the group and has its own proprietary knowledge. It is not restricted to the same-group only when it is actually applied, and the webpage parser under the different layer is to be executed one by one, and between The parser can view the ~ or multiple other web pages in the web page to report/execute the material, for example, in the picture::: the order of the calls between irregular jumps. The webpage parser reads the user-specified bills and pages (202), and the user can specify • 201019142 -:: web addresses to select single or plural source web pages to be analyzed, right-handed-single-source The website can also set the content of the webpage to be removed from the website. For example, if the user specifies: \\333\, if only the content of the webpage of the -layer sub-path is parsed, it will be processed. The content of the webpage will be included in http:Uaaa The content of the web page of the heart. The second two = line web parser (203) 'When the web parser to be analyzed is selected, the web parser has been pre-established, and the web page parser can be executed in the web page. According to the function of the job, the other webpage parser is taken out of the webpage... ": Dynamically call to perform its function industry" so that the called webpage parser can be responsible for the work of other webpage parsers; == Page parser, for the ancients, 畚一, 田# 再进-step call the other pages of the network gas analysis; who page parser can call one or more restrictions two =: industry, and the number and order of calls No or whether it is necessary to call other webpages. No: == Whether the data content of the data conforms to the data retrieval conditions and the webpages of the webpages, if there are conditions, the data system exists. The limited pages are divided into (2. 4). In addition to the pending web pages that can include other web pages in the webpage, ', α, the webpage linked to S Hai can also be available for 2010 19142. Content, then the page When the webpage is automatically included in the parsing operation, the webpage can also be processed. Therefore, there may be a number of webpages that are temporarily waiting for a single-page, and a webpage parser may call more than one. 』 斋斋 into the analysis of the operation, it is possible that other multiple pages, notes ... J is also in a waiting state, still waiting for the previous web profiler to complete a ^ ^ 交丹依-人仃, if the result疋The webpage that still has a cut or the webpage parser is waiting for execution 则 则 Continues the step-by-step execution webpage parser (2〇3), otherwise, if the web parser has processed the most detailed data ^ „ The town is in poor condition and no longer calls other web pages to scrape the pirates, and all the web pages have been processed, then the job ends. The output parsing result (205) is in the process of continuous execution of the web parser, except that the webpage parser is dynamically called, and the appropriate webpage parser is called, and The designed function building takes out the required information, and the extracted information is classified, output to the -database or output into a - slot case (such as XML, EXCEL·. format), etc. ^ Finding its storage type is not subject to specific restrictions. Please refer to the fifth figure, which is an example of the implementation of the present invention, which takes the webpage shown in the first figure as an example, and assumes that the required information is the relevant specification of the product. The items listed in (12〇) can be used to extract the details of each item by using the method of the present invention. First, after the user uses the webpage source (in this embodiment, the shopping webpage of p C η 〇 ME as an example), the content of the webpage can be first disassembled into different parts via a webpage component (HTML Parser). For example, in addition to the text data (12Q), the webpage also includes the product title frame above, the left 20 201019142 product list frame, the central product image file, etc. The page component parser can only take out text, net

Hr m Έ j, /V α 斗(2〇)以進行後續處理。 該網頁成分剖析器係判斷 v叫邊些•文子資料Hr m Έ j, /V α bucket (2〇) for subsequent processing. The webpage component parser is judged v.

在有一階資料、二階資料N '«丨自育科等’若存在,則 對應的剖析器進行處理’例如 ^ 爽埋器、記情贈、麻虛 光碟機、螢幕/重量、網路、盆 ’、 ,、他4項目視為是一 則-Ρ皆成分剖析器可辨識掏取出 益’ “丨―”可視為是二階的資 -::·’處理器的廠牌 ❹ Ψ , — 3 而一階資訊可定義為型 號R理’母—個項目内所含的資料均可由預先設計 網頁剖析器擷取出來,例如 、 匕U體的谷量為2G、格式兔 DDRII 667MHz ;硬碑的衮蕃么拾式為 ㈣I!為25〇G、格式為SATA等。 惟刖述第五圖的流程僅是舉其中—例說明而已 際執行網頁剖析器時,並 實 个限制其運仃順序必然是從一 产白、二階…N階依序進行,有 t 被g ^ 、 有了月b在執行負責第五階的剖 析器夺’於解析過程中又判斷屮古4人哲 本认次〜 列斷出I &含第三階剖析器所負 貝的貝二’故適時動態地呼叫對應的第三階剖析器執行。、 …以第-圖所示的網頁為例,在分析 發現到具有其它產品的連社 7迓、,’σ(Ί3〇) ’例如‘下一個商品,,矽 ;連結⑽)會開啟另-新網頁以介紹另—款產品,本發明= 此情況時自動進入該新網頁以分析、揭取該新網頁 不同功能 頁剖析器 以修改便 本發明係利用直譯器程式(丨nterpreter)建立 的網頁剖析H(pa⑽),故對於使用或管理該網 (_0的使用者而言,可直接以支援的語法加 .201019142 能立刻適應不同的網頁,從中擷取 頁剖析器不須要交還原始的程式開發者==些網 須利用特定的編譯器(⑶mpi丨叫執行編 業編撰、不 當高的應用靈活度;而各個網頁剖析業,故具有相 臨到的實際内纟’動態地呼叫對應 :丁時,係視面 中的資訊均能被完整取出。 ,析器,使網頁 對於所屬領域具有通常知識者而言, 之不同修正及變化,均不偏離本發明 2本發明所為 本發明已敘述特定的具體實施例 神。雖然 應被不當地限制於該等特定具體實施例!:…明不 步驟之進盯順序。在實施本發明 ^ 屬領域中具有通常㈣者*言_ =式方面,對於所 蓋於下列申請專利範圍之内易知之不同修正亦被涵 固叭間早說明 第一圖係欲利用本發明之古土 1、,此 月之方法加以揭取網頁内容的雜 •頁示意圖。In the case of a first-order data, second-order data, N '«丨自育科, etc., if it exists, the corresponding parser will process it. For example, ^ 爽 器 , , , , , , , , , , , , , , , , , , , , , , , , ', ,, his 4 items are considered to be one - all of the components of the parser can identify the benefits of taking out - "丨 -" can be regarded as the second-order capital -:: · 'processor's label ❹ —, — 3 and one The order information can be defined as the model R. The data contained in the project can be taken out by the pre-designed webpage parser. For example, the volume of the U-body is 2G, the format rabbit DDRII 667MHz; The pick-up is (four) I! is 25 〇 G, the format is SATA. However, the flow of the fifth diagram is only for the sake of example - when the web parser is executed, and the order of the operation is limited, it must be from the order of white, second, ... N order, with t being g ^, With the month b in the execution of the fifth-order parser, in the process of analysis, it is judged that the four people of the ancient times have recognized the number of times ~ the column breaks out I & the second-order parser with the negative shell 'Therefore, the corresponding third-order parser is called dynamically at the appropriate time. For example, in the case of the web page shown in the first figure, in the analysis, it is found that there are other products, such as Lianshe, and 'σ(Ί3〇) 'such as 'next product, 矽; link (10)) will open another - The new webpage introduces another product, and the present invention = automatically enters the new webpage in this case to analyze and extract the new webpage different function page parser to modify the webpage established by the invention using the interpreter program (丨nterpreter) Analyze H(pa(10)), so for users who use or manage the network (_0, they can directly use the supported syntax plus .201019142 to adapt to different web pages immediately, and the page parser can be retrieved without returning the original program development. == Some nets must use a specific compiler ((3) mpi screaming to perform editing and editing, improper application flexibility; and each web page analysis industry, so there is an actual internal 纟 纟 'dynamic call corresponding: Ding Shi, The information in the viewing plane can be completely taken out, and the different modifications and changes of the webpage for those having ordinary knowledge in the field do not deviate from the present invention. The specific embodiments are described in the context of the specific embodiments. Although they should not be limited to the specific embodiments of the invention, the order of the steps of the present invention is not limited to the order of the steps in the field of implementing the invention. The different amendments that are within the scope of the following patent application are also clarified. The first figure is intended to make use of the ancient soil of the present invention 1. This month's method is used to extract the page content of the webpage. schematic diagram.

第 第 二圖·•係本發明之方法流程圖。 三圖·係本發明建立網頁剖析器群組-實施例之示意 第四圖:係本發明建立網頁剖析器群組另一實施例之示 意圖。 第五圖··係本發明—實施例之流程圖。 第/、圖.係本國公告第1292104號「利用網頁模板剖析 12 201019142 其方法流程圖 網頁文件以擷取資料之方法」 【主要元件符號說明】 ’(1 1)網址資訊 (12)本文資訊 (120)文字資料 (130)連結Fig. 2 is a flow chart of the method of the present invention. BRIEF DESCRIPTION OF THE DRAWINGS The present invention is directed to a web page parser group - an illustration of an embodiment. FIG. 4 is a schematic illustration of another embodiment of the present invention for creating a web page parser group. Figure 5 is a flow chart of the present invention - an embodiment. No. 1292104 "Using the Web Page Template to Analyze 12 201019142 Method of Flowchart Web Page File to Capture Data" [Key Component Symbol Description] '(1 1) URL Information (12) This article information ( 120) Text data (130) link

1313

Claims (1)

201019142 十、申請專利範園: 1 .二種動態式網頁内容擷取方法,包含: 以直譯器程式建立出複數網 内係定義有i貝。J析盗,各網頁剖析器 條件; 身㈣取條件及數目不等之網頁剖析器呼叫 指定待分析之來源網頁為一待處理網頁, κ網頁剖析器全面地檢顏祥加言丄 資料擷取條件啖綑百 、,網頁中是否存在符合 —次 、牛次網頁0彳析器呼叫條件的資料,者資粗性人 合任-網頁…Γ 顧取暫存,當資料符 L1析㈣叫條料,料叫對應 進入解析作業,該被呼叫 動作且π κ網胃剖析15純行其資料擁取 析作業再進-步判斷是否再須呼叫其它網頁剖析器進入解 負二出:析結果’各參與解析作業之網頁剖析器依據苴 、次 功能自來源網頁,擷取所需資訊,該些擷取ψ 之資訊係加以分類輸出。 出 ❿ 取方1Τ請專利範圍第1項所述之動態式網頁内容_ =予Γ數網頁剖析器係依其功能劃分為不同群組, 以k供予不同的來源網址共用。 取方I如申明專利範圍第1項所述之動態式網頁内容擷 ’該複數網頁剖析器係依不同 不同群組。 J d刀為 4如申明專利範圍第1項所述之動態式網頁内容祎 取方法’該網頁剖析器中所定義之網頁剖析…= 14 201019142 含於待作業網頁中存 伟在其匕網貝之連結時,則將該其它網 頁納入解析作業中。 5 ·如中請專利範圍第1至4項任—項所述之動態式 網頁内容棟取方法,該網頁剔析器所榻取之資料包含文字 資料圖片檔、聲音檔、影像檔或超連結。 6 .如申請專利範圍第5項所述之動態式網頁内容擷 取方法,於該輸出解析結果之步驟中,該些擷 係輸出至一資料庫中。 貝 參 7.如申請專利範圍第5項所述之動態 取方法’於該輸出解析結果之步驟中,該些擷 = 係輸出成一檔案。 * <貝说 8 .如申請專利範圍第6項所述之動 取方法’該直譯器程式為ΡΗΡβ 頁内合操 9 .如申請專利範圍第Θ項所述之動態式網 取方法,該直譯器程式為Javascript。 谷擷 ❹ 擷取方法 11 掏取方法 12 操取方法 13 擷取方法 14 1如申凊專利範圍第6項所述之動態式網頁内办 該直譯器程式為Per卜 合 如申"月專利範圍帛6項所述之動態式網 該直譯器程式為Basic。 内令 如申請專利範圍第6項所述之動態式 該直課器程式為ASP。 内各 =申請專利範圍第7項所述之動態式網頁内办 °〆直澤器程式為PHP。 各 如申凊專利範圍第7項所述之動態式網頁内容 15 201019142 擷取方法 操取方法 擷取方法 擷取方法 Λ直澤器程式為Javascript。 如申請專利範圍第7項所述之動態式網頁内容 ,該直譯器程式為Per卜 ’如申請專利範圍第7項所述之動態式網頁内容 ’該直譯器程式為Basic。 •如申請專利範圍第7項所述之動態式網頁内容 ’該直譯器程式為ASP。201019142 X. Application for Patent Park: 1. Two kinds of dynamic web content capture methods, including: Establishing a complex network with an interpreter program. J analysis of theft, each web profiler condition; body (four) take the conditions and the number of web profiler calls to specify the source page to be analyzed as a pending web page, κ web profiler comprehensively check Yan Xiang add words and information The conditions are bundled, and whether there is any data in the webpage that meets the conditions of the caller of the time-of-day and the number of times of the page of the cattle. The person who is a rough person is qualified to serve the webpage... Γ take the temporary storage, when the data symbol L1 is analyzed (4) Material, the material is called to enter the analysis operation, the called action and the π κ network stomach analysis 15 pure line data acquisition analysis work and then step-by-step judgment whether it is necessary to call other web profiler to enter the solution two: analysis results Each webpage parser participating in the parsing operation extracts the required information according to the sub-function and the sub-function from the source webpage, and the information of the scrambled information is classified and output.出 取 取 1 Τ 动态 动态 动态 动态 动态 动态 动态 动态 动态 动态 动态 动态 动态 动态 动态 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页 网页The dynamic webpage content as described in item 1 of the claimed patent scope ’ 'the plural webpage parser is different in different groups. J d knife is 4 as described in the patent scope of the first paragraph of the dynamic web content capture method 'Web profile analysis defined in the web profiler...= 14 201019142 Included in the homepage of the job in the store When the link is made, the other web page is included in the parsing operation. 5 · For the dynamic web content collection method described in item 1 to 4 of the patent scope, the data of the web page depreciator includes text file image files, sound files, image files or hyperlinks. . 6. The method for extracting dynamic web content as described in claim 5, wherein the outputting is output to a database in the step of outputting the analysis result. According to the dynamic extraction method described in claim 5, in the step of outputting the analysis result, the 撷 = is output as a file. * <Bei said 8. The method of moving the method described in the sixth paragraph of the patent application 'The interpreter program is ΡΗΡβ in-page operation 9. As described in the patent application scope, the dynamic net extraction method, The interpreter program is Javascript.撷❹ 撷❹ 方法 11 11 11 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 The dynamic network described in the Scope 6 item is the Basic Translator program. Internal order The dynamic program described in item 6 of the patent application is ASP. Each of the = the patented scope of the scope of the dynamic web page inside the program, the program is PHP. Each of the dynamic webpage contents as described in claim 7 of the patent scope 15 201019142 Method of acquisition Method of acquisition Method of capture Λ The program of the stencil is Javascript. For example, in the dynamic webpage content described in claim 7, the interpreter program is Perbu's dynamic webpage content as described in claim 7 '. The interpreter program is Basic. • Dynamic web content as described in item 7 of the patent application. The interpreter program is ASP. 十一、圖式: 如次頁XI. Schema: as the next page 2626
TW97142784A 2008-11-06 2008-11-06 Dynamic webpage content capturing method TW201019142A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW97142784A TW201019142A (en) 2008-11-06 2008-11-06 Dynamic webpage content capturing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW97142784A TW201019142A (en) 2008-11-06 2008-11-06 Dynamic webpage content capturing method

Publications (2)

Publication Number Publication Date
TW201019142A true TW201019142A (en) 2010-05-16
TWI399653B TWI399653B (en) 2013-06-21

Family

ID=44831622

Family Applications (1)

Application Number Title Priority Date Filing Date
TW97142784A TW201019142A (en) 2008-11-06 2008-11-06 Dynamic webpage content capturing method

Country Status (1)

Country Link
TW (1) TW201019142A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI512505B (en) * 2010-05-20 2015-12-11 Alibaba Group Holding Ltd The method, device and e - commerce system of crawling web pages
US10043199B2 (en) 2013-01-30 2018-08-07 Alibaba Group Holding Limited Method, device and system for publishing merchandise information

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6163779A (en) * 1997-09-29 2000-12-19 International Business Machines Corporation Method of saving a web page to a local hard drive to enable client-side browsing
TW377421B (en) * 1998-02-12 1999-12-21 Inst Information Industry System and method of retrieving information intelligently from web
TWI238333B (en) * 2001-07-31 2005-08-21 Websurf Tech Co Ltd Website information capturing system and method
KR100531150B1 (en) * 2005-03-10 2005-11-29 엔에이치엔(주) Method and system for captureing image of web site, managing information of web site, and providing image of web site
TW200636504A (en) * 2005-04-13 2006-10-16 Gimefi Corp Method of using Web Page template to analyze Web Page document for extracting data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI512505B (en) * 2010-05-20 2015-12-11 Alibaba Group Holding Ltd The method, device and e - commerce system of crawling web pages
US10043199B2 (en) 2013-01-30 2018-08-07 Alibaba Group Holding Limited Method, device and system for publishing merchandise information

Also Published As

Publication number Publication date
TWI399653B (en) 2013-06-21

Similar Documents

Publication Publication Date Title
JP2020194567A (en) Methods and systems for web content generation
CN102929599B (en) The amending method at browser of mobile terminal interface and device, mobile terminal
CN104932889A (en) Page visual generation method and device
JP2014522030A (en) Configuration of web crawler to extract web page information
CN108595697B (en) Webpage integration method, device and system
WO2013097632A1 (en) Information distribution method and device
WO2023093673A1 (en) Information processing method, apparatus and system, and storage medium
CN102065114A (en) Method and device for mobile terminal to access webpage
CN102968345A (en) Note real-time synchronizing method and device
CN110968314A (en) Page generation method and device
US20070283251A1 (en) Web-based experience editor in a recursive browser system and uses thereof
JP2013540319A (en) Method and apparatus for inserting hyperlink address into bookmark
US10198408B1 (en) System and method for converting and importing web site content
CN110020361A (en) A kind of web page processing method, device, storage medium and electronic equipment
US8650485B2 (en) Method for integrating really simple syndication documents
Xiao et al. Web page adaptation for mobile device
CN102968347A (en) Method for synchronizing browser memo in real time and browser realizing memo real-time synchronization
TW201019142A (en) Dynamic webpage content capturing method
CN115587075A (en) Layout file processing method and device, terminal equipment and storage medium
CN110516185A (en) Method and device for processing dynamic website
US8606773B2 (en) Method for combining and processing web data using tables and programming-by-demonstration
CN112988255B (en) Data processing method, device and computer readable storage medium
Honkala et al. A configurable XForms implementation
CN103617224A (en) Webpage collecting method, webpage collecting device and webpage collecting system
CN103617223B (en) webpage collection method and device