TW200813763A - System and method for multithreading analyzing web page - Google Patents

System and method for multithreading analyzing web page Download PDF

Info

Publication number
TW200813763A
TW200813763A TW95134261A TW95134261A TW200813763A TW 200813763 A TW200813763 A TW 200813763A TW 95134261 A TW95134261 A TW 95134261A TW 95134261 A TW95134261 A TW 95134261A TW 200813763 A TW200813763 A TW 200813763A
Authority
TW
Taiwan
Prior art keywords
webpage
content
analysis
analysis rule
analyzed
Prior art date
Application number
TW95134261A
Other languages
Chinese (zh)
Other versions
TWI315835B (en
Inventor
Chung-I Lee
Chien-Fa Yeh
Chiu-Hua Lu
Xu-Chun Chen
Original Assignee
Hon Hai Prec Ind Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hon Hai Prec Ind Co Ltd filed Critical Hon Hai Prec Ind Co Ltd
Priority to TW95134261A priority Critical patent/TWI315835B/en
Publication of TW200813763A publication Critical patent/TW200813763A/en
Application granted granted Critical
Publication of TWI315835B publication Critical patent/TWI315835B/en

Links

Abstract

A system for multithreading analyzing web page is disclosed. The system includes an application server, a web page analyzing rules database, a downloaded web page database and an analyzed web page database. The application server includes a downloading module, a converting module, a determining module, an analyzing module, a saving module and a feedback module. A related method is also disclosed.

Description

200813763 九、發明說明: 【發明所屬之技術領域】 本發明涉及一種多線程分析網頁資料的系統及方法。 【先前技術】 近年來,由於網路世界的蓬勃發展,網上各種資訊數 量巨大,内容豐富,已成爲人們曰常工作、學習和生活中 獲取有用資訊的主要來源。200813763 IX. Description of the Invention: [Technical Field] The present invention relates to a system and method for multi-thread analysis of web page data. [Prior Art] In recent years, due to the booming development of the online world, the amount of information on the Internet is huge and rich in content, and it has become a major source of useful information for people to work, study and live.

一般而言,網路中資訊都係以網頁的形式存在,這毛 非結構化資訊使用起來不方便,而且存在大量冗餘資訊。 當用戶透過網際網路連接上—個網站後,在打開的網頁^ 面顯示有用的資訊和-些用戶不可用或令人厭須的斗 =如廣告、垃義料,這衫可㈣㈣能減慢搜’ 擎的執行或阻礙其準確性,給用戶獲取有 T弓 擾。因此’對網頁中不需要的内容的檢測成爲日 問題。 口现屣重& 【發明内容】 鑒於以上内容,有必要提供—種多線程 的系統,能快速、有效地萃取所需劉覽網、:貢資, 可過濾不需要的網頁内容。 、賢訊,2 此外,財必要提供1多線程分析 法’能快速、有效地萃取所㉞覽網頁中二料的, 濾不需要的網頁内容。 ’迷可$ -種纽齡析網頁轉m包括 、·周頁分析規則庫、下載網頁資料庫及分析 句服器、 貝科庳,6, 6 200813763 用於下載所需分析的網頁,In general, information on the Internet exists in the form of web pages. This unstructured information is inconvenient to use and has a lot of redundant information. When a user connects to a website through the Internet, it displays useful information on the open page and some unusable or irritating users = such as advertisements and materials. This shirt can be reduced (4) (4) Slow search for the execution of the engine or hinder its accuracy, giving the user access to the T-bow. Therefore, the detection of unnecessary content on a web page becomes a daily problem. [Abstract] In view of the above, it is necessary to provide a multi-threaded system that can quickly and efficiently extract the required network, tribute, and filter unwanted web content. , Xianxun, 2 In addition, the financial necessity to provide a multi-threaded analysis method can quickly and effectively extract the contents of the webpage of the 34 pages of the webpage. ‘迷可$- Kind New Age Analysis Web page to include, · Weekly page analysis rule base, download web database and analysis sentence server, Becco, 6, 6 200813763 to download the web page for the analysis required,

述應用伺服器包括:下载模組,用於下載; 並將該網頁儲存至下載網頁資料庫;轉相 頁内容格式轉換爲可延伸標示語言格式; 則中是否有對網頁内容的限定判斷是否對分析後的網頁内 容進行評估及根據所述分析規則中的限定判斷所述分析後 的網頁内容是否符合評估要求;分析模組,用於當所述網 •頁分析規則庫中有與該網頁内容相應的分析規則時,根據 所述分析規則對網頁内容進行分析,及當所述分析規則中 有對網頁内容的限定時,對分析後的網頁内容進行評估; 儲存杈組,用於將符合評估要求的網頁内容儲存至分析網 貢資料庫;反饋模組,用於當分析後的網頁内容不符合評 估要求時,將該分析規則反饋到網頁分析規則庫。 種多線程分析網頁資料的方法,該方法包括如下步 參驟·下載所需分析的網頁,並將該網頁儲存至下載網頁資 科庫,將該網頁内容格式轉換爲可延伸標示語言格式;根 據该可延伸標示語言格式内容結點判斷網頁分析規則庫中 是否有與該網頁内容相應的分析規則;若網頁分析規則庫 中有t應的分析規則,則根據所述分析規則對該網頁内容 進4亍刀析,根據所述分析規則中是否有對網頁内容的限 定,判斷是否對分析後網頁内容進行評估;若所述分析規 則中有對網頁内容的限定時,則對分析後的網頁内容進行 評估,根據該分析後的網頁内容是否符合所述分析規則中 7 200813763 網頁内容的限定’判斷分析後的網頁内容是否符合評估要 求;若分析後的網頁内容符合評估要求,則將所述評估後 的網頁内容儲存至分析網頁資料庫。 相較於習知技術’所述的多線程分析網頁資料的系統 及方法,能利用分析規則庫中的分析規則快速、有效地萃 取所需瀏覽網頁中的資訊,並可過濾不需要的網頁内容, 使用戶能更方便、快速的獲取所需網頁資訊。The application server includes: a download module for downloading; and storing the webpage to the download webpage database; and converting the format of the page-transferred page content into an extendable markup language format; whether there is a limit judgment on the content of the webpage The analyzed webpage content is evaluated and judged according to the limitation in the analysis rule whether the analyzed webpage content meets the evaluation requirement; the analysis module is configured to have the webpage content in the webpage/page analysis rule base Corresponding analysis rules, analyzing the content of the webpage according to the analysis rule, and evaluating the content of the analyzed webpage when the analysis rule has a limitation on the content of the webpage; The required webpage content is stored in the analysis network tribute database; the feedback module is configured to feed back the analysis rule to the webpage analysis rule base when the analyzed webpage content does not meet the evaluation requirement. A method for multi-thread analysis of webpage data, the method comprising the following steps: downloading a webpage to be analyzed, and storing the webpage to a downloading webpage library, and converting the webpage content format into an extendable markup language format; The content node of the extendable markup language format determines whether there is an analysis rule corresponding to the content of the webpage in the rule analysis rule database; if there is an analysis rule that should be included in the webpage analysis rule base, the content of the webpage is entered according to the analysis rule 4, according to whether the analysis rule has a limitation on the content of the webpage, whether to evaluate the content of the webpage after the analysis; if the content of the webpage is limited in the analysis rule, the webpage content after the analysis is Performing an evaluation according to whether the content of the analyzed webpage meets the definition of the content of the webpage in the analysis rule 7 200813763, and whether the content of the webpage meets the evaluation requirement; if the content of the analyzed webpage meets the evaluation requirement, the assessment is performed. The content of the post page is stored in the analysis web database. Compared with the system and method for multi-thread analysis webpage data described in the prior art, the analysis rules in the analysis rule base can be used to quickly and effectively extract information in a webpage to be browsed, and filter unnecessary webpage content. , to enable users to more easily and quickly get the required web page information.

【實施方式】 如圖1所示,係本發明一種多線程分析網頁的系統較 佳實施例的硬體架構圖。該系統包括應用伺服器i、網頁分 析規則庫2、下載網頁資料庫3、分析網頁資料庫4、防火牆 5及網際網路6。該系統透過應用伺服器1從網際網路6下載 所需分析網頁,並將下载的網頁存入下載網頁資料庫3中, 應用伺服器1根據網頁分析規則庫2中相應的分析規則對下 截網頁進行分析,並將分析後的網頁存入分析網頁資料庫 4。防火牆5用於管控外部網路的資訊安全。 一、,所述網頁分析規則庫2、下载網頁資料庫3和分析網] 資料庫4可以位於應用伺服器。該網頁分析規則庫 載網頁資料庫3和分析網頁資料庫4可以是硬碟或者快 憶體專存儲設備。 如圖2所示,係圖i中應用伺服器的的 應:咖㈣獅、細触、_ς:4該 刀析板組16、儲存模組18及反饋模組2〇。 所述下載模組10,用於透過網際網路6下載所需分析 8 200813763 網頁,並將該網頁儲存至下載網頁資料庫3。 所述轉換模組12,用於將所下載網頁内容格式轉換爲 可延伸標示語言(Extensible Markup Language,XML)格 式。下載模組10所下載的網頁内容爲超文件標示語言 (Hypertext Marked Language,HTML)格式,而該系統使 用的分析引擎係基於XML的查詢語言,只能解析XML格式 的文件,因此,需將所下載的網頁内容轉換成XML格式。 ^ 所述判斷模組14,用於根據XML格式網頁的内容結點 判斷所述網頁分析規則庫2中是否有與該網頁内容相應的 分析規則,根據所述分析規則中是否有對網頁内容的限定 ^ 判斷是否對分析後的網頁内容進行評估及根據所述分析規 則中的限定判斷所述分析後的網頁内容是否符合評估要 求。其中,所述網頁相應的分析規則可以係零個、一個或 多個。不同的網頁根據不同的内容結點對應不同的分析規 則,一個網頁中各個版塊根據不同的内容結點也對應不同 • 的分析規則。所述分析規則中可以包括一些對網頁内容的 限定,例如:對網頁内容範圍的限定range=(400,500),關 鍵字的設定keywords “電子”。若從網頁分析規則庫2中 查找到相應的分析規則,則判斷模組14判斷網頁分析規則 庫2中有相應的分析規則;若從網頁分析規則庫2中沒有查 找到相應的分析規則,則判斷模組14判斷網頁分析規則庫2 中沒有相應的分析規則。例如:有所下載網頁的内容結點 爲&lt;0011〖6111&gt;,若從網頁分析規則庫2中查找到有對内容結 點的分析規則語句//content,則判斷模組14判斷網頁分析 200813763 規則庫2中有相應的分析規則;若從網頁分析規則庫2中沒 有查到相應内容結點content的分析規則語句,則判斷模組 14判斷網頁分析規則庫2中沒有相應的分析規則。若分析規 則中有對網頁内容進行限定,則判斷模組14判斷需對分析 後的網頁内容進行評估;若分析規則中沒有沒有對網頁内 谷進行限疋’則判斷模組14判斷不對分析後的網頁内容進 行評估。若經過所述分析後的網頁内容符合分析規則中的 φ限定要求’則判斷模組14判斷所述分析後的網頁内容符合 古要求,若經過所述分析後的網頁内容不符合分析規則 中的限定要求,則判斷模組14判斷所述分析後的網頁内容 • 不符合評估要求。 % 所述分析模組16,用於當判斷模組14判斷網頁分析規 則庫2中有相應的分析規則時,根據所述分析規則對網頁内 容進行分析及當判斷模組14判斷需要對分析後的網頁内容 進行評估時,對所述分析後的網頁内容進行評估。所述對 φ 網頁内容的分析係指,根據所述分析規則中的結點提取所 述網頁内容中該結點包含的内容,同時過濾網頁内容中其 他結點所包含的内容。例如:若有網頁内容中包含内容結 點〈content〉、&lt;body&gt;等,分析規則中所包含的結點爲 body,則分析後得到的網頁内容只包括内容&lt;body&gt;所包含 的内容,同時過濾掉&lt;content&gt;等其他内容結點所包含的内 容。所述對分析後網頁内容的評估係指評估分析後的網頁 内容是否符合分析規則中的限定要求。例如:若在分析規 則中包括對内容文字範圍的限定’則評估分析後的網頁内 200813763 容是否在該範圍之内;若在分析規則中設定網頁内容中必 須包括關鍵字“電子”,則評估分析後的網頁内容中是否 包含關鍵字“電子”。 所述儲存模組18,用於將所述分析後的網頁内容及符 合評估要求的網頁内容儲存至分析網頁 資料庫4。 所述反饋模組20,用於當判斷模組14判斷所述分析後 的網頁内容不符合評估要求時,將分析規則反饋到網頁分 _析,則庫2。例如:若在分析規則中包括對内容文字範圍的 限疋range=(400,500) ’而分析後的網頁内容不在該範圍 range==(400,500)之内,則將該範圍限定mnge=(400,500)反 饋到網頁分析規則庫,以便相關工作人員對該分析規則做 修改。 如圖3所示’係本發明一種多線程分析網頁資料方法 的車乂佳實施例的流程圖。首先,步驟S10,下載模組10透 過網際網路6下載所需分析網頁,並將該網頁儲存至下載網 _頁資料庫3。步驟si2,轉換模組12將網頁内容格式轉換爲 XML格式。步驟S14,判斷模組14根據該XML格式内容結 …占判斷所述網頁分析規則庫中是否有相應的分析規則。例 若所為分析網頁中含有内容結點&lt;c〇ntent&gt;,則從網頁 刀析規則庫中查找相應包含結點c〇ntent的分析規則語 句步驟S16 ’若所述網頁分析規則庫中有相應的分析規 則,則分析模組16根據所述分析規則對網頁内容進行分 析。步驟S18,判斷模組14根據所述分析規則中是否有對 網頁内容的限定判斷是否對分析後的網頁内容進行評估。 11 200813763 若所述分析規則中沒有對網頁内容的限定,則不對分析後 的網頁内容進行評估;若所述分柯規則中有對網頁内容的 限定,則對分析後的網頁内容進行評估。若步驟⑽,若所 述分析規則中有對網頁内容的限定,則分析模組16對分析 ㈣網頁内容進行評估。例如:若在分析規射設定有網 頁内容中必須包括關鍵字“電子,,㈣定,則分析模⑽ 對分析後的網頁内容進行評估,評估分析後的網頁内容中 鲁是否包合關鍵字電子。步驟S22,判斷模組u根據所 述分析規則中的限定判斷分析後的網頁内容是否有符合評 估要求。步驟S24,若分析後的網頁内容符合評估要求, 則儲存模組18將該符合要求的網頁内容儲存至分析網頁資 料庫4。例如:分析後的網頁内容中若包含關鍵字“電 子”,則表示分析後的網頁内容符合評估要求,儲存模組 18將該符合要求的網頁内容儲存至分析網頁資料庫4。 在步驟S14中,若所述網頁分析規則庫2中沒有相應的 _ 分析規則,則轉至步驟S26將該網頁儲存至分析網頁資料 庫4後結束流程。 在步驟S18中,若不需對分析後的網頁内容進行評 估,則轉至步驟S28將分析後的網頁内容儲存至分析網頁 資料庫4後結束流程。 在步驟S22中,若分析後的網頁内容不符合評估要 求,則轉至步驟S30將所述分析規則反饋到網頁分析規則 庫2後結束流程。 【圖式簡單說明】 12 200813763 圖1係本發明多線程分析網頁資料的系統的較佳實施 例的硬體框架圖。 圖2係圖1中應用伺服器的功能模組圖。 圖3係本發明多線程分析網頁資料的方法較佳實施例 的流程圖。 【主要元件符號說明】 應用伺服器 1 下載模組 10 轉換模組 12 判斷模組 14 分析模組 16 儲存模組 18 反饋模組 20[Embodiment] As shown in Fig. 1, a hardware architecture diagram of a preferred embodiment of a system for multi-thread analysis web pages is provided. The system includes an application server i, a web page analysis rule base 2, a download web page database 3, an analysis web page database 4, a firewall 5, and an internet 6. The system downloads the required analysis webpage from the Internet 6 through the application server 1, and stores the downloaded webpage in the download webpage database 3. The application server 1 performs the next analysis according to the corresponding analysis rule in the webpage analysis rule base 2. The webpage is analyzed, and the analyzed webpage is stored in the analysis webpage database 4. Firewall 5 is used to control the information security of the external network. 1. The webpage analysis rule base 2, the download webpage database 3, and the analysis web] the database 4 can be located in the application server. The web page analysis rule library web page database 3 and the analysis web page database 4 may be hard disk or flash memory storage devices. As shown in Fig. 2, the application of the server in Fig. i should be: coffee (4) lion, fine touch, _ς: 4, the knife plate group 16, the storage module 18 and the feedback module 2〇. The downloading module 10 is configured to download the required analysis 8 200813763 webpage through the Internet 6 and save the webpage to the download webpage database 3. The conversion module 12 is configured to convert the downloaded web content format into an Extensible Markup Language (XML) format. The content of the webpage downloaded by the download module 10 is a Hypertext Marked Language (HTML) format, and the analysis engine used by the system is an XML-based query language, which can only parse files in XML format. Therefore, it is necessary to The downloaded web content is converted to an XML format. The determining module 14 is configured to determine, according to the content node of the XML format webpage, whether the webpage analysis rule base 2 has an analysis rule corresponding to the webpage content, according to whether the webpage content is included in the analysis rule. The definition ^ determines whether the content of the analyzed web page is evaluated and whether the content of the analyzed web page meets the evaluation requirement according to the limitation in the analysis rule. The corresponding analysis rule of the webpage may be zero, one or more. Different web pages correspond to different analysis rules according to different content nodes. Each page in a web page also has different analysis rules according to different content nodes. The analysis rule may include some restrictions on the content of the webpage, for example, a limited range=(400,500) for the content range of the webpage, and a keyword "electronic" for the keyword. If the corresponding analysis rule is found from the webpage analysis rule base 2, the judging module 14 judges that the webpage analysis rule base 2 has a corresponding analysis rule; if the corresponding analysis rule is not found from the webpage analysis rule base 2, The judging module 14 judges that there is no corresponding analysis rule in the webpage analysis rule base 2. For example, if the content node of the downloaded webpage is &lt;0011〖6111&gt;, if the analysis rule statement //content of the content node is found from the webpage analysis rule base 2, the judging module 14 judges the webpage analysis 200813763 There is a corresponding analysis rule in the rule base 2; if the analysis rule statement of the content node content is not found in the web page analysis rule base 2, the judging module 14 judges that there is no corresponding analysis rule in the web page analysis rule base 2. If the content of the webpage is limited in the analysis rule, the judging module 14 judges that the parsed webpage content needs to be evaluated; if the parsing rule does not limit the webpage in the webpage, the judging module 14 judges that the parsing is not correct. The content of the webpage is evaluated. If the content of the webpage after the analysis meets the φ limitation requirement in the analysis rule, the determining module 14 determines that the content of the analyzed webpage meets the ancient requirement, and if the content of the analyzed webpage does not meet the analysis rule If the request is limited, the judging module 14 judges that the analyzed webpage content does not meet the evaluation requirements. The analysis module 16 is configured to analyze the content of the webpage according to the analysis rule when the judging module 14 determines that the webpage analysis rule base 2 has a corresponding analysis rule, and when the judging module 14 judges that the parsing needs to be analyzed When the content of the webpage is evaluated, the content of the analyzed webpage is evaluated. The analyzing the content of the φ webpage refers to extracting content included in the node content in the webpage content according to the node in the analysis rule, and filtering content included in other nodes in the webpage content. For example, if the content of the webpage contains content nodes <content>, <body>, etc., and the node included in the analysis rule is body, the content of the webpage obtained after analysis only includes the content contained in the content &lt;body&gt; At the same time, filter out the content contained in other content nodes such as &lt;content&gt;. The evaluation of the content of the analyzed webpage refers to evaluating whether the content of the analyzed webpage meets the qualification requirements in the analysis rule. For example, if the definition of the content text range is included in the analysis rule, then the evaluation of the content of the analyzed webpage is not included in the scope; if the content of the webpage must be included in the analysis rule, the keyword "electronic" must be included. Whether the keyword "electronic" is included in the analyzed web content. The storage module 18 is configured to store the analyzed webpage content and the webpage content that meets the evaluation requirements into the analysis webpage database 4. The feedback module 20 is configured to: when the determining module 14 determines that the analyzed webpage content does not meet the evaluation requirement, and feeds the analysis rule to the webpage for analysis, the library 2. For example, if the analysis rule includes the limit of the content text range range=(400,500)' and the analyzed webpage content is not within the range range==(400,500), then the range is limited to mnge=(400,500) feedback. Go to the web page analysis rule base so that the relevant staff can make changes to the analysis rules. As shown in FIG. 3, a flowchart of an embodiment of a multi-thread analysis webpage data method of the present invention is shown. First, in step S10, the download module 10 downloads the required analysis webpage through the Internet 6, and stores the webpage in the download webpage_page database 3. In step si2, the conversion module 12 converts the webpage content format into an XML format. In step S14, the determining module 14 determines whether there is a corresponding analysis rule in the webpage analysis rule base according to the content of the XML format. For example, if the analysis webpage includes a content node &lt;c〇ntent&gt;, the corresponding analysis rule statement including the node c〇ntent is searched from the webpage analysis rule base, step S16', if there is corresponding in the webpage analysis rule base The analysis module analyzes the content of the webpage according to the analysis rule. In step S18, the determining module 14 determines whether the analyzed webpage content is evaluated according to whether the analysis rule has a limitation on the webpage content. 11 200813763 If there is no limitation on the content of the webpage in the analysis rule, the content of the analyzed webpage is not evaluated; if the content of the webpage is limited in the parsing rule, the content of the analyzed webpage is evaluated. In step (10), if there is a limitation on the content of the webpage in the analysis rule, the analysis module 16 evaluates the content of the webpage in the analysis (4). For example, if the keyword “electronics, (4) is included in the analysis of the web content, the analysis module (10) evaluates the content of the analyzed webpage, and evaluates whether the content of the analyzed webpage contains the keyword electronic. In step S22, the determining module u determines whether the analyzed webpage content meets the evaluation requirement according to the limitation in the analysis rule. In step S24, if the analyzed webpage content meets the evaluation requirement, the storage module 18 meets the requirements. The webpage content is stored in the analysis webpage database 4. For example, if the keyword "electronic" is included in the analyzed webpage content, the analyzed webpage content meets the evaluation requirement, and the storage module 18 stores the content of the webpage that meets the requirements. To analyze the webpage database 4. In step S14, if there is no corresponding _ analysis rule in the webpage analysis rule base 2, then go to step S26 to store the webpage in the analysis webpage database 4, and then end the flow. If the content of the analyzed webpage does not need to be evaluated, then go to step S28 to save the analyzed webpage content to the analysis webpage. After the database 4, the process ends. In step S22, if the analyzed webpage content does not meet the evaluation request, the process goes to step S30, and the analysis rule is fed back to the webpage analysis rule base 2, and the process ends. [Simplified description] 12 200813763 FIG. 1 is a hardware framework diagram of a preferred embodiment of a system for multi-thread analysis webpage data according to the present invention. FIG. 2 is a functional module diagram of an application server in FIG. 1. FIG. 3 is a multi-thread analysis webpage data of the present invention. Flowchart of the preferred embodiment of the method. [Description of main component symbols] Application server 1 Download module 10 Conversion module 12 Judgment module 14 Analysis module 16 Storage module 18 Feedback module 20

1313

Claims (1)

200813763 十、申請專利範圍 1. -種多線程分析網頁資料的系統,包括應用飼服 器、網頁分析規則庫、下載_頁f料庫及分析網χ資料庫, 其特徵在於,所述應用伺服器包括· 下載模組,用於下載所需分析的網頁,並將該網頁健 存至下载網頁資料庫; 轉換模組,用於將網頁内容格式轉換爲可延伸標示語 0 言格式; 判斷模組,用於根據該可延伸標示語言格式内容結點 判斷所述網頁分析規則庫中是否有與該網頁内容相應的分 析規則’根據所it分析規則中是否有對網頁内容的限定判 斷疋否對分析後的網頁内容進行評估,及根據所述分析規 則中的限定判斷所述分析後的網頁内容是否符合評估要 求; 分析模組,用於當所述網頁分析規則庫中有與該網頁 鲁内各相應的分析規則時,根據所述分析規則對網頁内容進 行分析,及當所述分析規則中有對網頁内容的限定時,對 分析後的網頁内容進行評估; 儲存模組,用於將符合評估要求的網頁内容儲存至分 析網頁資料庫;及 反饋模組,用於當分析後的網頁内容不符合評估要求 時,將該分析規則反饋到網頁分析規則庫。 2·如申請專利範圍第1項所述的多線程分析網頁資料 的系統’其中所述儲存模組還用於當網頁分析規則庫中沒 200813763 有與該網頁内容相應的分析規則時, 分析網頁資料庫,及當不需對分析後^將該網頁儲存至 時,直接將分析後的網頁内容儲存至分冯,内各進行評估 3.如申請專利範圍第_所述的。網:資料庫。 的系統,其中所述分析規則係根據網頁:氡刀析網頁資料 各中的内容結點進行確定。 1伸&amp;示語言内 4·如申請專利範圍第3項所述的多 八 的系統,其中所述對網頁内容的分析係於刀析網頁資料 的内容結點提取所述網頁内容中該結點據分析規則中 過濾該網頁内容中非分析規則中的的内容,同時 容。 谷、、、0點所包含的内 5·如申請專利範圍第i項所述的多 一 的系統,其中所述對分析後的網頁内/ 、周頁貝料 分析後的網頁内容是否符合分析規則指評估 步驟6.· -種多線程分析網頁資料的方法,該方法包括如下 下載所需分析的網頁,並將該網頁儲存至下载網頁資 料庫; ' 將該網買内容格式轉換爲可延伸標示語言格式; 根據該可延伸標示語言格式内容結點判斷網頁分析 規則庫中是否有與該網頁内容相應的分析規則; 若網貢分析規則庫中有相應的分析規則,則根據所述 分析規則對該網頁内容進行分析; 根據所述分析規則中是否有對網頁内容的限定,判斷 15 200813763 是否對分析後網頁内容進行評估; 若所述分析規則中有對網頁内容的限定時,則對分析 後的網頁内容進行評估; 根據該分析後的網頁内容是否符合所述分析規則中 網頁内容的限定,判斷分析後的網頁内容是否符合評估要 求;及 若分析後的網頁内容符合評估要求,則將所述評估後 I 的網頁内容儲存至分析網頁資料庫。 7. 如申請專利範圍第6項所述的多線程分析網頁資料 的方法,該方法還包括步驟: 若網頁分析規則庫中沒有與該網頁内容相應的分析 規則,則直接將該網頁内容儲存至分析網頁資料庫。 8. 如申請專利範圍第6項所述的多線程分析網頁資料 的方法,該方法還包括步驟: 若所述分析規則中沒有對網頁内容的限定時,則將分 _ 析後的網頁内容儲存至分析網頁資料庫。 9. 如申請專利範圍第6項所述的多線程分析網頁資料 的方法,該方法還包括步驟: 若分析後的網頁内容不符合評估要求,則將該分析規 則反饋到網頁分析規則庫。 16200813763 X. Patent application scope 1. A system for multi-thread analysis webpage data, including application feeding device, webpage analysis rule base, downloading_page f-repository and analysis netting database, characterized in that the application servo The device includes a download module for downloading a webpage to be analyzed and storing the webpage to the download webpage database; a conversion module for converting the webpage content format into an extensible logo language format; a group, configured to determine, according to the content node of the extendable markup language format, whether there is an analysis rule corresponding to the content of the webpage in the webpage analysis rule base, and determine whether there is a limit on the content of the webpage according to the analysis rule of the The analyzed webpage content is evaluated, and according to the limitation in the analysis rule, whether the analyzed webpage content meets the evaluation requirement; the analysis module is configured to: when the webpage analysis rule base has the webpage When the respective analysis rules are used, the content of the webpage is analyzed according to the analysis rule, and when the analysis rule is included in the webpage When the time limit is determined, the content of the analyzed webpage is evaluated; the storage module is configured to store the content of the webpage meeting the evaluation requirements into the analysis webpage database; and the feedback module is configured to: when the analyzed webpage content does not meet the evaluation requirements The analysis rule is fed back to the web analytics rule base. 2. The system for multi-thread analysis webpage data according to claim 1, wherein the storage module is further configured to analyze the webpage when there is no analysis rule corresponding to the webpage content in the webpage analysis rule base. The database, and when the webpage is not stored, the content of the analyzed webpage is directly stored in the von, and each of them is evaluated. 3. As described in the patent application scope. Web: Database. The system, wherein the analysis rule is determined according to a content node in a webpage: a webpage data analysis. The system of the multi-eighth according to claim 3, wherein the analysis of the content of the webpage is performed by extracting the content of the webpage from the content node of the webpage data. The content in the non-analytic rules in the content of the webpage is filtered according to the analysis rule, and the content is simultaneously accepted. The system included in the valley, the zero point, and the one-to-one system described in the item i of the patent application scope, wherein the content of the webpage after analyzing the analyzed webpage/week page is in conformity with the analysis The rule refers to the evaluation step 6.· - a method for multi-thread analysis of webpage data, the method comprising: downloading a webpage to be analyzed as follows, and storing the webpage to a download webpage database; ' converting the format of the online purchase content to an extendable Marking a language format; determining, according to the content node of the extendable markup language format, whether there is an analysis rule corresponding to the content of the webpage in the rule analysis rule base; if there is a corresponding analysis rule in the network analysis rule base, according to the analysis rule And analyzing the content of the webpage; determining, according to whether the analysis rule has a limitation on the content of the webpage, determining whether the content of the webpage after the analysis is evaluated by the 200813763; if the content of the webpage is limited in the analysis rule, then analyzing After the webpage content is evaluated; according to whether the analyzed webpage content meets the webpage content in the analysis rule Defining, web content analysis after determining whether the evaluation requirements; and, if the Web content evaluation analysis was consistent with the requirements, then the assessment of the contents of the I web pages stored to the database analysis. 7. The method for multi-thread analysis webpage data according to claim 6, wherein the method further comprises the step of: if the webpage analysis rule base does not have an analysis rule corresponding to the webpage content, directly storing the webpage content to Analyze the web database. 8. The method for multi-thread analysis of webpage data according to claim 6, wherein the method further comprises the step of: if there is no limitation on the content of the webpage in the analysis rule, storing the webpage content after the parsing To analyze the web database. 9. The method of multi-thread analysis webpage data according to claim 6, wherein the method further comprises the step of: feeding back the analysis rule to the webpage analysis rule base if the analyzed webpage content does not meet the evaluation requirement. 16
TW95134261A 2006-09-15 2006-09-15 System and method for multithreading analyzing web page TWI315835B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW95134261A TWI315835B (en) 2006-09-15 2006-09-15 System and method for multithreading analyzing web page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW95134261A TWI315835B (en) 2006-09-15 2006-09-15 System and method for multithreading analyzing web page

Publications (2)

Publication Number Publication Date
TW200813763A true TW200813763A (en) 2008-03-16
TWI315835B TWI315835B (en) 2009-10-11

Family

ID=44768400

Family Applications (1)

Application Number Title Priority Date Filing Date
TW95134261A TWI315835B (en) 2006-09-15 2006-09-15 System and method for multithreading analyzing web page

Country Status (1)

Country Link
TW (1) TWI315835B (en)

Also Published As

Publication number Publication date
TWI315835B (en) 2009-10-11

Similar Documents

Publication Publication Date Title
US7668812B1 (en) Filtering search results using annotations
US9092789B2 (en) Method and system for semantic analysis of unstructured data
US20070050708A1 (en) Systems and methods for content extraction
US20100125828A1 (en) Data transformation based on a technical design document
JP2007528520A (en) Method and system for managing websites registered with search engines
US20070005649A1 (en) Contextual title extraction
AU2016228246B2 (en) System and method for concept-based search summaries
Levering et al. The portrait of a common HTML web page
Gottron Evaluating content extraction on HTML documents
CN114443928B (en) Web text data crawler method and system
CN111381809B (en) Method and device for searching focus page
US20110131214A1 (en) Information retrieval method, computer readable medium and information retrieval apparatus
Dumitru et al. Garbage in, garbage out: An analysis of HTML text extractors and their impact on NLP performance
JP2006331292A (en) Weblog community search support method, search support device, and recording medium recording program for search support method
US20150248500A1 (en) Documentation parser
KR102088619B1 (en) System and method for providing variable user interface according to searching results
US20110270862A1 (en) Information processing apparatus and information processing method
US20120324326A1 (en) Method and apparatus for outputting a multimedia file of a web page
CN101140578A (en) Method and system for multithread analyzing web page data
TW200813763A (en) System and method for multithreading analyzing web page
TWI320144B (en) System and method for downloading static web page
Chiang et al. The WT10G Dataset and the Evolution of the Web
KR102111989B1 (en) System and method for providing time series to a natural language question
JP2004303097A (en) Partial document extraction program and partial document extraction method of structured document
CN110618809B (en) Front-end webpage input constraint extraction method and device

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees