TWI456416B - Ajax web content crawling method and system - Google Patents

Ajax web content crawling method and system Download PDF

Info

Publication number
TWI456416B
TWI456416B TW098119740A TW98119740A TWI456416B TW I456416 B TWI456416 B TW I456416B TW 098119740 A TW098119740 A TW 098119740A TW 98119740 A TW98119740 A TW 98119740A TW I456416 B TWI456416 B TW I456416B
Authority
TW
Taiwan
Prior art keywords
script
language
file
function
code
Prior art date
Application number
TW098119740A
Other languages
Chinese (zh)
Other versions
TW201044197A (en
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to TW098119740A priority Critical patent/TWI456416B/en
Publication of TW201044197A publication Critical patent/TW201044197A/en
Application granted granted Critical
Publication of TWI456416B publication Critical patent/TWI456416B/en

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Claims (16)

一種網頁內容的抓取方法,其特徵在於,包括:獲取網頁代碼資訊;根據該網頁代碼資訊,提取腳本語言資訊,該腳本語言資訊被包含於一或更多個腳本語言檔中;根據該一或更多個腳本語言檔中的一個腳本檔的檔案名,確定該一個腳本檔的類型,如果該一個腳本檔是框架檔,則:獲取非同步腳本語言特徵值,以及使用該非同步腳本語言特徵值,確定至少一個函數,該至少一個函數包含與網頁之非同步網站應用程式相關之調用,該網頁與該網頁代碼相關,以及如果該一個腳本檔是非框架檔,則根據對應的非同步腳本語言特徵值及與定義於該非框架檔中之該函數相關之代碼,獲取該至少一個函數;以及觸發包含至少一個非同步腳本語言調用之該至少一個函數,獲得生成的網頁內容。 A method for capturing webpage content, comprising: acquiring webpage code information; extracting script language information according to the webpage code information, the scripting language information being included in one or more script language files; according to the one The file name of a script file in one or more script language files, determining the type of the one script file, if the one script file is a frame file, obtaining an asynchronous script language feature value, and using the non-synchronized script language feature a value determining at least one function, the at least one function comprising a call associated with an asynchronous web application of the web page, the web page being associated with the web page code, and if the one script file is a non-frame file, according to the corresponding non-synchronized script language An eigenvalue and a code associated with the function defined in the non-framework file, the at least one function is obtained; and the at least one function including the at least one asynchronous scripting language call is triggered to obtain the generated webpage content. 根據申請專利範圍第1項之方法,其中,該提取包括:查詢該網頁代碼中的腳本語言標記;如果提取自該網頁代碼資訊且位於該腳本語言標記後的該腳本語言資訊包含腳本語言代碼,則提取該腳本語言代碼,將所提取的該腳本語言代碼保存於腳本語言檔中;如果提取自該網頁代碼資訊且位於該腳本語言標記後 的該腳本語言資訊包含腳本語言檔,則提取該腳本語言檔的存放路徑和該檔案名。 The method of claim 1, wherein the extracting comprises: querying a scripting language tag in the webpage code; if the scripting language information extracted from the webpage code information and located after the scripting language tag includes a scripting language code, Extracting the script language code, and saving the extracted script language code in the script language file; if the webpage code information is extracted from the webpage code mark The script language information includes a script language file, and the storage path of the script language file and the file name are extracted. 根據申請專利範圍第1項之方法,其中,該非同步腳本語言特徵值包括下列至少一個:使用腳本語言框架類型調用非同步腳本語言所對應的特徵值,和不使用腳本語言框架類型調用非同步腳本語言所對應的特徵值。 The method of claim 1, wherein the asynchronous script language feature value comprises at least one of: calling a feature value corresponding to the asynchronous script language using a script language framework type, and calling the asynchronous script without using the script language frame type The feature value corresponding to the language. 根據申請專利範圍第1項之方法,其中,該至少一個函數被定義在該網頁中引用的非JavaScript框架檔中。 The method of claim 1, wherein the at least one function is defined in a non-JavaScript framework file referenced in the web page. 一種非同步JavaScript及XML網頁內容抓取系統,其特徵在於,包括:網頁代碼獲取單元,用於獲取網頁代碼資訊;腳本提取單元,用於提取該網頁代碼資訊中的腳本語言資訊;腳本解析單元,用於:根據一個腳本檔的檔案名,確定該腳本語言資訊所指示的腳本檔的類型,如果該腳本檔是框架檔,則:獲取非同步腳本語言特徵值,以及使用該非同步腳本語言特徵值,確定一函數,該函數包含與網頁之非同步網站應用程式相關之調用,該網頁與該網頁代碼相關,以及如果該腳本檔是非框架檔,則根據對應的非同步 腳本語言特徵值及與定義於該非框架檔中之該函數相關之代碼,獲取該函數;以及;網頁內容獲得單元,用於觸發包含至少一個非同步腳本語言調用的該函數,獲得網頁內容。 A non-synchronized JavaScript and XML web content crawling system, comprising: a webpage code obtaining unit for acquiring webpage code information; a script extracting unit, configured to extract scripting language information in the webpage code information; and a script parsing unit And: determining, according to a file name of a script file, a type of the script file indicated by the script language information, if the script file is a frame file, acquiring an unsynchronized script language feature value, and using the asynchronous script language feature a value that determines a function that includes a call associated with an unsynchronized web application of the web page, the web page being associated with the web page code, and if the script file is a non-frame file, based on the corresponding non-synchronization a script language feature value and a code associated with the function defined in the non-framework file to obtain the function; and a webpage content obtaining unit for triggering the function including at least one asynchronous scripting language call to obtain webpage content. 根據申請專利範圍第5項之系統,其中,該腳本語言資訊包括腳本語言檔和腳本語言代碼中至少一個。 The system of claim 5, wherein the scripting language information comprises at least one of a scripting language file and a scripting language code. 根據申請專利範圍第6項之系統,該腳本提取單元包括:查詢子單元,用於查詢網頁代碼中的腳本語言標記;第一提取子單元,用於在提取自該網頁代碼資訊且位於該腳本語言標記後的該腳本語言資訊包含腳本語言代碼時,提取該腳本語言代碼,將所提取的該腳本語言代碼保存於腳本語言檔中;第二提取子單元,用於在提取自該網頁代碼資訊且位於該腳本語言標記後的該腳本語言資訊包含腳本語言檔時,提取該腳本語言檔的存放路徑和檔案名。 According to the system of claim 6, the script extracting unit includes: a query subunit for querying a script language mark in the webpage code; and a first extracting subunit for extracting information from the webpage code and located in the script When the script language information after the language tag includes the script language code, the script language code is extracted, and the extracted script language code is saved in the script language file; and the second extracting subunit is configured to extract the information from the webpage code. And when the script language information after the script language tag contains the script language file, the storage path and file name of the script language file are extracted. 根據申請專利範圍第7項之系統,該腳本解析單元包括:第一確定子單元,用於根據非同步腳本語言特徵值,確定該腳本語言檔中所定義的包含至少一個非同步腳本語言調用的該函數;該非同步腳本語言特徵值為:可標識該函數中存在非同步腳本語言調用的代碼段;第二確定子單元,用於在該第一確定子單元所確定的該函數中,確定該網頁代碼中包含非同步腳本語言調用的 函數。 According to the system of claim 7, the script parsing unit comprises: a first determining subunit, configured to determine, according to the non-synchronized script language feature value, the at least one asynchronous script language call defined in the script language file The non-synchronized scripting language feature value is: a code segment that can identify a non-synchronized scripting language call in the function; and a second determining subunit, configured to determine the function in the function determined by the first determining subunit The webpage code contains a non-synchronized scripting language call function. 根據申請專利範圍第8項之系統,其中,該非同步腳本語言特徵值包括下列至少一個:使用腳本語言框架類型調用非同步腳本語言所對應的特徵值,和不使用非同步腳本語言框架類型調用非同步腳本語言所對應的特徵值。 The system according to claim 8, wherein the non-synchronized script language feature value comprises at least one of: calling a feature value corresponding to the asynchronous script language using a script language framework type, and calling the non-synchronized script language frame type instead of calling The feature value corresponding to the synchronization script language. 根據申請專利範圍第8項之系統,其中,該第一確定子單元進一步用於確定在頁面中引用的非非腳本語言框架檔中所定義的包含至少一個非同步腳本語言調用的至少一個函數。 The system of claim 8 wherein the first determining subunit is further for determining at least one function comprising at least one asynchronous scripting language call defined in a non-scripting language framework file referenced in the page. 根據申請專利範圍第10項之系統,其中,該網頁內容獲得單元進一步用於藉由模擬用戶操作,觸發包含至少一個非同步腳本語言調用的所確定的至少一個函數,獲得由該包含至少一個非同步腳本語言調用的該至少一個函數生成的該網頁內容。 The system of claim 10, wherein the webpage content obtaining unit is further configured to trigger the determined at least one function including at least one asynchronous scripting language call by simulating a user operation to obtain at least one non-contained The content of the web page generated by the at least one function invoked by the synchronization scripting language. 一種非暫態儲存媒體,包含電腦可執行指令,當由電腦執行該指令時,組構該電腦以執行動作,該等動作包含:提取網頁中的腳本資訊,該腳本資訊指示一或更多個腳本檔;根據該一或更多個腳本檔中的一個腳本檔的檔案名,確定該一個腳本檔的類型,如果該一個腳本檔是框架檔,則:獲取一或更多個非同步腳本語言特徵值,以及 使用該一或更多個非同步腳本語言特徵值,確定一函數,該函數包含與網頁之非同步網站應用程式相關之調用,以及如果該一個腳本檔是非框架檔,則根據對應的非同步腳本語言特徵值及與定義於該非框架檔中之該函數相關之代碼,獲取該函數;以及調用該函數,產生與該網頁相關之內容。 A non-transitory storage medium, comprising computer executable instructions, when executed by a computer, configuring the computer to perform an action, the action comprising: extracting script information in a webpage, the script information indicating one or more a script file; determining a type of the script file according to a file name of one of the one or more script files, and if the one script file is a frame file, acquiring one or more asynchronous script languages Characteristic values, and Using the one or more non-synchronized scripting language feature values, determining a function that includes a call associated with an asynchronous web application of the web page, and if the one script file is a non-frame file, based on the corresponding non-synchronized script The language feature value and the code associated with the function defined in the non-framework file acquire the function; and call the function to generate content related to the web page. 根據申請專利範圍第12項之儲存媒體,其中,使用非同步腳本語言技術來產生該非同步網站應用程式。 The storage medium of claim 12, wherein the asynchronous web application is generated using asynchronous scripting language technology. 根據申請專利範圍第13項之儲存媒體,其中,該一或更多個腳本檔包含框架檔及非框架檔,該非框架檔定義包含非同步腳本語言調用之函數,以及該等動作進一步包含:根據該框架檔的檔案名,獲取對應於該框架檔之一或更多個非同步腳本語言特徵值;以及根據該一或更多個非同步腳本語言特徵值及該非框架檔,確定該函數。 The storage medium of claim 13, wherein the one or more script files include a frame file and a non-frame file, the non-frame file definition function including a function of the asynchronous script language call, and the actions further include: The file name of the framework file, obtaining one or more non-synchronized script language feature values corresponding to the framework file; and determining the function according to the one or more non-synchronized script language feature values and the non-framework file. 根據申請專利範圍第12項之儲存媒體,其中,該等動作進一步包含獲取標記語言格式之該網頁。 The storage medium of claim 12, wherein the actions further comprise obtaining the web page in a markup language format. 根據申請專利範圍第12項之儲存媒體,其中,該腳本資訊包含腳本語言資訊。 According to the storage medium of claim 12, wherein the script information includes script language information.
TW098119740A 2009-06-12 2009-06-12 Ajax web content crawling method and system TWI456416B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW098119740A TWI456416B (en) 2009-06-12 2009-06-12 Ajax web content crawling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW098119740A TWI456416B (en) 2009-06-12 2009-06-12 Ajax web content crawling method and system

Publications (2)

Publication Number Publication Date
TW201044197A TW201044197A (en) 2010-12-16
TWI456416B true TWI456416B (en) 2014-10-11

Family

ID=45001255

Family Applications (1)

Application Number Title Priority Date Filing Date
TW098119740A TWI456416B (en) 2009-06-12 2009-06-12 Ajax web content crawling method and system

Country Status (1)

Country Link
TW (1) TWI456416B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI610183B (en) * 2016-06-14 2018-01-01 健行學校財團法人健行科技大學 An operational system for centralized management base on ajax website

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1783079A (en) * 2004-11-30 2006-06-07 阿尔卡特公司 Method of displaying data of a client computer
US20080040653A1 (en) * 2006-08-14 2008-02-14 Christopher Levine System and methods for managing presentation and behavioral use of web display content
TW200901033A (en) * 2007-06-13 2009-01-01 Microsoft Corp Systems and methods for providing desktop or application remoting to a web browser
US20090006454A1 (en) * 2007-06-29 2009-01-01 Microsoft Corporation WYSIWYG, browser-based XML editor
US7506248B2 (en) * 2005-10-14 2009-03-17 Ebay Inc. Asynchronously loading dynamically generated content across multiple internet domains

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1783079A (en) * 2004-11-30 2006-06-07 阿尔卡特公司 Method of displaying data of a client computer
US7506248B2 (en) * 2005-10-14 2009-03-17 Ebay Inc. Asynchronously loading dynamically generated content across multiple internet domains
US20080040653A1 (en) * 2006-08-14 2008-02-14 Christopher Levine System and methods for managing presentation and behavioral use of web display content
TW200901033A (en) * 2007-06-13 2009-01-01 Microsoft Corp Systems and methods for providing desktop or application remoting to a web browser
US20090006454A1 (en) * 2007-06-29 2009-01-01 Microsoft Corporation WYSIWYG, browser-based XML editor

Also Published As

Publication number Publication date
TW201044197A (en) 2010-12-16

Similar Documents

Publication Publication Date Title
US20150143230A1 (en) Method and device for displaying webpage contents in browser
RU2013152635A (en) APPLICATION TILE FORMAT FORM
WO2014206169A1 (en) Method, device, and storage medium for drawing webpage text element based on html5
CN103294700B (en) Method and apparatus are locally stored in a kind of data of browser-cross
US20160283461A1 (en) Method and terminal for extracting webpage content, and non-transitory storage medium
US20170085676A1 (en) Webpage loading method, apparatus and system
US20150244661A1 (en) Method and apparatus for displaying rich text message on network platform, and computer storage medium
JP2012196529A5 (en)
TWI592807B (en) Method and device for web style address merge
JP2014514629A5 (en)
CN105095280A (en) Caching method and apparatus for browser
CN106980614B (en) A kind of Web page speech control implementation method based on JavaScript extension
CN106599270B (en) Network data capturing method and crawler
RU2011149589A (en) METHOD AND DEVICE FOR CONFIGURING SUBMISSION OF SERVICES DIRECTIONS
CN103425794A (en) Webpage previewing method and webpage previewing device based on bis-WebView
CN103838823A (en) Website content accessible detection method based on web page templates
CN108255975A (en) Template construction method, content of pages grasping means and device, medium and equipment
RU2017102575A (en) DEVICE AND METHOD FOR OPTIMIZING A WEB PAGE
CN103365919B (en) Web analysis container and method
KR101147256B1 (en) Producing apparatus and method for a standized electronic book
US10789325B2 (en) Systems and methods for prefetching dynamic URLs
CN105447198A (en) Convenient page script importing method and device
JP2015517710A5 (en)
US20150301994A1 (en) Non-transitory computer readable medium, information processing apparatus, and information processing method
TWI456416B (en) Ajax web content crawling method and system