Claims (16)
一種網頁內容的抓取方法,其特徵在於,包括:獲取網頁代碼資訊;根據該網頁代碼資訊,提取腳本語言資訊,該腳本語言資訊被包含於一或更多個腳本語言檔中;根據該一或更多個腳本語言檔中的一個腳本檔的檔案名,確定該一個腳本檔的類型,如果該一個腳本檔是框架檔,則:獲取非同步腳本語言特徵值,以及使用該非同步腳本語言特徵值,確定至少一個函數,該至少一個函數包含與網頁之非同步網站應用程式相關之調用,該網頁與該網頁代碼相關,以及如果該一個腳本檔是非框架檔,則根據對應的非同步腳本語言特徵值及與定義於該非框架檔中之該函數相關之代碼,獲取該至少一個函數;以及觸發包含至少一個非同步腳本語言調用之該至少一個函數,獲得生成的網頁內容。
A method for capturing webpage content, comprising: acquiring webpage code information; extracting script language information according to the webpage code information, the scripting language information being included in one or more script language files; according to the one The file name of a script file in one or more script language files, determining the type of the one script file, if the one script file is a frame file, obtaining an asynchronous script language feature value, and using the non-synchronized script language feature a value determining at least one function, the at least one function comprising a call associated with an asynchronous web application of the web page, the web page being associated with the web page code, and if the one script file is a non-frame file, according to the corresponding non-synchronized script language An eigenvalue and a code associated with the function defined in the non-framework file, the at least one function is obtained; and the at least one function including the at least one asynchronous scripting language call is triggered to obtain the generated webpage content.
根據申請專利範圍第1項之方法,其中,該提取包括:查詢該網頁代碼中的腳本語言標記;如果提取自該網頁代碼資訊且位於該腳本語言標記後的該腳本語言資訊包含腳本語言代碼,則提取該腳本語言代碼,將所提取的該腳本語言代碼保存於腳本語言檔中;如果提取自該網頁代碼資訊且位於該腳本語言標記後
的該腳本語言資訊包含腳本語言檔,則提取該腳本語言檔的存放路徑和該檔案名。
The method of claim 1, wherein the extracting comprises: querying a scripting language tag in the webpage code; if the scripting language information extracted from the webpage code information and located after the scripting language tag includes a scripting language code, Extracting the script language code, and saving the extracted script language code in the script language file; if the webpage code information is extracted from the webpage code mark
The script language information includes a script language file, and the storage path of the script language file and the file name are extracted.
根據申請專利範圍第1項之方法,其中,該非同步腳本語言特徵值包括下列至少一個:使用腳本語言框架類型調用非同步腳本語言所對應的特徵值,和不使用腳本語言框架類型調用非同步腳本語言所對應的特徵值。
The method of claim 1, wherein the asynchronous script language feature value comprises at least one of: calling a feature value corresponding to the asynchronous script language using a script language framework type, and calling the asynchronous script without using the script language frame type The feature value corresponding to the language.
根據申請專利範圍第1項之方法,其中,該至少一個函數被定義在該網頁中引用的非JavaScript框架檔中。
The method of claim 1, wherein the at least one function is defined in a non-JavaScript framework file referenced in the web page.
一種非同步JavaScript及XML網頁內容抓取系統,其特徵在於,包括:網頁代碼獲取單元,用於獲取網頁代碼資訊;腳本提取單元,用於提取該網頁代碼資訊中的腳本語言資訊;腳本解析單元,用於:根據一個腳本檔的檔案名,確定該腳本語言資訊所指示的腳本檔的類型,如果該腳本檔是框架檔,則:獲取非同步腳本語言特徵值,以及使用該非同步腳本語言特徵值,確定一函數,該函數包含與網頁之非同步網站應用程式相關之調用,該網頁與該網頁代碼相關,以及如果該腳本檔是非框架檔,則根據對應的非同步
腳本語言特徵值及與定義於該非框架檔中之該函數相關之代碼,獲取該函數;以及;網頁內容獲得單元,用於觸發包含至少一個非同步腳本語言調用的該函數,獲得網頁內容。
A non-synchronized JavaScript and XML web content crawling system, comprising: a webpage code obtaining unit for acquiring webpage code information; a script extracting unit, configured to extract scripting language information in the webpage code information; and a script parsing unit And: determining, according to a file name of a script file, a type of the script file indicated by the script language information, if the script file is a frame file, acquiring an unsynchronized script language feature value, and using the asynchronous script language feature a value that determines a function that includes a call associated with an unsynchronized web application of the web page, the web page being associated with the web page code, and if the script file is a non-frame file, based on the corresponding non-synchronization
a script language feature value and a code associated with the function defined in the non-framework file to obtain the function; and a webpage content obtaining unit for triggering the function including at least one asynchronous scripting language call to obtain webpage content.
根據申請專利範圍第5項之系統,其中,該腳本語言資訊包括腳本語言檔和腳本語言代碼中至少一個。
The system of claim 5, wherein the scripting language information comprises at least one of a scripting language file and a scripting language code.
根據申請專利範圍第6項之系統,該腳本提取單元包括:查詢子單元,用於查詢網頁代碼中的腳本語言標記;第一提取子單元,用於在提取自該網頁代碼資訊且位於該腳本語言標記後的該腳本語言資訊包含腳本語言代碼時,提取該腳本語言代碼,將所提取的該腳本語言代碼保存於腳本語言檔中;第二提取子單元,用於在提取自該網頁代碼資訊且位於該腳本語言標記後的該腳本語言資訊包含腳本語言檔時,提取該腳本語言檔的存放路徑和檔案名。
According to the system of claim 6, the script extracting unit includes: a query subunit for querying a script language mark in the webpage code; and a first extracting subunit for extracting information from the webpage code and located in the script When the script language information after the language tag includes the script language code, the script language code is extracted, and the extracted script language code is saved in the script language file; and the second extracting subunit is configured to extract the information from the webpage code. And when the script language information after the script language tag contains the script language file, the storage path and file name of the script language file are extracted.
根據申請專利範圍第7項之系統,該腳本解析單元包括:第一確定子單元,用於根據非同步腳本語言特徵值,確定該腳本語言檔中所定義的包含至少一個非同步腳本語言調用的該函數;該非同步腳本語言特徵值為:可標識該函數中存在非同步腳本語言調用的代碼段;第二確定子單元,用於在該第一確定子單元所確定的該函數中,確定該網頁代碼中包含非同步腳本語言調用的
函數。
According to the system of claim 7, the script parsing unit comprises: a first determining subunit, configured to determine, according to the non-synchronized script language feature value, the at least one asynchronous script language call defined in the script language file The non-synchronized scripting language feature value is: a code segment that can identify a non-synchronized scripting language call in the function; and a second determining subunit, configured to determine the function in the function determined by the first determining subunit The webpage code contains a non-synchronized scripting language call
function.
根據申請專利範圍第8項之系統,其中,該非同步腳本語言特徵值包括下列至少一個:使用腳本語言框架類型調用非同步腳本語言所對應的特徵值,和不使用非同步腳本語言框架類型調用非同步腳本語言所對應的特徵值。
The system according to claim 8, wherein the non-synchronized script language feature value comprises at least one of: calling a feature value corresponding to the asynchronous script language using a script language framework type, and calling the non-synchronized script language frame type instead of calling The feature value corresponding to the synchronization script language.
根據申請專利範圍第8項之系統,其中,該第一確定子單元進一步用於確定在頁面中引用的非非腳本語言框架檔中所定義的包含至少一個非同步腳本語言調用的至少一個函數。
The system of claim 8 wherein the first determining subunit is further for determining at least one function comprising at least one asynchronous scripting language call defined in a non-scripting language framework file referenced in the page.
根據申請專利範圍第10項之系統,其中,該網頁內容獲得單元進一步用於藉由模擬用戶操作,觸發包含至少一個非同步腳本語言調用的所確定的至少一個函數,獲得由該包含至少一個非同步腳本語言調用的該至少一個函數生成的該網頁內容。
The system of claim 10, wherein the webpage content obtaining unit is further configured to trigger the determined at least one function including at least one asynchronous scripting language call by simulating a user operation to obtain at least one non-contained The content of the web page generated by the at least one function invoked by the synchronization scripting language.
一種非暫態儲存媒體,包含電腦可執行指令,當由電腦執行該指令時,組構該電腦以執行動作,該等動作包含:提取網頁中的腳本資訊,該腳本資訊指示一或更多個腳本檔;根據該一或更多個腳本檔中的一個腳本檔的檔案名,確定該一個腳本檔的類型,如果該一個腳本檔是框架檔,則:獲取一或更多個非同步腳本語言特徵值,以及
使用該一或更多個非同步腳本語言特徵值,確定一函數,該函數包含與網頁之非同步網站應用程式相關之調用,以及如果該一個腳本檔是非框架檔,則根據對應的非同步腳本語言特徵值及與定義於該非框架檔中之該函數相關之代碼,獲取該函數;以及調用該函數,產生與該網頁相關之內容。
A non-transitory storage medium, comprising computer executable instructions, when executed by a computer, configuring the computer to perform an action, the action comprising: extracting script information in a webpage, the script information indicating one or more a script file; determining a type of the script file according to a file name of one of the one or more script files, and if the one script file is a frame file, acquiring one or more asynchronous script languages Characteristic values, and
Using the one or more non-synchronized scripting language feature values, determining a function that includes a call associated with an asynchronous web application of the web page, and if the one script file is a non-frame file, based on the corresponding non-synchronized script The language feature value and the code associated with the function defined in the non-framework file acquire the function; and call the function to generate content related to the web page.
根據申請專利範圍第12項之儲存媒體,其中,使用非同步腳本語言技術來產生該非同步網站應用程式。
The storage medium of claim 12, wherein the asynchronous web application is generated using asynchronous scripting language technology.
根據申請專利範圍第13項之儲存媒體,其中,該一或更多個腳本檔包含框架檔及非框架檔,該非框架檔定義包含非同步腳本語言調用之函數,以及該等動作進一步包含:根據該框架檔的檔案名,獲取對應於該框架檔之一或更多個非同步腳本語言特徵值;以及根據該一或更多個非同步腳本語言特徵值及該非框架檔,確定該函數。
The storage medium of claim 13, wherein the one or more script files include a frame file and a non-frame file, the non-frame file definition function including a function of the asynchronous script language call, and the actions further include: The file name of the framework file, obtaining one or more non-synchronized script language feature values corresponding to the framework file; and determining the function according to the one or more non-synchronized script language feature values and the non-framework file.
根據申請專利範圍第12項之儲存媒體,其中,該等動作進一步包含獲取標記語言格式之該網頁。
The storage medium of claim 12, wherein the actions further comprise obtaining the web page in a markup language format.
根據申請專利範圍第12項之儲存媒體,其中,該腳本資訊包含腳本語言資訊。
According to the storage medium of claim 12, wherein the script information includes script language information.