TWI683225B

TWI683225B - Script generation method and device

Info

Publication number: TWI683225B
Application number: TW106119133A
Authority: TW
Inventors: 孫宇
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2016-07-13
Filing date: 2017-06-08
Publication date: 2020-01-21
Also published as: CN106886547A; WO2018010573A1; TW201804340A

Abstract

本申請公開了一種腳本生成方法與裝置，用於解決現有技術中利用網路爬蟲抓取網頁內容時，人工編寫抓取腳本效率較低的問題。該方法包括：確定用戶在顯示的網頁中選定的網頁內容；根據確定的網頁內容，確定顯示的網頁內容對應的網頁代碼；根據所述網頁代碼，生成抓取腳本。 The present application discloses a script generation method and device, which are used to solve the problem of low efficiency in manually writing crawl scripts when using web crawlers to crawl web content in the prior art. The method includes: determining the webpage content selected by the user in the displayed webpage; determining the webpage code corresponding to the displayed webpage content according to the determined webpage content; and generating a crawling script according to the webpage code.

Description

Script generation method and device

本申請關於電腦技術領域，尤其關於一種腳本生成方法與裝置。 This application relates to the field of computer technology, and in particular to a script generation method and device.

在現有技術中，由於網路爬蟲可以抓取網頁中的文本內容，目前被廣泛的應用於搜索、資料採擷等領域中。網路爬蟲可以抓取網頁中的全部內容，也可以抓取網頁中的部分內容。 In the prior art, since web crawlers can crawl text content in web pages, they are currently widely used in fields such as search and data extraction. Web crawlers can crawl all the content in a webpage, or they can crawl part of the content in a webpage.

目前，若要利用網路爬蟲去抓取目標網頁中的目標內容，工作人員需先編寫抓取目標內容的腳本，網路爬蟲才能夠根據該腳本，抓取到目標內容。 At present, to use a web crawler to crawl the target content in the target webpage, the worker needs to first write a script that crawls the target content, so that the web crawler can crawl the target content according to the script.

例如，若現在想要利用網路爬蟲抓取如圖1中所示的網頁中的商品的價格資訊，即“價錢：$149.99”。那麼，工作人員便要透過瀏覽器訪問相應的網頁，然後在該網頁對應的網頁代碼中查找“價錢：$149.99”對應的網頁代碼，即“價錢：$149.99”對應的最小的文檔物件模型(Document Object Model,dom)樹。 For example, if you now want to use a web crawler to grab the price information of the products in the web page shown in Figure 1, that is, "price: $149.99". Then, the staff must visit the corresponding webpage through the browser, and then find the webpage code corresponding to "price: $149.99" in the webpage code corresponding to the webpage, that is, the smallest document object model (Document Object) corresponding to "price: $149.99" Model, dom) tree.

其中，“價錢：$149.99”對應的最小的dom樹如下所示：

Among them, the minimum dom tree corresponding to "price: $149.99" is as follows:

在查找到“價錢：$149.99”對應的最小的dom樹後，獲取“價錢：$149.99”對應的超文字標記語言(HyperText Markup Language,HTML)屬性值資訊，比如id="kfs_family_16"、class="kfs-price"等。工作人員根據該些屬性值資訊，編寫出包含該些HTML屬性值資訊的抓取腳本。將編寫好的腳本以及該商品對應的網頁代碼一併送入解析引擎中，以使得解析引擎可以根據抓取腳本中的id以及class，查找到“價錢：$149.99”對應的最小的dom樹，並在該最小的dom樹中提取出“價錢：$149.99”這一價格資訊。 After finding the smallest dom tree corresponding to "Price: $149.99", obtain the attribute value information of HyperText Markup Language (HTML) corresponding to "Price: $149.99", such as id="kfs_family_16", class="kfs -price" etc. Based on the attribute value information, the staff writes a grab script that contains the HTML attribute value information. The written script and the webpage code corresponding to the product are sent to the parsing engine, so that the parsing engine can find the smallest dom tree corresponding to "price: $149.99" according to the id and class in the crawl script, and Extract the "price: $149.99" from the smallest dom tree Price information.

雖然透過上述方法，網路爬蟲能夠抓取網頁中的內容，但是需要人工編寫抓取腳本，效率較低。 Although the web crawler can crawl the content in the webpage through the above method, it needs to manually write a crawling script, which is inefficient.

本申請實施例提供一種腳本生成方法與裝置，用於解決現有技術中利用網路爬蟲抓取網頁內容時，人工編寫抓取腳本效率較低的問題。 Embodiments of the present application provide a script generation method and device, which are used to solve the problem of low efficiency in manually writing crawl scripts when using web crawlers to crawl webpage content in the prior art.

本申請實施例採用下述技術方案：一種腳本生成方法，包括：確定用戶在顯示的網頁中選定的網頁內容；根據確定的網頁內容，確定所述網頁內容對應的網頁代碼；根據所述網頁代碼，生成抓取腳本。 The embodiment of the present application adopts the following technical solution: a script generation method, including: determining a webpage content selected by a user in a displayed webpage; determining a webpage code corresponding to the webpage content according to the determined webpage content; and according to the webpage code To generate a crawl script.

一種腳本生成裝置，包括：內容確定模組，確定用戶在顯示的網頁中選定的網頁內容；代碼確定模組，根據確定的網頁內容，確定所述網頁內容對應的網頁代碼；腳本生成模組，根據所述網頁代碼，生成抓取腳本。 A script generation device includes: a content determination module to determine the webpage content selected by the user in the displayed webpage; a code determination module to determine the webpage code corresponding to the webpage content according to the determined webpage content; a script generation module, According to the webpage code, generate a crawling script.

本申請實施例採用的上述至少一個技術方案能夠達到以下有益效果：與現有技術中利用網路爬蟲抓取網頁內容時，需人工編寫抓取腳本相比，採用本申請實施例提供的腳本生成方法，透過確定用戶在網頁中選定的網頁內容，確定出該網頁內容對應的網頁代碼，並根據該網頁代碼生成抓取腳本，從而解決了現有技術中利用網路爬蟲抓取網頁內容時，人工編寫抓取腳本效率較低的問題。 The above-mentioned at least one technical solution adopted by the embodiment of the present application can achieve the following beneficial effects: Compared with the prior art, when a web crawler is used to crawl webpage content, a crawling script needs to be manually written, and the script generator provided by the embodiment of the present application is used Method, by determining the content of the webpage selected by the user in the webpage, the webpage code corresponding to the webpage content is determined, and a crawling script is generated according to the webpage code, thereby solving the problem of manually crawling webpage content by using web crawlers in the prior art The problem of low efficiency in writing crawl scripts.

21‧‧‧顯示網頁 21‧‧‧Display web page

22‧‧‧確定用戶在顯示的網頁中選定的網頁內容 22‧‧‧Confirm the content of the webpage selected by the user in the displayed webpage

23‧‧‧根據確定的網頁內容，確定所述網頁內容對應的網頁代碼 23‧‧‧According to the determined webpage content, determine the webpage code corresponding to the webpage content

24‧‧‧根據所述網頁代碼，生成抓取腳本 24‧‧‧ Generate crawl script based on the webpage code

31‧‧‧內容確定模組 31‧‧‧Content determination module

32‧‧‧代碼確定模組 32‧‧‧Code confirmation module

33‧‧‧腳本生成模組 33‧‧‧Script generation module

此處所說明的附圖用來提供對本申請的進一步理解，構成本申請的一部分，本申請的示意性實施例及其說明用於解釋本申請，並不構成對本申請的不當限定。在附圖中：圖1為現有技術中的目標網頁中的內容；圖2a為本申請實施例提供的一種腳本生成方法的具體流程圖；圖2b為本申請實施例提供的確定HTML屬性值的頁面；圖2c為本申請實施例提供的詢問用戶抓取哪一種網頁內容的頁面；圖2d為本申請實施例提供的用戶框選網頁內容後顯示出的頁面；圖2e為本申請實施例提供的用戶兩次框選網頁內容後顯示出的頁面；圖3為本申請實施例提供的一種腳本生成裝置的具體結構示意圖。 The drawings described herein are used to provide a further understanding of the present application and form a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an undue limitation on the present application. In the drawings: FIG. 1 is the content in the target webpage in the prior art; FIG. 2a is a specific flowchart of a script generation method provided by an embodiment of the present application; FIG. 2b is a method for determining HTML attribute values provided by an embodiment of the present application Figure 2c is a page that asks the user which web content to crawl provided by an embodiment of the present application; Figure 2d is a page displayed after the user selects the web content provided by an embodiment of the present application; and Figure 2e provides an embodiment of the present application. Of users display the page after framing the content of the web page twice; FIG. 3 is a schematic diagram of a specific structure of a script generating device provided by an embodiment of the present application.

為使本申請的目的、技術方案和優點更加清楚，下面將結合本申請具體實施例及相應的附圖對本申請技術方案進行清楚、完整地描述。顯然，所描述的實施例僅是本申請一部分實施例，而不是全部的實施例。基於本申請中的實施例，本領域普通技術人員在沒有做出創造性勞動前提下所獲得的所有其他實施例，都屬於本申請保護的範圍。 In order to make the purpose, technical solutions and advantages of the present application more clear, the technical solutions of the present application will be described clearly and completely in conjunction with specific embodiments of the present application and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the protection scope of the present application.

以下結合附圖，詳細說明本申請實施例提供的技術方案。 The technical solutions provided by the embodiments of the present application will be described in detail below in conjunction with the drawings.

為了解決了現有技術中利用網路爬蟲抓取網頁內容時，人工編寫抓取腳本效率較低的問題，本申請實施例提供一種腳本生成方法。 In order to solve the problem of low efficiency in manually writing crawling scripts when using web crawlers to crawl webpage content in the prior art, embodiments of the present application provide a script generation method.

該方法的執行主體，可以但不限於為手機、平板電腦或個人電腦(Personal Computer,PC)等用戶終端，或者該些用戶終端上運行的應用(Application,APP)，或者，還可以是伺服器等設備。 The execution subject of the method may be, but not limited to, user terminals such as mobile phones, tablet computers or personal computers (Personal Computer, PC), or applications (Application, APP) running on these user terminals, or may also be servers Other equipment.

為便於描述，下文以該方法的執行主體為PC為例，對該方法的實施方式進行介紹。可以理解，該方法的執行主體為PC只是一種示例性的說明，並不應理解為對該方法的限定。 For convenience of description, the following takes the execution body of the method as a PC as an example to introduce the implementation manner of the method. It can be understood that the execution body of the method is a PC is only an exemplary description, and should not be understood as a limitation of the method.

該方法的具體流程示意圖如圖2a所示，包括下述步驟： The specific flow diagram of this method is shown in Figure 2a, including the following steps:

步驟21，顯示網頁。 In step 21, the web page is displayed.

在本申請實施例中，當用戶想要利用網路爬蟲抓取網頁中的網頁內容時，可以通過PC中安裝的瀏覽器或者其他具有瀏覽器功能的應用訪問該網址，以使得該PC可以顯示出該網頁，便於後續操作。後續以瀏覽器為例進行說明。 In the embodiment of the present application, when a user wants to use a web crawler to grab a web When the content of the webpage in the page is accessed, the web address can be accessed through a browser installed in the PC or other applications with browser functions, so that the PC can display the webpage for subsequent operations. The following uses the browser as an example for description.

具體的，用戶可以在瀏覽器中的網址輸入框中輸入網址，並訪問該網址，該PC便顯示出該網址對應的網頁。 Specifically, the user can enter the web address in the web address input box in the browser and access the web address, and the PC displays the web page corresponding to the web address.

步驟22，確定用戶在顯示的網頁中選定的網頁內容。 Step 22: Determine the webpage content selected by the user in the displayed webpage.

在PC顯示出網頁後，用戶可以根據實際需求，在該網頁中選定想要抓取的網頁內容，以使得瀏覽器可以確定用戶在網頁中選定的網頁內容，進而進行後續操作，最終生成抓取腳本。 After the webpage is displayed on the PC, the user can select the webpage content to be crawled in the webpage according to actual needs, so that the browser can determine the webpage content selected by the user in the webpage, and then perform subsequent operations to finally generate a crawl script.

用戶之所以可以在網頁中選定網頁內容，是因為網頁的網頁代碼中存在第一腳本。該第一腳本用於提供在網頁中選定網頁內容的功能。第一腳本包括層疊樣式表(Cascading Style Sheets,CSS)腳本。一般情況下，第一腳本位於網頁的網頁代碼中的頂部或底部。這是因為若將第一腳本嵌入到網頁的網頁代碼的中間位置，在瀏覽器執行後續操作時，有可能會誤將該第一腳本作為網頁的網頁代碼的一部分，進而影響最終抓取腳本的生成。因此，一般將第一腳本嵌入在網頁代碼的頂部或底部。 The reason why the user can select the webpage content in the webpage is because the first script exists in the webpage code of the webpage. The first script is used to provide the function of selecting webpage content in the webpage. The first script includes a Cascading Style Sheets (CSS) script. In general, the first script is located at the top or bottom of the webpage code of the webpage. This is because if the first script is embedded in the middle of the webpage code of the webpage, when the browser performs subsequent operations, the first script may be mistakenly used as a part of the webpage code of the webpage, thereby affecting the final crawl script generate. Therefore, the first script is generally embedded at the top or bottom of the web page code.

在實際應用中，網頁的網頁代碼中存在第一腳本，有可能是因為用戶在通過瀏覽器訪問網址後，伺服器返回的網頁的網頁代碼中已經存在第一腳本了，也有可能是因為在確定用戶在所述網頁中選定的網頁內容之前，瀏覽器將預設的第一腳本嵌入到了伺服器返回的網頁代碼中。 In actual applications, the first script exists in the webpage code of the webpage, possibly because the user has accessed the URL through the browser, the first script already exists in the webpage code of the webpage returned by the server, or it may be because Before determining the content of the webpage selected by the user in the webpage, the browser embeds the preset first script into the webpage code returned by the server.

若在通過瀏覽器訪問網址後，伺服器返回的網頁的網頁代碼中已經存在第一腳本了，這有可能是伺服器在接收到瀏覽器發送的獲取網頁的網頁代碼請求後，便將預設的第一腳本嵌入了網頁的網頁代碼中，然後再將該網頁代碼發送給瀏覽器；也有可能是瀏覽器的開發人員事先與網頁的開發人員協商好，網頁的開發人員在編寫該網頁的網頁代碼時，便將第一腳本嵌入了該網頁代碼中。於是，伺服器返回的網頁代碼中存在第一腳本。其中，第一腳本並不會影響瀏覽器對網頁的渲染。 If the first script already exists in the webpage code of the webpage returned by the server after accessing the web address through the browser, it may be that the server will preset the webpage code after receiving the webpage code request from the browser The first script is embedded in the webpage code of the webpage, and then the webpage code is sent to the browser; it may also be that the developer of the browser negotiates with the developer of the webpage in advance, and the developer of the webpage is writing the webpage of the webpage When coding, the first script is embedded in the code of the webpage. Therefore, the first script exists in the web page code returned by the server. Among them, the first script does not affect the rendering of the webpage by the browser.

步驟23，根據確定的網頁內容，確定所述網頁內容對應的網頁代碼。 Step 23: Determine the webpage code corresponding to the webpage content according to the determined webpage content.

在執行完畢步驟12後，瀏覽器可以根據透過執行步驟12確定的網頁內容，確定所述網頁內容對應的網頁代碼。 After performing step 12, the browser may determine the webpage code corresponding to the webpage content according to the webpage content determined by performing step 12.

這是因為在瀏覽器根據通過執行步驟12確定的網頁內容，確定所述網頁內容對應的網頁代碼之前，瀏覽器便將預設的第二腳本嵌入到了網頁的網頁代碼中，以使得該瀏覽器可以根據目標內容，透過第二腳本，確定目標內容對應的網頁代碼。其中，第二腳本包括Javascript(JS)腳本。 This is because before the browser determines the webpage code corresponding to the webpage content according to the webpage content determined by performing step 12, the browser embeds the preset second script into the webpage code of the webpage, so that the browser The webpage code corresponding to the target content can be determined through the second script according to the target content. The second script includes Javascript (JS) script.

或者，用戶在瀏覽器中訪問網頁對應的網址後，伺服器返回的網頁代碼中便存在第二腳本。這樣，瀏覽器便可以根據網頁內容，透過第二腳本，確定該網頁內容對應的網頁代碼。其中，伺服器返回的網頁代碼中存在第二腳本的原因可能是伺服器在接收到瀏覽器發送的獲取網頁的網頁代碼請求後，便將預設的第二腳本嵌入了網頁的網頁代碼中，然後再將該網頁代碼發送給瀏覽器；也有可能是瀏覽器的開發人員事先與網頁的開發人員協商好，網頁的開發人員在編寫該網頁的網頁代碼時，便將第二腳本嵌入了該網頁代碼中。於是，伺服器返回的網頁代碼中存在第二腳本。其中，第二腳本並不會影響瀏覽器對網頁的渲染。 Or, after the user visits the URL corresponding to the webpage in the browser, the second script exists in the webpage code returned by the server. In this way, the browser can To determine the webpage code corresponding to the webpage content through the second script according to the webpage content. Among them, the reason why the second script exists in the webpage code returned by the server may be that the server embeds the preset second script into the webpage code of the webpage after receiving the webpage code request from the browser to obtain the webpage, Then send the webpage code to the browser; it may be that the developer of the browser negotiates with the developer of the webpage in advance. When the webpage developer writes the webpage code of the webpage, the second script is embedded in the webpage Code. Therefore, the second script exists in the web page code returned by the server. Among them, the second script does not affect the rendering of the web page by the browser.

那麼，瀏覽器根據通過執行步驟12確定的網頁內容，透過第二腳本，確定該網頁內容對應的網頁代碼的具體實施方式可以如下所述：瀏覽器根據通過執行步驟12確定的網頁內容，透過第二腳本，在網頁的網頁代碼中，確定出該網頁內容對應的最小dom樹，然後在該最小的dom樹中，確定出該網頁內容對應的HTML屬性值。 Then, the specific implementation of the browser to determine the webpage code corresponding to the webpage content through the second script according to the webpage content determined by performing step 12 may be as follows: The browser determines the minimum dom tree corresponding to the webpage content in the webpage code of the webpage through the second script according to the webpage content determined by performing step 12, and then determines the corresponding webpage content in the minimum dom tree HTML attribute value.

由於瀏覽器在根據網頁代碼渲染網頁時，可以建立網頁內容與網頁代碼的映射關係，或者建立網頁中的網頁內容所在的座標點與網頁代碼之間的映射關係。因此，瀏覽器可以根據上述映射關係，根據透過執行步驟12確定的網頁內容，確定出該網頁內容對應的網頁代碼，即該網頁內容對應的最小dom樹。若透過執行步驟12確定的網頁內容對應著一個最小的dom樹，那在該最小的dom樹中，確定出透過執行步驟12確定的網頁內容對應的 HTML屬性值。其中，HTML屬性值可以為class，也可以為id和class。具體的，在瀏覽器確定HTML屬性值之前，瀏覽器可以在該瀏覽器所在PC的螢幕中顯示出一個詢問頁面，用以詢問用戶確定最小的dom樹中的id和class，還是確定最小的dom樹中的class。例如，如圖2b所示的頁面，該頁面中包括確定id和class的控制項，以及確定class的控制項。若用戶點擊確定id和class的控制項，則瀏覽器確定id和class，若用戶點擊確定class的控制項，則瀏覽器確定class。 When the browser renders the webpage according to the webpage code, it can establish the mapping relationship between the webpage content and the webpage code, or establish the mapping relationship between the coordinate points where the webpage content in the webpage is located and the webpage code. Therefore, the browser may determine the webpage code corresponding to the webpage content, that is, the minimum dom tree corresponding to the webpage content, according to the above mapping relationship and the webpage content determined by performing step 12. If the webpage content determined by performing step 12 corresponds to a smallest dom tree, then in the smallest dom tree, it is determined that the webpage content determined by performing step 12 corresponds to HTML attribute value. Among them, HTML attribute value can be class, can also be id and class. Specifically, before the browser determines the HTML attribute value, the browser may display an inquiry page on the screen of the PC where the browser is located to ask the user to determine the id and class in the smallest dom tree or determine the smallest dom The class in the tree. For example, as shown in the page shown in FIG. 2b, the page includes controls for determining id and class, and controls for determining class. If the user clicks to determine the control items of id and class, the browser determines the id and class. If the user clicks to determine the control items of class, the browser determines the class.

若透過執行步驟12確定的網頁內容對應著至少兩個最小的dom樹，那在該些最小的dom樹中，分別確定出透過執行步驟12確定的網頁內容對應的HTML屬性值。在確定HTML屬性值之前，瀏覽器可以在該瀏覽器所在的PC的螢幕中顯示出如圖2b所示的詢問頁面，用以詢問用戶確定每一個最小的dom樹中的id和class，還是確定每一個最小的dom樹中的class。用戶便可根據實際需要，確定點擊哪一個控制項，從而使得瀏覽器根據用戶的選擇，確定相應的HTML屬性值。 If the webpage content determined by performing step 12 corresponds to at least two smallest dom trees, then in the smallest dom trees, the HTML attribute values corresponding to the webpage content determined by performing step 12 are respectively determined. Before determining the HTML attribute value, the browser may display an inquiry page as shown in FIG. 2b on the screen of the PC where the browser is located, to ask the user to determine the id and class in each smallest dom tree, or to determine Each class in the smallest dom tree. The user can determine which control item to click according to actual needs, so that the browser determines the corresponding HTML attribute value according to the user's selection.

在確定出上述HTML屬性值之後，便可執行步驟14，最終生成抓取腳本。 After determining the above HTML attribute value, step 14 can be performed to finally generate a crawling script.

步驟24，根據所述網頁代碼，生成抓取腳本。 Step 24: Generate a crawling script based on the webpage code.

在執行完畢步驟13後，瀏覽器便可將確定出的想要抓取的網頁內容對應的網頁代碼中的HTML屬性值添加到預設的腳本生成範本中，生成抓取腳本。其中，該抓取腳本用於抓取與HTML屬性值相匹配的網頁內容。 After step 13 is performed, the browser may add the HTML attribute value in the determined webpage code corresponding to the webpage content to be crawled to the preset script generation template to generate the crawling script. Among them, the grab foot This is used to grab webpage content that matches the HTML attribute value.

其中，若瀏覽器確定出的HTML屬性值為id和class，那麼瀏覽器便可以分別將確定出的每一個dom中的id和class以{id=XXX，class=XXX}這種組合形式添加到預設的腳本生成範本中，生成抓取腳本。若瀏覽器確定出的HTML屬性值為class，那麼瀏覽器便可以分別將確定出的每一個dom中的class以{class=XXX}這種組合形式添加到預設的腳本生成範本中，生成抓取腳本。 Among them, if the HTML attribute values determined by the browser are id and class, then the browser can add the determined id and class in each dom to {id=XXX, class=XXX} In the preset script generation template, a crawl script is generated. If the HTML attribute value determined by the browser is class, then the browser can add the determined class in each dom to the preset script generation template in the form of {class=XXX}, to generate the capture Take the script.

其中，瀏覽器在生成抓取腳本後，可以將該抓取腳本保存在本地。另外，瀏覽器還可以將與透過執行步驟12確定出的網頁內容對應的最小的dom樹以及全部網頁代碼保存在本地，以使得瀏覽器可以在後續操作中可以運用上述抓取腳本、最小的dom樹以及網頁代碼，抓取網頁內容。 After the browser generates the crawl script, it can save the crawl script locally. In addition, the browser can also save the smallest dom tree corresponding to the webpage content determined by performing step 12 and all webpage codes locally, so that the browser can use the above crawling script and the smallest dom in subsequent operations Tree and webpage code to crawl webpage content.

在生成抓取腳本後，瀏覽器可以在該瀏覽器所在的PC中彈出一個頁面，用以告知用戶抓取腳本已經生成，並詢問用戶是否抓取網頁內容。 After generating the crawling script, the browser may pop up a page in the PC where the browser is located to inform the user that the crawling script has been generated and ask the user whether to crawl the webpage content.

例如，該頁面可以如圖2c所示。該頁面中包括第一抓取控制項和第二抓取控制項。若用戶點擊第一抓取控制項，瀏覽器便將抓取腳本以及透過執行步驟12確定出的網頁內容對應的最小的dom樹，發送給解析引擎。若抓取腳本中包含id和class，解析引擎根據id找到該id所在的最小的dom樹，再根據與該id存在於同一個組合中的class，提取出最小的dom樹中的用戶想要抓取的網頁內容。例如，若確定出的用戶想要抓取如圖1所示的網頁中的“價錢：$149.99對應的最小的dom樹為：

For example, the page may be as shown in Figure 2c. This page includes the first crawl control item and the second crawl control item. If the user clicks the first crawl control item, the browser sends the crawl script and the smallest dom tree corresponding to the web page content determined by performing step 12 to the parsing engine. If the crawl script contains id and class, the parsing engine finds the smallest dom tree where the id is located based on the id, and then extracts the smallest dom tree that the user wants to catch based on the class that exists in the same combination with the id Fetched web content. For example, if the determined user wants to crawl the webpage shown in Figure 1, the minimum dom tree corresponding to "Price: $149.99" is:

抓取“價錢：$149.99”的抓取腳本中包含id="kfs_family_16"，以及“價錢：$149.99”對應的class="kfs-price"。 The scraping script for scraping "price: $149.99" contains id="kfs_family_16", and class="kfs-price" corresponding to "price: $149.99".

那麼，用戶點擊第一抓取控制項，瀏覽器便可將上述最小的dom樹以及抓取腳本一併發送給解析引擎，解析引擎便可根據id="kfs_family_16"找到該id所在的最小的dom樹，再根據class="kfs-price"從該dom樹中提取出“價錢：$149.99”這一價格資訊。 Then, when the user clicks the first crawl control item, the browser can send the above minimum dom tree and crawl script to the parsing engine, and the parsing engine can find the smallest dom where the id is located according to id="kfs_family_16" Tree, and then extract the price information "price: $149.99" from the dom tree according to class="kfs-price".

若抓取腳本中不包含id，只包含class，瀏覽器根據 class便可提取出發送到解析引擎中的所有的最小的dom樹中的與class相匹配的網頁內容。 If id is not included in the crawl script, only class is included, the browser will The class can extract the web page content matching the class in all the smallest dom trees sent to the parsing engine.

當用戶點擊第二抓取控制項後，瀏覽器便將抓取腳本與網頁的網頁代碼發送給解析引擎。若抓取腳本中包含id和class，解析引擎根據id找到該id所在的最小的dom樹，再根據與該id存在於同一個組合中的class，提取出最小的dom樹中的用戶想要抓取的網頁內容。 When the user clicks the second crawl control item, the browser sends the crawl script and the web page code of the web page to the parsing engine. If the crawl script contains id and class, the parsing engine finds the smallest dom tree where the id is located according to the id, and then extracts the smallest dom tree that the user wants to catch based on the class that exists in the same combination with the id Fetched web content.

若抓取腳本中不包含id，只包含class，瀏覽器根據class便可提取出網頁的網頁代碼中的與class相匹配的網頁內容。 If the crawl script does not contain id, but only class, the browser can extract the webpage content that matches the class in the webpage code of the webpage according to the class.

需要說明的是，本申請實施例所提供方法的各步驟的執行主體可以相同，也可以不同。例如，在瀏覽器執行完畢步驟13後，瀏覽器可以將確定出的網頁代碼以及HTML屬性值發送給伺服器，以使得伺服器根據所述網頁代碼，生成抓取腳本。另外，上述只是以PC中安裝的瀏覽器為例的一種示例性說明，本申請實施例的執行主體除了可以是PC中安裝的瀏覽器之外，也可以是PC中安裝的其他具有瀏覽器功能的應用，還可以是移動終端中的具有瀏覽器功能的APP，本申請對此不進行任何限定。 It should be noted that the execution bodies of the steps of the method provided in the embodiments of the present application may be the same or different. For example, after the browser performs step 13, the browser may send the determined webpage code and HTML attribute value to the server, so that the server generates a crawling script according to the webpage code. In addition, the above is only an exemplary description taking the browser installed in the PC as an example. The execution subject of the embodiment of the present application may be a browser installed in the PC, or may have other browser functions installed in the PC. The application may also be an APP with a browser function in the mobile terminal, which is not limited in this application.

在本申請實施例中，瀏覽器在執行步驟12時，具體的，在一種實施方式中，用戶一旦開始在網頁中進行框選，瀏覽器便可以開始確認用戶框選的網頁內容。或者，在用戶框選結束後，被框選的目標內容被矩形方框框住，且該網頁中會顯示出繼續框選、提交和取消的控制項。其中，被矩形方框框住的目標內容，可以以高亮的顯示方式顯示，也可以以該網頁被顯示出來時該目標內容的最初顯示方式顯示，這可以根據用戶需求進行設置，本申請實施例對此不進行任何限定。例如，如圖2d所示，圖2d中顯示的頁面便為用戶框選想要抓取的某一購物網站中的某一商品的價格資訊後所顯示出的頁面。該頁面中的價格資訊被一個矩形方框框住，且在價格資訊的右側分別顯示出繼續框選、提交和取消的控制項。其中，被框住的價格資訊並未高亮顯示，而是以最初顯示狀態顯示。 In the embodiment of the present application, when the browser executes step 12, specifically, in one embodiment, once the user starts frame selection on the webpage, the browser can start to confirm the content of the webpage selected by the user. Or, after the user's frame selection is completed, the framed target content is framed by a rectangular frame, and the web page will display controls for continuing frame selection, submission, and cancellation. its In, the target content framed by a rectangular frame can be displayed in a highlighted manner, or it can be displayed in the initial display mode of the target content when the web page is displayed. This can be set according to user needs. This embodiment of the present application There are no restrictions on this. For example, as shown in FIG. 2d, the page displayed in FIG. 2d is the page displayed after the user selects the price information of a product in a shopping website that he wants to crawl. The price information on this page is framed by a rectangular box, and the control options for continued frame selection, submission, and cancellation are displayed on the right side of the price information, respectively. Among them, the framed price information is not highlighted, but displayed in the initial display state.

在網頁中出現繼續框選、提交和取消的控制項後，若用戶想要繼續框選其他的內容，便可點擊繼續框選控制項，繼續框選其他內容。若用戶不再想框選其他內容，便可點擊提交控制項，瀏覽器便將用戶框選的網頁內容確定為目標內容。若用戶想取消之前框選的內容，框選其他的內容，那麼點擊取消控制項，便可重新框選。 After the controls to continue to frame, submit, and cancel appear on the web page, if the user wants to continue to frame other content, they can click to continue to frame the control and continue to frame other content. If the user no longer wants to frame other content, he can click the submit control item, and the browser determines the content of the webpage selected by the user as the target content. If the user wants to cancel the previously selected content and select other content, then click Cancel Control to reselect the content.

另外，本申請實施例提供的腳本生成方法中，用戶在網頁中選擇想要抓取的網頁內容時，可以先進行一次粗略選擇，瀏覽器可以根據用戶第一次選擇的網頁內容確定出該網頁內容對應的最小的dom。然後，用戶再在第一次選擇的網頁內容的基礎上進行第二次選擇，瀏覽器便可以在確定出的最小的dom中確定出第二次選擇的網頁內容對應的網頁代碼中的HTML屬性值。比如，如圖2e所示，若用戶想要抓取網頁中的“￥175”的價格資訊，在第一次選定時，用戶可以粗略的將包含“￥175”這一價格資訊的網頁內容一併進行選擇，瀏覽器便可根據該第一次選擇的內容，確定出該些網頁內容對應的最小的dom樹。在用戶在第二次選擇時，可以只選擇“￥175”，那麼瀏覽器便可以在該最小的dom樹確定出“￥175”對應的HTML屬性值。圖2e中有兩個矩形方框，其中，較大的矩形方框中的所有內容為用戶第一次選擇的內容，較小的矩形方框中的網頁內容為用戶第二次選擇的內容。 In addition, in the script generation method provided by the embodiment of the present application, when the user selects the content of the web page to be crawled in the web page, a rough selection can be made first, and the browser can determine the web page according to the content of the web page selected by the user for the first time. The smallest dom corresponding to the content. Then, the user makes a second selection based on the first-selected webpage content, and the browser can determine the HTML attribute in the webpage code corresponding to the second-selected webpage content in the smallest dom determined value. For example, as shown in FIG. 2e, if the user wants to crawl the price information of “￥175” in the webpage, the user can roughly include the price information of “￥175” when selecting for the first time. When the webpage content is selected together, the browser can determine the smallest dom tree corresponding to the webpage content according to the first selected content. When the user selects for the second time, he can only select "￥175", then the browser can determine the HTML attribute value corresponding to "￥175" in the smallest dom tree. There are two rectangular boxes in FIG. 2e, where all the content in the larger rectangular box is the content selected by the user for the first time, and the content of the web page in the smaller rectangular box is the content selected by the user for the second time.

在本申請實施例中，還可通過一種腳本生成裝置，來實現本申請實施例中提供的腳本生成方法。 In the embodiments of the present application, a script generating device may also be used to implement the script generating method provided in the embodiments of the present application.

如圖3所示，為本申請實施例提供的一種腳本生成裝置的結構示意圖，主要包括下述裝置：內容確定模組31，確定用戶在顯示的網頁中選定的網頁內容。 As shown in FIG. 3, it is a schematic structural diagram of a script generation device provided by an embodiment of the present application, which mainly includes the following device: a content determination module 31, which determines the content of a webpage selected by the user in the displayed webpage.

代碼確定模組32，根據確定的網頁內容，確定所述網頁內容對應的網頁代碼。 The code determination module 32 determines the webpage code corresponding to the webpage content according to the determined webpage content.

腳本生成模組33，根據所述網頁代碼，生成抓取腳本。 The script generation module 33 generates a crawling script based on the webpage code.

在一種實施方式中，所述裝置還包括：第一嵌入模組，在內容確定模組31確定用戶在顯示的網頁中選定的網頁內容之前，將預設的第一腳本嵌入到所述網頁的網頁代碼中。其中，所述第一腳本用於提供在所述網頁中選定網頁內容的功能，所述第一腳本包括層疊樣式表CSS腳本。 In one embodiment, the device further includes a first embedding module, which embeds a preset first script into the webpage before the content determination module 31 determines the webpage content selected by the user in the displayed webpage In the page code. Wherein, the first script is used to provide a function of selecting webpage content in the webpage, and the first script includes a cascading style sheet CSS script.

在一種實施方式中，所述裝置還包括：第二嵌入模組，在代碼確定模組32根據確定的網頁內容，確定所述網頁內容對應的網頁代碼之前，將預設的第二腳本嵌入到所述網頁的網頁代碼中，所述第二腳本包括JS腳本。 In one embodiment, the device further includes: The second embedding module, before the code determining module 32 determines the webpage code corresponding to the webpage content according to the determined webpage content, embeds the preset second script into the webpage code of the webpage, the second The script includes JS script.

則代碼確定模組32，根據確定的網頁內容，通過所述第二腳本，確定所述網頁內容對應的網頁代碼。 Then, the code determination module 32 determines the webpage code corresponding to the webpage content through the second script according to the determined webpage content.

在一種實施方式中，代碼確定模組32，在所述網頁的網頁代碼中，確定所述網頁內容對應的最小的文檔物件模型dom樹；在所述最小的dom樹中，確定所述網頁內容對應的超文字標記語言HTML屬性值。 In one embodiment, the code determination module 32 determines the smallest document object model dom tree corresponding to the webpage content in the webpage code of the webpage; and determines the webpage content in the smallest dom tree Corresponding HTML attribute value of hypertext markup language.

在一種實施方式中，腳本生成模組33，將確定出的所述HTML屬性值添加到預設的腳本生成範本中，生成抓取腳本，所述抓取腳本用於抓取與所述HTML屬性值相匹配的網頁內容。 In one embodiment, the script generation module 33 adds the determined HTML attribute value to a preset script generation template to generate a crawl script, and the crawl script is used to capture and HTML attributes Web content with matching values.

在一種實施方式中，所述裝置還包括：內容解析模組，將所述抓取腳本以及網頁代碼發送給解析引擎，透過解析引擎，抓取相應的網頁內容。 In one embodiment, the device further includes a content parsing module, which sends the crawling script and webpage code to a parsing engine, and crawls the corresponding webpage content through the parsing engine.

與現有技術中利用網路爬蟲抓取網頁內容時，需人工編寫抓取腳本相比，採用本申請實施例提供的腳本生成方法，透過確定用戶在網頁中選定的網頁內容，確定出該網頁內容對應的網頁代碼，並根據該網頁代碼生成抓取腳本，從而解決了現有技術中利用網路爬蟲抓取網頁內容時，人工編寫抓取腳本效率較低的問題。 Compared with the prior art of using web crawlers to crawl webpage content, it is necessary to manually write a crawling script. Using the script generation method provided in this embodiment of the present application, the webpage content is determined by determining the webpage content selected by the user in the webpage Corresponding webpage code, and generating a crawling script according to the webpage code, thereby solving the problem of low efficiency in manually writing crawling scripts when using web crawlers to crawl webpage content in the prior art.

本領域內的技術人員應明白，本發明的實施例可提供為方法、系統、或電腦程式產品。因此，本發明可採用完全硬體實施例、完全軟體實施例、或結合軟體和硬體方面的實施例的形式。而且，本發明可採用在一個或多個其中包含有電腦可用程式碼的電腦可用儲存媒體(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。 Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Therefore, the present invention may take the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention can take the form of computer program products implemented on one or more computer usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) containing computer usable program code .

本發明是參照根據本發明實施例的方法、設備(系統)、和電腦程式產品的流程圖和/或方框圖來描述的。應理解可由電腦程式指令實現流程圖和/或方框圖中的每一流程和/或方框、以及流程圖和/或方框圖中的流程和/或方框的結合。可提供這些電腦程式指令到通用電腦、專用電腦、嵌入式處理機或其他可程式設計資料處理設備的處理器以產生一個機器，使得透過電腦或其他可程式設計資料處理設備的處理器執行的指令產生用於實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能的裝置。 The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each flow and/or block in the flowchart and/or block diagram and a combination of the flow and/or block in the flowchart and/or block diagram may be implemented by computer program instructions. These computer program instructions can be provided to the processors of general-purpose computers, special-purpose computers, embedded processors, or other programmable data processing equipment to produce a machine that allows instructions executed by the processor of the computer or other programmable data processing equipment Means for generating the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.

這些電腦程式指令也可儲存在能引導電腦或其他可程式設計資料處理設備以特定方式工作的電腦可讀記憶體中，使得儲存在該電腦可讀記憶體中的指令產生包括指令裝置的製造品，該指令裝置實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能。 These computer program instructions can also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce a manufactured product including an instruction device The instruction device implements the functions specified in one block or multiple blocks in one flow or multiple flows in the flowchart and/or one block in the block diagram.

這些電腦程式指令也可裝載到電腦或其他可程式設計資料處理設備上，使得在電腦或其他可程式設計設備上執行一系列操作步驟以產生電腦實現的處理，從而在電腦或其他可程式設計設備上執行的指令提供用於實現在流程圖一個流程或多個流程和/或方框圖一個方框或多個方框中指定的功能的步驟。 These computer program instructions can also be loaded onto a computer or other programmable data processing equipment, so that they can be executed on the computer or other programmable equipment Perform a series of operating steps to generate computer-implemented processing so that instructions executed on a computer or other programmable device provide a block or blocks for implementing a flow or a flow in a flowchart or a flow and/or block diagram The steps of the function specified in.

在一個典型的配置中，計算設備包括一個或多個處理器(CPU)、輸入/輸出介面、網路介面和記憶體。 In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

記憶體可能包括電腦可讀媒體中的非永久性記憶體，隨機存取記憶體(RAM)和/或非揮發性記憶體等形式，如唯讀記憶體(ROM)或快閃記憶體(flash RAM)。記憶體是電腦可讀媒體的示例。 Memory may include non-permanent memory, random access memory (RAM) and/or non-volatile memory in computer-readable media, such as read-only memory (ROM) or flash memory (flash) RAM). Memory is an example of computer-readable media.

電腦可讀媒體包括永久性和非永久性、可移動和非可移動媒體可以由任何方法或技術來實現資訊儲存。資訊可以是電腦可讀指令、資料結構、程式的模組或其他資料。電腦的儲存媒體的例子包括，但不限於相變記憶體(PRAM)、靜態隨機存取記憶體(SRAM)、動態隨機存取記憶體(DRAM)、其他類型的隨機存取記憶體(RAM)、唯讀記憶體(ROM)、電可擦除可程式設計唯讀記憶體(EEPROM)、快閃記憶體或其他記憶體技術、唯讀光碟唯讀記憶體(CD-ROM)、數位多功能光碟(DVD)或其他光學儲存、磁盒式磁帶，磁帶磁磁片儲存或其他磁性存放裝置或任何其他非傳輸媒體，可用於儲存可以被計算設備訪問的資訊。按照本文中的界定，電腦可讀媒體不包括暫存電腦可讀媒體(transitory media)，如調製的資料信號和載波。 Computer-readable media, including permanent and non-permanent, removable and non-removable media, can be stored by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM) , Read-only memory (ROM), electrically erasable and programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only disc read-only memory (CD-ROM), digital multifunction Optical discs (DVDs) or other optical storage, magnetic cassette tapes, magnetic tape storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include temporary computer-readable media (transitory media), such as modulated data signals and carrier waves.

還需要說明的是，術語“包括”、“包含”或者其任何其他變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、商品或者設備不僅包括那些要素，而且還包括沒有明確列出的其他要素，或者是還包括為這種過程、方法、商品或者設備所固有的要素。在沒有更多限制的情況下，由語句“包括一個......”限定的要素，並不排除在包括所述要素的過程、方法、商品或者設備中還存在另外的相同要素。 It should also be noted that the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or device that includes a series of elements includes not only those elements, but also includes Other elements not explicitly listed, or include elements inherent to this process, method, commodity, or equipment. Without more restrictions, the element defined by the sentence "include one..." does not exclude that there are other identical elements in the process, method, commodity or equipment that includes the element.

本領域技術人員應明白，本申請的實施例可提供為方法、系統或電腦程式產品。因此，本申請可採用完全硬體實施例、完全軟體實施例或結合軟體和硬體方面的實施例的形式。而且，本申請可採用在一個或多個其中包含有電腦可用程式碼的電腦可用儲存媒體(包括但不限於磁碟記憶體、CD-ROM、光學記憶體等)上實施的電腦程式產品的形式。 Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may take the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may take the form of computer program products implemented on one or more computer usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) containing computer usable program code .

以上所述僅為本申請的實施例而已，並不用於限制本申請。對於本領域技術人員來說，本申請可以有各種更改和變化。凡在本申請的精神和原理之內所作的任何修改、等同替換、改進等，均應包含在本申請的申請專利範圍之內。 The above are only examples of the present application, and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of this application shall be included in the scope of the patent application of this application.

Claims

A script generation method, characterized in that the method includes: determining a webpage content selected by a user in a displayed webpage; determining a webpage code corresponding to the webpage content according to the determined webpage content; and generating a crawling script according to the webpage code Where determining the webpage code corresponding to the webpage content specifically includes: determining the smallest document object model dom tree corresponding to the webpage content in the webpage code of the webpage; determining the webpage content in the smallest dom tree Corresponding HTML attribute value of hypertext markup language.

The method according to item 1 of the patent application scope, wherein, before determining the content of the webpage selected by the user in the displayed webpage, the method further comprises: embedding a preset first script into the webpage code of the webpage; wherein, The first script is used to provide a function of selecting webpage content in the webpage, and the first script includes a cascading style sheet CSS script.

The method according to item 1 of the patent application scope, wherein, before determining the webpage code corresponding to the webpage content according to the determined webpage content, the method further includes: embedding a preset second script into the webpage code of the webpage , The second script includes a JS script; According to the determined webpage content, determining the webpage code corresponding to the webpage content specifically includes: determining the webpage code corresponding to the webpage content through the second script according to the determined webpage content.

The method according to item 1 of the patent application scope, wherein generating a script according to the web page code specifically includes: adding the determined HTML attribute value to a preset script generation template to generate a crawling script, which captures The fetch script is used to grab the webpage content that matches the HTML attribute value.

The method as described in item 4 of the patent application scope, wherein the method further comprises: sending the crawling script and the webpage code to the parsing engine, and crawling the corresponding webpage content through the parsing engine.

A script generating device, characterized in that the device includes: a content determination module to determine the webpage content selected by the user in the displayed webpage; a code determination module to determine the webpage code corresponding to the webpage content according to the determined webpage content; The script generation module generates a crawling script based on the webpage code, wherein the code determination module determines the smallest document object model dom tree corresponding to the webpage content in the webpage code of the webpage; In the smallest dom tree, determine the HTML attribute value of the hypertext markup language corresponding to the content of the web page.

The device as described in item 6 of the patent application scope, wherein the device further includes: a first embedded module, which presets the first script before the content determination module determines the content of the webpage selected by the user in the displayed webpage Embedded in the webpage code of the webpage; wherein, the first script is used to provide a function of selecting webpage content in the webpage, and the first script includes a cascading style sheet CSS script.

The device as described in item 6 of the patent application scope, wherein the device further comprises: a second embedded module, before the code determination module determines the webpage code corresponding to the webpage content according to the determined webpage content, the preset The second script is embedded in the webpage code of the webpage. The second script includes a JS script; then the code determination module determines the webpage code corresponding to the webpage content through the second script based on the determined webpage content. .

The device as described in item 6 of the patent application scope, wherein the script generation module adds the determined HTML attribute value to a preset script generation template to generate a crawl script, which is used for crawling With that HTML Web content with matching attribute values.

The device as described in item 9 of the patent application scope, wherein the device further includes: a content parsing module, which sends the crawling script and webpage code to the parsing engine, and crawls the corresponding webpage content through the parsing engine.