WO2016086784A1 - 一种网页数据采集方法、装置及系统 - Google Patents

一种网页数据采集方法、装置及系统 Download PDF

Info

Publication number
WO2016086784A1
WO2016086784A1 PCT/CN2015/095584 CN2015095584W WO2016086784A1 WO 2016086784 A1 WO2016086784 A1 WO 2016086784A1 CN 2015095584 W CN2015095584 W CN 2015095584W WO 2016086784 A1 WO2016086784 A1 WO 2016086784A1
Authority
WO
WIPO (PCT)
Prior art keywords
collection
target
webpage
loading
information
Prior art date
Application number
PCT/CN2015/095584
Other languages
English (en)
French (fr)
Inventor
刘庆
黄华
殷贤君
张美德
Original Assignee
阿里巴巴集团控股有限公司
刘庆
黄华
殷贤君
张美德
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 刘庆, 黄华, 殷贤君, 张美德 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2016086784A1 publication Critical patent/WO2016086784A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present application relates to the field of the Internet, and in particular, to a webpage data collection method, apparatus, and system.
  • Loading webpage data mainly includes both synchronous and asynchronous loading methods. Synchronous party loading method, returning the HTML page directly for the request. Asynchronous loading method, after the page returns, the original structure of the page is changed by loading JS (JavaScript, a literal translation scripting language) to load the data. After getting the returned HTML page, you can parse the HTML page and separate the useful data extraction. For example, you can extract the title of a news in the Sina News Channel.
  • the purpose of the application is to provide a webpage data collection method, device and system to achieve the purpose of improving data collection efficiency.
  • a webpage data collection method may include: receiving a request for collecting data in batches, wherein the request carries target website information; and determining an acquisition strategy of successfully collecting target data corresponding to the target website information, where The collection policy corresponding to the target URL information is obtained by performing a target data collection test including at least a synchronous loading test on the target website information, where the collection policy includes a synchronous loading mode or an asynchronous loading mode; The synchronous loading mode or the asynchronous loading mode set in the corresponding collection policy is used to collect the target data in the webpage pointed to by the target URL information by using a corresponding loading manner.
  • a webpage data collection apparatus may include: a request receiving unit, configured to receive a request for collecting data in batches, wherein the request carries destination URL information.
  • the policy determining unit may be configured to determine an acquisition policy of the successfully collectable target data corresponding to the target web address information, where the collection policy corresponding to the target web address information specifically includes at least the target of the synchronous load test by using the target web address information
  • the data collection test is obtained, and the collection strategy includes a synchronous loading mode or an asynchronous loading mode.
  • the collecting unit may be configured to collect the target data in the webpage pointed by the target webpage information by using a corresponding loading manner according to the synchronous loading manner or the asynchronous loading manner set in the collection policy corresponding to the target web address information.
  • a webpage data collection system can include a client that can be used to issue a request for batch collection of data, wherein the request carries destination URL information.
  • the collection policy configuration server may be configured to receive a request for batch collection of data sent by the client, and determine an collection policy of the target data that can be successfully collected by the destination URL information carried by the request, where the destination URL information is corresponding to the collection policy.
  • the target data information is obtained by performing a target data collection test including at least a synchronous loading test, where the collection policy includes a synchronous loading mode or an asynchronous loading manner, and generating an setting policy according to the target URL information.
  • the synchronous loading mode or the asynchronous loading mode is used to collect the collection task of the target data in the webpage pointed to by the target URL information, and distribute the collection task to the collection server in the collection server cluster.
  • the collection server cluster may be configured to receive the collection task distributed by the collection policy configuration server, execute the collection task, and feed back the collected target data.
  • the embodiment of the present application determines the corresponding collection strategy of successfully collecting the target data according to the target URL information carried in the request, and the collection strategy is
  • the target data information is obtained by performing the target data collection test including at least the synchronous loading test. Therefore, if the target webpage corresponding to the target webpage information can collect the target data in a synchronous loading manner, the collected successfully collected target data is collected.
  • the loading mode included in the policy can be synchronous loading mode, so that the data can be collected by the synchronous loading mode set in the collection policy, so that the data that can be loaded by the synchronization can be prevented from being loaded by the asynchronous loading mode, thereby avoiding additional consumption of resources and time. Therefore, the embodiment of the present application can effectively improve data collection efficiency while ensuring successful acquisition of target data.
  • FIG. 1 is a schematic flowchart of a webpage data collection method according to an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a webpage data collection device according to an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a webpage data collection system disclosed in an embodiment of the present application.
  • the page data loading method can perform an effective analysis test including at least the synchronous loading test, the URL information of the target data that can be synchronously loaded and the URL that must load the target data asynchronously can be distinguished. Information, and set the corresponding collection strategy that can successfully collect target data.
  • the data may be collected by using the synchronous loading mode or the asynchronous loading mode set in the data, so that the data that can be loaded by the synchronization can be loaded. Avoid asynchronous loading, so as to avoid additional consumption of resources and time, which can effectively improve data collection efficiency.
  • FIG. 1 it is a schematic flowchart of a webpage data collection method according to an embodiment of the present application. As shown in FIG. 1, the method can include:
  • S110 Receive a request for collecting data in batches, where the request carries target website information.
  • the received request for batch data collection may carry the batch collection configuration information input by the user on the front page.
  • the HTML tag of the target data can be configured as id: breadCrumbText
  • the configuration of the batch collection configuration can also be configured as an XPath description. This application does not limit this. It can be understood that the batch collection configuration information can also selectively configure other parameters according to the user's own requirements, which is not limited in this application.
  • mapping relationship between the related file for saving the parameter and the storage address of the related file is also required. Set to read the parameters in the file according to the mapping relationship during the data acquisition test.
  • the keyword file submitted by the user can be downloaded to the machine for performing data collection test according to the specified address, and at the same time, set and Save the mapping relationship between the keyword file and the storage address, for example, "taskKeywordsFile”: "/home/admin/1/test.txt", so that when the data collection test is performed, the key in the keyword file can be read according to the mapping relationship. word.
  • S120 Determine an acquisition policy of the successfully collectable target data corresponding to the target web address information, where the collection strategy corresponding to the target web address information is obtained by performing target data collection and testing of the target web address information including at least a synchronous loading test.
  • the collection strategy includes a synchronous loading mode or an asynchronous loading mode.
  • the collection strategy of the target data address corresponding to the successfully collectable target data may be pre-processed through various different web address information before receiving the request for batch data collection.
  • the line obtains at least the target data collection test of the synchronous loading test, and may also perform the target data collection test including at least the synchronous loading test on the target website information in real time upon receiving the request for the batch collection data for the target website information.
  • the test configuration information input by the user on the front end page may be received in advance, mainly including different types of web addresses to be tested, An HTML tag that identifies the target data, and so on.
  • the target data collection test with the priority of the synchronous loading mode may be performed, and the collection strategies corresponding to different types of URLs are obtained.
  • the acquisition strategy obtained by the pre-test can be saved in the database as a historical collection strategy, so that when the request for batch collection of data is received, the corresponding historical collection strategy is extracted from the database for data collection.
  • the historical collection policy corresponding to the target website information carried by the request may be further determined. If not, the target website information may be synchronized. The load-prioritized target data collection test is performed, and the corresponding collection policy of the target data can be successfully collected.
  • the collection policy includes a synchronous loading mode or an asynchronous loading mode, and the collection policy is saved as a history corresponding to the target URL information. Collection strategy.
  • the collection policy of the successfully collectable target data corresponding to the target web address information by using the historical collection policy may be directly determined.
  • the loading method of the page data of the third-party site or platform may be changed.
  • the original synchronous loading may successfully collect the URL of the target data, and may become a URL that can only be asynchronously loaded. Therefore, after extracting the historical collection policy corresponding to the target URL information, a small-scale test can also be performed to verify whether the existing historical collection policy can continue to be used.
  • a small-scale test can include: determining a small rule by a predetermined small-scale test rule The HTML tag of the die test data and the URL information to be tested in the target URL information, according to the collection policy corresponding to the target URL information and the HTML tag used to identify the small-scale test data, try to collect the URL information to be tested.
  • the small-scale test data in the webpage if the collection is successful, may determine that the historical collection policy is an acquisition strategy of the successfully collectable target data corresponding to the target URL information, and perform formal batch collection.
  • the method further includes: if the collection is unsuccessful, performing the target data collection test including the synchronous loading test on the target URL information, obtaining a corresponding collection strategy for successfully collecting the target data, and updating the according to the obtained collection policy.
  • the historical collection strategy corresponding to the destination URL information may be determining a small rule by a predetermined small-scale test rule The HTML tag of the die test data and the URL information to be tested in the target URL information, according to the collection policy corresponding to the target URL information and
  • the specific implementation manner of the preset small-scale test rule is not limited in the embodiment of the present application.
  • the top 10 keywords can be extracted from a large number of keywords submitted by the user (if the number of keywords submitted by the user is less than 10, which can be extracted according to the actual quantity), and the keyword information is replaced by the user-configured URL information one by one.
  • the location of the parameter determines the 10 URL information that needs to be tested.
  • the historical collection strategy extracted from the database is adopted for testing.
  • the history collection policy may include parameters such as a load mode (synchronous load mode or asynchronous load mode), a connection timeout time, and a page timeout time.
  • the format of the extracted historical collection policy can be: "[ ⁇ "url":"http://s.1688.com/selloffer/offer_search.htm?
  • the target data collection test target data collection test including at least the synchronous loading test may be re-executed for the target URL information configured by the user, and the target URL information corresponding to the re-acquired collection policy is updated.
  • Historical collection strategy Target data collection is performed in batches based on the updated acquisition strategy.
  • the specific implementation manner of the target data collection test for the target web address information including at least the synchronous loading test is not limited.
  • the destination URL information includes at least synchronous loading.
  • the target data collection test of the test may include: loading a webpage pointed by the target URL information by using a synchronous loading manner, and attempting to read the target data for the webpage obtained by synchronous loading, and reading the target for the webpage obtainable from the synchronous loading.
  • the URL information of the data, the loading mode in the collection policy corresponding to the URL information of the type is set to the synchronous loading mode, and the URL information of the target data is not read from the webpage obtained by the synchronous loading, and the corresponding URL information of the type is set.
  • the loading mode in the collection policy is asynchronous loading mode.
  • the webpage pointed to by the target URL information may be loaded by using an asynchronous loading manner, and the target data is attempted to be read from the asynchronously loaded webpage, and then the target URL is loaded by using a synchronous loading manner.
  • the web page pointed to by the information attempts to read the target data from the web page that is synchronously loaded. If the URL information of the target data can be read from the webpage that is synchronously loaded, the loading mode in the collection policy corresponding to the URL information of the type can be set as the synchronous loading mode. If the target data cannot be read from the webpage that is synchronously loaded, and the target data can be read from the asynchronously loaded webpage, the loading mode in the collection policy corresponding to the webpage information of the type can be set to be asynchronous loading mode. .
  • the step of loading the webpage pointed to by the target webpage information by using the synchronous loading manner may be performed multiple times, and may further include: recording and the webpage at each execution The time when the connection is established and the time for obtaining the webpage page after the connection; when the loading mode in the collection policy corresponding to the URL information of the type is set to the synchronous loading mode, the connection is established according to the record recorded in the multiple execution process.
  • the time, and the time for obtaining the web page after the connection, the connection timeout time corresponding to the synchronous loading mode in the corresponding collection policy, and the setting of the page timeout time may be loaded multiple times, and the time for establishing the connection with the web address and the connection are recorded every time execution is performed.
  • the time for obtaining the web page so that when the loading mode in the collection policy corresponding to the web address information of the type is set to the asynchronous loading mode, the time for establishing the connection recorded during the webpage loading process according to the asynchronous loading method may be And the time for obtaining the webpage page after the connection, and setting the connection timeout time and the page timeout time in the corresponding collection policy.
  • connection timeout time and obtaining the page timeout time in the corresponding collection policy according to the time of establishing the connection recorded in the multiple execution process and the time for obtaining the webpage page after the connection is performed.
  • the method is not limited.
  • the average time of establishing a connection recorded during multiple executions may be obtained to obtain a connection timeout period to be set, and an average value of time for obtaining a webpage page recorded during multiple executions is obtained. Get the page timeout.
  • connection timeout period and obtaining the page timeout time which is not limited in this application.
  • connection timeout period is set in the collection policy, and the page timeout time is obtained, so that when the data is collected in batches, the connection timeout time set in the collection policy may be re-issued when the connection timeout occurs.
  • the request and, according to the acquisition page timeout time set in the collection policy, reissues the read page request when the read page timeout occurs.
  • the upper limit value of the number of retry connections and the upper limit value of the number of times of retrying the read page may be set, so that when the number of retry attempts exceeds the upper limit value, the page data corresponding to the web address information is discarded. collection.
  • S130 Collect, according to the synchronous loading mode or the asynchronous loading mode set in the collection policy corresponding to the target URL information, the target data in the webpage pointed by the target URL information by using a corresponding loading manner.
  • the destination URL information carried by the request may be one or more.
  • the target data collection test including at least the synchronous loading test may be separately performed on different types of web address information, and the URL information of the target data that can be synchronously loaded and the web address information that the target data must be asynchronously loaded are determined, and the corresponding success can be successfully set.
  • Collect the acquisition strategy of the target data For a plurality of different types of target URL information, the corresponding collection strategy may be adopted to collect the target data in the webpage.
  • the target data collection test for performing the at least the synchronous loading test on the different types of the website information may be implemented by referring to the foregoing implementation of the method for performing at least the synchronous loading test on the target website information, and details are not described herein.
  • the collection policy corresponding to the target URL information is obtained by performing the target data collection test including at least the synchronous loading test on the target URL information, and therefore, when collecting data in batches,
  • the collection policy corresponding to the target URL information, the synchronous loading mode or the asynchronous loading mode set in the data is collected, so that the synchronization can be added.
  • the loaded data can be prevented from being loaded by asynchronous loading, thereby avoiding additional consumption of resources and time, and can effectively improve data collection efficiency.
  • the application also records and analyzes the page connection and the page read time, sets the corresponding connection timeout time and obtains the page timeout time in the collection policy, so that the formal data collection can be reasonably invoked according to the collection strategy. Or asynchronous loading mode to maximize the collection efficiency while ensuring accurate data collection, avoiding additional hardware resources and time consumption.
  • the present application further provides a webpage data collection device.
  • FIG. 2 it is a schematic structural diagram of a webpage data collection device adopted by the embodiment of the present application.
  • the apparatus may include:
  • the request receiving unit 210 is configured to receive a request for collecting data in batches, where the request carries the target website information.
  • the policy determining unit 220 may be configured to determine an collection policy of the successfully collectable target data corresponding to the target web address information, where the collection policy corresponding to the target web address information is specifically configured to perform at least a synchronous loading test on the target web address information.
  • the target data collection test is obtained, and the collection strategy includes a synchronous loading mode or an asynchronous loading mode.
  • the collecting unit 230 may be configured to collect the target data in the webpage pointed by the target webpage information by using a corresponding loading manner according to the synchronous loading manner or the asynchronous loading manner set in the collection policy corresponding to the target web address information.
  • the policy determining unit 220 may be configured to extract a historical collection policy corresponding to the target website information, where the historical collection policy is specifically obtained by performing target data collection and testing of the target website information including at least a synchronous loading test.
  • the historical collection policy includes a synchronous loading mode or an asynchronous loading mode, and determines that the historical collection policy is an acquisition strategy for successfully collecting target data corresponding to the target website information.
  • a small-scale test can also be performed.
  • the policy determining unit 220 includes: an extracting sub-unit 221, which can be used to extract a historical collection policy corresponding to the target web address information, where the historical collection policy is specifically pre-tested by performing at least a synchronous loading test on the target web address information. aims The data collection test is obtained, and the historical collection strategy includes a synchronous loading mode or an asynchronous loading mode.
  • the small-scale test determining sub-unit 222 can be configured to determine an HTML tag for identifying small-scale test data and URL information to be tested in the target URL information according to a preset small-scale test rule.
  • the policy test sub-unit 223 may be configured to try to collect small-scale test data in a webpage pointed to by the website information to be tested according to the historical collection policy corresponding to the target website information and the HTML tag used to identify the small-scale test data.
  • the policy determining sub-unit 224 may be configured to determine, if the collecting is successful, the historical collecting policy as an acquiring policy of the successfully collectable target data corresponding to the target web address information.
  • the test sub-unit 225 can be configured to perform a target data collection test including at least a synchronous loading test on the target web address information if the acquisition is unsuccessful, and obtain a corresponding collection strategy for successfully collecting the target data.
  • the update subunit 226 is configured to update the historical collection policy corresponding to the target web address information according to the collection policy obtained by the test subunit.
  • the specific implementation manner of the collection policy for obtaining the corresponding successfully collectable target data by the test sub-unit 225 is not limited.
  • the test sub-unit 225 may include: a synchronization loading sub-unit 2251, which may be used to load a webpage pointed by the target web address information by using a synchronous loading manner.
  • the target data reading subunit 2252 can be used to search for a web page obtained by synchronous loading, from which an attempt is made to read the target data.
  • the synchronization policy setting sub-unit 2253 can be configured to read the URL information of the target data in the webpage that can be obtained from the synchronous loading, and set the loading mode in the collection policy corresponding to the webpage information of the type to the synchronous loading mode.
  • the asynchronous policy setting sub-unit 2254 can be used to read the URL information of the target data in the webpage that cannot be loaded from the synchronous loading, and set the loading mode in the collection policy corresponding to the webpage information of the type to the asynchronous loading mode.
  • the synchronization loading sub-unit 2251 may be configured to perform the step of loading a webpage pointed by the target webpage information by using a synchronous loading manner multiple times.
  • the test sub-unit may further include: a synchronization record sub-unit 2255, which may be configured to record the time for establishing a connection with the web address each time the synchronization loading sub-unit performs the loading, and to obtain the webpage after the connection The time of the page.
  • the synchronization timeout setting sub-unit 2256 may be configured to: when the loading mode in the collection policy corresponding to the type of the URL information is set to the synchronous loading mode, the sub-unit is loaded multiple times according to the synchronization The time when the connection is established during the loading process and the time for obtaining the web page after the connection is performed, and the connection timeout time corresponding to the synchronous loading mode in the corresponding collection policy and the setting of the page timeout time are obtained.
  • the asynchronous recording sub-unit 2257 can be configured to read the webpage information of the target data from the webpage that cannot be loaded from the synchronous loading, and adopt the asynchronous loading manner to load the webpage pointed to by the multiple times, and record the connection with the webpage every time the execution is performed.
  • the asynchronous timeout setting sub-unit 2258 can be used to set the connection time recorded in the process of loading the webpage according to multiple asynchronous loading modes when the loading mode in the collection policy corresponding to the URL information of the type is set to be asynchronous loading mode. And the time for obtaining the webpage page after the connection, and setting the connection timeout time and the page timeout time in the corresponding collection policy.
  • Subunit 2251, target data reading subunit 2252, synchronization policy setting subunit 2253, asynchronous policy setting subunit 2254, synchronization recording subunit 2255, synchronization timeout setting subunit 2256, asynchronous recording subunit 2257, asynchronous timeout setting Subunits 2258 are each depicted in dashed lines in Figure 2 to indicate that these units are not a necessary unit of the web page data collection device provided herein.
  • the present application also provides a webpage data collection system for implementing the method.
  • FIG. 3 is a schematic structural diagram of a webpage data collection system according to an embodiment of the present application.
  • the system can include:
  • the client 310 can be configured to issue a request for collecting data in batches, where the request carries the target URL information.
  • the collection policy configuration server 320 may be configured to receive a request for collecting data in batches, where the request carries the target website information, and the collection policy of the successfully collectable target data corresponding to the target website information is determined, where the destination URL is The collection policy corresponding to the information is obtained by performing the target data collection test including the synchronous loading test on the target website information, where the collection policy includes a synchronous loading mode or an asynchronous loading mode; and generating, for generating, according to the target URL information. Synchronous loading mode or asynchronous loading mode set in the collection policy, take corresponding The loading mode collects the collection task of the target data in the webpage pointed to by the target webpage information, and distributes the collection task to the collection server in the collection server cluster 330.
  • the collection server cluster 330 can be configured to receive the collection task distributed by the collection policy configuration server, execute the collection task, and feed back the collected target data.
  • the collection policy configuration server 320 can generate a batch collection task, and distribute the batch collection task to the idle collection server in the collection server cluster 330 according to the preset distribution policy.
  • the collection task can be executed concurrently, which further improves the collection efficiency of the webpage data.
  • the user may set the batch collection configuration information on the client 310, and the user may send a request to carry the batch collection configuration information through the client 310.
  • the batch collection configuration information may include parameters such as destination URL information.
  • the collection policy configuration server 320 needs to collect the keyword files submitted by the user according to the specified address. Download to the collection server for performing data collection test in the collection server cluster 330, and set and save the mapping relationship between the keyword file and the storage address, for example, "taskKeywordsFile”: "/home/admin/1/test.txt ".
  • the mapping relationship is encapsulated into a test task and sent to the collection server along with the test task. Therefore, when the data collection and testing is performed by the collection server, the keywords in the keyword file can be read according to the mapping relationship, and the corresponding destination URL information for searching and related page data is expanded.
  • the collection policy configuration server 320 may include: a policy generation server 321, a test server 322, and a database server 323.
  • the policy generation server 321 can be configured to generate a pre-test task for different types of web addresses in advance, submit the pre-test task to the test server 322, and obtain the loading mode, connection time, and page time of the test record from the database server 323. .
  • the collection policy corresponding to different types of URLs is generated according to the obtained loading mode, connection time, and page time.
  • the collection policy corresponding to the different types of URLs is sent to the database server 323 for storage as a historical collection policy. And receiving the request sent by the client 310, and obtaining a historical collection policy corresponding to the target URL information from the database server.
  • a small-scale test task is generated that performs a small-scale test on the target URL information by adopting a historical collection strategy.
  • the small scale test task is submitted to the test server 322. If the test collection is successful, an acquisition task for collecting target data in the webpage pointed by the target web address information according to the historical collection policy may be generated. If the acquisition is unsuccessful, a retry task of performing target data collection on the target URL information is generated. The retry test task is submitted to the test server 322. The loading mode, the connection time, the acquisition page time, and the like recorded during the retest are obtained from the database server 323. An updated collection policy corresponding to the target URL information is generated according to the acquired loading mode, connection time, and page time.
  • the generated collection task is distributed to the collection server in the collection server cluster 330 for execution.
  • the test server 322 can be used to obtain a pre-test task, a small-scale test task, and/or a retry task from the policy generation server 321.
  • the obtained pre-test tasks, small-scale test tasks, and/or retry tasks are distributed to the collection servers in the collection server cluster 330 for execution. Collect the loading mode, connection time, page time, etc. during the execution of the test task.
  • the collected loading mode, connection time, acquisition page time, and the like are saved to the database for use by the policy generation server 321.
  • the synchronous loading mode may be performed by using httpclient+htmlparser for loading and page parsing, and the asynchronous loading mode may be webkit for loading and page parsing.
  • the database server 323 can be configured to save the loading mode, the connection time, the page time, and the like collected by the test server 322, and save the collection policy generated by the policy generation server 321 .
  • the collection policy configuration server 320 and the collection server cluster 330 may be arranged in different network systems.
  • the database server 323 can be built on a MySQL database cluster. Additionally, in view of the magnitude of the data, the database server 323 can be deployed in a distributed manner to provide good read performance.
  • policy generation server 321, the test server 322, and the database server in the embodiment of the present application are drawn in dashed lines in FIG. 2 to indicate that these units are not necessary servers for the collection policy configuration server.
  • the present invention can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product, which may be stored in a storage medium such as a ROM/RAM or a disk. , an optical disk, etc., includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present invention or portions of the embodiments.
  • a computer device which may be a personal computer, server, or network device, etc.
  • the invention is applicable to a wide variety of general purpose or special purpose computing system environments or configurations.
  • the invention may be described in the general context of computer-executable instructions executed by a computer, such as a program module.
  • program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communication network.
  • program modules can be located in both local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种网页数据采集方法、装置及系统,该方法可以包括:接收批量采集数据的请求,其中,所述请求携带有目标网址信息(S110);确定所述目标网址信息对应的可成功采集目标数据的采集策略,其中,所述目标网址信息对应的采集策略具体通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述采集策略包括同步加载方式或异步加载方式(S120);根据所述目标网址信息对应的采集策略中设置的同步加载方式或异步加载方式,采取相应的加载方式采集所述目标网址信息指向的网页中的目标数据(S130)。

Description

一种网页数据采集方法、装置及系统 技术领域
本申请涉及互联网领域,尤其涉及一种网页数据采集方法、装置及系统。
背景技术
在网站的SEO(Search Engine Optimization,搜索引擎优化)建设过程中,为了能够准确了解到站点现阶段的总体优化情况,会产生一些对第三方站点或平台的数据采集需求,通过对采集到的各类信息进行分析从而制定下一步的网站优化策略。
目前,主要通过互联网加载第三方站点或平台的网页数据来采集第三方站点或平台的数据。加载网页数据主要包括同步和异步两种加载方式。同步方加载方式,为请求直接返回HTML页面。异步加载方式,在页面返回后,通过加载JS(JavaScript,一种直译式脚本语言)方式改变页面原有结构从而加载出数据。在得到返回的HTML页面之后,可以对HTML页面进行解析,将有用的数据提取分离出来,比如可以抽取出新浪网新闻频道里的某个新闻的标题。
由于制定网站优化策略的数据需求量较大,因此,需要批量采集第三方站点或平台的网页数据。然而,由于不同网页数据加载方式可能不同,为了保证数据采集结果的准确性,只能统一采取异步加载的方式。但是,由于JS执行需要消耗额外的时间,对于本来同步就能加载出的数据会额外消耗大量硬件资源和时间,导致数据采集效率较低。
发明内容
有鉴于此,本申请的目的在于提供一种网页数据采集方法、装置及系统以实现提高数据采集效率的目的。
在本申请实施例的第一个方面,提供了一种网页数据采集方法。例如,该方法可以包括:接收批量采集数据的请求,其中,所述请求携带有目标网址信息;确定所述目标网址信息对应的可成功采集目标数据的采集策略,其 中,所述目标网址信息对应的采集策略具体通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述采集策略包括同步加载方式或异步加载方式;根据所述目标网址信息对应的采集策略中设置的同步加载方式或异步加载方式,采取相应的加载方式采集所述目标网址信息指向的网页中的目标数据。
在本申请实施例的第二个方面,提供了一种网页数据采集装置。例如,该装置可以包括:请求接收单元,可以用于接收批量采集数据的请求,其中,所述请求携带有目标网址信息。策略确定单元,可以用于确定所述目标网址信息对应的可成功采集目标数据的采集策略,其中,所述目标网址信息对应的采集策略具体通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述采集策略包括同步加载方式或异步加载方式。采集单元,可以用于根据所述目标网址信息对应的采集策略中设置的同步加载方式或异步加载方式,采取相应的加载方式采集所述目标网址信息指向的网页中的目标数据。
在本申请实施例的第三个方面,提供了一种网页数据采集系统。例如,该系统可以包括:客户端,可以用于发出批量采集数据的请求,其中,所述请求携带有目标网址信息。采集策略配置服务器,可以用于接收客户端发送的批量采集数据的请求,确定所述请求携带的目标网址信息对应的可成功采集目标数据的采集策略,其中,所述目标网址信息对应的采集策略具体通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述采集策略包括同步加载方式或异步加载方式,以及,生成用于根据所述目标网址信息对应的采集策略中设置的同步加载方式或异步加载方式,采取相应的加载方式采集所述目标网址信息指向的网页中的目标数据的采集任务,将所述采集任务分发给采集服务器集群中的采集服务器。采集服务器集群,可以用于接收采集策略配置服务器分发的采集任务,执行所述采集任务,反馈采集到的目标数据。
可见本申请具有如下有益效果:
由于本申请实施例在接收批量采集数据的请求之后,根据请求携带的目标网址信息确定了对应的可成功采集目标数据的采集策略,而该采集策略是 通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得的,因此,如果目标网址信息对应的网页以同步加载方式能够采集出目标数据,则测试得到的可成功采集目标数据的采集策略中包含的加载方式就可以是同步加载方式,从而采取采集策略中设置的同步加载方式采集数据,使同步就能加载出的数据可以避免采用异步加载方式加载,避免造成资源和时间的额外消耗,因此,本申请实施例在保证成功采集到目标数据的同时,可以有效提高数据采集效率。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例公开的一种网页数据采集方法流程示意图;
图2为本申请实施例公开的一种网页数据采集装置结构示意图;
图3为本申请实施例公开的一种网页数据采集系统结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
一般来说,由于JS执行需要消耗额外的时间,对同一个页面结构如果不执行JS,则执行效率会有一定提升。基于这个原理,在批量采集网页数据前,如果能够对页面数据的加载方式进行至少包括同步加载测试的有效分析测试,则可以区分出可同步加载目标数据的网址信息以及必须异步加载目标数据的网址信息,并设置对应的可成功采集目标数据的采集策略。这样,在批量采集数据时,可以根据与目标网址信息对应的采集策略,采取其中设置的同步加载方式或异步加载方式采集数据,使本来同步就能加载出的数据可以 避免采用异步加载方式加载,从而避免造成资源和时间的额外消耗,可以有效的提高数据采集效率。
例如,参见图1,为本申请实施例提供的一种网页数据采集方法流程示意图。如图1所示,该方法可以包括:
S110、接收批量采集数据的请求,其中,所述请求携带有目标网址信息。
例如,所接收的批量采集数据的请求,可以携带有用户在前端页面上输入的批量采集配置信息。假设要批量采集1688站点搜索页面在检索不同关键词时的搜索结果数据。那么批量采集配置信息可以包括:目标网址信息“http://s.1688.com/selloffer/offer_search.htm?keywords=${keyword}&button_click=top&n=y”。其中,${keyword}可以替换成不同的关键词,目标数据的HTML标签可以配置成id:breadCrumbText|class[0]:sm-navigatebar-count|text,表示抽取breadCrumbText这个HTML标签下面第一个sm-navigatebar-count class下的纯文本。其中,批量采集配置信息也可以配置成XPath的描述方式,本申请对此并不进行限制。可以理解的是,批量采集配置信息还可以按照用户自己的需求选择性配置其他参数,本申请对此并不进行限制。
另外,根据实际需要,如果除了用户提交的批量采集配置信息之外,还需要从其他文件中读取相关参数,则还需要对用于保存该参数的相关文件与相关文件存储地址的映射关系进行设置,以便进行数据采集测试时根据映射关系读取到文件中的参数。例如,在批量采集1688站点搜索页面在搜索不同关键词时的搜索结果数据的应用场景中,用户提交的关键词文件可以按照指定地址下载到用于执行数据采集测试的机器上,同时,设置并保存关键词文件与存储地址的映射关系,例如,"taskKeywordsFile":"/home/admin/1/test.txt",从而进行数据采集测试时,可以根据映射关系读取到关键词文件中的关键词。
S120、确定所述目标网址信息对应的可成功采集目标数据的采集策略,其中,所述目标网址信息对应的采集策略具体通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述采集策略包括同步加载方式或异步加载方式。
需要说明的是,所述目标网址信息对应的可成功采集目标数据的采集策略,可以在接收批量采集数据的请求之前,预先通过对各种不同网址信息进 行至少包括同步加载测试的目标数据采集测试获得,也可以在接收到针对所述目标网址信息的批量采集数据的请求时,实时通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,再或者,也可以是在确定预先测试获得的采集策略无效后,再次进行至少包括同步加载测试的目标数据采集测试获得。
例如,在预先对各种不同网址信息进行至少包括同步加载测试的目标数据采集测试的实施方式中,可以预先接收用户在前端页面上输入的测试配置信息,主要包括待测试的不同类型网址、用于标识目标数据的HTML标签等。在确定需要测试的网址信息以及对应的用于标识目标数据的HTML标签之后,可以进行同步加载方式优先的目标数据采集测试,得到不同类型网址分别对应的采集策略。
一些可能的实施方式中,预先测试获得的采集策略可以作为历史采集策略保存于数据库中,以便在接收到批量采集数据的请求时,从数据库中提取出对应的历史采集策略来进行数据采集。
当然,在提取所述目标网址信息对应的历史采集策略之前,还可以进一步判断是否存在所述请求携带的目标网址信息对应的历史采集策略,如果不存在,则可以通过对该目标网址信息进行同步加载方式优先的目标数据采集测试,获得对应的可成功采集目标数据的采集策略,所述采集策略包括同步加载方式或异步加载方式,以及,将该采集策略保存为所述目标网址信息对应的历史采集策略。
一些可能的实施方式中,可以在提取所述请求携带的目标网址信息对应的历史采集策略之后,直接确定以所述历史采集策略为所述目标网址信息对应的可成功采集目标数据的采集策略。
另一些可能的实施方式中,考虑到第三方站点或平台的页面数据的加载方式可能会发生变化,原来同步加载可以成功采集到目标数据的网址,有可能变为只能异步加载的网址。因此,在提取目标网址信息对应的历史采集策略之后,还可以进行小规模的测试,从而校验已存在的历史采集策略是否可继续使用。
例如,小规模测试可以包括:按预设小规模测试规则确定用于标识小规 模测试数据的HTML标签以及所述目标网址信息中需要测试的网址信息,根据所述目标网址信息对应的采集策略以及用于标识小规模测试数据的HTML标签,尝试采集需要测试的网址信息指向的网页中的小规模测试数据,如果采集成功,则可以确定所述历史采集策略为所述目标网址信息对应的可成功采集目标数据的采集策略,进行正式批量采集。且,还包括,如果采集不成功,则可以对该目标网址信息进行至少包括同步加载测试的目标数据采集测试,获得对应的可成功采集目标数据的采集策略,根据所获得的采集策略更新所述目标网址信息对应的历史采集策略。
需要说明的是,本申请实施例对预设小规模测试规则的具体实施方式不限。例如,可以按照固定预设小规模数量或一定缩减比例,从目标网址信息中选择出少量需要测试的网址信息,等等。例如,结合上述批量采集1688站点搜索页面在搜索不同关键词时的搜索结果数据的应用场景。在进行小规模测试时,可以从用户提交的大量关键词中提取前10个(如果用户提交的关键词不足10个,可以按实际数量提取),逐一替换到用户配置的网址信息中搜索关键词参数的位置,确定出需要测试的10个网址信息。从而根据需要测试的10个网址信息、以及,用于标识目标数据的HTML标签等信息,采取从数据库中提取的历史采集策略,进行测试。例如,历史采集策略中可以包括加载方式(同步加载方式或异步加载方式)、连接超时时间、获取页面超时时间等参数。在该应用场景中,提取出的历史采集策略的格式可以为:“[{"url":"http://s.1688.com/selloffer/offer_search.htm?keywords=${keyword}&button_click=top&n=y","keywordsPath":"/usr/group/seo/test.txt","conto":"5000","readto":"6000","crawlType":"sync"}]”。经过小规模测试,如果确定采集不成功,可以针对用户配置的目标网址信息,重新进行至少包括同步加载测试的目标数据采集测试目标数据采集测试,根据重新获得的采集策略更新所述目标网址信息对应的历史采集策略。基于更新后的采集策略正式批量的进行目标数据采集。
需要说明的是,本申请实施例对目标网址信息进行至少包括同步加载测试的目标数据采集测试的具体实现方式不限。
例如,一些可能的实施方式中,对目标网址信息进行至少包括同步加载 测试的目标数据采集测试可以包括:采取同步加载方式加载所述目标网址信息指向的网页,针对同步加载得到的网页,从中尝试读取目标数据,针对可从同步加载得到的网页中读取出目标数据的网址信息,设置该类型的网址信息对应的采集策略中的加载方式为同步加载方式,针对不可从同步加载得到的网页中读取出目标数据的网址信息,设置该类型的网址信息对应的采集策略中的加载方式为异步加载方式。
再例如,另一些可能的实施方式中,可以先采取异步加载方式加载所述目标网址信息指向的网页,从异步加载到的网页中尝试读取目标数据,再采取同步加载方式加载所述目标网址信息指向的网页,从同步加载到的网页中尝试读取目标数据。如果可从同步加载到的网页中读取出目标数据的网址信息,则可以设置该类型的网址信息对应的采集策略中的加载方式为同步加载方式。如果不可从同步加载到的网页中读取出目标数据、且可以从异步加载到的网页中读取出目标数据,则可以设置该类型的网址信息对应的采集策略中的加载方式为异步加载方式。
一些可能的实施方式中,考虑到加载网页页面成功与否还会受到网络稳定性影响,可能需要在连接超时时重试连接以及在读取页面超时时重试读取页面,因此,在进行至少包括同步加载测试的目标数据采集测试过程中,所述采取同步加载方式加载所述目标网址信息指向的网页的步骤可以多次执行,且,还可以包括:在每次执行时,均记录与网址建立连接的时间、以及在连接后用于获取网页页面的时间;在设置该类型的网址信息对应的采集策略中的加载方式为同步加载方式时,根据在多次执行过程中记录的建立连接的时间、以及在连接后用于获取网页页面的时间,进行对应的采集策略中同步加载方式对应的连接超时时间以及获取页面超时时间的设定。而且,针对不可从同步加载得到的网页中读取出目标数据的网址信息,可以多次采取异步加载方式加载其指向的网页,且每次执行时均记录与网址建立连接的时间、以及在连接后用于获取网页页面的时间,从而在设置该类型的网址信息对应的采集策略中的加载方式为异步加载方式时,可以根据多次采取异步加载方式加载网页过程中记录的建立连接的时间、以及在连接后用于获取网页页面的时间,进行对应的采集策略中连接超时时间以及获取页面超时时间的设定。
其中,所述根据在多次执行过程中记录的建立连接的时间、以及在连接后用于获取网页页面的时间,进行对应的采集策略中连接超时时间以及获取页面超时时间的设定的具体实现方式不限。例如,可以取多次执行过程中记录的建立连接的时间的平均值得到需要设定的连接超时时间,取多次执行过程中记录的用于获取网页页面的时间的平均值得到需要设定的获取页面超时时间。当然,也可以有其他计算连接超时时间以及获取页面超时时间的实现方式,本申请对此并不进行限制。
在上面的实施方式中,由于在采集策略中设置了连接超时时间以及获取页面超时时间,从而后续批量采集数据时,可以根据采集策略中设定的连接超时时间,在出现连接超时时重新发出连接请求,以及,根据采集策略中设定的获取页面超时时间,在出现读取页面超时时重新发出读取页面请求。另外,在采集策略中还可以设定重试连接的次数上限值以及重试读取页面的次数上限值,以便当重试次数超过上限值时,放弃对该网址信息对应页面数据的采集。
S130、根据所述目标网址信息对应的采集策略中设置的同步加载方式或异步加载方式,采取相应的加载方式采集所述目标网址信息指向的网页中的目标数据。
需要说明的是,所述请求携带的目标网址信息可以为一个或多个。本发明实施例可以对不同类型的网址信息分别进行至少包括同步加载测试的目标数据采集测试,区分出可同步加载目标数据的网址信息以及必须异步加载目标数据的网址信息,并设置对应的可成功采集目标数据的采集策略。针对多个不同类型的目标网址信息,可以分别采取与之对应的采集策略来采集网页中的目标数据。其中,对各种不同类型的网址信息进行至少包括同步加载测试的目标数据采集测试,可以参照上述对目标网址信息进行至少包括同步加载测试的实施方式实现,在此不再赘述。
可见,应用本申请实施例提供的方法,由于目标网址信息对应的采集策略是通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得的,因此,在批量采集数据时,可以根据与目标网址信息对应的采集策略,采取其中设置的同步加载方式或异步加载方式采集数据,使同步就能加 载出的数据可以避免采用异步加载方式加载,从而避免造成资源和时间的额外消耗,可以有效的提高数据采集效率。另外,本申请还对页面连接和页面读取时间进行记录、分析,在采集策略中设定相应的连接超时时间、获取页面超时时间,从而在正式进行批量数据采集时可以根据采集策略合理调用同步或异步两种加载方式,在保证准确采集到数据的同时最大限度提高采集效率,避免了额外硬件资源和时间消耗。
与上述网页数据采集方法相对应的,本申请还提供了一种网页数据采集装置。
例如,参见图2,为本申请实施例通过的一种网页数据采集装置结构示意图。如图2所示,该装置可以包括:
请求接收单元210,可以用于接收批量采集数据的请求,其中,所述请求携带有目标网址信息。策略确定单元220,可以用于确定所述目标网址信息对应的可成功采集目标数据的采集策略,其中,所述目标网址信息对应的采集策略具体通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述采集策略包括同步加载方式或异步加载方式。采集单元230,可以用于根据所述目标网址信息对应的采集策略中设置的同步加载方式或异步加载方式,采取相应的加载方式采集所述目标网址信息指向的网页中的目标数据。
一些可能的实施方式中,可以在提取所述请求携带的目标网址信息对应的历史采集策略之后,直接确定以所述历史采集策略为所述目标网址信息对应的可成功采集目标数据的采集策略。因此,所述策略确定单元220,可以用于提取所述目标网址信息对应的历史采集策略,所述历史采集策略具体预先通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述历史采集策略包括同步加载方式或异步加载方式,确定所述历史采集策略为所述目标网址信息对应的可成功采集目标数据的采集策略。
另一些可能的实施方式中,在提取目标网址信息对应的历史采集策略之后,还可以进行小规模的测试。例如,所述策略确定单元220包括:提取子单元221,可以用于提取所述目标网址信息对应的历史采集策略,所述历史采集策略具体预先通过对该目标网址信息进行至少包括同步加载测试的目标 数据采集测试获得,所述历史采集策略包括同步加载方式或异步加载方式。小规模测试确定子单元222,可以用于按预设小规模测试规则确定用于标识小规模测试数据的HTML标签以及所述目标网址信息中需要测试的网址信息。策略测试子单元223,可以用于根据所述目标网址信息对应的历史采集策略以及用于标识小规模测试数据的HTML标签,尝试采集需要测试的网址信息指向的网页中的小规模测试数据。策略确定子单元224,可以用于如果采集成功,确定所述历史采集策略为所述目标网址信息对应的可成功采集目标数据的采集策略。测试子单元225,可以用于如果采集不成功,则重新对该目标网址信息进行至少包括同步加载测试的目标数据采集测试,获得对应的可成功采集目标数据的采集策略。更新子单元226,可以用于根据所述测试子单元获得的采集策略更新所述目标网址信息对应的历史采集策略。
需要说明的是,本申请实施例对所述测试子单元225通过目标数据采集测试,获得对应的可成功采集目标数据的采集策略的具体实现方式不限。例如,一些可能的实施方式中,其中,所述测试子单元225可以包括:同步加载子单元2251,可以用于采取同步加载方式加载所述目标网址信息指向的网页。目标数据读取子单元2252,可以用于针对同步加载得到的网页,从中尝试读取目标数据。同步策略设置子单元2253,可以用于针对可从同步加载得到的网页中读取出目标数据的网址信息,设置该类型的网址信息对应的采集策略中的加载方式为同步加载方式。异步策略设置子单元2254,可以用于针对不可从同步加载得到的网页中读取出目标数据的网址信息,设置该类型的网址信息对应的采集策略中的加载方式为异步加载方式。
一些可能的实施方式中,考虑到加载网页页面成功与否还会受到网络稳定性影响,可能需要在连接超时时重试连接以及在读取页面超时时重试读取页面,因此,其中,所述同步加载子单元2251,可以用于多次执行采取同步加载方式加载所述目标网址信息指向的网页的步骤。且,所述测试子单元还可以包括:同步记录子单元2255,可以用于在所述同步加载子单元每次执行加载时,均记录与网址建立连接的时间、以及在连接后用于获取网页页面的时间。同步超时设定子单元2256,可以用于在设置该类型的网址信息对应的采集策略中的加载方式为同步加载方式时,根据所述同步加载子单元在多次 执行加载过程中记录的建立连接的时间、以及在连接后用于获取网页页面的时间,进行对应的采集策略中同步加载方式对应的连接超时时间以及获取页面超时时间的设定。异步记录子单元2257,可以用于针对不可从同步加载得到的网页中读取出目标数据的网址信息,多次采取异步加载方式加载其指向的网页,且每次执行时均记录与网址建立连接的时间、以及在连接后用于获取网页页面的时间。异步超时设定子单元2258,可以用于在设置该类型的网址信息对应的采集策略中的加载方式为异步加载方式时,根据多次采取异步加载方式加载网页过程中记录的建立连接的时间、以及在连接后用于获取网页页面的时间,进行对应的采集策略中连接超时时间以及获取页面超时时间的设定。
需要注意的是,本申请实施例所述的提取子单元221、小规模测试确定子单元222、策略测试子单元223、策略确定子单元224、同步测试子单元225、更新子单元226、同步加载子单元2251、目标数据读取子单元2252、同步策略设置子单元2253、异步策略设置子单元2254、同步记录子单元2255、同步超时设定子单元2256、异步记录子单元2257、异步超时设定子单元2258在图2中均以虚线绘制,以表示这些单元不是本申请提供的网页数据采集装置的必要单元。
与上述网页数据采集方法相对应的,本申请还提供了一种用于实现该方法的网页数据采集系统。
例如,参见图3,为本申请实施例提供的一种网页数据采集系统结构示意图。如图3所示,该系统可以包括:
客户端310,可以用于发出批量采集数据的请求,其中,所述请求携带有目标网址信息。
采集策略配置服务器320,可以用于接收批量采集数据的请求,其中,所述请求携带有目标网址信息,确定所述目标网址信息对应的可成功采集目标数据的采集策略,其中,所述目标网址信息对应的采集策略具体通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述采集策略包括同步加载方式或异步加载方式;以及,生成用于根据所述目标网址信息对应的采集策略中设置的同步加载方式或异步加载方式,采取相应的 加载方式采集所述目标网址信息指向的网页中的目标数据的采集任务,将所述采集任务分发给采集服务器集群330中的采集服务器。
采集服务器集群330,可以用于接收采集策略配置服务器分发的采集任务,执行所述采集任务,反馈采集到的目标数据。
可见,应用本申请实施例提供的网页数据采集系统,可以由采集策略配置服务器320生成批量的采集任务,按预置的分发策略将批量的采集任务分发给采集服务器集群330中空闲的采集服务器,使得采集任务可以并发执行,进一步提高了网页数据的采集效率。
一些可能的实施方式中,用户可以在客户端310设置批量采集配置信息,用户可以通过客户端310发出携带有该批量采集配置信息的请求。其中批量采集配置信息中可以包括目标网址信息等参数。在上面提到的批量采集某1688网站的不同搜索关键词的搜索结果数据的应用场景中,采集策略配置服务器320除了得到批量采集配置信息之外,还需要将用户提交的关键词文件按照指定地址下载到采集服务器集群330中用于执行数据采集测试的采集服务器上,同时,设置并保存关键词文件与存储地址的映射关系,例如,"taskKeywordsFile":"/home/admin/1/test.txt"。并将该映射关系封装到测试任务中,与测试任务一并发送给采集服务器。从而在采集服务器进行数据采集测试时,可以根据映射关系读取到关键词文件中的关键词,扩展出相应的用于搜索出与相关的页面数据的目标网址信息。
另一些可能的实施方式中,采集策略配置服务器320可以包括:策略生成服务器321、测试服务器322、数据库服务器323。
其中,策略生成服务器321,可以用于预先针对不同类型网址,生成预先测试任务,将预先测试任务提交给测试服务器322,从数据库服务器323获取测试时记录的加载方式、连接时间、获取页面时间等。根据所获取的加载方式、连接时间、获取页面时间生成与不同类型网址对应的采集策略。向数据库服务器323发送不同类型网址对应的采集策略以便作为历史采集策略入库保存。以及,接收客户端310发出的请求,从数据库服务器获取目标网址信息对应的历史采集策略。生成对所述目标网址信息采取历史采集策略进行小规模测试的小规模测试任务。将小规模测试任务提交给测试服务器322。 如果测试采集成功,则可以生成用于根据所述历史采集策略采集所述目标网址信息指向的网页中的目标数据的采集任务。如果采集不成功,则生成对所述目标网址信息进行目标数据采集的重试任务。将重试测试任务提交给测试服务器322。从数据库服务器323获取重新测试时记录的加载方式、连接时间、获取页面时间等。根据所获取的加载方式、连接时间、获取页面时间,生成与目标网址信息对应的更新的采集策略。向数据库服务器323发送所述目标网址信息对应的更新的采集策略以便更新数据库中保存的历史采集策略,并生成用于根据更新的采集策略采集所述目标网址信息指向的网页中的目标数据的采集任务。将生成的采集任务分发给采集服务器集群330中的采集服务器来执行。
其中,测试服务器322,可以用于从策略生成服务器321得到预先测试任务、小规模测试任务、和/或者,重试任务。将得到的得到预先测试任务、小规模测试任务、和/或者,重试任务分发给采集服务器集群330中的采集服务器来执行。收集在测试任务执行过程中的加载方式、连接时间、获取页面时间等。将收集到的加载方式、连接时间、获取页面时间等保存到数据库中以便策略生成服务器321使用。在测试服务器322中,可以包含同步加载方式和异步加载方式两种加载方式,其中,同步加载方式可以采用httpclient+htmlparser的方式进行加载和页面解析,异步加载方式可以采用webkit进行加载和页面解析。
其中,数据库服务器323,可以用于保存所述测试服务器322收集到的加载方式、连接时间、获取页面时间等,以及,保存策略生成服务器321生成的采集策略。
在上面的实施方式中,采集策略配置服务器320与采集服务器集群330可以布置不同的网络系统中。数据库服务器323可以搭建在MySQL数据库集群上。另外,考虑到数据的量级,数据库服务器323可以采用分布式进行部署以提供良好的读取性能。
需要注意的是,本申请实施例所述策略生成服务器321、测试服务器322、数据库服务器在图2中以虚线绘制,以表示这些单元不是采集策略配置服务器的必要服务器。
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本发明时可以把各单元的功能在同一个或多个软件和/或硬件中实现。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本发明可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例或者实施例的某些部分所述的方法。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本发明可用于众多通用或专用的计算系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。
本发明可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本发明,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明 确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
以上所述仅为本发明的较佳实施例而已,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等,均包含在本发明的保护范围内。

Claims (11)

  1. 一种网页数据采集方法,其特征在于,包括:
    接收批量采集数据的请求,其中,所述请求携带有目标网址信息;
    确定所述目标网址信息对应的可成功采集目标数据的采集策略,其中,所述目标网址信息对应的采集策略具体通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述采集策略包括同步加载方式或异步加载方式;
    根据所述目标网址信息对应的采集策略中设置的同步加载方式或异步加载方式,采取相应的加载方式采集所述目标网址信息指向的网页中的目标数据。
  2. 根据权利要求1所述的方法,其特征在于,所述确定所述目标网址信息对应的可成功采集目标数据的采集策略包括:
    提取所述目标网址信息对应的历史采集策略,所述历史采集策略具体预先通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述历史采集策略包括同步加载方式或异步加载方式;
    确定所述历史采集策略为所述目标网址信息对应的可成功采集目标数据的采集策略。
  3. 根据权利要求1所述的方法,其特征在于,所述确定所述目标网址信息对应的可成功采集目标数据的采集策略包括:
    提取所述目标网址信息对应的历史采集策略,所述历史采集策略具体预先通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述历史采集策略包括同步加载方式或异步加载方式;
    按预设小规模测试规则确定用于标识小规模测试数据的HTML标签以及所述目标网址信息中需要测试的网址信息;
    根据所述目标网址信息对应的历史采集策略以及用于标识小规模测试数据的HTML标签,尝试采集需要测试的网址信息指向的网页中的小规模测试数据;
    如果采集成功,则确定所述历史采集策略为所述目标网址信息对应的可 成功采集目标数据的采集策略;
    如果采集不成功,则对该目标网址信息进行至少包括同步加载测试的目标数据采集测试,获得对应的可成功采集目标数据的采集策略,根据获得的采集策略更新所述目标网址信息对应的历史采集策略。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述对目标网址信息进行至少包括同步加载测试的目标数据采集测试包括:
    采取同步加载方式加载所述目标网址信息指向的网页,针对同步加载得到的网页,从中尝试读取目标数据,针对可从同步加载得到的网页中读取出目标数据的网址信息,设置该类型的网址信息对应的采集策略中的加载方式为同步加载方式,针对不可从同步加载得到的网页中读取出目标数据的网址信息,设置该类型的网址信息对应的采集策略中的加载方式为异步加载方式。
  5. 根据权利要求4所述的方法,其特征在于,所述采取同步加载方式加载所述目标网址信息指向的网页的步骤多次执行,且,还包括:
    在每次执行时,均记录与网址建立连接的时间、以及在连接后用于获取网页页面的时间,在设置该类型的网址信息对应的采集策略中的加载方式为同步加载方式时,根据在多次执行过程中记录的建立连接的时间、以及在连接后用于获取网页页面的时间,进行对应的采集策略中同步加载方式对应的连接超时时间以及获取页面超时时间的设定;
    针对不可从同步加载得到的网页中读取出目标数据的网址信息,多次采取异步加载方式加载其指向的网页,且每次执行时均记录与网址建立连接的时间、以及在连接后用于获取网页页面的时间,在设置该类型的网址信息对应的采集策略中的加载方式为异步加载方式时,根据多次采取异步加载方式加载网页过程中记录的建立连接的时间、以及在连接后用于获取网页页面的时间,进行对应的采集策略中异步加载方式对应的连接超时时间以及获取页面超时时间的设定。
  6. 一种网页数据采集装置,其特征在于,包括:
    请求接收单元,用于接收批量采集数据的请求,其中,所述请求携带有目标网址信息;
    策略确定单元,用于确定所述目标网址信息对应的可成功采集目标数据 的采集策略,其中,所述目标网址信息对应的采集策略具体通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述采集策略包括同步加载方式或异步加载方式;
    采集单元,用于根据所述目标网址信息对应的采集策略中设置的同步加载方式或异步加载方式,采取相应的加载方式采集所述目标网址信息指向的网页中的目标数据。
  7. 根据权利要求6所述的装置,其特征在于,所述策略确定单元,用于提取所述目标网址信息对应的历史采集策略,所述历史采集策略具体预先通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述历史采集策略包括同步加载方式或异步加载方式,确定所述历史采集策略为所述目标网址信息对应的可成功采集目标数据的采集策略。
  8. 根据权利要求6所述的装置,其特征在于,所述策略确定单元包括:
    提取子单元,用于提取所述目标网址信息对应的历史采集策略,所述历史采集策略具体预先通过对该目标网址信息进行至少包括同步加载测试的目标数据采集测试获得,所述历史采集策略包括同步加载方式或异步加载方式;
    小规模测试确定子单元,用于按预设小规模测试规则确定用于标识小规模测试数据的HTML标签以及所述目标网址信息中需要测试的网址信息;
    策略测试子单元,用于根据所述目标网址信息对应的历史采集策略以及用于标识小规模测试数据的HTML标签,尝试采集需要测试的网址信息指向的网页中的小规模测试数据;
    策略确定子单元,用于如果采集成功,确定所述历史采集策略为所述目标网址信息对应的可成功采集目标数据的采集策略;
    测试子单元,用于如果采集不成功,则对该目标网址信息进行至少包括同步加载测试的目标数据采集测试,获得对应的可成功采集目标数据的采集策略;
    更新子单元,用于根据所述测试子单元获得的采集策略更新所述目标网址信息对应的历史采集策略。
  9. 根据权利要求8所述的装置,其特征在于,所述测试子单元包括:
    同步加载子单元,用于采取同步加载方式加载所述目标网址信息指向的 网页;目标数据读取子单元,用于针对同步加载得到的网页,从中尝试读取目标数据;同步策略设置子单元,用于针对可从同步加载得到的网页中读取出目标数据的网址信息,设置该类型的网址信息对应的采集策略中的加载方式为同步加载方式;异步策略设置子单元,用于针对不可从同步加载得到的网页中读取出目标数据的网址信息,设置该类型的网址信息对应的采集策略中的加载方式为异步加载方式。
  10. 根据权利要求9所述的装置,其特征在于,所述同步加载子单元,用于多次执行采取同步加载方式加载所述目标网址信息指向的网页的步骤;
    且,所述测试子单元还包括:
    同步记录子单元,用于在所述同步加载子单元每次执行加载时,均记录与网址建立连接的时间、以及在连接后用于获取网页页面的时间;
    同步超时设定子单元,用于在设置该类型的网址信息对应的采集策略中的加载方式为同步加载方式时,根据所述同步加载子单元在多次执行加载过程中记录的建立连接的时间、以及在连接后用于获取网页页面的时间,进行对应的采集策略中同步加载方式对应的连接超时时间以及获取页面超时时间的设定;
    异步记录子单元,用于针对不可从同步加载得到的网页中读取出目标数据的网址信息,多次采取异步加载方式加载其指向的网页,且每次执行时均记录与网址建立连接的时间、以及在连接后用于获取网页页面的时间;
    异步超时设定子单元,用于在设置该类型的网址信息对应的采集策略中的加载方式为异步加载方式时,根据多次采取异步加载方式加载网页过程中记录的建立连接的时间、以及在连接后用于获取网页页面的时间,进行对应的采集策略中连接超时时间以及获取页面超时时间的设定。
  11. 一种网页数据采集系统,其特征在于,包括:
    客户端,用于发出批量采集数据的请求,其中,所述请求携带有目标网址信息;
    采集策略配置服务器,用于接收客户端发送的批量采集数据的请求,确定所述请求携带的目标网址信息对应的可成功采集目标数据的采集策略,其中,所述目标网址信息对应的采集策略具体通过对该目标网址信息进行至少 包括同步加载测试的目标数据采集测试获得,所述采集策略包括同步加载方式或异步加载方式,以及,生成用于根据所述目标网址信息对应的采集策略中设置的同步加载方式或异步加载方式,采取相应的加载方式采集所述目标网址信息指向的网页中的目标数据的采集任务,将所述采集任务分发给采集服务器集群中的采集服务器;
    采集服务器集群,用于接收采集策略配置服务器分发的采集任务,执行所述采集任务,反馈采集到的目标数据。
PCT/CN2015/095584 2014-12-02 2015-11-26 一种网页数据采集方法、装置及系统 WO2016086784A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410721389.9A CN105721519B (zh) 2014-12-02 2014-12-02 一种网页数据采集方法、装置及系统
CN201410721389.9 2014-12-02

Publications (1)

Publication Number Publication Date
WO2016086784A1 true WO2016086784A1 (zh) 2016-06-09

Family

ID=56090993

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/095584 WO2016086784A1 (zh) 2014-12-02 2015-11-26 一种网页数据采集方法、装置及系统

Country Status (2)

Country Link
CN (1) CN105721519B (zh)
WO (1) WO2016086784A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115630217A (zh) * 2022-12-21 2023-01-20 广州市千钧网络科技有限公司 一种加载信息的方法、装置、设备及存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502802A (zh) * 2016-10-12 2017-03-15 山东浪潮云服务信息科技有限公司 一种基于Avro RPC传输的分布式云端并发采集方法
CN110134841A (zh) * 2018-02-09 2019-08-16 鼎复数据科技(北京)有限公司 自定义实时获取网站数据的方法
CN109658689B (zh) * 2018-12-04 2021-01-05 沈阳世纪高通科技有限公司 一种交通信息处理方法及装置
CN113114505B (zh) * 2021-04-13 2022-07-12 广州海鹚网络科技有限公司 基于httpClient的访问请求处理方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101136026A (zh) * 2007-05-15 2008-03-05 北京聚生科技有限公司 一种基于xmlhttp组件技术的网页内容采集方法
CN103049542A (zh) * 2012-12-27 2013-04-17 北京信息科技大学 一种面向领域的网络信息搜索方法
CN103092817A (zh) * 2013-01-18 2013-05-08 五八同城信息技术有限公司 一种基于脚本引擎的数据采集方法和装置
US20140280014A1 (en) * 2013-03-14 2014-09-18 Glenbrook Networks Apparatus and method for automatic assignment of industry classification codes

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101136026A (zh) * 2007-05-15 2008-03-05 北京聚生科技有限公司 一种基于xmlhttp组件技术的网页内容采集方法
CN103049542A (zh) * 2012-12-27 2013-04-17 北京信息科技大学 一种面向领域的网络信息搜索方法
CN103092817A (zh) * 2013-01-18 2013-05-08 五八同城信息技术有限公司 一种基于脚本引擎的数据采集方法和装置
US20140280014A1 (en) * 2013-03-14 2014-09-18 Glenbrook Networks Apparatus and method for automatic assignment of industry classification codes

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115630217A (zh) * 2022-12-21 2023-01-20 广州市千钧网络科技有限公司 一种加载信息的方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN105721519B (zh) 2019-02-05
CN105721519A (zh) 2016-06-29

Similar Documents

Publication Publication Date Title
WO2016086784A1 (zh) 一种网页数据采集方法、装置及系统
US10068028B1 (en) Deep link verification for native applications
US9547721B2 (en) Native application search results
US8219687B2 (en) Implementing browser based hypertext transfer protocol session storage
US11500709B1 (en) Mobile application crash monitoring user interface
CN110262807B (zh) 集群创建进度日志采集系统、方法和装置
CN102663062A (zh) 一种处理搜索结果中无效链接的方法及装置
US9436531B1 (en) Monitoring application loading
US11630876B2 (en) Indexing actions for resources
WO2013159611A1 (zh) 网页数据提交的方法和装置
CN105721578A (zh) 一种用户行为数据采集方法和系统
US9645980B1 (en) Verification of native applications for indexing
CN109862074B (zh) 一种数据采集方法、装置、可读介质及电子设备
US8745245B1 (en) System and method for offline detection
US10095791B2 (en) Information search method and apparatus
WO2012070925A1 (en) Method and system for asynchronous processing in a web application server
CN108990423B (zh) 减少重定向
US10536547B2 (en) Reducing redirects
CN111177100B (zh) 一种训练数据处理方法、装置及存储介质
CN110955856B (zh) 一种网页加载方法、装置、服务器及存储介质
CN109756393B (zh) 信息处理方法、系统、介质和计算设备
CN103425696A (zh) 网络搜索行为识别方法及其系统
CN101340463A (zh) 一种确定网络资源类型的方法和装置
US20150288584A1 (en) System and method for determining end user timing
Persson Development of a prototype framework for monitoring application events

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15866177

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15866177

Country of ref document: EP

Kind code of ref document: A1