CN105721519B - A kind of webpage data acquiring method, apparatus and system - Google Patents

A kind of webpage data acquiring method, apparatus and system Download PDF

Info

Publication number
CN105721519B
CN105721519B CN201410721389.9A CN201410721389A CN105721519B CN 105721519 B CN105721519 B CN 105721519B CN 201410721389 A CN201410721389 A CN 201410721389A CN 105721519 B CN105721519 B CN 105721519B
Authority
CN
China
Prior art keywords
website information
acquisition
target
loading method
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410721389.9A
Other languages
Chinese (zh)
Other versions
CN105721519A (en
Inventor
刘庆
黄华
殷贤君
张美德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201410721389.9A priority Critical patent/CN105721519B/en
Priority to PCT/CN2015/095584 priority patent/WO2016086784A1/en
Publication of CN105721519A publication Critical patent/CN105721519A/en
Application granted granted Critical
Publication of CN105721519B publication Critical patent/CN105721519B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

This application discloses a kind of webpage data acquiring methods, for example, this method may include: to receive the request of batch capture data, wherein the request carries target website information;Determine the target website information it is corresponding can successful acquisition target data acquisition strategies, wherein, the corresponding acquisition strategies of the target website information are obtained especially by the target data collecting test for include at least synchronous load test to the target website information, and the acquisition strategies include synchronous loading method or asynchronous loading method;According to the synchronization loading method or asynchronous loading method being arranged in the corresponding acquisition strategies of the target website information, corresponding loading method is taken to acquire the target data in the webpage that the target website information is directed toward.In addition, disclosed herein as well is a kind of collecting webpage data apparatus and systems.

Description

A kind of webpage data acquiring method, apparatus and system
Technical field
This application involves internet area more particularly to a kind of webpage data acquiring methods, apparatus and system.
Background technique
In SEO (Search Engine Optimization, search engine optimization) process of construction of website, in order to Enough accurately to recognize the global optimization situation of website at this stage, the data acquisition that can generate some pairs of third party's websites or platform needs It asks, by carrying out analysis to collected various information to formulate the web information flow strategy of next step.
Currently, mainly acquiring third party's website or flat by the web data of internet loading third-party website or platform The data of platform.Load web data mainly includes synchronous and asynchronous two kinds of loading methods.The side's of synchronization loading method, it is direct to request Return to html page.Asynchronous loading method, after page return, by loading JS (JavaScript, a kind of literal translation formula script language Speech) mode changes page original structure to loading out data.It, can be to html page after the html page returned It is parsed, useful data extraction is separated, for example some news in Sina website's news channel can be extracted Title.
Since the data requirements amount for formulating web information flow strategy is larger, therefore, it is necessary to batch capture third party website or put down The web data of platform.However, since different web pages data loading method may be different, in order to guarantee the accurate of data acquisition results Property, it can only uniformly take the mode of asynchronous load.But since JS execution needs to consume the additional time, for synchronizing originally The data that can be loaded out can additionally consume great amount of hardware resources and time, cause data acquisition efficiency lower.
Summary of the invention
In view of this, the application be designed to provide a kind of webpage data acquiring method, apparatus and system is mentioned with realizing The purpose of high data acquisition efficiency.
In the first aspect of the embodiment of the present application, a kind of webpage data acquiring method is provided.For example, this method can be with It include: the request for receiving batch capture data, wherein the request carries target website information;Determine the target network address Information it is corresponding can successful acquisition target data acquisition strategies, wherein the corresponding acquisition strategies tool of the target website information Body is obtained by the target data collecting test for include at least synchronous load test to the target website information, the acquisition Strategy includes synchronous loading method or asynchronous loading method;According to what is be arranged in the corresponding acquisition strategies of the target website information Synchronous loading method or asynchronous loading method take corresponding loading method to acquire in the webpage that the target website information is directed toward Target data.
In the second aspect of the embodiment of the present application, a kind of collecting webpage data device is provided.For example, the device can be with Include: request reception unit, can be used for receiving the request of batch capture data, wherein the request carries target network address Information.Policy determining unit, be determined for the target website information it is corresponding can successful acquisition target data acquisition Strategy, wherein the corresponding acquisition strategies of the target website information are included at least especially by the target website information The target data collecting test of synchronous load test obtains, and the acquisition strategies include synchronous loading method or asynchronous load side Formula.Acquisition unit, can be used for according to the synchronization loading method being arranged in the corresponding acquisition strategies of the target website information or Asynchronous loading method takes corresponding loading method to acquire the target data in the webpage that the target website information is directed toward.
In terms of the third of the embodiment of the present application, a kind of collecting webpage data system is provided.For example, the system can be with Include: client, can be used for issuing the request of batch capture data, wherein the request carries target website information.It adopts Collect tactful configuration server, can be used for receiving the request of the batch capture data of client transmission, determines that the request carries Target website information it is corresponding can successful acquisition target data acquisition strategies, wherein the target website information is corresponding Acquisition strategies are obtained especially by the target data collecting test for include at least synchronous load test to the target website information , the acquisition strategies include synchronous loading method or asynchronous loading method, and, it generates for being believed according to the target network address The synchronization loading method or asynchronous loading method being arranged in corresponding acquisition strategies are ceased, is taken described in corresponding loading method acquisition The acquisition tasks are distributed to acquisition server collection by the acquisition tasks for the target data in webpage that target website information is directed toward Acquisition server in group.Acquisition server cluster can be used for receiving the acquisition tasks of acquisition strategies configuration server distribution, Execute the acquisition tasks, the target data that feedback collection arrives.
It can be seen that the application has the following beneficial effects:
Since the embodiment of the present application is after the request for receiving batch capture data, believed according to the target network address that request carries Breath determined it is corresponding can successful acquisition target data acquisition strategies, and the acquisition strategies are by the target website information Include at least the target data collecting test acquisition of synchronous load test, therefore, if target website information is corresponding Webpage can acquire out target data to synchronize loading method, then test obtain can successful acquisition target data acquisition strategies In include loading method can be synchronous loading method, thus take the synchronization loading method being arranged in acquisition strategies acquire Data load the synchronous data that can be loaded out can to avoid using asynchronous loading method, avoid resource and time Additional consumption, therefore, the embodiment of the present application can effectively improve data and acquire while guaranteeing that successful acquisition arrives target data Efficiency.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The some embodiments recorded in application, for those of ordinary skill in the art, in the premise of not making the creative labor property Under, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of webpage data acquiring method flow diagram disclosed in the embodiment of the present application;
Fig. 2 is a kind of collecting webpage data apparatus structure schematic diagram disclosed in the embodiment of the present application;
Fig. 3 is a kind of collecting webpage data system structure diagram disclosed in the embodiment of the present application.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common Technical staff's every other embodiment obtained without creative efforts, all should belong to protection of the present invention Range.
Typically, since JS execution needs to consume the additional time, if not executing JS to the same page structure, Execution efficiency has certain promotion.Based on this principle, before batch capture web data, if it is possible to add to page data Load mode include at least effective analysis test of synchronous load test, then can distinguish can synchronize load target data Website information and must asynchronous load target data website information, and be arranged it is corresponding can successful acquisition target data adopt Collection strategy.In this way, can take and wherein set according to acquisition strategies corresponding with target website information in batch capture data The synchronization loading method set or asynchronous loading method acquire data, make the original synchronous data that can be loaded out can be to avoid use Asynchronous loading method load, to avoid the additional consumption of resource and time, can effectively improve data acquisition efficiency.
It is a kind of webpage data acquiring method flow diagram provided by the embodiments of the present application for example, with reference to Fig. 1.Such as Fig. 1 Shown, this method may include:
S110, the request for receiving batch capture data, wherein the request carries target website information.
For example, received batch capture data request, the batch that user inputs in front end page can be carried Acquisition configuration information.Assuming that wanting search result data of the 1688 site search page of batch capture when retrieving different keywords. So batch capture configuration information may include: that " http://is s.1688.com/selloffer/offer_ for target website information Search.htm? keywords=$ { keyword } &button_click=to p&n=y ".Wherein, { keyword } can be with It is substituted for different keywords, the html tag of target data can be configured to id:breadCrumbText | class [0]: sm- Navigatebar-count | text indicates to extract first sm- below this html tag of breadCrumbText Plain text under navigatebar-count class.Wherein, batch capture configuration information also can be configured to retouching for XPath Mode is stated, the application is to this and is not limited.It is understood that batch capture configuration information can also be according to user oneself Demand selectively configuration other parameters, the application is to this and is not limited.
In addition, according to actual needs, if other than the batch capture configuration information that user submits, it is also necessary to from other Relevant parameter is read in file, then also needs the mapping to associated documents and associated documents storage address for saving the parameter Relationship is configured, to read the parameter in file when carrying out data acquisition test according to mapping relations.For example, in batch In the application scenarios for acquiring search result data of the 1688 site search pages when searching for different keywords, the pass of user's submission Keyword file can download to the machine for executing data acquisition test according to specified address, meanwhile, it is arranged and saves pass The mapping relations of keyword file and storage address, for example, " taskKeywordsFile ": "/home/admin/1/ Test.txt ", so that the keyword in key word file can be read according to mapping relations when carrying out data acquisition test.
S120, determine the target website information it is corresponding can successful acquisition target data acquisition strategies, wherein it is described The corresponding acquisition strategies of target website information are carried out especially by the target website information including at least synchronous load test Target data collecting test obtains, and the acquisition strategies include synchronous loading method or asynchronous loading method.
It should be noted that the target website information it is corresponding can successful acquisition target data acquisition strategies, can be with Before the request for receiving batch capture data, first passes through that various different website informations include at least synchronizing to load in advance and survey The target data collecting test of examination obtains, and can also receive asking for the batch capture data for being directed to the target website information When asking, obtained in real time by the target data collecting test for include at least synchronous load test to the target website information, Again alternatively, being also possible to after the acquisition strategies for determining test acquisition in advance are invalid, again proceeding to few includes synchronous load survey The target data collecting test of examination obtains.
For example, being surveyed in the target data acquisition for include at least synchronous load test to various different website informations in advance In the embodiment of examination, the test configurations information that user inputs in front end page can be received in advance, mainly include to be tested Different type network address, the html tag for identifying target data etc..Website information and correspondence to be tested are needed determining For identifying the html tag of target data after, the preferential target data collecting test of loading method can be synchronized, Obtain the corresponding acquisition strategies of different type network address.
In some possible embodiments, the acquisition strategies that test obtains in advance can be used as history acquisition strategies and be stored in In database, when receiving the request of batch capture data, to extract corresponding history acquisition strategies from database To carry out data acquisition.
Certainly, before extracting the corresponding history acquisition strategies of the target website information, can also further judge be No there are the corresponding history acquisition strategies of target website information that the request carries, if it does not exist, then can be by this Target website information synchronizes the preferential target data collecting test of loading method, and acquisition is corresponding can successful acquisition number of targets According to acquisition strategies, the acquisition strategies include synchronous loading method or asynchronous loading method, and, which is saved For the corresponding history acquisition strategies of the target website information.
In some possible embodiments, it can be adopted extracting the corresponding history of target website information that the request carries Collection strategy after, directly determine with the history acquisition strategies be the target website information is corresponding can successful acquisition number of targets According to acquisition strategies.
In other possible embodiments, it is contemplated that the loading method of the page data of third party's website or platform may It can change, originally synchronous load can be with the network address of successful acquisition to target data, it is possible to which becoming can only asynchronous load Network address.Therefore, after extracting the corresponding history acquisition strategies of target website information, small-scale test can also be carried out, from And verify whether already present history acquisition strategies can continue to use.
For example, small-scale test may include: determining for identifying small-scale test number by small-scale test order is preset According to html tag and the target website information in need website information to be tested, it is corresponding according to the target website information Acquisition strategies and html tag for identifying small-scale test data, attempting acquisition needs website information to be tested to be directed toward Webpage in small-scale test data, if acquired successfully, can determine the history acquisition strategies be the target network Location information it is corresponding can successful acquisition target data acquisition strategies, carry out formal batch capture.And further include, if acquisition It is unsuccessful, then the target data collecting test for including at least synchronous load test can be carried out to the target website information, obtained It is corresponding can successful acquisition target data acquisition strategies, update the target website information pair according to acquisition strategies obtained The history acquisition strategies answered.
It should be noted that the embodiment of the present application is unlimited to the specific embodiment for presetting small-scale test order.For example, Can according to fixation preset small-scale quantity or certain reduction ratio, selected from target website information on a small quantity need it is to be tested Website information, etc..For example, the search knot in conjunction with the above-mentioned 1688 site search page of batch capture when searching for different keywords The application scenarios of fruit data.When being tested on a small scale, first 10 can be extracted from a large amount of keywords that user submits (such as The keyword that fruit user submits can be extracted less than 10 by actual quantity), it is substituted into the website information of user configuration one by one The position of search key parameter is determined to need 10 website informations to be tested.10 network address to test as needed Information and, the information such as html tag for identifying target data take the history acquisition strategies extracted from database, It is tested.For example, may include loading method (synchronous loading method or asynchronous loading method), connection in history acquisition strategies Time-out time obtains the parameters such as page time-out time.In the application scenarios, the formats of the history acquisition strategies extracted can be with Are as follows: " [{ " url ": " http://is s.1688.com/selloffer/offer_search.htm? keywords=$ Keyword } &button_click=top&n=y ", " keywordsPath ": "/usr/group/seo/test.txt ", " conto":"5000","readto":"6000","crawlType":"sync"}]".By testing on a small scale, if it is determined that adopt Collect unsuccessful, the target website information of user configuration can be directed to, re-start the number of targets for including at least synchronous load test According to collecting test target data collecting test, gone through according to the acquisition strategies update target website information regained is corresponding History acquisition strategies.Carry out target data acquisition based on the formal batch of updated acquisition strategies.
It should be noted that the embodiment of the present application include at least to target website information the target of synchronous load test The specific implementation of data acquisition test is unlimited.
For example, include at least to target website information the mesh of synchronous load test in some possible embodiments Mark data acquisition test may include: that synchronous loading method is taken to load the webpage that the target website information is directed toward, for same The obtained webpage of step load, therefrom attempts to read target data, for can read out target from the webpage that synchronous load obtains The website information of data, the loading method being arranged in the corresponding acquisition strategies of website information of the type are synchronous loading method, For the website information that can not read out target data from the webpage that synchronous load obtains, the website information pair of the type is set The loading method in acquisition strategies answered is asynchronous loading method.
For another example can first take asynchronous loading method to load the target network address in other possible embodiments The webpage that information is directed toward is attempted to read target data from the asynchronous webpage being loaded into, then takes synchronous loading method load institute The webpage for stating target website information direction is attempted to read target data in the webpage being loaded into from synchronizing.If can add from synchronous The website information of target data is read out in the webpage being downloaded to, then the corresponding acquisition strategies of website information of the type can be set In loading method be synchronous loading method.If target data can not be read out and can be in the webpage being loaded into from synchronizing Target data is read out from the asynchronous webpage being loaded into, then can be set in the corresponding acquisition strategies of website information of the type Loading method be asynchronous loading method.
In some possible embodiments, it is contemplated that load Webpage success or not also suffers from network stabilization shadow It rings, it may be necessary to retry connection when connecting time-out and retry the reading page when reading page time-out, therefore, carry out extremely It is described that synchronous loading method is taken to load the target network during few target data collecting test including synchronous load test The step of webpage that location information is directed toward, can be performed a plurality of times, and, it can also include: in each execute, record is built with network address The time of vertical connection and the time for being used to obtain Webpage upon connection;It is corresponding in the website information of setting the type When loading method in acquisition strategies is synchronous loading method, according to recorded during being performed a plurality of times establish connection when Between and upon connection for obtaining time of Webpage, it is corresponding to carry out synchronous loading method in corresponding acquisition strategies It connects time-out time and obtains the setting of page time-out time.Moreover, for that can not be read from the webpage that synchronous load obtains The website information for taking out target data can repeatedly take asynchronous loading method to load the webpage of its direction, and when each execution It records and network address establishes the time of connection and is used to obtain the time of Webpage upon connection, thus such is arranged When loading method in the corresponding acquisition strategies of the website information of type is asynchronous loading method, can according to repeatedly take it is asynchronous plus Load mode loads the time for establishing connection recorded during webpage and upon connection for obtaining the time of Webpage, It carries out connecting time-out time in corresponding acquisition strategies and obtains the setting of page time-out time.
Wherein, the time for establishing connection and be used to obtain upon connection that the basis records during being performed a plurality of times The time of Webpage is taken, carry out connecting time-out time in corresponding acquisition strategies and obtains the setting of page time-out time Specific implementation is unlimited.For example, the average value that the time for establishing connection recorded in the process is performed a plurality of times can be taken to be needed The connection time-out time to be set takes the average value of the time for obtaining Webpage recorded during being performed a plurality of times to obtain The acquisition page time-out time for needing to set.It is of course also possible to there are other to calculate connection time-out time and obtain page time-out The implementation of time, the application is to this and is not limited.
In above embodiment, due to being provided with connection time-out time in acquisition strategies and obtaining page time-out Time, thus when subsequent batch capture data, it can be super there is connection according to the connection time-out time set in acquisition strategies Connection request is constantly re-emitted, and, according to the acquisition page time-out time set in acquisition strategies, occurring reading the page Reading page request is re-emitted when overtime.In addition, can also be set in acquisition strategies retry the maximum number of times value of connection with And the maximum number of times value for reading the page is retried, to abandon corresponding to page to the website information when number of retries is more than upper limit value The acquisition of face data.
S130, according to the synchronization loading method or asynchronous load being arranged in the corresponding acquisition strategies of the target website information Mode takes corresponding loading method to acquire the target data in the webpage that the target website information is directed toward.
It should be noted that the target website information that the request carries can be one or more.The embodiment of the present invention The target data collecting test that can carry out including at least synchronous load test respectively to different types of website information, is distinguished Can synchronize load target data website information and must it is asynchronous load target data website information, and be arranged it is corresponding can The acquisition strategies of successful acquisition target data.For the target website information of multiple and different types, it is right therewith to take respectively The acquisition strategies answered acquire the target data in webpage.Wherein, various types of website information is included at least The target data collecting test of synchronous load test is referred to above-mentioned carry out including at least synchronous load to target website information The embodiment of test realizes that details are not described herein.
As it can be seen that using method provided by the embodiments of the present application, since the corresponding acquisition strategies of target website information are to pass through Include at least the target data collecting test acquisition of synchronous load test to the target website information, therefore, in batch Acquire data when, can according to acquisition strategies corresponding with target website information, take the synchronization loading method being provided with or Asynchronous loading method acquires data, load the synchronous data that can be loaded out can to avoid using asynchronous loading method, thus The additional consumption for avoiding resource and time, can effectively improve data acquisition efficiency.In addition, the application also connects the page It connects and is recorded, analyzed with page read access time, corresponding connection time-out time is set in acquisition strategies, obtains page time-out Time, to can rationally call either synchronously or asynchronously two kinds of load sides according to acquisition strategies when formally carrying out batch data acquisition Formula is guaranteeing to improve collecting efficiency to greatest extent while accurately collecting data, is avoiding additional hardware resources and the time disappears Consumption.
Corresponding with above-mentioned webpage data acquiring method, present invention also provides a kind of collecting webpage data devices.
For example, with reference to Fig. 2, a kind of collecting webpage data apparatus structure schematic diagram passed through for the embodiment of the present application.Such as Fig. 2 It is shown, the apparatus may include:
Request reception unit 210 can be used for receiving the request of batch capture data, wherein the request carries mesh Mark website information.Policy determining unit 220, being determined for that the target website information is corresponding can successful acquisition number of targets According to acquisition strategies, wherein the corresponding acquisition strategies of the target website information are carried out especially by the target website information It is obtained including at least the target data collecting test of synchronous load test, the acquisition strategies include synchronous loading method or asynchronous Loading method.Acquisition unit 230 can be used for being added according to the synchronization being arranged in the corresponding acquisition strategies of the target website information Load mode or asynchronous loading method take corresponding loading method to acquire the target in the webpage that the target website information is directed toward Data.
In some possible embodiments, it can be adopted extracting the corresponding history of target website information that the request carries Collection strategy after, directly determine with the history acquisition strategies be the target website information is corresponding can successful acquisition number of targets According to acquisition strategies.Therefore, the policy determining unit 220 can be used for extracting the corresponding history of the target website information Acquisition strategies, the history acquisition strategies specifically first pass through in advance to be carried out the target website information to include at least synchronous load test Target data collecting test obtain, the history acquisition strategies include synchronous loading method or asynchronous loading method, determine institute State history acquisition strategies be the target website information it is corresponding can successful acquisition target data acquisition strategies.
In other possible embodiments, after extracting the corresponding history acquisition strategies of target website information, may be used also To carry out small-scale test.For example, the policy determining unit 220 includes: to extract subelement 221, it can be used for extracting institute The corresponding history acquisition strategies of target website information are stated, the history acquisition strategies are specifically first passed through in advance to the target website information The target data collecting test for include at least synchronous load test obtains, and the history acquisition strategies include synchronous load side Formula or asynchronous loading method.Small-scale test determines subelement 222, can be used for being used for by presetting small-scale test order determination Identifying in the html tag and the target website information of small-scale test data needs website information to be tested.Strategy test Subelement 223 can be used for according to the corresponding history acquisition strategies of the target website information and for identifying small-scale survey The html tag of data is tried, the small-scale test data in the webpage that acquisition needs website information to be tested to be directed toward is attempted.Strategy It determines subelement 224, if can be used for acquiring successfully, determines that the history acquisition strategies are corresponding for the target website information Can successful acquisition target data acquisition strategies.Subelement 225 is tested, it is again right if it is unsuccessful to can be used for acquisition The target website information include at least the target data collecting test of synchronous load test, and acquisition is corresponding can successful acquisition The acquisition strategies of target data.Subelement 226 is updated, can be used for being updated according to the acquisition strategies that the test subelement obtains The corresponding history acquisition strategies of the target website information.
It should be noted that the embodiment of the present application passes through target data collecting test, acquisition to the test subelement 225 It is corresponding can successful acquisition target data acquisition strategies specific implementation it is unlimited.For example, some possible embodiments In, wherein the test subelement 225 may include: synchronous load subelement 2251, can be used for taking synchronous loading method Load the webpage that the target website information is directed toward.Target data reading subunit 2252 can be used for loading for synchronizing The webpage arrived is therefrom attempted to read target data.Subelement 2253 is arranged in synchronization policy, and can be used for being directed to can load from synchronous Adding in the corresponding acquisition strategies of website information of the type is arranged in the website information that target data is read out in obtained webpage Load mode is synchronous loading method.Asynchronous strategy setting subelement 2254 can be used for for can not obtain from synchronous load The loading method in the corresponding acquisition strategies of website information of the type is arranged in the website information that target data is read out in webpage For asynchronous loading method.
In some possible embodiments, it is contemplated that load Webpage success or not also suffers from network stabilization shadow It rings, it may be necessary to retry connection when connecting time-out and retry the reading page when reading page time-out, therefore, wherein institute Synchronous load subelement 2251 is stated, can be used for being performed a plurality of times and synchronous loading method is taken to load the target website information direction Webpage the step of.And the test subelement can also include: synchronous recording subelement 2255, can be used for described same When step load subelement executes load every time, records and establish the time of connection with network address and upon connection for obtaining net The time of the page page.Synchronization timeout sets subelement 2256, can be used for the corresponding acquisition of website information in setting the type When loading method in strategy is synchronous loading method, remembered in loading procedure is performed a plurality of times according to the synchronous load subelement The time for establishing connection of record and the time for being used to obtain Webpage upon connection carry out same in corresponding acquisition strategies It walks the corresponding connection time-out time of loading method and obtains the setting of page time-out time.Asynchronous record subelement 2257, can For repeatedly taking asynchronous load for the website information that can not read out target data from the webpage that synchronous load obtains Mode loads the webpage of its direction, and while executing every time records and establishes the time of connection with network address and be upon connection used for Obtain the time of Webpage.Asynchronous timeouts subelement 2258 can be used for corresponding in the website information of setting the type Acquisition strategies in loading method when being asynchronous loading method, according to during repeatedly taking asynchronous loading method load webpage It the time for establishing connection of record and the time for being used to obtain Webpage upon connection, carries out in corresponding acquisition strategies It connects time-out time and obtains the setting of page time-out time.
It should be noted that the determining subelement 222 of extraction subelement 221 described in the embodiment of the present application, on a small scale test, Strategy test subelement 223, strategy determine subelement 224, synchronism detection subelement 225, update subelement 226, synchronous load Subelement 2253, asynchronous strategy setting subelement is arranged in subelement 2251, target data reading subunit 2252, synchronization policy 2254, synchronous recording subelement 2255, Synchronization timeout setting subelement 2256, asynchronous record subelement 2257, asynchronous time-out are set Stator unit 2258 is drawn with a dashed line in Fig. 2, is collecting webpage data dress provided by the present application to indicate these units not The necessary unit set.
Corresponding with above-mentioned webpage data acquiring method, present invention also provides a kind of webpages for realizing this method Data collection system.
It is a kind of collecting webpage data system structure diagram provided by the embodiments of the present application for example, with reference to Fig. 3.Such as Fig. 3 Shown, which may include:
Client 310 can be used for issuing the request of batch capture data, wherein the request carries target network address Information.
Acquisition strategies configuration server 320 can be used for receiving the request of batch capture data, wherein the request is taken With target website information, determine the target website information it is corresponding can successful acquisition target data acquisition strategies, wherein The corresponding acquisition strategies of the target website information are surveyed especially by carrying out including at least synchronous load to the target website information The target data collecting test of examination obtains, and the acquisition strategies include synchronous loading method or asynchronous loading method;And it generates For taking according to the synchronization loading method or asynchronous loading method being arranged in the corresponding acquisition strategies of the target website information Corresponding loading method acquires the acquisition tasks of the target data in the webpage that the target website information is directed toward, by the acquisition Task is distributed to the acquisition server in acquisition server cluster 330.
Acquisition server cluster 330 can be used for receiving the acquisition tasks of acquisition strategies configuration server distribution, execute institute State acquisition tasks, the target data that feedback collection arrives.
As it can be seen that using collecting webpage data system provided by the embodiments of the present application, it can be by acquisition strategies configuration server The acquisition tasks of batch are distributed to acquisition server cluster 330 by preset distribution policy by 320 raw batches of acquisition tasks The acquisition server of middle free time, executes acquisition tasks concurrently, further improves the collecting efficiency of web data.
In some possible embodiments, batch capture configuration information can be arranged in client 310 in user, and user can be with The request for carrying the batch capture configuration information is issued by client 310.Wherein it can wrap in batch capture configuration information Include the parameters such as target website information.The search result of the different search keys for 1688 website of batch capture being generally noted above In the application scenarios of data, acquisition strategies configuration server 320 is other than obtaining batch capture configuration information, it is also necessary to will use The key word file that family is submitted downloads in acquisition server cluster 330 according to specified address for executing data acquisition test On acquisition server, meanwhile, it is arranged and saves the mapping relations of key word file and storage address, for example, " taskKeywordsFile":"/home/admin/1/test.txt".And the mapping relations are encapsulated into test assignment, with Test assignment is sent to acquisition server together.Thus when acquisition server carries out data acquisition test, it can be according to mapping Relationship reads the keyword in key word file, expands corresponding for searching out and the target network of relevant page data Location information.
In other possible embodiments, acquisition strategies configuration server 320 may include: strategy generating server 321, testing service device 322, database server 323.
Wherein, strategy generating server 321 can be used for being directed to different type network address in advance, generate preparatory test assignment, Preparatory test assignment is submitted into testing service device 322, from database server 323 obtain test when record loading method, Connection Time, acquisition page time etc..According to acquired loading method, the Connection Time, obtain page time generate from it is different The corresponding acquisition strategies of type network address.The corresponding acquisition strategies of different type network address are sent to database server 323 to make It is put in storage and saves for history acquisition strategies.And the request that client 310 issues is received, target network is obtained from database server The corresponding history acquisition strategies of location information.Generation takes history acquisition strategies to be tested on a small scale the target website information Small-scale test assignment.Small-scale test assignment is submitted into testing service device 322.It, can be with if test acquires successfully Generate the acquisition for acquiring the target data in the webpage that the target website information is directed toward according to the history acquisition strategies Task.If acquisition is unsuccessful, generates and task is retried to target website information progress target data acquisition.It will retry Test assignment submits to testing service device 322.From database server 323 obtain retest when record loading method, even Connect the time, obtain page time etc..According to acquired loading method, Connection Time, obtain page time, generation and target network The acquisition strategies of the corresponding update of location information.The corresponding update of the target website information is sent to database server 323 Acquisition strategies generate so as to the history acquisition strategies that save in more new database for acquiring institute according to the acquisition strategies of update State the acquisition tasks of the target data in the webpage of target website information direction.The acquisition tasks of generation are distributed to acquisition service Acquisition server in device cluster 330 executes.
Wherein, testing service device 322 can be used for obtaining preparatory test assignment, small-scale from strategy generating server 321 Test assignment, and/or person, retry task.Preparatory test assignment, small-scale test assignment, and/or person, weight are obtained by what is obtained Trial is engaged in the acquisition server being distributed in acquisition server cluster 330 to execute.It collects in test assignment implementation procedure Loading method, Connection Time, acquisition page time etc..The loading method being collected into, Connection Time, acquisition page time etc. are protected It is stored in database so as to the use of strategy generating server 321.In testing service device 322, synchronous loading method may include With two kinds of loading methods of asynchronous loading method, wherein synchronous loading method can use the side of httpclient+htmlparser Formula carries out load and page parsing, and asynchronous loading method can carry out load and page parsing using webkit.
Wherein, database server 323 can be used to save loading method, company that the testing service device 322 is collected into Connect the time, obtain page time etc., and, conversation strategy generates the acquisition strategies that server 321 generates.
In above embodiment, acquisition strategies configuration server 320 and acquisition server cluster 330 can be arranged not In same network system.Database server 323 can be built on MySQL database cluster.Furthermore, it is contemplated that the amount of data Grade, database server 323 can use distribution to be disposed to provide good reading performance.
It should be noted that strategy generating server 321 described in the embodiment of the present application, testing service device 322, database clothes Business device is drawn with a dashed line in Fig. 2, is the essential service device of acquisition strategies configuration server to indicate these units not.
For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this The function of each unit can be realized in the same or multiple software and or hardware when invention.
As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can It realizes by means of software and necessary general hardware platform.Based on this understanding, technical solution of the present invention essence On in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software product It can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes the certain of each embodiment or embodiment of the invention Method described in part.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method Part explanation.
The present invention can be used in numerous general or special purpose computing system environments or configuration.Such as: personal computer, service Device computer, handheld device or portable device, laptop device, multicomputer system, microprocessor-based system, top set Box, programmable consumer-elcetronics devices, network PC, minicomputer, mainframe computer, including any of the above system or equipment Distributed computing environment etc..
The present invention can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Generally, program module includes routines performing specific tasks or implementing specific abstract data types, programs, objects, group Part, data structure etc..The present invention can also be practiced in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage equipment.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims (11)

1. a kind of webpage data acquiring method characterized by comprising
Receive the request of batch capture data, wherein the request carries target website information;
The target data collecting test for include at least to the target website information synchronous load test, distinguish can synchronize plus The target website information of target data and the target website information of necessary asynchronous load target data are carried, result is true according to distinguishing The fixed target website information it is corresponding can successful acquisition target data acquisition strategies, wherein the acquisition strategies include same Walk loading method or asynchronous loading method;
According to the synchronization loading method or asynchronous loading method being arranged in the corresponding acquisition strategies of the target website information, take Corresponding loading method acquires the target data in the webpage that the target website information is directed toward.
2. succeeding the method according to claim 1, wherein the determination target website information is corresponding Acquisition target data acquisition strategies include:
The corresponding history acquisition strategies of the target website information are extracted, the history acquisition strategies are specifically first passed through in advance to the mesh The target data collecting test that mark website information include at least synchronous load test obtains, and the history acquisition strategies include Synchronous loading method or asynchronous loading method;
Determine the history acquisition strategies be the target website information it is corresponding can successful acquisition target data acquisition strategies.
3. succeeding the method according to claim 1, wherein the determination target website information is corresponding Acquisition target data acquisition strategies include:
The corresponding history acquisition strategies of the target website information are extracted, the history acquisition strategies are specifically first passed through in advance to the mesh The target data collecting test that mark website information include at least synchronous load test obtains, and the history acquisition strategies include Synchronous loading method or asynchronous loading method;
Html tag and the target network address for identifying small-scale test data are determined by small-scale test order is preset Website information to be tested is needed in information;
It is marked according to the corresponding history acquisition strategies of the target website information and the HTML for identifying small-scale test data Label attempt the small-scale test data in the webpage that acquisition needs website information to be tested to be directed toward;
If acquired successfully, it is determined that the history acquisition strategies be the target website information is corresponding can successful acquisition target The acquisition strategies of data;
If acquisition is unsuccessful, the target data acquisition for include at least synchronous load test to the target website information is surveyed Examination, obtain it is corresponding can successful acquisition target data acquisition strategies, update the target network address according to the acquisition strategies of acquisition The corresponding history acquisition strategies of information.
4. method according to claim 1-3, which is characterized in that described at least to be wrapped to target website information The target data collecting test for including synchronous load test includes:
It takes synchronous loading method to load the webpage that the target website information is directed toward, obtained webpage is loaded for synchronization, from Target data is read in middle trial, for the website information that can read out target data from the webpage that synchronous load obtains, setting Loading method in the corresponding acquisition strategies of the website information of the type is synchronous loading method, for can not load from synchronizing To webpage in read out the website information of target data, the load in the corresponding acquisition strategies of website information of the type is set Mode is asynchronous loading method.
5. according to the method described in claim 4, it is characterized in that, described take synchronous loading method to load the target network address The step of webpage that information is directed toward, is performed a plurality of times, and, further includes:
In each execute, record with network address establish connection time and upon connection for obtain Webpage when Between, when the loading method in the corresponding acquisition strategies of website information that the type is arranged is synchronous loading method, according to more It the time for establishing connection recorded in secondary implementation procedure and the time for being used to obtain Webpage upon connection, is corresponded to Acquisition strategies in the corresponding connection time-out time of synchronous loading method and obtain the setting of page time-out time;
For the website information that can not read out target data from the webpage that synchronous load obtains, asynchronous load side is repeatedly taken Formula loads the webpage of its direction, and records when each execution and establish the time of connection with network address and upon connection for obtaining The time for taking Webpage, the loading method in the corresponding acquisition strategies of website information of setting the type are asynchronous load side When formula, according to the time for establishing connection for repeatedly taking asynchronous loading method load webpage to record in the process and upon connection For obtaining the time of Webpage, carry out the corresponding connection time-out time of asynchronous loading method in corresponding acquisition strategies and Obtain the setting of page time-out time.
6. a kind of collecting webpage data device characterized by comprising
Request reception unit, for receiving the request of batch capture data, wherein the request carries target website information;
Policy determining unit, the target data acquisition for include at least synchronous load test to the target website information are surveyed Examination, the target network address letter distinguished the target website information that can synchronize load target data and asynchronous must load target data Breath, according to distinguish result determine the target website information it is corresponding can successful acquisition target data acquisition strategies, wherein institute Stating acquisition strategies includes synchronous loading method or asynchronous loading method;
Acquisition unit, for according to the synchronization loading method that is arranged in the corresponding acquisition strategies of the target website information or asynchronous Loading method takes corresponding loading method to acquire the target data in the webpage that the target website information is directed toward.
7. device according to claim 6, which is characterized in that the policy determining unit, for extracting the target network The corresponding history acquisition strategies of location information, the history acquisition strategies specifically first pass through in advance carries out at least the target website information Target data collecting test including synchronous load test obtains, and the history acquisition strategies include synchronous loading method or asynchronous Loading method, determine the history acquisition strategies be the target website information it is corresponding can successful acquisition target data acquisition Strategy.
8. device according to claim 6, which is characterized in that the policy determining unit includes:
Subelement is extracted, for extracting the corresponding history acquisition strategies of the target website information, the history acquisition strategies tool Body first passes through the target data collecting test for include at least synchronous load test to the target website information in advance and obtains, described History acquisition strategies include synchronous loading method or asynchronous loading method;
Small-scale test determines subelement, for determining for identifying small-scale test data by small-scale test order is preset Website information to be tested is needed in html tag and the target website information;
Strategy test subelement, for according to the corresponding history acquisition strategies of the target website information and for identifying small rule The html tag of mould test data attempts the small-scale test data in the webpage that acquisition needs website information to be tested to be directed toward;
Strategy determines subelement, if determining that the history acquisition strategies are the target website information pair for acquiring successfully Answer can successful acquisition target data acquisition strategies;
Subelement is tested, if unsuccessful for acquisition, which is carried out to include at least synchronous load test Target data collecting test, obtain it is corresponding can successful acquisition target data acquisition strategies;
Subelement is updated, it is corresponding that the acquisition strategies for obtaining according to the test subelement update the target website information History acquisition strategies.
9. device according to claim 8, which is characterized in that the test subelement includes:
Synchronous load subelement, for taking synchronous loading method to load the webpage that the target website information is directed toward;Number of targets It therefrom attempts to read target data for loading obtained webpage for synchronous according to reading subunit;Synchronization policy setting is single Member, for the network address of the type to be arranged for the website information that can read out target data from the webpage that synchronous load obtains Loading method in the corresponding acquisition strategies of information is synchronous loading method;Asynchronous strategy setting subelement, can not for being directed to The website information that target data is read out from the webpage that synchronous load obtains, is arranged the corresponding acquisition of website information of the type Loading method in strategy is asynchronous loading method.
10. device according to claim 9, which is characterized in that the synchronous load subelement is taken for being performed a plurality of times Synchronous loading method loads the step of webpage that the target website information is directed toward;
And the test subelement further include:
Synchronous recording subelement, for when the synchronous load subelement executes load every time, record to be established with network address and connected Time for connecing and upon connection for obtaining the time of Webpage;
Synchronization timeout sets subelement, is for the loading method in the corresponding acquisition strategies of website information of setting the type When synchronous loading method, according to the synchronous load subelement recorded in loading procedure is performed a plurality of times establish connection when Between and upon connection for obtaining time of Webpage, it is corresponding to carry out synchronous loading method in corresponding acquisition strategies It connects time-out time and obtains the setting of page time-out time;
Asynchronous record subelement, for for the network address letter that can not read out target data from the webpage that synchronous load obtains Breath, repeatedly take asynchronous loading method load its direction webpage, and every time execute when record with network address establish connection when Between and upon connection for obtaining time of Webpage;
Asynchronous timeouts subelement, for setting the type the corresponding acquisition strategies of website information in loading method be When asynchronous loading method, according to repeatedly take the time for establishing connection recorded during asynchronous loading method load webpage, with And upon connection for obtaining the time of Webpage, carrying out connecting time-out time in corresponding acquisition strategies and obtaining the page The setting of time-out time.
11. a kind of collecting webpage data system characterized by comprising
Client, for issuing the request of batch capture data, wherein the request carries target website information;
Acquisition strategies configuration server, the request of the batch capture data for receiving client transmission, believes the target network address Breath include at least the target data collecting test of synchronous load test, distinguishes the target network that can synchronize load target data The target website information of location information and necessary asynchronous load target data determines the mesh for requesting to carry according to result is distinguished Mark website information it is corresponding can successful acquisition target data acquisition strategies, wherein the acquisition strategies include synchronous load side Formula or asynchronous loading method, and, it generates for according to the synchronization being arranged in the corresponding acquisition strategies of the target website information Loading method or asynchronous loading method take corresponding loading method to acquire the mesh in the webpage that the target website information is directed toward The acquisition tasks are distributed to the acquisition server in acquisition server cluster by the acquisition tasks for marking data;
Acquisition server cluster, the acquisition tasks distributed for receiving acquisition strategies configuration server, executes the acquisition tasks, The target data that feedback collection arrives.
CN201410721389.9A 2014-12-02 2014-12-02 A kind of webpage data acquiring method, apparatus and system Active CN105721519B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201410721389.9A CN105721519B (en) 2014-12-02 2014-12-02 A kind of webpage data acquiring method, apparatus and system
PCT/CN2015/095584 WO2016086784A1 (en) 2014-12-02 2015-11-26 Method, apparatus and system for collecting webpage data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410721389.9A CN105721519B (en) 2014-12-02 2014-12-02 A kind of webpage data acquiring method, apparatus and system

Publications (2)

Publication Number Publication Date
CN105721519A CN105721519A (en) 2016-06-29
CN105721519B true CN105721519B (en) 2019-02-05

Family

ID=56090993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410721389.9A Active CN105721519B (en) 2014-12-02 2014-12-02 A kind of webpage data acquiring method, apparatus and system

Country Status (2)

Country Link
CN (1) CN105721519B (en)
WO (1) WO2016086784A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502802A (en) * 2016-10-12 2017-03-15 山东浪潮云服务信息科技有限公司 A kind of concurrent acquisition method in distributed high in the clouds transmitted based on Avro RPC
CN110134841A (en) * 2018-02-09 2019-08-16 鼎复数据科技(北京)有限公司 The customized real-time method for obtaining website data
CN109658689B (en) * 2018-12-04 2021-01-05 沈阳世纪高通科技有限公司 Traffic information processing method and device
CN113114505B (en) * 2021-04-13 2022-07-12 广州海鹚网络科技有限公司 httpClient-based access request processing method and system
CN115630217A (en) * 2022-12-21 2023-01-20 广州市千钧网络科技有限公司 Method, device and equipment for loading information and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101136026A (en) * 2007-05-15 2008-03-05 北京聚生科技有限公司 Web page content capturing method based on XMLHTTP component technology
CN103049542A (en) * 2012-12-27 2013-04-17 北京信息科技大学 Domain-oriented network information search method
CN103092817A (en) * 2013-01-18 2013-05-08 五八同城信息技术有限公司 Data collection method and data collection device based on script engine

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8965877B2 (en) * 2013-03-14 2015-02-24 Glenbrook Networks Apparatus and method for automatic assignment of industry classification codes

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101136026A (en) * 2007-05-15 2008-03-05 北京聚生科技有限公司 Web page content capturing method based on XMLHTTP component technology
CN103049542A (en) * 2012-12-27 2013-04-17 北京信息科技大学 Domain-oriented network information search method
CN103092817A (en) * 2013-01-18 2013-05-08 五八同城信息技术有限公司 Data collection method and data collection device based on script engine

Also Published As

Publication number Publication date
CN105721519A (en) 2016-06-29
WO2016086784A1 (en) 2016-06-09

Similar Documents

Publication Publication Date Title
CN107273409B (en) Network data acquisition, storage and processing method and system
CN105721519B (en) A kind of webpage data acquiring method, apparatus and system
CN106886494A (en) A kind of automatic interface testing method and its system
CN107885873B (en) Method and apparatus for outputting information
CN104462547B (en) A kind of method and system of configurable collecting webpage data
CN101833570A (en) Method and device for optimizing page push of mobile terminal
CN103455600B (en) A kind of video URL grasping means, device and server apparatus
US10452730B2 (en) Methods for analyzing web sites using web services and devices thereof
CN105069087A (en) Web log data mining based website optimization method
CN105721578B (en) A kind of user behavior data acquisition method and system
CN111737443B (en) Answer text processing method and device and key text determining method
CN110069693A (en) Method and apparatus for determining target pages
CN104980464B (en) A kind of network request processing method, network server and network system
CN111651656A (en) Method and system for dynamic webpage crawler based on agent mode
CN107368407B (en) Information processing method and device
CN113742551A (en) Dynamic data capture method based on script and puppeteer
CN111273964B (en) Data loading method and device
JP6727097B2 (en) Information processing apparatus, information processing method, and program
CN114765599B (en) Subdomain name acquisition method and device
US20190253333A1 (en) Methods and devices for network web resource performance
CN117009430A (en) Data management method, device, storage medium and electronic equipment
CN104281693A (en) Semantic search method and semantic search system
Shen et al. A Catalogue Service for Internet GIS ervices Supporting Active Service Evaluation and Real‐Time Quality Monitoring
CN110740046B (en) Method and device for analyzing service contract
CN101640605A (en) Method and device for correlating client data with server-end data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant