CN106294369A - Web data acquisition methods and device - Google Patents

Web data acquisition methods and device Download PDF

Info

Publication number
CN106294369A
CN106294369A CN201510250516.6A CN201510250516A CN106294369A CN 106294369 A CN106294369 A CN 106294369A CN 201510250516 A CN201510250516 A CN 201510250516A CN 106294369 A CN106294369 A CN 106294369A
Authority
CN
China
Prior art keywords
account
webpage
targeted website
web data
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510250516.6A
Other languages
Chinese (zh)
Inventor
毛啸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510250516.6A priority Critical patent/CN106294369A/en
Publication of CN106294369A publication Critical patent/CN106294369A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a kind of web data acquisition methods and device.Wherein, the method includes: obtaining multiple account, wherein, multiple accounts are the account of the logon rights with targeted website;And from multiple accounts, select account, and utilize the account selected to access the webpage of targeted website, crawl the web data that the account of selection is accessed, wherein, the account of adjacent twice selection differs.The present invention solves the technical problem of the web data being difficult to quick obtaining website in prior art.

Description

Web data acquisition methods and device
Technical field
The present invention relates to data processing field, in particular to a kind of web data acquisition methods and device.
Background technology
At present, in web data process field, crawlers is generally used to crawl web data.Crawlers is one Planting the program automatically extracting web data, developer can use crawlers download to web data and analyze and process, There is provided basis for market demand service, create commercial value.But, the most a lot of websites not only need to enter visitor Row authentication, and the current login user access times to the different pages within a certain period of time can be added up, namely It is that these websites have employed anti-reptile strategy and crawl web data to limit crawlers, and web data is crawled by this Bring obstruction.When the web data that user uses crawlers to obtain certain specific website system, this crawlers Malice visitor would generally be identified as by the anti-reptile strategy of web station system and take many kinds of measures to stop crawlers to continue to visit Ask.Meanwhile, inventor finds, even if crawlers utilizes logs in access website in first register account number, also can be visited Asking the restriction of number of times, this makes user be difficult to the web data of quick obtaining website.
For above-mentioned problem, effective solution is the most not yet proposed.
Summary of the invention
Embodiments provide a kind of web data acquisition methods and device, at least to solve to be difficult to quick obtaining net The technical problem of the web data stood.
An aspect according to embodiments of the present invention, it is provided that a kind of web data acquisition methods, including: obtain multiple Account, wherein, the plurality of account is the account of the logon rights with targeted website;And from the plurality of account Middle selection account, utilizes the account selected to access the webpage of described targeted website, and the account crawling described selection is accessed The web data of webpage, wherein, the account of adjacent twice selection differs.
Further, the account quantity of the plurality of account is n, and described n is more than or equal to 2, from the plurality of account Select account, utilize the account selected to access the webpage of described targeted website, crawl what the account of described selection was accessed The web data of webpage includes: select the i-th account from the plurality of account, utilizes described i-th account to access target The jth webpage of website, wherein, described i=1 ... n, described j=1,2,3 ..., when described i is more than or equal to When 2, described i-th account is the account different from the i-th-1 account, when described j is more than or equal to 2, and described jth Webpage is one or more webpage of described targeted website, and described jth webpage is to jth-1 webpage with the 1st webpage The most different webpages;Crawl the web data of described jth webpage;Judge that whether described i is equal to described n;If sentenced Break and described i equal to described n, then the value of described i is put 1, the value of described j adds 1 return and performs from the plurality of Account selects the i-th account, utilizes the step of the jth webpage of described i-th account access targeted website;If it is determined that Going out described i and be less than described n, the value of the most described i adds 1, and the value of described j adds 1, and returns execution from the plurality of account Select the i-th account in number, utilize the step of the jth webpage of described i-th account access targeted website;Judge described mesh Whether the amount of crawling of the web data of mark website reaches preset value;If it is judged that the web data of described targeted website The amount of crawling reaches described preset value, then the web data stopping described targeted website crawls.
Further, after utilizing described i-th account to access the jth webpage of targeted website, described method also includes: Being marked described jth webpage, wherein, webpage after labelling is follow-up not to be visited again.
Further, obtain multiple account and include: obtain configuration file, wherein, described configuration file configures State the password of multiple account and correspondence thereof;Load described configuration file, obtain the password of the plurality of account and correspondence thereof, Wherein, after obtaining multiple accounts, described method also includes: utilize the password of the plurality of account and correspondence thereof to step on Lu Suoshu targeted website, and cache identification information, wherein, described identification information is that the identification of described targeted website is described many The information of individual account.
Further, the webpage utilizing the account selected to access described targeted website includes: judge the account of described selection Log in whether described targeted website exception occurs;Occur if it is judged that the account of described selection logs in described targeted website Abnormal, then remove logging in abnormal account from the plurality of account, and again multiple accounts after removing are selected Select account;If it is determined that the account of described selection logs in described targeted website exception does not occurs, then utilize described selection Account accesses the webpage of described targeted website.
Another aspect according to embodiments of the present invention, additionally provides a kind of web data acquisition device, including: obtain single Unit, is used for obtaining multiple account, and wherein, the plurality of account is the account of the logon rights with targeted website;With And crawl unit, for selecting account from the plurality of account, utilize the account selected to access described targeted website Webpage, crawls the web data of the webpage that the account of described selection is accessed, and wherein, the account of adjacent twice selection is not Identical.
Further, the account quantity of the plurality of account is n, and described n is more than or equal to 2, described in crawl unit and include: First access modules, for selecting the i-th account from the plurality of account, utilizes described i-th account to access target network The jth webpage stood, wherein, described i=1 ... n, described j=1,2,3 ..., when described i is more than or equal to 2 Time, described i-th account is the account different from the i-th-1 account, when described j is more than or equal to 2, and described jth net Page is one or more webpage of described targeted website, and described jth webpage is equal to jth-1 webpage with the 1st webpage Different webpages;Crawl module, for crawling the web data of described jth webpage;First judge module, is used for sentencing Whether disconnected described i is equal to described n;First access modules is additionally operable to if it is judged that described i is equal to described n, then by institute The value stating i puts 1, and the value of described j adds 1, selects the i-th account from the plurality of account, utilizes described i-th account Access the jth webpage of targeted website;First access modules is additionally operable to if it is judged that described i is less than described n, then institute The value stating i adds 1, and the value of described j adds 1, selects the i-th account from the plurality of account, utilizes described i-th account Access the jth webpage of targeted website;Second judge module, for judging crawling of the web data of described targeted website Whether amount reaches preset value;Stopping modular, for if it is judged that the amount of crawling of web data of described targeted website reaches To described preset value, then the web data stopping described targeted website crawls.
Further, described device also includes: indexing unit, for utilizing described i-th account to access targeted website Jth webpage after, described jth webpage is marked, wherein, webpage after labelling is follow-up not to be visited again.
Further, described acquiring unit includes: acquisition module, is used for obtaining configuration file, wherein, and described configuration File is configured with the password of the plurality of account and correspondence thereof;Load-on module, is used for loading described configuration file, obtains Taking the password of the plurality of account and correspondence thereof, wherein, described device also includes: log in unit, for many in acquisition After individual account, utilize targeted website described in the code entry of the plurality of account and correspondence thereof, and cache identification information, Wherein, described identification information is the information that described targeted website identifies the plurality of account.
Further, described in crawl unit and include: the 3rd judge module, for judging that the account of described selection logs in institute State whether targeted website exception occurs;Remove module, for if it is judged that the account of described selection logs in described target Website occurs abnormal, then remove logging in abnormal account from the plurality of account, and multiple after removing again Account selects account;Second access modules, for if it is determined that the account of described selection logs in described targeted website not Occur abnormal, then utilize the account of described selection to access the webpage of described targeted website.
According to embodiments of the present invention, by obtaining multiple accounts, select account to access target from multiple accounts every time Website, crawls the web data of targeted website, and the restriction strategy accessed account due to website can be avoided to hinder webpage The acquisition of data, solves the technical problem of the web data being difficult to quick obtaining website, has reached quick obtaining website The effect of web data.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this Bright schematic description and description is used for explaining the present invention, is not intended that inappropriate limitation of the present invention.At accompanying drawing In:
Fig. 1 is the flow chart of web data acquisition methods according to embodiments of the present invention;
Fig. 2 is the flow chart of preferably web data acquisition methods according to embodiments of the present invention;And
Fig. 3 is the schematic diagram of web data acquisition device according to embodiments of the present invention.
Detailed description of the invention
In order to make those skilled in the art be more fully understood that the present invention program, below in conjunction with in the embodiment of the present invention Accompanying drawing, is clearly and completely described the technical scheme in the embodiment of the present invention, it is clear that described embodiment It is only the embodiment of a present invention part rather than whole embodiments.Based on the embodiment in the present invention, ability The every other embodiment that territory those of ordinary skill is obtained under not making creative work premise, all should belong to The scope of protection of the invention.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " it is etc. for distinguishing similar object, without being used for describing specific order or precedence.Should be appreciated that this Sample use data can exchange in the appropriate case, in order to embodiments of the invention described herein can with except Here the order beyond those illustrating or describing is implemented.Additionally, term " includes " and " having " and they Any deformation, it is intended that cover non-exclusive comprising, such as, contain series of steps or the process of unit, side Method, system, product or equipment are not necessarily limited to those steps or the unit clearly listed, but can include the clearest List or for intrinsic other step of these processes, method, product or equipment or unit.
According to embodiments of the present invention, it is provided that the embodiment of a kind of web data acquisition methods, this web data acquisition side Method may be used for crawling web data from targeted website, especially for having the targeted website of anti-reptile strategy, its The effect that web data crawls is more significantly.
It should be noted that can be at such as one group of computer executable instructions in the step shown in the flow chart of accompanying drawing Computer system performs, and, although show logical order in flow charts, but in some cases, can With to be different from the step shown or described by order execution herein.
Fig. 1 is the flow chart of web data acquisition methods according to embodiments of the present invention, as it is shown in figure 1, the method bag Include following steps:
Step S102, obtains multiple account, and wherein, multiple accounts are the account of the logon rights with targeted website.
Step S104, selects account from multiple accounts, utilizes the account selected to access the webpage of targeted website, crawls The web data of the webpage that the account selected is accessed, wherein, the account of adjacent twice selection differs.
Targeted website is the website needing to obtain web data, registers multiple account the most on the web site, is somebody's turn to do to obtain The logon rights of website.When obtaining the web data of targeted website, first obtain multiple account, then can successively from Multiple accounts select account access the webpage of targeted website, then crawl the web data of the webpage of access.
In the embodiment of the present invention, crawlers can be used to crawl web data, multiple accounts can be only fitted to configuration In file, by crawlers loading data from this configuration file, obtain the password of multiple account and correspondence thereof.For The selection of account can select according to the rule pre-set, and this rule can be to select at random from multiple accounts every time Select with the last account differed to access targeted website, it is also possible to be to select from multiple accounts in a fixed order Account accesses targeted website.The account every time selected can access the webpage of one or more targeted website, wherein, When each account accesses the webpage of multiple targeted websites, the webpage quantity of access can be identical, it is also possible to differs. Preferably, in order to avoid account is limited by web station system, the account every time selected accesses a webpage of targeted website; Or, when the account selected accesses the multiple webpage in targeted website every time, the webpage quantity accessed exists less than this website The number of times accessed is limited in Preset Time.Such as, it is that single account is in 5 minutes when the restriction strategy of targeted website The number of times initiating access request not can exceed that 30 times, therefore, after selecting account, limits the account of this selection currently The webpage quantity of targeted website is accessed less than 30 under selected state.It addition, in order to avoid multiple accounts access simultaneously Identical webpage, selects an account to carry out web page access the most every time in embodiments of the invention.Accessing net every time After Ye, the position of the webpage of record current accessed, or the webpage accessed is marked, it is to avoid follow-up account The webpage accessed before access.
According to embodiments of the present invention, by obtaining multiple accounts, select account to access target from multiple accounts every time Website, crawls the web data of targeted website, and the restriction strategy accessed account due to website can be avoided to hinder webpage The acquisition of data, solves the technical problem of the web data being difficult to quick obtaining website, has reached quick obtaining website The effect of web data.
For being periodically executed the mode of task in using the unit interval, due to restricted to time, its web data obtains The efficiency taken is low.The embodiment of the present invention then can utilize the rotation of multiple account constantly to obtain web data, not by time Between impact, relative to using in the unit interval mode of the task that is periodically executed, it is in hgher efficiency that web data obtains.
Fig. 2 is the flow chart of preferably web data acquisition methods according to embodiments of the present invention.The method of this embodiment can Using a kind of preferred implementation as above-described embodiment.The account quantity of multiple accounts be n, n be more than or equal to 2 Natural number.
As in figure 2 it is shown, the method includes:
Step S202, obtains multiple account.Wherein, multiple accounts are the account of the logon rights with targeted website.
Step S204, selects the i-th account from multiple accounts, utilizes the i-th account to access the jth webpage of targeted website. Wherein, i=1 ... n, j=1,2,3 ..., when i is more than or equal to 2, the i-th account is different from the i-th-1 account Account, when j more than or equal to 2 time, jth webpage is one or more webpage of targeted website, and jth webpage is The webpage the most different from the 1st webpage to jth-1 webpage.Here each natural number in i traversal 1 to n, j is Natural number.
Step S206, crawls the web data of jth webpage.
Step S208, it is judged that whether i is equal to n.
Step S210, if it is judged that i is equal to n, then puts 1 by the value of i, and the value of j adds 1, returns and performs from multiple Account selects the i-th account, utilizes the step of the jth webpage of the i-th account access targeted website.
Step S212, if it is judged that i is less than n, then the value of i adds 1, and the value of j adds 1, and returns execution from multiple Account selects the i-th account, utilizes the step of the jth webpage of the i-th account access targeted website.
Step S214, it is judged that whether the amount of crawling of the web data of targeted website reaches preset value.
Step S216, if it is judged that the amount of crawling of the web data of targeted website reaches preset value, then stops target network The web data stood crawls.
Alternatively, the preset value in step S214 of the embodiment of the present invention can be configured as required, can arrange For targeted website whole web datas, one set the whole web data of numerical value or targeted website preset ratio (as The 95% of the whole web data in targeted website) etc..
In the present embodiment, can be the crawling of web data judging targeted website after every time crawling web data Whether amount reaches preset value, if the amount of crawling of the web data of targeted website has reached preset value, then can terminate Crawl task.
Such as, as a example by 10 accounts, from the beginning of the 1st account, access the webpage of targeted website, and crawl webpage Data, after poll traverses the 10th account, if web data has not the most crawled, then from the 1st account again Start, access the webpage of targeted website, continue to crawl web data.
In the present embodiment, utilize multiple account rotation and circulate acquisition web data, thus improve the utilization rate of account, And improve web data crawl efficiency.
Preferably, after utilizing the i-th account to access the jth webpage of targeted website, method also includes: to jth net Page is marked, and wherein, webpage after labelling is follow-up not to be visited again.
In the present embodiment, accessing the webpage of targeted website after every time, the webpage accessed is marked, is climbing After taking the web data of above-mentioned webpage, if the web data of website has not crawled, the most again select account, visit Ask not labeled webpage.So, by the webpage accessed is marked, thus repeated accesses phase is avoided Same webpage, obtains identical web data.
Preferably, obtain multiple account and include: obtain configuration file, wherein, configuration file is configured with multiple account And the password of correspondence;Loading configuration file, obtains the password of multiple account and correspondence thereof, wherein, multiple obtaining After account, method also includes: utilizes the code entry targeted website of multiple account and correspondence thereof, and caches identification letter Breath, wherein, identification information is the information that targeted website identifies multiple accounts.
In the embodiment of the present invention, the password of multiple accounts and correspondence thereof is arranged in configuration file, joins by obtaining this Put file, and load data therein to obtain multiple accounts and the password of correspondence thereof.Wherein, configuration file can be The configuration file of acquiescence, it is also possible to being the configuration file of exterior arrangement, the quantity of the account in configuration file can basis Demand configures.
After getting multiple account, it is possible to use the plurality of account simulation login targeted website, and cache entries Identification information, in order to the follow-up access to targeted website exempts to log in.Wherein, identification information can be such as cookie Etc. information.
As a example by crawlers, crawlers can obtain first login account time loading configuration file in data, If user does not provides exterior arrangement file, crawlers will load default configuration file, for ensureing that external system is concurrently visited Asking crawlers, this process can all use ReentrantLock mechanism, and after successfully obtaining account, record gets Position corresponding to account, thus upper once obtain account time can obtain record the next position of position corresponding Account.
After getting multiple account, utilize the account number cipher got to be simulated logging in, and cache cookie, In case needing next time to exempt to log in when using this account, this process equally uses ReentrantLock mechanism.
In the embodiment of the present invention, by utilizing configuration file to obtain multiple account, as such, it is possible to according to web data The amount of crawling configures the quantity of account.It addition, logged in by simulation, and store identification information so that when next time logs in Directly conduct interviews without logging in.
Preferably, the webpage utilizing the account selected to access targeted website includes: judge that the account selected logs in target network Stand and whether exception occurs;Occur abnormal if it is judged that the account selected logs in targeted website, then will log in abnormal account Number remove from multiple accounts, and again multiple accounts after removing select account;If it is determined that the account selected Log in targeted website and exception does not occurs, then utilize the account of selection to access the webpage of targeted website.
Owing to, in web data acquisition process, account may be there is and logs in exception, or, account is limited by website Access, therefore, when utilizing account to access targeted website, can first judge whether this account occurs logging in extremely, as Fruit occurs abnormal, then this account removed from above-mentioned multiple accounts, it is to avoid next time uses this account, and again selects Select account and access the webpage of targeted website, if exception does not occurs, then can continue to access target by the account of this selection Website.
Further, if in the process because login account abnormal causing obtains data failure, then can will currently log in Account temporarily removes in the account group from system, it is to avoid next time uses this account, after rest a period of time, also may be used With this account is added to account group perform crawl task.
The embodiment of the present invention additionally provides a kind of web data acquisition device, and it is above-mentioned that this device may be used for performing the present invention The web data acquisition methods of embodiment.
Fig. 3 is the schematic diagram of web data acquisition device according to embodiments of the present invention.As it is shown on figure 3, this webpage number Include according to acquisition device: acquiring unit 10 and crawl unit 20.
Acquiring unit 10 is used for obtaining multiple account, and wherein, multiple accounts are the account of the logon rights with targeted website Number.
Crawl unit 20 and access the webpage of targeted website for selection account from multiple accounts, the account that utilization selects, Crawling the web data of the webpage that the account of selection is accessed, wherein, the account of adjacent twice selection differs.
Targeted website is the website needing to obtain web data, registers multiple account the most on the web site, is somebody's turn to do to obtain The logon rights of website.When obtaining the web data of targeted website, first obtain multiple account, the most successively from multiple Account selects account access the webpage of targeted website, then crawl the web data of the webpage of access.
In the embodiment of the present invention, crawlers can be used to crawl web data, multiple accounts can be only fitted to configuration In file, by crawlers loading data from this configuration file, obtain the password of multiple account and correspondence thereof.For The selection of account can select according to the rule pre-set, and this rule can be to select at random from multiple accounts every time Select with the last account differed to access targeted website, it is also possible to be to select from multiple accounts in a fixed order Account accesses targeted website.The account every time selected can access the webpage of one or more targeted website, wherein, When each account accesses the webpage of multiple targeted websites, the webpage quantity of access can be identical, it is also possible to differs. Preferably, in order to avoid account is limited by web station system, the account every time selected accesses a webpage of targeted website; Or, when the account selected accesses the multiple webpage in targeted website every time, the webpage quantity accessed exists less than this website The number of times accessed is limited in Preset Time.Such as, it is that single account is in 5 minutes when the restriction strategy of targeted website The number of times initiating access request not can exceed that 30 times, therefore, after selecting account, limits the account of this selection currently The webpage quantity of targeted website is accessed less than 30 under selected state.It addition, in order to avoid multiple accounts access simultaneously Identical webpage, selects an account to carry out web page access the most every time in embodiments of the invention.Accessing net every time After Ye, the position of the webpage of record current accessed, or the webpage accessed is marked, it is to avoid follow-up account The webpage accessed before access.
According to embodiments of the present invention, by obtaining multiple accounts, select account to access target from multiple accounts every time Website, crawls the web data of targeted website, and the restriction strategy accessed account due to website can be avoided to hinder webpage The acquisition of data, solves the technical problem of the web data being difficult to quick obtaining website, has reached quick obtaining website The effect of web data.
For being periodically executed the mode of task in using the unit interval, due to restricted to time, its web data obtains The efficiency taken is low.The embodiment of the present invention then can utilize the rotation of multiple account constantly to obtain web data, not by time Between impact, relative to using in the unit interval mode of the task that is periodically executed, it is in hgher efficiency that web data obtains.
Preferably, the account quantity of multiple accounts is that n, n are more than or equal to 2, crawls unit and includes: the first access modules, For selecting the i-th account from multiple accounts, utilize i-th account access targeted website jth webpage, wherein, I=1 ... n, j=1,2,3 ..., when i is more than or equal to 2, the i-th account is the account different from the i-th-1 account, When j is more than or equal to 2, jth webpage is one or more webpage of targeted website, and jth webpage is and the 1st net Page is to the most different webpage of jth-1 webpage;Crawl module, for crawling the web data of jth webpage;First judges Module, is used for judging that whether i is equal to n;First access modules is additionally operable to if it is judged that i is equal to n, then by the value of i Putting 1, the value of j adds 1, selects the i-th account from multiple accounts, utilizes the i-th account to access the jth net of targeted website Page;First access modules is additionally operable to if it is judged that i is less than n, then the value of i adds 1, and the value of j adds 1, from multiple accounts Select the i-th account in number, utilize the i-th account to access the jth webpage of targeted website;Second judge module, is used for sentencing Whether the amount of crawling of the web data of disconnected targeted website reaches preset value;Stopping modular, for if it is judged that target network The amount of crawling of the web data stood reaches preset value, then the web data stopping targeted website crawling.
In the present embodiment, can be the crawling of web data judging targeted website after every time crawling web data Whether amount reaches preset value, if the amount of crawling of the web data of targeted website has reached preset value, then can terminate Crawl task.
Such as, as a example by 10 accounts, from the beginning of the 1st account, access the webpage of targeted website, and crawl webpage Data, after poll traverses the 10th account, if web data has not the most crawled, then from the 1st account again Start, access the webpage of targeted website, continue to crawl web data.
In the present embodiment, utilize multiple account rotation and circulate acquisition web data, thus improve the utilization rate of account, And improve web data crawl efficiency.
Preferably, device also includes: indexing unit, at the jth webpage utilizing the i-th account to access targeted website Afterwards, being marked jth webpage, wherein, webpage after labelling is follow-up not to be visited again.
In the present embodiment, accessing the webpage of targeted website after every time, the webpage accessed is marked, is climbing After taking the web data of above-mentioned webpage, if the web data of website has not crawled, the most again select account, visit Ask not labeled webpage.So, by the webpage accessed is marked, thus repeated accesses phase is avoided Same webpage, obtains identical web data.
Preferably, acquiring unit includes: acquisition module, is used for obtaining configuration file, wherein, configures in configuration file There is the password of multiple account and correspondence thereof;Load-on module, for loading configuration file, obtains multiple account and correspondence thereof Password, wherein, device also includes: log in unit, for obtaining after multiple accounts, utilize multiple account and The code entry targeted website of its correspondence, and cache identification information, wherein, identification information is that targeted website identification is multiple The information of account.
In the embodiment of the present invention, the password of multiple accounts and correspondence thereof is arranged in configuration file, joins by obtaining this Put file, and load data therein to obtain multiple accounts and the password of correspondence thereof.Wherein, configuration file can be The configuration file of acquiescence, it is also possible to being the configuration file of exterior arrangement, the quantity of the account in configuration file can basis Demand configures.
After getting multiple account, it is possible to use the plurality of account simulation login targeted website, and cache entries Identification information, in order to the follow-up access to targeted website exempts to log in.Wherein, identification information can be such as cookie Etc. information.
As a example by crawlers, crawlers can obtain first login account time loading configuration file in data, If user does not provides exterior arrangement file, crawlers will load default configuration file, for ensureing that external system is concurrently visited Asking crawlers, this process can all use ReentrantLock mechanism, after successfully obtaining account, record position, Ensure that obtain account hour wheel passs a next account next time
After getting multiple account, utilize the account number cipher got to be simulated logging in, and cache cookie, In case needing next time to exempt to log in when using this account, this process equally uses ReentrantLock mechanism.
In the embodiment of the present invention, by utilizing configuration file to obtain multiple account, as such, it is possible to according to web data The amount of crawling configures the quantity of account.It addition, logged in by simulation, and store identification information so that when next time logs in Directly conduct interviews without logging in.
Preferably, crawl unit and include: the 3rd judge module, for judging whether the account selected logs in targeted website Occur abnormal;Remove module, for if it is judged that the account selected logs in targeted website appearance extremely, then logging in Abnormal account removes from multiple accounts, and again selects account multiple accounts after removing;Second accesses mould Block, for if it is determined that the account of selection logs in targeted website and exception do not occurs, then utilizes the account of selection to access target The webpage of website.
Owing to, in web data acquisition process, account may be there is and logs in exception, or, account is limited by website Access, therefore, when utilizing account to access targeted website, can first judge whether this account occurs logging in extremely, as Fruit occurs abnormal, then this account removed from above-mentioned multiple accounts, it is to avoid next time uses this account, and again selects Select account and access the webpage of targeted website, if exception does not occurs, then can continue to access target by the account of this selection Website.
Further, if in the process because login account abnormal causing obtains data failure, then can will currently log in Account temporarily removes in the account group from system, it is to avoid next time uses this account, after rest a period of time, also may be used With this account is added to account group perform crawl task.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not has in certain embodiment The part described in detail, may refer to the associated description of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents, can be passed through other Mode realize.Wherein, device embodiment described above is only schematically, the division of the most described unit, Can be that a kind of logic function divides, actual can have other dividing mode, the most multiple unit or assembly when realizing Can in conjunction with or be desirably integrated into another system, or some features can be ignored, or does not performs.Another point, institute The coupling each other shown or discuss or direct-coupling or communication connection can be by some interfaces, unit or mould The INDIRECT COUPLING of block or communication connection, can be being electrical or other form.
The described unit illustrated as separating component can be or may not be physically separate, shows as unit The parts shown can be or may not be physical location, i.e. may be located at a place, or can also be distributed to On multiple unit.Some or all of unit therein can be selected according to the actual needs to realize the present embodiment scheme Purpose.
It addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to two or more unit are integrated in a unit.Above-mentioned integrated Unit both can realize to use the form of hardware, it would however also be possible to employ the form of SFU software functional unit realizes.
If described integrated unit is using the form realization of SFU software functional unit and as independent production marketing or use, Can be stored in a computer read/write memory medium.Based on such understanding, technical scheme essence On the part that in other words prior art contributed or this technical scheme completely or partially can be with software product Form embodies, and this computer software product is stored in a storage medium, including some instructions with so that one Platform computer equipment (can be for personal computer, server or the network equipment etc.) performs each embodiment institute of the present invention State all or part of step of method.And aforesaid storage medium includes: USB flash disk, read only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD Etc. the various media that can store program code.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For Yuan, under the premise without departing from the principles of the invention, it is also possible to make some improvements and modifications, these improve and profit Decorations also should be regarded as protection scope of the present invention.

Claims (10)

1. a web data acquisition methods, it is characterised in that including:
Obtaining multiple account, wherein, the plurality of account is the account of the logon rights with targeted website;With And
From the plurality of account, select account, utilize the account selected to access the webpage of described targeted website, climb Taking the web data of the webpage that the account of described selection is accessed, wherein, the account of adjacent twice selection differs.
Method the most according to claim 1, it is characterised in that the account quantity of the plurality of account is n, described n More than or equal to 2, from the plurality of account, select account, utilize the account selected to access described targeted website Webpage, the web data crawling the webpage that the account of described selection is accessed includes:
From the plurality of account, select the i-th account, utilize described i-th account to access the jth net of targeted website Page, wherein, described i=1 ... n, described j=1,2,3 ..., when described i is more than or equal to 2, described the I account is the account different from the i-th-1 account, and when described j is more than or equal to 2, described jth webpage is institute Stating one or more webpage of targeted website, described jth webpage is with the 1st webpage to jth-1 webpage the most not Same webpage;
Crawl the web data of described jth webpage;
Judge that whether described i is equal to described n;
If it is judged that described i is equal to described n, then the value of described i being put 1, the value of described j adds 1, returns Perform from the plurality of account, select the i-th account, utilize described i-th account to access the jth net of targeted website The step of page;
If it is judged that described i is less than described n, the value of the most described i adds 1, and the value of described j adds 1, and returns Perform from the plurality of account, select the i-th account, utilize described i-th account to access the jth net of targeted website The step of page;
Judge whether the amount of crawling of the web data of described targeted website reaches preset value;
If it is judged that the amount of crawling of the web data of described targeted website reaches described preset value, then stop described The web data of targeted website crawls.
Method the most according to claim 2, it is characterised in that
After utilizing described i-th account to access the jth webpage of targeted website, described method also includes: to institute Stating jth webpage to be marked, wherein, webpage after labelling is follow-up not to be visited again.
Method the most according to claim 1, it is characterised in that
Obtain multiple account to include: obtain configuration file, wherein, described configuration file is configured with the plurality of Account and the password of correspondence thereof;Load described configuration file, obtain the password of the plurality of account and correspondence thereof,
Wherein, after obtaining multiple accounts, described method also includes: utilize the plurality of account and correspondence thereof Code entry described in targeted website, and cache identification information, wherein, described identification information is described target network Stand and identify the information of the plurality of account.
Method the most according to any one of claim 1 to 4, it is characterised in that utilize the account selected to access institute The webpage stating targeted website includes:
Judge that the account of described selection logs in whether described targeted website exception occurs;
Occur abnormal if it is judged that the account of described selection logs in described targeted website, then will log in abnormal account Number remove from the plurality of account, and again multiple accounts after removing select account;
If it is determined that the account of described selection logs in described targeted website exception does not occurs, then utilize described selection Account accesses the webpage of described targeted website.
6. a web data acquisition device, it is characterised in that including:
Acquiring unit, is used for obtaining multiple account, and wherein, the plurality of account is to have the login of targeted website The account of authority;And
Crawl unit, for selecting account from the plurality of account, utilize the account selected to access described target The webpage of website, crawls the web data of the webpage that the account of described selection is accessed, wherein, adjacent twice choosing The account selected differs.
Device the most according to claim 6, it is characterised in that the account quantity of the plurality of account is n, described n More than or equal to 2, described in crawl unit and include:
First access modules, for selecting the i-th account from the plurality of account, utilizes described i-th account to visit Ask the jth webpage of targeted website, wherein, described i=1 ... n, described j=1,2,3 ..., as described i During more than or equal to 2, described i-th account is the account different from the i-th-1 account, when described j is more than or equal to 2 Time, described jth webpage is one or more webpage of described targeted website, and described jth webpage is and the 1st Webpage is to the most different webpage of jth-1 webpage;
Crawl module, for crawling the web data of described jth webpage;
First judge module, is used for judging that whether described i is equal to described n;
First access modules is additionally operable to if it is judged that described i is equal to described n, then the value of described i is put 1, institute The value stating j adds 1, selects the i-th account from the plurality of account, utilizes described i-th account to access target network The jth webpage stood;
First access modules is additionally operable to if it is judged that described i is less than described n, and the value of the most described i adds 1, described The value of j adds 1, selects the i-th account from the plurality of account, utilizes described i-th account to access targeted website Jth webpage;
Second judge module, for judging whether the amount of crawling of the web data of described targeted website reaches preset value;
Stopping modular, for if it is judged that the amount of crawling of web data of described targeted website reaches described presets Value, then the web data stopping described targeted website crawls.
Device the most according to claim 7, it is characterised in that described device also includes:
Indexing unit, for after utilizing described i-th account to access the jth webpage of targeted website, to described Jth webpage is marked, and wherein, webpage after labelling is follow-up not to be visited again.
Device the most according to claim 6, it is characterised in that
Described acquiring unit includes: acquisition module, is used for obtaining configuration file, wherein, in described configuration file It is configured with the password of the plurality of account and correspondence thereof;Load-on module, is used for loading described configuration file, obtains The plurality of account and the password of correspondence thereof,
Wherein, described device also includes: log in unit, for, after obtaining multiple accounts, utilizing described many Targeted website described in the code entry of individual account and correspondence thereof, and cache identification information, wherein, described identification is believed Breath identifies the information of the plurality of account for described targeted website.
10. according to the device according to any one of claim 6 to 9, it is characterised in that described in crawl unit and include:
3rd judge module, for judging that the account of described selection logs in whether described targeted website exception occurs;
Remove module, for if it is judged that the account of described selection logs in the appearance of described targeted website extremely, then Remove logging in abnormal account from the plurality of account, and again multiple accounts after removing select account Number;
Second access modules, is used for if it is determined that the account of described selection logs in described targeted website exception does not occurs, The account then utilizing described selection accesses the webpage of described targeted website.
CN201510250516.6A 2015-05-15 2015-05-15 Web data acquisition methods and device Pending CN106294369A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510250516.6A CN106294369A (en) 2015-05-15 2015-05-15 Web data acquisition methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510250516.6A CN106294369A (en) 2015-05-15 2015-05-15 Web data acquisition methods and device

Publications (1)

Publication Number Publication Date
CN106294369A true CN106294369A (en) 2017-01-04

Family

ID=57632274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510250516.6A Pending CN106294369A (en) 2015-05-15 2015-05-15 Web data acquisition methods and device

Country Status (1)

Country Link
CN (1) CN106294369A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704497A (en) * 2017-08-25 2018-02-16 上海壹账通金融科技有限公司 Web data crawling method, device, web data crawl platform and storage medium
CN109375960A (en) * 2018-09-29 2019-02-22 郑州云海信息技术有限公司 A kind of copyright information loading method and device
WO2019237547A1 (en) * 2018-06-11 2019-12-19 平安科技(深圳)有限公司 Data crawling method and apparatus, and computer device and storage medium
CN110619072A (en) * 2019-08-29 2019-12-27 凡普数字技术有限公司 Bank account information acquisition method and device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030135487A1 (en) * 2002-01-11 2003-07-17 Beyer Kevin Scott Automated access to web content based on log analysis
CN101872365A (en) * 2010-07-02 2010-10-27 苏州阔地网络科技有限公司 Method for realizing one-key login to other website on webpage
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN104615627A (en) * 2014-09-23 2015-05-13 中国科学院计算技术研究所 Event public sentiment information extracting method and system based on micro-blog platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030135487A1 (en) * 2002-01-11 2003-07-17 Beyer Kevin Scott Automated access to web content based on log analysis
CN101872365A (en) * 2010-07-02 2010-10-27 苏州阔地网络科技有限公司 Method for realizing one-key login to other website on webpage
CN103902386A (en) * 2014-04-11 2014-07-02 复旦大学 Multi-thread network crawler processing method based on connection proxy optimal management
CN104615627A (en) * 2014-09-23 2015-05-13 中国科学院计算技术研究所 Event public sentiment information extracting method and system based on micro-blog platform

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
史春永: "面向新浪微博的数据采集和社区发现算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
孙青云等: "一种基于模拟登录的微博数据采集方案", 《计算机技术与发展》 *
蒋建军: "《计算机网络技术实训教程》", 30 April 2001, 上海交通大学出版社 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704497A (en) * 2017-08-25 2018-02-16 上海壹账通金融科技有限公司 Web data crawling method, device, web data crawl platform and storage medium
WO2019037417A1 (en) * 2017-08-25 2019-02-28 深圳壹账通智能科技有限公司 Webpage data crawling method and apparatus, webpage data crawling platform, and storage medium
WO2019237547A1 (en) * 2018-06-11 2019-12-19 平安科技(深圳)有限公司 Data crawling method and apparatus, and computer device and storage medium
CN109375960A (en) * 2018-09-29 2019-02-22 郑州云海信息技术有限公司 A kind of copyright information loading method and device
CN110619072A (en) * 2019-08-29 2019-12-27 凡普数字技术有限公司 Bank account information acquisition method and device and storage medium

Similar Documents

Publication Publication Date Title
CN107797908A (en) A kind of behavioral data acquisition method of website user
CN106294369A (en) Web data acquisition methods and device
CN104539459B (en) Network control method on router and router
CN106844522A (en) A kind of network data crawling method and device
CN106897284A (en) The recommendation method and device of e-book
CN107958456A (en) Dispensing detection method, device and electronic equipment
CN106131047A (en) Account login method and relevant device, account login system
CN107800591A (en) A kind of analysis method of unified daily record data
CN103593444B (en) Internet Keyword identifying processing method and apparatus
CN104537005B (en) Data processing method and device for web page crawl
CN104202291A (en) Anti-phishing method based on multi-factor comprehensive assessment method
CN106936778A (en) The abnormal detection method of website traffic and device
CN108601023A (en) Home-network linkups authentication method, device, electronic equipment and storage medium
CN106874165A (en) Page detection method and device
CN107483381A (en) The monitoring method and device of interlock account
CN107204956A (en) website identification method and device
CN107948052A (en) Information crawler method, apparatus, electronic equipment and system
CN107104924A (en) The verification method and device of website backdoor file
CN107888606A (en) A kind of domain name credit assessment and system
CN107689941A (en) A kind of apparatus and method for preventing same user's repeat logon
CN106789837A (en) Network anomalous behaviors detection method and detection means
CN107124426A (en) The method for authenticating and device of a kind of user's right
CN109544238A (en) User behavior method for tracing, device, server and storage medium
CN108038218A (en) A kind of distributed reptile method, electronic equipment and server
CN108510304A (en) Construction method, electronic device and the storage medium of target customers

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20170104

RJ01 Rejection of invention patent application after publication