CN106294369A - Web data acquisition methods and device - Google Patents
Web data acquisition methods and device Download PDFInfo
- Publication number
- CN106294369A CN106294369A CN201510250516.6A CN201510250516A CN106294369A CN 106294369 A CN106294369 A CN 106294369A CN 201510250516 A CN201510250516 A CN 201510250516A CN 106294369 A CN106294369 A CN 106294369A
- Authority
- CN
- China
- Prior art keywords
- account
- webpage
- targeted website
- web data
- access
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a kind of web data acquisition methods and device.Wherein, the method includes: obtaining multiple account, wherein, multiple accounts are the account of the logon rights with targeted website;And from multiple accounts, select account, and utilize the account selected to access the webpage of targeted website, crawl the web data that the account of selection is accessed, wherein, the account of adjacent twice selection differs.The present invention solves the technical problem of the web data being difficult to quick obtaining website in prior art.
Description
Technical field
The present invention relates to data processing field, in particular to a kind of web data acquisition methods and device.
Background technology
At present, in web data process field, crawlers is generally used to crawl web data.Crawlers is one
Planting the program automatically extracting web data, developer can use crawlers download to web data and analyze and process,
There is provided basis for market demand service, create commercial value.But, the most a lot of websites not only need to enter visitor
Row authentication, and the current login user access times to the different pages within a certain period of time can be added up, namely
It is that these websites have employed anti-reptile strategy and crawl web data to limit crawlers, and web data is crawled by this
Bring obstruction.When the web data that user uses crawlers to obtain certain specific website system, this crawlers
Malice visitor would generally be identified as by the anti-reptile strategy of web station system and take many kinds of measures to stop crawlers to continue to visit
Ask.Meanwhile, inventor finds, even if crawlers utilizes logs in access website in first register account number, also can be visited
Asking the restriction of number of times, this makes user be difficult to the web data of quick obtaining website.
For above-mentioned problem, effective solution is the most not yet proposed.
Summary of the invention
Embodiments provide a kind of web data acquisition methods and device, at least to solve to be difficult to quick obtaining net
The technical problem of the web data stood.
An aspect according to embodiments of the present invention, it is provided that a kind of web data acquisition methods, including: obtain multiple
Account, wherein, the plurality of account is the account of the logon rights with targeted website;And from the plurality of account
Middle selection account, utilizes the account selected to access the webpage of described targeted website, and the account crawling described selection is accessed
The web data of webpage, wherein, the account of adjacent twice selection differs.
Further, the account quantity of the plurality of account is n, and described n is more than or equal to 2, from the plurality of account
Select account, utilize the account selected to access the webpage of described targeted website, crawl what the account of described selection was accessed
The web data of webpage includes: select the i-th account from the plurality of account, utilizes described i-th account to access target
The jth webpage of website, wherein, described i=1 ... n, described j=1,2,3 ..., when described i is more than or equal to
When 2, described i-th account is the account different from the i-th-1 account, when described j is more than or equal to 2, and described jth
Webpage is one or more webpage of described targeted website, and described jth webpage is to jth-1 webpage with the 1st webpage
The most different webpages;Crawl the web data of described jth webpage;Judge that whether described i is equal to described n;If sentenced
Break and described i equal to described n, then the value of described i is put 1, the value of described j adds 1 return and performs from the plurality of
Account selects the i-th account, utilizes the step of the jth webpage of described i-th account access targeted website;If it is determined that
Going out described i and be less than described n, the value of the most described i adds 1, and the value of described j adds 1, and returns execution from the plurality of account
Select the i-th account in number, utilize the step of the jth webpage of described i-th account access targeted website;Judge described mesh
Whether the amount of crawling of the web data of mark website reaches preset value;If it is judged that the web data of described targeted website
The amount of crawling reaches described preset value, then the web data stopping described targeted website crawls.
Further, after utilizing described i-th account to access the jth webpage of targeted website, described method also includes:
Being marked described jth webpage, wherein, webpage after labelling is follow-up not to be visited again.
Further, obtain multiple account and include: obtain configuration file, wherein, described configuration file configures
State the password of multiple account and correspondence thereof;Load described configuration file, obtain the password of the plurality of account and correspondence thereof,
Wherein, after obtaining multiple accounts, described method also includes: utilize the password of the plurality of account and correspondence thereof to step on
Lu Suoshu targeted website, and cache identification information, wherein, described identification information is that the identification of described targeted website is described many
The information of individual account.
Further, the webpage utilizing the account selected to access described targeted website includes: judge the account of described selection
Log in whether described targeted website exception occurs;Occur if it is judged that the account of described selection logs in described targeted website
Abnormal, then remove logging in abnormal account from the plurality of account, and again multiple accounts after removing are selected
Select account;If it is determined that the account of described selection logs in described targeted website exception does not occurs, then utilize described selection
Account accesses the webpage of described targeted website.
Another aspect according to embodiments of the present invention, additionally provides a kind of web data acquisition device, including: obtain single
Unit, is used for obtaining multiple account, and wherein, the plurality of account is the account of the logon rights with targeted website;With
And crawl unit, for selecting account from the plurality of account, utilize the account selected to access described targeted website
Webpage, crawls the web data of the webpage that the account of described selection is accessed, and wherein, the account of adjacent twice selection is not
Identical.
Further, the account quantity of the plurality of account is n, and described n is more than or equal to 2, described in crawl unit and include:
First access modules, for selecting the i-th account from the plurality of account, utilizes described i-th account to access target network
The jth webpage stood, wherein, described i=1 ... n, described j=1,2,3 ..., when described i is more than or equal to 2
Time, described i-th account is the account different from the i-th-1 account, when described j is more than or equal to 2, and described jth net
Page is one or more webpage of described targeted website, and described jth webpage is equal to jth-1 webpage with the 1st webpage
Different webpages;Crawl module, for crawling the web data of described jth webpage;First judge module, is used for sentencing
Whether disconnected described i is equal to described n;First access modules is additionally operable to if it is judged that described i is equal to described n, then by institute
The value stating i puts 1, and the value of described j adds 1, selects the i-th account from the plurality of account, utilizes described i-th account
Access the jth webpage of targeted website;First access modules is additionally operable to if it is judged that described i is less than described n, then institute
The value stating i adds 1, and the value of described j adds 1, selects the i-th account from the plurality of account, utilizes described i-th account
Access the jth webpage of targeted website;Second judge module, for judging crawling of the web data of described targeted website
Whether amount reaches preset value;Stopping modular, for if it is judged that the amount of crawling of web data of described targeted website reaches
To described preset value, then the web data stopping described targeted website crawls.
Further, described device also includes: indexing unit, for utilizing described i-th account to access targeted website
Jth webpage after, described jth webpage is marked, wherein, webpage after labelling is follow-up not to be visited again.
Further, described acquiring unit includes: acquisition module, is used for obtaining configuration file, wherein, and described configuration
File is configured with the password of the plurality of account and correspondence thereof;Load-on module, is used for loading described configuration file, obtains
Taking the password of the plurality of account and correspondence thereof, wherein, described device also includes: log in unit, for many in acquisition
After individual account, utilize targeted website described in the code entry of the plurality of account and correspondence thereof, and cache identification information,
Wherein, described identification information is the information that described targeted website identifies the plurality of account.
Further, described in crawl unit and include: the 3rd judge module, for judging that the account of described selection logs in institute
State whether targeted website exception occurs;Remove module, for if it is judged that the account of described selection logs in described target
Website occurs abnormal, then remove logging in abnormal account from the plurality of account, and multiple after removing again
Account selects account;Second access modules, for if it is determined that the account of described selection logs in described targeted website not
Occur abnormal, then utilize the account of described selection to access the webpage of described targeted website.
According to embodiments of the present invention, by obtaining multiple accounts, select account to access target from multiple accounts every time
Website, crawls the web data of targeted website, and the restriction strategy accessed account due to website can be avoided to hinder webpage
The acquisition of data, solves the technical problem of the web data being difficult to quick obtaining website, has reached quick obtaining website
The effect of web data.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this
Bright schematic description and description is used for explaining the present invention, is not intended that inappropriate limitation of the present invention.At accompanying drawing
In:
Fig. 1 is the flow chart of web data acquisition methods according to embodiments of the present invention;
Fig. 2 is the flow chart of preferably web data acquisition methods according to embodiments of the present invention;And
Fig. 3 is the schematic diagram of web data acquisition device according to embodiments of the present invention.
Detailed description of the invention
In order to make those skilled in the art be more fully understood that the present invention program, below in conjunction with in the embodiment of the present invention
Accompanying drawing, is clearly and completely described the technical scheme in the embodiment of the present invention, it is clear that described embodiment
It is only the embodiment of a present invention part rather than whole embodiments.Based on the embodiment in the present invention, ability
The every other embodiment that territory those of ordinary skill is obtained under not making creative work premise, all should belong to
The scope of protection of the invention.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, "
Two " it is etc. for distinguishing similar object, without being used for describing specific order or precedence.Should be appreciated that this
Sample use data can exchange in the appropriate case, in order to embodiments of the invention described herein can with except
Here the order beyond those illustrating or describing is implemented.Additionally, term " includes " and " having " and they
Any deformation, it is intended that cover non-exclusive comprising, such as, contain series of steps or the process of unit, side
Method, system, product or equipment are not necessarily limited to those steps or the unit clearly listed, but can include the clearest
List or for intrinsic other step of these processes, method, product or equipment or unit.
According to embodiments of the present invention, it is provided that the embodiment of a kind of web data acquisition methods, this web data acquisition side
Method may be used for crawling web data from targeted website, especially for having the targeted website of anti-reptile strategy, its
The effect that web data crawls is more significantly.
It should be noted that can be at such as one group of computer executable instructions in the step shown in the flow chart of accompanying drawing
Computer system performs, and, although show logical order in flow charts, but in some cases, can
With to be different from the step shown or described by order execution herein.
Fig. 1 is the flow chart of web data acquisition methods according to embodiments of the present invention, as it is shown in figure 1, the method bag
Include following steps:
Step S102, obtains multiple account, and wherein, multiple accounts are the account of the logon rights with targeted website.
Step S104, selects account from multiple accounts, utilizes the account selected to access the webpage of targeted website, crawls
The web data of the webpage that the account selected is accessed, wherein, the account of adjacent twice selection differs.
Targeted website is the website needing to obtain web data, registers multiple account the most on the web site, is somebody's turn to do to obtain
The logon rights of website.When obtaining the web data of targeted website, first obtain multiple account, then can successively from
Multiple accounts select account access the webpage of targeted website, then crawl the web data of the webpage of access.
In the embodiment of the present invention, crawlers can be used to crawl web data, multiple accounts can be only fitted to configuration
In file, by crawlers loading data from this configuration file, obtain the password of multiple account and correspondence thereof.For
The selection of account can select according to the rule pre-set, and this rule can be to select at random from multiple accounts every time
Select with the last account differed to access targeted website, it is also possible to be to select from multiple accounts in a fixed order
Account accesses targeted website.The account every time selected can access the webpage of one or more targeted website, wherein,
When each account accesses the webpage of multiple targeted websites, the webpage quantity of access can be identical, it is also possible to differs.
Preferably, in order to avoid account is limited by web station system, the account every time selected accesses a webpage of targeted website;
Or, when the account selected accesses the multiple webpage in targeted website every time, the webpage quantity accessed exists less than this website
The number of times accessed is limited in Preset Time.Such as, it is that single account is in 5 minutes when the restriction strategy of targeted website
The number of times initiating access request not can exceed that 30 times, therefore, after selecting account, limits the account of this selection currently
The webpage quantity of targeted website is accessed less than 30 under selected state.It addition, in order to avoid multiple accounts access simultaneously
Identical webpage, selects an account to carry out web page access the most every time in embodiments of the invention.Accessing net every time
After Ye, the position of the webpage of record current accessed, or the webpage accessed is marked, it is to avoid follow-up account
The webpage accessed before access.
According to embodiments of the present invention, by obtaining multiple accounts, select account to access target from multiple accounts every time
Website, crawls the web data of targeted website, and the restriction strategy accessed account due to website can be avoided to hinder webpage
The acquisition of data, solves the technical problem of the web data being difficult to quick obtaining website, has reached quick obtaining website
The effect of web data.
For being periodically executed the mode of task in using the unit interval, due to restricted to time, its web data obtains
The efficiency taken is low.The embodiment of the present invention then can utilize the rotation of multiple account constantly to obtain web data, not by time
Between impact, relative to using in the unit interval mode of the task that is periodically executed, it is in hgher efficiency that web data obtains.
Fig. 2 is the flow chart of preferably web data acquisition methods according to embodiments of the present invention.The method of this embodiment can
Using a kind of preferred implementation as above-described embodiment.The account quantity of multiple accounts be n, n be more than or equal to 2
Natural number.
As in figure 2 it is shown, the method includes:
Step S202, obtains multiple account.Wherein, multiple accounts are the account of the logon rights with targeted website.
Step S204, selects the i-th account from multiple accounts, utilizes the i-th account to access the jth webpage of targeted website.
Wherein, i=1 ... n, j=1,2,3 ..., when i is more than or equal to 2, the i-th account is different from the i-th-1 account
Account, when j more than or equal to 2 time, jth webpage is one or more webpage of targeted website, and jth webpage is
The webpage the most different from the 1st webpage to jth-1 webpage.Here each natural number in i traversal 1 to n, j is
Natural number.
Step S206, crawls the web data of jth webpage.
Step S208, it is judged that whether i is equal to n.
Step S210, if it is judged that i is equal to n, then puts 1 by the value of i, and the value of j adds 1, returns and performs from multiple
Account selects the i-th account, utilizes the step of the jth webpage of the i-th account access targeted website.
Step S212, if it is judged that i is less than n, then the value of i adds 1, and the value of j adds 1, and returns execution from multiple
Account selects the i-th account, utilizes the step of the jth webpage of the i-th account access targeted website.
Step S214, it is judged that whether the amount of crawling of the web data of targeted website reaches preset value.
Step S216, if it is judged that the amount of crawling of the web data of targeted website reaches preset value, then stops target network
The web data stood crawls.
Alternatively, the preset value in step S214 of the embodiment of the present invention can be configured as required, can arrange
For targeted website whole web datas, one set the whole web data of numerical value or targeted website preset ratio (as
The 95% of the whole web data in targeted website) etc..
In the present embodiment, can be the crawling of web data judging targeted website after every time crawling web data
Whether amount reaches preset value, if the amount of crawling of the web data of targeted website has reached preset value, then can terminate
Crawl task.
Such as, as a example by 10 accounts, from the beginning of the 1st account, access the webpage of targeted website, and crawl webpage
Data, after poll traverses the 10th account, if web data has not the most crawled, then from the 1st account again
Start, access the webpage of targeted website, continue to crawl web data.
In the present embodiment, utilize multiple account rotation and circulate acquisition web data, thus improve the utilization rate of account,
And improve web data crawl efficiency.
Preferably, after utilizing the i-th account to access the jth webpage of targeted website, method also includes: to jth net
Page is marked, and wherein, webpage after labelling is follow-up not to be visited again.
In the present embodiment, accessing the webpage of targeted website after every time, the webpage accessed is marked, is climbing
After taking the web data of above-mentioned webpage, if the web data of website has not crawled, the most again select account, visit
Ask not labeled webpage.So, by the webpage accessed is marked, thus repeated accesses phase is avoided
Same webpage, obtains identical web data.
Preferably, obtain multiple account and include: obtain configuration file, wherein, configuration file is configured with multiple account
And the password of correspondence;Loading configuration file, obtains the password of multiple account and correspondence thereof, wherein, multiple obtaining
After account, method also includes: utilizes the code entry targeted website of multiple account and correspondence thereof, and caches identification letter
Breath, wherein, identification information is the information that targeted website identifies multiple accounts.
In the embodiment of the present invention, the password of multiple accounts and correspondence thereof is arranged in configuration file, joins by obtaining this
Put file, and load data therein to obtain multiple accounts and the password of correspondence thereof.Wherein, configuration file can be
The configuration file of acquiescence, it is also possible to being the configuration file of exterior arrangement, the quantity of the account in configuration file can basis
Demand configures.
After getting multiple account, it is possible to use the plurality of account simulation login targeted website, and cache entries
Identification information, in order to the follow-up access to targeted website exempts to log in.Wherein, identification information can be such as cookie
Etc. information.
As a example by crawlers, crawlers can obtain first login account time loading configuration file in data,
If user does not provides exterior arrangement file, crawlers will load default configuration file, for ensureing that external system is concurrently visited
Asking crawlers, this process can all use ReentrantLock mechanism, and after successfully obtaining account, record gets
Position corresponding to account, thus upper once obtain account time can obtain record the next position of position corresponding
Account.
After getting multiple account, utilize the account number cipher got to be simulated logging in, and cache cookie,
In case needing next time to exempt to log in when using this account, this process equally uses ReentrantLock mechanism.
In the embodiment of the present invention, by utilizing configuration file to obtain multiple account, as such, it is possible to according to web data
The amount of crawling configures the quantity of account.It addition, logged in by simulation, and store identification information so that when next time logs in
Directly conduct interviews without logging in.
Preferably, the webpage utilizing the account selected to access targeted website includes: judge that the account selected logs in target network
Stand and whether exception occurs;Occur abnormal if it is judged that the account selected logs in targeted website, then will log in abnormal account
Number remove from multiple accounts, and again multiple accounts after removing select account;If it is determined that the account selected
Log in targeted website and exception does not occurs, then utilize the account of selection to access the webpage of targeted website.
Owing to, in web data acquisition process, account may be there is and logs in exception, or, account is limited by website
Access, therefore, when utilizing account to access targeted website, can first judge whether this account occurs logging in extremely, as
Fruit occurs abnormal, then this account removed from above-mentioned multiple accounts, it is to avoid next time uses this account, and again selects
Select account and access the webpage of targeted website, if exception does not occurs, then can continue to access target by the account of this selection
Website.
Further, if in the process because login account abnormal causing obtains data failure, then can will currently log in
Account temporarily removes in the account group from system, it is to avoid next time uses this account, after rest a period of time, also may be used
With this account is added to account group perform crawl task.
The embodiment of the present invention additionally provides a kind of web data acquisition device, and it is above-mentioned that this device may be used for performing the present invention
The web data acquisition methods of embodiment.
Fig. 3 is the schematic diagram of web data acquisition device according to embodiments of the present invention.As it is shown on figure 3, this webpage number
Include according to acquisition device: acquiring unit 10 and crawl unit 20.
Acquiring unit 10 is used for obtaining multiple account, and wherein, multiple accounts are the account of the logon rights with targeted website
Number.
Crawl unit 20 and access the webpage of targeted website for selection account from multiple accounts, the account that utilization selects,
Crawling the web data of the webpage that the account of selection is accessed, wherein, the account of adjacent twice selection differs.
Targeted website is the website needing to obtain web data, registers multiple account the most on the web site, is somebody's turn to do to obtain
The logon rights of website.When obtaining the web data of targeted website, first obtain multiple account, the most successively from multiple
Account selects account access the webpage of targeted website, then crawl the web data of the webpage of access.
In the embodiment of the present invention, crawlers can be used to crawl web data, multiple accounts can be only fitted to configuration
In file, by crawlers loading data from this configuration file, obtain the password of multiple account and correspondence thereof.For
The selection of account can select according to the rule pre-set, and this rule can be to select at random from multiple accounts every time
Select with the last account differed to access targeted website, it is also possible to be to select from multiple accounts in a fixed order
Account accesses targeted website.The account every time selected can access the webpage of one or more targeted website, wherein,
When each account accesses the webpage of multiple targeted websites, the webpage quantity of access can be identical, it is also possible to differs.
Preferably, in order to avoid account is limited by web station system, the account every time selected accesses a webpage of targeted website;
Or, when the account selected accesses the multiple webpage in targeted website every time, the webpage quantity accessed exists less than this website
The number of times accessed is limited in Preset Time.Such as, it is that single account is in 5 minutes when the restriction strategy of targeted website
The number of times initiating access request not can exceed that 30 times, therefore, after selecting account, limits the account of this selection currently
The webpage quantity of targeted website is accessed less than 30 under selected state.It addition, in order to avoid multiple accounts access simultaneously
Identical webpage, selects an account to carry out web page access the most every time in embodiments of the invention.Accessing net every time
After Ye, the position of the webpage of record current accessed, or the webpage accessed is marked, it is to avoid follow-up account
The webpage accessed before access.
According to embodiments of the present invention, by obtaining multiple accounts, select account to access target from multiple accounts every time
Website, crawls the web data of targeted website, and the restriction strategy accessed account due to website can be avoided to hinder webpage
The acquisition of data, solves the technical problem of the web data being difficult to quick obtaining website, has reached quick obtaining website
The effect of web data.
For being periodically executed the mode of task in using the unit interval, due to restricted to time, its web data obtains
The efficiency taken is low.The embodiment of the present invention then can utilize the rotation of multiple account constantly to obtain web data, not by time
Between impact, relative to using in the unit interval mode of the task that is periodically executed, it is in hgher efficiency that web data obtains.
Preferably, the account quantity of multiple accounts is that n, n are more than or equal to 2, crawls unit and includes: the first access modules,
For selecting the i-th account from multiple accounts, utilize i-th account access targeted website jth webpage, wherein,
I=1 ... n, j=1,2,3 ..., when i is more than or equal to 2, the i-th account is the account different from the i-th-1 account,
When j is more than or equal to 2, jth webpage is one or more webpage of targeted website, and jth webpage is and the 1st net
Page is to the most different webpage of jth-1 webpage;Crawl module, for crawling the web data of jth webpage;First judges
Module, is used for judging that whether i is equal to n;First access modules is additionally operable to if it is judged that i is equal to n, then by the value of i
Putting 1, the value of j adds 1, selects the i-th account from multiple accounts, utilizes the i-th account to access the jth net of targeted website
Page;First access modules is additionally operable to if it is judged that i is less than n, then the value of i adds 1, and the value of j adds 1, from multiple accounts
Select the i-th account in number, utilize the i-th account to access the jth webpage of targeted website;Second judge module, is used for sentencing
Whether the amount of crawling of the web data of disconnected targeted website reaches preset value;Stopping modular, for if it is judged that target network
The amount of crawling of the web data stood reaches preset value, then the web data stopping targeted website crawling.
In the present embodiment, can be the crawling of web data judging targeted website after every time crawling web data
Whether amount reaches preset value, if the amount of crawling of the web data of targeted website has reached preset value, then can terminate
Crawl task.
Such as, as a example by 10 accounts, from the beginning of the 1st account, access the webpage of targeted website, and crawl webpage
Data, after poll traverses the 10th account, if web data has not the most crawled, then from the 1st account again
Start, access the webpage of targeted website, continue to crawl web data.
In the present embodiment, utilize multiple account rotation and circulate acquisition web data, thus improve the utilization rate of account,
And improve web data crawl efficiency.
Preferably, device also includes: indexing unit, at the jth webpage utilizing the i-th account to access targeted website
Afterwards, being marked jth webpage, wherein, webpage after labelling is follow-up not to be visited again.
In the present embodiment, accessing the webpage of targeted website after every time, the webpage accessed is marked, is climbing
After taking the web data of above-mentioned webpage, if the web data of website has not crawled, the most again select account, visit
Ask not labeled webpage.So, by the webpage accessed is marked, thus repeated accesses phase is avoided
Same webpage, obtains identical web data.
Preferably, acquiring unit includes: acquisition module, is used for obtaining configuration file, wherein, configures in configuration file
There is the password of multiple account and correspondence thereof;Load-on module, for loading configuration file, obtains multiple account and correspondence thereof
Password, wherein, device also includes: log in unit, for obtaining after multiple accounts, utilize multiple account and
The code entry targeted website of its correspondence, and cache identification information, wherein, identification information is that targeted website identification is multiple
The information of account.
In the embodiment of the present invention, the password of multiple accounts and correspondence thereof is arranged in configuration file, joins by obtaining this
Put file, and load data therein to obtain multiple accounts and the password of correspondence thereof.Wherein, configuration file can be
The configuration file of acquiescence, it is also possible to being the configuration file of exterior arrangement, the quantity of the account in configuration file can basis
Demand configures.
After getting multiple account, it is possible to use the plurality of account simulation login targeted website, and cache entries
Identification information, in order to the follow-up access to targeted website exempts to log in.Wherein, identification information can be such as cookie
Etc. information.
As a example by crawlers, crawlers can obtain first login account time loading configuration file in data,
If user does not provides exterior arrangement file, crawlers will load default configuration file, for ensureing that external system is concurrently visited
Asking crawlers, this process can all use ReentrantLock mechanism, after successfully obtaining account, record position,
Ensure that obtain account hour wheel passs a next account next time
After getting multiple account, utilize the account number cipher got to be simulated logging in, and cache cookie,
In case needing next time to exempt to log in when using this account, this process equally uses ReentrantLock mechanism.
In the embodiment of the present invention, by utilizing configuration file to obtain multiple account, as such, it is possible to according to web data
The amount of crawling configures the quantity of account.It addition, logged in by simulation, and store identification information so that when next time logs in
Directly conduct interviews without logging in.
Preferably, crawl unit and include: the 3rd judge module, for judging whether the account selected logs in targeted website
Occur abnormal;Remove module, for if it is judged that the account selected logs in targeted website appearance extremely, then logging in
Abnormal account removes from multiple accounts, and again selects account multiple accounts after removing;Second accesses mould
Block, for if it is determined that the account of selection logs in targeted website and exception do not occurs, then utilizes the account of selection to access target
The webpage of website.
Owing to, in web data acquisition process, account may be there is and logs in exception, or, account is limited by website
Access, therefore, when utilizing account to access targeted website, can first judge whether this account occurs logging in extremely, as
Fruit occurs abnormal, then this account removed from above-mentioned multiple accounts, it is to avoid next time uses this account, and again selects
Select account and access the webpage of targeted website, if exception does not occurs, then can continue to access target by the account of this selection
Website.
Further, if in the process because login account abnormal causing obtains data failure, then can will currently log in
Account temporarily removes in the account group from system, it is to avoid next time uses this account, after rest a period of time, also may be used
With this account is added to account group perform crawl task.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
In the above embodiment of the present invention, the description to each embodiment all emphasizes particularly on different fields, and does not has in certain embodiment
The part described in detail, may refer to the associated description of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents, can be passed through other
Mode realize.Wherein, device embodiment described above is only schematically, the division of the most described unit,
Can be that a kind of logic function divides, actual can have other dividing mode, the most multiple unit or assembly when realizing
Can in conjunction with or be desirably integrated into another system, or some features can be ignored, or does not performs.Another point, institute
The coupling each other shown or discuss or direct-coupling or communication connection can be by some interfaces, unit or mould
The INDIRECT COUPLING of block or communication connection, can be being electrical or other form.
The described unit illustrated as separating component can be or may not be physically separate, shows as unit
The parts shown can be or may not be physical location, i.e. may be located at a place, or can also be distributed to
On multiple unit.Some or all of unit therein can be selected according to the actual needs to realize the present embodiment scheme
Purpose.
It addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it is also possible to
It is that unit is individually physically present, it is also possible to two or more unit are integrated in a unit.Above-mentioned integrated
Unit both can realize to use the form of hardware, it would however also be possible to employ the form of SFU software functional unit realizes.
If described integrated unit is using the form realization of SFU software functional unit and as independent production marketing or use,
Can be stored in a computer read/write memory medium.Based on such understanding, technical scheme essence
On the part that in other words prior art contributed or this technical scheme completely or partially can be with software product
Form embodies, and this computer software product is stored in a storage medium, including some instructions with so that one
Platform computer equipment (can be for personal computer, server or the network equipment etc.) performs each embodiment institute of the present invention
State all or part of step of method.And aforesaid storage medium includes: USB flash disk, read only memory (ROM, Read-Only
Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD
Etc. the various media that can store program code.
The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For Yuan, under the premise without departing from the principles of the invention, it is also possible to make some improvements and modifications, these improve and profit
Decorations also should be regarded as protection scope of the present invention.
Claims (10)
1. a web data acquisition methods, it is characterised in that including:
Obtaining multiple account, wherein, the plurality of account is the account of the logon rights with targeted website;With
And
From the plurality of account, select account, utilize the account selected to access the webpage of described targeted website, climb
Taking the web data of the webpage that the account of described selection is accessed, wherein, the account of adjacent twice selection differs.
Method the most according to claim 1, it is characterised in that the account quantity of the plurality of account is n, described n
More than or equal to 2, from the plurality of account, select account, utilize the account selected to access described targeted website
Webpage, the web data crawling the webpage that the account of described selection is accessed includes:
From the plurality of account, select the i-th account, utilize described i-th account to access the jth net of targeted website
Page, wherein, described i=1 ... n, described j=1,2,3 ..., when described i is more than or equal to 2, described the
I account is the account different from the i-th-1 account, and when described j is more than or equal to 2, described jth webpage is institute
Stating one or more webpage of targeted website, described jth webpage is with the 1st webpage to jth-1 webpage the most not
Same webpage;
Crawl the web data of described jth webpage;
Judge that whether described i is equal to described n;
If it is judged that described i is equal to described n, then the value of described i being put 1, the value of described j adds 1, returns
Perform from the plurality of account, select the i-th account, utilize described i-th account to access the jth net of targeted website
The step of page;
If it is judged that described i is less than described n, the value of the most described i adds 1, and the value of described j adds 1, and returns
Perform from the plurality of account, select the i-th account, utilize described i-th account to access the jth net of targeted website
The step of page;
Judge whether the amount of crawling of the web data of described targeted website reaches preset value;
If it is judged that the amount of crawling of the web data of described targeted website reaches described preset value, then stop described
The web data of targeted website crawls.
Method the most according to claim 2, it is characterised in that
After utilizing described i-th account to access the jth webpage of targeted website, described method also includes: to institute
Stating jth webpage to be marked, wherein, webpage after labelling is follow-up not to be visited again.
Method the most according to claim 1, it is characterised in that
Obtain multiple account to include: obtain configuration file, wherein, described configuration file is configured with the plurality of
Account and the password of correspondence thereof;Load described configuration file, obtain the password of the plurality of account and correspondence thereof,
Wherein, after obtaining multiple accounts, described method also includes: utilize the plurality of account and correspondence thereof
Code entry described in targeted website, and cache identification information, wherein, described identification information is described target network
Stand and identify the information of the plurality of account.
Method the most according to any one of claim 1 to 4, it is characterised in that utilize the account selected to access institute
The webpage stating targeted website includes:
Judge that the account of described selection logs in whether described targeted website exception occurs;
Occur abnormal if it is judged that the account of described selection logs in described targeted website, then will log in abnormal account
Number remove from the plurality of account, and again multiple accounts after removing select account;
If it is determined that the account of described selection logs in described targeted website exception does not occurs, then utilize described selection
Account accesses the webpage of described targeted website.
6. a web data acquisition device, it is characterised in that including:
Acquiring unit, is used for obtaining multiple account, and wherein, the plurality of account is to have the login of targeted website
The account of authority;And
Crawl unit, for selecting account from the plurality of account, utilize the account selected to access described target
The webpage of website, crawls the web data of the webpage that the account of described selection is accessed, wherein, adjacent twice choosing
The account selected differs.
Device the most according to claim 6, it is characterised in that the account quantity of the plurality of account is n, described n
More than or equal to 2, described in crawl unit and include:
First access modules, for selecting the i-th account from the plurality of account, utilizes described i-th account to visit
Ask the jth webpage of targeted website, wherein, described i=1 ... n, described j=1,2,3 ..., as described i
During more than or equal to 2, described i-th account is the account different from the i-th-1 account, when described j is more than or equal to 2
Time, described jth webpage is one or more webpage of described targeted website, and described jth webpage is and the 1st
Webpage is to the most different webpage of jth-1 webpage;
Crawl module, for crawling the web data of described jth webpage;
First judge module, is used for judging that whether described i is equal to described n;
First access modules is additionally operable to if it is judged that described i is equal to described n, then the value of described i is put 1, institute
The value stating j adds 1, selects the i-th account from the plurality of account, utilizes described i-th account to access target network
The jth webpage stood;
First access modules is additionally operable to if it is judged that described i is less than described n, and the value of the most described i adds 1, described
The value of j adds 1, selects the i-th account from the plurality of account, utilizes described i-th account to access targeted website
Jth webpage;
Second judge module, for judging whether the amount of crawling of the web data of described targeted website reaches preset value;
Stopping modular, for if it is judged that the amount of crawling of web data of described targeted website reaches described presets
Value, then the web data stopping described targeted website crawls.
Device the most according to claim 7, it is characterised in that described device also includes:
Indexing unit, for after utilizing described i-th account to access the jth webpage of targeted website, to described
Jth webpage is marked, and wherein, webpage after labelling is follow-up not to be visited again.
Device the most according to claim 6, it is characterised in that
Described acquiring unit includes: acquisition module, is used for obtaining configuration file, wherein, in described configuration file
It is configured with the password of the plurality of account and correspondence thereof;Load-on module, is used for loading described configuration file, obtains
The plurality of account and the password of correspondence thereof,
Wherein, described device also includes: log in unit, for, after obtaining multiple accounts, utilizing described many
Targeted website described in the code entry of individual account and correspondence thereof, and cache identification information, wherein, described identification is believed
Breath identifies the information of the plurality of account for described targeted website.
10. according to the device according to any one of claim 6 to 9, it is characterised in that described in crawl unit and include:
3rd judge module, for judging that the account of described selection logs in whether described targeted website exception occurs;
Remove module, for if it is judged that the account of described selection logs in the appearance of described targeted website extremely, then
Remove logging in abnormal account from the plurality of account, and again multiple accounts after removing select account
Number;
Second access modules, is used for if it is determined that the account of described selection logs in described targeted website exception does not occurs,
The account then utilizing described selection accesses the webpage of described targeted website.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510250516.6A CN106294369A (en) | 2015-05-15 | 2015-05-15 | Web data acquisition methods and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510250516.6A CN106294369A (en) | 2015-05-15 | 2015-05-15 | Web data acquisition methods and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106294369A true CN106294369A (en) | 2017-01-04 |
Family
ID=57632274
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510250516.6A Pending CN106294369A (en) | 2015-05-15 | 2015-05-15 | Web data acquisition methods and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106294369A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107704497A (en) * | 2017-08-25 | 2018-02-16 | 上海壹账通金融科技有限公司 | Web data crawling method, device, web data crawl platform and storage medium |
CN109375960A (en) * | 2018-09-29 | 2019-02-22 | 郑州云海信息技术有限公司 | A kind of copyright information loading method and device |
WO2019237547A1 (en) * | 2018-06-11 | 2019-12-19 | 平安科技(深圳)有限公司 | Data crawling method and apparatus, and computer device and storage medium |
CN110619072A (en) * | 2019-08-29 | 2019-12-27 | 凡普数字技术有限公司 | Bank account information acquisition method and device and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030135487A1 (en) * | 2002-01-11 | 2003-07-17 | Beyer Kevin Scott | Automated access to web content based on log analysis |
CN101872365A (en) * | 2010-07-02 | 2010-10-27 | 苏州阔地网络科技有限公司 | Method for realizing one-key login to other website on webpage |
CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
CN104615627A (en) * | 2014-09-23 | 2015-05-13 | 中国科学院计算技术研究所 | Event public sentiment information extracting method and system based on micro-blog platform |
-
2015
- 2015-05-15 CN CN201510250516.6A patent/CN106294369A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030135487A1 (en) * | 2002-01-11 | 2003-07-17 | Beyer Kevin Scott | Automated access to web content based on log analysis |
CN101872365A (en) * | 2010-07-02 | 2010-10-27 | 苏州阔地网络科技有限公司 | Method for realizing one-key login to other website on webpage |
CN103902386A (en) * | 2014-04-11 | 2014-07-02 | 复旦大学 | Multi-thread network crawler processing method based on connection proxy optimal management |
CN104615627A (en) * | 2014-09-23 | 2015-05-13 | 中国科学院计算技术研究所 | Event public sentiment information extracting method and system based on micro-blog platform |
Non-Patent Citations (3)
Title |
---|
史春永: "面向新浪微博的数据采集和社区发现算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
孙青云等: "一种基于模拟登录的微博数据采集方案", 《计算机技术与发展》 * |
蒋建军: "《计算机网络技术实训教程》", 30 April 2001, 上海交通大学出版社 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107704497A (en) * | 2017-08-25 | 2018-02-16 | 上海壹账通金融科技有限公司 | Web data crawling method, device, web data crawl platform and storage medium |
WO2019037417A1 (en) * | 2017-08-25 | 2019-02-28 | 深圳壹账通智能科技有限公司 | Webpage data crawling method and apparatus, webpage data crawling platform, and storage medium |
WO2019237547A1 (en) * | 2018-06-11 | 2019-12-19 | 平安科技(深圳)有限公司 | Data crawling method and apparatus, and computer device and storage medium |
CN109375960A (en) * | 2018-09-29 | 2019-02-22 | 郑州云海信息技术有限公司 | A kind of copyright information loading method and device |
CN110619072A (en) * | 2019-08-29 | 2019-12-27 | 凡普数字技术有限公司 | Bank account information acquisition method and device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107797908A (en) | A kind of behavioral data acquisition method of website user | |
CN106294369A (en) | Web data acquisition methods and device | |
CN104539459B (en) | Network control method on router and router | |
CN106844522A (en) | A kind of network data crawling method and device | |
CN106897284A (en) | The recommendation method and device of e-book | |
CN107958456A (en) | Dispensing detection method, device and electronic equipment | |
CN106131047A (en) | Account login method and relevant device, account login system | |
CN107800591A (en) | A kind of analysis method of unified daily record data | |
CN103593444B (en) | Internet Keyword identifying processing method and apparatus | |
CN104537005B (en) | Data processing method and device for web page crawl | |
CN104202291A (en) | Anti-phishing method based on multi-factor comprehensive assessment method | |
CN106936778A (en) | The abnormal detection method of website traffic and device | |
CN108601023A (en) | Home-network linkups authentication method, device, electronic equipment and storage medium | |
CN106874165A (en) | Page detection method and device | |
CN107483381A (en) | The monitoring method and device of interlock account | |
CN107204956A (en) | website identification method and device | |
CN107948052A (en) | Information crawler method, apparatus, electronic equipment and system | |
CN107104924A (en) | The verification method and device of website backdoor file | |
CN107888606A (en) | A kind of domain name credit assessment and system | |
CN107689941A (en) | A kind of apparatus and method for preventing same user's repeat logon | |
CN106789837A (en) | Network anomalous behaviors detection method and detection means | |
CN107124426A (en) | The method for authenticating and device of a kind of user's right | |
CN109544238A (en) | User behavior method for tracing, device, server and storage medium | |
CN108038218A (en) | A kind of distributed reptile method, electronic equipment and server | |
CN108510304A (en) | Construction method, electronic device and the storage medium of target customers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170104 |
|
RJ01 | Rejection of invention patent application after publication |