WO2019019344A1 - 网页数据爬取方法、装置、用户终端及可读存储介质 - Google Patents

网页数据爬取方法、装置、用户终端及可读存储介质 Download PDF

Info

Publication number
WO2019019344A1
WO2019019344A1 PCT/CN2017/103932 CN2017103932W WO2019019344A1 WO 2019019344 A1 WO2019019344 A1 WO 2019019344A1 CN 2017103932 W CN2017103932 W CN 2017103932W WO 2019019344 A1 WO2019019344 A1 WO 2019019344A1
Authority
WO
WIPO (PCT)
Prior art keywords
crawled
website
data
crawling
server
Prior art date
Application number
PCT/CN2017/103932
Other languages
English (en)
French (fr)
Inventor
周晶
Original Assignee
上海壹账通金融科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海壹账通金融科技有限公司 filed Critical 上海壹账通金融科技有限公司
Publication of WO2019019344A1 publication Critical patent/WO2019019344A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/083Network architectures or network communication protocols for network security for authentication of entities using passwords
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • H04L63/102Entity profiles

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a webpage data crawling method, apparatus, user terminal, and readable storage medium.
  • the server login can be used to input an account and password to the website to be crawled, to log in to the website to be crawled, and then Crawling the data stored in the website to be crawled, but since the security mechanism of each website is very high, the information of crawling too many accounts of the same IP address triggers the website's risk control mechanism, and the user's account is blocked. Users cannot use the account.
  • a webpage data crawling method, apparatus, storage medium, and terminal are provided, which solve one or more problems involved in the background art.
  • a webpage data crawling method includes:
  • a webpage data crawling device comprising:
  • the login module is configured to receive an account and a password corresponding to the to-be-crawled website through the to-be-crawled website login interface embedded in the client, and log in to the account by using an account and a password corresponding to the website to be crawled. Crawling the website;
  • a detecting module configured to detect whether the website to be crawled is successfully logged in
  • a verification module configured to: when successfully logging in to the website to be crawled, determine whether the account of the client matches the account of the website to be crawled;
  • a crawling module configured to: when the account of the client matches the account of the website to be crawled, crawl the data to be crawled in the website to be crawled;
  • a sending module configured to send the crawled data to be crawled to the server.
  • a user terminal comprising a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, the processor implementing the computer readable instructions to:
  • a computer readable storage medium having stored thereon computer readable instructions that, when executed by a processor, implement the following steps:
  • 1 is an application environment diagram of a webpage data crawling method in an embodiment
  • FIG. 2 is a flowchart of a webpage data crawling method in an embodiment
  • FIG. 3 is a flow chart of step S208 of the embodiment shown in Figure 2;
  • FIG. 4 is an interface diagram of a qq mailbox login interface in an embodiment
  • FIG. 5 is an interface diagram of an interface of a bill data crawling process in an embodiment
  • FIG. 6 is an interface diagram of successful billing data crawling in an embodiment
  • FIG. 7 is another flow chart of step S208 in the embodiment shown in Figure 2;
  • FIG 8 is a flow chart of step S210 in the embodiment shown in Figure 2;
  • FIG. 9 is a schematic structural diagram of a webpage data crawling apparatus in an embodiment
  • FIG. 10 is a schematic structural diagram of a user terminal in an embodiment.
  • FIG. 1 is an application environment diagram of a webpage data crawling method in an embodiment.
  • a server and a plurality of user terminals are used, and the server can communicate with a plurality of user terminals respectively, wherein the user terminal is A client APP is installed, and a website to be crawled is embedded in the client APP.
  • the user terminal can be a terminal such as a mobile phone, a tablet or a computer.
  • the client APP installed in the user terminal can be an APP of any APP provider, and the website to be crawled is embedded, for example, a client application APP such as WeChat can be embedded in the email login interface.
  • a webpage data crawling method is provided. This embodiment is exemplified by applying the method to the server in FIG.
  • the webpage data crawling program runs on the server, and the webpage data crawling method is implemented by the webpage data crawling program.
  • the method specifically includes the following steps:
  • S202 The account and the password corresponding to the website to be crawled are received through the login website to be crawled by the client, and the account to be crawled is registered by using the account and password corresponding to the website to be crawled.
  • the client refers to an application such as an APP installed on the user terminal, and the login website to be crawled is embedded, and the interface to be crawled may be an email login interface or an e-commerce login interface, for example, a qq email login interface, 126 E-mail login interface, 163 e-mail login interface, Taobao login interface, Alipay login interface, Jingdong login interface, Vipshop login interface, etc.
  • the login interface of the website to be crawled is opened, and the account and password of the website to be crawled are input, so that the website to be crawled can be logged in through the login website to be crawled in the client.
  • the user can first log in to the "Ping An Account Book” through the "Ping An Account Book”, and then open the embedded into the "Ping An Account Book” Qq mailbox login interface, log in to the qq mailbox by entering the qq email account and password to the qq mailbox login interface.
  • the website to be crawled needs to be successfully logged before crawling the data to be crawled in the website to be crawled, it is necessary to detect whether the login is to be climbed before crawling the data to be crawled in the website to be crawled. If you do not log in successfully, you cannot crawl the data to be crawled in the website to be crawled.
  • the user may log in to the account of the other user's website to be crawled through his own client. If the account of the other user's website to be crawled is also crawled at this time, the final crawl will be caused.
  • the data is not the user's own, resulting in data errors. Therefore, before the crawling, in order to ensure that the account of the website to be crawled is the user's own, it is determined whether the account of the client matches the account of the website to be crawled.
  • the user account of "Ping An Account Book" can set the unique identifier of the user, such as the ID number of the user, and the account of the qq mailbox can also set the unique identifier of the user, such as the ID number, etc.
  • the account unique identifier of a account book APP matches the account unique identifier of the qq mailbox, the next step is to climb the data to be crawled in the website to be crawled.
  • the data to be crawled in the website to be crawled is the data of the user, and the crawling program is directly crawled through the crawler in the client.
  • the data to be crawled the data to be crawled by each user is crawled at each user terminal, and the data to be crawled by all the users is crawled on the server, thereby effectively avoiding being crawled.
  • Take the risk control mechanism of the website to lock the account of the user's website to be crawled.
  • S210 Send the crawled data to be crawled to the server.
  • the data to be crawled may be sent to the server, so that the server may provide a corresponding service to the user according to the data.
  • the server can remind the user when the user needs to repay according to the billing data, or can provide the user with a repayment red envelope, for example, when the user needs to repay 1000 yuan, Provide users with services such as 5 yuan deduction red envelopes.
  • the crawling data of the website to be crawled is crawled through the client, and the account to be crawled is verified after logging in to the crawling website through the login interface of the website to be crawled embedded by the client. Whether it corresponds to the client's account, to ensure that the data to be crawled is the data of the client user, and send the crawled data to be crawled to the server for processing and analysis, which can avoid crawling on the server side.
  • the situation that the crawling data in the website is crawled triggers the risk control mechanism, causing the user account to be locked and the like.
  • FIG. 3 is a flowchart of step S208 of the embodiment shown in FIG. 2.
  • the step S208 that is, the step of crawling the data to be crawled in the website to be crawled may include:
  • S302 Send a crawl script acquisition request to the server.
  • the crawl script refers to a script that can be used for the user terminal to crawl the data to be crawled in the website to be crawled.
  • the crawl script is stored on the server, so that the crawl script can be modified only on the server side, and the new crawl script is directly downloaded from the server before crawling the crawl data in the website to be crawled next time.
  • the crawl script is scripted, it takes up The space is small and the transmission speed is fast.
  • the user terminal sends a crawl script acquisition request to the server, and after receiving the crawl script acquisition request, the server queries the crawl script, and then climbs the crawl script.
  • the script is packaged and sent to the corresponding client, so that the amount of data can be transferred.
  • S304 Receive a crawl script corresponding to the crawl script acquisition request returned by the server.
  • the crawl script is sent to the user terminal, and the user terminal can thereby crawl the crawling website to be crawled by the crawl script. Take data.
  • S306 Crawling the data to be crawled in the website to be crawled by using a crawl script.
  • FIG. 4 is an interface diagram of the qq mailbox login interface in an embodiment
  • FIG. 5 is an implementation.
  • the interface diagram of the billing data crawling process interface such as 6 is an interface diagram in which the billing data is successfully crawled in an embodiment.
  • the qq mailbox interface is embedded in the client terminal of the user terminal, and the user logs in to the qq mailbox by inputting the account and password in the qq mailbox interface, as shown in FIG. 4, when the qq mailbox is successfully logged in, the user terminal detects the client account and the qq mailbox.
  • the crawl script is downloaded from the server, and then the crawling script is used to crawl the billing information in the qq mailbox.
  • FIG. 5 can display the progress of the user terminal crawling the data to be crawled, and FIG. 5 shows qq.
  • the mailbox was successfully verified, the corresponding bill was also found, and the bill had been crawled 64%.
  • the user terminal has crawled the data to be crawled, that is, the bill, the user can be prompted to complete the crawling, for example, FIG. 6.
  • the information of the acquisition script is sent to the server, and after receiving the information, the server packages the latest script and transmits the information to the user terminal.
  • the script is stored in the server, and the crawl script can be modified only on the server.
  • the crawl script is sent with the client installation package, when the crawl script is modified, the new script needs to be issued.
  • the installation package causes the client to update the frequency, and secondly, when the crawl script is sent, it is sent after being packaged, which can reduce the amount of data transmission.
  • the method may further include: obtaining a time when the last receiving server returns the crawl script; when the last receiving server returns When the difference between the time of crawling the script and the current time is within the preset range, the crawling script returned by the server received last time crawls the data to be crawled in the website to be crawled.
  • the step of sending a crawl script acquisition request to the server is performed, that is, step S302.
  • a preset range is set, as long as the difference between the time when the user last acquired the crawl script from the server and the current time is within a preset range. , the user terminal does not need to download the crawl script from the server again.
  • the preset range may be 1 hour, 30 minutes, 2 hours, 1 day, 1 week, etc., and is not limited herein. For example, the last time you crawled, the crawl script was downloaded from the server at 9:30 AM, the default range was 2 hours, and the crawl time was 10:30 AM, due to 9:30 AM.
  • the difference is 1 hour, less than the preset range of 2 hours, so when crawling at 10:30, the crawl script downloaded from the server last time is used, there is no need to re-download the crawl script, but if you climb again The time taken is 2:30 pm, and the difference from 9:30 am is 5 hours, which is greater than the preset range of 2 hours. Therefore, when crawling at 2:30 pm, you need to download the crawl script from the server again.
  • the time obtained by the last crawling script may be obtained first. If the difference between the last crawling script and the current time is within a preset range, The crawl script stored by the user terminal is directly invoked, and the downloading from the server is no longer needed, which can be avoided. For example, the user frequently logs in to the qq mailbox to synchronize the bills in one day, causing the script to be downloaded every time, resulting in waste of data traffic.
  • the URLs of the different web pages are different, it is possible to determine whether the website to be crawled is successfully registered by detecting whether the URL address of the web page is changed.
  • the URL of the website login interface to be crawled may be A. After the login is successful, the URL address may become B. If the login fails, the login URL of the current website to be crawled will still be stopped, that is, the URL address is still not A, thereby judging whether the URL address is changed or not, Successful login to the website to be crawled is easy to operate.
  • detecting whether the website to be crawled successfully can be detected by detecting whether the URL address of the current interface of the client is changed, and only when the login is successful, the URL address of the current interface of the client is changed.
  • the login fails the URL of the current interface of the client does not change, and the corresponding login failure prompt message is provided.
  • the website to be crawled is a mailbox website; referring to FIG. 7, FIG. 7 is another flowchart of step S208 in the embodiment shown in FIG. 2, and the step S208 is to crawl the website to be crawled.
  • the steps of the data to be crawled may include:
  • S702 Select, from the mailbox website, a message whose title corresponds to the data to be crawled.
  • the mail corresponding to the data to be crawled may be first selected from the mailbox by the nature of the data to be crawled. For example, when you need to crawl billing data, you first crawl the mail header and mail related to the bill.
  • S704 Crawl the data of the preset field from the selected mail as the crawled data to be crawled.
  • some billing information may include various information such as name, date, consumption amount, and payee, but the server only needs to climb the name and the amount of the purchase information. Then, the user terminal crawls the data of the name and the consumption amount field from the selected mail as the crawl data, without having to climb other additional data.
  • the mail is first locked according to the title of the mail, for example, the title of the mail in the inbox may be traversed, or the title of the mail in the inbox in a certain period of time may be traversed to determine the mail associated with the credit card bill.
  • the user When the user is using the client APP for the first time, it needs to traverse the mail in the entire inbox, but if the user does not use the client APP for the first time, it can get the time when the server last obtained the bill, and only needs to traverse the time after the time.
  • the message in the inbox is fine.
  • the content of the preset field is obtained, for example, information such as obtaining only the date, summary, payment, expenditure, etc., that is, filtering out the useless information, or To get all the information, such as the balance, the expenditure object information, and so on.
  • FIG. 8 is a flowchart of step S210 in the embodiment shown in FIG. 2.
  • the step S210, the step of sending the crawled data to be crawled to the server may include:
  • S802 Encrypt the crawled data to be crawled.
  • the encryption process is required in the process of the transmission, and the symmetric encryption method or the asymmetric encryption method may be used, which is not limited herein.
  • the server After the user terminal crawls to the data to be crawled, the data to be crawled is encrypted and then sent to the server. After receiving the data, the server performs a corresponding decryption operation to obtain the crawled data to be crawled. .
  • the crawled data may be packaged and the packaged data may be sent to the server, thereby reducing the use of user traffic.
  • S806 Send the packaged data to be crawled to the server.
  • the packaged data is sent to the server, and the user terminal can detect the current network environment.
  • the packaged to be crawled The data is sent to the server.
  • the packaged data to be crawled is not sent temporarily until the network of the user terminal is programmed into the wifi network, and the packaged data to be crawled is sent to the server. This can reduce the use of user traffic.
  • the data to be crawled when the crawled data to be crawled is sent, the data to be crawled is first encrypted, and then the encrypted data to be crawled is packaged, so that the data to be crawled is transmitted.
  • the security in the process can reduce the amount of data transmission.
  • steps in the flowcharts of FIGS. 2, 3, 7, and 8 above are displayed once in accordance with the indication of the arrow, these steps are not necessarily performed once in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and may be performed in other sequences. Moreover, at least some of the steps in FIG. 2, FIG. 3, FIG. 7, and FIG. 8 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time. The completion is performed, but may be performed at different times, and the order of execution thereof is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of the sub-steps or stages of the other steps or other steps.
  • FIG. 9 is a schematic structural diagram of a webpage data crawling apparatus according to an embodiment, where the webpage data crawling apparatus includes:
  • the login module 100 is configured to receive an account and a password corresponding to the website to be crawled through the website to be crawled by the client, and log in to the website to be crawled by using an account and a password corresponding to the website to be crawled.
  • the detecting module 200 is configured to detect whether the website to be crawled is successfully logged.
  • the verification module 300 is configured to determine, when the website to be crawled is successfully logged, whether the account of the client matches the account of the website to be crawled.
  • the crawling module 400 is configured to crawl the data to be crawled in the website to be crawled when the account of the client matches the account of the website to be crawled.
  • the sending module 500 is configured to send the crawled data to be crawled to the server.
  • the sending module may be further configured to send a crawl script acquisition request to the server.
  • the crawl module can include:
  • the receiving unit is configured to receive a crawl script returned by the server and corresponding to the crawl script acquisition request.
  • Crawl unit for crawling the data to be crawled in the website to be crawled by the crawl script.
  • the webpage data crawling device may further include:
  • the time acquisition module is used to obtain the time when the last receiving server returns the crawl script.
  • the comparison module is configured to: when the difference between the time when the last receiving server returns the crawling script and the current time is within a preset range, the crawling script returned by the server received last time is used to crawl the to-be-crawled website. Crawling data; when the difference between the time when the last receiving server returned the crawling script and the current time is not within the preset range, the step of sending a crawl script acquisition request to the server is performed.
  • the detecting module may further be configured to detect whether the URL address of the current page displayed by the client changes; when the URL address of the current page displayed by the client changes, Successful login to the website to be crawled; if the URL address of the current page displayed by the client has not changed, the website to be crawled is not successfully logged.
  • the website to be crawled is a mailbox website.
  • the crawl module can also be used to select a message corresponding to the data to be crawled from the mailbox website; and crawl the data of the preset field from the selected mail as the crawled data to be crawled.
  • the sending module can include:
  • An encryption unit configured to encrypt the crawled data to be crawled.
  • a packaging unit for packaging encrypted data to be crawled for packaging encrypted data to be crawled.
  • a sending unit configured to send the packaged data to be crawled to the server.
  • the module and unit involved in the webpage data crawling device may be a program segment divided according to functions, and the above limitation on the webpage data crawling device may participate in the above limitation on the webpage data crawling method, and no longer Narration.
  • the various modules in the webpage data crawler described above may be implemented in whole or in part by software, hardware, and combinations thereof. Each of the above modules may be embedded in or independent of the processor in the computer device, or may be stored in a memory in the computer device in a software form, so that the processor invokes the operations corresponding to the above modules.
  • the processor can be a central processing unit (CPU), a microprocessor, a microcontroller, or the like.
  • the web page data crawling device described above can be implemented in the form of a computer readable instruction that can be run on a server as shown in FIG.
  • FIG. 8 is a schematic structural diagram of a user terminal in an embodiment, where the user terminal includes a memory, a processor, and an operating system connected through a system bus, where the processor is used to provide computing and control capabilities, and supports the entire computer. The operation of the device.
  • the memory is used to store data, program code, and the like.
  • At least one computer executable program is stored on the memory, the computer executable program being executable by the processor to implement the web page data crawling method provided in the embodiments of the present application.
  • the computer The executable program can be executed by the processor for implementing a web page data crawling method provided by the various embodiments described above.
  • the internal memory in the user terminal provides a cached operating environment for the operating system, databases, and computer executables in the non-volatile storage medium.
  • the processor implements the following steps: the account login password corresponding to the website to be crawled is received through the login interface of the website to be crawled embedded by the client, and is registered and accessed through the account and password corresponding to the website to be crawled. Crawling the website; detecting whether the website to be crawled successfully is successfully logged; when successfully logging in to the website to be crawled, it is determined whether the account of the client matches the account of the website to be crawled; when the account of the client matches the account of the website to be crawled , crawling the data to be crawled in the website to be crawled; sending the crawled data to be crawled to the server.
  • the processor may further implement the following steps: sending a crawl script acquisition request to the server; receiving a crawl script returned by the server corresponding to the crawl script acquisition request; and crawling through the crawl script Crawl the data to be crawled in the website.
  • the processor may further implement the following steps: obtaining the time when the last receiving server returns the crawling script; when the difference between the time when the last receiving server returns the crawling script and the current time is preset
  • the crawling script returned by the server received last time crawls the data to be crawled in the website to be crawled; when the time when the last receiving server returns the crawling script and the current time is not in the preset range
  • the step of sending a crawl script acquisition request to the server is performed.
  • the processor may further implement the following steps: the step of obtaining the input application information includes: detecting whether the URL address of the current page displayed by the client changes; when the client displays the current page If the URL address is changed, the website to be crawled is successfully logged in; if the URL address of the current page displayed by the client has not changed, the website to be crawled is not successfully logged.
  • the website to be crawled is a mailbox website; when the processor executes the program, the following steps may be implemented: selecting a message corresponding to the data to be crawled from the mailbox website; and crawling the selected mail from the selected mail. Set the data of the field as the data to be crawled.
  • the processor when the processor executes the program, the following steps may be implemented: encrypting the crawled data to be crawled; and packaging the encrypted data to be crawled; The data to be crawled is sent to the server.
  • a computer readable storage medium having stored thereon computer readable instructions, such as the nonvolatile storage medium shown in FIG. 8, wherein the memory may include a magnetic disk, an optical disk, and only Read a non-volatile storage medium such as a read-only memory (ROM).
  • the memory includes a non-volatile storage medium and an internal memory.
  • a non-volatile storage medium of a computer device stores an operating system, a database, and a computer executable program.
  • the database stores data related to a webpage data crawling method provided by the various embodiments described above.
  • the program is executed by the processor to implement the following steps: receiving an account and a password corresponding to the website to be crawled through the website login interface to be crawled by the client, and using the account and password corresponding to the website to be crawled.
  • Log in to the website to be crawled ; check whether the website is to be crawled successfully; when successfully logging in to the website to be crawled, determine whether the account of the client matches the account of the website to be crawled; when the account of the client and the account to be crawled When matching, crawl the data to be crawled in the website to be crawled; send the crawled data to be crawled to the server.
  • the following steps may be implemented: sending a crawl script acquisition request to the server; receiving a crawl script returned by the server corresponding to the crawl script acquisition request; crawling through the crawl script Take the crawled data in the website.
  • the following steps may be implemented: obtaining the time when the last receiving server returns the crawling script; when the last time the receiving server returns the crawling script, the difference between the current time and the current time is When the preset range is used, the crawling script returned by the server received last time crawls the data to be crawled in the website to be crawled; when the last time the receiving server returns the crawling script, the difference between the current time and the current time is not Within the scope, the step of sending a crawl script acquisition request to the server is performed.
  • the step of obtaining the input application information includes: detecting whether the URL address of the current page displayed by the client is changed; when the client displays the current If the URL of the page changes, the login to the crawled network is successful. Station; when the URL address of the current page displayed by the client has not changed, the website to be crawled is not successfully logged.
  • the website to be crawled is a mailbox website; when the program is executed by the processor, the following steps may be implemented: selecting a message corresponding to the data to be crawled from the mailbox website; crawling from the selected mail Take the data of the preset field as the crawled data to be crawled.
  • the following steps may be implemented: encrypting the crawled data to be crawled; packaging the encrypted data to be crawled; and packaging the crawled data to be crawled Take the data and send it to the server.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种网页数据爬取方法、装置、用户终端及可读存储介质,该方法包括通过客户端嵌入的待爬取网站登录界面,接收输入的与待爬取网站对应的账户和密码,通过与待爬取网站对应的账户和密码登录待爬取网站;检测是否成功登录待爬取网站;当成功登录待爬取网站时,则判断客户端的账户与待爬取网站的账户是否匹配;当匹配时,则爬取待爬取网站中的待爬取数据;将所爬取的待爬取数据发送至服务器。

Description

网页数据爬取方法、装置、用户终端及可读存储介质
本申请要求于2017年7月26日提交中国专利局、申请号为2017106192634、发明名称为“网页数据爬取方法、装置、用户终端及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及计算机技术领域,特别是涉及一种网页数据爬取方法、装置、用户终端及可读存储介质。
背景技术
目前,互联网上大量有价值的信息均需要爬取到服务器进行分析,以对用户的行为等进行分析,例如可以通过服务器登录向待爬取网站输入账号和密码,以登录待爬取网站,然后爬取存储在待爬取网站中的数据,但是由于当前各个网站的安全机制都非常高,同一IP地址爬取过多账户的信息会触发网站的风控机制,导致用户的账户被封锁,从而用户不能使用账户。
发明内容
根据本申请的各种实施例,提供一种网页数据爬取方法、装置、存储介质和终端,解决了背景技术中所涉及的一个或多个问题。
一种网页数据爬取方法,包括:
通过客户端嵌入的待爬取网站登录界面,接收输入的与所述待爬取网站对应的账户和密码,通过与所述待爬取网站对应的账户和密码登录所述待爬取网站;
检测是否成功登录所述待爬取网站;
当成功登录所述待爬取网站时,则判断所述客户端的账户与所述待爬取网站的账户是否匹配;
当所述客户端的账户与所述待爬取网站的账户匹配时,则爬取所述待爬取网站中的待爬取数据;及
将所爬取的待爬取数据发送至服务器。
一种网页数据爬取装置,包括:
登录模块,用于通过客户端嵌入的待爬取网站登录界面,接收输入的与所述待爬取网站对应的账户和密码,通过与所述待爬取网站对应的账户和密码登录所述待爬取网站;
检测模块,用于检测是否成功登录所述待爬取网站;
验证模块,用于当成功登录所述待爬取网站时,则判断所述客户端的账户与所述待爬取网站的账户是否匹配;
爬取模块,用于当所述客户端的账户与所述待爬取网站的账户匹配时,则爬取所述待爬取网站中的待爬取数据;
发送模块,用于将所爬取的待爬取数据发送至服务器。
一种用户终端,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现以下步骤:
通过客户端嵌入的待爬取网站登录界面,接收输入的与所述待爬取网站对应的账户和密码,通过与所述待爬取网站对应的账户和密码登录所述待爬取网站;
检测是否成功登录所述待爬取网站;
当成功登录所述待爬取网站时,则判断所述客户端的账户与所述待爬取网站的账户是否匹配;
当所述客户端的账户与所述待爬取网站的账户匹配时,则爬取所述待爬取网站中的待爬取数据;及
将所爬取的待爬取数据发送至服务器。
一种计算机可读存储介质,其上存储有计算机可读指令,该计算机可读指令被处理器执行时实现以下步骤:
通过客户端嵌入的待爬取网站登录界面,接收输入的与所述待爬取网站对应的账户和密码,通过与所述待爬取网站对应的账户和密码登录所述待爬取网站;
检测是否成功登录所述待爬取网站;
当成功登录所述待爬取网站时,则判断所述客户端的账户与所述待爬取网站的账户是否匹配;
当所述客户端的账户与所述待爬取网站的账户匹配时,则爬取所述待爬取网站中的待爬取数据;及
将所爬取的待爬取数据发送至服务器。
本发明的一个或多个实施例的细节在下面的附图和描述中提出。本发明的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一实施例中网页数据爬取方法的应用环境图;
图2为一实施例中的网页数据爬取方法的流程图;
图3为图2所示实施例的步骤S208的流程图;
图4为一实施例中qq邮箱登录界面的界面图;
图5为一实施例中账单数据爬取过程界面的界面图;
图6为一实施例中账单数据爬取成功的界面图;
图7为图2所示实施例中的步骤S208的另一流程图;
图8为图2所示实施例中的步骤S210的流程图;
图9为一实施例中的网页数据爬取装置的结构示意图;
图10为一实施例中的用户终端的结构示意图。
具体实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用于解释本发明,并不用于限定本发明。
在详细说明根据本发明的实施例前,应该注意到的是,所述的实施例主要在于与网页数据爬取方法、装置、用户终端及可读存储介质相关的步骤和系统组件的组合。因此,所属系统组件和方法步骤已经在附图中通过常规符号在适当的位置表示出来了,并且只示出了与理解本发明的实施例有关的细节,以免因对于得益于本发明的本领域普通技术人员而言显而易见的那些细节模糊了本发明的公开内容。
在本文中,诸如左和右,上和下,前和后,第一和第二之类的关系术语仅仅用来区分一个实体或动作与另一个实体或动作,而不一定要求或暗示这种实体或动作之间的任何实际的这种关系或顺序。术语“包括”、“包含”或任何其他变体旨在涵盖非排他性的包含,由此使得包括一系列要素的过程、方法、物品或者设备不仅包含这些要素,而且还包含没有明确列出的其他要素,或者为这种过程、方法、物品或者设备所固有的要素。
参阅图1,图1为一实施例中网页数据爬取方法的应用环境图,在该实施例中,包括服务器以及数个用户终端,服务器可以分别与数个用户终端相通信,其中用户终端中安装有客户端APP,客户端APP中嵌入有待爬取网站。用户终端可以是手机、平板或电脑等终端,用户终端中安装的客户端APP可以是任意APP提供商的APP,其中嵌入有待爬取网站,例如微信等客户端APP中可以嵌入邮箱登录界面等。
请参阅图2,在其中一个实施例中,提供一种网页数据爬取方法,本实施例以该方法应用到上述图1中的服务器来举例说明。该服务器上运行有网页数据爬取程序,通过该网页数据爬取程序来实施网页数据爬取方法。该方法具体包括如下步骤:
S202:通过客户端嵌入的待爬取网站登录界面,接收输入的与待爬取网站对应的账户和密码,通过与待爬取网站对应的账户和密码登录待爬取网站。
具体地,客户端是指安装在用户终端的APP等应用程序,其中嵌入了待爬取网站登录界面,该待爬取网站界面可以是邮箱登录界面、电商登录界面,例如qq邮箱登录界面、126邮箱登录界面、163邮箱登录界面、淘宝登录界面、支付宝登录界面、京东登录界面、唯品会登录界面等。
当用户通过客户端的账户登录客户端后,再打开该待爬取网站登录界面,输入待爬取网站的账户和密码,从而可以通过客户端中嵌入的待爬取网站登录界面登录待爬取网站。例如在“平安一账通APP”中嵌入有qq邮箱登录界面,用户可以首先通过“平安一账通APP”登录“平安一账通APP”,然后打开嵌入至“平安一账通APP”中的qq邮箱登录界面,通过向该qq邮箱登录界面输入qq邮箱账户和密码登录qq邮箱。
S204:检测是否成功登录待爬取网站。
具体地,由于在爬取待爬取网站中的待爬取数据前,需要成功登录待爬取网站,因此在爬取待爬取网站中的待爬取数据前,需要检测是否成功登录待爬取网站,如果未成功登录,则无法爬取待爬取网站中的待爬取数据。
S206:当成功登录待爬取网站时,则判断客户端的账户与待爬取网站的账户是否匹配。
具体地,在某些情况下,用户可能通过自己的客户端来登录其他用户的待爬取网站的账户,如果此时也爬取其他用户的待爬取网站的账户,则会导致最终爬取的数据不是用户自己的,从而导致数据错误。因此在爬取之前,为了保证待爬取网站的账户是用户本人的,因此判断客户端的账户与待爬取网站的账户是否匹配。例如“平安一账通APP”的用户账户中可以设置用户的唯一标识,例如用户的身份证号等,且qq邮箱的账户中也可以设置用户的唯一标识,例如身份证号等,只有“平安一账通APP”的账户唯一标识与qq邮箱的账户唯一标识相匹配时,才会进行下一步来爬取待爬取网站中的待爬取数据。
S208:当客户端的账户与待爬取网站的账户匹配时,则爬取待爬取网站中的待爬取数据。
具体地,当客户端的账户与待爬取网站的账户匹配时,则证明待爬取网站中的待爬取数据是用户的数据,则直接通过客户端中的爬取程序爬取待爬取网站中的待爬取数据即可,这样每个用户的待爬取数据在每个用户终端进行爬取,而非所有用户的待爬取数据均在服务器进行爬取,从而可以有效避免由于待爬取网站的风控机制将用户的待爬取网站的账户锁定的情况的发生。
S210:将所爬取的待爬取数据发送至服务器。
具体地,当用户终端爬取到相应的待爬取数据时,则可以将该些待爬取数据发送至服务器,从而服务器可以根据该些数据为用户提供相应的服务。例如当用户终端爬取的是qq邮箱中的信用卡账单时,服务器可以根据该账单数据提醒用户何时需要还款,或者可以给用户提供还款红包,例如当用户需要还款1000元时,则给用户提供5元抵扣红包等服务。
上述网页数据爬取方法,通过客户端来爬取待爬取网站中的待爬取数据,在通过客户端嵌入的待爬取网站登录界面登录待爬取网站后,验证待爬取网站的账户与客户端的账户是否对应,来确保所爬取的待爬取数据即为客户端用户的数据,并将爬取的待爬取数据发送至服务器以供服务器进行处理分析,可以避免在服务器端爬取待爬取网站中的待爬取数据触发风控机制,导致用户账户被锁等情况的发生。
在其中一个实施例中,参阅图3,图3为图2所示实施例的步骤S208的流程图,该步骤S208,即爬取待爬取网站中的待爬取数据的步骤可以包括:
S302:向服务器发送爬取脚本获取请求。
具体地,爬取脚本是指可以用于用户终端的,用来爬取待爬取网站中的待爬取数据的脚本。该爬取脚本是存储在服务器的,这样可以仅在服务器端对该爬取脚本进行修改,且在下次爬取待爬取网站中的爬取数据前,直接从服务器下载新的爬取脚本即可,由于该爬取脚本是采用脚本的方式,其占用 空间小,传输速度快。当在客户端的账户与待爬取网站的账户相匹配时,用户终端则向服务器发送爬取脚本获取请求,服务器在接收到爬取脚本获取请求后,查询到该爬取脚本,然后将该爬取脚本进行打包后发送至相应的客户端,这样可以数据的传输量。
S304:接收服务器返回的与爬取脚本获取请求对应的爬取脚本。
具体地,当服务器查询到与爬取脚本获取指令对应的爬取脚本后,则将该爬取脚本发送到用户终端,用户终端从而可以通过该爬取脚本爬取待爬取网站中的待爬取数据。
S306:通过爬取脚本爬取待爬取网站中的待爬取数据。
具体地,用户终端通过从服务器下载的爬取脚本爬取相应的待爬取数据,请参阅图4至图6,图4为一实施例中qq邮箱登录界面的界面图,图5为一实施例中账单数据爬取过程界面的界面图,如6为一实施例中账单数据爬取成功的界面图。其中qq邮箱界面是嵌入至用户终端的客户端的,用户通过在qq邮箱界面中输入账户和密码来登录qq邮箱,如图4,当qq邮箱登录成功后,用户终端检测客户端的账户与qq邮箱的账户相匹配后,则从服务器下载爬取脚本,然后通过爬取脚本来爬取qq邮箱中的账单信息,例如图5,可以显示用户终端爬取待爬取数据的进度,图5中表示qq邮箱验证成功、也搜索到相应的账单,且账单已经爬取了64%。当用户终端爬取到了待爬取数据,即账单后,则可以提示用户爬取完成,例如图6。
上述实施例中,在客户端账户和待爬取网站的账户相匹配时,则向服务器发送获取脚本的信息,服务器接收到该信息后,将最新的脚本进行打包后传输给用户终端。这样操作首先,脚本存储在服务器,可以仅在服务器对爬取脚本进行修改,但是如果爬取脚本是与客户端安装包一起下发的话,则当爬取脚本修改时,则就需要下发新的安装包,导致客户端更新频率增加,其次在发送爬取脚本时,打包后再发送,可以减少数据传输量。
在其中一个实施例中,向服务器发送爬取脚本获取请求的步骤之前还可以包括:获取上次接收服务器返回爬取脚本的时间;当上次接收服务器返回 爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取待爬取网站中的待爬取数据。当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行向服务器发送爬取脚本获取请求的步骤,即步骤S302。
具体地,为了防止用户终端短时间内从服务器多次爬取爬取脚本,设置了一预设范围,只要用户上次从服务器获取爬取脚本的时间与当前时间的差值在预设范围内,则用户终端则不需要再次从服务器下载爬取脚本。该预设范围可以是1小时、30分钟、2小时、1天、1星期等,在此不做限制。例如,上一次爬取时,从服务器下载了爬取脚本,时间为上午9点30分,预设范围是2小时,则再次爬取时是上午10点30分,由于与上午9点30分的差值是1小时,小于预设范围2小时,因此在10点30分爬取时,则采用上次从服务器下载的爬取脚本即可,不需要重新下载爬取脚本,但是如果再次爬取的时间是下午2点30分,与上午9点30分的差值5小时,大于预设范围2小时,因此在下午2点30分爬取时,则需要重新从服务器下载爬取脚本。
上述实施例中,在客户端账户和待爬取网站的账户相匹配后,可以首先获取上一次爬取脚本获取的时间,如果上一次爬取脚本与当前时间的差值在预设范围,则直接调用用户终端存储的爬取脚本,而不再需要从服务器下载,这样可以避免,例如一天内用户频繁登录qq邮箱同步账单导致每次都下载脚本,造成数据流量的浪费等。
在其中一个实施例中,检测是否成功登录待爬取网站的步骤,即图2所示实施例中的步骤S204可以包括:检测客户端所显示的当前页面的URL地址是否改变;当客户端所显示的当前页面的URL地址改变,则成功登录待爬取网站;当客户端所显示的当前页面的URL地址未改变,则未成功登录待爬取网站。
具体地,由于不同网页的URL(统一资源定位符,Uniform Resource Locator)地址是不同的,所以可以通过检测网页的URL地址是否改变来确定是否成功登录待爬取网站。例如待爬取网站登录界面的URL地址可能是A, 而在登录成功后URL地址可能变成B,如果登录失败,则仍会停留在当前的待爬取网站登录界面,即其URL地址仍未A,从而通过判断URL地址是否改变即可判断出是否成功登录待爬取网站,操作简单。
上述实施例中,检测是否成功登录待爬取网站可以通过检测客户端当前界面的URL地址是否改变来进行,只有在登录成功时,客户端当前界面的URL地址才会改变。登录失败的时候,客户端当前界面的URL地址不变,且会提供相应的登录失败的提示信息。
在其中一个实施例中,待爬取网站为邮箱网站;参阅图7,图7为图2所示实施例中的步骤S208的另一流程图,该步骤S208,即爬取待爬取网站中的待爬取数据的步骤可以包括:
S702:从邮箱网站中选取标题与待爬取数据对应的邮件。
具体地,由于邮箱中可能存储有大量的数据,而服务器只关心与待爬取数据对应的邮件,因此首先可以通过待爬取数据的性质从邮箱中选取与待爬取数据对应的邮件。例如当需要爬取账单数据时,则首先爬取邮件标题与账单有关的邮件。
S704:从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。
具体地,由于账单邮件中可能存储大量的账单信息,例如有的账单信息可能包括姓名、日期、消费额、收款方等多种信息,但是服务器仅需要爬取姓名、消费额信息即可,则用户终端则从所选取的邮件中爬取姓名和消费额字段的数据作为爬取数据即可,而不需要爬取其他额外的数据。
上述实施例,首先根据邮件的标题进行锁定邮件,例如可以遍历收件箱中的邮件的标题,或者遍历某一时间段中收件箱中的邮件的标题,以确定与信用卡账单相关联的邮件。当用户是首次使用客户端APP时,则需要遍历整个收件箱中的邮件,但是如果用户非首次使用客户端APP时,则可以获取服务器最近一次获取账单的时间,仅需要遍历该时间以后的收件箱中的邮件即可。当已经锁定标题与账单相关的邮件后,再获取预设字段的内容,例如可以是仅获取日期、摘要、支入、支出等信息,即过滤掉无用信息,或者还可 以获取所有的信息,例如余额、支入支出对象信息等。
在其中一个实施例中,参阅图8,图8为图2所示实施例中的步骤S210的流程图,该步骤S210,即将所爬取的待爬取数据发送至服务器的步骤可以包括:
S802:将所爬取的待爬取数据进行加密处理。
具体地,由于所爬取的数据涉及到用户的隐私信息,因此在传输过程中需要进行加密处理,其可以采用对称加密方法也可以采用非对称的加密方法,在此不作限定。当用户终端爬取到待爬取数据后,则将待爬取数据进行加密处理,然后发送到服务器,服务器接收到该些数据后,进行相应的解密操作以获取所爬取的待爬取数据。
S804:将加密后的待爬取数据进行打包。
具体地,为了减少数据的传输量,可以对所爬取的数据进行打包处理,将打包后的数据发送给服务器,从而可以减少用户流量的使用。
S806:将打包后的待爬取数据发送至服务器。
具体地,当待爬取数据打包完成后,则将打包完成的数据发送给服务器,此时用户终端可以检测当前所处的网络环境,当网路为wifi网络时,则将打包后的待爬取数据发送至服务器,当网络为移动网络时,则暂时不发送该打包后的待爬取数据,直至用户终端的网络编程为wifi网络后,则将打包后的待爬取数据发送至服务器,这样可以减少用户流量的使用。
上述实施例中,在发送爬取的待爬取数据时,首先对该些待爬取数据进行加密,然后将加密后的待爬取数据进行打包,这样,即可以保证待爬取数据在传输过程中的安全性,有可以降低数据传输量。
虽然上文中图2、图3、图7、图8的流程图中的各个步骤按照箭头的指示一次显示,但是这些步骤并不是必然按照箭头指示的顺序一次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,图2、图3、图7、图8中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行 完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替执行。
参阅图9,图9为一实施例中的网页数据爬取装置的结构示意图,该网页数据爬取装置包括:
登录模块100,用于通过客户端嵌入的待爬取网站登录界面,接收输入的与待爬取网站对应的账户和密码,通过与待爬取网站对应的账户和密码登录待爬取网站。
检测模块200,用于检测是否成功登录待爬取网站。
验证模块300,用于当成功登录待爬取网站时,则判断客户端的账户与待爬取网站的账户是否匹配。
爬取模块400,用于当客户端的账户与待爬取网站的账户匹配时,则爬取待爬取网站中的待爬取数据。
发送模块500,用于将所爬取的待爬取数据发送至服务器。
在其中一个实施例中,发送模块可以还用于向服务器发送爬取脚本获取请求。
爬取模块可以包括:
接收单元,用于接收服务器返回的与爬取脚本获取请求对应的爬取脚本。
爬取单元,用于通过爬取脚本爬取待爬取网站中的待爬取数据。
在其中一个实施例中,网页数据爬取装置还可以包括:
时间获取模块,用于获取上次接收服务器返回爬取脚本的时间。
比较模块,用于当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取待爬取网站中的待爬取数据;当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行向服务器发送爬取脚本获取请求的步骤。
在其中一个实施例中,检测模块还可以用于检测客户端所显示的当前页面的URL地址是否改变;当客户端所显示的当前页面的URL地址改变,则 成功登录待爬取网站;当客户端所显示的当前页面的URL地址未改变,则未成功登录待爬取网站。
在其中一个实施例中,待爬取网站为邮箱网站。爬取模块还可以用于从邮箱网站中选取标题与待爬取数据对应的邮件;从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。
在其中一个实施例中,发送模块可以包括:
加密单元,用于将所爬取的待爬取数据进行加密处理。
打包单元,用于将加密后的待爬取数据进行打包。
发送单元,用于将打包后的待爬取数据发送至服务器。
其中网页数据爬取装置中所涉及到的模块、单元可以是依据功能划分的程序段,此外上述关于网页数据爬取装置的限定可以参加上文中关于网页数据爬取方法的限定,在此不再赘述。上述网页数据爬取装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。该处理器可以为中央处理单元(CPU)、微处理器、单片机等。上述网页数据爬取装置可以实现为一种计算机可读指令的形式,计算机可读指令可在如图1所示的服务器运行。
本发明实施例提出了一种计算机设备,该计算机设备包括一系列存储于存储器上的计算机可读指令,当计算机可读指令被处理器执行时,可以实现本发明各个实施例提出的网页数据爬取方法,在一些实施例中,基于该计算机可读指令各部分所实现的特定的操作。请参阅图8,图8为一实施例中的用户终端的结构示意图,其中用户终端包括通过系统总线连接的存储器、处理器、操作系统,其中该处理器用于提供计算和控制能力,支撑整个计算机设备的运行。存储器用于存储数据、程序代码等。
该存储器上存储至少一个计算机可执行程序,该计算机可执行程序可被处理器执行,以实现本申请各实施例中提供的网页数据爬取方法。该计算机 可执行程序可被处理器所执行,以用于实现上述各个实施例所提供的一种网页数据爬取方法。用户终端中的内存储器为非易失性存储介质中的操作系统、数据库和计算机可执行程序提供高速缓存的运行环境。
其中,处理器执行程序时实现以下步骤:通过客户端嵌入的待爬取网站登录界面,接收输入的与待爬取网站对应的账户和密码,通过与待爬取网站对应的账户和密码登录待爬取网站;检测是否成功登录待爬取网站;当成功登录待爬取网站时,则判断客户端的账户与待爬取网站的账户是否匹配;当客户端的账户与待爬取网站的账户匹配时,则爬取待爬取网站中的待爬取数据;将所爬取的待爬取数据发送至服务器。
在其中一个实施例中,处理器执行程序时还可以实现以下步骤:向服务器发送爬取脚本获取请求;接收服务器返回的与爬取脚本获取请求对应的爬取脚本;通过爬取脚本爬取待爬取网站中的待爬取数据。
在其中一个实施例中,处理器执行程序时还可以实现以下步骤:获取上次接收服务器返回爬取脚本的时间;当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取待爬取网站中的待爬取数据;当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行向服务器发送爬取脚本获取请求的步骤。
在其中一个实施例中,处理器执行程序时还可以实现以下步骤:获取输入的申请信息的步骤包括:检测客户端所显示的当前页面的URL地址是否改变;当客户端所显示的当前页面的URL地址改变,则成功登录待爬取网站;当客户端所显示的当前页面的URL地址未改变,则未成功登录待爬取网站。
在其中一个实施例中,待爬取网站为邮箱网站;处理器执行程序时还可以实现以下步骤:从邮箱网站中选取标题与待爬取数据对应的邮件;从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。
在其中一个实施例中,处理器执行程序时还可以实现以下步骤:将所爬取的待爬取数据进行加密处理;将加密后的待爬取数据进行打包;将打包后 的待爬取数据发送至服务器。
上述对于爬虫终端的限定可以参见上文中对于网页数据爬取方法的具体限定,在此不再赘述。
请继续参阅图8,还提供一种计算机可读存储介质,其上存储有计算机可读指令,如图8中所示的非易失性存储介质,其中,存储器可包括磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质。在一个实施例中,存储器包括非易失性存储介质及内存储器。计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可执行程序。该数据库中存储有用于实现上述各个实施例所提供的一种网页数据爬取方法相关的数据。其中,该程序被处理器执行时实现以下步骤:通过客户端嵌入的待爬取网站登录界面,接收输入的与待爬取网站对应的账户和密码,通过与待爬取网站对应的账户和密码登录待爬取网站;检测是否成功登录待爬取网站;当成功登录待爬取网站时,则判断客户端的账户与待爬取网站的账户是否匹配;当客户端的账户与待爬取网站的账户匹配时,则爬取待爬取网站中的待爬取数据;将所爬取的待爬取数据发送至服务器。
在其中一个实施例中,该程序被处理器执行时还可以实现以下步骤:向服务器发送爬取脚本获取请求;接收服务器返回的与爬取脚本获取请求对应的爬取脚本;通过爬取脚本爬取待爬取网站中的待爬取数据。
在其中一个实施例中,该程序被处理器执行时还可以实现以下步骤:获取上次接收服务器返回爬取脚本的时间;当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取待爬取网站中的待爬取数据;当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行向服务器发送爬取脚本获取请求的步骤。
在其中一个实施例中,该程序被处理器执行时还可以实现以下步骤:获取输入的申请信息的步骤包括:检测客户端所显示的当前页面的URL地址是否改变;当客户端所显示的当前页面的URL地址改变,则成功登录待爬取网 站;当客户端所显示的当前页面的URL地址未改变,则未成功登录待爬取网站。
在其中一个实施例中,待爬取网站为邮箱网站;该程序被处理器执行时还可以实现以下步骤:从邮箱网站中选取标题与待爬取数据对应的邮件;从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。
在其中一个实施例中,该程序被处理器执行时还可以实现以下步骤:将所爬取的待爬取数据进行加密处理;将加密后的待爬取数据进行打包;将打包后的待爬取数据发送至服务器。
上述对于计算机可读存储介质的限定可以参见上文中对于网页数据爬取方法的具体限定,在此不再赘述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (24)

  1. 一种网页数据爬取方法,其特征在于,包括:
    通过客户端嵌入的待爬取网站登录界面,接收输入的与所述待爬取网站对应的账户和密码,通过与所述待爬取网站对应的账户和密码登录所述待爬取网站;
    检测是否成功登录所述待爬取网站;
    当成功登录所述待爬取网站时,则判断所述客户端的账户与所述待爬取网站的账户是否匹配;
    当所述客户端的账户与所述待爬取网站的账户匹配时,则爬取所述待爬取网站中的待爬取数据;及
    将所爬取的待爬取数据发送至服务器。
  2. 根据权利要求1所述的方法,其特征在于,所述爬取所述待爬取网站中的待爬取数据的步骤,包括:
    向服务器发送爬取脚本获取请求;
    接收服务器返回的与所述爬取脚本获取请求对应的爬取脚本;
    通过所述爬取脚本爬取所述待爬取网站中的待爬取数据。
  3. 根据权利要求2所述的方法,其特征在于,所述向服务器发送爬取脚本获取请求的步骤之前,还包括:
    获取上次接收服务器返回爬取脚本的时间;
    当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取所述待爬取网站中的待爬取数据;当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行所述向服务器发送爬取脚本获取请求的步骤。
  4. 根据权利要求1所述的方法,其特征在于,所述检测是否成功登录所述待爬取网站的步骤,包括:
    检测所述客户端所显示的当前页面的URL地址是否改变;
    当所述客户端所显示的当前页面的URL地址改变,则成功登录所述待爬 取网站;
    当所述客户端所显示的当前页面的URL地址未改变,则未成功登录所述待爬取网站。
  5. 根据权利要求1所述的方法,其特征在于,所述待爬取网站为邮箱网站;
    所述爬取所述待爬取网站中的待爬取数据的步骤,包括:
    从所述邮箱网站中选取标题与所述待爬取数据对应的邮件;
    从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。
  6. 根据权利要求1所述的方法,其特征在于,所述将所爬取的待爬取数据发送至服务器的步骤,包括:
    将所爬取的待爬取数据进行加密处理;
    将加密后的待爬取数据进行打包;
    将打包后的待爬取数据发送至服务器。
  7. 一种网页数据爬取装置,其特征在于,包括:
    登录模块,用于通过客户端嵌入的待爬取网站登录界面,接收输入的与所述待爬取网站对应的账户和密码,通过与所述待爬取网站对应的账户和密码登录所述待爬取网站;
    检测模块,用于检测是否成功登录所述待爬取网站;
    验证模块,用于当成功登录所述待爬取网站时,则判断所述客户端的账户与所述待爬取网站的账户是否匹配;
    爬取模块,用于当所述客户端的账户与所述待爬取网站的账户匹配时,则爬取所述待爬取网站中的待爬取数据;及
    发送模块,用于将所爬取的待爬取数据发送至服务器。
  8. 根据权利要求7所述的装置,其特征在于,所述发送模块还用于向服务器发送爬取脚本获取请求;
    所述爬取模块包括:
    接收单元,用于接收服务器返回的与所述爬取脚本获取请求对应的爬取 脚本;
    爬取单元,用于通过所述爬取脚本爬取所述待爬取网站中的待爬取数据。
  9. 根据权利要求8所述的装置,其特征在于,所述装置还包括:
    时间获取模块,用于获取上次接收服务器返回爬取脚本的时间;
    比较模块,用于当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取所述待爬取网站中的待爬取数据;当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则向服务器发送爬取脚本获取请求。
  10. 根据权利要求7所述的装置,其特征在于,所述检测模块还用于检测所述客户端所显示的当前页面的URL地址是否改变;当所述客户端所显示的当前页面的URL地址改变,则成功登录所述待爬取网站;当所述客户端所显示的当前页面的URL地址未改变,则未成功登录所述待爬取网站。
  11. 根据权利要求7所述的装置,其特征在于,所述待爬取网站为邮箱网站;
    所述爬取模块还用于从所述邮箱网站中选取标题与所述待爬取数据对应的邮件;从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。
  12. 根据权利要求7所述的装置,其特征在于,所述发送模块包括:
    加密单元,用于将所爬取的待爬取数据进行加密处理;
    打包单元,用于将加密后的待爬取数据进行打包;
    发送单元,用于将打包后的待爬取数据发送至服务器。
  13. 一种用户终端,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现以下步骤:
    通过客户端嵌入的待爬取网站登录界面,接收输入的与所述待爬取网站对应的账户和密码,通过与所述待爬取网站对应的账户和密码登录所述待爬取网站;
    检测是否成功登录所述待爬取网站;
    当成功登录所述待爬取网站时,则判断所述客户端的账户与所述待爬取网站的账户是否匹配;
    当所述客户端的账户与所述待爬取网站的账户匹配时,则爬取所述待爬取网站中的待爬取数据;及
    将所爬取的待爬取数据发送至服务器。
  14. 根据权利要求13所述的用户终端,其特征在于,所述处理器执行的爬取所述待爬取网站中的待爬取数据的步骤,包括:
    向服务器发送爬取脚本获取请求;
    接收服务器返回的与所述爬取脚本获取请求对应的爬取脚本;
    通过所述爬取脚本爬取所述待爬取网站中的待爬取数据。
  15. 根据权利要求14所述的用户终端,其特征在于,所述处理器执行的向服务器发送爬取脚本获取请求的步骤之前,还包括:
    获取上次接收服务器返回爬取脚本的时间;
    当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取所述待爬取网站中的待爬取数据;当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行所述向服务器发送爬取脚本获取请求的步骤。
  16. 根据权利要求13所述的用户终端,其特征在于,所述处理器执行的检测是否成功登录所述待爬取网站的步骤,包括:
    检测所述客户端所显示的当前页面的URL地址是否改变;
    当所述客户端所显示的当前页面的URL地址改变,则成功登录所述待爬取网站;
    当所述客户端所显示的当前页面的URL地址未改变,则未成功登录所述待爬取网站。
  17. 根据权利要求13所述的用户终端,其特征在于,所述待爬取网站为邮箱网站;
    所述处理器执行的爬取所述待爬取网站中的待爬取数据的步骤,包括:
    从所述邮箱网站中选取标题与所述待爬取数据对应的邮件;
    从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。
  18. 根据权利要求13所述的用户终端,其特征在于,所述处理器执行的将所爬取的待爬取数据发送至服务器的步骤,包括:
    将所爬取的待爬取数据进行加密处理;
    将加密后的待爬取数据进行打包;
    将打包后的待爬取数据发送至服务器。
  19. 一种计算机可读存储介质,其上存储有计算机可读指令,其特征在于,该计算机可读指令被处理器执行时以下步骤:
    通过客户端嵌入的待爬取网站登录界面,接收输入的与所述待爬取网站对应的账户和密码,通过与所述待爬取网站对应的账户和密码登录所述待爬取网站;
    检测是否成功登录所述待爬取网站;
    当成功登录所述待爬取网站时,则判断所述客户端的账户与所述待爬取网站的账户是否匹配;
    当所述客户端的账户与所述待爬取网站的账户匹配时,则爬取所述待爬取网站中的待爬取数据;及
    将所爬取的待爬取数据发送至服务器。
  20. 根据权利要求19所述的计算机可读存储介质,其特征在于,所述计算机可读指令被处理器执行时实现的爬取所述待爬取网站中的待爬取数据的步骤,包括:
    向服务器发送爬取脚本获取请求;
    接收服务器返回的与所述爬取脚本获取请求对应的爬取脚本;
    通过所述爬取脚本爬取所述待爬取网站中的待爬取数据。
  21. 根据权利要求20所述的计算机可读存储介质,其特征在于,所述计算机可读指令被处理器执行时实现的向服务器发送爬取脚本获取请求的步骤之前,还包括:
    获取上次接收服务器返回爬取脚本的时间;
    当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取所述待爬取网站中的待爬取数据;当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行所述向服务器发送爬取脚本获取请求的步骤。
  22. 根据权利要求19所述的计算机可读存储介质,其特征在于,所述计算机可读指令被处理器执行时实现的检测是否成功登录所述待爬取网站的步骤,包括:
    检测所述客户端所显示的当前页面的URL地址是否改变;
    当所述客户端所显示的当前页面的URL地址改变,则成功登录所述待爬取网站;
    当所述客户端所显示的当前页面的URL地址未改变,则未成功登录所述待爬取网站。
  23. 根据权利要求19所述的计算机可读存储介质,其特征在于,所述待爬取网站为邮箱网站;
    所述计算机可读指令被处理器执行时实现的爬取所述待爬取网站中的待爬取数据的步骤,包括:
    从所述邮箱网站中选取标题与所述待爬取数据对应的邮件;
    从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。
  24. 根据权利要求19所述的计算机可读存储介质,其特征在于,所述计算机可读指令被处理器执行时实现的将所爬取的待爬取数据发送至服务器的步骤,包括:
    将所爬取的待爬取数据进行加密处理;
    将加密后的待爬取数据进行打包;
    将打包后的待爬取数据发送至服务器。
PCT/CN2017/103932 2017-07-26 2017-09-28 网页数据爬取方法、装置、用户终端及可读存储介质 WO2019019344A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710619263.4A CN107689951A (zh) 2017-07-26 2017-07-26 网页数据爬取方法、装置、用户终端及可读存储介质
CN201710619263.4 2017-07-26

Publications (1)

Publication Number Publication Date
WO2019019344A1 true WO2019019344A1 (zh) 2019-01-31

Family

ID=61153095

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/103932 WO2019019344A1 (zh) 2017-07-26 2017-09-28 网页数据爬取方法、装置、用户终端及可读存储介质

Country Status (2)

Country Link
CN (1) CN107689951A (zh)
WO (1) WO2019019344A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020253366A1 (zh) * 2019-06-17 2020-12-24 深圳壹账通智能科技有限公司 网页邮箱数据的爬取方法、装置、终端和存储介质
CN113254744A (zh) * 2021-04-24 2021-08-13 中电长城网际系统应用广东有限公司 一种使用网络爬虫技术获取安全设备数据信息的方法

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968755A (zh) * 2018-09-29 2020-04-07 北京国双科技有限公司 一种爬取数据的方法及装置
CN109670100B (zh) * 2018-12-21 2020-06-26 第四范式(北京)技术有限公司 一种页面数据抓取方法及装置
CN109948020A (zh) * 2019-01-14 2019-06-28 北京三快在线科技有限公司 数据获取方法、装置、系统及可读存储介质
CN110162682A (zh) * 2019-04-12 2019-08-23 深圳壹账通智能科技有限公司 一种网络数据的爬取方法、装置、存储介质和终端设备
CN110400080A (zh) * 2019-07-26 2019-11-01 浙江大搜车软件技术有限公司 考核数据监控方法、装置、计算机设备和存储介质
CN110677423A (zh) * 2019-09-30 2020-01-10 深圳前海环融联易信息科技服务有限公司 基于客户代理端的数据采集方法、装置、及计算机设备
CN110691091A (zh) * 2019-09-30 2020-01-14 深圳前海环融联易信息科技服务有限公司 基于身份认证的数据采集方法、装置、及计算机设备
CN112989159A (zh) * 2019-12-16 2021-06-18 浙江大搜车软件技术有限公司 数据获取方法、装置、计算机设备和存储介质
CN114780822A (zh) * 2022-06-20 2022-07-22 云账户技术(天津)有限公司 爬取应用程序数据的方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120284806A1 (en) * 2002-09-13 2012-11-08 Oracle America, Inc. Embedded content requests in a rights locker system for digital content access control
CN103365893A (zh) * 2012-03-31 2013-10-23 百度在线网络技术(北京)有限公司 一种用于实现搜索用户的个体信息的方法和设备
CN105814901A (zh) * 2013-10-10 2016-07-27 尼尔森(美国)有限公司 测量到流媒体的曝光的方法和设备
CN106021257A (zh) * 2015-12-31 2016-10-12 广州华多网络科技有限公司 一种支持在线编程的爬虫抓取数据方法、装置及系统
CN106341313A (zh) * 2016-09-29 2017-01-18 北京小米移动软件有限公司 获取账单信息的方法及装置
CN106886547A (zh) * 2016-07-13 2017-06-23 阿里巴巴集团控股有限公司 一种脚本生成方法与装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102761843B (zh) * 2012-08-10 2015-06-17 上海洲信信息技术有限公司 基于全文检索和wappush的移动终端用户获取邮件的系统和获取邮件的方法
CN103902889A (zh) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 一种恶意消息云检测方法和服务器

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120284806A1 (en) * 2002-09-13 2012-11-08 Oracle America, Inc. Embedded content requests in a rights locker system for digital content access control
CN103365893A (zh) * 2012-03-31 2013-10-23 百度在线网络技术(北京)有限公司 一种用于实现搜索用户的个体信息的方法和设备
CN105814901A (zh) * 2013-10-10 2016-07-27 尼尔森(美国)有限公司 测量到流媒体的曝光的方法和设备
CN106021257A (zh) * 2015-12-31 2016-10-12 广州华多网络科技有限公司 一种支持在线编程的爬虫抓取数据方法、装置及系统
CN106886547A (zh) * 2016-07-13 2017-06-23 阿里巴巴集团控股有限公司 一种脚本生成方法与装置
CN106341313A (zh) * 2016-09-29 2017-01-18 北京小米移动软件有限公司 获取账单信息的方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020253366A1 (zh) * 2019-06-17 2020-12-24 深圳壹账通智能科技有限公司 网页邮箱数据的爬取方法、装置、终端和存储介质
CN113254744A (zh) * 2021-04-24 2021-08-13 中电长城网际系统应用广东有限公司 一种使用网络爬虫技术获取安全设备数据信息的方法

Also Published As

Publication number Publication date
CN107689951A (zh) 2018-02-13

Similar Documents

Publication Publication Date Title
WO2019019344A1 (zh) 网页数据爬取方法、装置、用户终端及可读存储介质
US11477180B2 (en) Differential client-side encryption of information originating from a client
EP3596642B1 (en) Privacy-preserving identity verification
EP3207464B1 (en) Method, device, terminal, and server for verifying security of service operation
KR20210089682A (ko) 블록 체인을 사용한 영지식 증명 결제
US9378345B2 (en) Authentication using device ID
US12003505B2 (en) Custom authorization of network connected devices using signed credentials
CN111314306A (zh) 接口访问方法及装置、电子设备、存储介质
TWI679550B (zh) 帳號登入方法及裝置
EP3242455A1 (en) Method and device for identifying user identity
CN112333198A (zh) 安全跨域登录方法、系统及服务器
WO2015062362A1 (zh) 用户登录的方法、设备及系统
CN111656730A (zh) 解耦和更新移动设备上的锁定证书
CN104580112B (zh) 一种业务认证方法、系统及服务器
US20150082440A1 (en) Detection of man in the browser style malware using namespace inspection
WO2016202204A1 (zh) 一种下载应用的方法和装置
US20210377309A1 (en) System and method for establishing secure session with online disambiguation data
WO2015142968A1 (en) Providing multi-level password and phishing protection
CN113792346A (zh) 一种可信数据处理方法、装置及设备
US20230403562A1 (en) Systems and methods for verified communication between mobile applications
CN111506503B (zh) 基于JMeter的接口签名验证方法及装置、计算设备、存储介质
US20230376953A1 (en) Systems and methods for verified communication between mobile applications
CN104348807B (zh) 基于可定制的浏览器的安全性信息交互方法
CN115712916A (zh) 一种基于区块链的数据访问处理方法以及数据交互系统
CN117422416A (zh) 基于区块链的业务办理方法、装置、设备、介质及产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17919012

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 23/06/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17919012

Country of ref document: EP

Kind code of ref document: A1