WO2021022689A1 - 一种信息采集方法和装置 - Google Patents

一种信息采集方法和装置 Download PDF

Info

Publication number
WO2021022689A1
WO2021022689A1 PCT/CN2019/115278 CN2019115278W WO2021022689A1 WO 2021022689 A1 WO2021022689 A1 WO 2021022689A1 CN 2019115278 W CN2019115278 W CN 2019115278W WO 2021022689 A1 WO2021022689 A1 WO 2021022689A1
Authority
WO
WIPO (PCT)
Prior art keywords
collection
parameters
preset
web page
information
Prior art date
Application number
PCT/CN2019/115278
Other languages
English (en)
French (fr)
Inventor
袁学文
Original Assignee
苏州闻道网络科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州闻道网络科技股份有限公司 filed Critical 苏州闻道网络科技股份有限公司
Publication of WO2021022689A1 publication Critical patent/WO2021022689A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/30Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information

Definitions

  • the present disclosure relates to the technical field of network information collection, in particular to an information collection method and device.
  • the present disclosure provides a method and device for automatic text generation.
  • an information collection method including:
  • the updated collection parameters are sent to multiple collection terminals respectively, and the webpage content parsed by the collection terminals is received to obtain the target webpage content.
  • the collection parameter includes at least one of the following:
  • the user agent parameters of the web page request the access source parameters of the web page request, the permission parameters of the web page request and the IP address parameters.
  • the updating the collection parameters according to preset rules, and the preset rules are generated according to anti-crawler rules of multiple websites, including:
  • the updating the collection parameters according to preset rules, and the preset rules are generated according to anti-crawler rules of multiple websites, including:
  • the user agent parameters and/or the IP address parameters of the web page request are regularly updated.
  • the updating the collection parameters according to preset rules, and the preset rules are generated according to anti-crawler rules of multiple websites, including:
  • the collection parameters are replaced with candidate parameters as the updated collection parameters, and the preset rules are generated according to anti-crawler rules of multiple websites.
  • the sending the updated collection parameters to multiple collection terminals respectively, and receiving the webpage content parsed by the collection terminal to obtain the target webpage content includes:
  • an information collection device including:
  • An information collection device characterized by comprising:
  • the acquisition module is used to acquire the acquisition parameters of the target website
  • the update module is configured to update the collection parameters according to preset rules, which are generated according to anti-crawler rules of multiple websites;
  • the scheduling module sends the updated collection parameters to multiple collection terminals respectively, and receives the web page content parsed by the collection terminals to obtain the target web page content.
  • the collection parameter includes at least one of the following:
  • the user agent parameters of the web page request the access source parameters of the web page request, the permission parameters of the web page request and the IP address parameters.
  • the update module includes:
  • the adding sub-module is used to add the access source parameter of the web page request and/or the permission parameter of the web page request related to the target website to the collection parameter.
  • the update module includes:
  • the update submodule is used to periodically update the user agent parameters and/or IP address parameters of the webpage request according to the preset time interval or the preset number of visits.
  • the update module includes:
  • the acquiring sub-module is used to acquire candidate parameters corresponding to the acquisition parameters from the preset acquisition parameter database;
  • the replacement sub-module is configured to replace the collection parameters with candidate parameters according to preset rules as the updated collection parameters, and the preset rules are generated according to the anti-crawler rules of multiple websites.
  • the scheduling module includes:
  • the sending sub-module is used to send the updated collection parameters to multiple collection terminals respectively;
  • a receiving sub-module for receiving webpage content parsed by the collection terminal, and extracting webpage data and sub-website information from the webpage content
  • the extraction sub-module is used to send the sub-website information to a collection terminal within a preset collection threshold, receive the webpage content parsed by the collection terminal, and extract webpage data and a new sub-website from the webpage content information;
  • the storage sub-module is used to store the webpage data and obtain the target webpage content.
  • an information collection system including:
  • the user terminal obtains the collection parameters of the target website
  • the information collection device according to any embodiment of the present disclosure.
  • the collection terminal is used to receive and analyze the updated collection parameters sent by the information collection device, and send the parsed webpage content to the information collection device.
  • an information collection device including:
  • a memory for storing processor executable instructions
  • the processor is configured to execute the method described in any embodiment of the present disclosure.
  • a non-transitory computer-readable storage medium When the instructions in the storage medium are executed by a processor, the processor can execute what is described according to any embodiment of the present disclosure. The method described.
  • the technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects: the present disclosure formulates corresponding rules in advance by acquiring the anti-crawler strategies of different websites, and updates the collection parameters of the target website according to the preset rules.
  • the beneficial effect is that the operation is simple, the user does not need to perform a large number of parameter configurations, can continuously visit the target website, is not affected by anti-crawler technology, and has high crawling efficiency.
  • Fig. 1 is an application scenario diagram of an information collection method according to an exemplary embodiment.
  • Fig. 2 is a flow chart showing an information collection method according to an exemplary embodiment.
  • Fig. 3 is a block diagram showing an information collection device according to an exemplary embodiment.
  • Fig. 4 is a block diagram showing an information collection device according to an exemplary embodiment.
  • Fig. 5 is a block diagram showing an information collection device according to an exemplary embodiment.
  • the Internet has produced a large number of data resources, which contain many valuable things.
  • crawler technology with collection functions is produced.
  • Some crawler technologies are relatively simple to operate, but can only be implemented. For some simple collection tasks, when the collection volume is large, the information collection efficiency is low, or even difficult to complete.
  • some websites have adopted anti-crawler technical measures. For example, when a user terminal is detected to have a large number of visits, the user terminal will be blocked to restrict its continued access. The emergence of anti-crawler technology is even greater Because of the difficulty of information collection, operators need to configure a large number of parameters when collecting information. To write a crawler program, many files may need to be changed, which is not easy in actual application.
  • the present disclosure provides an information collection method, device, and system.
  • Fig. 1 is an application scenario diagram of a collection method according to an exemplary embodiment.
  • the functions of the first user terminal 101, the second user terminal 102, and the third user terminal 103 are to receive information provided by the user.
  • the number of user terminals can include multiple, and when the collection task of the same user terminal is completed, it can continue to receive new collection parameters and send the collection parameters to the server 201, so
  • the user terminal may include terminal devices with input functions such as notebook computers, mobile phones, and tablets.
  • the server 201 has functions of data processing, task scheduling, and data storage.
  • the server 201 is configured to receive collection parameters sent by a user terminal, configure a corresponding collection terminal for the collection parameters, update the collection parameters according to preset rules, and send the updated collection parameters to the corresponding collection terminal. For example, if the second collection terminal 302 receives the updated collection parameters from the first user terminal 101, it will analyze the updated collection parameters to obtain the corresponding webpage content, and send the webpage content to the server 201. The server 201 sends the received webpage content to the corresponding first user terminal 101.
  • Fig. 2 is a flowchart showing an information collection method according to an exemplary embodiment. As shown in Fig. 2, the method includes the following steps.
  • step S11 acquisition parameters about the target website are acquired.
  • the collection parameters include website information about the target website and IP address information of the collection terminal performing the collection task.
  • the URL information can be obtained by parsing the web pages of the website, and further the corresponding relationship between the website theme and the URL information can be established, the corresponding URL information can be found according to the theme name of the user's target website, and the URL information can be input to the user terminal.
  • the web address information may be provided by multiple user terminals, and the web address information provided by the user terminals are processed sequentially according to the sequence of receiving the web address information sent by the multiple user terminals.
  • step S12 the collection parameters are updated according to preset rules, which are generated according to anti-crawler rules of multiple websites.
  • the anti-crawler policy of Baidu website prohibits the visitor's access work when the number of visits reaches 30,000; the anti-crawler policy of NetEase website prohibits the visitor's access when the number of visits reaches 15,000 Work; Sohu’s anti-crawler strategy is to prohibit the visitor’s work when the number of visits reaches 10,000.
  • the following rules can be formulated, according to the minimum number of visits, for example, the number of visits is 10,000, and the visitor’s IP address is changed once.
  • the preset rule may be included in the web page request header to detect whether the logo of the crawler program is included, and if such a mark is found, it will be deleted.
  • the word python appears . It may be considered that the target website is directly requested through python, and the word python is deleted according to the preset structural rules.
  • the setting of the preset rules is not limited to the above examples. Those skilled in the art may make other changes under the enlightenment of the technical essence of this application, but as long as the functions and effects achieved are the same as those of this application Or similar, should be covered in the scope of protection of this application.
  • step S13 the updated collection parameters are respectively sent to multiple collection terminals, and the webpage content parsed by the collection terminals is received to obtain the target webpage content.
  • a browser may be installed in the collection terminal, and the browser is used to parse the URL information
  • the browser types include Mozilla Firefox, Internet Explorer, and Microsoft browser ( At least one of Microsoft Edge, Google Chrome, Opera browser, Safari browser, 360 browser, qq browser, or browser may not be used, and the collection terminal accepts the collection parameters , Connect to the server of the target website, obtain the webpage content from the server, and transmit the target webpage content.
  • the present disclosure obtains the anti-crawler strategies of different websites, formulates corresponding rules in advance, and updates the collection parameters of the target website according to the preset rules.
  • the beneficial effect achieved by the present disclosure is simple operation, and the user does not need to perform a large number of parameter configurations , You can continuously visit the target website, not affected by anti-crawler technology, and the crawling efficiency is high.
  • the collection parameter includes at least one of the following:
  • Web page request header user agent parameters Web page request header access source parameters, Web page request header permission parameters, and IP address parameters.
  • the user agent parameter of the web page request header is Http User-Agent (UA)
  • the access source parameter of the web page request header is Http Referer
  • the permission parameter of the web page request header is Http Authorization
  • the web page request header User agent parameters, web page request header access source parameters, and web page request header permission parameters can be initially obtained through the URL information provided by the user client.
  • the IP address parameters can be initially obtained through the collection terminal, and when the collection parameters are updated
  • the updated candidate collection parameters can be obtained from a pre-established collection parameter database, and the collection terminal database contains the parameter data of the collection parameter type.
  • collection parameters are not limited to those listed above, and can also include the cookie parameters of the web page request header, and other custom parameters, etc.
  • the collection parameter types can be adapted according to the emergence of anti-crawler technology, which is not limited in the present disclosure.
  • step S12 the collection parameters are updated according to preset rules, and the preset rules are generated according to anti-crawler rules of multiple websites.
  • S121 Add the access source parameter of the web page request and/or the permission parameter of the web page request related to the target website to the collection parameter.
  • the access source parameter of the webpage request indicates the specific link through which the target website is accessed to access the target website. For example, when accessing QQ music, the user will first search for qq music through Baidu and enter the homepage in turn , And then enter the qq music page, so that the http request header that enters the qq page will contain access source information such as Baidu, qq homepage, etc. If the URL information is directly obtained through a crawler program, such access source parameters are generally not included, so , Such crawlers may be rejected by the site administrator.
  • the access source parameters of the webpage request can be obtained by simulating login to the target website, and by adding the access source parameters to the collection parameters, it is possible to effectively avoid the anti-reaction of some websites setting access barriers through the access source.
  • the crawler rules enable the collection terminal to better collect the content of the target website.
  • the webpage request permission parameter indicates that when the collection terminal accesses certain websites, the website management will assign an ID to the collection terminal. At this time, the http request header of the collection terminal will contain this ID. The background of the target website verifies the access authority of the ID.
  • the webpage request authority parameter can be simulated by logging in to the target website to obtain the webpage request authority parameter.
  • step S12 the collection parameters are updated according to preset rules, and the preset rules are generated according to anti-crawler rules of multiple websites.
  • the preset time interval may be determined by a timer, and the timer may include a program that uses python language or java language to trigger a certain task at a fixed time. It should be noted that for different scenario requirements , The preset time interval can be dynamically adjusted.
  • the user agent parameter of the web page request header can indicate the information of the collection terminal.
  • the UA when a visitor uses the same UA to continuously visit the same website to obtain data, the UA always displays that the access terminal is an Android system
  • website administrators think that this access operation is done by a machine, and they will restrict the visitor.
  • this disclosure replaces it with a preset time interval. Eliminating the preset content in the URL information can effectively avoid the anti-crawler rules of certain websites that set access barriers by detecting user agent parameters, so that the collection terminal can better collect the content of the target website.
  • the IP address parameter represents the address information of the collection terminal.
  • the website administrator may think that the access operation is completed by the machine , The visitor will be restricted, such as setting the upper limit of the number of visits. If the number of visits to the target website by the collection terminal using the same IP address reaches the upper limit, the collection terminal may be disabled.
  • the present disclosure replaces the IP address of the collection terminal at a preset time interval, which can effectively avoid the anti-crawler rules of certain websites setting access barriers by detecting the IP address parameters, so that the collection terminal can better collect the targets The content of the website.
  • step S12 the collection parameters are updated according to preset rules, and the preset rules are generated according to anti-crawler rules of multiple websites.
  • Step S123 Obtain candidate parameters corresponding to the collection parameters from a preset collection parameter database
  • Step S124 Replace the collection parameters with candidate parameters according to preset rules as the updated collection parameters, and the preset rules are generated according to the anti-crawler rules of multiple websites.
  • the collection parameter database may be established in advance, and corresponding candidate parameters, such as IP address candidate parameters, and user agent candidate parameters of the web page request header, may be established according to the types of the collected parameters.
  • candidate parameters such as IP address candidate parameters, and user agent candidate parameters of the web page request header
  • the IP addresses of multiple remote hosts can be obtained through an ADSL dial-up server, and stored, as an IP address alternative parameter library, or can be realized by purchasing an IP proxy; the user agent alternative parameters
  • a user agent candidate parameter table can be established through pre-collection, and when used, the corresponding user agent parameters are randomly read and replaced.
  • the collection parameters provided by the terminal provide standardized standards, so that it is not restricted by anti-crawler technology and successfully completes the information collection operation.
  • the updated collection parameters are sent to multiple collection terminals respectively, and the webpage content parsed by the collection terminal is received to obtain the target webpage content.
  • Step S131 Send the updated collection parameters to multiple collection terminals respectively;
  • Step S132 receiving the webpage content parsed by the collection terminal, and extracting webpage data and sub-URL information from the webpage content
  • Step S133 within a preset collection threshold, send the sub-website information to a collection terminal, receive the webpage content parsed by the collection terminal, and extract webpage data and new sub-website information from the webpage content;
  • Step S134 Store the webpage data to obtain the target webpage content.
  • the updated collection parameters include the URL address information of the target website and the IP address information of the collection terminal. Due to the large number of user terminals, the information collection needs of each user terminal are relatively large, for example, The entire network captures JD’s comments, following relationships, etc. This kind of requests from billions to tens of billions or even hundreds of billions requires a good scheduling design.
  • a distributed scheduling collection method can be used, and multiple collection terminals can be used. Different thread scheduling shares a URL queue.
  • redis database can be used for request queue sharing to ensure efficient operation of scheduling.
  • the collection terminal after receiving the updated collection parameters, sends the collection parameters to the server of the target website to obtain the webpage content.
  • Receive the webpage content parsed by the collection terminal the webpage content contains webpage data and sub-website information, and the sub-website information refers to a new link in the parsed webpage content that points to the corresponding webpage , For example, the next page in the user reviews page, the next song in the music playlist, etc.
  • the webpage data and sub-URL information can be obtained by parsing the HTML tags of the webpage.
  • the webpage data in the webpage content exists in the JavaScript code, which can be included by obtaining JavaScript code to obtain web page data and sub-URL information through regular expressions.
  • the method of extracting webpage data and new self-website information from webpages is not limited to the above examples.
  • Those skilled in the art may also make other changes under the enlightenment of the technical essence of this application, but as long as they are implemented The functions and effects are the same or similar to those of this application, and should be covered by the scope of protection of this application.
  • the preset collection threshold in the embodiment of the present disclosure is used as a stop condition.
  • the number of collections can be set as the collection threshold, and the amount of collected data or the number of web pages collected can be set as the collection threshold, within the preset collection threshold range ,
  • Continuously sending the sub-web address to the collection terminal, and accepting the web page content, extracting web page data and new sub-web address information from it, is a dynamic cycle process. In this process, it is possible to filter links that have nothing to do with the subject as needed, keep valuable sub-website information in the URL queue, store webpage data, and obtain the target webpage content.
  • the webpage content is parsed by the collection terminal, the webpage content parsed by the collection terminal is received, the webpage data and the sub-website information are extracted from the webpage content, the sub-website information is sent to the collection terminal, and the collection terminal receives
  • the parsed webpage content, the extraction of webpage data and new sub-website information from the webpage content realize the separate management of collection tasks and scheduling tasks, independent design, which facilitates implementation and improves collection efficiency.
  • Fig. 3 is a block diagram showing an information collection device according to an exemplary embodiment. 3, the device includes an acquisition module 11, an update module 12, and a scheduling module 13.
  • the obtaining module 11 is used to obtain collection parameters about the target website
  • the update module 12 is configured to update the collection parameters according to preset rules, and the preset rules are generated according to anti-crawler rules of multiple websites;
  • the scheduling module 13 sends the updated collection parameters to multiple collection terminals respectively, and receives the web page content parsed by the collection terminals to obtain the target web page content.
  • the collection parameter includes at least one of the following:
  • the user agent parameters of the web page request the access source parameters of the web page request, the permission parameters of the web page request and the IP address parameters.
  • the update module 12 includes:
  • the adding sub-module 121 is configured to add the access source parameter of the web page request and/or the permission parameter of the web page request related to the target website to the collection parameter.
  • the update module 12 includes:
  • the update submodule 122 is configured to periodically update the user agent parameters and/or IP address parameters of the webpage request according to a preset time interval or a preset number of visits.
  • the update module 12 includes:
  • the obtaining sub-module 123 is configured to obtain candidate parameters corresponding to the collection parameters from a preset collection parameter database
  • the replacement sub-module 124 is configured to replace the collection parameters with candidate parameters according to preset rules as the updated collection parameters, and the preset rules are generated according to anti-crawler rules of multiple websites.
  • the scheduling module 13 includes:
  • the sending submodule 131 is used to send the updated collection parameters to multiple collection terminals respectively;
  • the receiving sub-module 132 is configured to receive the webpage content parsed by the collection terminal, and extract webpage data and sub-URL information from the webpage content;
  • the extraction sub-module 133 is configured to send the sub-website information to the collection terminal within a preset collection threshold, receive the webpage content parsed by the collection terminal, and extract webpage data and new sub-webpage data from the webpage content.
  • URL information ;
  • the storage sub-module 134 is used to store the webpage data to obtain the target webpage content.
  • an information collection system including:
  • the user terminal is used to obtain the collection parameters of the target website
  • the information collection device according to any embodiment of the present disclosure.
  • the collection terminal is used to receive and analyze the updated collection parameters sent by the information collection device, and send the parsed webpage content to the information collection device.
  • the user terminal is used to receive the collection parameters of the target website provided by the user, and the number of user terminals can include multiple, and when the collection task of the same user terminal is completed, it can continue to receive new And send the collected parameters to the information collection device.
  • the user terminal may include terminal devices with input functions such as laptops, mobile phones, and tablets. The specific manners of performing operations of each module of the information collection device have been described in detail in the embodiments of the method, and detailed descriptions will not be given here.
  • the collection terminal may be installed with a browser, the browser is used to parse the URL information, and the browser types include Mozilla Firefox, Internet Explorer, Microsoft Edge, and Google Chrome.
  • the collection terminal connects to the server of the target website by accepting the collection parameters , Obtain the webpage content from the server, and transmit the target webpage content.
  • the public information collection system obtains the anti-crawler strategies of different websites, formulates corresponding rules in advance, and updates the collection parameters of the target website according to the preset rules.
  • the beneficial effect achieved by the disclosed system is simple operation, and the user does not need to perform A large number of parameter configurations can continuously visit the target website without being affected by anti-crawler technology.
  • the crawling efficiency is high; the collection device and the collection terminal are designed independently to complete their own tasks. There is no communication between the collection terminals, which is conducive to implementation. Improve collection efficiency.
  • Fig. 4 is a block diagram showing a device 800 for a collection device according to an exemplary embodiment.
  • the device 800 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.
  • the device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, And the communication component 816.
  • the processing component 802 generally controls the overall operations of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the foregoing method.
  • the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components.
  • the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.
  • the memory 804 is configured to store various types of data to support operations in the device 800. Examples of these data include instructions for any application or method operating on the device 800, contact data, phone book data, messages, pictures, videos, etc.
  • the memory 804 can be implemented by any type of volatile or nonvolatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic Disk Magnetic Disk or Optical Disk.
  • the power supply component 806 provides power to various components of the device 800.
  • the power supply component 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 800.
  • the multimedia component 808 includes a screen that provides an output interface between the device 800 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation.
  • the multimedia component 808 includes a front camera and/or a rear camera. When the device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 810 is configured to output and/or input audio signals.
  • the audio component 810 includes a microphone (MIC).
  • the microphone When the device 800 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive external audio signals.
  • the received audio signal may be further stored in the memory 804 or transmitted via the communication component 816.
  • the audio component 810 further includes a speaker for outputting audio signals.
  • the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module.
  • the peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include but are not limited to: home button, volume button, start button, and lock button.
  • the sensor component 814 includes one or more sensors for providing the device 800 with various aspects of status assessment.
  • the sensor component 814 can detect the open/close state of the device 800 and the relative positioning of components.
  • the component is the display and the keypad of the device 800.
  • the sensor component 814 can also detect the position change of the device 800 or a component of the device 800. , The presence or absence of contact between the user and the device 800, the orientation or acceleration/deceleration of the device 800, and the temperature change of the device 800.
  • the sensor component 814 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact.
  • the sensor component 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
  • the communication component 816 is configured to facilitate wired or wireless communication between the device 800 and other devices.
  • the device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
  • the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 816 further includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the apparatus 800 may be implemented by one or more application specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing equipment (DSPD), programmable logic devices (PLD), field programmable A gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
  • ASIC application specific integrated circuits
  • DSP digital signal processors
  • DSPD digital signal processing equipment
  • PLD programmable logic devices
  • FPGA field programmable A gate array
  • controller microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
  • non-transitory computer-readable storage medium including instructions, such as the memory 804 including instructions, which can be executed by the processor 820 of the device 800 to complete the foregoing method.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • Fig. 5 is a block diagram showing a device 1900 for collecting according to an exemplary embodiment.
  • the device 1900 may be provided as a server.
  • the apparatus 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource represented by the memory 1932, for storing instructions executable by the processing component 1922, such as application programs.
  • the application program stored in the memory 1932 may include one or more modules each corresponding to a set of instructions.
  • the processing component 1922 is configured to execute instructions to perform the above-described methods.
  • the device 1900 may also include a power supply component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input output (I/O) interface 1958.
  • the device 1900 can operate based on an operating system stored in the memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or the like.
  • non-transitory computer-readable storage medium including instructions, such as the memory 1932 including instructions, which may be executed by the processing component 1922 of the device 1900 to complete the foregoing method.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种信息采集方法和装置。包括:获取关于目标网站的采集参数(S11);根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成(S12);将更新后的采集参数分别发送至多个采集终端,并接收由所述采集终端解析的网页内容,得到目标网页内容(S13)。通过获取不同网站的反爬虫策略,预先制定相应的规则,并且根据预设规则,更新关于目标网站的采集参数,操作简易,用户端不需要进行大量的参数配置,可以持续的访问目标网站,不受反爬虫技术的影响,采集效率高。

Description

一种信息采集方法和装置 技术领域
本公开涉及网络信息采集技术领域,尤其涉及一种信息采集方法和装置。
背景技术
随着网络技术的高速发展,互联网信息也在快速增长,形成海量的数据资源。为了从海量的数据资源中采集有价值的数据,爬虫技术应运而生。相关爬虫技术中,信息采集的速率较慢,当需要采集的数据量达到一定规模时,信息采集效率低下;且客户端操作复杂,在编写采集程序时,为了应对各种各样的反爬虫策略,需要输入大量的关联参数,应用起来十分不便。
发明内容
为克服相关技术中存在的问题,提高信息采集速率及便利性,本公开提供一种文本自动生成方法和装置。
根据本公开实施例的第一方面,提供一种信息采集方法,包括:
获取关于目标网站的采集参数;
根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成;
将更新后的采集参数分别发送至多个采集终端,并接收由所述采集终端解析的网页内容,得到目标网页内容。
在一种可能的实现方式中,所述采集参数包括下述中的至少一种:
网页请求的用户代理参数、网页请求的访问来源参数、网页请求的权限参数和IP地址参数。
在一种可能的实现方式中,所述根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成,包括:
添加与所述目标网站相关的网页请求的访问来源参数和/或网页请求的权限参数到所述采集参数中。
在一种可能的实现方式中,所述根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成,包括:
根据预设的时间间隔或预设访问次数,定时更新所述网页请求的用户代理参数和/或IP地址参数。
在一种可能的实现方式中,所述根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成,包括:
从预设的采集参数数据库中获取采集参数对应的备选参数;
根据预设规则,将所述采集参数替换为备选参数,作为更新后的采集参数,所述预设规则根据多个网站的反爬虫规则生成。
在一种可能的实现方式中,所述将更新后的采集参数分别发送至多个采集终端,并接收由所述采集终端解析的网页内容,得到目标网页内容,包括:
将更新后的采集参数分别发送至多个采集终端;
接收由所述采集终端解析的网页内容,并从所述网页内容中提取网页数据以及子网址信息;
在预设采集阈值范围内,将所述子网址信息发送至采集终端,接收由所述采集终端解析的网页内容,从所述网页内容中提取网页数据以及新的子网址信息;
存储所述网页数据,得到目标网页内容。
根据本公开实施例的第二方面,提供一种信息采集装置,包括:
一种信息采集装置,其特征在于,包括:
获取模块,用于获取关于目标网站的采集参数;
更新模块,用于根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成;
调度模块,将更新后的采集参数分别发送至多个采集终端,并接收由所述采集终端解析的网页内容,得到目标网页内容。
在一种可能的实现方式中,所述采集参数包括下述中的至少一种:
网页请求的用户代理参数、网页请求的访问来源参数、网页请求的权限参数和IP地址参数。
在一种可能的实现方式中,所述更新模块包括:
添加子模块,用于添加与所述目标网站相关的网页请求的访问来源参数和/或网页请求的权限参数到所述采集参数中。
在一种可能的实现方式中,所述更新模块包括:
更新子模块,用于根据预设的时间间隔或预设访问次数,定时更新所述网页请求的用户代理参数和/或IP地址参数。
在一种可能的实现方式中,所述更新模块包括:
获取子模块,用于从预设的采集参数数据库中获取采集参数对应的备选参数;
替换子模块,用于根据预设规则,将所述采集参数替换为备选参数,作为更新后的采集参数,所述预设规则根据多个网站的反爬虫规则生成。
在一种可能的实现方式中,所述调度模块包括:
发送子模块,用于将更新后的采集参数分别发送至多个采集终端;
接收子模块,用于接收由所述采集终端解析的网页内容,并从所述网页内容中提取网页数据以及子网址信息;
提取子模块,用于在预设采集阈值范围内,将所述子网址信息发送至采集终端,接收由所述采集终端解析的网页内容,从所述网页内容中提取网页数据以及新的子网址信息;
存储子模块,用于存储所述网页数据,得到目标网页内容。
根据本公开实施例的第三方面,提供了一种信息采集系统,包括:
用户终端,获取关于目标网站的采集参数;
根据本公开任一实施例所述的信息采集装置;
采集终端,用于接收并解析由所述信息采集装置发送的更新后的采集参数,将解析后的网页内容发送至所述信息采集装置。
根据本公开实施例的第四方面,提供了一种信息采集装置,包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为执行本公开任一实施例所述的方法。
根据本公开实施例的第五方面,提供了一种非临时性计算机可读存储介质,当所述存储介质中的指令由处理器执行时,使得处理器能够执行根据本公开任一实施例所述的方法。
本公开的实施例提供的技术方案可以包括以下有益效果:本公开通过获取不同网站的反爬虫策略,预先制定相应的规则,并且根据预设规则,更新关于目标网站的采集参数,本公开实现的有益效果是操作简易,用户端不需要进行大量的参数配置,可以持续的访问目标网站,不受反爬虫技术的影响,爬取效率高。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例, 并与说明书一起用于解释本公开的原理。
图1是根据一示例性实施例示出的一种信息采集方法的应用场景图。
图2是根据一示例性实施例示出的一种信息采集方法流程图。
图3是根据一示例性实施例示出的一种信息采集装置的框图。
图4是根据一示例性实施例示出的一种信息采集装置的框图。
图5是根据一示例性实施例示出的一种信息采集装置的框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
为了方便本领域技术人员理解本公开实施例提供的技术方案,下面先对技术方案实现的技术环境进行说明。
信息时代,互联网产生了大量的数据资源,其中包含了很多有价值的东西,为了获取这些有价值的数据,具有采集功能的爬虫技术便产生了,有些爬虫技术操作起来较为简单,但只能实现一些简单的采集任务,当采集量较大时,信息采集效率较低,甚至难以完成。随着互联网技术的发展,一些网站采取了反爬虫技术措施,比如检测到某个用户端访问次数较大时,则将所述用户端封号,以限制其继续访问,反爬虫技术的出现更加大了信息采集的工作难度,操作人员在进行信息采集时,需要配置大量的参数,编写一个爬虫程序,可能需要改动很多文件,实际应用起来并不容易。
基于类似于上文所述的实际技术需求,本公开提供了一种信息采集方法、装置和系统。
图1是根据一示例性实施例示出的一种采集方法的应用场景图,如图1所示,第一用户终端101、第二用户终端102和第三用户终端103的作用在于接收用户提供的关于目标网站的采集参数,用户终端的个数可以包括多个,并且当同一个用户终端的采集任务完成以后,还可以继续接收新的采集参数,并将所述采集参数发送至服务器201,所述用户终端可以包括笔记本电脑、手机、平板等具有输入功能的终端设备。服务器201具有数据处理、任务调度以及数据存储的功能。所述服务器201用于接收用户终端发送的采集参数,为所述采集参数配置相应的采集终端,根据预设的规则,更新所述采集参数,将更新后的采集参数发送至对应的采集终端。比如第二采集终端302接收到源自第一用户终端101的更新后的采集参数,则会对所述更新后的采集参数进行解析,得到对应的网页内容,并 将所述网页内容发送至服务器201,服务器201将接收到的网页内容发送给对应的第一用户终端101。
图2是根据一示例性实施例示出的一种信息采集方法的流程图,如图2所示,包括以下步骤。
在步骤S11中,获取关于目标网站的采集参数。
本公开实施例,所述采集参数包括关于目标网站的网址信息以及执行采集任务的采集终端的IP地址信息。所述网址信息可以通过对网站的网页解析进行获得,进一步的可以建立网站主题与网址信息的对应关系,根据用户的目标网站的主题名称,查找对应的网址信息,将所述网址信息输入至用户终端。所述网址信息可以由多个用户终端提供,根据接收到多个用户终端发送的网址信息的先后顺序,依次对用户终端提供的所述网址信息进行处理。
在步骤S12中,根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成。
本公开实施例中,在预先制定规则的时候,需要获取多个网站的反爬虫规则,根据反爬虫检测技术制定对应的符合网站要求信息采集方式,使得采集终端的采集行为转换成自然的访问网站形式。比如百度网站反爬虫策略是访问者访问次数达到3万的时候,禁止所述访问者的访问工作;网易网站的反爬虫策略是访问者访问次数达到1.5万的时候,禁止所述访问者的访问工作;搜狐网站的反爬虫策略是访问者访问次数达到1万的时候,禁止所述访问者的访问工作。根据句所述反爬虫策略,可以制定如下规则,按照最低数量的访问次数,比如访问次数是1万,更换一次访问者的IP地址。再比如,所述预设规则可以包括在网页请求头中,检测是否包含爬虫程序的标志,若发现有此类标记,则将其删除,比如,在网页请求头中,出现了python语言的字眼,则可能被认为是通过python直接请求目标网站的,根据预设结构规则将所述python字眼删除。需要说明的是,所述预设规则的设置方式不限于上述举例,所属领域技术人员在本申请技术精髓的启示下,还可能做出其它变更,但只要其实现的功能和效果与本申请相同或相似,均应涵盖于本申请保护范围内。
在步骤S13中,将更新后采集参数分别发送至多个采集终端,并接收由所述采集终端解析的网页内容,得到目标网页内容。
本公开实施例中,所述采集终端中可以安装有浏览器,所述浏览器用于解析所述网址信息,所述浏览器类型包括火狐浏览器(Mozilla Firefox)、IE浏览器、微软浏览器 (Microsoft Edge)、谷歌浏览器(Google Chrome)、Opera浏览器、Safari浏览器、360浏览器、qq浏览器中的至少一种,还可以不使用浏览器,所述采集终端通过接受所述采集参数,连接到目标网站的服务器,从服务器中获取网页内容,并传输所述目标网页内容。
本公开通过获取不同网站的反爬虫策略,预先制定相应的规则,并且根据预设规则,更新关于目标网站的采集参数,本公开实现的有益效果是操作简易,用户端不需要进行大量的参数配置,可以持续的访问目标网站,不受反爬虫技术的影响,爬取效率高。
在一种可能的实现方式中,所述采集参数包括下述中的至少一种:
网页请求头用户代理参数、网页请求头访问来源参数、网页请求头权限参数和IP地址参数。
本公开实施例中所述网页请求头用户代理参数即Http User-Agent(UA),所述网页请求头访问来源参数即Http Referer,所述网页请求头权限参数即Http Authorization,所述网页请求头用户代理参数、网页请求头访问来源参数和网页请求头权限参数最初可以通过用户客户端提供的网址信息进行获取,所述IP地址参数最初可以通过采集终端获取,当对所述采集参数进行更新时,更新的备选采集参数可以从预先建立的采集参数数据库中获取,所述采集终端数据库中包含了所述采集参数类型的参数数据。当然所述采集参数不限于上述列举的几种,还可以包括网页请求头Cookie参数、以及其他自定义参数等,可以根据反爬虫技术的出现,适应的更改采集参数类型,本公开不做限制。
在一种可能的实现方式中,所述步骤S12,根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成。包括:
S121,添加与所述目标网站相关的网页请求的访问来源参数和/或网页请求的权限参数到所述采集参数中。
本公开实施例中,所述网页请求的访问来源参数表示访问到目标网站通过具体的何种链接访问到目标网站,比如,访问QQ音乐时,用户会先通过百度搜素qq音乐,依次进入主页,再进入qq音乐的页面,这样在进入qq页面的http请求头里面会包含百度、qq主页等访问来源信息,如果通过爬虫程序直接获取网址信息中一般是不包含这样的访问来源参数的,因此,此类爬虫程序可能被网站管理者拒绝。所述网页请求访问来源参数可以通过模拟登陆目标网站获取所述网页请求来源参数,并通过在所述采集参数中添加访问来源参数,可以有效的避开某些网站通过访问来源设置访问屏障的反爬虫规则,使得采集终端更好的采集到的目标网站的内容。
本公开实施例中,所述网页请求权限参数表示采集终端在访问某些网站的时候,网站管理则会分配给一个ID给采集终端,这时候,采集终端的http请求头内会包含这个ID,目标网站后台验证所述ID的访问权限,所述网页请求权限参数可以通过模拟登陆目标网站,获取所述网页请求权限参数,通过在所述采集参数中添加访问权限参数,可以有效的避开某些网站通过访问权限参数设置访问屏障的反爬虫规则,使得采集终端更好的采集到的目标网站的内容。
在一种可能的实现方式中,所述步骤S12,根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成。包括:
S122,根据预设的时间间隔或预设访问次数,定时更新所述网页请求用户代理参数和/或IP地址参数。
本公开实施例中,所述预设的时间间隔可以通过定时器来确定,定时器可以包括利用python语言或java语言编写的定时触发某项工作的程序,需要说明的是,针对不同的场景需求,所述预设的时间间隔是可以进行动态调整的。
本公开实施例中,所述网页请求头用户代理参数能够表示采集终端的信息,比如,当访问者使用相同的UA不停的访问同一个网站来获取数据,UA始终显示访问终端是一个安卓系统手机,网站管理者则认为此访问操作是由机器完成的,则会对此访问者进行限制,为了避免访问者因此类问题受限制不能访问网页,本公开通过每间隔预设的时间间隔,替换掉网址信息中的预设内容,可以有效的避开某些网站通过检测用户代理参数设置访问屏障的反爬虫规则,使得采集终端更好的采集到的目标网站的内容。
本公开实施例中,所述IP地址参数表示采集终端的地址信息,当访问者使用相同的IP不停的访问同一个网站来获取数据的时候,网站管理者可能认为此访问操作是由机器完成的,则会对此访问者进行限制,比如设置访问次数的上限值,如果采集终端使用同一个IP地址访问目标网站的访问次数达到所述上限值,则所述采集终端可能被禁用,本公开通过每间隔预设的时间间隔,替换掉采集终端的IP地址,可以有效的避开某些网站通过检测IP地址参数设置访问屏障的反爬虫规则,使得采集终端更好的采集到的目标网站的内容。
在一种可能的实现方式中,所述步骤S12,根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成。包括:
步骤S123,从预设的采集参数数据库中获取采集参数对应的备选参数;
步骤S124,根据预设规则,将所述采集参数替换为备选参数,作为更新后的采集参 数,所述预设规则根据多个网站的反爬虫规则生成。
本公开实施例中,所述采集参数数据库可以预先建立,可以根据采集参数的种类,建立对应的备选参数,比如IP地址备选参数,网页请求头的用户代理备选参数。在一种可能的实现方式中,可以通过ADSL拨号服务器,获取多个远程主机的IP地址,并存储,作为IP地址备选参数库,也可以通过购买IP代理实现;所述用户代理备选参数可以通过预先搜集,建立用户代理备选参数表,使用时,随机读取替换对应的用户代理参数。通过从预设的采集参数库中获取采集参数的对应的备选参数,并根据预设规则,将所述采集参数替换为备选参数,作为更新后的采集参数,可以高效的为不同的用户终端提供的采集参数提供规范化的标准,使其不受反爬虫技术的限制,顺利完成信息采集作业。
在一种可能的实现方式中,所述步骤S13,将更新后的采集参数分别发送至多个采集终端,并接收由所述采集终端解析的网页内容,得到目标网页内容。包括:
步骤S131,将更新后的采集参数分别发送至多个采集终端;
步骤S132,接收由所述采集终端解析的网页内容,并从所述网页内容中提取网页数据以及子网址信息;
步骤S133,在预设采集阈值范围内,将所述子网址信息发送至采集终端,接收由所述采集终端解析的网页内容,从所述网页内容中提取网页数据以及新的子网址信息;
步骤S134,存储所述网页数据,得到目标网页内容。
本公开实施例中,所述更新后的采集参数包括目标网站的URL网址信息和采集终端的IP地址信息,由于用户终端的数量比较多,每个用户终端的信息采集需求又比较大,比如,全网抓取京东的评论,关注关系等,这种十亿到百亿甚至千亿次的请求量,需要有良好的调度设计,本实施例可以采用分布式调度采集方法,多台采集终端使用不同的线程调度,共享一个URL队列,在一种可能的实现方式中,可以采用redis数据库做请求队列共享,保证了调度工作的高效运作。
本公开实施例中,所述采集终端接收到更新后的采集参数后,将所述采集参数发送到目标网站的服务器,得到网页内容。接收由所述采集终端解析的网页内容,所述网页内容中包含了网页数据以及子网址的信息,所述子网址信息指的是解析到的网页内容中出现了新的连接,指向对应的网页,比如,用户评论页面中的下一页,音乐歌单中下一首等。在一个示例中,比如网页内容属于静态网页,可以通过解析网页的HTML标签得到网页数据和子网址信息,在另一个示例中,比如网页内容中网页数据存在于JavaScript代码中,可以通过获取包含网页数据的JavaScript代码,通过正则表达式得到网页数据和子网址 信息。需要说明的是,所述从网页中提取网页数据和新的自网址信息方式不限于上述举例,所属领域技术人员在本申请技术精髓的启示下,还可能做出其它变更,但只要其实现的功能和效果与本申请相同或相似,均应涵盖于本申请保护范围内。
本公开实施例中所述预设采集阈值,用于作为停止条件,可以设置采集次数作为采集阈值,也可以设置采集数据量,或采集网页的数量等作为采集阈值,在预设采集阈值范围内,不断的将所述子网址发送至采集终端,并接受网页内容,从中提取网页数据以及新的子网址信息,是一个动态的循环过程。在此过程中,可以根据需要对过滤与主题无关的链接,将有价值的子网址信息保留URL队列中,存储网页数据,得到目标网页内容。
通过采集终端解析网页内容,接收由所述采集终端解析的网页内容,并从所述网页内容中提取网页数据以及子网址信息,将所述子网址信息发送至采集终端,接收由所述采集终端解析的网页内容,从所述网页内容中提取网页数据以及新的子网址信息实现了采集任务和调度任务分开管理,独立设计,利于实现,提高采集效率。
图3是根据一示例性实施例示出的一种信息采集装置框图。参照图3,该装置包括获取模块11,更新模块12和调度模块13。
获取模块11,用于获取关于目标网站的采集参数;
更新模块12,用于根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成;
调度模块13,将更新后的采集参数分别发送至多个采集终端,并接收由所述采集终端解析的网页内容,得到目标网页内容。
在一种可能的实现方式中,所述采集参数包括下述中的至少一种:
网页请求的用户代理参数、网页请求的访问来源参数、网页请求的权限参数和IP地址参数。
在一种可能的实现方式中,所述更新模块12包括:
添加子模块121,用于添加与所述目标网站相关的网页请求的访问来源参数和/或网页请求的权限参数到所述采集参数中。
在一种可能的实现方式中,所述更新模块12包括:
更新子模块122,用于根据预设的时间间隔或预设访问次数,定时更新所述网页请求的用户代理参数和/或IP地址参数。
在一种可能的实现方式中,所述更新模块12包括:
获取子模块123,用于从预设的采集参数数据库中获取采集参数对应的备选参数;
替换子模块124,用于根据预设规则,将所述采集参数替换为备选参数,作为更新后的采集参数,所述预设规则根据多个网站的反爬虫规则生成。
在一种可能的实现方式中,所述调度模块13包括:
发送子模块131,用于将更新后的采集参数分别发送至多个采集终端;
接收子模块132,用于接收由所述采集终端解析的网页内容,并从所述网页内容中提取网页数据以及子网址信息;
提取子模块133,用于在预设采集阈值范围内,将所述子网址信息发送至采集终端,接收由所述采集终端解析的网页内容,从所述网页内容中提取网页数据以及新的子网址信息;
存储子模块134,用于存储所述网页数据,得到目标网页内容。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
在一种可能的实现方式中,提供了一种信息采集系统,包括:
用户终端,用于获取关于目标网站的采集参数;
根据本公开任一实施例所述的信息采集装置;
采集终端,用于接收并解析由所述信息采集装置发送的更新后的采集参数,将解析后的网页内容发送至所述信息采集装置。
本公开实施例中,所述用户终端用于接收用户提供的关于目标网站的采集参数,用户终端的个数可以包括多个,并且当同一个用户终端的采集任务完成以后,还可以继续接收新的采集参数,并将所述采集参数发送至信息采集装置,所述用户终端可以包括笔记本电脑、手机、平板等具有输入功能的终端设备。所述信息采集装置各个模块的执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。所述采集终端可以安装有浏览器,所述浏览器用于解析所述网址信息,所述浏览器类型包括火狐浏览器(Mozilla Firefox)、IE浏览器、微软浏览器(Microsoft Edge)、谷歌浏览器(Google Chrome)、Opera浏览器、Safari浏览器、360浏览器、qq浏览器中的至少一种,还可以不使用浏览器,所述采集终端通过接受所述采集参数,连接到目标网站的服务器,从服务器中获取网页内容,并传输所述目标网页内容。
本公开信息采集系统通过获取不同网站的反爬虫策略,预先制定相应的规则,并且根据预设规则,更新关于目标网站的采集参数,本公开系统实现的有益效果是操作简易,用户端不需要进行大量的参数配置,可以持续的访问目标网站,不受反爬虫技术的影响,爬 取效率高;采集装置和采集终端分别独立设计,各自完成自己的任务,采集终端之间没有通讯,利于实现,提高采集效率。
图4是根据一示例性实施例示出的一种用于采集装置的装置800的框图。例如,装置800可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图4,装置800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(I/O)的接口812,传感器组件814,以及通信组件816。
处理组件802通常控制装置800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件802可以包括一个或多个处理器820来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理组件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。
存储器804被配置为存储各种类型的数据以支持在装置800的操作。这些数据的示例包括用于在装置800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件806为装置800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为装置800生成、管理和分配电力相关联的组件。
多媒体组件808包括在所述装置800和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当装置800处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件810被配置为输出和/或输入音频信号。例如,音频组件810包括一个麦克 风(MIC),当装置800处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于输出音频信号。
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件814包括一个或多个传感器,用于为装置800提供各个方面的状态评估。例如,传感器组件814可以检测到装置800的打开/关闭状态,组件的相对定位,例如所述组件为装置800的显示器和小键盘,传感器组件814还可以检测装置800或装置800一个组件的位置改变,用户与装置800接触的存在或不存在,装置800方位或加速/减速和装置800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件816被配置为便于装置800和其他设备之间有线或无线方式的通信。装置800可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件816还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,装置800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器804,上述指令可由装置800的处理器820执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
图5是根据一示例性实施例示出的一种用于采集装置1900的框图。例如,装置1900可以被提供为一服务器。参照图5,装置1900包括处理组件1922,其进一步包括一个或 多个处理器,以及由存储器1932所代表的存储器资源,用于存储可由处理组件1922的执行的指令,例如应用程序。存储器1932中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件1922被配置为执行指令,以执行上述方法。
装置1900还可以包括一个电源组件1926被配置为执行装置1900的电源管理,一个有线或无线网络接口1950被配置为将装置1900连接到网络,和一个输入输出(I/O)接口1958。装置1900可以操作基于存储在存储器1932的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器1932,上述指令可由装置1900的处理组件1922执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (15)

  1. 一种信息采集方法,其特征在于,包括:
    获取关于目标网站的采集参数;
    根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成;
    将更新后的采集参数分别发送至多个采集终端,并接收由所述采集终端解析的网页内容,得到目标网页内容。
  2. 根据权利要求1所述的方法,其特征在于,所述采集参数包括下述中的至少一种:
    网页请求的用户代理参数、网页请求的访问来源参数、网页请求的权限参数和IP地址参数。
  3. 根据权利要求2所述的方法,其特征在于,所述根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成,包括:
    添加与所述目标网站相关的网页请求的访问来源参数和/或网页请求的权限参数到所述采集参数中。
  4. 根据权利要求2所述的方法,其特征在于,所述根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成,包括:
    根据预设的时间间隔或预设访问次数,定时更新所述网页请求的用户代理参数和/或IP地址参数。
  5. 根据权利要求1所述的方法,其特征在于,所述根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成,包括:
    从预设的采集参数数据库中获取采集参数对应的备选参数;
    根据预设规则,将所述采集参数替换为备选参数,作为更新后的采集参数,所述预设规则根据多个网站的反爬虫规则生成。
  6. 根据权利要求1所述的方法,其特征在于,所述将更新后的采集参数分别发送至多个采集终端,并接收由所述采集终端解析的网页内容,得到目标网页内容,包括:
    将更新后的采集参数分别发送至多个采集终端;
    接收由所述采集终端解析的网页内容,并从所述网页内容中提取网页数据以及子网址信息;
    在预设采集阈值范围内,将所述子网址信息发送至采集终端,接收由所述采集终端解析的网页内容,从所述网页内容中提取网页数据以及新的子网址信息;
    存储所述网页数据,得到目标网页内容。
  7. 一种信息采集装置,其特征在于,包括:
    获取模块,用于获取关于目标网站的采集参数;
    更新模块,用于根据预设规则更新所述采集参数,所述预设规则根据多个网站的反爬虫规则生成;
    调度模块,将更新后的采集参数分别发送至多个采集终端,并接收由所述采集终端解析的网页内容,得到目标网页内容。
  8. 根据权利要求7所述的装置,其特征在于,所述采集参数包括下述中的至少一种:
    网页请求的用户代理参数、网页请求的访问来源参数、网页请求的权限参数和IP地址参数。
  9. 根据权利要求8所述的装置,其特征在于,所述更新模块包括:
    添加子模块,用于添加与所述目标网站相关的网页请求的访问来源参数和/或网页请求的权限参数到所述采集参数中。
  10. 根据权利要求8所述的装置,其特征在于,所述更新模块包括:
    更新子模块,用于根据预设的时间间隔或预设访问次数,定时更新所述网页请求的用户代理参数和/或IP地址参数。
  11. 根据权利要求8所述的装置,其特征在于,所述更新模块包括:
    获取子模块,用于从预设的采集参数数据库中获取采集参数对应的备选参数;
    替换子模块,用于根据预设规则,将所述采集参数替换为备选参数,作为更新后的采集参数,所述预设规则根据多个网站的反爬虫规则生成。
  12. 根据权利要求8所述的装置,其特征在于,所述调度模块包括:
    发送子模块,用于将更新后的采集参数分别发送至多个采集终端;
    接收子模块,用于接收由所述采集终端解析的网页内容,并从所述网页内容中提取网页数据以及子网址信息;
    提取子模块,用于在预设采集阈值范围内,将所述子网址信息发送至采集终端,接收由所述采集终端解析的网页内容,从所述网页内容中提取网页数据以及新的子网址信息;
    存储子模块,用于存储所述网页数据,得到目标网页内容。
  13. 一种信息采集系统,其特征在于,包括:
    用户终端,用于获取关于目标网站的采集参数;
    根据权利要求7至12中任一项所述的信息采集装置;
    采集终端,用于接收并解析由所述信息采集装置发送的更新后的采集参数,将解析后的网页内容发送至所述信息采集装置。
  14. 一种信息采集装置,其特征在于,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为执行权利要求1至6任一项所述的方法。
  15. 一种非临时性计算机可读存储介质,当所述存储介质中的指令由处理器执行时,使得处理器能够执行根据权利要求1至6中任一项所述的方法。
PCT/CN2019/115278 2019-08-05 2019-11-04 一种信息采集方法和装置 WO2021022689A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910717510.3 2019-08-05
CN201910717510.3A CN110489626A (zh) 2019-08-05 2019-08-05 一种信息采集方法和装置

Publications (1)

Publication Number Publication Date
WO2021022689A1 true WO2021022689A1 (zh) 2021-02-11

Family

ID=68547819

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/115278 WO2021022689A1 (zh) 2019-08-05 2019-11-04 一种信息采集方法和装置

Country Status (2)

Country Link
CN (1) CN110489626A (zh)
WO (1) WO2021022689A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094382A (zh) * 2021-04-02 2021-07-09 南开大学 一种面向多来源数据管理的半自动化数据采集更新方法
CN113434787A (zh) * 2021-05-14 2021-09-24 国网河北省电力有限公司衡水供电分公司 网络数据获取方法、装置及终端设备
CN113810310A (zh) * 2021-09-10 2021-12-17 北京云杉世纪网络科技有限公司 一种流量采集方法、装置、设备及存储介质
CN114021668A (zh) * 2021-11-29 2022-02-08 北京天融信网络安全技术有限公司 网站反爬机制自动化检测方法、装置、设备及存储介质
CN114428635A (zh) * 2022-04-06 2022-05-03 杭州未名信科科技有限公司 一种数据采集方法、装置、电子设备及存储介质
CN115865427A (zh) * 2022-11-14 2023-03-28 重庆伏特猫科技有限公司 一种基于数据路由网关的数据采集与监控方法
CN115967607A (zh) * 2022-12-25 2023-04-14 西安电子科技大学 基于模板的分布式互联网大数据采集系统及方法
CN116108252A (zh) * 2023-04-14 2023-05-12 深圳市和讯华谷信息技术有限公司 限制数据抓取方法、系统、计算机设备及存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111585956B (zh) * 2020-03-31 2022-09-09 完美世界(北京)软件科技发展有限公司 一种网址防刷验证方法与装置
CN111741109B (zh) * 2020-06-19 2024-06-18 深圳前海微众银行股份有限公司 基于代理的访问方法、装置、设备及存储介质
CN111859076B (zh) * 2020-07-31 2024-04-02 平安健康保险股份有限公司 数据爬取方法、装置、计算机设备及计算机可读存储介质
CN112073412A (zh) * 2020-09-08 2020-12-11 北京天融信网络安全技术有限公司 一种反爬虫方法、装置、处理器及计算机可读介质
CN113360736B (zh) * 2021-06-21 2023-08-01 北京百度网讯科技有限公司 互联网数据的抓取方法和装置
CN116070052A (zh) * 2023-01-28 2023-05-05 爱集微咨询(厦门)有限公司 界面数据传输方法、装置、终端及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080183889A1 (en) * 2007-01-31 2008-07-31 Dmitry Andreev Method and system for preventing web crawling detection
CN105956175A (zh) * 2016-05-24 2016-09-21 考拉征信服务有限公司 网页内容爬取的方法和装置
CN107105071A (zh) * 2017-05-05 2017-08-29 北京京东金融科技控股有限公司 Ip调用方法及装置、存储介质、电子设备
CN108038218A (zh) * 2017-12-22 2018-05-15 联想(北京)有限公司 一种分布式爬虫方法、电子设备及服务器
CN108345642A (zh) * 2018-01-12 2018-07-31 深圳壹账通智能科技有限公司 采用代理ip爬取网站数据的方法、存储介质和服务器

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR3043817A1 (fr) * 2015-11-16 2017-05-19 Scoop It Procede de recherche d’informations au sein d’un ensemble d’informations
CN109508422A (zh) * 2018-12-05 2019-03-22 南京邮电大学 多线程智能调度的高匿爬虫系统
CN109614539A (zh) * 2019-01-16 2019-04-12 重庆金融资产交易所有限责任公司 数据抓取方法、装置及计算机可读存储介质
CN110020062B (zh) * 2019-04-12 2021-09-24 北京邮电大学 一种可定制的网络爬虫方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080183889A1 (en) * 2007-01-31 2008-07-31 Dmitry Andreev Method and system for preventing web crawling detection
CN105956175A (zh) * 2016-05-24 2016-09-21 考拉征信服务有限公司 网页内容爬取的方法和装置
CN107105071A (zh) * 2017-05-05 2017-08-29 北京京东金融科技控股有限公司 Ip调用方法及装置、存储介质、电子设备
CN108038218A (zh) * 2017-12-22 2018-05-15 联想(北京)有限公司 一种分布式爬虫方法、电子设备及服务器
CN108345642A (zh) * 2018-01-12 2018-07-31 深圳壹账通智能科技有限公司 采用代理ip爬取网站数据的方法、存储介质和服务器

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094382A (zh) * 2021-04-02 2021-07-09 南开大学 一种面向多来源数据管理的半自动化数据采集更新方法
CN113434787A (zh) * 2021-05-14 2021-09-24 国网河北省电力有限公司衡水供电分公司 网络数据获取方法、装置及终端设备
CN113434787B (zh) * 2021-05-14 2023-11-07 国网河北省电力有限公司衡水供电分公司 网络数据获取方法、装置及终端设备
CN113810310A (zh) * 2021-09-10 2021-12-17 北京云杉世纪网络科技有限公司 一种流量采集方法、装置、设备及存储介质
CN114021668A (zh) * 2021-11-29 2022-02-08 北京天融信网络安全技术有限公司 网站反爬机制自动化检测方法、装置、设备及存储介质
CN114428635A (zh) * 2022-04-06 2022-05-03 杭州未名信科科技有限公司 一种数据采集方法、装置、电子设备及存储介质
CN115865427A (zh) * 2022-11-14 2023-03-28 重庆伏特猫科技有限公司 一种基于数据路由网关的数据采集与监控方法
CN115865427B (zh) * 2022-11-14 2023-07-21 重庆伏特猫科技有限公司 一种基于数据路由网关的数据采集与监控方法
CN115967607A (zh) * 2022-12-25 2023-04-14 西安电子科技大学 基于模板的分布式互联网大数据采集系统及方法
CN116108252A (zh) * 2023-04-14 2023-05-12 深圳市和讯华谷信息技术有限公司 限制数据抓取方法、系统、计算机设备及存储介质

Also Published As

Publication number Publication date
CN110489626A (zh) 2019-11-22

Similar Documents

Publication Publication Date Title
WO2021022689A1 (zh) 一种信息采集方法和装置
RU2618910C2 (ru) Способ и устройство для отображения информации
CN111368290B (zh) 一种数据异常检测方法、装置及终端设备
US10187419B2 (en) Method and system for processing notification messages of a website
US11677845B2 (en) Matching and attribution of user device events
WO2018085732A1 (en) Techniques for detecting malicious behavior using an accomplice model
US11360834B2 (en) Application interaction method and apparatus
KR101678932B1 (ko) 웹페이지 액세스 방법, 장치, 서버, 단말기, 프로그램 및 저장매체
CN105245518B (zh) 网址劫持的检测方法及装置
CN104050266B (zh) 用户行为记录方法、装置和网页浏览器
CN105930536B (zh) 索引建立方法、页面跳转方法及装置
US11086956B2 (en) Method and device for processing hyperlink object
US11004163B2 (en) Terminal-implemented method, server-implemented method and terminal for acquiring certification document
US9235693B2 (en) System and methods thereof for tracking and preventing execution of restricted applications
KR101777035B1 (ko) 주소 필터링 방법, 장치, 프로그램 및 기록매체
CN107491453B (zh) 一种识别作弊网页的方法及装置
WO2017166297A1 (zh) WiFi热点Portal认证方法和装置
KR20160120198A (ko) 정보 필터링 방법, 장치, 프로그램 및 저장매체
US20150193393A1 (en) Dynamic Display of Web Content
US11210453B2 (en) Host pair detection
CN113872921B (zh) 网页检测方法、装置、设备及计算机可读存储介质
RU2672716C2 (ru) Способ и устройство для ввода информации
CN109766501B (zh) 爬虫协议管理方法及装置、爬虫系统
KR20150083589A (ko) 북마크 공유관리 서버, 이를 이용한 북마크 공유관리 시스템 및 방법
CN115065677B (zh) 媒体资源获取方法、装置、电子设备、存储介质和产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19940904

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19940904

Country of ref document: EP

Kind code of ref document: A1