CN108874810B - Information acquisition method and device - Google Patents

Information acquisition method and device Download PDF

Info

Publication number
CN108874810B
CN108874810B CN201710325105.8A CN201710325105A CN108874810B CN 108874810 B CN108874810 B CN 108874810B CN 201710325105 A CN201710325105 A CN 201710325105A CN 108874810 B CN108874810 B CN 108874810B
Authority
CN
China
Prior art keywords
template
target webpage
information acquisition
url
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710325105.8A
Other languages
Chinese (zh)
Other versions
CN108874810A (en
Inventor
李�杰
安伟佳
许斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201710325105.8A priority Critical patent/CN108874810B/en
Publication of CN108874810A publication Critical patent/CN108874810A/en
Application granted granted Critical
Publication of CN108874810B publication Critical patent/CN108874810B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides an information acquisition method and device. A method of information collection, comprising: receiving an information acquisition task distributed from a processing center; starting one or more browser processes according to the information acquisition task, and loading a simulation behavior template in the process of starting the one or more browser processes; receiving a Uniform Resource Locator (URL) of a target webpage of information to be acquired from a processing center; rendering the target webpage according to the received URL, and acquiring the page rendering state of the target webpage; determining whether the loaded simulation behavior template needs to be configured on the target webpage or not according to the type of the URL; triggering a function defined in the simulation behavior template on the target webpage in response to determining that the simulation behavior template needs to be configured; and analyzing the target webpage and transmitting the analysis result back to the cloud storage of the processing center.

Description

Information acquisition method and device
Technical Field
The invention relates to the field of computers, in particular to a method and a device for acquiring information.
Background
Network information collection is a set of programs for automatically collecting information on the internet by using a network robot (commonly called a web crawler) according to a predetermined standard and protocol on the internet. Different acquisition algorithms can be adopted, and the information of the whole Internet website can be topological according to different scenes, such as a depth priority algorithm, a breadth priority algorithm or a combination of the depth priority algorithm and the breadth priority algorithm.
At present, with the optimization and the improvement of resources such as server hardware, network bandwidth and the like, the front-end technology of each site is enriched, the bandwidth consumption and the flow of a webpage are increased, most of the methods adopt the modes of delayed asynchronous loading, lazy loading and the like of display information, and the effects are that the content of the webpage is enriched and the experience of a user is improved on the premise of not influencing the response speed of the webpage.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
at present, the webpage structure of the mainstream website is complex, and many important information such as price, comments and the like are asynchronous requests and are subjected to delayed loading and rendering. The conventional information acquisition mode can not acquire the contents and can not be acquired without seeing. In addition, the conventional information acquisition mode is poorly customized, no manual operation behavior is added, and the conventional information acquisition mode is easily identified as non-manual operation by various machine learning algorithms of the target webpage, so that the target webpage is prohibited from accessing or the login times are required to be increased, and the information acquisition fails.
Disclosure of Invention
In view of this, the embodiment of the present invention provides an information acquisition method and apparatus.
The embodiment of the invention can flexibly increase the simulation operations of manual behaviors, such as clicking, logging in, page turning, refreshing, pull-down scrolling, full screen operation, mouse sliding of a certain element, scroll bar pull-down, mouse moving and stopping and the like, on the basis of conventional coreless browser information acquisition (web crawler), thereby meeting the requirements of various types of embedded points set in a target webpage, loading more information which can be displayed by clicking, really achieving what you see is what you get when initiating access requests for many times, and reducing the risk of forbidden access.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided an information collecting method, including: receiving an information acquisition task distributed from a processing center; starting one or more browser processes according to the information acquisition task, and loading a simulation behavior template in the process of starting the one or more browser processes; receiving a Uniform Resource Locator (URL) of a target webpage of information to be acquired from the processing center; rendering the target webpage according to the received URL, and obtaining a page rendering state of the target webpage; determining whether the loaded simulation behavior template needs to be configured on the target webpage or not according to the type of the received URL; in response to determining that the simulated behavior template needs to be configured, triggering a function defined in the simulated behavior template on the target webpage; and analyzing the target webpage and transmitting an analysis result back to the processing center.
Optionally, the simulated behavioral templates include one or more of: a page pull-down and rolling effect template; click, log in effect template, and select effect template.
Optionally, the simulated behavioral template is a template predefined by the information acquisition device.
Optionally, the simulated behavioral templates are user-defined templates.
Optionally, the simulated behavior template is loaded by injecting the simulated behavior template in a plug-in form to the one or more browser processes in a plug-in pluggable manner.
Optionally, parsing the target webpage and returning a parsing result to the processing center includes: carrying out template adaptation on the target webpage so as to match the target webpage with a template defined by an information acquisition device; selecting a rule used for analyzing the target webpage according to different URL types of the target webpage, and analyzing the target webpage by using the selected rule; and generating a parsing result based on the rule and transmitting the parsing result back to the processing center.
In order to achieve the above object, according to another aspect of an embodiment of the present invention, there is provided an information acquisition apparatus, including: a URL downloading module for downloading a URL of a target web page of information to be acquired and obtaining a rendering state of the target web page, the URL downloading module including: the browser pool management module is used for receiving an information acquisition task distributed to the URL downloading module from a processing center, starting one or more browser processes according to the information acquisition task, and loading a simulation behavior template in the process of starting the one or more browser processes; a URL input module for receiving the URL of the target webpage from the processing center; the page rendering state acquisition module is used for rendering the target webpage according to the received URL and acquiring the page rendering state of the target webpage; a simulation behavior template configuration module, configured to determine whether the simulation behavior template needs to be configured on the target webpage according to the type of the received URL, and trigger a function defined in the simulation behavior template on the target webpage in response to determining that the simulation behavior template needs to be configured; and the analysis template module is used for analyzing the target webpage and transmitting an analysis result back to the processing center.
Optionally, the simulated behavioral templates include one or more of: a page pull-down and rolling effect template; click, log in effect template, and select effect template.
Optionally, the simulated behavioral template is a template predefined by the information acquisition device.
Optionally, the simulated behavioral templates are user-defined templates.
Optionally, the browser pool management module loads the simulated behavior template by injecting the simulated behavior template into the one or more browser processes in a plug-in pluggable manner.
Optionally, the parsing template module includes: the adaptive template type module is used for carrying out template adaptation on the target webpage so as to match the target webpage with the template defined by the information acquisition device; the analysis rule loading module is used for selecting a rule used for analyzing the target webpage according to different URL types of the target and analyzing the target webpage by using the selected rule; and the analysis result returning module is used for generating an analysis result based on the rule and returning the analysis result to the processing center.
To achieve the above object, according to another aspect of an embodiment of the present invention, an electronic device for information collection includes: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the information acquisition method of the present invention.
To achieve the above object, according to another aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, characterized in that the program, when executed by a processor, implements the information acquisition method of the present invention.
One embodiment of the above invention has the following advantages or benefits: the embodiment of the invention can flexibly increase the simulation operation of manual behavior on the basis of conventional coreless browser information acquisition (web crawler), thereby meeting the requirements of various types of embedded points set in a target webpage, loading more information which can be displayed by clicking, really realizing what you see is what you get when initiating access requests for many times, and reducing the risk of forbidden access.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of an information acquisition system according to an embodiment of the invention.
FIG. 2 is a schematic diagram of an information acquisition device according to an embodiment of the present invention;
FIG. 3 is a flow chart of an information collection method according to an embodiment of the invention;
fig. 4 is a schematic structural diagram of a computer system suitable for implementing the electronic device or the server according to the embodiment of the present application.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the traditional information acquisition, the used technology generally initiates an HTTP request to a target webpage of information to be acquired or adds pool object management and other optimized HTTP links by using an Apache component, and finally downloads a source code of the target webpage, and performs a series of operations such as sorting, storing, analyzing, and collecting related information on the content of the target webpage.
One prior art approach is to use the httpparent component of Apache to manage httpparent through HTTP connection pool, initiate HTTP request to the target web page of the information to be collected, communicate with the target web page, and then return the information of the target web page to pool management httpparent through file stream.
Another prior art method is to use a coreless browser to remotely access a target web page, and return the target web page to the coreless browser after the target web page is loaded.
At present, the webpage structure of the mainstream website is complex, and many important information such as price, comments and the like are asynchronous requests and are subjected to delayed loading and rendering. The traditional httpClient component downloads the content, and the content cannot be acquired and cannot be acquired in a visible mode. In addition, the information is collected by adopting a popular coreless browser, customization is poor, no manual operation behavior is added, and much information of the target page can be displayed only by manually operating and clicking, for example, the detailed comments or the regions need to be manually selected to display the content. Many target web pages have buried points set by various machine learning algorithms. The page embedded point is used for flow analysis and comprises page browsing number (PV), independent visitor number (UV), IP, page staying time, page operation time, page access times, button clicking times, file downloading times and the like, and the data are significant for analyzing operation and browsing behaviors of a user in a website. With the increase of the acquisition time, if the target webpage has no behavior of manual operation for a long time, the embedded point requirements set by various machine learning algorithms of the target webpage cannot be met, the conventional coreless browser information acquisition is easy to identify the non-manual operation, so that the target webpage is prohibited from accessing or the number of login times needs to be increased, and the failure of information acquisition is caused.
Therefore, the invention provides an information acquisition method and device.
Reference is first made to fig. 1. FIG. 1 is an information collection system 100 according to an embodiment of the present invention. The information collection system 100 and its constituent modules of the present invention can be implemented by a computer in combination with related programs.
As shown in fig. 1, an information collection system 100 according to an embodiment of the present invention includes a processing center 110 and one or more information collection nodes 111. 3 information collection nodes are shown, but the invention is not so limited and may include any number of information collection nodes.
According to an embodiment of the invention, the processing center 110 distributes information collection tasks to one or more information collection nodes 111. The one or more information collection nodes 111 remotely access and collect information from a target web page of information to be collected according to the received information collection task, parse the collected information, and transmit the parsed results back to a cloud storage (not shown) of the processing center 110.
Fig. 2 is a schematic diagram of an information acquisition apparatus 200 according to an embodiment of the present invention.
Each information collection node 111 shown in fig. 1 may include one or more information collection devices 200 of the present invention, according to an embodiment of the present invention. The information collecting apparatus 200 and its constituent modules of the present invention can be realized by a computer in combination with a related program.
As shown in fig. 2, the information collecting apparatus 200 according to the embodiment of the present invention includes a URL (uniform resource locator) download module 210, a simulated behavior template configuration module 220, and an analysis template module 230.
According to an embodiment of the present invention, the URL downloading module 210 is configured to download a URL of a target webpage of information to be collected and obtain a rendering state of the target webpage. According to an embodiment of the present invention, the URL download module 210 includes a browser pool managing module 211, a URL input module 212, and a page rendering state acquiring module 213.
According to an embodiment of the present invention, the browser pool managing module 211 is configured to receive an information collection task distributed from the processing center 110 to the information collection apparatus 200, start one or more browser processes according to the information collection task, and load a simulated behavior template during the starting of the one or more browser processes.
In one embodiment, the one or more browser processes are web browser processes.
In one embodiment, the one or more browser processes are PhantomJs-based browser processes.
In one embodiment, the simulated behavioral templates are templates predefined by the information acquisition device 200 of an embodiment of the invention.
In one embodiment, the simulated behavioral templates are user-defined templates.
According to an embodiment of the present invention, the URL input module 212 is configured to receive the URL of the target web page from the processing center 111.
According to the embodiment of the present invention, the page rendering state obtaining module 230 is configured to render the target webpage according to the received URL, and obtain a page rendering state of the target webpage.
According to an embodiment of the present invention, the simulated behavior template configuration module 220 is configured to determine whether a simulated behavior template needs to be configured on the target webpage according to the type of the received URL, and to trigger a function defined in the simulated behavior template on the target webpage in response to determining that the simulated behavior template needs to be configured.
In one embodiment, the simulated behavior template configuration module 220 includes one or more of a page drop down, a scroll effect template 221, a click, a login effect template 222, and a selection effect template 223, although the invention is not so limited and the simulated behavior template configuration module 220 may include any other template.
After the simulation behavior template configuration module 220 configures the simulation behavior templates, the plug-in simulation is completed.
According to an embodiment of the invention, the parsing template module 230 is used to parse the target webpage and transmit the parsing result back to the processing center 110, such as a cloud storage (not shown) of the processing center 110.
According to an embodiment of the present invention, the parsing template module 230 includes: an adaptation template type module 231, a parsing rule loading module 232 and a parsing result passing module 233.
According to an embodiment of the present invention, the adaptive template type module 231 is used for performing template adaptation on the target webpage to match the target webpage with a template defined by the information collecting apparatus 200.
According to the embodiment of the present invention, the parsing rule loading module 232 is configured to select a rule used for parsing the target webpage according to different URL types of the target webpage, and parse the target webpage by using the selected rule.
According to an embodiment of the present invention, the parsing result returning module 233 is used for generating parsing results based on the rules and returning the parsing results to the cloud storage of the processing center 110.
Reference is now made to fig. 3. Fig. 3 is a flow chart of an information collection method 300 according to an embodiment of the invention.
The information collecting method 300 according to the embodiment of the present invention may be performed by the information collecting apparatus 200 shown in fig. 2 or the electronic device 400 shown in fig. 4. For convenience, the information collecting apparatus 200 according to the embodiment of the present invention will be described in detail as an example.
As shown in fig. 3, in step S310, the information collection apparatus 200 receives the information collection task distributed from the processing center 111.
In one embodiment, the information collection task includes the attributes, number, initialization information, etc. of the target web page of the information to be collected, for the information collection device 200 to initialize and prepare for further operation.
In step S320, the information collecting apparatus 200 starts one or more browser processes according to the received information collecting task, and loads a simulation behavior template in the process of starting the one or more browser processes.
In one embodiment, the one or more browser processes are web browser processes.
In one embodiment, the one or more browser processes are PhantomJs-based browser processes.
In one embodiment, if the number of target web pages included in the information gathering task is 1, 1 browser process is started. In one embodiment, if the number of target web pages included in the information gathering task is 3, 3 browser processes are launched. The present invention is not limited in this regard and other numbers of browser processes may be initiated based on the number of target web pages included in the information gathering task.
In one embodiment, if the computer device configuration of the information collection node 111 where the information collection apparatus 200 is located is high, a large number of target web page information collection (e.g., URL access) tasks may be retrieved, and if the computer device configuration of the information collection node 111 where the information collection apparatus 200 is located is low, a small number of target web page information collection (e.g., URL access) tasks may be retrieved, but the present invention is not limited thereto, and the information collection tasks may be retrieved according to other conditions.
In one embodiment, the simulated behavioral templates are JQuery components.
In one embodiment, with the browser support for JavaScript, simulated behavior templates are loaded into the browser process by injecting JQuery components that define various simulated behavior templates into the browser process during the startup of the browser process. In one embodiment, the browser process is a PhantomJs-based browser process, but the invention is not limited thereto, and other browser processes supporting JavaScript may be employed.
In one embodiment, a plug-in pluggable manner is adopted to inject JQuery components defining various simulation behavior templates into a browser process as plug-ins, for example, injecting into PhantomJs processes.
Sample-1 shows one example of loading a simulated behavior template.
Figure BDA0001290924170000101
Example-1
In sample-1, "executeScript" is an example of a piece of JavaScript injected into a browser process during the startup of the browser process according to an embodiment of the present invention. In one embodiment, the script "executeScript" is injected as a plug-in to the browser process in a plug-in pluggable manner.
In sample-1, "phantomjsmosueevent" is a template example of a mouse event predefined by the information collection apparatus 200 according to the embodiment of the present invention. In one embodiment, the predefined mouse events include a single click, a double click, a swipe to a certain point location, a scroll down to the bottom, a window maximization minimization, a login, a page flip, a refresh, a mouse swipe over a certain element, a scroll bar pull down, a mouse move dwell, and the like, but the invention is not limited thereto and may select one or more of the templates for the mouse events or define other simulated behavior events according to the requirements of different target web page types or target web page landings.
In one embodiment, if the user wishes to use a customized simulated behavior template, the simulated behavior template predefined by the information gathering device 200 may be replaced with a simulated behavior template customized by the user.
For example, in one embodiment, similar to the above-described predefined simulation behavior templates of the loading information collection apparatus 200, a user-defined JavaScript script may be injected into the browser process during the browser startup process in step S320. In one embodiment, a user-defined JavaScript script is injected into a browser process as a plug-in a plug-in pluggable manner.
The syntax and field positioning used by the user-defined script both use standard JQuery, which can well support JQuery element selector $, and can easily position each element tag of the webpage, for example, position id ═ frame "element, and directly use $ (" # frame "). And operation assignment to elements or addition of a listening event, such as submission of a button using a JavaScript script $ ("ul.
In one embodiment, a corresponding interface is set in the browser process, such as the "executeScript" interface in sample-1, and the user-defined script can be added to the corresponding interface, such as the "executeScript" interface in sample-1. In one embodiment, the user-defined script is saved as a text, and the information collecting apparatus 200 according to the embodiment of the present invention is directly used by loading the text.
In one embodiment, "phantoms js mouseevent" in sample-1 may be replaced with a user-defined mouse event, so that a simulated behavior template predefined by the information collecting apparatus 200 may be replaced with the user-defined simulated behavior template.
The conventional coreless browser has poor customization of information acquisition modes, does not add manual operation behaviors, and cannot meet the embedded point requirements set by various machine learning algorithms of a target webpage, so that non-manual operation is easily identified. According to the embodiment of the invention, by setting a corresponding interface in the browser process, such as the "executeScript" interface in the sample-1, the user can load the customized simulation behavior template by adding the customized script to the set corresponding interface, so that good customization is realized, and the simulation of manual operation behavior is realized. In addition, in one embodiment, the simulation behavior template is injected into the browser process as a plug-in a plug-in and plug-out mode, so that a user can flexibly select to plug in or plug out the plug-in of the simulation behavior template according to needs, and better flexibility is achieved.
In step S330, the information collecting apparatus 200 receives the URL of the target web page from the processing center 111.
In step S340, the information collecting apparatus 200 remotely accesses the target webpage according to the received URL, renders the target webpage, and obtains a page rendering state of the target webpage.
At present, the webpage structure of the mainstream website is complex, and many important information such as price, comments and the like are asynchronous requests and are subjected to delayed loading and rendering. Therefore, the conventional information acquisition mode cannot be obtained in a visible mode, that is, the page rendering state of the target webpage cannot be synchronously obtained. According to the embodiment of the invention, the page rendering state of the target webpage can be synchronously obtained by rendering the target webpage according to the URL and acquiring the page rendering state of the target webpage, and important information such as price, comments and the like is loaded synchronously with the page rendering, so that 'what you see is what you get' is realized.
And after the page rendering state of the target webpage is obtained, the URL downloading process is completed. The method 300 according to an embodiment of the present invention proceeds to step S350.
In step S350, it is determined whether a simulated behavior template needs to be configured on the target web page according to the type of the received URL.
If it is determined in step S350 that the simulated behavior template needs to be configured on the target web page, the function defined in the simulated behavior template is triggered on the target web page in step S360.
In one embodiment, the JQuery component injected in step S320 is executed to trigger a function defined in the simulated behavior template on the target webpage, such as implementing functions of clicking, logging in, turning pages, refreshing, pull-down scrolling, full-screen operation, mouse sliding over an element, scroll bar pulling down, mouse moving and staying on the target webpage.
Many target web pages set buried points through various machine learning algorithms, if there is no manual operation behavior for the target web pages for a long time, the buried point requirements set by the various machine learning algorithms of the target web pages cannot be met, so that the target web pages are easily identified as being not manually operated and are prohibited from accessing. According to the embodiment of the invention, various simulation behavior templates simulating human operation behaviors can be configured on the target webpage according to different target webpage types and embedded point requirements, so as to avoid non-human operation being identified.
By triggering the function defined in the simulation behavior template on the target webpage, the embodiment of the invention can simulate human operation behaviors, such as clicking, logging in, page turning, refreshing, pull-down scrolling, full-screen operation, mouse sliding over a certain element, scroll bar pull-down, mouse moving and stopping and the like, so that the requirement of a buried point set in the target webpage can be met, and non-manual operation is prevented from being identified.
After the simulation behavior template is configured, the plug-in simulation process is complete and method 300 according to an embodiment of the present invention proceeds to step S370.
On the other hand, if it is determined at step S350 that the simulated behavior template does not need to be configured on the target web page, the method 300 according to an embodiment of the present invention proceeds to step S370.
In step S370, the information collecting apparatus 200 parses the target webpage and returns the parsing result to the processing center 110, for example, to the cloud storage of the processing center 110.
According to an embodiment of the present invention, in step S370, template adaptation is performed on the target webpage to match the target webpage with a template defined by the information collecting apparatus 200; selecting a rule used for analyzing a target webpage according to different URL types of the target, and analyzing the target webpage by using the selected rule; the parsing results are generated based on the rules and are communicated back to the processing center 110, such as to a cloud storage of the processing center 110.
According to the embodiment of the invention, the access to the URL is automatically completed by the browser, a friendly plug-in (such as JavaScript script and the like) of the simulation human operation behavior template and a corresponding interface of the simulation behavior template customized by a user are provided, and various operations of simulating a human on a target webpage are flexibly realized. The page displays more important information, meets the requirement of various recording people operating embedded points set in the target webpage, increases effective access times and reduces the risk of access prohibition.
The invention also provides an electronic device and a readable storage medium according to the embodiment of the invention.
The electronic device for information acquisition of the present invention includes: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the information acquisition method of the present invention.
The computer-readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements the information collecting method of the present invention.
Referring now to FIG. 4, shown is a block diagram of a computer system 400 suitable for use in implementing a terminal device of an embodiment of the present application. The terminal device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU)401 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.
In particular, the process described above with reference to fig. 3 may be implemented as a computer software program, according to an embodiment of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program containing program code for performing the method illustrated in fig. 3. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The above-described functions defined in the system of the present application are executed when the computer program is executed by a Central Processing Unit (CPU) 401.
It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present application may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a URL download module, a simulated behavior template configuration module, and a parsing template module. The names of these modules do not constitute a limitation to the modules themselves in some cases, for example, the URL download module may also be described as a "module that downloads the URL of a target web page of information to be collected and obtains the rendering state of the target web page".
As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer-readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to perform the information acquisition method of the present invention, including: receiving an information acquisition task distributed from a processing center; starting one or more browser processes according to the information acquisition task, and loading a simulation behavior template in the process of starting the one or more browser processes; receiving a Uniform Resource Locator (URL) of a target webpage of information to be acquired from the processing center; rendering the target webpage according to the received URL, and obtaining a page rendering state of the target webpage; determining whether the loaded simulation behavior template needs to be configured on the target webpage or not according to the type of the received URL; in response to determining that the simulated behavior template needs to be configured, triggering a function defined in the simulated behavior template on the target webpage; and analyzing the target webpage and transmitting an analysis result back to the processing center.
According to the embodiment of the invention, the simulation operation of artificial behaviors is flexibly added, so that the requirements of various types of embedded points set in the target webpage are met, more information which can be displayed by clicking can be loaded, and the WYSIWYG (what you see is what you get) can be really achieved when an access request is initiated for multiple times, and the risk of access prohibition is reduced.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the information acquisition method provided in the embodiment of the present invention.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (14)

1. A method of information collection, comprising:
receiving an information acquisition task distributed from a processing center;
starting one or more browser processes according to the information acquisition task, and loading a simulation behavior template in the process of starting the one or more browser processes;
receiving a Uniform Resource Locator (URL) of a target webpage of information to be acquired from the processing center;
rendering the target webpage according to the received URL, and obtaining a page rendering state of the target webpage;
determining whether the loaded simulation behavior template needs to be configured on the target webpage or not according to the type of the URL;
in response to determining that the simulated behavior template needs to be configured, triggering a function defined in the simulated behavior template on the target webpage; and
and analyzing the target webpage and transmitting an analysis result back to the processing center.
2. The method of claim 1, wherein the simulated behavioral templates comprise one or more of: a page pull-down and rolling effect template; click, log in effect template, and select effect template.
3. The method of claim 1, wherein the simulated behavioral templates are templates predefined by an information collection device.
4. The method of claim 1, wherein the simulated behavioral templates are user-customized templates.
5. The method of claim 1, wherein the simulated behavior template is loaded by injecting the simulated behavior template in a plug-in form to the one or more browser processes in a plug-in pluggable manner.
6. The method of claim 1, wherein parsing the target web page and returning a result of the parsing to the processing center comprises:
carrying out template adaptation on the target webpage so as to match the target webpage with a template defined by an information acquisition device;
selecting a rule used for analyzing the target webpage according to different URL types of the target webpage, and analyzing the target webpage by using the selected rule; and the number of the first and second groups,
a parsing result is generated based on the rule and the parsing result is transmitted back to the processing center.
7. An information acquisition apparatus, comprising:
a URL downloading module for downloading a URL of a target web page of information to be acquired and obtaining a rendering state of the target web page, the URL downloading module including: the browser pool management module is used for receiving an information acquisition task distributed to the URL downloading module from a processing center, starting one or more browser processes according to the information acquisition task, and loading a simulation behavior template in the process of starting the one or more browser processes; a URL input module for receiving the URL of the target webpage from the processing center; the page rendering state acquisition module is used for rendering the target webpage according to the received URL and acquiring the page rendering state of the target webpage;
a simulation behavior template configuration module, configured to determine whether the simulation behavior template needs to be configured on the target webpage according to the type of the received URL, and trigger a function defined in the simulation behavior template on the target webpage in response to determining that the simulation behavior template needs to be configured; and
and the analysis template module is used for analyzing the target webpage and transmitting an analysis result back to the processing center.
8. The information acquisition device of claim 7, wherein the simulated behavioral templates comprise one or more of: a page pull-down and rolling effect template; click, log in effect template, and select effect template.
9. The information acquisition device of claim 7, wherein the simulated behavioral template is a template predefined by the information acquisition device.
10. The information acquisition device of claim 7, wherein the simulated behavioral templates are user-defined templates.
11. The information acquisition device of claim 7, wherein the browser pool management module loads the simulated behavior template by plug-in injecting the simulated behavior template into the one or more browser processes in a plug-in pluggable manner.
12. The information acquisition device of claim 7, wherein the parsing template module comprises:
the adaptive template type module is used for carrying out template adaptation on the target webpage so as to match the target webpage with the template defined by the information acquisition device;
the analysis rule loading module is used for selecting a rule used for analyzing the target webpage according to different URL types of the target and analyzing the target webpage by using the selected rule; and the number of the first and second groups,
and the analysis result returning module is used for generating an analysis result based on the rule and returning the analysis result to the processing center.
13. An electronic device for information collection, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-6.
14. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN201710325105.8A 2017-05-10 2017-05-10 Information acquisition method and device Active CN108874810B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710325105.8A CN108874810B (en) 2017-05-10 2017-05-10 Information acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710325105.8A CN108874810B (en) 2017-05-10 2017-05-10 Information acquisition method and device

Publications (2)

Publication Number Publication Date
CN108874810A CN108874810A (en) 2018-11-23
CN108874810B true CN108874810B (en) 2021-01-26

Family

ID=64287894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710325105.8A Active CN108874810B (en) 2017-05-10 2017-05-10 Information acquisition method and device

Country Status (1)

Country Link
CN (1) CN108874810B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800123A (en) * 2018-12-14 2019-05-24 深圳壹账通智能科技有限公司 Automate electric quantity test method, apparatus, computer equipment and storage medium
CN110046319B (en) * 2019-04-01 2021-04-09 北大方正集团有限公司 Social media information acquisition method, device, system, equipment and storage medium
CN110413922B (en) * 2019-06-28 2024-06-14 平安科技(深圳)有限公司 Page information display method, device, computer equipment and storage medium
CN110362296A (en) * 2019-07-12 2019-10-22 无锡锐泰节能系统科学有限公司 Device data monitoring system based on javascript
CN110691125A (en) * 2019-09-24 2020-01-14 上海富数科技有限公司 System and method for realizing browser loading control based on heuristic algorithm
CN110995691A (en) * 2019-11-28 2020-04-10 佛山科学技术学院 Method and system for acquiring webpage data
CN112994968B (en) * 2019-12-17 2023-05-02 北京沃东天骏信息技术有限公司 Network information acquisition method, server, terminal and system
CN111314298B (en) * 2020-01-16 2020-12-29 北京金堤科技有限公司 Verification identification method and device, electronic equipment and storage medium
CN112035211A (en) * 2020-11-04 2020-12-04 北京值得买科技股份有限公司 Method for improving article opening speed by preloading article data
CN114595410A (en) * 2022-03-24 2022-06-07 中国农业银行股份有限公司 Webpage parsing method and system and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016075552A1 (en) * 2014-11-14 2016-05-19 Yandex Europe Ag Method of testing webpage layout
CN106021552A (en) * 2016-05-30 2016-10-12 深圳市华傲数据技术有限公司 Internet creeper concurrency data collection method and system based on crowd behavior simulation

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9009678B2 (en) * 2011-06-28 2015-04-14 International Business Machines Corporation Software debugging with execution match determinations
CN102880607A (en) * 2011-07-15 2013-01-16 舆情(香港)有限公司 Dynamic network content grabbing method and dynamic network content crawler system
CN102375951B (en) * 2011-10-18 2014-07-23 北龙中网(北京)科技有限责任公司 Webpage security detection method and system
CN103186670B (en) * 2013-03-27 2016-04-13 北京中金云网科技有限公司 A kind of method and system of complete collection info web
CN103218431B (en) * 2013-04-10 2016-02-17 金军 A kind ofly can identify the system that info web gathers automatically
CN105512193A (en) * 2015-11-26 2016-04-20 上海携程商务有限公司 Data acquisition system and method based on browser expansion
CN106599270B (en) * 2016-12-23 2020-08-21 浙江省公众信息产业有限公司 Network data capturing method and crawler

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016075552A1 (en) * 2014-11-14 2016-05-19 Yandex Europe Ag Method of testing webpage layout
CN106021552A (en) * 2016-05-30 2016-10-12 深圳市华傲数据技术有限公司 Internet creeper concurrency data collection method and system based on crowd behavior simulation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Python爬虫使用Selenium+PhantomJS抓取Ajax和动态HTML内容;华天清;《https://www.cnblogs.com/gooseeker/p/5511193.html》;20160520;全文 *

Also Published As

Publication number Publication date
CN108874810A (en) 2018-11-23

Similar Documents

Publication Publication Date Title
CN108874810B (en) Information acquisition method and device
EP3465467B1 (en) Web page accelerations for web application hosted in native mobile application
CN106575298B (en) Rapid presentation of web sites containing dynamic content and stale content
US8122104B2 (en) Method and system for providing XML-based asynchronous and interactive feeds for web applications
AU2012370492B2 (en) Graphical overlay related to data mining and analytics
US20140100963A1 (en) Method, System and Device For Filtering Mobile Terminal Webpage Advertisements
US11017153B2 (en) Optimizing loading of web page based on aggregated user preferences for web page elements of web page
US20130159892A1 (en) Non-technical creation of mobile web applications
US8943036B1 (en) Search controls using sliders and lightboxes
CN106339414A (en) Webpage rendering method and device
CN113590974B (en) Recommendation page configuration method and device, electronic equipment and computer readable medium
US8756214B2 (en) Crawling browser-accessible applications
WO2016011879A1 (en) Web page display method and apparatus
US9383971B2 (en) Mobilize website using representational state transfer (REST) resources
US10082937B2 (en) Intelligent rendering of webpages
CN110598135A (en) Network request processing method and device, computer readable medium and electronic equipment
CN104899212B (en) Web page display method, server and system
US10089283B2 (en) Mobile enablement of webpages
CN113688341B (en) Dynamic picture decomposition method and device, electronic equipment and readable storage medium
CN112486482A (en) Page display method and device
US9680697B2 (en) Dynamic product installation based on user feedback
CN110647327A (en) Method and device for dynamic control of user interface based on card
CN114598920B (en) Video playing control method, device, equipment and storage medium
CN105447041A (en) Webpage processing method and device
CN113590985B (en) Page jump configuration method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant