CN114491356B - Data acquisition method and device, computer storage medium and electronic equipment - Google Patents

Data acquisition method and device, computer storage medium and electronic equipment Download PDF

Info

Publication number
CN114491356B
CN114491356B CN202111612228.2A CN202111612228A CN114491356B CN 114491356 B CN114491356 B CN 114491356B CN 202111612228 A CN202111612228 A CN 202111612228A CN 114491356 B CN114491356 B CN 114491356B
Authority
CN
China
Prior art keywords
browser
network request
data
webpage
request packet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111612228.2A
Other languages
Chinese (zh)
Other versions
CN114491356A (en
Inventor
陈祖德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jindi Technology Co Ltd
Original Assignee
Beijing Jindi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jindi Technology Co Ltd filed Critical Beijing Jindi Technology Co Ltd
Priority to CN202111612228.2A priority Critical patent/CN114491356B/en
Publication of CN114491356A publication Critical patent/CN114491356A/en
Application granted granted Critical
Publication of CN114491356B publication Critical patent/CN114491356B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)
  • Computer And Data Communications (AREA)

Abstract

The embodiment of the invention provides a data acquisition method and a device thereof, a computer storage medium and electronic equipment, wherein the data acquisition method comprises the following steps: starting a browser; accessing a webpage corresponding to the data to be acquired through the browser; responding to monitoring triggering operation on links meeting preset rules in a webpage, and intercepting a request packet generated during the triggering operation; generating a network request based on the request packet, and sending the network request to a server to acquire data to be acquired. The data acquisition method and the device, the computer storage medium and the electronic equipment acquire the encryption parameters in a mode that the interceptor intercepts the network request, and compared with the traditional browser automatic data acquisition efficiency is higher.

Description

Data acquisition method and device, computer storage medium and electronic equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data acquisition method and apparatus, a computer storage medium, and an electronic device.
Background
With the advent of the big data age and the continuous development of computer technology, the data requirements of each excellent enterprise are very high. The amount of information data on the network is higher and higher, and a plurality of technologies capable of rapidly acquiring network data are derived today under the rapid development of intelligence and automation, and a large amount of webpage data can be rapidly acquired. With the advent of illegal acquisition of web page data, many websites have provided protection measures, so that many simple network transmission requests cannot effectively acquire data, and how to effectively acquire the acquired data is a serious problem. One common safeguard measure is to detect the access request of the user by generating encryption parameters through a script (e.g., javascript), if the parameters in the request include the correct encryption parameters generated by the script, the webpage data will be fed back to the user, otherwise, the access request of the user will be regarded as an illegal request, and thus the user cannot acquire the data.
Disclosure of Invention
The embodiment of the invention provides a data acquisition method and device, a computer storage medium and electronic equipment, which are used for overcoming or relieving the technical problems in the prior art.
The invention adopts the technical scheme that:
according to an aspect of the present invention, there is provided a data acquisition method comprising:
starting a browser;
accessing a webpage corresponding to the data to be acquired through the browser; a kind of electronic device with high-pressure air-conditioning system
Responding to monitoring triggering operation on links meeting preset rules in a webpage, and intercepting a request packet generated during the triggering operation;
generating a network request based on the request packet, and sending the network request to a server to acquire data to be acquired.
Optionally, in some embodiments, the method further includes, after the launching of the browser: based on the set monitoring rule, monitoring the triggering operation of the link in the webpage.
Optionally, in some embodiments, the method further comprises setting a listening rule;
the setting of the monitoring rule specifically includes:
setting the link characteristics to be intercepted as links conforming to preset rules based on the link characteristics in the webpage; or (b)
Setting a link corresponding to an access request of the data of the specific type as a link conforming to a preset rule based on the type of the data requested to be accessed;
and when the triggering operation of the link conforming to the preset rule is monitored, triggering the interceptor to intercept.
Optionally, in some embodiments, the intercepting the request packet generated during the triggering operation further includes: storing the intercepted request packet to a memory;
the generating a network request based on the request packet, and sending the network request to a server to obtain data to be collected specifically includes: and acquiring request packets in a memory through multithreading, generating network requests based on the request packets respectively, and sending the network requests to a server to acquire data to be acquired.
Optionally, in some embodiments, the method further comprises configuring a runtime environment, the configuring the runtime environment comprising installing the browser, the first script, and the browser control tool.
Optionally, in some embodiments, the launching browser specifically includes: and starting the browser through the browser control tool.
Optionally, in some embodiments, the method further comprises:
controlling the first script to simulate user operation through the browser control tool to trigger links in the webpage so as to generate a request packet.
According to another aspect of the present invention, there is provided a data acquisition apparatus comprising:
the starting unit is used for starting the browser;
the access unit is used for accessing the webpage corresponding to the data to be acquired through the browser;
the interception unit is used for intercepting a request packet generated during triggering operation in response to monitoring the triggering operation on the links meeting the preset rules in the webpage; a kind of electronic device with high-pressure air-conditioning system
And the acquisition unit is used for generating a network request based on the request packet and sending the network request to a server to acquire data to be acquired.
According to still another aspect of the present invention, there is provided a computer storage medium having stored thereon a computer executable program that is executed to implement the data acquisition method according to any of the embodiments of the present invention.
According to yet another aspect of the present invention, there is provided an electronic device comprising a memory for storing a computer executable program thereon, and a processor for running the computer executable program to implement a data acquisition method according to any of the embodiments of the present invention.
According to the embodiment of the invention, the encryption parameters are obtained by intercepting the network request packet, so that complex script reverse development is avoided, the compatibility of the calculation mode for generating the encryption parameters for the script of the website is high, and the data acquisition failure caused by script language modification is avoided. Compared with the traditional browser, the method has higher efficiency of automatically acquiring data, the data can be acquired only by intercepting and extracting the encryption parameters generated by the script and assembling the encryption parameters into the network request, and excessive operation attempts on the browser like the traditional browser for automatically acquiring the data are avoided.
Drawings
FIG. 1 is a flow chart of a data acquisition method according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a data acquisition device according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.
According to the embodiment of the invention, the encryption parameters are acquired by intercepting the request packet sent by the browser, and the network request is generated based on the encryption parameters and sent to the server, so that the data to be acquired is acquired.
FIG. 1 is a flow chart of a data acquisition method according to an embodiment of the invention; as shown in fig. 1, specifically, the data acquisition method 10 includes:
and 11, starting the browser.
In this embodiment, a browser control tool is used to control the opening of the browser.
In this embodiment, the browser may be a chrome browser. The chrome browser is a web browser which is mainly developed by Google corporation, and a Webkit rendering engine based on KHTML issues and opens source codes with multiple free copyrights such as BSD license. Has the characteristics of high speed, stability, safety and the like. In some operating environments, the chromoum browser may be installed free and decompressed after the package is downloaded to run directly. It will be appreciated that in other operating environments, such as Windows systems, the installed version may also be downloaded and run after installation.
In this embodiment, the browser control tool is a control tool pyppeterer written based on Python language. Puppeteer is a tool developed by Google corporation based on node. Js, and can control some operations of Chrome browser and Chrome browser through Java script (JavaScript). Pypeteer is a browser control tool written in Python language on Puppeterer. It will be appreciated that in other embodiments, the browser may be controlled using selenium, the written language need not be python, and any suitable language may be used, such as java, nodjs, etc.
It will be appreciated that in some embodiments, if the runtime environment is not set, the method further comprises the step of configuring the runtime environment including installing a browser, installing a scripting language (e.g., javascript), and a browser control tool.
And step 12, accessing a webpage corresponding to the data to be acquired through the browser.
Specifically, the webpage corresponding to the data to be collected is one of common page types, is an intermediate page for accepting the navigation page and the detail page, and integrates all tags of related contents. For example, if the website to be collected is internet news, the corresponding web page is the first page of the internet news, and the first page has a plurality of news tags in columns, so that the news tag can be triggered to further display a news headline set or a news detail page.
More specifically, the browser is controlled to access the webpage corresponding to the data to be collected, namely, a Request is sent to a server through the browser, and the webpage corresponding to the data to be collected is requested to be accessed. The Request encapsulates a web address (URL) of the web page and browser parameters. The server returns a webpage according to the Request and decodes the webpage into a format supported by the browser.
And step 13, responding to the monitoring of triggering operation on links meeting preset rules in the webpage, and intercepting a request packet generated during the triggering operation.
In some embodiments, after launching the browser, the data acquisition method further comprises: based on the set monitoring rule, monitoring the triggering operation of the link in the webpage.
In some embodiments, the method further comprises setting a listening rule.
The setting of the monitoring rule specifically includes:
setting the link characteristics to be intercepted as links conforming to preset rules based on the link characteristics in the webpage; or (b)
Setting a link corresponding to an access request of the data of the specific type as a link conforming to a preset rule based on the type of the data requested to be accessed;
and when the triggering operation of the link conforming to the preset rule is monitored, triggering the interceptor to intercept.
In this embodiment, the preset rule is an interception rule of the interceptor. The interceptor is equivalent to a network filter, intercepts the network requests meeting the interception rules, and releases other network requests. The interceptor can be various existing network interceptors, can be a network interceptor which is defined by self writing, can be an interceptor of a browser, is not limited herein, and can meet the requirement of intercepting network requests meeting specific characteristics.
In this embodiment, pyppeterer based on Python is used to set the interception rules of the interceptor.
In this embodiment, the interception rule of the interceptor is set based on the link feature of the request access, and the interception rule is used for intercepting the request of the request access preset link and releasing the request except the preset link. For example: and intercepting network requests containing http:// www.zzz.com/details in the link, and releasing other requests.
In some embodiments, the method further comprises the step of setting interception rules of the interceptor. The interception rules of the interceptor can be preset in the browser control tool, and an operation entry can be provided in a toolbar of the browser, so that a user can customize the interception rules. For example, a tag is set in a toolbar of the interceptor, and an interactive interface can be opened by triggering the tag so as to facilitate a user to set the interception rule.
In some embodiments, the interceptor's interception rules may be set based on the type of data requested for access. Types of data that the request accesses include, but are not limited to: script, document, gif, xhr, etc. Access requests of a particular data type are intercepted and other types of access requests are passed. For example, gif type access requests are intercepted and other types of data access requests are passed.
In some embodiments, the interception rules are two or more, stored in list form in the interceptor.
In some embodiments, the method further comprises controlling, by the browser control tool, a first script to simulate user operation to trigger links in the web page to generate a request package. The web page integrates a plurality of detail page links, and a network request packet for accessing the detail page is generated by triggering the detail page links. In this embodiment, the Javascript is controlled by the pyppeterer to simulate user operation (for example, a mouse clicks on a detail page link on the web page) to trigger a detail page link conforming to the interception rule of the interceptor to generate a request packet for accessing the detail page.
The encryption parameter refers to that a script (e.g. Javascript) at the browser end encrypts part of parameters in the network request or encapsulates the parameters after adding specific encryption parameters to generate a network request packet, the server end returns data corresponding to the network request packet after detecting the network request packet including the encryption parameter, otherwise, the network request is refused. In the invention, the user operation is simulated to trigger the link conforming to the interception rule of the interceptor, so that the request packet with the encryption parameters is generated at the browser end.
The network request packet added with the encryption parameters is intercepted by an interceptor before being sent to the server, and the network request is generated based on the intercepted request packet and sent to the server, so that the data to be acquired can be acquired. Since the network request packet includes the encryption parameter, when the network request is issued later, the server will not reject the network request as illegitimate access, as long as the encryption parameter included in the network request packet is encapsulated in the network request.
In some embodiments, the method further comprises: and sending the intercepted request packet to a memory for storage.
In this embodiment, the memory is Remote Dictionary Server (Redis), which is a key-value type storage system written by Salvatore Sanfilippo, and is a cross-platform non-relational database. Redis is an open source, written in ANSI C language, compliant with BSD protocols, supporting networks, memory-based, distributed, optionally persistent Key-Value pair (Key-Value) store database, and provides multiple language APIs. Redis is commonly referred to as a data structure server because a value (value) may be a String (String), a Hash (Hash), a list (list), a collection (sets), a sorted collection (sets), and the like.
It will be appreciated that the memory may be any suitable storage device or platform, for example, a relational database management system mysql, a distributed file storage based database mong odb.
In some embodiments, a plurality of intercepted request packets stored in the memory may be stored in a list based on interception rules, and each request packet is associated with a corresponding interception rule, so that subsequent quick and efficient acquisition of the corresponding request packet stored in the memory is facilitated. For example, in the interception rule set based on URL, keywords set in the rule may be intercepted for storage. For example, when the request packet including the pattern of the notify is stored in the URL of the interception rule, the request packet is associated with the notify, and when the memory is accessed later, the corresponding request packet can be obtained quickly using the notify as a keyword.
According to the embodiment of the invention, the encryption parameters are obtained by intercepting the request packet, so that complicated reverse development of JavaScript is avoided, the compatibility of the calculation mode for generating the encryption parameters for the JavaScript of the website is high, and data acquisition failure caused by JavaScript modification is avoided.
And 14, generating a network request based on the request packet, and sending the network request to a server to obtain data to be acquired.
Specifically, since the intercepted request packet includes the encryption parameter, generating the network request based on the request packet content encapsulates the encryption parameter in the network request. In this embodiment, the request including the encryption parameters is encapsulated with a request from python.
In some embodiments, the multiple data acquisition processes access the memory in a multithreading manner, generate network requests based on the request packet contents respectively, and send the network requests to a server to acquire data to be acquired, so that the data acquisition efficiency can be effectively improved.
The server detects the encryption parameter and considers that the user access request is normal and legal, and the user access request is not refused, so that the data to be acquired in the detail page can be acquired.
The embodiment of the invention obtains the encryption parameters by using the interceptor to intercept the network request, which is faster than the traditional browser automation data acquisition efficiency, and can perform data acquisition only by intercepting and extracting the encryption parameters generated by the script and assembling the encryption parameters into the network request, thereby avoiding excessive operation attempts on the browser like the traditional browser automation data acquisition.
Fig. 2 is a schematic structural diagram of a data acquisition device 2 according to an embodiment of the present invention; as shown in fig. 2, it includes:
the starting unit 22 is used for starting the browser.
In this embodiment, a browser control tool is used to control the opening of the browser.
In this embodiment, the browser may be a chrome browser. The chrome browser is a web browser which is mainly developed by Google corporation, and a Webkit rendering engine based on KHTML issues and opens source codes with multiple free copyrights such as BSD license. Has the characteristics of high speed, stability, safety and the like. In some operating environments, the chromoum browser may be installed free and decompressed after the package is downloaded to run directly. It will be appreciated that in other operating environments, such as Windows systems, the installed version may also be downloaded and run after installation.
In this embodiment, the browser control tool is a control tool pyppeterer written based on Python language. Puppeteer is a tool developed by Google corporation based on node. Js, and can control some operations of Chrome browser and Chrome browser through Java script (JavaScript). Pypeteer is a browser control tool written in Python language on Puppeterer. It will be appreciated that in other embodiments, the browser may be controlled using selenium, the written language need not be python, and any suitable language may be used, such as java, nodjs, etc.
It will be appreciated that in some embodiments, if the running environment is not set, the data acquisition device further includes a configuration unit 21 for configuring the running environment, where the configuration running environment includes an installation browser, an installation script language (e.g., javascript), and a browser control tool.
The access unit 24 is configured to access, through the browser, a web page corresponding to the data to be collected.
Specifically, the webpage corresponding to the data to be collected is one of common page types, is an intermediate page for accepting the navigation page and the detail page, and integrates all tags of related contents. For example, if the website to be collected is internet news, the corresponding web page is the first page of the internet news, and the first page has a plurality of news tags in columns, so that the news tag can be triggered to further display a news headline set or a news detail page.
More specifically, the browser is controlled to access the webpage corresponding to the data to be collected, namely, a Request is sent to a server through the browser, and the webpage corresponding to the data to be collected is requested to be accessed. The Request encapsulates a web address (URL) of the web page. The server returns a webpage according to the Request and decodes the webpage into a format supported by the browser.
And the interception unit 26 is configured to intercept a request packet generated during the triggering operation in response to monitoring that the triggering operation is performed on the link meeting the preset rule in the web page.
In some embodiments, the data acquisition device further includes a listening unit 25, configured to listen to a trigger operation linked in the web page based on the set listening rule.
In some embodiments, the apparatus further comprises a setting unit 23 for setting listening rules. The setting of the monitoring rule specifically includes:
setting the link characteristics to be intercepted as links conforming to preset rules based on the link characteristics in the webpage; or (b)
Setting a link corresponding to an access request of the data of the specific type as a link conforming to a preset rule based on the type of the data requested to be accessed;
and when the triggering operation of the link conforming to the preset rule is monitored, triggering the interceptor to intercept.
In this embodiment, the preset rule is an interception rule of the interceptor. The interceptor is equivalent to a network filter, intercepts the network requests meeting the interception rules, and releases other network requests. The interceptor can be various existing network interceptors, can be a network interceptor which is defined by self writing, can be an interceptor of a browser, is not limited herein, and can meet the requirement of intercepting network requests meeting specific characteristics.
In this embodiment, pyppeterer based on Python is used to set the interception rules of the interceptor.
In this embodiment, the interception rule of the interceptor is set based on the link feature of the request access, and the interception rule is used for intercepting the request of the request access preset link and releasing the request except the preset link. For example: and intercepting network requests containing http:// www.zzz.com/details in the link, and releasing other requests.
In some embodiments, the interception rules of the interceptor may be preset in the browser control tool, or an operation entry may be provided in a toolbar of the browser, so as to allow the user to customize the interception rules. For example, a tag is set in a toolbar of the interceptor, and an interactive interface can be opened by triggering the tag so as to facilitate a user to set the interception rule.
In some embodiments, the interceptor's interception rules may be set based on the type of data requested for access. Types of data that the request accesses include, but are not limited to: script, document, gif, xhr, etc. Access requests of a particular data type are intercepted and other types of access requests are passed. For example, gif type access requests are intercepted and other types of data access requests are passed.
In some embodiments, the interception rules are two or more, stored in list form in the interceptor.
In some embodiments, the apparatus further comprises a generating unit for controlling, by the browser control tool, the first script to simulate user operation to trigger links in the web page to generate the request packet. The web page integrates a plurality of detail page links, and a network request packet for accessing the detail page is generated by triggering the detail page links. In this embodiment, the Javascript is controlled by the pyppeterer to simulate user operation (for example, a mouse clicks on a detail page link on the web page) to trigger a detail page link conforming to the interception rule of the interceptor to generate a request packet for accessing the detail page.
The encryption parameter refers to that a script (e.g. Javascript) at the browser end encrypts part of parameters in the network request or encapsulates the parameters after adding specific encryption parameters to generate a network request packet, the server end returns data corresponding to the network request packet after detecting the network request packet including the encryption parameter, otherwise, the network request is refused. In the invention, the user operation is simulated to trigger the link conforming to the interception rule of the interceptor, so that the request packet with the encryption parameters is generated at the browser end.
The network request packet added with the encryption parameters is intercepted by an interceptor before being sent to the server, and the network request is generated based on the intercepted request packet and sent to the server, so that the data to be acquired can be acquired. Since the network request packet includes the encryption parameter, when the network request is issued later, the server will not reject the network request as illegal access only by encapsulating the encryption parameter included in the network request packet in the network request.
In some embodiments, the apparatus further comprises a storage unit to send the intercepted request packet to a memory store.
In this embodiment, the memory is Remote Dictionary Server (Redis), which is a key-value type storage system written by Salvatore Sanfilippo, and is a cross-platform non-relational database. Redis is an open source, written in ANSI C language, compliant with BSD protocols, supporting networks, memory-based, distributed, optionally persistent Key-Value pair (Key-Value) store database, and provides multiple language APIs. Redis is commonly referred to as a data structure server because a value (value) may be a String (String), a Hash (Hash), a list (list), a collection (sets), a sorted collection (sets), and the like.
It will be appreciated that the memory may be any suitable storage device or platform, for example, a relational database management system mysql, a distributed file storage based database mong odb.
In some embodiments, a plurality of intercepted request packets stored in the memory may be stored in a list based on interception rules, and each request packet is associated with a corresponding interception rule, so that subsequent quick and efficient acquisition of the corresponding request packet stored in the memory is facilitated. For example, in the interception rule set based on URL, keywords set in the rule may be intercepted for storage. For example, when the request packet including the pattern of the notify is stored in the URL of the interception rule, the request packet is associated with the notify, and when the memory is accessed later, the corresponding request packet can be obtained quickly using the notify as a keyword.
According to the embodiment of the invention, the encryption parameters are obtained by intercepting the request packet, so that complicated reverse development of JavaScript is avoided, the compatibility of the calculation mode for generating the encryption parameters for the JavaScript of the website is high, and data acquisition failure caused by JavaScript modification is avoided.
And the acquisition unit 27 is used for generating a network request based on the request packet and sending the anti-drop request to a server to acquire data to be acquired.
Specifically, since the intercepted request packet includes the encryption parameter, generating a network request based on the request packet content encapsulates the encryption parameter in the network request. In this embodiment, the request including the encryption parameters is encapsulated with a request from python.
In some embodiments, the multiple data acquisition processes access the memory in a multithreading manner, generate network requests based on the request packet contents respectively, and send the network requests to a server to acquire data to be acquired, so that the data acquisition efficiency can be effectively improved.
The server detects the encryption parameter and considers that the user access request is normal and legal, and the user access request is not refused, so that the data to be acquired in the detail page can be acquired.
The embodiment of the invention obtains the encryption parameters by using the interceptor to intercept the network request, which is faster than the traditional browser automation data acquisition efficiency, and can perform data acquisition only by intercepting and extracting the encryption parameters generated by the script and assembling the encryption parameters into the network request, thereby avoiding excessive operation attempts on the browser like the traditional browser automation data acquisition.
Fig. 3 is a schematic structural diagram of an electronic device 3 according to an embodiment of the present invention; as shown in fig. 3, the electronic device 3 includes a memory 31, on which a computer executable program is stored, and a processor 32 for running the computer executable program to implement the data acquisition method according to any one of the embodiments of the present invention.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such embodiments, the computer program may be downloaded and installed from a network via a communication portion, and/or installed from a removable medium. The above-described functions defined in the method of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU). The computer readable medium according to the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Python, java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present invention may be implemented in software or in hardware. The described units may also be provided in a processor. The names of these units do not constitute a limitation on the unit itself in some cases, for example, the trigger unit may also be described as "intercepting a network request packet".
The above description is only illustrative of the preferred embodiments of the present invention and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the invention referred to in the present invention is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept described above. Such as the above-mentioned features and the technical features disclosed in the present invention (but not limited to) having similar functions are replaced with each other.

Claims (7)

1. A method of data acquisition, comprising:
starting a browser;
controlling a first script to simulate user operation by a browser control tool to trigger links in a webpage so as to generate a network request packet added with encryption parameters, wherein the encryption parameters are that the script at the browser end encrypts part of parameters in the network request or adds specific encryption parameters;
accessing a webpage corresponding to the data to be acquired through the browser; a kind of electronic device with high-pressure air-conditioning system
Responding to the monitoring of triggering operation on links meeting preset rules in a webpage, and intercepting a network request packet generated during the triggering operation;
generating a network request based on the intercepted network request packet, and sending the network request to a server to acquire data to be acquired;
the method further comprises the steps of monitoring triggering operation of links in the webpage based on the set monitoring rule after the browser is started, and setting the monitoring rule;
the listening rule includes:
setting the link characteristics to be intercepted as links conforming to preset rules based on the link characteristics in the webpage; or (b)
Setting a link corresponding to an access request of the data of the specific type as a link conforming to a preset rule based on the type of the data requested to be accessed;
and when the triggering operation of the link conforming to the preset rule is monitored, triggering the interceptor to intercept.
2. The method of claim 1, wherein after intercepting the network request packet generated at the time of the trigger operation, further comprising: storing the intercepted network request packet to a memory;
the method for generating the network request based on the network request packet obtained by interception and sending the network request to a server to obtain data to be acquired specifically comprises the following steps: and acquiring network request packets in a memory through multiple threads, generating network requests based on the network request packets respectively, and sending the network requests to a server to acquire data to be acquired.
3. The method of claim 1, further comprising configuring a runtime environment, the configuring the runtime environment comprising installing a browser, a first script, and a browser control tool.
4. A method according to claim 3, wherein the launching browser is specifically: and starting the browser through the browser control tool.
5. A data acquisition device, comprising:
the starting unit is used for starting the browser;
the generation unit is used for controlling the first script to simulate the user operation to trigger the link in the webpage through the browser control tool so as to generate a network request packet added with encryption parameters, wherein the encryption parameters refer to that the script at the browser end encrypts part of parameters in the network request or adds specific encryption parameters;
the access unit is used for accessing the webpage corresponding to the data to be acquired through the browser;
the interception unit is used for intercepting a network request packet generated during triggering operation in response to monitoring the triggering operation on the links meeting the preset rules in the webpage; a kind of electronic device with high-pressure air-conditioning system
The acquisition unit is used for generating a network request based on the network request packet obtained by interception to generate a network request, and sending the network request to the server to acquire data to be acquired;
the device also comprises a monitoring unit and a monitoring rule setting unit;
the monitoring unit is used for monitoring the triggering operation of the link in the webpage based on the set monitoring rule;
the monitoring rule setting unit is used for setting a monitoring rule, and the monitoring rule comprises:
setting the link characteristics to be intercepted as links conforming to preset rules based on the link characteristics in the webpage; or (b)
Setting a link corresponding to an access request of the data of the specific type as a link conforming to a preset rule based on the type of the data requested to be accessed;
and when the triggering operation of the link conforming to the preset rule is monitored, triggering the interceptor to intercept.
6. A computer storage medium having stored thereon a computer executable program that is run to implement the method of any of claims 1-4.
7. An electronic device comprising a memory for storing a computer executable program thereon and a processor for running the computer executable program to implement the method of any of claims 1-4.
CN202111612228.2A 2021-12-27 2021-12-27 Data acquisition method and device, computer storage medium and electronic equipment Active CN114491356B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111612228.2A CN114491356B (en) 2021-12-27 2021-12-27 Data acquisition method and device, computer storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111612228.2A CN114491356B (en) 2021-12-27 2021-12-27 Data acquisition method and device, computer storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN114491356A CN114491356A (en) 2022-05-13
CN114491356B true CN114491356B (en) 2023-07-04

Family

ID=81495209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111612228.2A Active CN114491356B (en) 2021-12-27 2021-12-27 Data acquisition method and device, computer storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114491356B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2987110B1 (en) * 2013-04-19 2018-06-13 EntIT Software LLC Unused parameters of application under test
CN104408204A (en) * 2014-12-18 2015-03-11 北京国双科技有限公司 Method and device for obtaining webpage page link address
CN109587269A (en) * 2018-12-27 2019-04-05 迅雷计算机(深圳)有限公司 A kind of hold-up interception method, unit, system and the storage medium of downloading behavior
CN111159614B (en) * 2019-12-30 2021-02-02 北京金堤科技有限公司 Webpage resource acquisition method and device
CN111314298B (en) * 2020-01-16 2020-12-29 北京金堤科技有限公司 Verification identification method and device, electronic equipment and storage medium
CN111552854A (en) * 2020-04-24 2020-08-18 北京明略软件系统有限公司 Webpage data capturing method and device, storage medium and equipment
CN112417324A (en) * 2020-05-12 2021-02-26 上海哔哩哔哩科技有限公司 Chrome-based URL (Uniform resource locator) interception method and device and computer equipment

Also Published As

Publication number Publication date
CN114491356A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
US10567407B2 (en) Method and system for detecting malicious web addresses
US11281777B2 (en) Proactive browser content analysis
US9053319B2 (en) Context-sensitive taint processing for application security
US8286250B1 (en) Browser extension control flow graph construction for determining sensitive paths
EP3547121B1 (en) Combining device, combining method and combining program
US9838418B1 (en) Detecting malware in mixed content files
US9497252B2 (en) On-demand code version switching
CN111163094B (en) Network attack detection method, network attack detection device, electronic device, and medium
CN111163095A (en) Network attack analysis method, network attack analysis device, computing device, and medium
CN110598135A (en) Network request processing method and device, computer readable medium and electronic equipment
US10291492B2 (en) Systems and methods for discovering sources of online content
US9942267B1 (en) Endpoint segregation to prevent scripting attacks
CN112866279B (en) Webpage security detection method, device, equipment and medium
US9398041B2 (en) Identifying stored vulnerabilities in a web service
CN114626061A (en) Webpage Trojan horse detection method and device, electronic equipment and medium
CN114491356B (en) Data acquisition method and device, computer storage medium and electronic equipment
CN116028917A (en) Authority detection method and device, storage medium and electronic equipment
US10044728B1 (en) Endpoint segregation to prevent scripting attacks
CN113535322A (en) Form verification method and device
US10452837B1 (en) Inbound link handling
KR102311119B1 (en) Method for automatic diagnosis vulnerability of web and apparatus for performing the method
CN116707846A (en) Data processing method and device
US11223650B2 (en) Security system with adaptive parsing
CN111371745B (en) Method and apparatus for determining SSRF vulnerability
US20200104483A1 (en) Risk computation for software extensions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant