CN110020060B - Webpage data crawling method and device and storage medium - Google Patents

Webpage data crawling method and device and storage medium Download PDF

Info

Publication number
CN110020060B
CN110020060B CN201810791126.3A CN201810791126A CN110020060B CN 110020060 B CN110020060 B CN 110020060B CN 201810791126 A CN201810791126 A CN 201810791126A CN 110020060 B CN110020060 B CN 110020060B
Authority
CN
China
Prior art keywords
url
webpage data
url list
storage path
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810791126.3A
Other languages
Chinese (zh)
Other versions
CN110020060A (en
Inventor
吴壮伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810791126.3A priority Critical patent/CN110020060B/en
Priority to PCT/CN2018/108218 priority patent/WO2020015192A1/en
Publication of CN110020060A publication Critical patent/CN110020060A/en
Application granted granted Critical
Publication of CN110020060B publication Critical patent/CN110020060B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a webpage data crawling method which comprises the steps of reading a first URL list according to a received webpage data crawling request, generating a plurality of application containers according to a pre-constructed docker mirror image, dividing the read first URL list into a plurality of second URL lists, crawling webpage data corresponding to each URL in the second URL lists respectively, extracting the webpage data, and sending the webpage data to a user terminal corresponding to the webpage data crawling request. The invention also provides an electronic device and a computer storage medium. By using the method and the device, the crawling efficiency of the webpage data can be improved.

Description

Webpage data crawling method and device and storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a method for crawling web page data, an electronic device, and a computer-readable storage medium.
Background
In the prior art, the traditional way of running multiple tasks on one server is to start multiple virtual machines and then run different tasks on different virtual machines. Most of the traditional virtualization technologies adopt VMware-based virtual machines, and the operation of the VMware virtual machines needs to run the whole operating system, which needs to occupy a large amount of system resources.
However, the resources of the server, such as CPU, memory, network resources, and disk resources, are limited. Taking web page data crawling as an example, the distribution of the current crawlers is limited by the number of machines, the number of CPUs, the number of threads, the number of processes and the like, and when a virtual machine started on a server excessively consumes resources, system resources cannot be utilized to the maximum extent, so that the crawling efficiency of the web page data is affected.
Disclosure of Invention
In view of the foregoing, the present invention provides a method, a server and a computer readable storage medium for crawling web page data, and a main objective of the method is to improve efficiency of crawling web page data.
In order to achieve the above object, the present invention provides a method for crawling web page data, comprising:
s1, receiving a webpage data crawling request, acquiring a first Uniform Resource Locator (URL) list according to the webpage data crawling request, wherein the first URL list comprises URLs to be crawled, and storing the first URL list into a first preset storage path where a preset configuration file is located;
s2, reading a pre-constructed docker mirror image from a second preset storage path, and generating a plurality of application containers according to the docker mirror image, wherein the application containers comprise: a first application container, a second application container, a third application container;
s3, reading a first URL list and a configuration file from the first preset storage path, dividing the first URL list into a plurality of second URL lists based on the first application container, and storing the plurality of second URL lists into a third preset storage path;
s4, respectively crawling the webpage data corresponding to each URL in the second URL lists based on the second application containers, and storing the webpage data into a fourth preset storage path; and
and S5, extracting the webpage data from a fourth preset storage path based on the third application container, and sending the webpage data to a user terminal corresponding to the webpage data crawling request.
Preferably, before step S1, the method further comprises:
receiving configuration parameters sent by a client, and acquiring the quantity of the pre-configured parallel processes, webpage data and appointed storage paths of all programs from the configuration parameters; and
and generating a configuration file according to the number of the acquired parallel processes and the file path of each program, and storing the configuration file into a first preset storage path.
Preferably, the first URL list further includes index value information corresponding to each URL, where the "index value information corresponding to each URL" is obtained through the following steps:
acquiring and analyzing specific information of each URL in a first URL list, and determining characteristic information of each URL; and
and matching the corresponding index value for each URL in the first URL list according to the mapping relation between the characteristic information and the index value.
Preferably, the method further comprises the steps of:
and when the URL which cannot be matched with the index value according to the characteristic information exists, generating prompt information based on the URL, and receiving a matching instruction for matching the index value with the URL.
Preferably, before step S5, the method further comprises the steps of:
comparing the quantity of the webpage data with the quantity of the URLs in the first URL list to determine a third URL list;
when the third URL list is not empty, performing webpage data crawling operation on each URL in the third URL list until the third URL list is empty, and storing webpage data corresponding to the third URL list into a fourth preset storage path; and
when the third URL list is empty, the step S5 is continued.
Preferably, before step S5, the method further comprises the steps of:
respectively acquiring a second URL list and a URL regular mining program from a third preset storage path;
acquiring webpage data corresponding to each second URL list from a fourth preset storage path, mining the webpage data corresponding to each second URL list, determining a fourth URL list corresponding to each second URL list, and storing the fourth URL list into a fifth preset storage path; and
and performing webpage data crawling operation on the fourth URL list, performing sub-URL mining operation on webpage data corresponding to the fourth URL list, extracting new sub-URLs, and performing webpage data crawling operation, thereby circulating.
In addition, the present invention also provides an electronic device, characterized in that the device comprises: the webpage data crawling device comprises a memory and a processor, wherein a webpage data crawling program which can run on the processor is stored in the memory, and when the webpage data crawling program is executed by the processor, any step in the webpage data crawling method can be realized.
In addition, to achieve the above object, the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a web page data crawling program, and when the web page data crawling program is executed by a processor, any step in the web page data crawling method may be implemented.
According to the webpage data crawling method, the electronic device and the computer readable storage medium, the data processing is performed in parallel by establishing the docker application containers based on the docker mirror image, the docker application containers can save resource waste caused by starting an operating system, the isolation capacity similar to that of a virtual machine is provided by process-level consumption, based on the framework, a user only needs to set configuration files and generate mirror image files for related programs, and the webpage data crawling work can be efficiently completed by establishing a plurality of docker application containers to crawl webpage data in parallel; the integrity of the crawled webpage data is ensured by performing data verification on the crawled webpage data; the comprehensiveness of the webpage data is ensured by carrying out sub URL deep mining on the crawled webpage data.
Drawings
FIG. 1 is a flowchart illustrating a preferred embodiment of a web page data crawling method according to the present invention;
FIG. 2 is a diagram of an electronic device according to a preferred embodiment of the present invention;
FIG. 3 is a block diagram of a program module of the page data crawling program of FIG. 2 according to the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a webpage data crawling method. Referring to fig. 1, a flowchart of a preferred embodiment of the web page data crawling method of the present invention is shown. The method may be performed by an apparatus, which may be implemented by software and/or hardware.
In this embodiment, the webpage data crawling method based on docker includes steps S1 to S5:
the method comprises the steps of S1, receiving a webpage data crawling request, obtaining a first URL list according to the webpage data crawling request, storing the first URL list into a first preset storage path where a preset configuration file is located, wherein the first URL list comprises URLs to be crawled;
the following description will describe an embodiment of the method of the present invention with an electronic device as an execution subject, where the electronic device serves as a server to establish a communication connection with a user terminal, receives a service data processing request sent by the user terminal, and processes service data according to the request. The electronic device may have a Central Processing Unit (CPU).
It can be understood that before receiving a webpage data crawling request sent by a user terminal and crawling webpage data, a docker mirror image is configured on the electronic device. Specifically, a docker mirror image is created based on the docker file rule, the docker mirror image includes a list partitioning program, a parallel processing program, a data verification program, a data merging program, and the like, and the created docker mirror image is stored in a second preset storage path. After the docker image is created, a plurality of application containers are created based on the docker image. Each program can independently run in the application container, and the running of a plurality of application containers is independent.
Further, before step S1, the method further comprises the step of:
receiving configuration parameters sent by a client, and acquiring the quantity of the pre-configured parallel processes, webpage data and appointed storage paths of all programs from the configuration parameters;
and generating a configuration file according to the number of the acquired parallel processes and the file path of each program, and storing the configuration file into a first preset storage path.
The number of the processes is adjusted according to the size of the multi-core CPU of the server and the condition of the CPU occupied by data processing.
The method includes the steps that a first URL list contained in a webpage data crawling request is an original URL list to be crawled, when the webpage data crawling request is received, an index value corresponding to each URL is determined according to information of each URL in the first URL list to be crawled, the index value information corresponding to each URL is updated to the first URL list, the updated first URL list is stored into a first preset storage path, and the first preset storage path can be a Redis database.
As an embodiment, the "index value corresponding to each URL in the first URL list" is obtained through the following steps:
acquiring and analyzing specific information of each URL in a first URL list, and determining characteristic information of each URL; and
and matching the corresponding index value for each URL in the first URL list according to the mapping relation between the characteristic information and the index value.
The characteristic information can be used for representing the type of the webpage, and the index value is used for calling the crawler program. The mapping relation between the characteristic information and the index value is obtained through the following steps:
acquiring a set of appointed URLs, determining characteristic information of each URL in the set, and marking an index value for each URL; dividing the appointed URL in the set into sub-sets corresponding to different index values according to the index values; respectively counting the proportion of different feature information in each subset, and selecting the feature information with the largest proportion as the target feature information of the designated URL in each subset; and determining the mapping relation between the feature information and the index value according to the target feature information of the appointed URL in each subset and the index value corresponding to the subset.
Further, the method comprises the following steps:
and when the URL which cannot be matched with the index value according to the characteristic information exists, generating prompt information based on the URL, and receiving a matching instruction for matching the URL with the index value.
The matching instruction includes index value information corresponding to the URL that cannot match the index value according to the feature information. And exporting the URL which cannot be matched with the index value, feeding the exported URL back to the appointed terminal, and artificially determining the corresponding index value for the exported URL.
S2, reading a pre-constructed docker mirror image from a second preset storage path, and generating a plurality of application containers according to the docker mirror image, wherein the application containers comprise: a first application container, a second application container, a third application container;
in this embodiment, the application container includes: a first application container, a plurality of second application containers, and a third application container.
S3, reading a first URL list and a configuration file from the first preset storage path, dividing the first URL list into a plurality of second URL lists based on the first application container, and storing the second URL lists into a third preset storage path;
specifically, the first application container is run, the first URL list is acquired from the first preset storage path, the list dividing program is called from the second preset storage path, and the first URL list is equally divided into N URL sub-lists, that is, N second URL lists, where N is an integer greater than 1. Each second URL list includes a plurality of URLs and index value information corresponding to each URL, and each second URL list is stored in a storage path of a preset second URL list, that is, a third preset storage path.
In this step, the number N of the second URL lists is equal to the number of the second application containers, for example, when the number of the second application containers is 5, N is 5, which indicates that the first URL list is divided into 5 second URL lists. Based on the steps, the division and the transfer of the list to be crawled and the corresponding program index value are realized.
Furthermore, the first application container calls the configuration file from a second preset storage path, and CPU resources of the server are distributed to a plurality of second application containers for the plurality of second application containers to execute the webpage data crawling operation in parallel. And acquiring data processing parameters from the configuration file, wherein the data processing parameters comprise the number N of parallel processes and a storage path of the crawled webpage data.
S4, crawling the webpage data corresponding to each URL in the second URL lists respectively based on the second application containers, and storing the webpage data into a fourth preset storage path;
and the N second application containers are synchronously operated, one second URL list corresponding to one second application container is obtained by the plurality of second application containers from the third preset storage path respectively, a crawler program corresponding to the index value is sequentially called according to the index value corresponding to each URL in each second URL list, webpage data crawling operation is carried out, and the crawled webpage data are stored in the appointed storage path. And different second application containers correspond to the same fourth preset storage path, namely, webpage data crawled by the containers in different second applications are stored in the same folder. Through the steps, the crawling and summarizing process of the webpage data is realized.
And S5, extracting the webpage data from a fourth preset storage path based on the third application container, and sending the webpage data to a user terminal corresponding to the webpage data crawling request.
And operating the third application container, reading the webpage data crawled by the N second application containers from a fourth preset storage path, sending the webpage data to the user terminal corresponding to the webpage data crawling request, and generating prompt information.
In other embodiments, the crawled web page data needs to be verified in order to ensure the integrity of the crawled web page data. Specifically, before step S5, the method further comprises the steps of:
comparing the quantity of the webpage data with the quantity of the URLs in the first URL list to determine a third URL list;
when the third URL list is not empty, performing webpage data crawling operation on each URL in the third URL list until the third URL list is empty, and storing webpage data corresponding to the third URL list into a fourth preset storage path;
when the third URL list is empty, the step S5 is continued.
Specifically, a fourth application container is generated based on the docker image, and the fourth application container verifies the crawled webpage data. And the third URL list comprises URLs to be crawled and index values corresponding to the URLs to be crawled. It can be understood that one URL corresponds to one piece of web page data, assuming that the number of URLs in the first URL list is P and the number of web page data in the fourth preset storage path is Q through statistics, when P = Q, it indicates that the third URL list is empty, that is, there is no unprocessed URL, that is, there is no web page data to be crawled; and when P is more than Q, the number of the URLs in the third URL list is P-Q, P-Q URLs are URLs to be processed, namely webpage data to be crawled exist, for the third URL list containing the URLs to be processed, the fourth application container is operated, webpage data crawling operation is executed, and the webpage data crawled in the step S4 are combined. It should be noted that, when a URL still exists in the third URL list after the URL in the third URL list is executed for a preset number of times (for example, 3 times), the warning information is generated. Through the steps, omission of the webpage data can be prevented, and the integrity of the webpage data is ensured.
Preferably, there is a case that the crawled webpage data includes child URLs, and in order to ensure the comprehensiveness of the crawled webpage data, the first URL list may be deeply mined to determine webpage data corresponding to the child URLs included in the webpage data. Specifically, before step S5, the method further comprises the steps of:
respectively acquiring a second URL list and a URL regular mining program from a third preset storage path;
acquiring webpage data corresponding to each second URL list from a fourth preset storage path, mining the webpage data corresponding to each second URL list, determining a fourth URL list corresponding to each second URL list, and storing the fourth URL list into a fifth preset storage path;
and performing webpage data crawling operation on the fourth URL list, continuing performing sub-URL mining operation on webpage data corresponding to the fourth URL list, extracting new sub URLs, and performing webpage data crawling operation, so as to circulate.
Each second URL list corresponds to a fourth URL list, and a second application container for performing webpage data crawling operation on the fourth URL lists is the same as a second application container of the second URL lists corresponding to the fourth URL lists.
It should be understood that, in order to prevent the situation that the new child URL is deeply mined all the time, a depth threshold (indicating the number of times of deeply mining the child URL) is preset, and when the number of times of deeply mining the URL list exceeds the preset depth threshold, the operation of mining the new child URL is stopped.
In the method for crawling web page data provided in the above embodiment, when a web page data crawling request is received, a first URL list to be processed is obtained according to the request, the first URL list is stored in a first preset storage path where a preset configuration file is located, a pre-constructed docker image is read from a second preset storage path, a plurality of application containers are generated according to the docker image, the configuration file and the first URL list are read from the first preset storage path, the first URL list is divided into a plurality of second URL lists according to the plurality of application containers and the configuration file, the plurality of second URL lists are processed in a multi-container parallel processing manner, system resources are distributed to the plurality of application containers in parallel processing by a server, web page data corresponding to each second URL list is crawled, and the web page data is sent to a user terminal corresponding to the web page data request. According to the scheme, a docker application container is established based on a docker mirror image to perform data processing in parallel, the docker application container can save resource waste caused by starting an operating system, isolation capacity similar to that of a virtual machine is provided by process-level consumption, based on the framework, a user only needs to set a configuration file and generate a mirror image file for a related program, and webpage data can be crawled in parallel by establishing a plurality of docker application containers, so that webpage data crawling work can be efficiently completed; the integrity of the crawled webpage data is ensured by performing data verification on the crawled webpage data; the comprehensiveness of the webpage data is ensured by carrying out sub-URL deep mining on the crawled webpage data.
The invention also provides an electronic device. Fig. 2 is a schematic diagram of an electronic device 1 according to a preferred embodiment of the invention.
In the present embodiment, the electronic device 1 may be a terminal device having a data processing function, such as a smart phone, a tablet computer, a portable computer, and a desktop computer.
The electronic device 1 includes a memory 11, a processor 12, and a network interface 13.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic apparatus 1 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic apparatus 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic apparatus 1.
The memory 11 may be used not only to store application software installed in the electronic apparatus 1 and various types of data, such as the web page data crawling program 10, but also to temporarily store data that has been output or is to be output.
The processor 12 may be, in some embodiments, a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip for executing program codes stored in the memory 11 or Processing data, such as the web page data crawling program 10.
The network interface 13 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), and is typically used to establish a communication link between the electronic apparatus 1 and other electronic devices.
Fig. 2 only shows the electronic device 1 with components 11-13, and it will be understood by a person skilled in the art that the structure shown in fig. 2 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
Optionally, the electronic device 1 may further comprise a user interface, the user interface may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further comprise a standard wired interface, a wireless interface.
Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display, which may also be referred to as a display screen or display unit, is used for displaying information processed in the electronic apparatus 1 and for displaying a visualized user interface.
In the embodiment of the electronic device 1 shown in fig. 2, the memory 11 as a computer storage medium stores a web page data crawling program 10, and when the processor 12 executes the web page data crawling program 10 stored in the memory 11, the following steps are implemented:
a1, receiving a webpage data crawling request, acquiring a first Uniform Resource Location (URL) list according to the webpage data crawling request, wherein the first URL list comprises URLs to be crawled, and storing the first URL list into a first preset storage path where a preset configuration file is located;
the following description will describe an embodiment of the method of the present invention with an electronic device as an execution subject, where the electronic device serves as a server to establish a communication connection with a user terminal, receives a service data processing request sent by the user terminal, and processes service data according to the request. The electronic device may have a Central Processing Unit (CPU).
It can be understood that before receiving a webpage data crawling request sent by a user terminal and crawling webpage data, a docker mirror image is configured on the electronic device. Specifically, a docker mirror image is created based on the docker file rule, the docker mirror image includes a list partitioning program, a parallel processing program, a data verification program, a data merging program, and the like, and the created docker mirror image is stored in a second preset storage path. After the docker image is created, a plurality of application containers are created based on the docker image. Each program can independently run in the application container, and the running of a plurality of application containers is independent.
Further, before the step A1, the method further comprises the steps of:
receiving configuration parameters sent by a client, and acquiring the quantity of the pre-configured parallel processes, webpage data and appointed storage paths of all programs from the configuration parameters;
and generating a configuration file according to the number of the acquired parallel processes and the file path of each program, and storing the configuration file into a first preset storage path.
The number of the processes is adjusted according to the size of the multi-core CPU of the server and the condition of the CPU occupied by data processing.
The method comprises the steps that a first URL list contained in a webpage data crawling request is an original URL list to be crawled, when the webpage data crawling request is received, an index value corresponding to each URL is determined according to information of each URL in the first URL list to be crawled, the index value information corresponding to each URL is updated to the first URL list, then the updated first URL list is stored into a first preset storage path, and the first preset storage path can be a Redis database.
As an embodiment, the "index value corresponding to each URL in the first URL list" is obtained by:
acquiring and analyzing specific information of each URL in a first URL list, and determining characteristic information of each URL; and
and matching the corresponding index value for each URL in the first URL list according to the mapping relation between the characteristic information and the index value.
The characteristic information can be used for representing the type of the webpage, and the index value is used for calling the crawler program. The mapping relation between the characteristic information and the index value is obtained through the following steps:
acquiring a set of appointed URLs, determining characteristic information of each URL in the set, and marking an index value for each URL; dividing the appointed URL in the set into sub-sets corresponding to different index values according to the index values; respectively counting the proportion of different feature information in each subset, and selecting the feature information with the largest proportion as the target feature information of the designated URL in each subset; and determining the mapping relation between the feature information and the index value according to the target feature information of the appointed URL in each subset and the index value corresponding to the subset.
Further, the method comprises the following steps:
and when the URL which cannot be matched with the index value according to the characteristic information exists, generating prompt information based on the URL, and receiving a matching instruction for matching the index value with the URL.
The matching instruction includes index value information corresponding to the URL that cannot match the index value according to the feature information. And exporting URLs which cannot be matched with the index values, feeding the exported URLs back to the appointed terminal, and artificially determining corresponding index values for the exported URLs.
A2, reading a pre-constructed docker mirror image from a second preset storage path, and generating a plurality of application containers according to the docker mirror image, wherein the application containers comprise: a first application container, a second application container, a third application container;
in this embodiment, the application container includes: a first application container, a plurality of second application containers, and a third application container.
A3, reading a first URL list and a configuration file from the first preset storage path, dividing the first URL list into a plurality of second URL lists based on the first application container, and storing the plurality of second URL lists into a third preset storage path;
specifically, the first application container is run, a first URL list is acquired from a first preset storage path, the list division program is called from a second preset storage path, and the first URL list is averagely divided into N URL sub-lists, that is, N second URL lists, where N is an integer greater than 1. Each second URL list includes a plurality of URLs and index value information corresponding to each URL, and each second URL list is stored in a storage path of a preset second URL list, that is, a third preset storage path.
In this step, the number of the second URL lists is equal to the number of the second application containers, for example, when the number of the second application containers is 5, N is 5, which means that the first URL list is divided into 5 second URL lists. Based on the steps, the division and the transfer of the list to be crawled and the corresponding program index value are realized.
Further, the first application container calls the configuration file from a second preset storage path, and the CPU resource of the server is distributed to a plurality of second application containers for the plurality of second application containers to execute the webpage data crawling operation in parallel. And acquiring data processing parameters from the configuration file, wherein the data processing parameters comprise the number N of parallel processes and a storage path of the crawled webpage data.
A4, crawling the webpage data corresponding to each URL in the second URL lists respectively based on the second application containers, and storing the webpage data into a fourth preset storage path;
and synchronously operating the N second application containers, wherein one second URL list corresponds to one second application container, the second application containers respectively acquire one second URL list from the third preset storage path, sequentially calling a crawler program corresponding to the index value according to the index value corresponding to each URL in each second URL list, performing webpage data crawling operation, and storing the crawled webpage data into the appointed storage path. And different second application containers correspond to the same fourth preset storage path, namely, webpage data crawled by the containers in different second applications are stored in the same folder. Through the steps, the crawling and summarizing process of the webpage data is realized.
And A5, extracting the webpage data from a fourth preset storage path based on the third application container, and sending the webpage data to a user terminal corresponding to the webpage data crawling request.
And operating the third application container, reading the webpage data crawled by the N second application containers from a fourth preset storage path, sending the webpage data to the user terminal corresponding to the webpage data crawling request, and generating prompt information.
In other embodiments, to ensure the integrity of the crawled web page data, the crawled web page data needs to be verified. Specifically, when the web page data crawling program 10 is executed by the processor, before step A5, the following steps are further implemented:
comparing the quantity of the webpage data with the quantity of the URLs in the first URL list to determine a third URL list;
when the third URL list is not empty, performing webpage data crawling operation on each URL in the third URL list until the third URL list is empty, and storing webpage data corresponding to the third URL list into a fourth preset storage path;
when the third URL list is empty, the step A5 is continued.
Specifically, a fourth application container is generated based on the docker image, and the fourth application container verifies the crawled webpage data. And the third URL list comprises URLs to be crawled and index values corresponding to the URLs to be crawled. It can be understood that one URL corresponds to one piece of web page data, it is assumed that the number of URLs in the first URL list is P and the number of web page data in the fourth preset storage path is Q through statistics, and when P = Q, it indicates that the third URL list is empty, that is, there is no unprocessed URL, that is, there is no web page data to be crawled; and when P is more than Q, the number of the URLs in the third URL list is P-Q, P-Q URLs are URLs to be processed, namely webpage data to be crawled exist, for the third URL list containing the URLs to be processed, the fourth application container is operated, webpage data crawling operation is executed, and the webpage data crawled in the step A4 are combined. It should be noted that, after the URLs in the third URL list are executed for a preset number of times (for example, 3 times), the warning information is generated when a URL still exists in the third URL list. Through the steps, omission of the webpage data can be prevented, and the integrity of the webpage data is ensured.
Preferably, there is a case that the crawled webpage data includes child URLs, and in order to ensure the comprehensiveness of the crawled webpage data, the first URL list may be deeply mined to determine webpage data corresponding to the child URLs included in the webpage data. Specifically, when the web page data crawling program 10 is executed by the processor, before step A5, the following steps are also implemented:
respectively acquiring a second URL list and a URL regular mining program from a third preset storage path;
acquiring webpage data corresponding to each second URL list from a fourth preset storage path, mining the webpage data corresponding to each second URL list, determining a fourth URL list corresponding to each second URL list, and storing the fourth URL list into a fifth preset storage path;
and performing webpage data crawling operation on the fourth URL list, continuing performing sub-URL mining operation on the webpage data corresponding to the fourth URL list, extracting new sub-URLs, and performing webpage data crawling operation, so as to circulate.
Each second URL list corresponds to a fourth URL list, and a second application container for executing webpage data crawling operation on the fourth URL list is the same as a second application container of the second URL list corresponding to the fourth URL list.
It should be understood that, in order to prevent the situation that the new child URL is deeply mined all the time, a depth threshold (indicating the number of times of deeply mining the child URL) is set in advance, and when the number of times of deeply mining the URL list exceeds the preset depth threshold, the operation of mining the new child URL is stopped.
The electronic device 1 provided in the above embodiment establishes a docker application container based on the docker image to perform data processing in parallel, the docker application container can save resource waste caused by starting an operating system, and provides an isolation capability similar to that of a virtual machine with process-level consumption, based on this framework, a user only needs to set a configuration file and generate a mirror image file for a related program, and crawls web page data in parallel by establishing a plurality of docker application containers, so that the web page data crawling operation can be efficiently completed; the integrity of the crawled webpage data is ensured by carrying out data verification on the crawled webpage data; the comprehensiveness of the webpage data is ensured by carrying out sub URL deep mining on the crawled webpage data.
Alternatively, in other embodiments, the web page data crawling program 10 may be divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention, where the module refers to a series of computer program instruction segments capable of performing a specific function. For example, referring to fig. 3, which is a schematic block diagram of the web page data crawling program 10 in fig. 2, in this embodiment, the web page data crawling program 10 may be divided into a receiving module 110, a container generating module 120, a list dividing module 130, a data crawling module 140 and a data sending module 150, the functions or operation steps implemented by the modules 110 to 150 are similar to those described above, and are not detailed here, for example, where:
the receiving module 110 is configured to receive a webpage data crawling request, obtain a first URL list according to the webpage data crawling request, store the first URL list into a first preset storage path where a preset configuration file is located, where the first URL list includes a URL to be crawled;
the container generating module 120 is configured to read a pre-constructed docker mirror image from a second preset storage path, and generate a plurality of application containers according to the docker mirror image, where the application containers include: a first application container, a second application container, a third application container; and
a list dividing module 130, configured to read a first URL list and a configuration file from the first preset storage path, divide the first URL list into a plurality of second URL lists based on the first application container, and store the plurality of second URL lists into a third preset storage path;
the data crawling module 140 is configured to crawl, based on the second application containers, web page data corresponding to each URL in the second URL lists, and store the web page data in a fourth preset storage path; and
and the data sending module 150 is configured to extract the webpage data from a fourth preset storage path based on the third application container, and send the webpage data to a user terminal corresponding to the webpage data crawling request.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a web page data crawling program 10, and when executed by a processor, the web page data crawling program 10 implements the following operations:
a1, receiving a webpage data crawling request, acquiring a first URL list according to the webpage data crawling request, wherein the first URL list comprises URLs to be crawled, and storing the first URL list into a first preset storage path where a preset configuration file is located;
a2, reading a pre-constructed docker mirror image from a second preset storage path, and generating a plurality of application containers according to the docker mirror image, wherein the application containers comprise: a first application container, a second application container, a third application container;
a3, reading a first URL list and a configuration file from the first preset storage path, dividing the first URL list into a plurality of second URL lists based on the first application container, and storing the plurality of second URL lists into a third preset storage path;
a4, crawling the webpage data corresponding to each URL in the second URL lists respectively based on the second application containers, and storing the webpage data into a fourth preset storage path; and
and A5, extracting the webpage data from a fourth preset storage path based on the third application container, and sending the webpage data to a user terminal corresponding to the webpage data crawling request.
The specific implementation of the computer-readable storage medium of the present invention is substantially the same as the specific implementation of the above-mentioned web page data crawling method, and will not be described herein again.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of another identical element in a process, apparatus, article, or method comprising the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A webpage data crawling method is applied to an electronic device and is characterized by comprising the following steps:
s1, receiving a webpage data crawling request, acquiring a first Uniform Resource Locator (URL) list according to the webpage data crawling request, wherein the first URL list comprises URLs to be crawled, and storing the first URL list into a first preset storage path where a preset configuration file is located;
s2, reading a pre-constructed docker mirror image from a second preset storage path, and generating a plurality of application containers according to the docker mirror image, wherein the application containers comprise: a first application container, a second application container, a third application container;
s3, reading a first URL list and a configuration file from the first preset storage path, dividing the first URL list into a plurality of second URL lists based on the first application container, and storing the second URL lists into a third preset storage path;
s4, crawling webpage data corresponding to each URL in the second URL lists respectively based on the second application containers, and storing the webpage data into a fourth preset storage path; and
and S5, extracting the webpage data from a fourth preset storage path based on the third application container, and sending the webpage data to a user terminal corresponding to the webpage data crawling request.
2. The web page data crawling method according to claim 1, wherein before step S1, the method further comprises:
receiving configuration parameters sent by a client, and acquiring the quantity of the pre-configured parallel processes, webpage data and the appointed storage path of each program from the configuration parameters; and
and generating a configuration file according to the number of the acquired parallel processes and the file path of each program, and storing the configuration file into a first preset storage path.
3. The web page data crawling method according to claim 1, wherein the first URL list further includes index value information corresponding to each URL, and the "index value information corresponding to each URL" is obtained through the following steps:
acquiring and analyzing specific information of each URL in a first URL list, and determining characteristic information of each URL; and
and matching the corresponding index value for each URL in the first URL list according to the mapping relation between the characteristic information and the index value.
4. The web page data crawling method according to claim 3, further comprising the steps of:
and when the URL which cannot be matched with the index value according to the characteristic information exists, generating prompt information based on the URL, and receiving a matching instruction for matching the index value with the URL.
5. The method for crawling web page data according to any one of claims 1 to 4, wherein before step S5, the method further comprises the following steps:
comparing the quantity of the webpage data with the quantity of the URLs in the first URL list to determine a third URL list;
when the third URL list is not empty, performing webpage data crawling operation on each URL in the third URL list until the third URL list is empty, and storing webpage data corresponding to the third URL list into a fourth preset storage path; and
when the third URL list is empty, the step S5 is continued.
6. The web page data crawling method according to claim 5, wherein before step S5, the method further comprises the following steps:
respectively acquiring a second URL list and a URL regular mining program from a third preset storage path;
acquiring webpage data corresponding to each second URL list from a fourth preset storage path, mining the webpage data corresponding to each second URL list, determining a fourth URL list corresponding to each second URL list, and storing the fourth URL list into a fifth preset storage path; and
and performing webpage data crawling operation on the fourth URL list, performing sub-URL mining operation on webpage data corresponding to the fourth URL list, extracting new sub-URLs, and performing webpage data crawling operation, thereby circulating.
7. An electronic device, comprising: the storage is stored with a webpage data crawling program which can run on the processor, and when the webpage data crawling program is executed by the processor, the following steps can be realized:
a1, receiving a webpage data crawling request, acquiring a first URL list according to the webpage data crawling request, wherein the first URL list comprises URLs to be crawled, and storing the first URL list into a first preset storage path where a preset configuration file is located;
a2, reading a pre-constructed docker mirror image from a second preset storage path, and generating a plurality of application containers according to the docker mirror image, wherein the application containers comprise: a first application container, a second application container, a third application container;
a3, reading a first URL list and a configuration file from the first preset storage path, dividing the first URL list into a plurality of second URL lists based on the first application container, and storing the plurality of second URL lists into a third preset storage path;
a4, crawling the webpage data corresponding to each URL in the second URL lists respectively based on the second application containers, and storing the webpage data into a fourth preset storage path; and
and A5, extracting the webpage data from a fourth preset storage path based on the third application container, and sending the webpage data to a user terminal corresponding to the webpage data crawling request.
8. The electronic device of claim 7, wherein the web page data crawling program, when executed by the processor, further implements the following steps before step A5:
comparing the quantity of the webpage data with the quantity of the URLs in the first URL list to determine a third URL list;
when the third URL list is not empty, performing webpage data crawling operation on each URL in the third URL list until the third URL list is empty, and storing webpage data corresponding to the third URL list into a fourth preset storage path; and
when the third URL list is empty, the step A5 is continued.
9. The electronic device according to any one of claims 7 to 8, wherein when executed by the processor, further performs, before step A5, the steps of:
respectively acquiring a second URL list and a URL regular mining program from a third preset storage path;
acquiring webpage data corresponding to each second URL list from a fourth preset storage path, mining the webpage data corresponding to each second URL list, determining a fourth URL list corresponding to each second URL list, and storing the fourth URL list into a fifth preset storage path;
and performing webpage data crawling operation on the fourth URL list, continuing performing sub-URL mining operation on the webpage data corresponding to the fourth URL list, extracting new sub-URLs, and performing webpage data crawling operation, so as to circulate.
10. A computer-readable storage medium, comprising a web page data crawling program, wherein the web page data crawling program is capable of implementing the steps of the web page data crawling method according to any one of claims 1 to 6 when being executed by a processor.
CN201810791126.3A 2018-07-18 2018-07-18 Webpage data crawling method and device and storage medium Active CN110020060B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810791126.3A CN110020060B (en) 2018-07-18 2018-07-18 Webpage data crawling method and device and storage medium
PCT/CN2018/108218 WO2020015192A1 (en) 2018-07-18 2018-09-28 Webpage data crawling method and apparatus, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810791126.3A CN110020060B (en) 2018-07-18 2018-07-18 Webpage data crawling method and device and storage medium

Publications (2)

Publication Number Publication Date
CN110020060A CN110020060A (en) 2019-07-16
CN110020060B true CN110020060B (en) 2023-03-14

Family

ID=67188354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810791126.3A Active CN110020060B (en) 2018-07-18 2018-07-18 Webpage data crawling method and device and storage medium

Country Status (2)

Country Link
CN (1) CN110020060B (en)
WO (1) WO2020015192A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110888655A (en) * 2019-11-14 2020-03-17 中国民航信息网络股份有限公司 Application publishing method and device
CN113392301A (en) * 2021-06-08 2021-09-14 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for crawling data
CN116361362B (en) * 2023-05-30 2023-08-11 江西顶易科技发展有限公司 User information mining method and system based on webpage content identification

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7676553B1 (en) * 2003-12-31 2010-03-09 Microsoft Corporation Incremental web crawler using chunks
KR20120042529A (en) * 2010-10-25 2012-05-03 삼성전자주식회사 Method and apparatus for crawling web page
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN106101176B (en) * 2016-05-27 2019-04-12 成都索贝数码科技股份有限公司 One kind is integrated to melt media cloud production delivery system and method
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN108197633A (en) * 2017-11-24 2018-06-22 百年金海科技有限公司 Deep learning image classification based on TensorFlow is with applying dispositions method
CN108062413B (en) * 2017-12-30 2019-05-28 平安科技(深圳)有限公司 Web data processing method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
WO2020015192A1 (en) 2020-01-23
CN110020060A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN107992356B (en) Block chain transaction block processing method, electronic device and readable storage medium
CN110020060B (en) Webpage data crawling method and device and storage medium
WO2019227715A1 (en) Data processing method and apparatus, and computer-readable storage medium
CN110737659A (en) Graph data storage and query method, device and computer readable storage medium
CN108874924B (en) Method and device for creating search service and computer-readable storage medium
CN107656729B (en) List view updating apparatus, method and computer-readable storage medium
CN107463563B (en) Information service processing method and device of browser
CN107798030B (en) Splitting method and device of data table
WO2020015170A1 (en) Interface invoking method and apparatus, and computer-readable storage medium
CN112416458A (en) Preloading method and device based on ReactNative, computer equipment and storage medium
CN110738049A (en) Similar text processing method and device and computer readable storage medium
US20120166412A1 (en) Super-clustering for efficient information extraction
CN112650909A (en) Product display method and device, electronic equipment and storage medium
CN110727425A (en) Electronic device, form data verification method and computer-readable storage medium
CN110764913A (en) Data calculation method based on rule calling, client and readable storage medium
CN112698962A (en) Data processing method and device, electronic equipment and storage medium
CN113126980A (en) Page generation method and device and electronic equipment
CN107729341B (en) Electronic device, information inquiry control method, and computer-readable storage medium
CN111177600A (en) Built-in webpage loading method and device based on mobile application
CN111158777A (en) Component calling method and device and computer readable storage medium
US10089369B2 (en) Searching method, searching apparatus and device
CN113467823B (en) Configuration information acquisition method, device, system and storage medium
CN111221917B (en) Intelligent partition storage method, intelligent partition storage device and computer readable storage medium
CN110688223B (en) Data processing method and related product
CN112783578A (en) Method, device and equipment for starting task flow and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant