CN110020060A - Web data crawling method, device and storage medium - Google Patents
Web data crawling method, device and storage medium Download PDFInfo
- Publication number
- CN110020060A CN110020060A CN201810791126.3A CN201810791126A CN110020060A CN 110020060 A CN110020060 A CN 110020060A CN 201810791126 A CN201810791126 A CN 201810791126A CN 110020060 A CN110020060 A CN 110020060A
- Authority
- CN
- China
- Prior art keywords
- url
- web data
- url list
- list
- store path
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of web data crawling method, web data crawls request the first url list of reading to this method based on the received, multiple application containers are generated according to the docker mirror image constructed in advance, first url list of reading is divided into multiple second url lists, and the corresponding web data of every URL in the multiple second url list is crawled respectively, later, this method extracts the web data, and the web data is sent to the web data and crawls the corresponding user terminal of request.The present invention also provides a kind of electronic device and computer storage mediums.Using the present invention, web data can be improved and crawl efficiency.
Description
Technical field
The present invention relates to data processing field more particularly to a kind of web data crawling methods, electronic device and computer
Readable storage medium storing program for executing.
Background technique
In the prior art, the traditional approach that multiple tasks are run on a server is the multiple virtual machines of starting, then
Different tasks is run on different virtual machines.Traditional virtualization technology be mostly using based on VMware virtual machine, and
The operation of VMware virtual machine has to operation whole operation system, needs to occupy a large amount of system resource.
However, the resources such as CPU, memory, Internet resources and disk resource are all limited for server.With
For web data crawls, the distribution of crawler is limited to machine quantity, CPU quantity, number of threads and number of processes etc. at present,
When the virtual machine started on server excessively consumes resource, lead to not maximally utilize system resource, and then influence webpage
Data crawl efficiency.
Summary of the invention
In view of the foregoing, the present invention provides a kind of web data crawling method, server and computer-readable storage medium
Matter, main purpose are that improving web data crawls efficiency.
To achieve the above object, the present invention provides a kind of web data crawling method, this method comprises:
S1, reception web data crawl request, crawl the first URL (Uniform of request according to the web data
Resource Location, uniform resource locator) list, include URL to be crawled in first url list, by this first
Url list is stored into the first default store path where preset configuration file;
S2, the docker mirror image constructed in advance is read from the second default store path, and according to the docker mirror image
Generate multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
S3, the first url list and configuration file are read from the described first default store path, based on first application
First url list is divided into multiple second url lists by container, the multiple second url list is stored pre- to third
If in store path;
S4, to crawl every URL in the multiple second url list respectively based on the multiple second application container corresponding
Web data, and the web data is saved into the 4th default store path;And
S5, the web data is extracted from the 4th default store path based on the third application container, and will be described
Web data is sent to the web data and crawls the corresponding user terminal of request.
Preferably, before step S1, this method further include:
The configuration parameter that client is sent is received, quantity, the net of pre-configured concurrent process are obtained from configuration parameter
The specified store path of page data and each program;And
Configuration file is generated according to the file path of the quantity of the concurrent process of acquisition, each program, by the configuration text
Part is stored into the first default store path.
It preferably, further include the corresponding index value information of every URL in first url list, wherein described " every
The corresponding index value information of URL " is obtained by following steps:
The specifying information for obtaining and analyzing every URL in the first url list determines the characteristic information of every URL;And
It is that every URL matches corresponding index in the first url list according to the mapping relations of characteristic information and index value
Value.
Preferably, this method is further comprising the steps of:
When in the presence of can not be according to the URL of characteristic information match index value, prompt information be generated based on the URL, and receive
To the matching instruction of the URL match index value.
Preferably, before step S5, this method is further comprising the steps of:
The quantity of URL is compared in quantity and the first url list to the web data, determines third url list;
When third url list is not sky, web data is executed for every URL in third url list and crawls operation,
It saves until third url list is empty, and by the corresponding web data of the third url list to the 4th default storage road
In diameter;And
When third url list is empty, step S5 is continued to execute.
Preferably, before step S5, this method is further comprising the steps of:
It is preset respectively from third and obtains the second url list and URL canonical excavation program in store path;
The corresponding web data of each second url list is obtained from the 4th default store path, to each second url list pair
The web data answered is excavated, and determines corresponding 4th url list of each second url list, and the 4th url list is protected
It deposits into the 5th default store path;And
Web data is executed to the 4th url list and crawls operation, and to the corresponding web data of the 4th url list into
The sub- URL dredge operation of row, is extracted new sub- URL, and carry out web data and crawl operation, is recycled with this.
In addition, the present invention also provides a kind of electronic devices, which is characterized in that the device includes: memory, processor, institute
It states the web data for being stored with and being run on memory on the processor and crawls program, the web data crawls program quilt
, it can be achieved that arbitrary steps in web data crawling method as described above when the processor executes.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium
It include that web data crawls program in storage medium, the web data crawls when program is executed by processor, it can be achieved that as above
Arbitrary steps in the web data crawling method.
Web data crawling method, electronic device and computer readable storage medium proposed by the present invention are based on docker
Mirror image establishes docker application container and carrys out parallel carry out data processing, and docker application container can save start-up operation system
The brought wasting of resources provides the isolating power similar with virtual machine with the consumption of process-level, is based on this frame, uses
Family only needs to set configuration file, relative program is generated image file, parallel by establishing multiple docker application containers
Ground crawls web data, can efficiently complete web data and crawl work;By carrying out data to the web data crawled
Verifying guarantees the integrality of the web data crawled;By carrying out sub- URL depth excavation to the web data crawled, protect
Demonstrate,prove the comprehensive of web data.
Detailed description of the invention
Fig. 1 is the flow chart of web data crawling method preferred embodiment of the present invention;
Fig. 2 is the schematic diagram of electronic device preferred embodiment of the present invention;
Fig. 3 is the program module schematic diagram that web data crawls program in Fig. 2 of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The present invention provides a kind of web data crawling method.Referring to Fig.1 shown in, be web data crawling method of the present invention compared with
The flow chart of good embodiment.This method can be executed by a device, which can be by software and or hardware realization.
In the present embodiment, the web data crawling method based on docker includes step S1-S5:
S1, reception web data crawl request, crawl the first url list of request according to the web data, this first
Include URL to be crawled in url list, which is stored to first where preset configuration file is default and is deposited
It stores up in path;
The following contents is illustrated embodiment of the present invention method using electronic device as executing subject, wherein electronics dress
It sets and establishes communication connection as server and user terminal, and receive the business data processing request of user terminal transmission, according to
Request handles business datum.The electronic device can have multi-core CPU (Central Processing Unit, center
Processor).
It requests it is understood that being crawled in the web data for receiving user terminal transmission and web data is crawled
Before, docker mirror image has been configured on the electronic device.Specifically, it is based on dockerfile rule creation docker mirror image, it should
Include list partition program, concurrent processor, data verifying program, data consolidation procedure etc. in docker mirror image, will create
The docker mirror image built is saved to preservation into the second default store path.After having created docker mirror image, being based on should
Docker mirror image creates multiple application containers.Each program can operate independently in application container, multiple application containers
Between operation it is mutually indepedent.
In addition, this method further comprises the steps of: before step S1
The configuration parameter that client is sent is received, quantity, the net of pre-configured concurrent process are obtained from configuration parameter
The specified store path of page data and each program;
Configuration file is generated according to the file path of the quantity of the concurrent process of acquisition, each program, by the configuration text
Part is stored into the first default store path.
Wherein, the quantity of above-mentioned process needs to occupy according to the size of the multi-core CPU of server and data processing
CPU situation is adjusted.
It is original url list to be crawled that above-mentioned web data, which crawls the first url list for including in request, is being received
When web data crawls request, according to the information of every URL in the first url list to be crawled, the corresponding rope of every URL is determined
Draw value, by the corresponding index value information update of every URL into above-mentioned first url list, then by above-mentioned updated first
Url list is stored into the first default store path, wherein the first default store path can be Redis database.
As an implementation, described " the corresponding index value of every URL in the first url list " is obtained by following steps
It takes:
The specifying information for obtaining and analyzing every URL in the first url list determines the characteristic information of every URL;And
It is that every URL matches corresponding index in the first url list according to the mapping relations of characteristic information and index value
Value.
Wherein, features described above information can be used for characterizing the type of webpage, and index value is for transferring crawlers.Above-mentioned spy
Reference breath and the mapping relations of index value are obtained by following steps:
The set for obtaining specified URL, determines the characteristic information of every URL in set, is that every URL marks index value;Root
The specified URL in set is divided in the corresponding subclass of different index value according to index value;It is counted in each subclass respectively
The accounting of different characteristic information selects the maximum characteristic information of accounting as the target signature information for specifying URL in each subclass;
According to the target signature information index value corresponding with the subclass for specifying URL in each subclass, characteristic information and index are determined
The mapping relations of value.
Further, this method is further comprising the steps of:
When in the presence of can not be according to the URL of characteristic information match index value, prompt information be generated based on the URL, and receive
To the matching instruction of the URL match index value.
Wherein, in above-mentioned matching instruction comprising it is above-mentioned can not be according to the corresponding index of URL of characteristic information match index value
Value information.The URL export that will be unable to match index value, feeds back to designated terminal for derived URL, is artificially derived URL
Determine corresponding index value.
S2, the docker mirror image constructed in advance is read from the second default store path, and according to the docker mirror image
Generate multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
In the present embodiment, application container includes: first application container, multiple second application containers, a third
Application container.
S3, the first url list and configuration file are read from the described first default store path, based on first application
First url list is divided into multiple second url lists by container, the multiple second url list is stored pre- to third
If in store path;
Specifically, first application container is run, obtains the first url list from the first default store path, from the
Above-mentioned list partition program is called in two default store paths, and the first url list is fifty-fifty divided into N number of URL sublist,
That is, N number of second url list, N is the integer greater than 1.It wherein, include a plurality of URL and every URL in each second url list
Corresponding index value information, and each second url list is stored into the store path of preset second url list, that is,
Third is preset in store path.
In this step, the quantity N of the second url list is consistent with the quantity of the second application container, for example, the second application is held
When the quantity of device is 5, N 5 indicates the first url list being divided into 5 each second url lists.Based on the step, realize wait climb
Take the division and transfer of list and corresponding procedure index value.
Further, the first application container calls configuration file from the second default store path, by the cpu resource of server
Following multiple second application containers are distributed to, web data is performed in parallel for multiple second application containers and crawls operation.From this
Data processing parameters are obtained in configuration file, wherein data processing parameters include the quantity N of concurrent process and the net that crawls
The store path of page data.
S4, to crawl every URL in the multiple second url list respectively based on the multiple second application container corresponding
Web data, and the web data is saved into the 4th default store path;
Synchronously run N number of second application container, second URL column that second application container corresponds to
Table, the multiple second application container presets store path from the third respectively and obtains second url list, according to each
The corresponding index value of every URL in a second url list successively calls crawlers corresponding with index value, carries out webpage number
It is operated according to crawling, and the web data crawled is saved into specified store path.Wherein, the second different application container
The corresponding same 4th default store path, that is, the web data that different second application content devices crawl is saved to same
In file.By the step, realize web data crawl and aggregation process.
S5, the web data is extracted from the 4th default store path based on the third application container, and will be described
Web data is sent to the web data and crawls the corresponding user terminal of request.
The third application container is run, above-mentioned N number of second application container is read from the 4th default store path and is crawled
The web data arrived, and web data is sent to above-mentioned web data and crawls the corresponding user terminal of request, and generates prompt
Information.
It in other embodiments, need to be to the web data crawled in order to guarantee the integrality of the web data crawled
It is verified.Specifically, before step S5, this method is further comprised the steps of:
The quantity of URL is compared in quantity and the first url list to the web data, determines third url list;
When third url list is not sky, web data is executed for every URL in third url list and crawls operation,
It saves until third url list is empty, and by the corresponding web data of the third url list to the 4th default storage road
In diameter;
When third url list is empty, step S5 is continued to execute.
Specifically, the 4th application container is generated based on docker mirror image, the 4th application container is to the webpage number crawled
According to being verified.It wherein, include URL to be crawled and the corresponding index value of URL to be crawled in third url list.It is understood that
, the corresponding web data of URL, it is assumed that the quantity that statistics obtains URL in the first url list is P, the 4th default deposits
The quantity for storing up web data in path is Q, as P=Q, illustrates that for sky, that is, untreated URL is not present in third url list,
That is there is no web datas to be crawled;As P > Q, the quantity of URL is P-Q, P-Q item in the third url list
URL is URL to be processed, that is, there is web data to be crawled, for the third url list comprising URL to be processed, in operation
The 4th application container is stated, web data is executed and crawls operation, and the web data that the step crawls is climbed with by step S4
The web data got merges.It should be noted that when executing preset times (for example, 3 to the URL in third url list
It is secondary) after, when certain URL is still had in third url list, generate warning information.By the step, web data can be prevented
Omission, guarantee the integrality of web data.
Preferably, there are a kind of situation, comprising sub- URL in the web data that crawls, in order to guarantee the webpage crawled
Data it is comprehensive, can also to the first url list carry out depth excavation, determine that the sub- URL for including in web data is corresponding
Web data.Specifically, before step S5, this method is further comprised the steps of:
It is preset respectively from third and obtains the second url list and URL canonical excavation program in store path;
The corresponding web data of each second url list is obtained from the 4th default store path, to each second url list pair
The web data answered is excavated, and determines corresponding 4th url list of each second url list, and the 4th url list is protected
It deposits into the 5th default store path;
Web data is executed to the 4th url list and crawls operation, is continued to the corresponding web data of the 4th url list
Sub- URL dredge operation is carried out, new sub- URL is extracted, and carries out web data and crawls operation, is recycled with this.
Corresponding 4th url list of each second url list, executes web data to the 4th url list and crawls operation
The second application container the second url list corresponding with the 4th url list the second application container be it is same.
It is to be appreciated that the case where depth excavates new sub- URL always in order to prevent, presets a depth threshold
(indicating that depth excavates the number of sub- URL) stops digging when the number that depth excavates url list is more than preset depth threshold
The operation of the new sub- URL of pick.
The web data crawling method that above-described embodiment proposes is asked when receiving web data and crawling request according to this
It asks and obtains the first url list to be processed, stored into the first default store path where preset configuration file, from
The docker mirror image constructed in advance is read in second default store path, and multiple applications are generated according to the docker mirror image and are held
Device reads configuration file and the first url list from the above-mentioned first default store path, according to multiple application containers and configuration text
First url list is divided into multiple second url lists by part, and multiple 2nd URL are handled in the way of more container parallel processings
List, system resource can be distributed to multiple application containers of parallel processing by server, and it is corresponding to crawl each second url list
Web data, and send it to web data and request corresponding user terminal.The solution of the present invention is based on docker mirror image and builds
Vertical docker application container carrys out parallel carry out data processing, and docker application container can save start-up operation system and be brought
The wasting of resources, provide the isolating power similar with virtual machine with the consumption of process-level, be based on this frame, user only needs
It sets configuration file, relative program is generated into image file, concurrently crawled by establishing multiple docker application containers
Web data can efficiently complete web data and crawl work;By carrying out data verification to the web data crawled, protect
Demonstrate,prove the integrality of the web data crawled;By carrying out sub- URL depth excavation to the web data crawled, guarantee webpage number
According to it is comprehensive.
The present invention also provides a kind of electronic devices.Referring to shown in Fig. 2, for showing for 1 preferred embodiment of electronic device of the present invention
It is intended to.
In the present embodiment, electronic device 1 can be smart phone, tablet computer, portable computer, desktop PC
Etc. terminal device having data processing function.
The electronic device 1 includes memory 11, processor 12 and network interface 13.
Wherein, memory 11 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory,
Hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), magnetic storage, disk, CD etc..Memory 11
It can be the internal storage unit of the electronic device 1, such as the hard disk of the electronic device 1 in some embodiments.Memory
11 are also possible to be equipped on the External memory equipment of the electronic device 1, such as the electronic device 1 in further embodiments
Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card dodge
Deposit card (Flash Card) etc..Further, memory 11 can also both include the internal storage unit of the electronic device 1 or wrap
Include External memory equipment.
Memory 11 can be not only used for the application software and Various types of data that storage is installed on the electronic device 1, such as net
Page data crawls program 10 etc., can be also used for temporarily storing the data that has exported or will export.
Processor 12 can be in some embodiments a central processing unit (Central Processing Unit,
CPU), controller, microcontroller, microprocessor or other data processing chips, the program for being stored in run memory 11
Code or processing data, such as web data crawl program 10 etc..
Network interface 13 optionally may include standard wireline interface and wireless interface (such as WI-FI interface), be commonly used in
Communication connection is established between the electronic device 1 and other electronic equipments.
Fig. 2 illustrates only the electronic device 1 with component 11-13, it will be appreciated by persons skilled in the art that Fig. 2 shows
Structure out does not constitute the restriction to electronic device 1, may include than illustrating less perhaps more components or combining certain
A little components or different component layouts.
Optionally, the electronic device 1 can also include user interface, user interface may include display (Display),
Input unit such as keyboard (Keyboard), optional user interface can also include standard wireline interface and wireless interface.
Optionally, in some embodiments, display can be light-emitting diode display, liquid crystal display, touch control type LCD and show
Device and Organic Light Emitting Diode (Organic Light-Emitting Diode, OLED) touch device etc..Wherein, display
It is properly termed as display screen or display unit, for showing the information handled in the electronic apparatus 1 and for showing visually
User interface.
In 1 embodiment of electronic device shown in Fig. 2, as being stored in a kind of memory 11 of computer storage medium
Web data crawls program 10 and realizes as follows when the web data stored in the execution memory 11 of processor 12 crawls program 10
Step:
A1, reception web data crawl request, crawl the first URL (Uniform of request according to the web data
Resource Location, uniform resource locator) list, include URL to be crawled in first url list, by this first
Url list is stored into the first default store path where preset configuration file;
The following contents is illustrated embodiment of the present invention method using electronic device as executing subject, wherein electronics dress
It sets and establishes communication connection as server and user terminal, and receive the business data processing request of user terminal transmission, according to
Request handles business datum.The electronic device can have multi-core CPU (Central Processing Unit, center
Processor).
It requests it is understood that being crawled in the web data for receiving user terminal transmission and web data is crawled
Before, docker mirror image has been configured on the electronic device.Specifically, it is based on dockerfile rule creation docker mirror image, it should
Include list partition program, concurrent processor, data verifying program, data consolidation procedure etc. in docker mirror image, will create
The docker mirror image built is saved to preservation into the second default store path.After having created docker mirror image, being based on should
Docker mirror image creates multiple application containers.Each program can operate independently in application container, multiple application containers
Between operation it is mutually indepedent.
In addition, prior to step A1, this method further comprises the steps of:
The configuration parameter that client is sent is received, quantity, the net of pre-configured concurrent process are obtained from configuration parameter
The specified store path of page data and each program;
Configuration file is generated according to the file path of the quantity of the concurrent process of acquisition, each program, by the configuration text
Part is stored into the first default store path.
Wherein, the quantity of above-mentioned process needs to occupy according to the size of the multi-core CPU of server and data processing
CPU situation is adjusted.
It is original url list to be crawled that above-mentioned web data, which crawls the first url list for including in request, is being received
When web data crawls request, according to the information of every URL in the first url list to be crawled, the corresponding rope of every URL is determined
Draw value, by the corresponding index value information update of every URL into above-mentioned first url list, then by above-mentioned updated first
Url list is stored into the first default store path, wherein the first default store path can be Redis database.
As an implementation, described " the corresponding index value of every URL in the first url list " is obtained by following steps
It takes:
The specifying information for obtaining and analyzing every URL in the first url list determines the characteristic information of every URL;And
It is that every URL matches corresponding index in the first url list according to the mapping relations of characteristic information and index value
Value.
Wherein, features described above information can be used for characterizing the type of webpage, and index value is for transferring crawlers.Above-mentioned spy
Reference breath and the mapping relations of index value are obtained by following steps:
The set for obtaining specified URL, determines the characteristic information of every URL in set, is that every URL marks index value;Root
The specified URL in set is divided in the corresponding subclass of different index value according to index value;It is counted in each subclass respectively
The accounting of different characteristic information selects the maximum characteristic information of accounting as the target signature information for specifying URL in each subclass;
According to the target signature information index value corresponding with the subclass for specifying URL in each subclass, characteristic information and index are determined
The mapping relations of value.
Further, this method is further comprising the steps of:
When in the presence of can not be according to the URL of characteristic information match index value, prompt information be generated based on the URL, and receive
To the matching instruction of the URL match index value.
Wherein, in above-mentioned matching instruction comprising it is above-mentioned can not be according to the corresponding index of URL of characteristic information match index value
Value information.The URL export that will be unable to match index value, feeds back to designated terminal for derived URL, is artificially derived URL
Determine corresponding index value.
A2, the docker mirror image constructed in advance is read from the second default store path, and according to the docker mirror image
Generate multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
In the present embodiment, application container includes: first application container, multiple second application containers, a third
Application container.
A3, the first url list and configuration file are read from the described first default store path, based on first application
First url list is divided into multiple second url lists by container, the multiple second url list is stored pre- to third
If in store path;
Specifically, first application container is run, obtains the first url list from the first default store path, from the
Above-mentioned list partition program is called in two default store paths, and the first url list is fifty-fifty divided into N number of URL sublist,
That is, N number of second url list, N is the integer greater than 1.It wherein, include a plurality of URL and every URL in each second url list
Corresponding index value information, and each second url list is stored into the store path of preset second url list, that is,
Third is preset in store path.
In this step, the quantity of the second url list is consistent with the quantity of the second application container, for example, the second application is held
When the quantity of device is 5, N 5 indicates the first url list being divided into 5 each second url lists.Based on the step, realize wait climb
Take the division and transfer of list and corresponding procedure index value.
Further, the first application container calls configuration file from the second default store path, by the cpu resource of server
Following multiple second application containers are distributed to, web data is performed in parallel for multiple second application containers and crawls operation.From this
Data processing parameters are obtained in configuration file, wherein data processing parameters include the quantity N of concurrent process and the net that crawls
The store path of page data.
A4, to crawl every URL in the multiple second url list respectively based on the multiple second application container corresponding
Web data, and the web data is saved into the 4th default store path;
Synchronously run N number of second application container, second URL column that second application container corresponds to
Table, the multiple second application container presets store path from the third respectively and obtains second url list, according to each
The corresponding index value of every URL in a second url list successively calls crawlers corresponding with index value, carries out webpage number
It is operated according to crawling, and the web data crawled is saved into specified store path.Wherein, the second different application container
The corresponding same 4th default store path, that is, the web data that different second application content devices crawl is saved to same
In file.By the step, realize web data crawl and aggregation process.
A5, the web data is extracted from the 4th default store path based on the third application container, and will be described
Web data is sent to the web data and crawls the corresponding user terminal of request.
The third application container is run, above-mentioned N number of second application container is read from the 4th default store path and is crawled
The web data arrived, and web data is sent to above-mentioned web data and crawls the corresponding user terminal of request, and generates prompt
Information.
It in other embodiments, need to be to the web data crawled in order to guarantee the integrality of the web data crawled
It is verified.Specifically, it when the web data crawls program 10 by processor execution, before step A5, also realizes
Following steps:
The quantity of URL is compared in quantity and the first url list to the web data, determines third url list;
When third url list is not sky, web data is executed for every URL in third url list and crawls operation,
It saves until third url list is empty, and by the corresponding web data of the third url list to the 4th default storage road
In diameter;
When third url list is empty, step A5 is continued to execute.
Specifically, the 4th application container is generated based on docker mirror image, the 4th application container is to the webpage number crawled
According to being verified.It wherein, include URL to be crawled and the corresponding index value of URL to be crawled in third url list.It is understood that
, the corresponding web data of URL, it is assumed that the quantity that statistics obtains URL in the first url list is P, the 4th default deposits
The quantity for storing up web data in path is Q, as P=Q, illustrates that for sky, that is, untreated URL is not present in third url list,
That is there is no web datas to be crawled;As P > Q, the quantity of URL is P-Q, P-Q item in the third url list
URL is URL to be processed, that is, there is web data to be crawled, for the third url list comprising URL to be processed, in operation
The 4th application container is stated, web data is executed and crawls operation, and the web data that the step crawls is climbed with by step A4
The web data got merges.It should be noted that when executing preset times (for example, 3 to the URL in third url list
It is secondary) after, when certain URL is still had in third url list, generate warning information.By the step, web data can be prevented
Omission, guarantee the integrality of web data.
Preferably, there are a kind of situation, comprising sub- URL in the web data that crawls, in order to guarantee the webpage crawled
Data it is comprehensive, can also to the first url list carry out depth excavation, determine that the sub- URL for including in web data is corresponding
Web data.Specifically, specifically, when the web data crawls program 10 by processor execution, before step A5,
Also realize following steps:
It is preset respectively from third and obtains the second url list and URL canonical excavation program in store path;
The corresponding web data of each second url list is obtained from the 4th default store path, to each second url list pair
The web data answered is excavated, and determines corresponding 4th url list of each second url list, and the 4th url list is protected
It deposits into the 5th default store path;
Web data is executed to the 4th url list and crawls operation, is continued to the corresponding web data of the 4th url list
Sub- URL dredge operation is carried out, new sub- URL is extracted, and carries out web data and crawls operation, is recycled with this.
Corresponding 4th url list of each second url list, executes web data to the 4th url list and crawls operation
The second application container the second url list corresponding with the 4th url list the second application container be it is same.
It is to be appreciated that the case where depth excavates new sub- URL always in order to prevent, presets a depth threshold
(indicating that depth excavates the number of sub- URL) stops digging when the number that depth excavates url list is more than preset depth threshold
The operation of the new sub- URL of pick.
Above-described embodiment propose electronic device 1, based on docker mirror image establish docker application container come parallel into
Row data processing, docker application container can save the wasting of resources brought by start-up operation system, with disappearing for process-level
Consumption is based on this frame to provide the isolating power similar with virtual machine, and user only needs to set configuration file, by related journey
Sequence generates image file, concurrently crawls web data by establishing multiple docker application containers, can efficiently complete net
Page data crawls work;By carrying out data verification to the web data crawled, guarantee the complete of the web data crawled
Property;By carrying out sub- URL depth excavation to the web data crawled, guarantee the comprehensive of web data.
Optionally, in other examples, web data, which crawls program 10, can also be divided into one or more
Module, one or more module are stored in memory 11, and (the present embodiment is processor by one or more processors
12) performed, to complete the present invention, the so-called module of the present invention is the series of computation machine program for referring to complete specific function
Instruction segment.For example, referring to shown in Fig. 3, the module diagram of program 10 is crawled for web data in Fig. 2, in the embodiment, webpage
Data, which crawl program 10 and can be divided into receiving module 110, container generation module 120, list division module 130, data, climbs
Modulus block 140 and data transmission blocks 150, the functions or operations step that the module 110-150 is realized is similar as above,
And will not be described here in detail, illustratively, such as wherein:
Receiving module 110 crawls request for receiving web data, crawls request first according to the web data
Url list includes URL to be crawled in first url list, which is stored to preset configuration file institute
The first default store path in;
Container generation module 120, for reading the docker mirror image constructed in advance, and root from the second default store path
Generate multiple application containers according to the docker mirror image, wherein application container include: the first application container, the second application container,
Third application container;And
List division module 130, for reading the first url list and configuration text from the described first default store path
First url list is divided into multiple second url lists based on first application container, by the multiple second by part
Url list is stored to third and is preset in store path;
Data crawl module 140, for crawling the multiple second URL column respectively based on the multiple second application container
The corresponding web data of every URL in table, and the web data is saved into the 4th default store path;And
Data transmission blocks 150, described in being extracted from the 4th default store path based on the third application container
Web data, and the web data is sent to the web data and crawls the corresponding user terminal of request.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium
In include that web data crawls program 10, the web data, which crawls, realizes following operation when program 10 is executed by processor:
A1, reception web data crawl request, crawl the first url list of request according to the web data, this first
Include URL to be crawled in url list, which is stored to first where preset configuration file is default and is deposited
It stores up in path;
A2, the docker mirror image constructed in advance is read from the second default store path, and according to the docker mirror image
Generate multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
A3, the first url list and configuration file are read from the described first default store path, based on first application
First url list is divided into multiple second url lists by container, the multiple second url list is stored pre- to third
If in store path;
A4, to crawl every URL in the multiple second url list respectively based on the multiple second application container corresponding
Web data, and the web data is saved into the 4th default store path;And
A5, the web data is extracted from the 4th default store path based on the third application container, and will be described
Web data is sent to the web data and crawls the corresponding user terminal of request.
The specific embodiment of the computer readable storage medium of the present invention is specific with above-mentioned web data crawling method
Embodiment is roughly the same, and details are not described herein.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row
His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and
And further include other elements that are not explicitly listed, or further include for this process, device, article or method institute it is intrinsic
Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do
There is also other identical elements in the process, device of element, article or method.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art
The part contributed out can be embodied in the form of software products, which is stored in one as described above
In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone,
Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of web data crawling method is applied to electronic device, which is characterized in that the described method includes:
S1, reception web data crawl request, crawl the first URL (Uniform of request according to the web data
Resource Location, uniform resource locator) list, include URL to be crawled in first url list, by this first
Url list is stored into the first default store path where preset configuration file;
S2, the docker mirror image constructed in advance is read from the second default store path, and generated according to the docker mirror image
Multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
S3, the first url list and configuration file are read from the described first default store path, is based on first application container
First url list is divided into multiple second url lists, the multiple second url list is stored to third is default and is deposited
It stores up in path;
S4, the corresponding webpage of every URL in the multiple second url list is crawled based on the multiple second application container respectively
Data, and the web data is saved into the 4th default store path;And
S5, the web data is extracted from the 4th default store path based on the third application container, and by the webpage
Data are sent to the web data and crawl the corresponding user terminal of request.
2. web data crawling method according to claim 1, which is characterized in that before step S1, this method is also wrapped
It includes:
The configuration parameter that client is sent is received, quantity, the webpage number of pre-configured concurrent process are obtained from configuration parameter
According to and each program specified store path;And
Configuration file is generated according to the file path of the quantity of the concurrent process of acquisition, each program, the configuration file is deposited
Storage is into the first default store path.
3. web data crawling method according to claim 1, which is characterized in that further include in first url list
The corresponding index value information of every URL, wherein " the corresponding index value information of every URL " is obtained by following steps:
The specifying information for obtaining and analyzing every URL in the first url list determines the characteristic information of every URL;And
It is that every URL matches corresponding index value in the first url list according to the mapping relations of characteristic information and index value.
4. web data crawling method according to claim 3, which is characterized in that this method is further comprising the steps of:
When existing, when can not be according to the URL of characteristic information match index value, based on URL generation prompt information, and reception be to this
The matching instruction of URL match index value.
5. web data crawling method as claimed in any of claims 1 to 4, which is characterized in that step S5 it
Before, this method is further comprising the steps of:
The quantity of URL is compared in quantity and the first url list to the web data, determines third url list;
When third url list is not sky, web data is executed for every URL in third url list and crawls operation, until
Until third url list is empty, and the corresponding web data of the third url list is saved to the 4th default store path
In;And
When third url list is empty, step S5 is continued to execute.
6. web data crawling method according to claim 5, which is characterized in that before step S5, this method is also wrapped
Include following steps:
It is preset respectively from third and obtains the second url list and URL canonical excavation program in store path;
The corresponding web data of each second url list is obtained from the 4th default store path, it is corresponding to each second url list
Web data is excavated, and determines corresponding 4th url list of each second url list, by the 4th url list save to
In 5th default store path;And
Web data is executed to the 4th url list and crawls operation, and son is carried out to the corresponding web data of the 4th url list
URL dredge operation extracts new sub- URL, and carries out web data and crawl operation, is recycled with this.
7. a kind of electronic device, which is characterized in that the device includes: memory, processor, and being stored on the memory can be
The web data run on the processor crawls program, can when the web data crawls program and executed by the processor
Realize following steps:
A1, reception web data crawl request, crawl the first url list of request according to the web data, first URL column
Include URL to be crawled in table, which is stored to the first default store path where preset configuration file
In;
A2, the docker mirror image constructed in advance is read from the second default store path, and generated according to the docker mirror image
Multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
A3, the first url list and configuration file are read from the described first default store path, is based on first application container
First url list is divided into multiple second url lists, the multiple second url list is stored to third is default and is deposited
It stores up in path;
A4, the corresponding webpage of every URL in the multiple second url list is crawled based on the multiple second application container respectively
Data, and the web data is saved into the 4th default store path;And
A5, the web data is extracted from the 4th default store path based on the third application container, and by the webpage
Data are sent to the web data and crawl the corresponding user terminal of request.
8. electronic device according to claim 7, which is characterized in that the web data crawls program by the processor
When execution, before step A5, following steps are also realized:
The quantity of URL is compared in quantity and the first url list to the web data, determines third url list;
When third url list is not sky, web data is executed for every URL in third url list and crawls operation, until
Until third url list is empty, and the corresponding web data of the third url list is saved to the 4th default store path
In;And
When third url list is empty, step A5 is continued to execute.
9. the electronic device according to any one of claim 7 to 8, which is characterized in that the web data crawls journey
When sequence is executed by the processor, before step A5, following steps are also realized:
It is preset respectively from third and obtains the second url list and URL canonical excavation program in store path;
The corresponding web data of each second url list is obtained from the 4th default store path, it is corresponding to each second url list
Web data is excavated, and determines corresponding 4th url list of each second url list, by the 4th url list save to
In 5th default store path;
Web data is executed to the 4th url list and crawls operation, continues to carry out the corresponding web data of the 4th url list
Sub- URL dredge operation extracts new sub- URL, and carries out web data and crawl operation, is recycled with this.
10. a kind of computer readable storage medium, which is characterized in that include web data in the computer readable storage medium
Program is crawled, the web data crawls when program is executed by processor, it can be achieved that such as any one of claim 1 to 6 institute
The step of web data crawling method stated.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810791126.3A CN110020060B (en) | 2018-07-18 | 2018-07-18 | Webpage data crawling method and device and storage medium |
PCT/CN2018/108218 WO2020015192A1 (en) | 2018-07-18 | 2018-09-28 | Webpage data crawling method and apparatus, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810791126.3A CN110020060B (en) | 2018-07-18 | 2018-07-18 | Webpage data crawling method and device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110020060A true CN110020060A (en) | 2019-07-16 |
CN110020060B CN110020060B (en) | 2023-03-14 |
Family
ID=67188354
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810791126.3A Active CN110020060B (en) | 2018-07-18 | 2018-07-18 | Webpage data crawling method and device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110020060B (en) |
WO (1) | WO2020015192A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110888655A (en) * | 2019-11-14 | 2020-03-17 | 中国民航信息网络股份有限公司 | Application publishing method and device |
CN113392301A (en) * | 2021-06-08 | 2021-09-14 | 北京精准沟通传媒科技股份有限公司 | Method, device, medium and electronic equipment for crawling data |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116361362B (en) * | 2023-05-30 | 2023-08-11 | 江西顶易科技发展有限公司 | User information mining method and system based on webpage content identification |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7676553B1 (en) * | 2003-12-31 | 2010-03-09 | Microsoft Corporation | Incremental web crawler using chunks |
US20120102019A1 (en) * | 2010-10-25 | 2012-04-26 | Korea Advanced Institute Of Science And Technology | Method and apparatus for crawling webpages |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106101176B (en) * | 2016-05-27 | 2019-04-12 | 成都索贝数码科技股份有限公司 | One kind is integrated to melt media cloud production delivery system and method |
CN106484886A (en) * | 2016-10-17 | 2017-03-08 | 金蝶软件(中国)有限公司 | A kind of method of data acquisition and its relevant device |
CN108197633A (en) * | 2017-11-24 | 2018-06-22 | 百年金海科技有限公司 | Deep learning image classification based on TensorFlow is with applying dispositions method |
CN108062413B (en) * | 2017-12-30 | 2019-05-28 | 平安科技(深圳)有限公司 | Web data processing method, device, computer equipment and storage medium |
-
2018
- 2018-07-18 CN CN201810791126.3A patent/CN110020060B/en active Active
- 2018-09-28 WO PCT/CN2018/108218 patent/WO2020015192A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7676553B1 (en) * | 2003-12-31 | 2010-03-09 | Microsoft Corporation | Incremental web crawler using chunks |
US20120102019A1 (en) * | 2010-10-25 | 2012-04-26 | Korea Advanced Institute Of Science And Technology | Method and apparatus for crawling webpages |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110888655A (en) * | 2019-11-14 | 2020-03-17 | 中国民航信息网络股份有限公司 | Application publishing method and device |
CN113392301A (en) * | 2021-06-08 | 2021-09-14 | 北京精准沟通传媒科技股份有限公司 | Method, device, medium and electronic equipment for crawling data |
Also Published As
Publication number | Publication date |
---|---|
WO2020015192A1 (en) | 2020-01-23 |
CN110020060B (en) | 2023-03-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110020060A (en) | Web data crawling method, device and storage medium | |
CN110737659A (en) | Graph data storage and query method, device and computer readable storage medium | |
CN108958881A (en) | Data processing method, device and computer readable storage medium | |
CN107908472A (en) | Data synchronization unit, method and computer-readable recording medium | |
CN110442816A (en) | Web form configuration method, device and computer readable storage medium | |
CN109617996B (en) | File uploading and downloading method, server and computer readable storage medium | |
CN107967135A (en) | Computing engines implementation method, electronic device and storage medium | |
CN107656729B (en) | List view updating apparatus, method and computer-readable storage medium | |
CN108427698A (en) | Updating device, method and the computer readable storage medium of prediction model | |
CN110363303B (en) | Memory training method and device for intelligent distribution model and computer readable storage medium | |
CN107301091A (en) | Resource allocation methods and device | |
CN112416458A (en) | Preloading method and device based on ReactNative, computer equipment and storage medium | |
CN110764913B (en) | Data calculation method based on rule calling, client and readable storage medium | |
CN109634916A (en) | File storage and method for down loading, device and storage medium | |
CN108845839A (en) | Application page loading method, device and computer readable storage medium | |
CN103838851B (en) | The rendering intent and device of three-dimensional scene models file | |
CN109190062A (en) | Crawling method, device and the storage medium of target corpus data | |
CN108073698B (en) | Real-time animation display methods, device, electric terminal and readable storage medium storing program for executing | |
CN110274607A (en) | Intelligent paths planning method, device and computer readable storage medium | |
CN110245281B (en) | Internet asset information collection method and terminal equipment | |
US20140181064A1 (en) | Geographical area correlated websites | |
CN108427586B (en) | Using push terminal, method and the computer readable storage medium of theme | |
CN108845864A (en) | A kind of JVM rubbish recovering method and device based on spring frame | |
CN112529711A (en) | Transaction processing method and device based on block chain virtual machine multiplexing | |
CN107729523A (en) | Data service method, electronic installation and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |