CN110020060A - Web data crawling method, device and storage medium - Google Patents

Web data crawling method, device and storage medium Download PDF

Info

Publication number
CN110020060A
CN110020060A CN201810791126.3A CN201810791126A CN110020060A CN 110020060 A CN110020060 A CN 110020060A CN 201810791126 A CN201810791126 A CN 201810791126A CN 110020060 A CN110020060 A CN 110020060A
Authority
CN
China
Prior art keywords
url
web data
url list
list
store path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810791126.3A
Other languages
Chinese (zh)
Other versions
CN110020060B (en
Inventor
吴壮伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810791126.3A priority Critical patent/CN110020060B/en
Priority to PCT/CN2018/108218 priority patent/WO2020015192A1/en
Publication of CN110020060A publication Critical patent/CN110020060A/en
Application granted granted Critical
Publication of CN110020060B publication Critical patent/CN110020060B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of web data crawling method, web data crawls request the first url list of reading to this method based on the received, multiple application containers are generated according to the docker mirror image constructed in advance, first url list of reading is divided into multiple second url lists, and the corresponding web data of every URL in the multiple second url list is crawled respectively, later, this method extracts the web data, and the web data is sent to the web data and crawls the corresponding user terminal of request.The present invention also provides a kind of electronic device and computer storage mediums.Using the present invention, web data can be improved and crawl efficiency.

Description

Web data crawling method, device and storage medium
Technical field
The present invention relates to data processing field more particularly to a kind of web data crawling methods, electronic device and computer Readable storage medium storing program for executing.
Background technique
In the prior art, the traditional approach that multiple tasks are run on a server is the multiple virtual machines of starting, then Different tasks is run on different virtual machines.Traditional virtualization technology be mostly using based on VMware virtual machine, and The operation of VMware virtual machine has to operation whole operation system, needs to occupy a large amount of system resource.
However, the resources such as CPU, memory, Internet resources and disk resource are all limited for server.With For web data crawls, the distribution of crawler is limited to machine quantity, CPU quantity, number of threads and number of processes etc. at present, When the virtual machine started on server excessively consumes resource, lead to not maximally utilize system resource, and then influence webpage Data crawl efficiency.
Summary of the invention
In view of the foregoing, the present invention provides a kind of web data crawling method, server and computer-readable storage medium Matter, main purpose are that improving web data crawls efficiency.
To achieve the above object, the present invention provides a kind of web data crawling method, this method comprises:
S1, reception web data crawl request, crawl the first URL (Uniform of request according to the web data Resource Location, uniform resource locator) list, include URL to be crawled in first url list, by this first Url list is stored into the first default store path where preset configuration file;
S2, the docker mirror image constructed in advance is read from the second default store path, and according to the docker mirror image Generate multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
S3, the first url list and configuration file are read from the described first default store path, based on first application First url list is divided into multiple second url lists by container, the multiple second url list is stored pre- to third If in store path;
S4, to crawl every URL in the multiple second url list respectively based on the multiple second application container corresponding Web data, and the web data is saved into the 4th default store path;And
S5, the web data is extracted from the 4th default store path based on the third application container, and will be described Web data is sent to the web data and crawls the corresponding user terminal of request.
Preferably, before step S1, this method further include:
The configuration parameter that client is sent is received, quantity, the net of pre-configured concurrent process are obtained from configuration parameter The specified store path of page data and each program;And
Configuration file is generated according to the file path of the quantity of the concurrent process of acquisition, each program, by the configuration text Part is stored into the first default store path.
It preferably, further include the corresponding index value information of every URL in first url list, wherein described " every The corresponding index value information of URL " is obtained by following steps:
The specifying information for obtaining and analyzing every URL in the first url list determines the characteristic information of every URL;And
It is that every URL matches corresponding index in the first url list according to the mapping relations of characteristic information and index value Value.
Preferably, this method is further comprising the steps of:
When in the presence of can not be according to the URL of characteristic information match index value, prompt information be generated based on the URL, and receive To the matching instruction of the URL match index value.
Preferably, before step S5, this method is further comprising the steps of:
The quantity of URL is compared in quantity and the first url list to the web data, determines third url list;
When third url list is not sky, web data is executed for every URL in third url list and crawls operation, It saves until third url list is empty, and by the corresponding web data of the third url list to the 4th default storage road In diameter;And
When third url list is empty, step S5 is continued to execute.
Preferably, before step S5, this method is further comprising the steps of:
It is preset respectively from third and obtains the second url list and URL canonical excavation program in store path;
The corresponding web data of each second url list is obtained from the 4th default store path, to each second url list pair The web data answered is excavated, and determines corresponding 4th url list of each second url list, and the 4th url list is protected It deposits into the 5th default store path;And
Web data is executed to the 4th url list and crawls operation, and to the corresponding web data of the 4th url list into The sub- URL dredge operation of row, is extracted new sub- URL, and carry out web data and crawl operation, is recycled with this.
In addition, the present invention also provides a kind of electronic devices, which is characterized in that the device includes: memory, processor, institute It states the web data for being stored with and being run on memory on the processor and crawls program, the web data crawls program quilt , it can be achieved that arbitrary steps in web data crawling method as described above when the processor executes.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium It include that web data crawls program in storage medium, the web data crawls when program is executed by processor, it can be achieved that as above Arbitrary steps in the web data crawling method.
Web data crawling method, electronic device and computer readable storage medium proposed by the present invention are based on docker Mirror image establishes docker application container and carrys out parallel carry out data processing, and docker application container can save start-up operation system The brought wasting of resources provides the isolating power similar with virtual machine with the consumption of process-level, is based on this frame, uses Family only needs to set configuration file, relative program is generated image file, parallel by establishing multiple docker application containers Ground crawls web data, can efficiently complete web data and crawl work;By carrying out data to the web data crawled Verifying guarantees the integrality of the web data crawled;By carrying out sub- URL depth excavation to the web data crawled, protect Demonstrate,prove the comprehensive of web data.
Detailed description of the invention
Fig. 1 is the flow chart of web data crawling method preferred embodiment of the present invention;
Fig. 2 is the schematic diagram of electronic device preferred embodiment of the present invention;
Fig. 3 is the program module schematic diagram that web data crawls program in Fig. 2 of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The present invention provides a kind of web data crawling method.Referring to Fig.1 shown in, be web data crawling method of the present invention compared with The flow chart of good embodiment.This method can be executed by a device, which can be by software and or hardware realization.
In the present embodiment, the web data crawling method based on docker includes step S1-S5:
S1, reception web data crawl request, crawl the first url list of request according to the web data, this first Include URL to be crawled in url list, which is stored to first where preset configuration file is default and is deposited It stores up in path;
The following contents is illustrated embodiment of the present invention method using electronic device as executing subject, wherein electronics dress It sets and establishes communication connection as server and user terminal, and receive the business data processing request of user terminal transmission, according to Request handles business datum.The electronic device can have multi-core CPU (Central Processing Unit, center Processor).
It requests it is understood that being crawled in the web data for receiving user terminal transmission and web data is crawled Before, docker mirror image has been configured on the electronic device.Specifically, it is based on dockerfile rule creation docker mirror image, it should Include list partition program, concurrent processor, data verifying program, data consolidation procedure etc. in docker mirror image, will create The docker mirror image built is saved to preservation into the second default store path.After having created docker mirror image, being based on should Docker mirror image creates multiple application containers.Each program can operate independently in application container, multiple application containers Between operation it is mutually indepedent.
In addition, this method further comprises the steps of: before step S1
The configuration parameter that client is sent is received, quantity, the net of pre-configured concurrent process are obtained from configuration parameter The specified store path of page data and each program;
Configuration file is generated according to the file path of the quantity of the concurrent process of acquisition, each program, by the configuration text Part is stored into the first default store path.
Wherein, the quantity of above-mentioned process needs to occupy according to the size of the multi-core CPU of server and data processing CPU situation is adjusted.
It is original url list to be crawled that above-mentioned web data, which crawls the first url list for including in request, is being received When web data crawls request, according to the information of every URL in the first url list to be crawled, the corresponding rope of every URL is determined Draw value, by the corresponding index value information update of every URL into above-mentioned first url list, then by above-mentioned updated first Url list is stored into the first default store path, wherein the first default store path can be Redis database.
As an implementation, described " the corresponding index value of every URL in the first url list " is obtained by following steps It takes:
The specifying information for obtaining and analyzing every URL in the first url list determines the characteristic information of every URL;And
It is that every URL matches corresponding index in the first url list according to the mapping relations of characteristic information and index value Value.
Wherein, features described above information can be used for characterizing the type of webpage, and index value is for transferring crawlers.Above-mentioned spy Reference breath and the mapping relations of index value are obtained by following steps:
The set for obtaining specified URL, determines the characteristic information of every URL in set, is that every URL marks index value;Root The specified URL in set is divided in the corresponding subclass of different index value according to index value;It is counted in each subclass respectively The accounting of different characteristic information selects the maximum characteristic information of accounting as the target signature information for specifying URL in each subclass; According to the target signature information index value corresponding with the subclass for specifying URL in each subclass, characteristic information and index are determined The mapping relations of value.
Further, this method is further comprising the steps of:
When in the presence of can not be according to the URL of characteristic information match index value, prompt information be generated based on the URL, and receive To the matching instruction of the URL match index value.
Wherein, in above-mentioned matching instruction comprising it is above-mentioned can not be according to the corresponding index of URL of characteristic information match index value Value information.The URL export that will be unable to match index value, feeds back to designated terminal for derived URL, is artificially derived URL Determine corresponding index value.
S2, the docker mirror image constructed in advance is read from the second default store path, and according to the docker mirror image Generate multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
In the present embodiment, application container includes: first application container, multiple second application containers, a third Application container.
S3, the first url list and configuration file are read from the described first default store path, based on first application First url list is divided into multiple second url lists by container, the multiple second url list is stored pre- to third If in store path;
Specifically, first application container is run, obtains the first url list from the first default store path, from the Above-mentioned list partition program is called in two default store paths, and the first url list is fifty-fifty divided into N number of URL sublist, That is, N number of second url list, N is the integer greater than 1.It wherein, include a plurality of URL and every URL in each second url list Corresponding index value information, and each second url list is stored into the store path of preset second url list, that is, Third is preset in store path.
In this step, the quantity N of the second url list is consistent with the quantity of the second application container, for example, the second application is held When the quantity of device is 5, N 5 indicates the first url list being divided into 5 each second url lists.Based on the step, realize wait climb Take the division and transfer of list and corresponding procedure index value.
Further, the first application container calls configuration file from the second default store path, by the cpu resource of server Following multiple second application containers are distributed to, web data is performed in parallel for multiple second application containers and crawls operation.From this Data processing parameters are obtained in configuration file, wherein data processing parameters include the quantity N of concurrent process and the net that crawls The store path of page data.
S4, to crawl every URL in the multiple second url list respectively based on the multiple second application container corresponding Web data, and the web data is saved into the 4th default store path;
Synchronously run N number of second application container, second URL column that second application container corresponds to Table, the multiple second application container presets store path from the third respectively and obtains second url list, according to each The corresponding index value of every URL in a second url list successively calls crawlers corresponding with index value, carries out webpage number It is operated according to crawling, and the web data crawled is saved into specified store path.Wherein, the second different application container The corresponding same 4th default store path, that is, the web data that different second application content devices crawl is saved to same In file.By the step, realize web data crawl and aggregation process.
S5, the web data is extracted from the 4th default store path based on the third application container, and will be described Web data is sent to the web data and crawls the corresponding user terminal of request.
The third application container is run, above-mentioned N number of second application container is read from the 4th default store path and is crawled The web data arrived, and web data is sent to above-mentioned web data and crawls the corresponding user terminal of request, and generates prompt Information.
It in other embodiments, need to be to the web data crawled in order to guarantee the integrality of the web data crawled It is verified.Specifically, before step S5, this method is further comprised the steps of:
The quantity of URL is compared in quantity and the first url list to the web data, determines third url list;
When third url list is not sky, web data is executed for every URL in third url list and crawls operation, It saves until third url list is empty, and by the corresponding web data of the third url list to the 4th default storage road In diameter;
When third url list is empty, step S5 is continued to execute.
Specifically, the 4th application container is generated based on docker mirror image, the 4th application container is to the webpage number crawled According to being verified.It wherein, include URL to be crawled and the corresponding index value of URL to be crawled in third url list.It is understood that , the corresponding web data of URL, it is assumed that the quantity that statistics obtains URL in the first url list is P, the 4th default deposits The quantity for storing up web data in path is Q, as P=Q, illustrates that for sky, that is, untreated URL is not present in third url list, That is there is no web datas to be crawled;As P > Q, the quantity of URL is P-Q, P-Q item in the third url list URL is URL to be processed, that is, there is web data to be crawled, for the third url list comprising URL to be processed, in operation The 4th application container is stated, web data is executed and crawls operation, and the web data that the step crawls is climbed with by step S4 The web data got merges.It should be noted that when executing preset times (for example, 3 to the URL in third url list It is secondary) after, when certain URL is still had in third url list, generate warning information.By the step, web data can be prevented Omission, guarantee the integrality of web data.
Preferably, there are a kind of situation, comprising sub- URL in the web data that crawls, in order to guarantee the webpage crawled Data it is comprehensive, can also to the first url list carry out depth excavation, determine that the sub- URL for including in web data is corresponding Web data.Specifically, before step S5, this method is further comprised the steps of:
It is preset respectively from third and obtains the second url list and URL canonical excavation program in store path;
The corresponding web data of each second url list is obtained from the 4th default store path, to each second url list pair The web data answered is excavated, and determines corresponding 4th url list of each second url list, and the 4th url list is protected It deposits into the 5th default store path;
Web data is executed to the 4th url list and crawls operation, is continued to the corresponding web data of the 4th url list Sub- URL dredge operation is carried out, new sub- URL is extracted, and carries out web data and crawls operation, is recycled with this.
Corresponding 4th url list of each second url list, executes web data to the 4th url list and crawls operation The second application container the second url list corresponding with the 4th url list the second application container be it is same.
It is to be appreciated that the case where depth excavates new sub- URL always in order to prevent, presets a depth threshold (indicating that depth excavates the number of sub- URL) stops digging when the number that depth excavates url list is more than preset depth threshold The operation of the new sub- URL of pick.
The web data crawling method that above-described embodiment proposes is asked when receiving web data and crawling request according to this It asks and obtains the first url list to be processed, stored into the first default store path where preset configuration file, from The docker mirror image constructed in advance is read in second default store path, and multiple applications are generated according to the docker mirror image and are held Device reads configuration file and the first url list from the above-mentioned first default store path, according to multiple application containers and configuration text First url list is divided into multiple second url lists by part, and multiple 2nd URL are handled in the way of more container parallel processings List, system resource can be distributed to multiple application containers of parallel processing by server, and it is corresponding to crawl each second url list Web data, and send it to web data and request corresponding user terminal.The solution of the present invention is based on docker mirror image and builds Vertical docker application container carrys out parallel carry out data processing, and docker application container can save start-up operation system and be brought The wasting of resources, provide the isolating power similar with virtual machine with the consumption of process-level, be based on this frame, user only needs It sets configuration file, relative program is generated into image file, concurrently crawled by establishing multiple docker application containers Web data can efficiently complete web data and crawl work;By carrying out data verification to the web data crawled, protect Demonstrate,prove the integrality of the web data crawled;By carrying out sub- URL depth excavation to the web data crawled, guarantee webpage number According to it is comprehensive.
The present invention also provides a kind of electronic devices.Referring to shown in Fig. 2, for showing for 1 preferred embodiment of electronic device of the present invention It is intended to.
In the present embodiment, electronic device 1 can be smart phone, tablet computer, portable computer, desktop PC Etc. terminal device having data processing function.
The electronic device 1 includes memory 11, processor 12 and network interface 13.
Wherein, memory 11 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), magnetic storage, disk, CD etc..Memory 11 It can be the internal storage unit of the electronic device 1, such as the hard disk of the electronic device 1 in some embodiments.Memory 11 are also possible to be equipped on the External memory equipment of the electronic device 1, such as the electronic device 1 in further embodiments Plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card dodge Deposit card (Flash Card) etc..Further, memory 11 can also both include the internal storage unit of the electronic device 1 or wrap Include External memory equipment.
Memory 11 can be not only used for the application software and Various types of data that storage is installed on the electronic device 1, such as net Page data crawls program 10 etc., can be also used for temporarily storing the data that has exported or will export.
Processor 12 can be in some embodiments a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips, the program for being stored in run memory 11 Code or processing data, such as web data crawl program 10 etc..
Network interface 13 optionally may include standard wireline interface and wireless interface (such as WI-FI interface), be commonly used in Communication connection is established between the electronic device 1 and other electronic equipments.
Fig. 2 illustrates only the electronic device 1 with component 11-13, it will be appreciated by persons skilled in the art that Fig. 2 shows Structure out does not constitute the restriction to electronic device 1, may include than illustrating less perhaps more components or combining certain A little components or different component layouts.
Optionally, the electronic device 1 can also include user interface, user interface may include display (Display), Input unit such as keyboard (Keyboard), optional user interface can also include standard wireline interface and wireless interface.
Optionally, in some embodiments, display can be light-emitting diode display, liquid crystal display, touch control type LCD and show Device and Organic Light Emitting Diode (Organic Light-Emitting Diode, OLED) touch device etc..Wherein, display It is properly termed as display screen or display unit, for showing the information handled in the electronic apparatus 1 and for showing visually User interface.
In 1 embodiment of electronic device shown in Fig. 2, as being stored in a kind of memory 11 of computer storage medium Web data crawls program 10 and realizes as follows when the web data stored in the execution memory 11 of processor 12 crawls program 10 Step:
A1, reception web data crawl request, crawl the first URL (Uniform of request according to the web data Resource Location, uniform resource locator) list, include URL to be crawled in first url list, by this first Url list is stored into the first default store path where preset configuration file;
The following contents is illustrated embodiment of the present invention method using electronic device as executing subject, wherein electronics dress It sets and establishes communication connection as server and user terminal, and receive the business data processing request of user terminal transmission, according to Request handles business datum.The electronic device can have multi-core CPU (Central Processing Unit, center Processor).
It requests it is understood that being crawled in the web data for receiving user terminal transmission and web data is crawled Before, docker mirror image has been configured on the electronic device.Specifically, it is based on dockerfile rule creation docker mirror image, it should Include list partition program, concurrent processor, data verifying program, data consolidation procedure etc. in docker mirror image, will create The docker mirror image built is saved to preservation into the second default store path.After having created docker mirror image, being based on should Docker mirror image creates multiple application containers.Each program can operate independently in application container, multiple application containers Between operation it is mutually indepedent.
In addition, prior to step A1, this method further comprises the steps of:
The configuration parameter that client is sent is received, quantity, the net of pre-configured concurrent process are obtained from configuration parameter The specified store path of page data and each program;
Configuration file is generated according to the file path of the quantity of the concurrent process of acquisition, each program, by the configuration text Part is stored into the first default store path.
Wherein, the quantity of above-mentioned process needs to occupy according to the size of the multi-core CPU of server and data processing CPU situation is adjusted.
It is original url list to be crawled that above-mentioned web data, which crawls the first url list for including in request, is being received When web data crawls request, according to the information of every URL in the first url list to be crawled, the corresponding rope of every URL is determined Draw value, by the corresponding index value information update of every URL into above-mentioned first url list, then by above-mentioned updated first Url list is stored into the first default store path, wherein the first default store path can be Redis database.
As an implementation, described " the corresponding index value of every URL in the first url list " is obtained by following steps It takes:
The specifying information for obtaining and analyzing every URL in the first url list determines the characteristic information of every URL;And
It is that every URL matches corresponding index in the first url list according to the mapping relations of characteristic information and index value Value.
Wherein, features described above information can be used for characterizing the type of webpage, and index value is for transferring crawlers.Above-mentioned spy Reference breath and the mapping relations of index value are obtained by following steps:
The set for obtaining specified URL, determines the characteristic information of every URL in set, is that every URL marks index value;Root The specified URL in set is divided in the corresponding subclass of different index value according to index value;It is counted in each subclass respectively The accounting of different characteristic information selects the maximum characteristic information of accounting as the target signature information for specifying URL in each subclass; According to the target signature information index value corresponding with the subclass for specifying URL in each subclass, characteristic information and index are determined The mapping relations of value.
Further, this method is further comprising the steps of:
When in the presence of can not be according to the URL of characteristic information match index value, prompt information be generated based on the URL, and receive To the matching instruction of the URL match index value.
Wherein, in above-mentioned matching instruction comprising it is above-mentioned can not be according to the corresponding index of URL of characteristic information match index value Value information.The URL export that will be unable to match index value, feeds back to designated terminal for derived URL, is artificially derived URL Determine corresponding index value.
A2, the docker mirror image constructed in advance is read from the second default store path, and according to the docker mirror image Generate multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
In the present embodiment, application container includes: first application container, multiple second application containers, a third Application container.
A3, the first url list and configuration file are read from the described first default store path, based on first application First url list is divided into multiple second url lists by container, the multiple second url list is stored pre- to third If in store path;
Specifically, first application container is run, obtains the first url list from the first default store path, from the Above-mentioned list partition program is called in two default store paths, and the first url list is fifty-fifty divided into N number of URL sublist, That is, N number of second url list, N is the integer greater than 1.It wherein, include a plurality of URL and every URL in each second url list Corresponding index value information, and each second url list is stored into the store path of preset second url list, that is, Third is preset in store path.
In this step, the quantity of the second url list is consistent with the quantity of the second application container, for example, the second application is held When the quantity of device is 5, N 5 indicates the first url list being divided into 5 each second url lists.Based on the step, realize wait climb Take the division and transfer of list and corresponding procedure index value.
Further, the first application container calls configuration file from the second default store path, by the cpu resource of server Following multiple second application containers are distributed to, web data is performed in parallel for multiple second application containers and crawls operation.From this Data processing parameters are obtained in configuration file, wherein data processing parameters include the quantity N of concurrent process and the net that crawls The store path of page data.
A4, to crawl every URL in the multiple second url list respectively based on the multiple second application container corresponding Web data, and the web data is saved into the 4th default store path;
Synchronously run N number of second application container, second URL column that second application container corresponds to Table, the multiple second application container presets store path from the third respectively and obtains second url list, according to each The corresponding index value of every URL in a second url list successively calls crawlers corresponding with index value, carries out webpage number It is operated according to crawling, and the web data crawled is saved into specified store path.Wherein, the second different application container The corresponding same 4th default store path, that is, the web data that different second application content devices crawl is saved to same In file.By the step, realize web data crawl and aggregation process.
A5, the web data is extracted from the 4th default store path based on the third application container, and will be described Web data is sent to the web data and crawls the corresponding user terminal of request.
The third application container is run, above-mentioned N number of second application container is read from the 4th default store path and is crawled The web data arrived, and web data is sent to above-mentioned web data and crawls the corresponding user terminal of request, and generates prompt Information.
It in other embodiments, need to be to the web data crawled in order to guarantee the integrality of the web data crawled It is verified.Specifically, it when the web data crawls program 10 by processor execution, before step A5, also realizes Following steps:
The quantity of URL is compared in quantity and the first url list to the web data, determines third url list;
When third url list is not sky, web data is executed for every URL in third url list and crawls operation, It saves until third url list is empty, and by the corresponding web data of the third url list to the 4th default storage road In diameter;
When third url list is empty, step A5 is continued to execute.
Specifically, the 4th application container is generated based on docker mirror image, the 4th application container is to the webpage number crawled According to being verified.It wherein, include URL to be crawled and the corresponding index value of URL to be crawled in third url list.It is understood that , the corresponding web data of URL, it is assumed that the quantity that statistics obtains URL in the first url list is P, the 4th default deposits The quantity for storing up web data in path is Q, as P=Q, illustrates that for sky, that is, untreated URL is not present in third url list, That is there is no web datas to be crawled;As P > Q, the quantity of URL is P-Q, P-Q item in the third url list URL is URL to be processed, that is, there is web data to be crawled, for the third url list comprising URL to be processed, in operation The 4th application container is stated, web data is executed and crawls operation, and the web data that the step crawls is climbed with by step A4 The web data got merges.It should be noted that when executing preset times (for example, 3 to the URL in third url list It is secondary) after, when certain URL is still had in third url list, generate warning information.By the step, web data can be prevented Omission, guarantee the integrality of web data.
Preferably, there are a kind of situation, comprising sub- URL in the web data that crawls, in order to guarantee the webpage crawled Data it is comprehensive, can also to the first url list carry out depth excavation, determine that the sub- URL for including in web data is corresponding Web data.Specifically, specifically, when the web data crawls program 10 by processor execution, before step A5, Also realize following steps:
It is preset respectively from third and obtains the second url list and URL canonical excavation program in store path;
The corresponding web data of each second url list is obtained from the 4th default store path, to each second url list pair The web data answered is excavated, and determines corresponding 4th url list of each second url list, and the 4th url list is protected It deposits into the 5th default store path;
Web data is executed to the 4th url list and crawls operation, is continued to the corresponding web data of the 4th url list Sub- URL dredge operation is carried out, new sub- URL is extracted, and carries out web data and crawls operation, is recycled with this.
Corresponding 4th url list of each second url list, executes web data to the 4th url list and crawls operation The second application container the second url list corresponding with the 4th url list the second application container be it is same.
It is to be appreciated that the case where depth excavates new sub- URL always in order to prevent, presets a depth threshold (indicating that depth excavates the number of sub- URL) stops digging when the number that depth excavates url list is more than preset depth threshold The operation of the new sub- URL of pick.
Above-described embodiment propose electronic device 1, based on docker mirror image establish docker application container come parallel into Row data processing, docker application container can save the wasting of resources brought by start-up operation system, with disappearing for process-level Consumption is based on this frame to provide the isolating power similar with virtual machine, and user only needs to set configuration file, by related journey Sequence generates image file, concurrently crawls web data by establishing multiple docker application containers, can efficiently complete net Page data crawls work;By carrying out data verification to the web data crawled, guarantee the complete of the web data crawled Property;By carrying out sub- URL depth excavation to the web data crawled, guarantee the comprehensive of web data.
Optionally, in other examples, web data, which crawls program 10, can also be divided into one or more Module, one or more module are stored in memory 11, and (the present embodiment is processor by one or more processors 12) performed, to complete the present invention, the so-called module of the present invention is the series of computation machine program for referring to complete specific function Instruction segment.For example, referring to shown in Fig. 3, the module diagram of program 10 is crawled for web data in Fig. 2, in the embodiment, webpage Data, which crawl program 10 and can be divided into receiving module 110, container generation module 120, list division module 130, data, climbs Modulus block 140 and data transmission blocks 150, the functions or operations step that the module 110-150 is realized is similar as above, And will not be described here in detail, illustratively, such as wherein:
Receiving module 110 crawls request for receiving web data, crawls request first according to the web data Url list includes URL to be crawled in first url list, which is stored to preset configuration file institute The first default store path in;
Container generation module 120, for reading the docker mirror image constructed in advance, and root from the second default store path Generate multiple application containers according to the docker mirror image, wherein application container include: the first application container, the second application container, Third application container;And
List division module 130, for reading the first url list and configuration text from the described first default store path First url list is divided into multiple second url lists based on first application container, by the multiple second by part Url list is stored to third and is preset in store path;
Data crawl module 140, for crawling the multiple second URL column respectively based on the multiple second application container The corresponding web data of every URL in table, and the web data is saved into the 4th default store path;And
Data transmission blocks 150, described in being extracted from the 4th default store path based on the third application container Web data, and the web data is sent to the web data and crawls the corresponding user terminal of request.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium In include that web data crawls program 10, the web data, which crawls, realizes following operation when program 10 is executed by processor:
A1, reception web data crawl request, crawl the first url list of request according to the web data, this first Include URL to be crawled in url list, which is stored to first where preset configuration file is default and is deposited It stores up in path;
A2, the docker mirror image constructed in advance is read from the second default store path, and according to the docker mirror image Generate multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
A3, the first url list and configuration file are read from the described first default store path, based on first application First url list is divided into multiple second url lists by container, the multiple second url list is stored pre- to third If in store path;
A4, to crawl every URL in the multiple second url list respectively based on the multiple second application container corresponding Web data, and the web data is saved into the 4th default store path;And
A5, the web data is extracted from the 4th default store path based on the third application container, and will be described Web data is sent to the web data and crawls the corresponding user terminal of request.
The specific embodiment of the computer readable storage medium of the present invention is specific with above-mentioned web data crawling method Embodiment is roughly the same, and details are not described herein.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, device, article or method institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, device of element, article or method.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone, Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of web data crawling method is applied to electronic device, which is characterized in that the described method includes:
S1, reception web data crawl request, crawl the first URL (Uniform of request according to the web data Resource Location, uniform resource locator) list, include URL to be crawled in first url list, by this first Url list is stored into the first default store path where preset configuration file;
S2, the docker mirror image constructed in advance is read from the second default store path, and generated according to the docker mirror image Multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
S3, the first url list and configuration file are read from the described first default store path, is based on first application container First url list is divided into multiple second url lists, the multiple second url list is stored to third is default and is deposited It stores up in path;
S4, the corresponding webpage of every URL in the multiple second url list is crawled based on the multiple second application container respectively Data, and the web data is saved into the 4th default store path;And
S5, the web data is extracted from the 4th default store path based on the third application container, and by the webpage Data are sent to the web data and crawl the corresponding user terminal of request.
2. web data crawling method according to claim 1, which is characterized in that before step S1, this method is also wrapped It includes:
The configuration parameter that client is sent is received, quantity, the webpage number of pre-configured concurrent process are obtained from configuration parameter According to and each program specified store path;And
Configuration file is generated according to the file path of the quantity of the concurrent process of acquisition, each program, the configuration file is deposited Storage is into the first default store path.
3. web data crawling method according to claim 1, which is characterized in that further include in first url list The corresponding index value information of every URL, wherein " the corresponding index value information of every URL " is obtained by following steps:
The specifying information for obtaining and analyzing every URL in the first url list determines the characteristic information of every URL;And
It is that every URL matches corresponding index value in the first url list according to the mapping relations of characteristic information and index value.
4. web data crawling method according to claim 3, which is characterized in that this method is further comprising the steps of:
When existing, when can not be according to the URL of characteristic information match index value, based on URL generation prompt information, and reception be to this The matching instruction of URL match index value.
5. web data crawling method as claimed in any of claims 1 to 4, which is characterized in that step S5 it Before, this method is further comprising the steps of:
The quantity of URL is compared in quantity and the first url list to the web data, determines third url list;
When third url list is not sky, web data is executed for every URL in third url list and crawls operation, until Until third url list is empty, and the corresponding web data of the third url list is saved to the 4th default store path In;And
When third url list is empty, step S5 is continued to execute.
6. web data crawling method according to claim 5, which is characterized in that before step S5, this method is also wrapped Include following steps:
It is preset respectively from third and obtains the second url list and URL canonical excavation program in store path;
The corresponding web data of each second url list is obtained from the 4th default store path, it is corresponding to each second url list Web data is excavated, and determines corresponding 4th url list of each second url list, by the 4th url list save to In 5th default store path;And
Web data is executed to the 4th url list and crawls operation, and son is carried out to the corresponding web data of the 4th url list URL dredge operation extracts new sub- URL, and carries out web data and crawl operation, is recycled with this.
7. a kind of electronic device, which is characterized in that the device includes: memory, processor, and being stored on the memory can be The web data run on the processor crawls program, can when the web data crawls program and executed by the processor Realize following steps:
A1, reception web data crawl request, crawl the first url list of request according to the web data, first URL column Include URL to be crawled in table, which is stored to the first default store path where preset configuration file In;
A2, the docker mirror image constructed in advance is read from the second default store path, and generated according to the docker mirror image Multiple application containers, wherein application container includes: the first application container, the second application container, third application container;
A3, the first url list and configuration file are read from the described first default store path, is based on first application container First url list is divided into multiple second url lists, the multiple second url list is stored to third is default and is deposited It stores up in path;
A4, the corresponding webpage of every URL in the multiple second url list is crawled based on the multiple second application container respectively Data, and the web data is saved into the 4th default store path;And
A5, the web data is extracted from the 4th default store path based on the third application container, and by the webpage Data are sent to the web data and crawl the corresponding user terminal of request.
8. electronic device according to claim 7, which is characterized in that the web data crawls program by the processor When execution, before step A5, following steps are also realized:
The quantity of URL is compared in quantity and the first url list to the web data, determines third url list;
When third url list is not sky, web data is executed for every URL in third url list and crawls operation, until Until third url list is empty, and the corresponding web data of the third url list is saved to the 4th default store path In;And
When third url list is empty, step A5 is continued to execute.
9. the electronic device according to any one of claim 7 to 8, which is characterized in that the web data crawls journey When sequence is executed by the processor, before step A5, following steps are also realized:
It is preset respectively from third and obtains the second url list and URL canonical excavation program in store path;
The corresponding web data of each second url list is obtained from the 4th default store path, it is corresponding to each second url list Web data is excavated, and determines corresponding 4th url list of each second url list, by the 4th url list save to In 5th default store path;
Web data is executed to the 4th url list and crawls operation, continues to carry out the corresponding web data of the 4th url list Sub- URL dredge operation extracts new sub- URL, and carries out web data and crawl operation, is recycled with this.
10. a kind of computer readable storage medium, which is characterized in that include web data in the computer readable storage medium Program is crawled, the web data crawls when program is executed by processor, it can be achieved that such as any one of claim 1 to 6 institute The step of web data crawling method stated.
CN201810791126.3A 2018-07-18 2018-07-18 Webpage data crawling method and device and storage medium Active CN110020060B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810791126.3A CN110020060B (en) 2018-07-18 2018-07-18 Webpage data crawling method and device and storage medium
PCT/CN2018/108218 WO2020015192A1 (en) 2018-07-18 2018-09-28 Webpage data crawling method and apparatus, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810791126.3A CN110020060B (en) 2018-07-18 2018-07-18 Webpage data crawling method and device and storage medium

Publications (2)

Publication Number Publication Date
CN110020060A true CN110020060A (en) 2019-07-16
CN110020060B CN110020060B (en) 2023-03-14

Family

ID=67188354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810791126.3A Active CN110020060B (en) 2018-07-18 2018-07-18 Webpage data crawling method and device and storage medium

Country Status (2)

Country Link
CN (1) CN110020060B (en)
WO (1) WO2020015192A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110888655A (en) * 2019-11-14 2020-03-17 中国民航信息网络股份有限公司 Application publishing method and device
CN113392301A (en) * 2021-06-08 2021-09-14 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for crawling data

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361362B (en) * 2023-05-30 2023-08-11 江西顶易科技发展有限公司 User information mining method and system based on webpage content identification

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7676553B1 (en) * 2003-12-31 2010-03-09 Microsoft Corporation Incremental web crawler using chunks
US20120102019A1 (en) * 2010-10-25 2012-04-26 Korea Advanced Institute Of Science And Technology Method and apparatus for crawling webpages
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106101176B (en) * 2016-05-27 2019-04-12 成都索贝数码科技股份有限公司 One kind is integrated to melt media cloud production delivery system and method
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN108197633A (en) * 2017-11-24 2018-06-22 百年金海科技有限公司 Deep learning image classification based on TensorFlow is with applying dispositions method
CN108062413B (en) * 2017-12-30 2019-05-28 平安科技(深圳)有限公司 Web data processing method, device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7676553B1 (en) * 2003-12-31 2010-03-09 Microsoft Corporation Incremental web crawler using chunks
US20120102019A1 (en) * 2010-10-25 2012-04-26 Korea Advanced Institute Of Science And Technology Method and apparatus for crawling webpages
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110888655A (en) * 2019-11-14 2020-03-17 中国民航信息网络股份有限公司 Application publishing method and device
CN113392301A (en) * 2021-06-08 2021-09-14 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for crawling data

Also Published As

Publication number Publication date
WO2020015192A1 (en) 2020-01-23
CN110020060B (en) 2023-03-14

Similar Documents

Publication Publication Date Title
CN110020060A (en) Web data crawling method, device and storage medium
CN110737659A (en) Graph data storage and query method, device and computer readable storage medium
CN108958881A (en) Data processing method, device and computer readable storage medium
CN107908472A (en) Data synchronization unit, method and computer-readable recording medium
CN110442816A (en) Web form configuration method, device and computer readable storage medium
CN109617996B (en) File uploading and downloading method, server and computer readable storage medium
CN107967135A (en) Computing engines implementation method, electronic device and storage medium
CN107656729B (en) List view updating apparatus, method and computer-readable storage medium
CN108427698A (en) Updating device, method and the computer readable storage medium of prediction model
CN110363303B (en) Memory training method and device for intelligent distribution model and computer readable storage medium
CN107301091A (en) Resource allocation methods and device
CN112416458A (en) Preloading method and device based on ReactNative, computer equipment and storage medium
CN110764913B (en) Data calculation method based on rule calling, client and readable storage medium
CN109634916A (en) File storage and method for down loading, device and storage medium
CN108845839A (en) Application page loading method, device and computer readable storage medium
CN103838851B (en) The rendering intent and device of three-dimensional scene models file
CN109190062A (en) Crawling method, device and the storage medium of target corpus data
CN108073698B (en) Real-time animation display methods, device, electric terminal and readable storage medium storing program for executing
CN110274607A (en) Intelligent paths planning method, device and computer readable storage medium
CN110245281B (en) Internet asset information collection method and terminal equipment
US20140181064A1 (en) Geographical area correlated websites
CN108427586B (en) Using push terminal, method and the computer readable storage medium of theme
CN108845864A (en) A kind of JVM rubbish recovering method and device based on spring frame
CN112529711A (en) Transaction processing method and device based on block chain virtual machine multiplexing
CN107729523A (en) Data service method, electronic installation and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant