CN109885744A - Web data crawling method, device, system, computer equipment and storage medium - Google Patents

Web data crawling method, device, system, computer equipment and storage medium Download PDF

Info

Publication number
CN109885744A
CN109885744A CN201910012240.6A CN201910012240A CN109885744A CN 109885744 A CN109885744 A CN 109885744A CN 201910012240 A CN201910012240 A CN 201910012240A CN 109885744 A CN109885744 A CN 109885744A
Authority
CN
China
Prior art keywords
server
source code
web
network address
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910012240.6A
Other languages
Chinese (zh)
Other versions
CN109885744B (en
Inventor
吴壮伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910012240.6A priority Critical patent/CN109885744B/en
Publication of CN109885744A publication Critical patent/CN109885744A/en
Application granted granted Critical
Publication of CN109885744B publication Critical patent/CN109885744B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses web data crawling method, device, system, computer equipment and storage mediums.This method comprises: receiving the network address of second server distribution;The corresponding web page content information of the network address is crawled by the code skeleton of deployment;The web page content information is parsed by the code skeleton, obtains web analysis content;The web analysis content is sent in second server storage region corresponding with the first server to store;Source code in the web analysis content is parsed by the code skeleton, obtains corresponding source code parsing information;And source code parsing information is sent in second server storage region corresponding with the first server and is stored.The method achieve carrying out the web page contents crawled being saved in order to data tracing to the source, and secondary parsing can also be carried out to web page contents.

Description

Web data crawling method, device, system, computer equipment and storage medium
Technical field
The present invention relates to data to crawl technical field more particularly to a kind of web data crawling method, device, system, meter Calculate machine equipment and storage medium.
Background technique
It is crawled currently, crawler system is oriented both for specified content, and frequently encounters website revision, or When mistake occurs for the position of data grabber, results in the need for re-starting crawl, cause web page contents post-production cost relatively high.
Summary of the invention
The embodiment of the invention provides a kind of web data crawling method, device, system, computer equipment and storages to be situated between Matter, it is intended to it solves crawler system in the prior art and is oriented and crawls both for specified content, when encountering website revision, Or data grabber position occur mistake when, the problem of need to crawling and can not trace to the source again.
In a first aspect, being applied to first server, packet the embodiment of the invention provides a kind of web data crawling method It includes:
Receive the network address of second server distribution;The network address is the target that the second server receives that user terminal uploads The subset of website set;
The corresponding web page content information of the network address is crawled by the code skeleton of deployment;
The web page content information is parsed by the code skeleton, obtains web analysis content;
By the web analysis content be sent in second server storage region corresponding with the first server into Row storage;
Source code in the web analysis content is parsed by the code skeleton, obtains corresponding source code parsing Information;And
By the source code parsing information be sent in second server storage region corresponding with the first server into Row storage.
Second aspect, the embodiment of the present invention provide a kind of web data crawling method again, are applied to second server, Include:
Receive the targeted website network address set to be crawled sent by user terminal;
Each network address in the targeted website network address set is distributed to corresponding first server;
Receive the web analysis content that the first server is sent;The web analysis content is by the first server It crawls and parses the corresponding web page content information of the network address and obtain;
Receive the source code parsing information that the first server is sent;The source code parsing information is by the first server The source code correspondence for parsing the web analysis content obtains;And
Search condition is received, is obtained in the web analysis content and source code parsing information according to the search condition Take corresponding search result.
The third aspect, the embodiment of the present invention provide a kind of web data again and crawl device, which includes for executing The corresponding unit of web data crawling method described in above-mentioned first aspect, or including for executing net described in above-mentioned second aspect The corresponding unit of page data crawling method.
Fourth aspect, the embodiment of the present invention provide a kind of web data again and crawl system, including first server and Two servers, the first server is for executing web data crawling method described in above-mentioned first aspect, the second service Device is for executing web data crawling method described in above-mentioned first aspect.
5th aspect, the embodiment of the present invention provide a kind of computer equipment again comprising memory, processor and storage On the memory and the computer program that can run on the processor, the processor execute the computer program Web data crawling method described in the above-mentioned first aspect of Shi Shixian, or realize that web data described in above-mentioned second aspect is climbed Take method.
6th aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, wherein the computer can It reads storage medium and is stored with computer program, it is above-mentioned that the computer program when being executed by a processor executes the processor Web data crawling method described in first aspect, or execute web data crawling method described in above-mentioned second aspect.
The embodiment of the invention provides a kind of web data crawling method, device, system, computer equipment and storages to be situated between Matter.This method includes receiving the network address of second server distribution;The corresponding net of the network address is crawled by the code skeleton of deployment Page content information;The web page content information is parsed by the code skeleton, obtains web analysis content;It will be described Web analysis content is sent in second server storage region corresponding with the first server and is stored;By the net Source code in page parsing content is parsed by the code skeleton, obtains corresponding source code parsing information;And it will be described Source code parsing information is sent in second server storage region corresponding with the first server and is stored.This method is real The web page contents that will have been crawled are showed to carry out being saved in order to data tracing to the source, and secondary solution can also have been carried out to web page contents Analysis.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is that web data provided in an embodiment of the present invention crawls systematic difference schematic diagram of a scenario;
Fig. 2 is the flow diagram of web data crawling method provided in an embodiment of the present invention;
Fig. 3 is another flow diagram of web data crawling method provided in an embodiment of the present invention;
Fig. 4 is the schematic block diagram that web data provided in an embodiment of the present invention crawls device;
Fig. 5 is another schematic block diagram that web data provided in an embodiment of the present invention crawls device;
Fig. 6 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Fig. 1 and Fig. 2 are please referred to, Fig. 1 is the application scenarios signal of web data crawling method provided in an embodiment of the present invention Figure, Fig. 2 is the flow diagram of web data crawling method provided in an embodiment of the present invention, the web data crawling method application In first server, this method is executed by the application software being installed in first server.
As shown in Fig. 2, the method comprising the steps of S111~S116.
S111, the network address for receiving second server distribution;The network address is that the second server receives user terminal upload Targeted website network address set subset.
It in the present embodiment, is angle description technique scheme in first server, the first server can be single A load end, or multiple load ends.Load end is the network address for receiving second server distribution, and according to network address It crawls web page contents to carry out after parsing twice, the content parsed twice is sent to first server and is stored, in order to Web page contents are traced to the source.
After second server has received the targeted website network address set of user terminal upload, one of network address can choose It is sent to a first server, multiple network address is also can choose and is sent to a first server.Started by first server Web page crawl task.
In one embodiment, before step S111 further include:
Initial deployment application container engine;
It is packaged for crawling web page contents in the application container engine and parses the code skeleton of web page contents;
The application container engine corresponding storage region in second server is set.
In the present embodiment, (developer can be packaged application and rely on packet and arrive application container engine, that is, Docker container In one transplantable container, Docker container can be considered as a microsystem).It can be packaged in Docker container It crawls web page contents and parses the code skeleton of web page contents, crawling and solving twice for web page contents is realized by code skeleton Analysis.And in order to distinguish every Docker container corresponding storage region in second server, need to be set in second server Storage region identical with Docker container number is set, and each storage region is ordered according to the identifier of Docker container Name.The data parsed in every Docker container are stored into second server corresponding storage in this way, can be realized Region.
S112, the corresponding web page content information of the network address is crawled by the code skeleton of deployment.
In the present embodiment, when the application container engine in first server receive second server distribution network address, Indicate that the application container engine need to start the code skeleton wherein encapsulated to crawl the corresponding web page content information of the network address.
In one embodiment, before step S112 further include:
According to the network address, connection is established at target access corresponding with network address end.
In the present embodiment, when the application container engine in first server receive second server distribution network address, Indicate that second server need to establish connection according to the network address, target access corresponding with network address end.When with the network address When corresponding target access end successful connection, web page contents can be crawled from the corresponding target access end of the network address.More specifically , i.e., disposed Docker container correspondence is written with network address (can be regarded as the targeted website address URL) in first server, is indicated First server has known the address URL at the target access end of request connection.When Docker container disposed in first server After establishing connection with target access end, web page contents can be crawled from target access end.
S113, the web page content information is parsed by the code skeleton, obtains web analysis content.
In the present embodiment, source code, the targeted website URL of webpage are included at least in the web analysis content parsed The information such as the file directory of location, web page crawl time and webpage.
I.e. when method of the Docker container to browse simulator in first server, mark access end target network is being loaded It stands the address URL and after target access end establishes connection, and completes to be saved in source code in the form of a file after the acquisition of source code the In two servers in storage region corresponding with Docker container, and the index information of source code is stored in second server simultaneously MYSQL database (Relational DBMS that MYSQL is a kind of open source code) in;It is stored in the second clothes simultaneously The webpage information being engaged in the MYSQL database of device further comprises the text of the targeted website address URL, web page crawl time and webpage Part catalogue.The above process realizes the first time parsing to web page content information, can pass through each web page content information crawled It crosses after parsing to retransmit and be stored into second server.
S114, the web analysis content is sent in second server memory block corresponding with the first server Domain is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second server Store the web analysis content, be the web page contents crawled are inquired in second server for the ease of subsequent, thus Realization is traced to the source.
S115, the source code in the web analysis content is parsed by the code skeleton, obtains corresponding source Code parsing information.
In one embodiment, step S115 includes:
Obtain the regular expression rule cluster for being identified to source code constructed in advance in the code skeleton;
By the regular expression rule cluster, obtain in the source code with it is every in the regular expression rule cluster The one-to-one segmentation source code of one regular expression rule;
It is segmented source code by parsing corresponding with the rule of regular expression corresponding to each segmentation source code, is obtained and each segmentation The corresponding segmentation source code of source code parses information;
Each segmentation source code parsing information is combined, to obtain source code parsing information.
In the present embodiment, i.e., first established in the code skeleton source code parsing code code1, code2 ..., Codem } incidence relation with the regular expression rule of source code to be resolved, namely construct regular expression rule cluster {rule1,rule2,...,rulem}.If the particular content of source code meets regular expression rule, 1 is returned, otherwise returns to 0; By to source code carry out regular expression rule cluster { rule1, rule2 ..., rulem } identification, obtain segmentation source code and Parsing corresponding with each segmentation source code is segmented source code codei, and is solved with parsing segmentation source code codei to each segmentation source code Analysis obtains segmentation source code parsing information corresponding with each segmentation source code, each segmentation source code parsing information is combined and (at this time will It is to be connected in series each segmentation source code parsing information that each segmentation source code parsing information, which is combined, passes through separator between each other Separate), to obtain source code parsing information.By the regular expression rule cluster and parsing segmentation source code, realize Secondary parsing to the source code in web analysis content, can deeper each attribute (such as CSS file for excavating web page contents With JS file, wherein CSS indicates cascading style sheets, and JS file is web page foreground script file).
S116, source code parsing information is sent in second server memory block corresponding with the first server Domain is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second server The source code parsing information is stored, is for the ease of the subsequent source code for inquiring the web page contents crawled in second server Information is parsed, is traced to the source to realize.
The embodiment of the present invention also provides another web data crawling method, please refers to Fig. 1 and Fig. 3, and Fig. 3 is of the invention real Another flow diagram of the web data crawling method of example offer is applied, which is applied to second server In, this method is executed by the application software being installed in second server.
As shown in figure 3, the method comprising the steps of S121~S125.
The targeted website network address set to be crawled that S121, reception are sent by user terminal.
It in the present embodiment, is angle description technique scheme in second server, the second server can be view For primary server, first service is stored for distributing network address to be crawled, and the multiple memory spaces of division to first server The web analysis content and source code that device is sent parse information.
In one embodiment, include: in step S121
The targeted website network address set to be crawled sent by user terminal is received by remote date transmission database.
Wherein, (full name of Redis is Remote Dictionary to remote date transmission database, that is, Redis database Server indicates remote date transmission, and Redis is a key-value storage system, it supports data type abundant), lead to It crosses Redis database and receives to be merged by the targeted website address set to be crawled that user terminal is sent and distribute the targeted website address set The subset of conjunction is to first server.
S122, each network address in the targeted website network address set is distributed to corresponding first server.
In the present embodiment, second server can be in distributing the targeted website network address set when each network address One network address is distributed to same first server, is also possible to multiple network address being distributed to same first server.
S123, the web analysis content that the first server is sent is received;The web analysis content is by described first Server crawls and parses the corresponding web page content information of the network address and obtains.
In the present embodiment, the parsing to crawled web page content information is completed in first server, and is included The source code of webpage, the targeted website address URL, web page crawl time and webpage the information such as file directory web analysis content When, it is to be stored in the web analysis content that the first server is sent in the MYSQL database of second server and first In the corresponding tables of data of server, also i.e. by tables of data corresponding with first server in the MYSQL database of second server It is considered as storage region corresponding with first server.The web analysis content is stored in second server, is traced to the source convenient for subsequent Retrieval.
S124, the source code parsing information that the first server is sent is received;The source code parsing information is by described first The source code correspondence that server parses the web analysis content obtains.
In the present embodiment, the parsing to the source code of the web analysis content is completed in first server, and is obtained It is the MYSQL number that the source code parsing information that the first server is sent is stored in second server when source code parses information According in tables of data corresponding with first server in library, also i.e. by the MYSQL database of second server with first server Corresponding tables of data is considered as storage region corresponding with first server.The source code, which is stored, in second server parses information, Convenient for subsequent retrieval of tracing to the source.Content (web analysis content) is parsed for the first time of the crawled web page content information of same network address And second of parsing content (source code parsing information) is stored in the tables of data of same MYSQL database, it is same to realize Network address is crawled by first server and information obtained after parsing is stored in the same area.
S125, search condition is received, according to the search condition in the web analysis content and source code parsing letter Corresponding search result is obtained in breath.
In the present embodiment, no matter targeted website network address set (with can be regarded as multiple targeted website URL to be crawled Location) in the website revision of some or the address multiple targeted website URL, due to having been saved in the storage region of second server The historical data of the targeted website address URL, therefore when starting Docker container crawls the targeted website address URL of correcting When leading to the failure, with the triggering of the targeted website address URL to the search instruction of second server, and with targeted website URL Location is that search condition is retrieved in multiple storage regions.It is obtained in storage region corresponding with the targeted website address URL Source code after, secondary parsing can be carried out for the source code, quickly give original web page files for change, provide retrospective canal Road.
The method achieve the web page contents crawled being carried out be saved in order to data tracing to the source, and can also be to webpage Content carries out secondary parsing.
The embodiment of the present invention also provides a kind of web data and crawls device, and it is aforementioned for executing which crawls device Any embodiment of web data crawling method.The embodiment of the present invention also provides a kind of web data and crawls system, the webpage number It include first server and second server according to the system of crawling.Specifically, referring to Fig. 4, Fig. 4 is provided in an embodiment of the present invention Web data crawls the schematic block diagram of device.The web data crawls device 100 and can be configured in first server.
As shown in figure 4, web data crawl device 100 include network address receiving unit 111, web page contents crawl unit 112, First resolution unit 113, the first transmission unit 114, the second resolution unit 115, the second transmission unit 116.
Network address receiving unit 111, for receiving the network address of second server distribution;The network address is the second server Receive the subset for the targeted website network address set that user terminal uploads.
It in the present embodiment, is angle description technique scheme in first server, the first server can be single A load end, or multiple load ends.Load end is the network address for receiving second server distribution, and according to network address It crawls web page contents to carry out after parsing twice, the content parsed twice is sent to first server and is stored, in order to Web page contents are traced to the source.
After second server has received the targeted website network address set of user terminal upload, one of network address can choose It is sent to a first server, multiple network address is also can choose and is sent to a first server.Started by first server Web page crawl task.
In one embodiment, web data crawls device 100 further include:
Container deployment unit is used for initial deployment application container engine;
Code skeleton deployment unit crawls web page contents and parsing net for being packaged in the application container engine The code skeleton of page content;
Storage region setting unit, for the application container engine corresponding memory block in second server to be arranged Domain.
In the present embodiment, (developer can be packaged application and rely on packet and arrive application container engine, that is, Docker container In one transplantable container, Docker container can be considered as a microsystem).It can be packaged in Docker container It crawls web page contents and parses the code skeleton of web page contents, crawling and solving twice for web page contents is realized by code skeleton Analysis.And in order to distinguish every Docker container corresponding storage region in second server, need to be set in second server Storage region identical with Docker container number is set, and each storage region is ordered according to the identifier of Docker container Name.The data parsed in every Docker container are stored into second server corresponding storage in this way, can be realized Region.
Web page contents crawl unit 112, crawl the corresponding web page contents of the network address for the code skeleton by disposing Information.
In the present embodiment, when the application container engine in first server receive second server distribution network address, Indicate that the application container engine need to start the code skeleton wherein encapsulated to crawl the corresponding web page content information of the network address.
In one embodiment, web data crawls device 100 further include:
Connection establishment unit, for according to the network address, connection to be established at target access corresponding with network address end.
In the present embodiment, when the application container engine in first server receive second server distribution network address, Indicate that second server need to establish connection according to the network address, target access corresponding with network address end.When with the network address When corresponding target access end successful connection, web page contents can be crawled from the corresponding target access end of the network address.More specifically , i.e., disposed Docker container correspondence is written with network address (can be regarded as the targeted website address URL) in first server, is indicated First server has known the address URL at the target access end of request connection.When Docker container disposed in first server After establishing connection with target access end, web page contents can be crawled from target access end.
First resolution unit 113 obtains net for parsing the web page content information by the code skeleton Page parsing content.
In the present embodiment, source code, the targeted website URL of webpage are included at least in the web analysis content parsed The information such as the file directory of location, web page crawl time and webpage.
I.e. when method of the Docker container to browse simulator in first server, mark access end target network is being loaded It stands the address URL and after target access end establishes connection, and completes to be saved in source code in the form of a file after the acquisition of source code the In two servers in storage region corresponding with Docker container, and the index information of source code is stored in second server simultaneously MYSQL database (Relational DBMS that MYSQL is a kind of open source code) in;It is stored in the second clothes simultaneously The webpage information being engaged in the MYSQL database of device further comprises the text of the targeted website address URL, web page crawl time and webpage Part catalogue.The above process realizes the first time parsing to web page content information, can pass through each web page content information crawled It crosses after parsing to retransmit and be stored into second server.
First transmission unit 114, for the web analysis content to be sent in second server and first clothes The corresponding storage region of business device is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second server Store the web analysis content, be the web page contents crawled are inquired in second server for the ease of subsequent, thus Realization is traced to the source.
Second resolution unit 115, for solving the source code in the web analysis content by the code skeleton Analysis obtains corresponding source code parsing information.
In one embodiment, the second resolution unit 115 includes:
Regular cluster acquiring unit, for obtain constructed in advance in the code skeleton for being identified to source code Regular expression rule cluster;
Be segmented source code acquiring unit, for by the regular expression rule cluster, obtain in the source code with it is described Each one-to-one segmentation source code of regular expression rule in regular expression rule cluster;
It is segmented source code resolution unit, for passing through parsing corresponding with the rule of regular expression corresponding to each segmentation source code It is segmented source code, obtains segmentation source code parsing information corresponding with each segmentation source code;
Information assembled unit, for each segmentation source code parsing information to be combined, to obtain source code parsing information.
In the present embodiment, i.e., first established in the code skeleton source code parsing code code1, code2 ..., Codem } incidence relation with the regular expression rule of source code to be resolved, namely construct regular expression rule cluster {rule1,rule2,...,rulem}.If the particular content of source code meets regular expression rule, 1 is returned, otherwise returns to 0; By to source code carry out regular expression rule cluster { rule1, rule2 ..., rulem } identification, obtain segmentation source code and Parsing corresponding with each segmentation source code is segmented source code codei, and is solved with parsing segmentation source code codei to each segmentation source code Analysis obtains segmentation source code parsing information corresponding with each segmentation source code, each segmentation source code parsing information is combined and (at this time will It is to be connected in series each segmentation source code parsing information that each segmentation source code parsing information, which is combined, passes through separator between each other Separate), to obtain source code parsing information.By the regular expression rule cluster and parsing segmentation source code, realize Secondary parsing to the source code in web analysis content, can deeper each attribute (such as CSS file for excavating web page contents With JS file, wherein CSS indicates cascading style sheets, and JS file is web page foreground script file).
Second transmission unit 116, for source code parsing information to be sent in second server and first clothes The corresponding storage region of business device is stored.
In the present embodiment, the storage region corresponding with Docer container in the first server in second server The source code parsing information is stored, is for the ease of the subsequent source code for inquiring the web page contents crawled in second server Information is parsed, is traced to the source to realize.
The embodiment of the present invention also provides a kind of web data and crawls device, and it is aforementioned for executing which crawls device Any embodiment of web data crawling method.Specifically, referring to Fig. 5, Fig. 5 is web data provided in an embodiment of the present invention Crawl another schematic block diagram of device.The web data crawls device 100 and can be configured in second server.
As shown in figure 5, web data crawl device 100 include address set close receiving unit 121, network address Dispatching Unit 122, First storage unit 123, the second storage unit 124, retrieval unit 125.
Address set closes receiving unit 121, for receiving the targeted website network address set to be crawled sent by user terminal.
It in the present embodiment, is angle description technique scheme in second server, the second server can be view For primary server, first service is stored for distributing network address to be crawled, and the multiple memory spaces of division to first server The web analysis content and source code that device is sent parse information.
In one embodiment, address set closes receiving unit 121 and is specifically used for:
The targeted website network address set to be crawled sent by user terminal is received by remote date transmission database.
Wherein, (full name of Redis is Remote Dictionary to remote date transmission database, that is, Redis database Server indicates remote date transmission, and Redis is a key-value storage system, it supports data type abundant), lead to It crosses Redis database and receives to be merged by the targeted website address set to be crawled that user terminal is sent and distribute the targeted website address set The subset of conjunction is to first server.
Network address Dispatching Unit 122, for each network address in the targeted website network address set to be distributed to corresponding first Server.
In the present embodiment, second server can be in distributing the targeted website network address set when each network address One network address is distributed to same first server, is also possible to multiple network address being distributed to same first server.
First storage unit 123, the web analysis content sent for receiving the first server;The web analysis Content is crawled by the first server and parses the corresponding web page content information of the network address and obtained.
In the present embodiment, the parsing to crawled web page content information is completed in first server, and is included The source code of webpage, the targeted website address URL, web page crawl time and webpage the information such as file directory web analysis content When, it is to be stored in the web analysis content that the first server is sent in the MYSQL database of second server and first In the corresponding tables of data of server, also i.e. by tables of data corresponding with first server in the MYSQL database of second server It is considered as storage region corresponding with first server.The web analysis content is stored in second server, is traced to the source convenient for subsequent Retrieval.
Second storage unit 124 parses information for receiving the source code that the first server is sent;The source code parsing Information is obtained by the source code correspondence that the first server parses the web analysis content.
In the present embodiment, the parsing to the source code of the web analysis content is completed in first server, and is obtained It is the MYSQL number that the source code parsing information that the first server is sent is stored in second server when source code parses information According in tables of data corresponding with first server in library, also i.e. by the MYSQL database of second server with first server Corresponding tables of data is considered as storage region corresponding with first server.The source code, which is stored, in second server parses information, Convenient for subsequent retrieval of tracing to the source.Content (web analysis content) is parsed for the first time of the crawled web page content information of same network address And second of parsing content (source code parsing information) is stored in the tables of data of same MYSQL database, it is same to realize Network address is crawled by first server and information obtained after parsing is stored in the same area.
Retrieval unit 125, for receiving search condition, according to the search condition in the web analysis content and described Corresponding search result is obtained in source code parsing information.
In the present embodiment, no matter targeted website network address set (with can be regarded as multiple targeted website URL to be crawled Location) in the website revision of some or the address multiple targeted website URL, due to having been saved in the storage region of second server The historical data of the targeted website address URL, therefore when starting Docker container crawls the targeted website address URL of correcting When leading to the failure, with the triggering of the targeted website address URL to the search instruction of second server, and with targeted website URL Location is that search condition is retrieved in multiple storage regions.It is obtained in storage region corresponding with the targeted website address URL Source code after, secondary parsing can be carried out for the source code, quickly give original web page files for change, provide retrospective canal Road.
The arrangement achieves the web page contents crawled being carried out be saved in order to data tracing to the source, and can also be to webpage Content carries out secondary parsing.
Above-mentioned web data, which crawls device, can be implemented as the form of computer program, which can such as scheme It is run in computer equipment shown in 6.
Referring to Fig. 6, Fig. 6 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.The computer equipment 500 be server.Wherein, server can be independent server, be also possible to the server cluster of multiple server compositions.
Refering to Fig. 6, which includes processor 502, memory and the net connected by system bus 501 Network interface 505, wherein memory may include non-volatile memory medium 503 and built-in storage 504.
The non-volatile memory medium 503 can storage program area 5031 and computer program 5032.The computer program 5032 are performed, and processor 502 may make to execute web data crawling method.
The processor 502 supports the operation of entire computer equipment 500 for providing calculating and control ability.
The built-in storage 504 provides environment for the operation of the computer program 5032 in non-volatile memory medium 503, should When computer program 5032 is executed by processor 502, processor 502 may make to execute web data crawling method.
The network interface 505 is for carrying out network communication, such as the transmission of offer data information.Those skilled in the art can To understand, structure shown in Fig. 6, only the block diagram of part-structure relevant to the present invention program, is not constituted to this hair The restriction for the computer equipment 500 that bright scheme is applied thereon, specific computer equipment 500 may include than as shown in the figure More or fewer components perhaps combine certain components or with different component layouts.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following function Can: receive the network address of second server distribution;The network address is the targeted website that the second server receives that user terminal uploads The subset of network address set;The corresponding web page content information of the network address is crawled by the code skeleton of deployment;It will be in the webpage Hold information to be parsed by the code skeleton, obtains web analysis content;The web analysis content is sent to second Storage region corresponding with the first server is stored in server;Source code in the web analysis content is passed through The code skeleton is parsed, and corresponding source code parsing information is obtained;And source code parsing information is sent to second Storage region corresponding with the first server is stored in server.
In one embodiment, processor 502 is before the step of executing the network address for receiving second server distribution, also It performs the following operations: initial deployment application container engine;Be packaged for crawling in the application container engine web page contents and The code skeleton of analyzing web page content;The application container engine corresponding storage region in second server is set.
In one embodiment, to crawl the network address corresponding executing the code skeleton by deployment for processor 502 It before the step of web page content information, also performs the following operations: according to the network address, target access corresponding with network address end Establish connection.
In one embodiment, processor 502 passes through the generation in the execution source code by the web analysis content Code frame is parsed, and when obtaining the step of corresponding source code parsing information, is performed the following operations: being obtained in the code skeleton The regular expression rule cluster for being identified to source code constructed in advance;By the regular expression rule cluster, It obtains in the source code and is segmented source code correspondingly with regular expression rule each in the regular expression rule cluster; It is segmented source code by parsing corresponding with the rule of regular expression corresponding to each segmentation source code, is obtained corresponding with each segmentation source code Segmentation source code parse information;Each segmentation source code parsing information is combined, to obtain source code parsing information.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following function Can: receive the targeted website network address set to be crawled sent by user terminal;By each net in the targeted website network address set Location is distributed to corresponding first server;Receive the web analysis content that the first server is sent;In the web analysis Appearance, which is crawled by the first server and parses the corresponding web page content information of the network address, to be obtained;Receive the first server The source code of transmission parses information;The source code parsing information is parsed the source code of the web analysis content by the first server Correspondence obtains;And search condition is received, according to the search condition in the web analysis content and source code parsing letter Corresponding search result is obtained in breath.
In one embodiment, processor 502 is executing the targeted website net to be crawled for receiving and being sent by user terminal When the step of location set, perform the following operations: received by remote date transmission database sent by user terminal it is to be crawled Targeted website network address set.
It will be understood by those skilled in the art that the embodiment of computer equipment shown in Fig. 6 is not constituted to computer The restriction of equipment specific composition, in other embodiments, computer equipment may include components more more or fewer than diagram, or Person combines certain components or different component layouts.For example, in some embodiments, computer equipment can only include depositing Reservoir and processor, in such embodiments, the structure and function of memory and processor are consistent with embodiment illustrated in fig. 6, Details are not described herein.
It should be appreciated that in embodiments of the present invention, processor 502 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable GateArray, FPGA) or other programmable logic devices Part, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or The processor is also possible to any conventional processor etc..
Computer readable storage medium is provided in another embodiment of the invention.The computer readable storage medium can be with For non-volatile computer readable storage medium.The computer-readable recording medium storage has computer program, wherein calculating Machine program performs the steps of the network address for receiving second server distribution when being executed by processor;The network address is described second Server receives the subset for the targeted website network address set that user terminal uploads;The network address pair is crawled by the code skeleton of deployment The web page content information answered;The web page content information is parsed by the code skeleton, obtains web analysis content; The web analysis content is sent in second server storage region corresponding with the first server to store;It will Source code in the web analysis content is parsed by the code skeleton, obtains corresponding source code parsing information;And Source code parsing information is sent in second server storage region corresponding with the first server to store.
In one embodiment, before the network address for receiving second server distribution, further includes: initial deployment application container Engine;It is packaged for crawling web page contents in the application container engine and parses the code skeleton of web page contents;Setting institute State application container engine corresponding storage region in second server.
In one embodiment, the code skeleton by deployment crawl the corresponding web page content information of the network address it Before, further includes: according to the network address, connection is established at target access corresponding with network address end.
In one embodiment, the source code by the web analysis content is parsed by the code skeleton, Obtain corresponding source code parsing information, comprising: obtain constructed in advance in the code skeleton for being identified to source code Regular expression rule cluster;By the regular expression rule cluster, obtain in the source code with the regular expression Each one-to-one segmentation source code of regular expression rule in regular cluster;By with canonical table corresponding to each segmentation source code Up to the corresponding parsing segmentation source code of formula rule, segmentation source code parsing information corresponding with each segmentation source code is obtained;By each segmentation source Code parsing information is combined, to obtain source code parsing information.
Computer readable storage medium is provided in another embodiment of the invention.The computer readable storage medium can be with For non-volatile computer readable storage medium.The computer-readable recording medium storage has computer program, wherein calculating Machine program performs the steps of the targeted website network address set to be crawled for receiving and being sent by user terminal when being executed by processor; Each network address in the targeted website network address set is distributed to corresponding first server;The first server is received to send Web analysis content;The web analysis content is crawled by the first server and is parsed in the corresponding webpage of the network address Hold information to obtain;Receive the source code parsing information that the first server is sent;The source code parsing information is taken by described first The source code correspondence that business device parses the web analysis content obtains;And search condition is received, according to the search condition in institute It states in web analysis content and source code parsing information and obtains corresponding search result.
In one embodiment, described to receive the targeted website network address set to be crawled sent by user terminal, comprising: to pass through Remote date transmission database receives the targeted website network address set to be crawled sent by user terminal.
It is apparent to those skilled in the art that for convenience of description and succinctly, foregoing description is set The specific work process of standby, device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein. Those of ordinary skill in the art may be aware that unit described in conjunction with the examples disclosed in the embodiments of the present disclosure and algorithm Step can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and software Interchangeability generally describes each exemplary composition and step according to function in the above description.These functions are studied carefully Unexpectedly the specific application and design constraint depending on technical solution are implemented in hardware or software.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed The scope of the present invention.
In several embodiments provided by the present invention, it should be understood that disclosed unit and method, it can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only logical function partition, there may be another division manner in actual implementation, can also will be with the same function Unit set is at a unit, such as multiple units or components can be combined or can be integrated into another system or some Feature can be ignored, or not execute.In addition, shown or discussed mutual coupling, direct-coupling or communication connection can Be through some interfaces, the indirect coupling or communication connection of device or unit, be also possible to electricity, mechanical or other shapes Formula connection.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.Some or all of unit therein can be selected to realize the embodiment of the present invention according to the actual needs Purpose.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integrated Unit both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existing The all or part of part or the technical solution that technology contributes can be embodied in the form of software products, should Computer software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be Personal computer, server or network equipment etc.) execute all or part of step of each embodiment the method for the present invention Suddenly.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or The various media that can store program code such as person's CD.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims (10)

1. a kind of web data crawling method is applied to first server characterized by comprising
Receive the network address of second server distribution;The network address is the targeted website that the second server receives that user terminal uploads The subset of network address set;
The corresponding web page content information of the network address is crawled by the code skeleton of deployment;
The web page content information is parsed by the code skeleton, obtains web analysis content;
The web analysis content is sent in second server storage region corresponding with the first server to deposit Storage;
Source code in the web analysis content is parsed by the code skeleton, obtains corresponding source code parsing letter Breath;And
Source code parsing information is sent in second server storage region corresponding with the first server to deposit Storage.
2. web data crawling method according to claim 1, which is characterized in that described to receive what second server was distributed Before network address, further includes:
Initial deployment application container engine;
It is packaged for crawling web page contents in the application container engine and parses the code skeleton of web page contents;
The application container engine corresponding storage region in second server is set.
3. web data crawling method according to claim 1, which is characterized in that described to be climbed by the code skeleton of deployment Before taking the corresponding web page content information of the network address, further includes:
According to the network address, connection is established at target access corresponding with network address end.
4. web data crawling method according to claim 1, which is characterized in that it is described will be in the web analysis content Source code parsed by the code skeleton, obtain corresponding source code parsing information, comprising:
Obtain the regular expression rule cluster for being identified to source code constructed in advance in the code skeleton;
By the regular expression rule cluster, obtain in the source code in the regular expression rule cluster it is each just The then one-to-one segmentation source code of expression formula rule;
It is segmented source code by parsing corresponding with the rule of regular expression corresponding to each segmentation source code, is obtained and each segmentation source code Corresponding segmentation source code parses information;
Each segmentation source code parsing information is combined, to obtain source code parsing information.
5. a kind of web data crawling method is applied to second server characterized by comprising
Receive the targeted website network address set to be crawled sent by user terminal;
Each network address in the targeted website network address set is distributed to corresponding first server;
Receive the web analysis content that the first server is sent;The web analysis content is crawled by the first server And it parses the corresponding web page content information of the network address and obtains;
Receive the source code parsing information that the first server is sent;The source code parsing information is parsed by the first server The source code correspondence of the web analysis content obtains;And
Search condition is received, according to search condition acquisition pair in the web analysis content and source code parsing information The search result answered.
6. web data crawling method according to claim 5, which is characterized in that it is described receive by user terminal send to The targeted website network address set crawled, comprising:
The targeted website network address set to be crawled sent by user terminal is received by remote date transmission database.
7. a kind of web data crawls device, which is characterized in that including for executing the webpage as described in claim any one of 1-4 The unit of data crawling method, or include for executing the web data crawling method as described in claim any one of 5-6 Unit.
8. a kind of web data crawls system, which is characterized in that including first server and second server, the first service Device is for executing web data crawling method according to any one of claims 1 to 4, and the second server is for executing Web data crawling method as described in any one of claim 5 to 6.
9. a kind of computer equipment, including memory, processor and it is stored on the memory and can be on the processor The computer program of operation, which is characterized in that the processor realizes such as Claims 1-4 when executing the computer program Any one of described in web data crawling method, or realize that web data as described in any one of claim 5 to 6 is climbed Take method.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program execute the processor as appointed in claim such as Claims 1-4 Web data crawling method described in one, or execute the web data side of crawling as described in any one of claim 5 to 6 Method.
CN201910012240.6A 2019-01-07 2019-01-07 Webpage data crawling method, device, system, computer equipment and storage medium Active CN109885744B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910012240.6A CN109885744B (en) 2019-01-07 2019-01-07 Webpage data crawling method, device, system, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910012240.6A CN109885744B (en) 2019-01-07 2019-01-07 Webpage data crawling method, device, system, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109885744A true CN109885744A (en) 2019-06-14
CN109885744B CN109885744B (en) 2024-05-10

Family

ID=66925622

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910012240.6A Active CN109885744B (en) 2019-01-07 2019-01-07 Webpage data crawling method, device, system, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109885744B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297962A (en) * 2019-06-28 2019-10-01 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN111090798A (en) * 2019-12-06 2020-05-01 广州探途网络技术有限公司 Webpage data crawling method and system
CN111949849A (en) * 2020-08-13 2020-11-17 中国科学院水生生物研究所 Fish information acquisition method and device, electronic equipment and readable storage medium
CN112257032A (en) * 2019-10-21 2021-01-22 国家计算机网络与信息安全管理中心 Method and system for determining APP responsibility subject
CN112422707A (en) * 2020-10-22 2021-02-26 北京安博通科技股份有限公司 Domain name data mining method and device and Redis server
CN114969172A (en) * 2022-03-24 2022-08-30 北京感易智能科技有限公司 Information data processing method, information data processing device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902889A (en) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 Malicious message cloud detection method and server
WO2015003664A1 (en) * 2013-07-12 2015-01-15 贝壳网际(北京)安全技术有限公司 Method, device, server, and client device for download processing
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet
CN106096056A (en) * 2016-06-30 2016-11-09 西南石油大学 A kind of based on distributed public sentiment data real-time collecting method and system
CN106126693A (en) * 2016-06-29 2016-11-16 微梦创科网络科技(中国)有限公司 The sending method of the related data of a kind of webpage and device
CN106776567A (en) * 2016-12-22 2017-05-31 金蝶软件(中国)有限公司 A kind of internet big data analyzes extracting method and system
CN109033195A (en) * 2018-06-28 2018-12-18 上海盛付通电子支付服务有限公司 The acquisition methods of webpage information obtain equipment and computer-readable medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902889A (en) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 Malicious message cloud detection method and server
WO2015003664A1 (en) * 2013-07-12 2015-01-15 贝壳网际(北京)安全技术有限公司 Method, device, server, and client device for download processing
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet
CN106126693A (en) * 2016-06-29 2016-11-16 微梦创科网络科技(中国)有限公司 The sending method of the related data of a kind of webpage and device
CN106096056A (en) * 2016-06-30 2016-11-09 西南石油大学 A kind of based on distributed public sentiment data real-time collecting method and system
CN106776567A (en) * 2016-12-22 2017-05-31 金蝶软件(中国)有限公司 A kind of internet big data analyzes extracting method and system
CN109033195A (en) * 2018-06-28 2018-12-18 上海盛付通电子支付服务有限公司 The acquisition methods of webpage information obtain equipment and computer-readable medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297962A (en) * 2019-06-28 2019-10-01 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN110297962B (en) * 2019-06-28 2021-08-24 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN112257032A (en) * 2019-10-21 2021-01-22 国家计算机网络与信息安全管理中心 Method and system for determining APP responsibility subject
CN112257032B (en) * 2019-10-21 2023-07-14 国家计算机网络与信息安全管理中心 Method and system for determining APP responsibility main body
CN111090798A (en) * 2019-12-06 2020-05-01 广州探途网络技术有限公司 Webpage data crawling method and system
CN111090798B (en) * 2019-12-06 2023-11-21 广州探途网络技术有限公司 Webpage data crawling method and system
CN111949849A (en) * 2020-08-13 2020-11-17 中国科学院水生生物研究所 Fish information acquisition method and device, electronic equipment and readable storage medium
CN111949849B (en) * 2020-08-13 2023-11-21 中国科学院水生生物研究所 Fish information acquisition method and device, electronic equipment and readable storage medium
CN112422707A (en) * 2020-10-22 2021-02-26 北京安博通科技股份有限公司 Domain name data mining method and device and Redis server
CN114969172A (en) * 2022-03-24 2022-08-30 北京感易智能科技有限公司 Information data processing method, information data processing device and electronic equipment

Also Published As

Publication number Publication date
CN109885744B (en) 2024-05-10

Similar Documents

Publication Publication Date Title
CN109885744A (en) Web data crawling method, device, system, computer equipment and storage medium
US11860821B2 (en) Generating target application packages for groups of computing devices
KR102317535B1 (en) Methods and systems for implementing data tracking with software development kits
US7702959B2 (en) Error management system and method of using the same
CN105765528B (en) Method, system and medium with the application execution path trace that configurable origin defines
CN104067276B (en) Client-side minimum is downloaded and the page navigation feature of simulation
CN111104635B (en) Method and device for generating form webpage
CN109446072A (en) The generation method and device of test script
CN102722381B (en) The technology of optimization and upgrading task
CN102012954B (en) Subsystem integration method and subsystem integration system for integration design of system-on-chip
CN107800562B (en) A kind of method for configuring route and device of view file
CN110221968A (en) Method for testing software and Related product
US20130191376A1 (en) Identifying related entities
CN107480117B (en) Recovery method and device for automatic page table single data
KR20130019366A (en) Efficiently collecting transction-separated metrics in a distributed enviornment
CN110427775A (en) Data query authority control method and device
CN108519903A (en) Static resource adaptation method, device, computer equipment and storage medium
US10867006B2 (en) Tag plan generation
CN111782317A (en) Page testing method and device, storage medium and electronic device
CN106980501A (en) A kind of software package management method, device and system
CN104052626A (en) Method, device and system for configuring network element data
US10761862B2 (en) Method and device for adding indicative icon in interactive application
EP2815314B1 (en) Assessment of transaction-level interoperability over a tactical data link
CN105446981B (en) Map of website generation method, access method and device
CN109597948A (en) Access method, system and the storage medium of URL link

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant