CN108491420A

CN108491420A - Configuration method, application server and the computer readable storage medium of web page crawl

Info

Publication number: CN108491420A
Application number: CN201810119441.1A
Authority: CN
Inventors: 蔡俊
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-02-06
Filing date: 2018-02-06
Publication date: 2018-09-04
Also published as: WO2019153603A1

Abstract

The invention discloses a kind of configuration method of web page crawl, the method includes：It receives and input by user crawls network address；Setting crawls information type；Setting crawls task processing node；It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node；It crawls information type according to described in described crawl on task processing node and crawls corresponding information.The present invention also provides a kind of application server and computer readable storage mediums.Configuration method, application server and the computer readable storage medium of web page crawl provided by the invention, it can flexibly control crawling depth, the classified finishing of data can also be can be achieved with during web page crawl simultaneously, improve the efficiency that entire data are crawled and used.

Description

Configuration method, application server and the computer readable storage medium of web page crawl

Technical field

The present invention relates to a kind of field of communication technology more particularly to configuration method of web page crawl, application server and meters Calculation machine readable storage medium storing program for executing.

Background technology

Web page crawl refers in Webpage search subsystem according to uniform resource locator (Uniform Resource Locator, URL) complete process or thread that a sections and pages face crawls.For search engine, web page crawl, that is, network spider Spider is to find webpage by the chained address of webpage, since some page (being typically homepage) of website, reads webpage Content finds other chained addresses in webpage, then finds next webpage by these chained addresses, follows always in this way Ring goes down, until all webpages in this website have all been captured.If a website is treated as in entire internet, Web Spider can all capture webpage all on internet with this principle.However current web page crawl process In, process is crawled especially for picture, although Target Photo can be crawled effectively, circulation searching increases clothes The load of business device, affects the efficiency crawled, affects user experience.

Invention content

In view of this, the present invention proposes a kind of configuration method of web page crawl, application server and computer-readable storage Medium can be controlled flexibly crawling depth, while can also can be achieved with data during web page crawl Classified finishing improves the efficiency that entire data are crawled and used.

First, to achieve the above object, the present invention proposes that a kind of application server, the application server include storage Device, processor are stored with the configurator for the web page crawl that can be run on the processor, the webpage on the memory The configurator crawled realizes following steps when being executed by the processor：

It receives and input by user crawls network address；

Setting crawls information type, wherein the information type that crawls includes in word, html, multimedia and photo It is at least one or a variety of；

Setting crawls task processing node；

It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node；And

It crawls information type according to described in described crawl on task processing node and crawls corresponding information.

Optionally, described reception the step of crawling network address input by user, including：

Establish crucial literal information and the related information for crawling network address；

Receive crucial literal information input by user；And

Pass through crucial literal information network address corresponding with the related information acquisition crucial literal information.

Optionally, the setting crawls the step of task processing node, including：

The access number of plies of the website of network address representative is crawled described in setting, and net is carried out according to the task processing node that crawls Page crawls.

Optionally, when the configurator of the web page crawl is executed by the processor, following steps are also realized：

Setting crawls purposes information；And

The corresponding memory space of purposes information setting is crawled according to described；And

It is described crawl task processing node on according to described in crawl the step of information type crawls corresponding information after, will The corresponding information is stored to the memory space.

In addition, to achieve the above object, the present invention also provides a kind of configuration method of web page crawl, this method is answered With server, the method includes：

It receives and input by user crawls network address；

Setting crawls task processing node；

Optionally, described to receive the step of crawling network address input by user, further include：

Receive crucial literal information input by user；And

Optionally, the setting crawls the step of task processing node, including：

Optionally, the method further includes step：

Setting crawls purposes information；And

The corresponding memory space of purposes information setting is crawled according to described.

Optionally, in described crawl on task processing node the step of information type crawls corresponding information is crawled described in foundation Later, the method further includes step：

The corresponding information is stored to the memory space.

Further, to achieve the above object, the present invention also provides a kind of computer readable storage medium, the computers Readable storage medium storing program for executing is stored with the configurator of web page crawl, and the configurator of the web page crawl can be by least one processor It executes, so that the step of at least one processor executes the configuration method such as above-mentioned web page crawl.

Compared to the prior art, application server proposed by the invention, the configuration method of web page crawl and computer can Storage medium is read, first, reception is input by user to crawl network address；Secondly, setting crawls information type；Then, setting, which crawls, appoints Business processing node；Then, it is transferred to the task that crawls from the link on the webpage for crawling network address and handles node；Finally, exist Described crawl crawls information type and crawls corresponding information on task processing node according to described in.This way it is possible to avoid the prior art During crawling, the drawbacks of circulation searching increases the load of server.It can flexibly control crawling depth, simultaneously The classified finishing of data can also be can be achieved with during web page crawl, improve the effect that entire data are crawled and used Energy.

Description of the drawings

Fig. 1 is the schematic diagram of one optional hardware structure of application server in the present invention；

Fig. 2 is the Program modual graph of the configurator first embodiment of web page crawl of the present invention；

Fig. 3 is the Program modual graph of the configurator second embodiment of web page crawl of the present invention；

Fig. 4 is the flow chart of the configuration method first embodiment of web page crawl of the present invention；

Fig. 5 is the flow chart of the configuration method second embodiment of web page crawl of the present invention；

Fig. 6 is the flow chart of the configuration method 3rd embodiment of web page crawl of the present invention.

Reference numeral：

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work The every other embodiment obtained is put, shall fall within the protection scope of the present invention.

It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot It is interpreted as indicating or implying its relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical solution Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims Protection domain within.

As shown in fig.1, being the schematic diagram of 1 one optional hardware structure of application server.

The application server 1 can be rack-mount server, blade server, tower server or cabinet-type service The computing devices such as device, which can be independent server, can also be the server that multiple servers are formed Cluster.

In the present embodiment, the application server 1 may include, but be not limited only to, and company can be in communication with each other by system bus Connect memory 11, processor 12, network interface 13.

The application server 1 connects network by network interface 13, obtains information.The network can be enterprises Net (Intranet), internet (Internet), global system for mobile communications (Global System of Mobile Communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), the wirelessly or non-wirelessly network such as 4G networks, 5G networks, bluetooth (Bluetooth), Wi-Fi, speech path network.

It should be pointed out that Fig. 1 illustrates only the application server 1 with component 11-13, it should be understood that simultaneously All components shown realistic are not applied, the implementation that can be substituted is more or less component.

Wherein, the memory 11 includes at least a type of readable storage medium storing program for executing, and the readable storage medium storing program for executing includes Flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), it is static with Machine accesses memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), may be programmed only Read memory (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 11 can be described answers With the hard disk or memory of the internal storage unit of server 1, such as the application server 1.In further embodiments, described to deposit Reservoir 11 can also be the External memory equipment of the application server 1, such as the plug-in type that the application server 1 is equipped with is hard Disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, the memory 11 can also both include the internal storage unit of the application server 1 or wrap Include its External memory equipment.In the present embodiment, the memory 11 is installed on the behaviour of the application server 1 commonly used in storage Make system and types of applications software, such as the program code etc. of the configurator 200 of web page crawl.In addition, the memory 11 It can be also used for temporarily storing the Various types of data that has exported or will export.

The processor 12 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 12 is commonly used in answering described in control With the overall operation of server 1, such as execution data interaction or the relevant control of communication and processing etc..In the present embodiment, institute It states processor 12 and is used to run the program code stored in the memory 11 or processing data, such as run the webpage Configurator 200 crawled etc..

The network interface 13 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the application server 1 and other electronic equipments.

In the present embodiment, the configurator 200 of the web page crawl is installed and run in the application server 1, when When the configurator 200 of the web page crawl is run, first, reception is input by user to crawl network address to the application server 1；Its Secondary, setting crawls information type；Then, setting crawls task processing node；Then, from the chain on the webpage for crawling network address Switch through and crawls task processing node into described in；Finally, in described crawl on task processing node information type is crawled described in foundation Crawl corresponding information.This way it is possible to avoid during the prior art crawls, circulation searching increases the disadvantage of the load of server End.It can flexibly control crawling depth, while returning for data can also be can be achieved with during web page crawl Class arranges, and improves the efficiency that entire data are crawled and used.

So far, oneself is through describing the hardware configuration and work(of the application environment and relevant device of each embodiment of the present invention in detail Energy.In the following, above application environment and relevant device will be based on, each embodiment of the present invention is proposed.

First, the present invention proposes a kind of configurator 200 of web page crawl.

As shown in fig.2, being the Program modual graph of 200 first embodiment of configurator of web page crawl of the present invention.

In the present embodiment, the configurator 200 of the web page crawl includes a series of is stored on memory 11 The net of various embodiments of the present invention may be implemented when the computer program instructions are executed by processor 12 in computer program instructions The configuration operation that page crawls.In some embodiments, the specific operation realized based on the computer program instructions each section, The configurator 200 of the web page crawl can be divided into one or more modules.For example, in fig. 2, the webpage is climbed The configurator 200 taken can be divided into receiving module 201, the first setup module 202, the second setup module 203, at link Manage module 204 and information crawler module 205.Wherein：

The receiving module 201 input by user crawls network address for receiving.Specifically, the receiving module 201 is pre- If crawling network address described in Web address field reception is input by user.User crawls network address in the default Web address field input of terminal device.This In embodiment, the terminal device can be mobile phone, smart phone, laptop, digit broadcasting receiver, PDA (a Personal digital assistant), PAD (tablet computer), PMP (portable media player), navigation device, car-mounted device etc. it is removable Move equipment, and the fixed terminal of such as number TV, desktop computer, notebook, server etc..

In the present embodiment, the receiving module 201 receives in the following manner input by user crawls network address：

The receiving module 201 is established crucial literal information and is then connect with the related information for crawling network address first Receive crucial literal information input by user；Finally, the key is obtained by the crucial literal information and the related information The corresponding network address of text information.

In the present embodiment, crawling network address, can be that user carries out in the preset Web address field of the terminal device defeated Enter.And in other embodiments, it may not be necessary to which user goes to remember corresponding network address, and only needs user's input and corresponding web site Associated crucial literal information, for example " Sina " two word is inputted by keyboard input or voice, then according to preset Related information automatically enters " Sina " corresponding network address.

First setup module 202 crawls information type for being arranged.It is described to crawl information type in the present embodiment Including in word, Hypertext Markup Language (Hyper Text Markup Language, HTML), multimedia and photo at least It is one or more kinds of.

It in the present embodiment, again can be according to different for the mode for crawling information type progress acquisition of information according to described in It crawls information type and takes different modes：

For the acquisition of literal type corresponding information, the document for being usually the Software Create provided by specialized vendor is in Existing, manufacturer can all provide corresponding Text Feature Extraction interface.It crawls program only to need to call the interface of these plug-in units, so that it may with light Extraction document in text message and the other relevant information of file.

And the documents such as HTML are different, HTML has a set of grammer of oneself, is indicated not by different command identifiers The formats such as same font, color, position, such as：Style=" color:#fff；font-weght；Bold " etc. extracts text envelope It needs these identifiers all to filter out when breath, then goes to obtain content information again.

For multimedia, picture/mb-type, noted generally by the Anchor Text (that is, link text) of link and relevant file It releases to judge the content of these documents, and then obtains corresponding content.

Second setup module 203 handles node for the task that crawls to be arranged.In the present embodiment, the setting crawls Task processing node specifically refers to：The access number of plies of the website of network address representative is crawled described in setting.

In the present embodiment, from crawling efficiency, it is impossible to all webpages are captured, then can be to the net that crawls Setting of standing crawls the processing node of task, that is, the number of plies (can also be referred to as to crawl depth) of access is arranged.For example, A is starting Webpage belongs to 0 layer, and B, C, D, E, F belong to the 1st layer under A links, and G, H belong to the 2nd layer under the 1st layer of link, and I belongs to the 2nd layer The 3rd layer under link.If the access number of plies of Web Spider setting is 2, webpage I will not be accessed to.

In this way, crawling task by setting handles node, i.e., the access number of plies of the website of network address representative is crawled described in setting, It can flexibly control crawling depth, improve the efficiency that entire data crawl.

The link processing module 204 described crawls task for being transferred to from the link on the webpage for crawling network address Handle node.

Described information crawls module 205, for crawling information type described in foundation on task processing node in described crawl Crawl corresponding information.

By above procedure module 201-205, the configurator 200 of web page crawl proposed by the invention receives first It is input by user to crawl network address；Secondly, setting crawls information type；Then, setting crawls task processing node；Then, from institute It states the link on the webpage for crawling network address and is transferred to and described crawl task processing node；Finally, node is handled in the task that crawls Information type, which is crawled, described in upper foundation crawls corresponding information.This way it is possible to avoid during the prior art crawls, circulation searching increases The drawbacks of having added the load of server.It can flexibly control crawling depth, while can also be in the mistake of web page crawl Cheng Zhong can be achieved with the classified finishing of data, improve the efficiency that entire data are crawled and used.

Further, based on the present invention is based on the above-mentioned first embodiment of the configurator 200 of web page crawl, this hair is proposed Bright second embodiment (as shown in Figure 3).In the present embodiment, the configurator 200 of the web page crawl further includes storage mould Block 206, in the present embodiment：

Second setup module 203 is additionally operable to setting and crawls purposes information；And crawl purposes information setting according to described Corresponding memory space.Such as the purposes crawled is to do user behavior analysis or data modification etc., so as to crawl purposes into Row taxonomic revision, raising efficiency.Specifically, can identification number be set to each different purposes information that crawls, passes through mark in this way Know number to distinguish and different crawls purposes information.

It is described crawl task processing node on according to described in crawl the step of information type crawls corresponding information after, institute Memory module 206 is stated, for storing the corresponding information to the memory space.

In the present embodiment, while carrying out crawling flow, setting crawls purposes information, and establishes and run after fame with purposes Memory space can be by the information storage of acquisition to the memory space after the completion of crawling flow.Such as the use this time crawled Way is to do user behavior analysis, then after crawling flow, can be stored data to the memory space of user behavior analysis, So that the application of subsequent user behavior directly invokes, in this way, convenient for data classification and the management of data, entire data are improved The efficiency crawled.

By above procedure module 201-206, the configurator 200 of web page crawl proposed by the invention passes through setting Crawl purposes information；And the corresponding memory space of purposes information setting is crawled according to described, and the corresponding information is stored To the memory space.Realize data classification and the management of data.

In addition, the present invention also proposes a kind of configuration method of web page crawl.

As shown in fig.4, being the implementation process diagram of the configuration method first embodiment of web page crawl of the present invention.At this In embodiment, the execution sequence of the step in flow chart shown in Fig. 4 can change according to different requirements, and certain steps can To omit.

Step S401, reception is input by user to crawl network address.Specifically, the application server 1 connects in default Web address field It receives and input by user described crawls network address.User crawls network address in the default Web address field input of terminal device.Specifically, described to connect Receiving the specific steps input by user for crawling network address will carry out in the configuration method 3rd embodiment (Fig. 6) of web page crawl of the present invention It is described in detail.In the present embodiment, the terminal device can be mobile phone, smart phone, laptop, Digital Broadcasting Receiver Device, PDA (personal digital assistant), PAD (tablet computer), PMP (portable media player), navigation device, car-mounted device Etc. movable equipment, and such as number TV, desktop computer, notebook, server etc. fixed terminal.

Step S402, setting crawl information type.In the present embodiment, the information type that crawls includes word, hypertext At least one or multiple of identifiable language (Hyper Text Markup Language, HTML), multimedia and photo.

Step S403, setting crawl task processing node.In the present embodiment, it is specific that the setting crawls task processing node Refer to：The access number of plies of the website of network address representative is crawled described in setting.

Step S404 is transferred to the task that crawls from the link on the webpage for crawling network address and handles node.

Step S405 crawls information type and crawls corresponding information in described crawl on task processing node according to described in.

It is defeated to receive user first for S401-405 through the above steps, the configuration method of web page crawl proposed by the invention What is entered crawls network address；Secondly, setting crawls information type；Then, setting crawls task processing node；Then, it is crawled from described Link on the webpage of network address is transferred to the task that crawls and handles node；Finally, foundation on task processing node is crawled described The information type that crawls crawls corresponding information.This way it is possible to avoid during crawling in the prior art, circulation searching increases The drawbacks of load of server.Can flexibly control crawling depth, at the same can also during web page crawl, The classified finishing that can be achieved with data improves the efficiency that entire data are crawled and used.

As shown in fig.5, being the implementation process diagram of the configuration method second embodiment of web page crawl of the present invention.At this In embodiment, the execution sequence of the step in flow chart shown in fig. 5 can change according to different requirements, and certain steps can To omit.

Step S501, reception is input by user to crawl network address.Specifically, the application server 1 connects in default Web address field It receives and input by user described crawls network address.User crawls network address in the default Web address field input of terminal device.Specifically, described to connect Receiving the specific steps input by user for crawling network address will carry out in the configuration method 3rd embodiment (Fig. 6) of web page crawl of the present invention It is described in detail.In the present embodiment, the terminal device can be mobile phone, smart phone, laptop, Digital Broadcasting Receiver Device, PDA (personal digital assistant), PAD (tablet computer), PMP (portable media player), navigation device, car-mounted device Etc. movable equipment, and such as number TV, desktop computer, notebook, server etc. fixed terminal.

Step S502, setting crawl information type.In the present embodiment, the information type that crawls includes word, hypertext At least one or multiple of identifiable language (Hyper Text Markup Language, HTML), multimedia and photo.

Step S503, setting crawl task processing node.In the present embodiment, it is specific that the setting crawls task processing node Refer to：The access number of plies of the website of network address representative is crawled described in setting.

Step S504 is transferred to the task that crawls from the link on the webpage for crawling network address and handles node.

Step S505 crawls information type and crawls corresponding information in described crawl on task processing node according to described in.

Step S506, setting crawl purposes information.For example the purposes crawled is to do user behavior analysis or data modification Deng to carry out taxonomic revision, raising efficiency to crawling purposes.Specifically, each different purposes information that crawls can be set It sets identification number, can be distinguished by identification number different crawl purposes information in this way.

Step S507 crawls the corresponding memory space of purposes information setting according to described.

Step S508 stores the corresponding information to the memory space.

S501-508 through the above steps, the configuration method of web page crawl proposed by the invention terminate crawling flow Afterwards, data can be stored to the memory space of user behavior analysis, so that the application of subsequent user behavior directly invokes, in this way, Convenient for data classification and the management of data, the efficiency that entire data crawl is improved.

As shown in fig.6, being the implementation process diagram of the configuration method 3rd embodiment of web page crawl of the present invention.At this In embodiment, the execution sequence of the step in flow chart shown in fig. 6 can change according to different requirements, and certain steps can To omit.

In the present embodiment, described to receive the step of crawling network address input by user, it specifically includes：

Step S601 is received in default Web address field and input by user described is crawled network address.

Step S602 establishes crucial literal information and the related information for crawling network address.

Step S603 receives crucial literal information input by user.

Step S604, it is corresponding with the related information acquisition crucial literal information by the crucial literal information Network address.

In the present embodiment, crawling network address, can be that user carries out in the preset Web address field of the terminal device defeated Enter.And in other embodiments, it may not be necessary to which user goes to remember corresponding network address, and only needs user's input and corresponding web site Associated crucial literal information, for example " Sina " two word is inputted by keyboard input or voice, then according to preset Related information automatically enters " Sina " corresponding network address.In the present embodiment, the terminal device can be mobile phone, intelligence electricity Words, laptop, digit broadcasting receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedias Player), the movable equipment of navigation device, car-mounted device etc., and such as number TV, desktop computer, notebook, clothes The fixed terminal of business device etc..

S601-604 through the above steps, the configuration method of web page crawl proposed by the invention can be defeated by user The crucial literal information entered, quick obtaining crawl network address, improve the efficiency that entire data crawl.

The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art Going out the part of contribution can be expressed in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, computer, clothes Be engaged in device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.

It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of configuration method of web page crawl is applied to application server, which is characterized in that the method includes the steps：

It receives and input by user crawls network address；

Setting crawl information type, wherein it is described crawl information type include in word, html, multimedia and photo at least It is one or more kinds of；

Setting crawls task processing node；

2. the configuration method of web page crawl as described in claim 1, which is characterized in that the reception is input by user to crawl net The step of location further includes：

Receive crucial literal information input by user；And

3. the configuration method of web page crawl as described in claim 1, which is characterized in that the setting crawls task processing node The step of, including：

The access number of plies of the website of network address representative is crawled described in setting, and is climbed according to the task processing node progress webpage that crawls It takes.

4. the configuration method of web page crawl as described in claim 1, which is characterized in that the method further includes step：

Setting crawls purposes information；And

5. the configuration method of web page crawl as claimed in claim 4, which is characterized in that crawled on task processing node described After crawling the step of information type crawls corresponding information according to described in, the method further includes step：

The corresponding information is stored to the memory space.

6. a kind of application server, which is characterized in that the application server includes memory, processor, on the memory It is stored with the configurator for the web page crawl that can be run on the processor, the configurator of the web page crawl is by the place Reason device realizes following steps when executing：

It receives and input by user crawls network address；

Setting crawls task processing node；

7. application server as claimed in claim 6, which is characterized in that described to receive the step input by user for crawling network address Suddenly, including：

Receive crucial literal information input by user；And

8. application server as claimed in claim 6, which is characterized in that the setting crawls the step of task processing node, Including：

9. application server as claimed in claim 6, which is characterized in that the configurator of the web page crawl is by the processing When device executes, following steps are also realized：

Setting crawls purposes information；And

It is described crawl crawl the step of information type crawls corresponding information according to described on task processing node after, will be described Corresponding information is stored to the memory space.

10. a kind of computer readable storage medium, the computer-readable recording medium storage has the configurator of web page crawl, The configurator of the web page crawl can be executed by least one processor, so that at least one processor executes such as right It is required that the step of configuration method of web page crawl described in any one of 1-5.