CN108491420A - Configuration method, application server and the computer readable storage medium of web page crawl - Google Patents

Configuration method, application server and the computer readable storage medium of web page crawl Download PDF

Info

Publication number
CN108491420A
CN108491420A CN201810119441.1A CN201810119441A CN108491420A CN 108491420 A CN108491420 A CN 108491420A CN 201810119441 A CN201810119441 A CN 201810119441A CN 108491420 A CN108491420 A CN 108491420A
Authority
CN
China
Prior art keywords
crawls
information
crawl
web page
network address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810119441.1A
Other languages
Chinese (zh)
Inventor
蔡俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810119441.1A priority Critical patent/CN108491420A/en
Priority to PCT/CN2018/089706 priority patent/WO2019153603A1/en
Publication of CN108491420A publication Critical patent/CN108491420A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a kind of configuration method of web page crawl, the method includes:It receives and input by user crawls network address;Setting crawls information type;Setting crawls task processing node;It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;It crawls information type according to described in described crawl on task processing node and crawls corresponding information.The present invention also provides a kind of application server and computer readable storage mediums.Configuration method, application server and the computer readable storage medium of web page crawl provided by the invention, it can flexibly control crawling depth, the classified finishing of data can also be can be achieved with during web page crawl simultaneously, improve the efficiency that entire data are crawled and used.

Description

Configuration method, application server and the computer readable storage medium of web page crawl
Technical field
The present invention relates to a kind of field of communication technology more particularly to configuration method of web page crawl, application server and meters Calculation machine readable storage medium storing program for executing.
Background technology
Web page crawl refers in Webpage search subsystem according to uniform resource locator (Uniform Resource Locator, URL) complete process or thread that a sections and pages face crawls.For search engine, web page crawl, that is, network spider Spider is to find webpage by the chained address of webpage, since some page (being typically homepage) of website, reads webpage Content finds other chained addresses in webpage, then finds next webpage by these chained addresses, follows always in this way Ring goes down, until all webpages in this website have all been captured.If a website is treated as in entire internet, Web Spider can all capture webpage all on internet with this principle.However current web page crawl process In, process is crawled especially for picture, although Target Photo can be crawled effectively, circulation searching increases clothes The load of business device, affects the efficiency crawled, affects user experience.
Invention content
In view of this, the present invention proposes a kind of configuration method of web page crawl, application server and computer-readable storage Medium can be controlled flexibly crawling depth, while can also can be achieved with data during web page crawl Classified finishing improves the efficiency that entire data are crawled and used.
First, to achieve the above object, the present invention proposes that a kind of application server, the application server include storage Device, processor are stored with the configurator for the web page crawl that can be run on the processor, the webpage on the memory The configurator crawled realizes following steps when being executed by the processor:
It receives and input by user crawls network address;
Setting crawls information type, wherein the information type that crawls includes in word, html, multimedia and photo It is at least one or a variety of;
Setting crawls task processing node;
It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;And
It crawls information type according to described in described crawl on task processing node and crawls corresponding information.
Optionally, described reception the step of crawling network address input by user, including:
Establish crucial literal information and the related information for crawling network address;
Receive crucial literal information input by user;And
Pass through crucial literal information network address corresponding with the related information acquisition crucial literal information.
Optionally, the setting crawls the step of task processing node, including:
The access number of plies of the website of network address representative is crawled described in setting, and net is carried out according to the task processing node that crawls Page crawls.
Optionally, when the configurator of the web page crawl is executed by the processor, following steps are also realized:
Setting crawls purposes information;And
The corresponding memory space of purposes information setting is crawled according to described;And
It is described crawl task processing node on according to described in crawl the step of information type crawls corresponding information after, will The corresponding information is stored to the memory space.
In addition, to achieve the above object, the present invention also provides a kind of configuration method of web page crawl, this method is answered With server, the method includes:
It receives and input by user crawls network address;
Setting crawls information type, wherein the information type that crawls includes in word, html, multimedia and photo It is at least one or a variety of;
Setting crawls task processing node;
It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;And
It crawls information type according to described in described crawl on task processing node and crawls corresponding information.
Optionally, described to receive the step of crawling network address input by user, further include:
Establish crucial literal information and the related information for crawling network address;
Receive crucial literal information input by user;And
Pass through crucial literal information network address corresponding with the related information acquisition crucial literal information.
Optionally, the setting crawls the step of task processing node, including:
The access number of plies of the website of network address representative is crawled described in setting, and net is carried out according to the task processing node that crawls Page crawls.
Optionally, the method further includes step:
Setting crawls purposes information;And
The corresponding memory space of purposes information setting is crawled according to described.
Optionally, in described crawl on task processing node the step of information type crawls corresponding information is crawled described in foundation Later, the method further includes step:
The corresponding information is stored to the memory space.
Further, to achieve the above object, the present invention also provides a kind of computer readable storage medium, the computers Readable storage medium storing program for executing is stored with the configurator of web page crawl, and the configurator of the web page crawl can be by least one processor It executes, so that the step of at least one processor executes the configuration method such as above-mentioned web page crawl.
Compared to the prior art, application server proposed by the invention, the configuration method of web page crawl and computer can Storage medium is read, first, reception is input by user to crawl network address;Secondly, setting crawls information type;Then, setting, which crawls, appoints Business processing node;Then, it is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;Finally, exist Described crawl crawls information type and crawls corresponding information on task processing node according to described in.This way it is possible to avoid the prior art During crawling, the drawbacks of circulation searching increases the load of server.It can flexibly control crawling depth, simultaneously The classified finishing of data can also be can be achieved with during web page crawl, improve the effect that entire data are crawled and used Energy.
Description of the drawings
Fig. 1 is the schematic diagram of one optional hardware structure of application server in the present invention;
Fig. 2 is the Program modual graph of the configurator first embodiment of web page crawl of the present invention;
Fig. 3 is the Program modual graph of the configurator second embodiment of web page crawl of the present invention;
Fig. 4 is the flow chart of the configuration method first embodiment of web page crawl of the present invention;
Fig. 5 is the flow chart of the configuration method second embodiment of web page crawl of the present invention;
Fig. 6 is the flow chart of the configuration method 3rd embodiment of web page crawl of the present invention.
Reference numeral:
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work The every other embodiment obtained is put, shall fall within the protection scope of the present invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot It is interpreted as indicating or implying its relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical solution Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims Protection domain within.
As shown in fig.1, being the schematic diagram of 1 one optional hardware structure of application server.
The application server 1 can be rack-mount server, blade server, tower server or cabinet-type service The computing devices such as device, which can be independent server, can also be the server that multiple servers are formed Cluster.
In the present embodiment, the application server 1 may include, but be not limited only to, and company can be in communication with each other by system bus Connect memory 11, processor 12, network interface 13.
The application server 1 connects network by network interface 13, obtains information.The network can be enterprises Net (Intranet), internet (Internet), global system for mobile communications (Global System of Mobile Communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), the wirelessly or non-wirelessly network such as 4G networks, 5G networks, bluetooth (Bluetooth), Wi-Fi, speech path network.
It should be pointed out that Fig. 1 illustrates only the application server 1 with component 11-13, it should be understood that simultaneously All components shown realistic are not applied, the implementation that can be substituted is more or less component.
Wherein, the memory 11 includes at least a type of readable storage medium storing program for executing, and the readable storage medium storing program for executing includes Flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), it is static with Machine accesses memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), may be programmed only Read memory (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 11 can be described answers With the hard disk or memory of the internal storage unit of server 1, such as the application server 1.In further embodiments, described to deposit Reservoir 11 can also be the External memory equipment of the application server 1, such as the plug-in type that the application server 1 is equipped with is hard Disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, the memory 11 can also both include the internal storage unit of the application server 1 or wrap Include its External memory equipment.In the present embodiment, the memory 11 is installed on the behaviour of the application server 1 commonly used in storage Make system and types of applications software, such as the program code etc. of the configurator 200 of web page crawl.In addition, the memory 11 It can be also used for temporarily storing the Various types of data that has exported or will export.
The processor 12 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 12 is commonly used in answering described in control With the overall operation of server 1, such as execution data interaction or the relevant control of communication and processing etc..In the present embodiment, institute It states processor 12 and is used to run the program code stored in the memory 11 or processing data, such as run the webpage Configurator 200 crawled etc..
The network interface 13 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the application server 1 and other electronic equipments.
In the present embodiment, the configurator 200 of the web page crawl is installed and run in the application server 1, when When the configurator 200 of the web page crawl is run, first, reception is input by user to crawl network address to the application server 1;Its Secondary, setting crawls information type;Then, setting crawls task processing node;Then, from the chain on the webpage for crawling network address Switch through and crawls task processing node into described in;Finally, in described crawl on task processing node information type is crawled described in foundation Crawl corresponding information.This way it is possible to avoid during the prior art crawls, circulation searching increases the disadvantage of the load of server End.It can flexibly control crawling depth, while returning for data can also be can be achieved with during web page crawl Class arranges, and improves the efficiency that entire data are crawled and used.
So far, oneself is through describing the hardware configuration and work(of the application environment and relevant device of each embodiment of the present invention in detail Energy.In the following, above application environment and relevant device will be based on, each embodiment of the present invention is proposed.
First, the present invention proposes a kind of configurator 200 of web page crawl.
As shown in fig.2, being the Program modual graph of 200 first embodiment of configurator of web page crawl of the present invention.
In the present embodiment, the configurator 200 of the web page crawl includes a series of is stored on memory 11 The net of various embodiments of the present invention may be implemented when the computer program instructions are executed by processor 12 in computer program instructions The configuration operation that page crawls.In some embodiments, the specific operation realized based on the computer program instructions each section, The configurator 200 of the web page crawl can be divided into one or more modules.For example, in fig. 2, the webpage is climbed The configurator 200 taken can be divided into receiving module 201, the first setup module 202, the second setup module 203, at link Manage module 204 and information crawler module 205.Wherein:
The receiving module 201 input by user crawls network address for receiving.Specifically, the receiving module 201 is pre- If crawling network address described in Web address field reception is input by user.User crawls network address in the default Web address field input of terminal device.This In embodiment, the terminal device can be mobile phone, smart phone, laptop, digit broadcasting receiver, PDA (a Personal digital assistant), PAD (tablet computer), PMP (portable media player), navigation device, car-mounted device etc. it is removable Move equipment, and the fixed terminal of such as number TV, desktop computer, notebook, server etc..
In the present embodiment, the receiving module 201 receives in the following manner input by user crawls network address:
The receiving module 201 is established crucial literal information and is then connect with the related information for crawling network address first Receive crucial literal information input by user;Finally, the key is obtained by the crucial literal information and the related information The corresponding network address of text information.
In the present embodiment, crawling network address, can be that user carries out in the preset Web address field of the terminal device defeated Enter.And in other embodiments, it may not be necessary to which user goes to remember corresponding network address, and only needs user's input and corresponding web site Associated crucial literal information, for example " Sina " two word is inputted by keyboard input or voice, then according to preset Related information automatically enters " Sina " corresponding network address.
First setup module 202 crawls information type for being arranged.It is described to crawl information type in the present embodiment Including in word, Hypertext Markup Language (Hyper Text Markup Language, HTML), multimedia and photo at least It is one or more kinds of.
It in the present embodiment, again can be according to different for the mode for crawling information type progress acquisition of information according to described in It crawls information type and takes different modes:
For the acquisition of literal type corresponding information, the document for being usually the Software Create provided by specialized vendor is in Existing, manufacturer can all provide corresponding Text Feature Extraction interface.It crawls program only to need to call the interface of these plug-in units, so that it may with light Extraction document in text message and the other relevant information of file.
And the documents such as HTML are different, HTML has a set of grammer of oneself, is indicated not by different command identifiers The formats such as same font, color, position, such as:Style=" color:#fff;font-weght;Bold " etc. extracts text envelope It needs these identifiers all to filter out when breath, then goes to obtain content information again.
For multimedia, picture/mb-type, noted generally by the Anchor Text (that is, link text) of link and relevant file It releases to judge the content of these documents, and then obtains corresponding content.
Second setup module 203 handles node for the task that crawls to be arranged.In the present embodiment, the setting crawls Task processing node specifically refers to:The access number of plies of the website of network address representative is crawled described in setting.
In the present embodiment, from crawling efficiency, it is impossible to all webpages are captured, then can be to the net that crawls Setting of standing crawls the processing node of task, that is, the number of plies (can also be referred to as to crawl depth) of access is arranged.For example, A is starting Webpage belongs to 0 layer, and B, C, D, E, F belong to the 1st layer under A links, and G, H belong to the 2nd layer under the 1st layer of link, and I belongs to the 2nd layer The 3rd layer under link.If the access number of plies of Web Spider setting is 2, webpage I will not be accessed to.
In this way, crawling task by setting handles node, i.e., the access number of plies of the website of network address representative is crawled described in setting, It can flexibly control crawling depth, improve the efficiency that entire data crawl.
The link processing module 204 described crawls task for being transferred to from the link on the webpage for crawling network address Handle node.
Described information crawls module 205, for crawling information type described in foundation on task processing node in described crawl Crawl corresponding information.
By above procedure module 201-205, the configurator 200 of web page crawl proposed by the invention receives first It is input by user to crawl network address;Secondly, setting crawls information type;Then, setting crawls task processing node;Then, from institute It states the link on the webpage for crawling network address and is transferred to and described crawl task processing node;Finally, node is handled in the task that crawls Information type, which is crawled, described in upper foundation crawls corresponding information.This way it is possible to avoid during the prior art crawls, circulation searching increases The drawbacks of having added the load of server.It can flexibly control crawling depth, while can also be in the mistake of web page crawl Cheng Zhong can be achieved with the classified finishing of data, improve the efficiency that entire data are crawled and used.
Further, based on the present invention is based on the above-mentioned first embodiment of the configurator 200 of web page crawl, this hair is proposed Bright second embodiment (as shown in Figure 3).In the present embodiment, the configurator 200 of the web page crawl further includes storage mould Block 206, in the present embodiment:
Second setup module 203 is additionally operable to setting and crawls purposes information;And crawl purposes information setting according to described Corresponding memory space.Such as the purposes crawled is to do user behavior analysis or data modification etc., so as to crawl purposes into Row taxonomic revision, raising efficiency.Specifically, can identification number be set to each different purposes information that crawls, passes through mark in this way Know number to distinguish and different crawls purposes information.
It is described crawl task processing node on according to described in crawl the step of information type crawls corresponding information after, institute Memory module 206 is stated, for storing the corresponding information to the memory space.
In the present embodiment, while carrying out crawling flow, setting crawls purposes information, and establishes and run after fame with purposes Memory space can be by the information storage of acquisition to the memory space after the completion of crawling flow.Such as the use this time crawled Way is to do user behavior analysis, then after crawling flow, can be stored data to the memory space of user behavior analysis, So that the application of subsequent user behavior directly invokes, in this way, convenient for data classification and the management of data, entire data are improved The efficiency crawled.
By above procedure module 201-206, the configurator 200 of web page crawl proposed by the invention passes through setting Crawl purposes information;And the corresponding memory space of purposes information setting is crawled according to described, and the corresponding information is stored To the memory space.Realize data classification and the management of data.
In addition, the present invention also proposes a kind of configuration method of web page crawl.
As shown in fig.4, being the implementation process diagram of the configuration method first embodiment of web page crawl of the present invention.At this In embodiment, the execution sequence of the step in flow chart shown in Fig. 4 can change according to different requirements, and certain steps can To omit.
Step S401, reception is input by user to crawl network address.Specifically, the application server 1 connects in default Web address field It receives and input by user described crawls network address.User crawls network address in the default Web address field input of terminal device.Specifically, described to connect Receiving the specific steps input by user for crawling network address will carry out in the configuration method 3rd embodiment (Fig. 6) of web page crawl of the present invention It is described in detail.In the present embodiment, the terminal device can be mobile phone, smart phone, laptop, Digital Broadcasting Receiver Device, PDA (personal digital assistant), PAD (tablet computer), PMP (portable media player), navigation device, car-mounted device Etc. movable equipment, and such as number TV, desktop computer, notebook, server etc. fixed terminal.
Step S402, setting crawl information type.In the present embodiment, the information type that crawls includes word, hypertext At least one or multiple of identifiable language (Hyper Text Markup Language, HTML), multimedia and photo.
It in the present embodiment, again can be according to different for the mode for crawling information type progress acquisition of information according to described in It crawls information type and takes different modes:
For the acquisition of literal type corresponding information, the document for being usually the Software Create provided by specialized vendor is in Existing, manufacturer can all provide corresponding Text Feature Extraction interface.It crawls program only to need to call the interface of these plug-in units, so that it may with light Extraction document in text message and the other relevant information of file.
And the documents such as HTML are different, HTML has a set of grammer of oneself, is indicated not by different command identifiers The formats such as same font, color, position, such as:Style=" color:#fff;font-weght;Bold " etc. extracts text envelope It needs these identifiers all to filter out when breath, then goes to obtain content information again.
For multimedia, picture/mb-type, noted generally by the Anchor Text (that is, link text) of link and relevant file It releases to judge the content of these documents, and then obtains corresponding content.
Step S403, setting crawl task processing node.In the present embodiment, it is specific that the setting crawls task processing node Refer to:The access number of plies of the website of network address representative is crawled described in setting.
In the present embodiment, from crawling efficiency, it is impossible to all webpages are captured, then can be to the net that crawls Setting of standing crawls the processing node of task, that is, the number of plies (can also be referred to as to crawl depth) of access is arranged.For example, A is starting Webpage belongs to 0 layer, and B, C, D, E, F belong to the 1st layer under A links, and G, H belong to the 2nd layer under the 1st layer of link, and I belongs to the 2nd layer The 3rd layer under link.If the access number of plies of Web Spider setting is 2, webpage I will not be accessed to.
In this way, crawling task by setting handles node, i.e., the access number of plies of the website of network address representative is crawled described in setting, It can flexibly control crawling depth, improve the efficiency that entire data crawl.
Step S404 is transferred to the task that crawls from the link on the webpage for crawling network address and handles node.
Step S405 crawls information type and crawls corresponding information in described crawl on task processing node according to described in.
It is defeated to receive user first for S401-405 through the above steps, the configuration method of web page crawl proposed by the invention What is entered crawls network address;Secondly, setting crawls information type;Then, setting crawls task processing node;Then, it is crawled from described Link on the webpage of network address is transferred to the task that crawls and handles node;Finally, foundation on task processing node is crawled described The information type that crawls crawls corresponding information.This way it is possible to avoid during crawling in the prior art, circulation searching increases The drawbacks of load of server.Can flexibly control crawling depth, at the same can also during web page crawl, The classified finishing that can be achieved with data improves the efficiency that entire data are crawled and used.
As shown in fig.5, being the implementation process diagram of the configuration method second embodiment of web page crawl of the present invention.At this In embodiment, the execution sequence of the step in flow chart shown in fig. 5 can change according to different requirements, and certain steps can To omit.
Step S501, reception is input by user to crawl network address.Specifically, the application server 1 connects in default Web address field It receives and input by user described crawls network address.User crawls network address in the default Web address field input of terminal device.Specifically, described to connect Receiving the specific steps input by user for crawling network address will carry out in the configuration method 3rd embodiment (Fig. 6) of web page crawl of the present invention It is described in detail.In the present embodiment, the terminal device can be mobile phone, smart phone, laptop, Digital Broadcasting Receiver Device, PDA (personal digital assistant), PAD (tablet computer), PMP (portable media player), navigation device, car-mounted device Etc. movable equipment, and such as number TV, desktop computer, notebook, server etc. fixed terminal.
Step S502, setting crawl information type.In the present embodiment, the information type that crawls includes word, hypertext At least one or multiple of identifiable language (Hyper Text Markup Language, HTML), multimedia and photo.
It in the present embodiment, again can be according to different for the mode for crawling information type progress acquisition of information according to described in It crawls information type and takes different modes:
For the acquisition of literal type corresponding information, the document for being usually the Software Create provided by specialized vendor is in Existing, manufacturer can all provide corresponding Text Feature Extraction interface.It crawls program only to need to call the interface of these plug-in units, so that it may with light Extraction document in text message and the other relevant information of file.
And the documents such as HTML are different, HTML has a set of grammer of oneself, is indicated not by different command identifiers The formats such as same font, color, position, such as:Style=" color:#fff;font-weght;Bold " etc. extracts text envelope It needs these identifiers all to filter out when breath, then goes to obtain content information again.
For multimedia, picture/mb-type, noted generally by the Anchor Text (that is, link text) of link and relevant file It releases to judge the content of these documents, and then obtains corresponding content.
Step S503, setting crawl task processing node.In the present embodiment, it is specific that the setting crawls task processing node Refer to:The access number of plies of the website of network address representative is crawled described in setting.
In the present embodiment, from crawling efficiency, it is impossible to all webpages are captured, then can be to the net that crawls Setting of standing crawls the processing node of task, that is, the number of plies (can also be referred to as to crawl depth) of access is arranged.For example, A is starting Webpage belongs to 0 layer, and B, C, D, E, F belong to the 1st layer under A links, and G, H belong to the 2nd layer under the 1st layer of link, and I belongs to the 2nd layer The 3rd layer under link.If the access number of plies of Web Spider setting is 2, webpage I will not be accessed to.
Step S504 is transferred to the task that crawls from the link on the webpage for crawling network address and handles node.
Step S505 crawls information type and crawls corresponding information in described crawl on task processing node according to described in.
Step S506, setting crawl purposes information.For example the purposes crawled is to do user behavior analysis or data modification Deng to carry out taxonomic revision, raising efficiency to crawling purposes.Specifically, each different purposes information that crawls can be set It sets identification number, can be distinguished by identification number different crawl purposes information in this way.
Step S507 crawls the corresponding memory space of purposes information setting according to described.
Step S508 stores the corresponding information to the memory space.
In the present embodiment, while carrying out crawling flow, setting crawls purposes information, and establishes and run after fame with purposes Memory space can be by the information storage of acquisition to the memory space after the completion of crawling flow.Such as the use this time crawled Way is to do user behavior analysis, then after crawling flow, can be stored data to the memory space of user behavior analysis, So that the application of subsequent user behavior directly invokes, in this way, convenient for data classification and the management of data, entire data are improved The efficiency crawled.
S501-508 through the above steps, the configuration method of web page crawl proposed by the invention terminate crawling flow Afterwards, data can be stored to the memory space of user behavior analysis, so that the application of subsequent user behavior directly invokes, in this way, Convenient for data classification and the management of data, the efficiency that entire data crawl is improved.
As shown in fig.6, being the implementation process diagram of the configuration method 3rd embodiment of web page crawl of the present invention.At this In embodiment, the execution sequence of the step in flow chart shown in fig. 6 can change according to different requirements, and certain steps can To omit.
In the present embodiment, described to receive the step of crawling network address input by user, it specifically includes:
Step S601 is received in default Web address field and input by user described is crawled network address.
Step S602 establishes crucial literal information and the related information for crawling network address.
Step S603 receives crucial literal information input by user.
Step S604, it is corresponding with the related information acquisition crucial literal information by the crucial literal information Network address.
In the present embodiment, crawling network address, can be that user carries out in the preset Web address field of the terminal device defeated Enter.And in other embodiments, it may not be necessary to which user goes to remember corresponding network address, and only needs user's input and corresponding web site Associated crucial literal information, for example " Sina " two word is inputted by keyboard input or voice, then according to preset Related information automatically enters " Sina " corresponding network address.In the present embodiment, the terminal device can be mobile phone, intelligence electricity Words, laptop, digit broadcasting receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedias Player), the movable equipment of navigation device, car-mounted device etc., and such as number TV, desktop computer, notebook, clothes The fixed terminal of business device etc..
S601-604 through the above steps, the configuration method of web page crawl proposed by the invention can be defeated by user The crucial literal information entered, quick obtaining crawl network address, improve the efficiency that entire data crawl.
Further, to achieve the above object, the present invention also provides a kind of computer readable storage medium, the computers Readable storage medium storing program for executing is stored with the configurator of web page crawl, and the configurator of the web page crawl can be by least one processor It executes, so that the step of at least one processor executes the configuration method such as above-mentioned web page crawl.
The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art Going out the part of contribution can be expressed in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, computer, clothes Be engaged in device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.
It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of configuration method of web page crawl is applied to application server, which is characterized in that the method includes the steps:
It receives and input by user crawls network address;
Setting crawl information type, wherein it is described crawl information type include in word, html, multimedia and photo at least It is one or more kinds of;
Setting crawls task processing node;
It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;And
It crawls information type according to described in described crawl on task processing node and crawls corresponding information.
2. the configuration method of web page crawl as described in claim 1, which is characterized in that the reception is input by user to crawl net The step of location further includes:
Establish crucial literal information and the related information for crawling network address;
Receive crucial literal information input by user;And
Pass through crucial literal information network address corresponding with the related information acquisition crucial literal information.
3. the configuration method of web page crawl as described in claim 1, which is characterized in that the setting crawls task processing node The step of, including:
The access number of plies of the website of network address representative is crawled described in setting, and is climbed according to the task processing node progress webpage that crawls It takes.
4. the configuration method of web page crawl as described in claim 1, which is characterized in that the method further includes step:
Setting crawls purposes information;And
The corresponding memory space of purposes information setting is crawled according to described.
5. the configuration method of web page crawl as claimed in claim 4, which is characterized in that crawled on task processing node described After crawling the step of information type crawls corresponding information according to described in, the method further includes step:
The corresponding information is stored to the memory space.
6. a kind of application server, which is characterized in that the application server includes memory, processor, on the memory It is stored with the configurator for the web page crawl that can be run on the processor, the configurator of the web page crawl is by the place Reason device realizes following steps when executing:
It receives and input by user crawls network address;
Setting crawl information type, wherein it is described crawl information type include in word, html, multimedia and photo at least It is one or more kinds of;
Setting crawls task processing node;
It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;And
It crawls information type according to described in described crawl on task processing node and crawls corresponding information.
7. application server as claimed in claim 6, which is characterized in that described to receive the step input by user for crawling network address Suddenly, including:
Establish crucial literal information and the related information for crawling network address;
Receive crucial literal information input by user;And
Pass through crucial literal information network address corresponding with the related information acquisition crucial literal information.
8. application server as claimed in claim 6, which is characterized in that the setting crawls the step of task processing node, Including:
The access number of plies of the website of network address representative is crawled described in setting, and is climbed according to the task processing node progress webpage that crawls It takes.
9. application server as claimed in claim 6, which is characterized in that the configurator of the web page crawl is by the processing When device executes, following steps are also realized:
Setting crawls purposes information;And
The corresponding memory space of purposes information setting is crawled according to described;And
It is described crawl crawl the step of information type crawls corresponding information according to described on task processing node after, will be described Corresponding information is stored to the memory space.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has the configurator of web page crawl, The configurator of the web page crawl can be executed by least one processor, so that at least one processor executes such as right It is required that the step of configuration method of web page crawl described in any one of 1-5.
CN201810119441.1A 2018-02-06 2018-02-06 Configuration method, application server and the computer readable storage medium of web page crawl Pending CN108491420A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810119441.1A CN108491420A (en) 2018-02-06 2018-02-06 Configuration method, application server and the computer readable storage medium of web page crawl
PCT/CN2018/089706 WO2019153603A1 (en) 2018-02-06 2018-06-03 Web page crawling configuration method, application server and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810119441.1A CN108491420A (en) 2018-02-06 2018-02-06 Configuration method, application server and the computer readable storage medium of web page crawl

Publications (1)

Publication Number Publication Date
CN108491420A true CN108491420A (en) 2018-09-04

Family

ID=63344583

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810119441.1A Pending CN108491420A (en) 2018-02-06 2018-02-06 Configuration method, application server and the computer readable storage medium of web page crawl

Country Status (2)

Country Link
CN (1) CN108491420A (en)
WO (1) WO2019153603A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297962A (en) * 2019-06-28 2019-10-01 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN111192155A (en) * 2019-12-25 2020-05-22 杭州龙席网络科技股份有限公司 Social media inquiry plate identification and recommendation method based on SAAS
CN111209459A (en) * 2019-12-27 2020-05-29 中移(杭州)信息技术有限公司 Information processing method, information processing device, electronic equipment and storage medium
CN111241370A (en) * 2020-01-08 2020-06-05 北京松果电子有限公司 Method, device and storage medium for distributed crawling of content
CN111241366A (en) * 2019-12-25 2020-06-05 杭州龙席网络科技股份有限公司 Client social media monitoring method based on SAAS
CN112948654A (en) * 2019-11-26 2021-06-11 上海哔哩哔哩科技有限公司 Webpage crawling method and device and computer equipment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941788A (en) * 2019-12-17 2020-03-31 山西云时代技术有限公司 Cloud environment distributed Web page extraction and analysis system and method for edge computing

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929932A (en) * 2012-09-25 2013-02-13 人民搜索网络股份公司 Displaying device and displaying method for real-time news
CN103475688A (en) * 2013-05-24 2013-12-25 北京网秦天下科技有限公司 Distributed method and distributed system for downloading website data
CN104376406A (en) * 2014-11-05 2015-02-25 上海计算机软件技术开发中心 Enterprise innovation resource management and analysis system and method based on big data
CN104391978A (en) * 2014-12-05 2015-03-04 北京国双科技有限公司 Method and device for storing and processing web pages of browsers
CN105045872A (en) * 2015-07-16 2015-11-11 北京京东尚科信息技术有限公司 Information screening method and information screening device
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970788A (en) * 2013-02-01 2014-08-06 北京英富森信息技术有限公司 Webpage-crawling-based crawler technology
CN107622125B (en) * 2017-09-29 2020-02-21 联想(北京)有限公司 Information crawling method and device and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929932A (en) * 2012-09-25 2013-02-13 人民搜索网络股份公司 Displaying device and displaying method for real-time news
CN103475688A (en) * 2013-05-24 2013-12-25 北京网秦天下科技有限公司 Distributed method and distributed system for downloading website data
CN104376406A (en) * 2014-11-05 2015-02-25 上海计算机软件技术开发中心 Enterprise innovation resource management and analysis system and method based on big data
CN104391978A (en) * 2014-12-05 2015-03-04 北京国双科技有限公司 Method and device for storing and processing web pages of browsers
CN105045872A (en) * 2015-07-16 2015-11-11 北京京东尚科信息技术有限公司 Information screening method and information screening device
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
袁津生 等: "《21世纪高等学校精品教材 搜索引擎与信息检索教程》", 30 April 2008, 北京:中国水利水电出版社 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297962A (en) * 2019-06-28 2019-10-01 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN110297962B (en) * 2019-06-28 2021-08-24 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN112948654A (en) * 2019-11-26 2021-06-11 上海哔哩哔哩科技有限公司 Webpage crawling method and device and computer equipment
CN111192155A (en) * 2019-12-25 2020-05-22 杭州龙席网络科技股份有限公司 Social media inquiry plate identification and recommendation method based on SAAS
CN111241366A (en) * 2019-12-25 2020-06-05 杭州龙席网络科技股份有限公司 Client social media monitoring method based on SAAS
CN111209459A (en) * 2019-12-27 2020-05-29 中移(杭州)信息技术有限公司 Information processing method, information processing device, electronic equipment and storage medium
CN111241370A (en) * 2020-01-08 2020-06-05 北京松果电子有限公司 Method, device and storage medium for distributed crawling of content
CN111241370B (en) * 2020-01-08 2023-10-13 北京小米松果电子有限公司 Method, device and storage medium for crawling content in distributed manner

Also Published As

Publication number Publication date
WO2019153603A1 (en) 2019-08-15

Similar Documents

Publication Publication Date Title
CN108491420A (en) Configuration method, application server and the computer readable storage medium of web page crawl
CN110348239B (en) Desensitization rule configuration method, data desensitization method, system and computer equipment
CN104112002B (en) A kind of methods, devices and systems of list adaptation
CN103605502B (en) Form page display method and server
CN105306495B (en) user identification method and device
CN106294648A (en) A kind of processing method and processing device for page access path
CN109829287A (en) Api interface permission access method, equipment, storage medium and device
CN108171069A (en) Desensitization method, application server and computer readable storage medium
CN103368986A (en) Information recommendation method and information recommendation device
CN107809383A (en) A kind of map paths method and device based on MVC
CN103064738A (en) Method and system for embedding local application program window into browser in Linux
US11080322B2 (en) Search methods, servers, and systems
CN108021621A (en) Database data acquisition method, application server and computer-readable recording medium
CN105573733A (en) Communication method for browser and web front end and web front end and system
CN102880698B (en) A kind of crawl website defining method and device
CN104899203B (en) Webpage generation method and device and terminal equipment
CN110162540A (en) Querying method, electronic device and the storage medium of block chain account book data
CN109582883B (en) Column page determination method and device
CN111797297B (en) Page data processing method and device, computer equipment and storage medium
CN111859069B (en) Network malicious crawler identification method, system, terminal and storage medium
CN108875085A (en) Mix image processing method, device, computer equipment and the storage medium of application
CN108427701A (en) The method and application server of help information are identified based on operation pages
CN112416858A (en) Document storage method and device, electronic equipment and computer readable storage medium
CN107832374A (en) Construction method, electronic installation and the storage medium in standard knowledge storehouse
CN108256986A (en) Wages computational methods, application server and computer readable storage medium based on cloud computing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180904

RJ01 Rejection of invention patent application after publication