CN108491420A - Configuration method, application server and the computer readable storage medium of web page crawl - Google Patents
Configuration method, application server and the computer readable storage medium of web page crawl Download PDFInfo
- Publication number
- CN108491420A CN108491420A CN201810119441.1A CN201810119441A CN108491420A CN 108491420 A CN108491420 A CN 108491420A CN 201810119441 A CN201810119441 A CN 201810119441A CN 108491420 A CN108491420 A CN 108491420A
- Authority
- CN
- China
- Prior art keywords
- crawls
- information
- crawl
- web page
- network address
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a kind of configuration method of web page crawl, the method includes:It receives and input by user crawls network address;Setting crawls information type;Setting crawls task processing node;It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;It crawls information type according to described in described crawl on task processing node and crawls corresponding information.The present invention also provides a kind of application server and computer readable storage mediums.Configuration method, application server and the computer readable storage medium of web page crawl provided by the invention, it can flexibly control crawling depth, the classified finishing of data can also be can be achieved with during web page crawl simultaneously, improve the efficiency that entire data are crawled and used.
Description
Technical field
The present invention relates to a kind of field of communication technology more particularly to configuration method of web page crawl, application server and meters
Calculation machine readable storage medium storing program for executing.
Background technology
Web page crawl refers in Webpage search subsystem according to uniform resource locator (Uniform Resource
Locator, URL) complete process or thread that a sections and pages face crawls.For search engine, web page crawl, that is, network spider
Spider is to find webpage by the chained address of webpage, since some page (being typically homepage) of website, reads webpage
Content finds other chained addresses in webpage, then finds next webpage by these chained addresses, follows always in this way
Ring goes down, until all webpages in this website have all been captured.If a website is treated as in entire internet,
Web Spider can all capture webpage all on internet with this principle.However current web page crawl process
In, process is crawled especially for picture, although Target Photo can be crawled effectively, circulation searching increases clothes
The load of business device, affects the efficiency crawled, affects user experience.
Invention content
In view of this, the present invention proposes a kind of configuration method of web page crawl, application server and computer-readable storage
Medium can be controlled flexibly crawling depth, while can also can be achieved with data during web page crawl
Classified finishing improves the efficiency that entire data are crawled and used.
First, to achieve the above object, the present invention proposes that a kind of application server, the application server include storage
Device, processor are stored with the configurator for the web page crawl that can be run on the processor, the webpage on the memory
The configurator crawled realizes following steps when being executed by the processor:
It receives and input by user crawls network address;
Setting crawls information type, wherein the information type that crawls includes in word, html, multimedia and photo
It is at least one or a variety of;
Setting crawls task processing node;
It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;And
It crawls information type according to described in described crawl on task processing node and crawls corresponding information.
Optionally, described reception the step of crawling network address input by user, including:
Establish crucial literal information and the related information for crawling network address;
Receive crucial literal information input by user;And
Pass through crucial literal information network address corresponding with the related information acquisition crucial literal information.
Optionally, the setting crawls the step of task processing node, including:
The access number of plies of the website of network address representative is crawled described in setting, and net is carried out according to the task processing node that crawls
Page crawls.
Optionally, when the configurator of the web page crawl is executed by the processor, following steps are also realized:
Setting crawls purposes information;And
The corresponding memory space of purposes information setting is crawled according to described;And
It is described crawl task processing node on according to described in crawl the step of information type crawls corresponding information after, will
The corresponding information is stored to the memory space.
In addition, to achieve the above object, the present invention also provides a kind of configuration method of web page crawl, this method is answered
With server, the method includes:
It receives and input by user crawls network address;
Setting crawls information type, wherein the information type that crawls includes in word, html, multimedia and photo
It is at least one or a variety of;
Setting crawls task processing node;
It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;And
It crawls information type according to described in described crawl on task processing node and crawls corresponding information.
Optionally, described to receive the step of crawling network address input by user, further include:
Establish crucial literal information and the related information for crawling network address;
Receive crucial literal information input by user;And
Pass through crucial literal information network address corresponding with the related information acquisition crucial literal information.
Optionally, the setting crawls the step of task processing node, including:
The access number of plies of the website of network address representative is crawled described in setting, and net is carried out according to the task processing node that crawls
Page crawls.
Optionally, the method further includes step:
Setting crawls purposes information;And
The corresponding memory space of purposes information setting is crawled according to described.
Optionally, in described crawl on task processing node the step of information type crawls corresponding information is crawled described in foundation
Later, the method further includes step:
The corresponding information is stored to the memory space.
Further, to achieve the above object, the present invention also provides a kind of computer readable storage medium, the computers
Readable storage medium storing program for executing is stored with the configurator of web page crawl, and the configurator of the web page crawl can be by least one processor
It executes, so that the step of at least one processor executes the configuration method such as above-mentioned web page crawl.
Compared to the prior art, application server proposed by the invention, the configuration method of web page crawl and computer can
Storage medium is read, first, reception is input by user to crawl network address;Secondly, setting crawls information type;Then, setting, which crawls, appoints
Business processing node;Then, it is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;Finally, exist
Described crawl crawls information type and crawls corresponding information on task processing node according to described in.This way it is possible to avoid the prior art
During crawling, the drawbacks of circulation searching increases the load of server.It can flexibly control crawling depth, simultaneously
The classified finishing of data can also be can be achieved with during web page crawl, improve the effect that entire data are crawled and used
Energy.
Description of the drawings
Fig. 1 is the schematic diagram of one optional hardware structure of application server in the present invention;
Fig. 2 is the Program modual graph of the configurator first embodiment of web page crawl of the present invention;
Fig. 3 is the Program modual graph of the configurator second embodiment of web page crawl of the present invention;
Fig. 4 is the flow chart of the configuration method first embodiment of web page crawl of the present invention;
Fig. 5 is the flow chart of the configuration method second embodiment of web page crawl of the present invention;
Fig. 6 is the flow chart of the configuration method 3rd embodiment of web page crawl of the present invention.
Reference numeral:
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not
For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work
The every other embodiment obtained is put, shall fall within the protection scope of the present invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot
It is interpreted as indicating or implying its relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the
One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment
Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical solution
Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims
Protection domain within.
As shown in fig.1, being the schematic diagram of 1 one optional hardware structure of application server.
The application server 1 can be rack-mount server, blade server, tower server or cabinet-type service
The computing devices such as device, which can be independent server, can also be the server that multiple servers are formed
Cluster.
In the present embodiment, the application server 1 may include, but be not limited only to, and company can be in communication with each other by system bus
Connect memory 11, processor 12, network interface 13.
The application server 1 connects network by network interface 13, obtains information.The network can be enterprises
Net (Intranet), internet (Internet), global system for mobile communications (Global System of Mobile
Communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access,
WCDMA), the wirelessly or non-wirelessly network such as 4G networks, 5G networks, bluetooth (Bluetooth), Wi-Fi, speech path network.
It should be pointed out that Fig. 1 illustrates only the application server 1 with component 11-13, it should be understood that simultaneously
All components shown realistic are not applied, the implementation that can be substituted is more or less component.
Wherein, the memory 11 includes at least a type of readable storage medium storing program for executing, and the readable storage medium storing program for executing includes
Flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), it is static with
Machine accesses memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), may be programmed only
Read memory (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 11 can be described answers
With the hard disk or memory of the internal storage unit of server 1, such as the application server 1.In further embodiments, described to deposit
Reservoir 11 can also be the External memory equipment of the application server 1, such as the plug-in type that the application server 1 is equipped with is hard
Disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card
(Flash Card) etc..Certainly, the memory 11 can also both include the internal storage unit of the application server 1 or wrap
Include its External memory equipment.In the present embodiment, the memory 11 is installed on the behaviour of the application server 1 commonly used in storage
Make system and types of applications software, such as the program code etc. of the configurator 200 of web page crawl.In addition, the memory 11
It can be also used for temporarily storing the Various types of data that has exported or will export.
The processor 12 can be in some embodiments central processing unit (Central Processing Unit,
CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 12 is commonly used in answering described in control
With the overall operation of server 1, such as execution data interaction or the relevant control of communication and processing etc..In the present embodiment, institute
It states processor 12 and is used to run the program code stored in the memory 11 or processing data, such as run the webpage
Configurator 200 crawled etc..
The network interface 13 may include radio network interface or wired network interface, which is commonly used in
Communication connection is established between the application server 1 and other electronic equipments.
In the present embodiment, the configurator 200 of the web page crawl is installed and run in the application server 1, when
When the configurator 200 of the web page crawl is run, first, reception is input by user to crawl network address to the application server 1;Its
Secondary, setting crawls information type;Then, setting crawls task processing node;Then, from the chain on the webpage for crawling network address
Switch through and crawls task processing node into described in;Finally, in described crawl on task processing node information type is crawled described in foundation
Crawl corresponding information.This way it is possible to avoid during the prior art crawls, circulation searching increases the disadvantage of the load of server
End.It can flexibly control crawling depth, while returning for data can also be can be achieved with during web page crawl
Class arranges, and improves the efficiency that entire data are crawled and used.
So far, oneself is through describing the hardware configuration and work(of the application environment and relevant device of each embodiment of the present invention in detail
Energy.In the following, above application environment and relevant device will be based on, each embodiment of the present invention is proposed.
First, the present invention proposes a kind of configurator 200 of web page crawl.
As shown in fig.2, being the Program modual graph of 200 first embodiment of configurator of web page crawl of the present invention.
In the present embodiment, the configurator 200 of the web page crawl includes a series of is stored on memory 11
The net of various embodiments of the present invention may be implemented when the computer program instructions are executed by processor 12 in computer program instructions
The configuration operation that page crawls.In some embodiments, the specific operation realized based on the computer program instructions each section,
The configurator 200 of the web page crawl can be divided into one or more modules.For example, in fig. 2, the webpage is climbed
The configurator 200 taken can be divided into receiving module 201, the first setup module 202, the second setup module 203, at link
Manage module 204 and information crawler module 205.Wherein:
The receiving module 201 input by user crawls network address for receiving.Specifically, the receiving module 201 is pre-
If crawling network address described in Web address field reception is input by user.User crawls network address in the default Web address field input of terminal device.This
In embodiment, the terminal device can be mobile phone, smart phone, laptop, digit broadcasting receiver, PDA (a
Personal digital assistant), PAD (tablet computer), PMP (portable media player), navigation device, car-mounted device etc. it is removable
Move equipment, and the fixed terminal of such as number TV, desktop computer, notebook, server etc..
In the present embodiment, the receiving module 201 receives in the following manner input by user crawls network address:
The receiving module 201 is established crucial literal information and is then connect with the related information for crawling network address first
Receive crucial literal information input by user;Finally, the key is obtained by the crucial literal information and the related information
The corresponding network address of text information.
In the present embodiment, crawling network address, can be that user carries out in the preset Web address field of the terminal device defeated
Enter.And in other embodiments, it may not be necessary to which user goes to remember corresponding network address, and only needs user's input and corresponding web site
Associated crucial literal information, for example " Sina " two word is inputted by keyboard input or voice, then according to preset
Related information automatically enters " Sina " corresponding network address.
First setup module 202 crawls information type for being arranged.It is described to crawl information type in the present embodiment
Including in word, Hypertext Markup Language (Hyper Text Markup Language, HTML), multimedia and photo at least
It is one or more kinds of.
It in the present embodiment, again can be according to different for the mode for crawling information type progress acquisition of information according to described in
It crawls information type and takes different modes:
For the acquisition of literal type corresponding information, the document for being usually the Software Create provided by specialized vendor is in
Existing, manufacturer can all provide corresponding Text Feature Extraction interface.It crawls program only to need to call the interface of these plug-in units, so that it may with light
Extraction document in text message and the other relevant information of file.
And the documents such as HTML are different, HTML has a set of grammer of oneself, is indicated not by different command identifiers
The formats such as same font, color, position, such as:Style=" color:#fff;font-weght;Bold " etc. extracts text envelope
It needs these identifiers all to filter out when breath, then goes to obtain content information again.
For multimedia, picture/mb-type, noted generally by the Anchor Text (that is, link text) of link and relevant file
It releases to judge the content of these documents, and then obtains corresponding content.
Second setup module 203 handles node for the task that crawls to be arranged.In the present embodiment, the setting crawls
Task processing node specifically refers to:The access number of plies of the website of network address representative is crawled described in setting.
In the present embodiment, from crawling efficiency, it is impossible to all webpages are captured, then can be to the net that crawls
Setting of standing crawls the processing node of task, that is, the number of plies (can also be referred to as to crawl depth) of access is arranged.For example, A is starting
Webpage belongs to 0 layer, and B, C, D, E, F belong to the 1st layer under A links, and G, H belong to the 2nd layer under the 1st layer of link, and I belongs to the 2nd layer
The 3rd layer under link.If the access number of plies of Web Spider setting is 2, webpage I will not be accessed to.
In this way, crawling task by setting handles node, i.e., the access number of plies of the website of network address representative is crawled described in setting,
It can flexibly control crawling depth, improve the efficiency that entire data crawl.
The link processing module 204 described crawls task for being transferred to from the link on the webpage for crawling network address
Handle node.
Described information crawls module 205, for crawling information type described in foundation on task processing node in described crawl
Crawl corresponding information.
By above procedure module 201-205, the configurator 200 of web page crawl proposed by the invention receives first
It is input by user to crawl network address;Secondly, setting crawls information type;Then, setting crawls task processing node;Then, from institute
It states the link on the webpage for crawling network address and is transferred to and described crawl task processing node;Finally, node is handled in the task that crawls
Information type, which is crawled, described in upper foundation crawls corresponding information.This way it is possible to avoid during the prior art crawls, circulation searching increases
The drawbacks of having added the load of server.It can flexibly control crawling depth, while can also be in the mistake of web page crawl
Cheng Zhong can be achieved with the classified finishing of data, improve the efficiency that entire data are crawled and used.
Further, based on the present invention is based on the above-mentioned first embodiment of the configurator 200 of web page crawl, this hair is proposed
Bright second embodiment (as shown in Figure 3).In the present embodiment, the configurator 200 of the web page crawl further includes storage mould
Block 206, in the present embodiment:
Second setup module 203 is additionally operable to setting and crawls purposes information;And crawl purposes information setting according to described
Corresponding memory space.Such as the purposes crawled is to do user behavior analysis or data modification etc., so as to crawl purposes into
Row taxonomic revision, raising efficiency.Specifically, can identification number be set to each different purposes information that crawls, passes through mark in this way
Know number to distinguish and different crawls purposes information.
It is described crawl task processing node on according to described in crawl the step of information type crawls corresponding information after, institute
Memory module 206 is stated, for storing the corresponding information to the memory space.
In the present embodiment, while carrying out crawling flow, setting crawls purposes information, and establishes and run after fame with purposes
Memory space can be by the information storage of acquisition to the memory space after the completion of crawling flow.Such as the use this time crawled
Way is to do user behavior analysis, then after crawling flow, can be stored data to the memory space of user behavior analysis,
So that the application of subsequent user behavior directly invokes, in this way, convenient for data classification and the management of data, entire data are improved
The efficiency crawled.
By above procedure module 201-206, the configurator 200 of web page crawl proposed by the invention passes through setting
Crawl purposes information;And the corresponding memory space of purposes information setting is crawled according to described, and the corresponding information is stored
To the memory space.Realize data classification and the management of data.
In addition, the present invention also proposes a kind of configuration method of web page crawl.
As shown in fig.4, being the implementation process diagram of the configuration method first embodiment of web page crawl of the present invention.At this
In embodiment, the execution sequence of the step in flow chart shown in Fig. 4 can change according to different requirements, and certain steps can
To omit.
Step S401, reception is input by user to crawl network address.Specifically, the application server 1 connects in default Web address field
It receives and input by user described crawls network address.User crawls network address in the default Web address field input of terminal device.Specifically, described to connect
Receiving the specific steps input by user for crawling network address will carry out in the configuration method 3rd embodiment (Fig. 6) of web page crawl of the present invention
It is described in detail.In the present embodiment, the terminal device can be mobile phone, smart phone, laptop, Digital Broadcasting Receiver
Device, PDA (personal digital assistant), PAD (tablet computer), PMP (portable media player), navigation device, car-mounted device
Etc. movable equipment, and such as number TV, desktop computer, notebook, server etc. fixed terminal.
Step S402, setting crawl information type.In the present embodiment, the information type that crawls includes word, hypertext
At least one or multiple of identifiable language (Hyper Text Markup Language, HTML), multimedia and photo.
It in the present embodiment, again can be according to different for the mode for crawling information type progress acquisition of information according to described in
It crawls information type and takes different modes:
For the acquisition of literal type corresponding information, the document for being usually the Software Create provided by specialized vendor is in
Existing, manufacturer can all provide corresponding Text Feature Extraction interface.It crawls program only to need to call the interface of these plug-in units, so that it may with light
Extraction document in text message and the other relevant information of file.
And the documents such as HTML are different, HTML has a set of grammer of oneself, is indicated not by different command identifiers
The formats such as same font, color, position, such as:Style=" color:#fff;font-weght;Bold " etc. extracts text envelope
It needs these identifiers all to filter out when breath, then goes to obtain content information again.
For multimedia, picture/mb-type, noted generally by the Anchor Text (that is, link text) of link and relevant file
It releases to judge the content of these documents, and then obtains corresponding content.
Step S403, setting crawl task processing node.In the present embodiment, it is specific that the setting crawls task processing node
Refer to:The access number of plies of the website of network address representative is crawled described in setting.
In the present embodiment, from crawling efficiency, it is impossible to all webpages are captured, then can be to the net that crawls
Setting of standing crawls the processing node of task, that is, the number of plies (can also be referred to as to crawl depth) of access is arranged.For example, A is starting
Webpage belongs to 0 layer, and B, C, D, E, F belong to the 1st layer under A links, and G, H belong to the 2nd layer under the 1st layer of link, and I belongs to the 2nd layer
The 3rd layer under link.If the access number of plies of Web Spider setting is 2, webpage I will not be accessed to.
In this way, crawling task by setting handles node, i.e., the access number of plies of the website of network address representative is crawled described in setting,
It can flexibly control crawling depth, improve the efficiency that entire data crawl.
Step S404 is transferred to the task that crawls from the link on the webpage for crawling network address and handles node.
Step S405 crawls information type and crawls corresponding information in described crawl on task processing node according to described in.
It is defeated to receive user first for S401-405 through the above steps, the configuration method of web page crawl proposed by the invention
What is entered crawls network address;Secondly, setting crawls information type;Then, setting crawls task processing node;Then, it is crawled from described
Link on the webpage of network address is transferred to the task that crawls and handles node;Finally, foundation on task processing node is crawled described
The information type that crawls crawls corresponding information.This way it is possible to avoid during crawling in the prior art, circulation searching increases
The drawbacks of load of server.Can flexibly control crawling depth, at the same can also during web page crawl,
The classified finishing that can be achieved with data improves the efficiency that entire data are crawled and used.
As shown in fig.5, being the implementation process diagram of the configuration method second embodiment of web page crawl of the present invention.At this
In embodiment, the execution sequence of the step in flow chart shown in fig. 5 can change according to different requirements, and certain steps can
To omit.
Step S501, reception is input by user to crawl network address.Specifically, the application server 1 connects in default Web address field
It receives and input by user described crawls network address.User crawls network address in the default Web address field input of terminal device.Specifically, described to connect
Receiving the specific steps input by user for crawling network address will carry out in the configuration method 3rd embodiment (Fig. 6) of web page crawl of the present invention
It is described in detail.In the present embodiment, the terminal device can be mobile phone, smart phone, laptop, Digital Broadcasting Receiver
Device, PDA (personal digital assistant), PAD (tablet computer), PMP (portable media player), navigation device, car-mounted device
Etc. movable equipment, and such as number TV, desktop computer, notebook, server etc. fixed terminal.
Step S502, setting crawl information type.In the present embodiment, the information type that crawls includes word, hypertext
At least one or multiple of identifiable language (Hyper Text Markup Language, HTML), multimedia and photo.
It in the present embodiment, again can be according to different for the mode for crawling information type progress acquisition of information according to described in
It crawls information type and takes different modes:
For the acquisition of literal type corresponding information, the document for being usually the Software Create provided by specialized vendor is in
Existing, manufacturer can all provide corresponding Text Feature Extraction interface.It crawls program only to need to call the interface of these plug-in units, so that it may with light
Extraction document in text message and the other relevant information of file.
And the documents such as HTML are different, HTML has a set of grammer of oneself, is indicated not by different command identifiers
The formats such as same font, color, position, such as:Style=" color:#fff;font-weght;Bold " etc. extracts text envelope
It needs these identifiers all to filter out when breath, then goes to obtain content information again.
For multimedia, picture/mb-type, noted generally by the Anchor Text (that is, link text) of link and relevant file
It releases to judge the content of these documents, and then obtains corresponding content.
Step S503, setting crawl task processing node.In the present embodiment, it is specific that the setting crawls task processing node
Refer to:The access number of plies of the website of network address representative is crawled described in setting.
In the present embodiment, from crawling efficiency, it is impossible to all webpages are captured, then can be to the net that crawls
Setting of standing crawls the processing node of task, that is, the number of plies (can also be referred to as to crawl depth) of access is arranged.For example, A is starting
Webpage belongs to 0 layer, and B, C, D, E, F belong to the 1st layer under A links, and G, H belong to the 2nd layer under the 1st layer of link, and I belongs to the 2nd layer
The 3rd layer under link.If the access number of plies of Web Spider setting is 2, webpage I will not be accessed to.
Step S504 is transferred to the task that crawls from the link on the webpage for crawling network address and handles node.
Step S505 crawls information type and crawls corresponding information in described crawl on task processing node according to described in.
Step S506, setting crawl purposes information.For example the purposes crawled is to do user behavior analysis or data modification
Deng to carry out taxonomic revision, raising efficiency to crawling purposes.Specifically, each different purposes information that crawls can be set
It sets identification number, can be distinguished by identification number different crawl purposes information in this way.
Step S507 crawls the corresponding memory space of purposes information setting according to described.
Step S508 stores the corresponding information to the memory space.
In the present embodiment, while carrying out crawling flow, setting crawls purposes information, and establishes and run after fame with purposes
Memory space can be by the information storage of acquisition to the memory space after the completion of crawling flow.Such as the use this time crawled
Way is to do user behavior analysis, then after crawling flow, can be stored data to the memory space of user behavior analysis,
So that the application of subsequent user behavior directly invokes, in this way, convenient for data classification and the management of data, entire data are improved
The efficiency crawled.
S501-508 through the above steps, the configuration method of web page crawl proposed by the invention terminate crawling flow
Afterwards, data can be stored to the memory space of user behavior analysis, so that the application of subsequent user behavior directly invokes, in this way,
Convenient for data classification and the management of data, the efficiency that entire data crawl is improved.
As shown in fig.6, being the implementation process diagram of the configuration method 3rd embodiment of web page crawl of the present invention.At this
In embodiment, the execution sequence of the step in flow chart shown in fig. 6 can change according to different requirements, and certain steps can
To omit.
In the present embodiment, described to receive the step of crawling network address input by user, it specifically includes:
Step S601 is received in default Web address field and input by user described is crawled network address.
Step S602 establishes crucial literal information and the related information for crawling network address.
Step S603 receives crucial literal information input by user.
Step S604, it is corresponding with the related information acquisition crucial literal information by the crucial literal information
Network address.
In the present embodiment, crawling network address, can be that user carries out in the preset Web address field of the terminal device defeated
Enter.And in other embodiments, it may not be necessary to which user goes to remember corresponding network address, and only needs user's input and corresponding web site
Associated crucial literal information, for example " Sina " two word is inputted by keyboard input or voice, then according to preset
Related information automatically enters " Sina " corresponding network address.In the present embodiment, the terminal device can be mobile phone, intelligence electricity
Words, laptop, digit broadcasting receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedias
Player), the movable equipment of navigation device, car-mounted device etc., and such as number TV, desktop computer, notebook, clothes
The fixed terminal of business device etc..
S601-604 through the above steps, the configuration method of web page crawl proposed by the invention can be defeated by user
The crucial literal information entered, quick obtaining crawl network address, improve the efficiency that entire data crawl.
Further, to achieve the above object, the present invention also provides a kind of computer readable storage medium, the computers
Readable storage medium storing program for executing is stored with the configurator of web page crawl, and the configurator of the web page crawl can be by least one processor
It executes, so that the step of at least one processor executes the configuration method such as above-mentioned web page crawl.
The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art
Going out the part of contribution can be expressed in the form of software products, which is stored in a storage medium
In (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, computer, clothes
Be engaged in device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.
It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of configuration method of web page crawl is applied to application server, which is characterized in that the method includes the steps:
It receives and input by user crawls network address;
Setting crawl information type, wherein it is described crawl information type include in word, html, multimedia and photo at least
It is one or more kinds of;
Setting crawls task processing node;
It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;And
It crawls information type according to described in described crawl on task processing node and crawls corresponding information.
2. the configuration method of web page crawl as described in claim 1, which is characterized in that the reception is input by user to crawl net
The step of location further includes:
Establish crucial literal information and the related information for crawling network address;
Receive crucial literal information input by user;And
Pass through crucial literal information network address corresponding with the related information acquisition crucial literal information.
3. the configuration method of web page crawl as described in claim 1, which is characterized in that the setting crawls task processing node
The step of, including:
The access number of plies of the website of network address representative is crawled described in setting, and is climbed according to the task processing node progress webpage that crawls
It takes.
4. the configuration method of web page crawl as described in claim 1, which is characterized in that the method further includes step:
Setting crawls purposes information;And
The corresponding memory space of purposes information setting is crawled according to described.
5. the configuration method of web page crawl as claimed in claim 4, which is characterized in that crawled on task processing node described
After crawling the step of information type crawls corresponding information according to described in, the method further includes step:
The corresponding information is stored to the memory space.
6. a kind of application server, which is characterized in that the application server includes memory, processor, on the memory
It is stored with the configurator for the web page crawl that can be run on the processor, the configurator of the web page crawl is by the place
Reason device realizes following steps when executing:
It receives and input by user crawls network address;
Setting crawl information type, wherein it is described crawl information type include in word, html, multimedia and photo at least
It is one or more kinds of;
Setting crawls task processing node;
It is transferred to the task that crawls from the link on the webpage for crawling network address and handles node;And
It crawls information type according to described in described crawl on task processing node and crawls corresponding information.
7. application server as claimed in claim 6, which is characterized in that described to receive the step input by user for crawling network address
Suddenly, including:
Establish crucial literal information and the related information for crawling network address;
Receive crucial literal information input by user;And
Pass through crucial literal information network address corresponding with the related information acquisition crucial literal information.
8. application server as claimed in claim 6, which is characterized in that the setting crawls the step of task processing node,
Including:
The access number of plies of the website of network address representative is crawled described in setting, and is climbed according to the task processing node progress webpage that crawls
It takes.
9. application server as claimed in claim 6, which is characterized in that the configurator of the web page crawl is by the processing
When device executes, following steps are also realized:
Setting crawls purposes information;And
The corresponding memory space of purposes information setting is crawled according to described;And
It is described crawl crawl the step of information type crawls corresponding information according to described on task processing node after, will be described
Corresponding information is stored to the memory space.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has the configurator of web page crawl,
The configurator of the web page crawl can be executed by least one processor, so that at least one processor executes such as right
It is required that the step of configuration method of web page crawl described in any one of 1-5.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810119441.1A CN108491420A (en) | 2018-02-06 | 2018-02-06 | Configuration method, application server and the computer readable storage medium of web page crawl |
PCT/CN2018/089706 WO2019153603A1 (en) | 2018-02-06 | 2018-06-03 | Web page crawling configuration method, application server and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810119441.1A CN108491420A (en) | 2018-02-06 | 2018-02-06 | Configuration method, application server and the computer readable storage medium of web page crawl |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108491420A true CN108491420A (en) | 2018-09-04 |
Family
ID=63344583
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810119441.1A Pending CN108491420A (en) | 2018-02-06 | 2018-02-06 | Configuration method, application server and the computer readable storage medium of web page crawl |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108491420A (en) |
WO (1) | WO2019153603A1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110297962A (en) * | 2019-06-28 | 2019-10-01 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
CN111192155A (en) * | 2019-12-25 | 2020-05-22 | 杭州龙席网络科技股份有限公司 | Social media inquiry plate identification and recommendation method based on SAAS |
CN111209459A (en) * | 2019-12-27 | 2020-05-29 | 中移(杭州)信息技术有限公司 | Information processing method, information processing device, electronic equipment and storage medium |
CN111241370A (en) * | 2020-01-08 | 2020-06-05 | 北京松果电子有限公司 | Method, device and storage medium for distributed crawling of content |
CN111241366A (en) * | 2019-12-25 | 2020-06-05 | 杭州龙席网络科技股份有限公司 | Client social media monitoring method based on SAAS |
CN112948654A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Webpage crawling method and device and computer equipment |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110941788A (en) * | 2019-12-17 | 2020-03-31 | 山西云时代技术有限公司 | Cloud environment distributed Web page extraction and analysis system and method for edge computing |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102929932A (en) * | 2012-09-25 | 2013-02-13 | 人民搜索网络股份公司 | Displaying device and displaying method for real-time news |
CN103475688A (en) * | 2013-05-24 | 2013-12-25 | 北京网秦天下科技有限公司 | Distributed method and distributed system for downloading website data |
CN104376406A (en) * | 2014-11-05 | 2015-02-25 | 上海计算机软件技术开发中心 | Enterprise innovation resource management and analysis system and method based on big data |
CN104391978A (en) * | 2014-12-05 | 2015-03-04 | 北京国双科技有限公司 | Method and device for storing and processing web pages of browsers |
CN105045872A (en) * | 2015-07-16 | 2015-11-11 | 北京京东尚科信息技术有限公司 | Information screening method and information screening device |
CN105893583A (en) * | 2016-04-01 | 2016-08-24 | 北京鼎泰智源科技有限公司 | Data acquisition method and system based on artificial intelligence |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN107622125B (en) * | 2017-09-29 | 2020-02-21 | 联想(北京)有限公司 | Information crawling method and device and electronic equipment |
-
2018
- 2018-02-06 CN CN201810119441.1A patent/CN108491420A/en active Pending
- 2018-06-03 WO PCT/CN2018/089706 patent/WO2019153603A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102929932A (en) * | 2012-09-25 | 2013-02-13 | 人民搜索网络股份公司 | Displaying device and displaying method for real-time news |
CN103475688A (en) * | 2013-05-24 | 2013-12-25 | 北京网秦天下科技有限公司 | Distributed method and distributed system for downloading website data |
CN104376406A (en) * | 2014-11-05 | 2015-02-25 | 上海计算机软件技术开发中心 | Enterprise innovation resource management and analysis system and method based on big data |
CN104391978A (en) * | 2014-12-05 | 2015-03-04 | 北京国双科技有限公司 | Method and device for storing and processing web pages of browsers |
CN105045872A (en) * | 2015-07-16 | 2015-11-11 | 北京京东尚科信息技术有限公司 | Information screening method and information screening device |
CN105893583A (en) * | 2016-04-01 | 2016-08-24 | 北京鼎泰智源科技有限公司 | Data acquisition method and system based on artificial intelligence |
Non-Patent Citations (1)
Title |
---|
袁津生 等: "《21世纪高等学校精品教材 搜索引擎与信息检索教程》", 30 April 2008, 北京:中国水利水电出版社 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110297962A (en) * | 2019-06-28 | 2019-10-01 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
CN110297962B (en) * | 2019-06-28 | 2021-08-24 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
CN112948654A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Webpage crawling method and device and computer equipment |
CN111192155A (en) * | 2019-12-25 | 2020-05-22 | 杭州龙席网络科技股份有限公司 | Social media inquiry plate identification and recommendation method based on SAAS |
CN111241366A (en) * | 2019-12-25 | 2020-06-05 | 杭州龙席网络科技股份有限公司 | Client social media monitoring method based on SAAS |
CN111209459A (en) * | 2019-12-27 | 2020-05-29 | 中移(杭州)信息技术有限公司 | Information processing method, information processing device, electronic equipment and storage medium |
CN111241370A (en) * | 2020-01-08 | 2020-06-05 | 北京松果电子有限公司 | Method, device and storage medium for distributed crawling of content |
CN111241370B (en) * | 2020-01-08 | 2023-10-13 | 北京小米松果电子有限公司 | Method, device and storage medium for crawling content in distributed manner |
Also Published As
Publication number | Publication date |
---|---|
WO2019153603A1 (en) | 2019-08-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108491420A (en) | Configuration method, application server and the computer readable storage medium of web page crawl | |
CN110348239B (en) | Desensitization rule configuration method, data desensitization method, system and computer equipment | |
CN104112002B (en) | A kind of methods, devices and systems of list adaptation | |
CN103605502B (en) | Form page display method and server | |
CN105306495B (en) | user identification method and device | |
CN106294648A (en) | A kind of processing method and processing device for page access path | |
CN109829287A (en) | Api interface permission access method, equipment, storage medium and device | |
CN108171069A (en) | Desensitization method, application server and computer readable storage medium | |
CN103368986A (en) | Information recommendation method and information recommendation device | |
CN107809383A (en) | A kind of map paths method and device based on MVC | |
CN103064738A (en) | Method and system for embedding local application program window into browser in Linux | |
US11080322B2 (en) | Search methods, servers, and systems | |
CN108021621A (en) | Database data acquisition method, application server and computer-readable recording medium | |
CN105573733A (en) | Communication method for browser and web front end and web front end and system | |
CN102880698B (en) | A kind of crawl website defining method and device | |
CN104899203B (en) | Webpage generation method and device and terminal equipment | |
CN110162540A (en) | Querying method, electronic device and the storage medium of block chain account book data | |
CN109582883B (en) | Column page determination method and device | |
CN111797297B (en) | Page data processing method and device, computer equipment and storage medium | |
CN111859069B (en) | Network malicious crawler identification method, system, terminal and storage medium | |
CN108875085A (en) | Mix image processing method, device, computer equipment and the storage medium of application | |
CN108427701A (en) | The method and application server of help information are identified based on operation pages | |
CN112416858A (en) | Document storage method and device, electronic equipment and computer readable storage medium | |
CN107832374A (en) | Construction method, electronic installation and the storage medium in standard knowledge storehouse | |
CN108256986A (en) | Wages computational methods, application server and computer readable storage medium based on cloud computing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180904 |
|
RJ01 | Rejection of invention patent application after publication |