CN110188258A - The method and device of external data is obtained using crawler - Google Patents
The method and device of external data is obtained using crawler Download PDFInfo
- Publication number
- CN110188258A CN110188258A CN201910320214.XA CN201910320214A CN110188258A CN 110188258 A CN110188258 A CN 110188258A CN 201910320214 A CN201910320214 A CN 201910320214A CN 110188258 A CN110188258 A CN 110188258A
- Authority
- CN
- China
- Prior art keywords
- crawler
- data
- crawlers
- page
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The embodiment of the invention provides a kind of method and devices that external data is obtained using crawler.On the one hand, this method comprises: obtaining data acquisition instruction according to trigger condition;It is instructed according to the data acquisition and calls crawlers;Receive the crawler page of the crawlers crawl;It parses the crawler page and obtains result data, and the result data is stored to mysql database.Through the invention, it solves the technical issues of crawlers cannot be called to obtain data automatically in the prior art, improves the efficiency using crawler capturing data, reduce manual operation.
Description
[technical field]
The present invention relates to computer field more particularly to a kind of method and devices that external data is obtained using crawler.
[background technique]
In the prior art, crawler be it is a kind of according to certain rules, automatically grab the program or foot of web message
This, crawler is that all companies obtain the most frequently used and most important means of external data at present, and data in business can be played
Good supplementary function.
In the prior art, but there are more technologies in crawler field now, but the function of every kind of technology is again excessively single, and
The automation of crawler and the data persistence of crawler all compare shortage, but after crawler gets data, need user into one
Step screening and processing, efficiency is lower, when being applied to large database foundation and periodic duty, needs to consume a large amount of manpower.
For the above problem present in the relevant technologies, at present it is not yet found that the solution of effect.
[summary of the invention]
In view of this, the embodiment of the invention provides a kind of method and devices for obtaining external data using crawler.
On the one hand, the embodiment of the invention provides a kind of methods for obtaining external data using crawler, which comprises
Data acquisition instruction is obtained according to trigger condition;It is instructed according to the data acquisition and calls crawlers;Receive the crawler journey
The crawler page of sequence crawl;It parses the crawler page and obtains result data, and the result data is stored to mysql data
Library.
Optionally, being instructed according to the data acquisition and calling crawlers includes: to be converted to data acquisition instruction
Crawler task;Determine the degree-of-difficulty factor of the crawler task;Quantity and the institute of crawlers are determined according to the degree-of-difficulty factor
State the crawler request method of crawlers.
Optionally, the degree-of-difficulty factor for determining the crawler task includes: the crawler task according at least one of
Degree-of-difficulty factor: the quantity of data source, the size of data, the size in data distribution region, the complexity of chained address.
Optionally, the quantity of crawlers and the crawler requesting party of the crawlers are determined according to the degree-of-difficulty factor
Formula includes: to select the crawler request method of a crawlers and the first kind when the degree-of-difficulty factor is lower than preset threshold;
When the degree-of-difficulty factor is greater than or equal to the preset threshold, multiple crawlers and multiple corresponding Second Types are selected
Crawler request method;Wherein, the crawler request method of the first kind includes following one: directly acquiring unified resource positioning
Symbol URL, proxy requests are utilized;The crawler request method of the Second Type includes following one: using model browse request,
It is requested using true browser kernel.
Optionally, being instructed according to the data acquisition and calling crawlers includes: to be converted to data acquisition instruction
Crawler task;Call multiple crawler nodes in distributed network, wherein crawlers are distributed on each crawler node, are climbed
Worm node is arranged in the server of distributed network;Obtain the processing capacity of each crawler node in distributed network;According to
The processing capacity of each crawler node is that each crawler node distributes crawler subtask, wherein the crawler task includes multiple
Crawler subtask.
Optionally, described in layering analysis when the crawler page, parsing the crawler page and obtaining result data includes: to receive
Call request to upper layer to current layer;According to the metadata carried in the call request, determine target operation object institute after
The target entity held, wherein the target operation object is the object that current layer needs to parse, and the target entity is the member
The data of data definition;According to the target entity, parsing operation is executed to the operation object.
Optionally, parsing the crawler page and obtaining result data includes: that the parsing crawler page obtains climbing with described
The corresponding initial data of the worm page;Data cleansing and Screening Treatment are carried out to the initial data, deleting includes blacklist dictionary
Data packet, obtain the first result data;Selection includes the data packet of keyword in first result data, obtains second
Result data.
On the other hand, the embodiment of the invention provides a kind of device that external data is obtained using crawler, described device packets
It includes: obtaining module, for obtaining data acquisition instruction according to trigger condition;Calling module, for being referred to according to the data acquisition
It enables and calls crawlers;Receiving module, for receiving the crawler page of the crawlers crawl;Parsing module, for parsing
The crawler page obtains result data, and the result data is stored to mysql database.
Optionally, the calling module includes: converting unit, is appointed for data acquisition instruction to be converted to crawler
Business;First determination unit, for determining the degree-of-difficulty factor of the crawler task;Second determination unit, for according to the difficulty
Coefficient determines the quantity of crawlers and the crawler request method of the crawlers.
Optionally, first determination unit comprises determining that subelement, appoints for the crawler according at least one of
The degree-of-difficulty factor of business: the quantity of data source, the size of data, the size in data distribution region, the complexity of chained address.
Optionally, second determination unit includes: selection subelement, for being lower than preset threshold in the degree-of-difficulty factor
When, select the crawler request method of a crawlers and the first kind;It is greater than or equal in the degree-of-difficulty factor described default
When threshold value, the crawler request method of multiple crawlers and multiple corresponding Second Types is selected;Wherein, the first kind
Crawler request method includes following one: directly acquiring uniform resource position mark URL, utilizes proxy requests;The Second Type
Crawler request method include following one: using model browse request, using true browser kernel request.
Optionally, the calling module includes: converting unit, is appointed for data acquisition instruction to be converted to crawler
Business;Call unit, for calling multiple crawler nodes in distributed network, wherein crawlers are distributed in each crawler section
On point, crawler node is arranged in the server of distributed network;Acquiring unit, for obtaining each crawler in distributed network
The processing capacity of node;Allocation unit, for being that each crawler node distributes crawler according to the processing capacity of each crawler node
Subtask, wherein the crawler task includes multiple crawler subtasks.
Optionally, described in layering analysis when the crawler page, the parsing module includes: receiving unit, for receiving
Call request of the upper layer to current layer;Determination unit, for determining that target is grasped according to the metadata carried in the call request
Make the target entity that object is inherited, wherein the target operation object is the object that current layer needs to parse, and the target is real
Body is the data of the metadata definition;Resolution unit, for executing parsing to the operation object according to the target entity
Operation.
Optionally, the parsing module includes: resolution unit, is obtained and the crawler page for parsing the crawler page
The corresponding initial data in face;Screening unit is deleted for carrying out data cleansing and Screening Treatment to the initial data comprising black
The data packet of name word library, obtains the first result data;Selecting unit, for selecting in first result data comprising closing
The data packet of keyword obtains the second result data.
According to still another embodiment of the invention, a kind of storage medium is additionally provided, meter is stored in the storage medium
Calculation machine program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
According to still another embodiment of the invention, a kind of electronic device, including memory and processor are additionally provided, it is described
Computer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-described
Step in embodiment of the method.
By the invention it is possible to realize the automatic dispatching of crawler task and the automated storing of crawler result data.
It solves the technical issues of crawlers cannot be called to obtain data automatically in the prior art, improves using crawler capturing data
Efficiency, reduce manual operation.
[Detailed description of the invention]
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached
Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this field
For those of ordinary skill, without any creative labor, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is a kind of hardware block diagram of server that external data is obtained using crawler of the embodiment of the present invention;
Fig. 2 is the flow chart of the method according to an embodiment of the present invention that external data is obtained using crawler;
Fig. 3 is the application framework figure of the embodiment of the present invention;
Fig. 4 is the overall workflow figure the embodiment of the present invention includes data cleansing;
Fig. 5 is the structural block diagram of the device according to an embodiment of the present invention that external data is obtained using crawler.
[specific embodiment]
Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and in combination with Examples.It should be noted that not conflicting
In the case of, the features in the embodiments and the embodiments of the present application can be combined with each other.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.
Embodiment 1
Embodiment of the method provided by the embodiment of the present application one can mobile terminal, server, terminal or
It is executed in similar arithmetic unit.For running on the server, Fig. 1 is that a kind of of the embodiment of the present invention is obtained using crawler
The hardware block diagram of the server of external data.As shown in Figure 1, server 10 may include one or more (only shows in Fig. 1
One out) (processor 102 can include but is not limited to Micro-processor MCV or programmable logic device FPGA etc. to processor 102
Processing unit) and memory 104 for storing data, optionally, above-mentioned server can also include for communication function
Transmission device 106 and input-output equipment 108.It will appreciated by the skilled person that structure shown in FIG. 1 is only to show
Meaning, does not cause to limit to the structure of above-mentioned server.For example, server 10 may also include it is more than shown in Fig. 1 or
Less component, or with the configuration different from shown in Fig. 1.
Memory 104 can be used for storing computer program, for example, the software program and module of application software, such as this hair
The corresponding computer program of method that external data is obtained using crawler in bright embodiment, processor 102 pass through operation storage
Computer program in memory 104 realizes above-mentioned method thereby executing various function application and data processing.
Memory 104 may include high speed random access memory, may also include nonvolatile memory, as one or more magnetic storage fills
It sets, flash memory or other non-volatile solid state memories.In some instances, memory 104 can further comprise relative to place
The remotely located memory of device 102 is managed, these remote memories can pass through network connection to server 10.The reality of above-mentioned network
Example includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include
The wireless network that the communication providers of server 10 provide.In an example, transmitting device 106 includes a network adapter
(Network Interface Controller, referred to as NIC), can be connected by base station with other network equipments so as to
It is communicated with internet.In an example, transmitting device 106 can be radio frequency (Radio Frequency, referred to as RF)
Module is used to wirelessly be communicated with internet.
A kind of method for obtaining external data using crawler is provided in the present embodiment, and Fig. 2 is to implement according to the present invention
The flow chart of the method that external data is obtained using crawler of example, as shown in Fig. 2, the process includes the following steps:
Step S202 obtains data acquisition instruction according to trigger condition;
The trigger condition of the present embodiment can be the acquisition instruction that user sends in real time, be also possible to touch automatically according to the period
Hair, as external data be stock market exchange hand when, after in the day of trade, stock market stops business (such as 15:30), automatic trigger obtain deep bid
Activity data.
Step S204 is instructed according to the data acquisition and is called crawlers;
Step S206 receives the crawler page of the crawlers crawl;
Step S208 parses the crawler page and obtains result data, and the result data is stored to mysql data
Library (a kind of Relational DBMS).
Scheme through this embodiment, may be implemented crawler task automatic dispatching and crawler result data from
Dynamicization storage.It solves the technical issues of crawlers cannot be called to obtain data automatically in the prior art, improves use and climb
Worm grabs the efficiency of data, reduces manual operation.
Fig. 3 is the application framework figure of the embodiment of the present invention, as shown in figure 3, each function carries out modularization, application in frame
Frame includes: Apscheduler (crawler task manager), spider (crawlers), mysql database.Apscheduler
Management and scheduling crawlers and mysql database, crawlers grab external data, mysql database purchase according to task
External data.Specifically, Apscheduler, management, scheduling, period for being responsible for crawler task are controlled, setting including task
It sets, suspend, remove, dispatch;The periodic scheduling of crawler task controls, and according to the trigger condition of user setting, is periodically appointed
Business triggering is reminded.Task is thread task, and each task is handled in a background thread.It is subsequent, if there is new crawler is appointed
Then business realizes the specific task method of Task class as long as inheriting corresponding Task class, and be added in task manager.
Spider, for being responsible for the specific implementation of crawler task, including crawler request module, crawler page parsing module, crawler result
Data cleansing and sorting module.Mysql stores crawler result data for being responsible for crawler final result data persistence.
In the present embodiment, data acquisition instruction is converted into crawler task, selects to correspond to according to the difficulty of crawler task
Crawler request method, call crawlers, complete crawler task.Each crawler task needs one or more crawlers
It completes, the quantity of crawlers can be determined according to the difficulty of crawler task, crawler difficulty is classified, minimum difficulty etc.
The crawler task of grade distributes a crawlers, and highly difficult crawler task distributes multiple crawlers.In the present embodiment
Include: according to data acquisition instruction calling crawlers
Data acquisition instruction is converted to crawler task by S11;
S12 determines the degree-of-difficulty factor of the crawler task;
Optionally, the difficulty of crawler task is according to the quantity (such as web data, quantity database) of data source, data it is big
Small, the size (such as inside the province, external) in data distribution region, the complexity of chained address is waited and is determined.
S13 determines the quantity of crawlers and the crawler request method of the crawlers according to the degree-of-difficulty factor.
In an optional embodiment of the present embodiment, the grade of difficulty of crawler task is determined, and then according to difficulty etc.
The crawlers of grade distribution respective numbers and corresponding crawler request method, the number of crawlers is determined according to the degree-of-difficulty factor
Amount and the crawler request method of the crawlers include: to select one when the degree-of-difficulty factor is lower than preset threshold and climb
The crawler request method of worm program and the first kind;When the degree-of-difficulty factor is greater than or equal to the preset threshold, select more
The crawler request method of a crawlers and multiple corresponding Second Types;Wherein, the crawler request method of the first kind
Including following one: directly acquiring uniform resource position mark URL, utilize proxy requests;The crawler requesting party of the Second Type
Formula includes following one: being requested using model browse request, using true browser kernel.
System is when realizing specific crawler task, it is only necessary to independently be called according to crawler task difficulty.For example, climbing
When the difficulty of worm task is lower, it can choose and directly acquire the crawler request method of URL mode to crawl data, due to different
The ability that request method obtains data is different, and (correspondingly, ability is bigger, the resource and expense for needing to call are also bigger) is so need
Crawler request method is corresponding with the difficulty of crawler task, to realize the rational allocation of resource.Due to crawler task and crawler
The quantity of program and the crawler request method of crawlers are corresponding, therefore, can distribute crawler according to the difficulty of crawler task
The quantity and crawler request method of program, other optional embodiments of the present embodiment further include: distribute phase according to grade of difficulty
The crawler request method answering the crawlers of quantity and fixing;Alternatively, distributing the crawlers of fixed quantity according to grade of difficulty
With corresponding crawler request method.Each crawlers use identical crawler request method.
It in an application scenarios of the present embodiment, applies in distributed network, is instructed and adjusted according to the data acquisition
It include: that data acquisition instruction is converted into crawler task with crawlers;Call multiple crawlers section in distributed network
Point, wherein crawlers are distributed on each crawler node, and crawler node is arranged in the server of distributed network;It obtains
The processing capacity of each crawler node in distributed network;It is each crawler node point according to the processing capacity of each crawler node
With crawler subtask, wherein the crawler task includes multiple crawler subtasks.
In one example, the priority that may be incorporated into crawlers is called, and each crawlers, which are distributed in, climbs
On worm node, the different crawlers of multiple priority are distributed on each crawler node, crawler node is arranged in distributed network
Server in, obtain distributed network in each crawler node processing capacity;According to preset priority orders and basis
The processing capacity of each crawler node is that each crawler node distributes crawler task, so that the crawler node is to being distributed
Crawler task is handled.The maximum amount of access of single crawlers is determined according to the processing capacity of each crawler node;If point
The crawler task amount for being fitted on crawler node is more than or equal to the maximum amount of access, then using at multiple crawlers
Reason;Alternatively, if the crawler task amount of assigned crawler node is less than the maximum amount of access of single crawlers, it is excellent using one
The first highest crawlers processing of grade.It is of course also possible to crawler task is evenly distributed on the crawlers of all nodes.?
In distributed network, crawler node be it is fixed, be arranged in the server of distributed network, the crawler on each crawler node
Program be also divide in advance it is well laid.Crawler task is distributed to each crawler node according to priority and processing capacity, according still further to every
The maximum amount of access of each crawlers determines crawlers in the allocated task of a crawler node and crawler node.
In the present embodiment, due to the diversification of all kinds of crawler pages, it is impossible to it is realized with a kind of analysis mode, because
This, only presets the top layer parent of crawler resolution logic, and the crawler resolution logic of the specific page inherits top layer parent by each layer
Personalization is realized.When stub procedure includes: the call request for receiving upper layer to current layer, according to the member carried in the request
Data determine the target entity that target operation object is inherited, and according to the target entity, carry out to the target operation object
Parsing.
The parsing of the crawler page needs multiple steps, parsing in layer, and each layer is responsible for different parsing operations, solution
Different metadata is analysed, all data until getting the crawler page.Wherein, target operation object is that current layer needs to parse
Object, data described in target entity, that is, metadata, metadata is for defining the data of data (data and information resources
Descriptive information).For example, having a student information record, including field name (name), age (age), gender
(male), class (class) etc., then name, age, male, class are exactly metadata, each layer is at least responsible for parsing one
The data (e.g., the corresponding Zhang San of name, Li Si) of metadata, the corresponding description of metadata are target entity.It has been parsed at all layers
After all metadata, the result data of the crawler page is obtained.
In an embodiment of the present embodiment, parsing the crawler page and obtaining result data includes: described in parsing
The crawler page obtains initial data corresponding with the crawler page;The initial data is carried out at data cleansing and screening
Reason deletes the data packet comprising blacklist dictionary, obtains the first result data;Selection is comprising closing in first result data
The data packet of keyword obtains the second result data.Using the second result data as finally obtained result data, and crawler is appointed
Mysql database is automatically stored in the result data of business, completes the persistence of crawler result.Meanwhile it can be by the weight of crawler task
Step procedure log is wanted to be stored in mysql, by checking that program log can carry out effective monitoring to each crawler task.Fig. 4 is
The embodiment of the present invention includes the overall workflow figures of data cleansing, wherein each function carries out module encapsulation.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation
The method of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but it is very much
In the case of the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to existing
The part that technology contributes can be embodied in the form of software products, which is stored in a storage
In medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, calculate
Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.
Embodiment 2
A kind of device that external data is obtained using crawler is additionally provided in the present embodiment, and the device is for realizing above-mentioned
Embodiment and preferred embodiment, the descriptions that have already been made will not be repeated.As used below, term " module " can be real
The combination of the software and/or hardware of existing predetermined function.Although device described in following embodiment is preferably realized with software,
But the realization of the combination of hardware or software and hardware is also that may and be contemplated.
Fig. 5 is the structural block diagram of the device according to an embodiment of the present invention that external data is obtained using crawler, such as Fig. 5 institute
Show, which includes:
Module 50 is obtained, for obtaining data acquisition instruction according to trigger condition;
Calling module 52 calls crawlers for instructing according to the data acquisition;
Receiving module 54, for receiving the crawler page of the crawlers crawl;
Parsing module 56 obtains result data for parsing the crawler page, and by the result data store to
Mysql database.
Optionally, the calling module includes: converting unit, is appointed for data acquisition instruction to be converted to crawler
Business;First determination unit, for determining the degree-of-difficulty factor of the crawler task;Second determination unit, for according to the difficulty
Coefficient determines the quantity of crawlers and the crawler request method of the crawlers.
Optionally, first determination unit comprises determining that subelement, appoints for the crawler according at least one of
The degree-of-difficulty factor of business: the quantity of data source, the size of data, the size in data distribution region, the complexity of chained address.
Optionally, second determination unit includes: selection subelement, for being lower than preset threshold in the degree-of-difficulty factor
When, select the crawler request method of a crawlers and the first kind;It is greater than or equal in the degree-of-difficulty factor described default
When threshold value, the crawler request method of multiple crawlers and multiple corresponding Second Types is selected;Wherein, the first kind
Crawler request method includes following one: directly acquiring uniform resource position mark URL, utilizes proxy requests;The Second Type
Crawler request method include following one: using model browse request, using true browser kernel request.
Optionally, the calling module includes: converting unit, is appointed for data acquisition instruction to be converted to crawler
Business;Call unit, for calling multiple crawler nodes in distributed network, wherein crawlers are distributed in each crawler section
On point, crawler node is arranged in the server of distributed network;Acquiring unit, for obtaining each crawler in distributed network
The processing capacity of node;Allocation unit, for being that each crawler node distributes crawler according to the processing capacity of each crawler node
Subtask, wherein the crawler task includes multiple crawler subtasks.
Optionally, described in layering analysis when the crawler page, the parsing module includes: receiving unit, for receiving
Call request of the upper layer to current layer;Determination unit, for determining that target is grasped according to the metadata carried in the call request
Make the target entity that object is inherited, wherein the target operation object is the object that current layer needs to parse, and the target is real
Body is the data of the metadata definition;Resolution unit, for executing parsing to the operation object according to the target entity
Operation.
Optionally, the parsing module includes: resolution unit, is obtained and the crawler page for parsing the crawler page
The corresponding initial data in face;Screening unit is deleted for carrying out data cleansing and Screening Treatment to the initial data comprising black
The data packet of name word library, obtains the first result data;Selecting unit, for selecting in first result data comprising closing
The data packet of keyword obtains the second result data.
It should be noted that above-mentioned modules can be realized by software or hardware, for the latter, Ke Yitong
Following manner realization is crossed, but not limited to this: above-mentioned module is respectively positioned in same processor;Alternatively, above-mentioned modules are with any
Combined form is located in different processors.
Embodiment 3
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or group
Part can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown
Or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit it is indirect
Coupling or communication connection can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer
It is each that device (can be personal computer, server or network equipment etc.) or processor (Processor) execute the present invention
The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-
Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various
It can store the medium of program code.
The embodiments of the present invention also provide a kind of storage medium, computer program is stored in the storage medium, wherein
The computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
Optionally, in the present embodiment, above-mentioned storage medium can be set to store by executing based on following steps
Calculation machine program:
S1 obtains data acquisition instruction according to trigger condition;
S2 is instructed according to the data acquisition and is called crawlers;
S3 receives the crawler page of the crawlers crawl;
S4 parses the crawler page and obtains result data, and the result data is stored to mysql database.
Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read-
Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard
The various media that can store computer program such as disk, magnetic or disk.
The embodiments of the present invention also provide a kind of electronic device, including memory and processor, stored in the memory
There is computer program, which is arranged to run computer program to execute the step in any of the above-described embodiment of the method
Suddenly.
Optionally, above-mentioned electronic device can also include transmission device and input-output equipment, wherein the transmission device
It is connected with above-mentioned processor, which connects with above-mentioned processor.
Optionally, in the present embodiment, above-mentioned processor can be set to execute following steps by computer program:
S1 obtains data acquisition instruction according to trigger condition;
S2 is instructed according to the data acquisition and is called crawlers;
S3 receives the crawler page of the crawlers crawl;
S4 parses the crawler page and obtains result data, and the result data is stored to mysql database.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.
Claims (10)
1. a kind of method for obtaining external data using crawler, which is characterized in that the described method includes:
Data acquisition instruction is obtained according to trigger condition;
It is instructed according to the data acquisition and calls crawlers;
Receive the crawler page of the crawlers crawl;
It parses the crawler page and obtains result data, and the result data is stored to mysql database.
2. calling crawlers packet the method according to claim 1, wherein instructing according to the data acquisition
It includes:
Data acquisition instruction is converted into crawler task;
Determine the degree-of-difficulty factor of the crawler task;
The quantity of crawlers and the crawler request method of the crawlers are determined according to the degree-of-difficulty factor.
3. the method stated according to claim 2, which is characterized in that the degree-of-difficulty factor for determining the crawler task includes:
The degree-of-difficulty factor of crawler task according at least one of: the quantity of data source, the size of data, data distribution area
The size in domain, the complexity of chained address.
4. the method stated according to claim 2, which is characterized in that according to the degree-of-difficulty factor determine crawlers quantity and
The crawler request method of the crawlers includes:
When the degree-of-difficulty factor is lower than preset threshold, the crawler request method of a crawlers and the first kind is selected;?
When the degree-of-difficulty factor is greater than or equal to the preset threshold, climbing for multiple crawlers and multiple corresponding Second Types is selected
Worm request method;
Wherein, the crawler request method of the first kind includes following one: directly acquiring uniform resource position mark URL, benefit
Use proxy requests;The crawler request method of the Second Type includes following one: using model browse request, using true
Browser kernel request.
5. the method stated according to claim 1, which is characterized in that include: according to data acquisition instruction calling crawlers
Data acquisition instruction is converted into crawler task;
Call multiple crawler nodes in distributed network, wherein crawlers are distributed on each crawler node, crawler node
It is arranged in the server of distributed network;
Obtain the processing capacity of each crawler node in distributed network;
It is that each crawler node distributes crawler subtask according to the processing capacity of each crawler node, wherein the crawler task
Including multiple crawler subtasks.
6. the method stated according to claim 1, which is characterized in that described in layering analysis when the crawler page, parse the crawler
The page obtains result data
Receive call request of the upper layer to current layer;
According to the metadata carried in the call request, the target entity that target operation object is inherited is determined, wherein described
Target operation object is the object that current layer needs to parse, and the target entity is the data of the metadata definition;
According to the target entity, parsing operation is executed to the operation object.
7. the method stated according to claim 1, which is characterized in that parse the crawler page and obtain result data and include:
It parses the crawler page and obtains initial data corresponding with the crawler page;
Data cleansing and Screening Treatment are carried out to the initial data, the data packet comprising blacklist dictionary is deleted, obtains first
Result data;
Selection includes the data packet of keyword in first result data, obtains the second result data.
8. a kind of device for obtaining external data using crawler, which is characterized in that described device includes:
Module is obtained, for obtaining data acquisition instruction according to trigger condition;
Calling module calls crawlers for instructing according to the data acquisition;
Receiving module, for receiving the crawler page of the crawlers crawl;
Parsing module obtains result data for parsing the crawler page, and the result data is stored to mysql data
Library.
9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists
In the step of processor realizes any one of claims 1 to 7 the method when executing the computer program.
10. a kind of computer storage medium, is stored thereon with computer program, which is characterized in that the computer program is located
The step of reason device realizes method described in any one of claims 1 to 7 when executing.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910320214.XA CN110188258A (en) | 2019-04-19 | 2019-04-19 | The method and device of external data is obtained using crawler |
PCT/CN2019/117722 WO2020211351A1 (en) | 2019-04-19 | 2019-11-12 | Method and device for obtaining external data by using crawler |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910320214.XA CN110188258A (en) | 2019-04-19 | 2019-04-19 | The method and device of external data is obtained using crawler |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110188258A true CN110188258A (en) | 2019-08-30 |
Family
ID=67714829
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910320214.XA Pending CN110188258A (en) | 2019-04-19 | 2019-04-19 | The method and device of external data is obtained using crawler |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110188258A (en) |
WO (1) | WO2020211351A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020211351A1 (en) * | 2019-04-19 | 2020-10-22 | 平安科技(深圳)有限公司 | Method and device for obtaining external data by using crawler |
CN113076457A (en) * | 2021-04-09 | 2021-07-06 | 航天信息(广东)有限公司 | Crawler action processing method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070050445A1 (en) * | 2005-08-31 | 2007-03-01 | Hugh Hyndman | Internet content analysis |
US20070124110A1 (en) * | 2005-11-28 | 2007-05-31 | Fatlens Inc. | Method, system and computer program product for identifying primary product objects |
US20090288099A1 (en) * | 2008-05-18 | 2009-11-19 | Sap Portals Israel Ltd | Apparatus and method for accessing and indexing dynamic web pages |
CN107015986A (en) * | 2016-01-27 | 2017-08-04 | 北京国双科技有限公司 | A kind of reptile crawls the method and device of webpage |
US20180067932A1 (en) * | 2016-09-02 | 2018-03-08 | FutureVault Inc. | Real-time document filtering systems and methods |
CN107895009A (en) * | 2017-11-10 | 2018-04-10 | 北京国信宏数科技有限责任公司 | One kind is based on distributed internet data acquisition method and system |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8229911B2 (en) * | 2008-05-13 | 2012-07-24 | Enpulz, Llc | Network search engine utilizing client browser activity information |
CN101826110B (en) * | 2010-04-13 | 2011-12-21 | 北京大学 | Method for crawling BitTorrent torrent files |
CN102890692A (en) * | 2011-07-22 | 2013-01-23 | 阿里巴巴集团控股有限公司 | Webpage information extraction method and webpage information extraction system |
CN105653599A (en) * | 2015-12-23 | 2016-06-08 | 浪潮软件集团有限公司 | Data acquisition method and device |
CN108021369B (en) * | 2017-12-21 | 2020-10-16 | 马上消费金融股份有限公司 | Data integration processing method and related device |
CN109614539A (en) * | 2019-01-16 | 2019-04-12 | 重庆金融资产交易所有限责任公司 | Data grab method, device and computer readable storage medium |
CN110188258A (en) * | 2019-04-19 | 2019-08-30 | 平安科技(深圳)有限公司 | The method and device of external data is obtained using crawler |
-
2019
- 2019-04-19 CN CN201910320214.XA patent/CN110188258A/en active Pending
- 2019-11-12 WO PCT/CN2019/117722 patent/WO2020211351A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070050445A1 (en) * | 2005-08-31 | 2007-03-01 | Hugh Hyndman | Internet content analysis |
US20070124110A1 (en) * | 2005-11-28 | 2007-05-31 | Fatlens Inc. | Method, system and computer program product for identifying primary product objects |
US20090288099A1 (en) * | 2008-05-18 | 2009-11-19 | Sap Portals Israel Ltd | Apparatus and method for accessing and indexing dynamic web pages |
CN107015986A (en) * | 2016-01-27 | 2017-08-04 | 北京国双科技有限公司 | A kind of reptile crawls the method and device of webpage |
US20180067932A1 (en) * | 2016-09-02 | 2018-03-08 | FutureVault Inc. | Real-time document filtering systems and methods |
CN107895009A (en) * | 2017-11-10 | 2018-04-10 | 北京国信宏数科技有限责任公司 | One kind is based on distributed internet data acquisition method and system |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020211351A1 (en) * | 2019-04-19 | 2020-10-22 | 平安科技(深圳)有限公司 | Method and device for obtaining external data by using crawler |
CN113076457A (en) * | 2021-04-09 | 2021-07-06 | 航天信息(广东)有限公司 | Crawler action processing method and device |
Also Published As
Publication number | Publication date |
---|---|
WO2020211351A1 (en) | 2020-10-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9720731B2 (en) | Methods and apparatus for coordinating and selecting protocols for resources acquisition from multiple resource managers | |
CN108370341B (en) | Resource allocation method, virtual network function manager and network element management system | |
CN109347947A (en) | A kind of method of load balancing, name server and cluster nas server | |
CN107682397B (en) | Customer resources acquisition methods, device, terminal device and storage medium | |
CN109471727A (en) | A kind of task processing method, apparatus and system | |
CN108431796A (en) | Distributed resource management system and method | |
Delicato et al. | Resource management for Internet of Things | |
CN106331150A (en) | Method and device for scheduling cloud servers | |
CN106201661A (en) | Method and apparatus for elastic telescopic cluster virtual machine | |
CN106155812A (en) | Method, device, system and the electronic equipment of a kind of resource management to fictitious host computer | |
CN110506259A (en) | System and method for calculate node management agreement | |
CN108696400A (en) | network monitoring method and device | |
CN105468619B (en) | Resource allocation methods and device for database connection pool | |
CN111552838A (en) | Data processing method and device, computer equipment and storage medium | |
CN108365989A (en) | Event-handling method and device | |
CN107819825A (en) | A kind of service scheduling method, device and electronic equipment | |
CN110188258A (en) | The method and device of external data is obtained using crawler | |
CN108563509A (en) | Data query implementation method, device, medium and electronic equipment | |
CN108111499A (en) | Service process performance optimization method, device, electronic equipment and storage medium | |
CN115134371A (en) | Scheduling method, system, equipment and medium containing edge network computing resources | |
US7478396B2 (en) | Tunable engine, method and program product for resolving prerequisites for client devices in an open service gateway initiative (OSGi) framework | |
CN113037891A (en) | Access method and device for stateful application in edge computing system and electronic equipment | |
CN110472109A (en) | Mobilism Data Quality Analysis method and plateform system | |
CN111565120B (en) | 5G network slicing product configuration method and system and electronic equipment | |
CN116700929A (en) | Task batch processing method and system based on artificial intelligence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |