CN110188258A - The method and device of external data is obtained using crawler - Google Patents

The method and device of external data is obtained using crawler Download PDF

Info

Publication number
CN110188258A
CN110188258A CN201910320214.XA CN201910320214A CN110188258A CN 110188258 A CN110188258 A CN 110188258A CN 201910320214 A CN201910320214 A CN 201910320214A CN 110188258 A CN110188258 A CN 110188258A
Authority
CN
China
Prior art keywords
crawler
data
crawlers
page
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910320214.XA
Other languages
Chinese (zh)
Inventor
申超波
阮晓雯
徐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910320214.XA priority Critical patent/CN110188258A/en
Publication of CN110188258A publication Critical patent/CN110188258A/en
Priority to PCT/CN2019/117722 priority patent/WO2020211351A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The embodiment of the invention provides a kind of method and devices that external data is obtained using crawler.On the one hand, this method comprises: obtaining data acquisition instruction according to trigger condition;It is instructed according to the data acquisition and calls crawlers;Receive the crawler page of the crawlers crawl;It parses the crawler page and obtains result data, and the result data is stored to mysql database.Through the invention, it solves the technical issues of crawlers cannot be called to obtain data automatically in the prior art, improves the efficiency using crawler capturing data, reduce manual operation.

Description

The method and device of external data is obtained using crawler
[technical field]
The present invention relates to computer field more particularly to a kind of method and devices that external data is obtained using crawler.
[background technique]
In the prior art, crawler be it is a kind of according to certain rules, automatically grab the program or foot of web message This, crawler is that all companies obtain the most frequently used and most important means of external data at present, and data in business can be played Good supplementary function.
In the prior art, but there are more technologies in crawler field now, but the function of every kind of technology is again excessively single, and The automation of crawler and the data persistence of crawler all compare shortage, but after crawler gets data, need user into one Step screening and processing, efficiency is lower, when being applied to large database foundation and periodic duty, needs to consume a large amount of manpower.
For the above problem present in the relevant technologies, at present it is not yet found that the solution of effect.
[summary of the invention]
In view of this, the embodiment of the invention provides a kind of method and devices for obtaining external data using crawler.
On the one hand, the embodiment of the invention provides a kind of methods for obtaining external data using crawler, which comprises Data acquisition instruction is obtained according to trigger condition;It is instructed according to the data acquisition and calls crawlers;Receive the crawler journey The crawler page of sequence crawl;It parses the crawler page and obtains result data, and the result data is stored to mysql data Library.
Optionally, being instructed according to the data acquisition and calling crawlers includes: to be converted to data acquisition instruction Crawler task;Determine the degree-of-difficulty factor of the crawler task;Quantity and the institute of crawlers are determined according to the degree-of-difficulty factor State the crawler request method of crawlers.
Optionally, the degree-of-difficulty factor for determining the crawler task includes: the crawler task according at least one of Degree-of-difficulty factor: the quantity of data source, the size of data, the size in data distribution region, the complexity of chained address.
Optionally, the quantity of crawlers and the crawler requesting party of the crawlers are determined according to the degree-of-difficulty factor Formula includes: to select the crawler request method of a crawlers and the first kind when the degree-of-difficulty factor is lower than preset threshold; When the degree-of-difficulty factor is greater than or equal to the preset threshold, multiple crawlers and multiple corresponding Second Types are selected Crawler request method;Wherein, the crawler request method of the first kind includes following one: directly acquiring unified resource positioning Symbol URL, proxy requests are utilized;The crawler request method of the Second Type includes following one: using model browse request, It is requested using true browser kernel.
Optionally, being instructed according to the data acquisition and calling crawlers includes: to be converted to data acquisition instruction Crawler task;Call multiple crawler nodes in distributed network, wherein crawlers are distributed on each crawler node, are climbed Worm node is arranged in the server of distributed network;Obtain the processing capacity of each crawler node in distributed network;According to The processing capacity of each crawler node is that each crawler node distributes crawler subtask, wherein the crawler task includes multiple Crawler subtask.
Optionally, described in layering analysis when the crawler page, parsing the crawler page and obtaining result data includes: to receive Call request to upper layer to current layer;According to the metadata carried in the call request, determine target operation object institute after The target entity held, wherein the target operation object is the object that current layer needs to parse, and the target entity is the member The data of data definition;According to the target entity, parsing operation is executed to the operation object.
Optionally, parsing the crawler page and obtaining result data includes: that the parsing crawler page obtains climbing with described The corresponding initial data of the worm page;Data cleansing and Screening Treatment are carried out to the initial data, deleting includes blacklist dictionary Data packet, obtain the first result data;Selection includes the data packet of keyword in first result data, obtains second Result data.
On the other hand, the embodiment of the invention provides a kind of device that external data is obtained using crawler, described device packets It includes: obtaining module, for obtaining data acquisition instruction according to trigger condition;Calling module, for being referred to according to the data acquisition It enables and calls crawlers;Receiving module, for receiving the crawler page of the crawlers crawl;Parsing module, for parsing The crawler page obtains result data, and the result data is stored to mysql database.
Optionally, the calling module includes: converting unit, is appointed for data acquisition instruction to be converted to crawler Business;First determination unit, for determining the degree-of-difficulty factor of the crawler task;Second determination unit, for according to the difficulty Coefficient determines the quantity of crawlers and the crawler request method of the crawlers.
Optionally, first determination unit comprises determining that subelement, appoints for the crawler according at least one of The degree-of-difficulty factor of business: the quantity of data source, the size of data, the size in data distribution region, the complexity of chained address.
Optionally, second determination unit includes: selection subelement, for being lower than preset threshold in the degree-of-difficulty factor When, select the crawler request method of a crawlers and the first kind;It is greater than or equal in the degree-of-difficulty factor described default When threshold value, the crawler request method of multiple crawlers and multiple corresponding Second Types is selected;Wherein, the first kind Crawler request method includes following one: directly acquiring uniform resource position mark URL, utilizes proxy requests;The Second Type Crawler request method include following one: using model browse request, using true browser kernel request.
Optionally, the calling module includes: converting unit, is appointed for data acquisition instruction to be converted to crawler Business;Call unit, for calling multiple crawler nodes in distributed network, wherein crawlers are distributed in each crawler section On point, crawler node is arranged in the server of distributed network;Acquiring unit, for obtaining each crawler in distributed network The processing capacity of node;Allocation unit, for being that each crawler node distributes crawler according to the processing capacity of each crawler node Subtask, wherein the crawler task includes multiple crawler subtasks.
Optionally, described in layering analysis when the crawler page, the parsing module includes: receiving unit, for receiving Call request of the upper layer to current layer;Determination unit, for determining that target is grasped according to the metadata carried in the call request Make the target entity that object is inherited, wherein the target operation object is the object that current layer needs to parse, and the target is real Body is the data of the metadata definition;Resolution unit, for executing parsing to the operation object according to the target entity Operation.
Optionally, the parsing module includes: resolution unit, is obtained and the crawler page for parsing the crawler page The corresponding initial data in face;Screening unit is deleted for carrying out data cleansing and Screening Treatment to the initial data comprising black The data packet of name word library, obtains the first result data;Selecting unit, for selecting in first result data comprising closing The data packet of keyword obtains the second result data.
According to still another embodiment of the invention, a kind of storage medium is additionally provided, meter is stored in the storage medium Calculation machine program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
According to still another embodiment of the invention, a kind of electronic device, including memory and processor are additionally provided, it is described Computer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-described Step in embodiment of the method.
By the invention it is possible to realize the automatic dispatching of crawler task and the automated storing of crawler result data. It solves the technical issues of crawlers cannot be called to obtain data automatically in the prior art, improves using crawler capturing data Efficiency, reduce manual operation.
[Detailed description of the invention]
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this field For those of ordinary skill, without any creative labor, it can also be obtained according to these attached drawings other attached Figure.
Fig. 1 is a kind of hardware block diagram of server that external data is obtained using crawler of the embodiment of the present invention;
Fig. 2 is the flow chart of the method according to an embodiment of the present invention that external data is obtained using crawler;
Fig. 3 is the application framework figure of the embodiment of the present invention;
Fig. 4 is the overall workflow figure the embodiment of the present invention includes data cleansing;
Fig. 5 is the structural block diagram of the device according to an embodiment of the present invention that external data is obtained using crawler.
[specific embodiment]
Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and in combination with Examples.It should be noted that not conflicting In the case of, the features in the embodiments and the embodiments of the present application can be combined with each other.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.
Embodiment 1
Embodiment of the method provided by the embodiment of the present application one can mobile terminal, server, terminal or It is executed in similar arithmetic unit.For running on the server, Fig. 1 is that a kind of of the embodiment of the present invention is obtained using crawler The hardware block diagram of the server of external data.As shown in Figure 1, server 10 may include one or more (only shows in Fig. 1 One out) (processor 102 can include but is not limited to Micro-processor MCV or programmable logic device FPGA etc. to processor 102 Processing unit) and memory 104 for storing data, optionally, above-mentioned server can also include for communication function Transmission device 106 and input-output equipment 108.It will appreciated by the skilled person that structure shown in FIG. 1 is only to show Meaning, does not cause to limit to the structure of above-mentioned server.For example, server 10 may also include it is more than shown in Fig. 1 or Less component, or with the configuration different from shown in Fig. 1.
Memory 104 can be used for storing computer program, for example, the software program and module of application software, such as this hair The corresponding computer program of method that external data is obtained using crawler in bright embodiment, processor 102 pass through operation storage Computer program in memory 104 realizes above-mentioned method thereby executing various function application and data processing. Memory 104 may include high speed random access memory, may also include nonvolatile memory, as one or more magnetic storage fills It sets, flash memory or other non-volatile solid state memories.In some instances, memory 104 can further comprise relative to place The remotely located memory of device 102 is managed, these remote memories can pass through network connection to server 10.The reality of above-mentioned network Example includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of server 10 provide.In an example, transmitting device 106 includes a network adapter (Network Interface Controller, referred to as NIC), can be connected by base station with other network equipments so as to It is communicated with internet.In an example, transmitting device 106 can be radio frequency (Radio Frequency, referred to as RF) Module is used to wirelessly be communicated with internet.
A kind of method for obtaining external data using crawler is provided in the present embodiment, and Fig. 2 is to implement according to the present invention The flow chart of the method that external data is obtained using crawler of example, as shown in Fig. 2, the process includes the following steps:
Step S202 obtains data acquisition instruction according to trigger condition;
The trigger condition of the present embodiment can be the acquisition instruction that user sends in real time, be also possible to touch automatically according to the period Hair, as external data be stock market exchange hand when, after in the day of trade, stock market stops business (such as 15:30), automatic trigger obtain deep bid Activity data.
Step S204 is instructed according to the data acquisition and is called crawlers;
Step S206 receives the crawler page of the crawlers crawl;
Step S208 parses the crawler page and obtains result data, and the result data is stored to mysql data Library (a kind of Relational DBMS).
Scheme through this embodiment, may be implemented crawler task automatic dispatching and crawler result data from Dynamicization storage.It solves the technical issues of crawlers cannot be called to obtain data automatically in the prior art, improves use and climb Worm grabs the efficiency of data, reduces manual operation.
Fig. 3 is the application framework figure of the embodiment of the present invention, as shown in figure 3, each function carries out modularization, application in frame Frame includes: Apscheduler (crawler task manager), spider (crawlers), mysql database.Apscheduler Management and scheduling crawlers and mysql database, crawlers grab external data, mysql database purchase according to task External data.Specifically, Apscheduler, management, scheduling, period for being responsible for crawler task are controlled, setting including task It sets, suspend, remove, dispatch;The periodic scheduling of crawler task controls, and according to the trigger condition of user setting, is periodically appointed Business triggering is reminded.Task is thread task, and each task is handled in a background thread.It is subsequent, if there is new crawler is appointed Then business realizes the specific task method of Task class as long as inheriting corresponding Task class, and be added in task manager. Spider, for being responsible for the specific implementation of crawler task, including crawler request module, crawler page parsing module, crawler result Data cleansing and sorting module.Mysql stores crawler result data for being responsible for crawler final result data persistence.
In the present embodiment, data acquisition instruction is converted into crawler task, selects to correspond to according to the difficulty of crawler task Crawler request method, call crawlers, complete crawler task.Each crawler task needs one or more crawlers It completes, the quantity of crawlers can be determined according to the difficulty of crawler task, crawler difficulty is classified, minimum difficulty etc. The crawler task of grade distributes a crawlers, and highly difficult crawler task distributes multiple crawlers.In the present embodiment Include: according to data acquisition instruction calling crawlers
Data acquisition instruction is converted to crawler task by S11;
S12 determines the degree-of-difficulty factor of the crawler task;
Optionally, the difficulty of crawler task is according to the quantity (such as web data, quantity database) of data source, data it is big Small, the size (such as inside the province, external) in data distribution region, the complexity of chained address is waited and is determined.
S13 determines the quantity of crawlers and the crawler request method of the crawlers according to the degree-of-difficulty factor.
In an optional embodiment of the present embodiment, the grade of difficulty of crawler task is determined, and then according to difficulty etc. The crawlers of grade distribution respective numbers and corresponding crawler request method, the number of crawlers is determined according to the degree-of-difficulty factor Amount and the crawler request method of the crawlers include: to select one when the degree-of-difficulty factor is lower than preset threshold and climb The crawler request method of worm program and the first kind;When the degree-of-difficulty factor is greater than or equal to the preset threshold, select more The crawler request method of a crawlers and multiple corresponding Second Types;Wherein, the crawler request method of the first kind Including following one: directly acquiring uniform resource position mark URL, utilize proxy requests;The crawler requesting party of the Second Type Formula includes following one: being requested using model browse request, using true browser kernel.
System is when realizing specific crawler task, it is only necessary to independently be called according to crawler task difficulty.For example, climbing When the difficulty of worm task is lower, it can choose and directly acquire the crawler request method of URL mode to crawl data, due to different The ability that request method obtains data is different, and (correspondingly, ability is bigger, the resource and expense for needing to call are also bigger) is so need Crawler request method is corresponding with the difficulty of crawler task, to realize the rational allocation of resource.Due to crawler task and crawler The quantity of program and the crawler request method of crawlers are corresponding, therefore, can distribute crawler according to the difficulty of crawler task The quantity and crawler request method of program, other optional embodiments of the present embodiment further include: distribute phase according to grade of difficulty The crawler request method answering the crawlers of quantity and fixing;Alternatively, distributing the crawlers of fixed quantity according to grade of difficulty With corresponding crawler request method.Each crawlers use identical crawler request method.
It in an application scenarios of the present embodiment, applies in distributed network, is instructed and adjusted according to the data acquisition It include: that data acquisition instruction is converted into crawler task with crawlers;Call multiple crawlers section in distributed network Point, wherein crawlers are distributed on each crawler node, and crawler node is arranged in the server of distributed network;It obtains The processing capacity of each crawler node in distributed network;It is each crawler node point according to the processing capacity of each crawler node With crawler subtask, wherein the crawler task includes multiple crawler subtasks.
In one example, the priority that may be incorporated into crawlers is called, and each crawlers, which are distributed in, climbs On worm node, the different crawlers of multiple priority are distributed on each crawler node, crawler node is arranged in distributed network Server in, obtain distributed network in each crawler node processing capacity;According to preset priority orders and basis The processing capacity of each crawler node is that each crawler node distributes crawler task, so that the crawler node is to being distributed Crawler task is handled.The maximum amount of access of single crawlers is determined according to the processing capacity of each crawler node;If point The crawler task amount for being fitted on crawler node is more than or equal to the maximum amount of access, then using at multiple crawlers Reason;Alternatively, if the crawler task amount of assigned crawler node is less than the maximum amount of access of single crawlers, it is excellent using one The first highest crawlers processing of grade.It is of course also possible to crawler task is evenly distributed on the crawlers of all nodes.? In distributed network, crawler node be it is fixed, be arranged in the server of distributed network, the crawler on each crawler node Program be also divide in advance it is well laid.Crawler task is distributed to each crawler node according to priority and processing capacity, according still further to every The maximum amount of access of each crawlers determines crawlers in the allocated task of a crawler node and crawler node.
In the present embodiment, due to the diversification of all kinds of crawler pages, it is impossible to it is realized with a kind of analysis mode, because This, only presets the top layer parent of crawler resolution logic, and the crawler resolution logic of the specific page inherits top layer parent by each layer Personalization is realized.When stub procedure includes: the call request for receiving upper layer to current layer, according to the member carried in the request Data determine the target entity that target operation object is inherited, and according to the target entity, carry out to the target operation object Parsing.
The parsing of the crawler page needs multiple steps, parsing in layer, and each layer is responsible for different parsing operations, solution Different metadata is analysed, all data until getting the crawler page.Wherein, target operation object is that current layer needs to parse Object, data described in target entity, that is, metadata, metadata is for defining the data of data (data and information resources Descriptive information).For example, having a student information record, including field name (name), age (age), gender (male), class (class) etc., then name, age, male, class are exactly metadata, each layer is at least responsible for parsing one The data (e.g., the corresponding Zhang San of name, Li Si) of metadata, the corresponding description of metadata are target entity.It has been parsed at all layers After all metadata, the result data of the crawler page is obtained.
In an embodiment of the present embodiment, parsing the crawler page and obtaining result data includes: described in parsing The crawler page obtains initial data corresponding with the crawler page;The initial data is carried out at data cleansing and screening Reason deletes the data packet comprising blacklist dictionary, obtains the first result data;Selection is comprising closing in first result data The data packet of keyword obtains the second result data.Using the second result data as finally obtained result data, and crawler is appointed Mysql database is automatically stored in the result data of business, completes the persistence of crawler result.Meanwhile it can be by the weight of crawler task Step procedure log is wanted to be stored in mysql, by checking that program log can carry out effective monitoring to each crawler task.Fig. 4 is The embodiment of the present invention includes the overall workflow figures of data cleansing, wherein each function carries out module encapsulation.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but it is very much In the case of the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to existing The part that technology contributes can be embodied in the form of software products, which is stored in a storage In medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, calculate Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.
Embodiment 2
A kind of device that external data is obtained using crawler is additionally provided in the present embodiment, and the device is for realizing above-mentioned Embodiment and preferred embodiment, the descriptions that have already been made will not be repeated.As used below, term " module " can be real The combination of the software and/or hardware of existing predetermined function.Although device described in following embodiment is preferably realized with software, But the realization of the combination of hardware or software and hardware is also that may and be contemplated.
Fig. 5 is the structural block diagram of the device according to an embodiment of the present invention that external data is obtained using crawler, such as Fig. 5 institute Show, which includes:
Module 50 is obtained, for obtaining data acquisition instruction according to trigger condition;
Calling module 52 calls crawlers for instructing according to the data acquisition;
Receiving module 54, for receiving the crawler page of the crawlers crawl;
Parsing module 56 obtains result data for parsing the crawler page, and by the result data store to Mysql database.
Optionally, the calling module includes: converting unit, is appointed for data acquisition instruction to be converted to crawler Business;First determination unit, for determining the degree-of-difficulty factor of the crawler task;Second determination unit, for according to the difficulty Coefficient determines the quantity of crawlers and the crawler request method of the crawlers.
Optionally, first determination unit comprises determining that subelement, appoints for the crawler according at least one of The degree-of-difficulty factor of business: the quantity of data source, the size of data, the size in data distribution region, the complexity of chained address.
Optionally, second determination unit includes: selection subelement, for being lower than preset threshold in the degree-of-difficulty factor When, select the crawler request method of a crawlers and the first kind;It is greater than or equal in the degree-of-difficulty factor described default When threshold value, the crawler request method of multiple crawlers and multiple corresponding Second Types is selected;Wherein, the first kind Crawler request method includes following one: directly acquiring uniform resource position mark URL, utilizes proxy requests;The Second Type Crawler request method include following one: using model browse request, using true browser kernel request.
Optionally, the calling module includes: converting unit, is appointed for data acquisition instruction to be converted to crawler Business;Call unit, for calling multiple crawler nodes in distributed network, wherein crawlers are distributed in each crawler section On point, crawler node is arranged in the server of distributed network;Acquiring unit, for obtaining each crawler in distributed network The processing capacity of node;Allocation unit, for being that each crawler node distributes crawler according to the processing capacity of each crawler node Subtask, wherein the crawler task includes multiple crawler subtasks.
Optionally, described in layering analysis when the crawler page, the parsing module includes: receiving unit, for receiving Call request of the upper layer to current layer;Determination unit, for determining that target is grasped according to the metadata carried in the call request Make the target entity that object is inherited, wherein the target operation object is the object that current layer needs to parse, and the target is real Body is the data of the metadata definition;Resolution unit, for executing parsing to the operation object according to the target entity Operation.
Optionally, the parsing module includes: resolution unit, is obtained and the crawler page for parsing the crawler page The corresponding initial data in face;Screening unit is deleted for carrying out data cleansing and Screening Treatment to the initial data comprising black The data packet of name word library, obtains the first result data;Selecting unit, for selecting in first result data comprising closing The data packet of keyword obtains the second result data.
It should be noted that above-mentioned modules can be realized by software or hardware, for the latter, Ke Yitong Following manner realization is crossed, but not limited to this: above-mentioned module is respectively positioned in same processor;Alternatively, above-mentioned modules are with any Combined form is located in different processors.
Embodiment 3
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or group Part can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown Or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit it is indirect Coupling or communication connection can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that device (can be personal computer, server or network equipment etc.) or processor (Processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various It can store the medium of program code.
The embodiments of the present invention also provide a kind of storage medium, computer program is stored in the storage medium, wherein The computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
Optionally, in the present embodiment, above-mentioned storage medium can be set to store by executing based on following steps Calculation machine program:
S1 obtains data acquisition instruction according to trigger condition;
S2 is instructed according to the data acquisition and is called crawlers;
S3 receives the crawler page of the crawlers crawl;
S4 parses the crawler page and obtains result data, and the result data is stored to mysql database.
Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read- Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard The various media that can store computer program such as disk, magnetic or disk.
The embodiments of the present invention also provide a kind of electronic device, including memory and processor, stored in the memory There is computer program, which is arranged to run computer program to execute the step in any of the above-described embodiment of the method Suddenly.
Optionally, above-mentioned electronic device can also include transmission device and input-output equipment, wherein the transmission device It is connected with above-mentioned processor, which connects with above-mentioned processor.
Optionally, in the present embodiment, above-mentioned processor can be set to execute following steps by computer program:
S1 obtains data acquisition instruction according to trigger condition;
S2 is instructed according to the data acquisition and is called crawlers;
S3 receives the crawler page of the crawlers crawl;
S4 parses the crawler page and obtains result data, and the result data is stored to mysql database.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims (10)

1. a kind of method for obtaining external data using crawler, which is characterized in that the described method includes:
Data acquisition instruction is obtained according to trigger condition;
It is instructed according to the data acquisition and calls crawlers;
Receive the crawler page of the crawlers crawl;
It parses the crawler page and obtains result data, and the result data is stored to mysql database.
2. calling crawlers packet the method according to claim 1, wherein instructing according to the data acquisition It includes:
Data acquisition instruction is converted into crawler task;
Determine the degree-of-difficulty factor of the crawler task;
The quantity of crawlers and the crawler request method of the crawlers are determined according to the degree-of-difficulty factor.
3. the method stated according to claim 2, which is characterized in that the degree-of-difficulty factor for determining the crawler task includes:
The degree-of-difficulty factor of crawler task according at least one of: the quantity of data source, the size of data, data distribution area The size in domain, the complexity of chained address.
4. the method stated according to claim 2, which is characterized in that according to the degree-of-difficulty factor determine crawlers quantity and The crawler request method of the crawlers includes:
When the degree-of-difficulty factor is lower than preset threshold, the crawler request method of a crawlers and the first kind is selected;? When the degree-of-difficulty factor is greater than or equal to the preset threshold, climbing for multiple crawlers and multiple corresponding Second Types is selected Worm request method;
Wherein, the crawler request method of the first kind includes following one: directly acquiring uniform resource position mark URL, benefit Use proxy requests;The crawler request method of the Second Type includes following one: using model browse request, using true Browser kernel request.
5. the method stated according to claim 1, which is characterized in that include: according to data acquisition instruction calling crawlers
Data acquisition instruction is converted into crawler task;
Call multiple crawler nodes in distributed network, wherein crawlers are distributed on each crawler node, crawler node It is arranged in the server of distributed network;
Obtain the processing capacity of each crawler node in distributed network;
It is that each crawler node distributes crawler subtask according to the processing capacity of each crawler node, wherein the crawler task Including multiple crawler subtasks.
6. the method stated according to claim 1, which is characterized in that described in layering analysis when the crawler page, parse the crawler The page obtains result data
Receive call request of the upper layer to current layer;
According to the metadata carried in the call request, the target entity that target operation object is inherited is determined, wherein described Target operation object is the object that current layer needs to parse, and the target entity is the data of the metadata definition;
According to the target entity, parsing operation is executed to the operation object.
7. the method stated according to claim 1, which is characterized in that parse the crawler page and obtain result data and include:
It parses the crawler page and obtains initial data corresponding with the crawler page;
Data cleansing and Screening Treatment are carried out to the initial data, the data packet comprising blacklist dictionary is deleted, obtains first Result data;
Selection includes the data packet of keyword in first result data, obtains the second result data.
8. a kind of device for obtaining external data using crawler, which is characterized in that described device includes:
Module is obtained, for obtaining data acquisition instruction according to trigger condition;
Calling module calls crawlers for instructing according to the data acquisition;
Receiving module, for receiving the crawler page of the crawlers crawl;
Parsing module obtains result data for parsing the crawler page, and the result data is stored to mysql data Library.
9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In the step of processor realizes any one of claims 1 to 7 the method when executing the computer program.
10. a kind of computer storage medium, is stored thereon with computer program, which is characterized in that the computer program is located The step of reason device realizes method described in any one of claims 1 to 7 when executing.
CN201910320214.XA 2019-04-19 2019-04-19 The method and device of external data is obtained using crawler Pending CN110188258A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910320214.XA CN110188258A (en) 2019-04-19 2019-04-19 The method and device of external data is obtained using crawler
PCT/CN2019/117722 WO2020211351A1 (en) 2019-04-19 2019-11-12 Method and device for obtaining external data by using crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910320214.XA CN110188258A (en) 2019-04-19 2019-04-19 The method and device of external data is obtained using crawler

Publications (1)

Publication Number Publication Date
CN110188258A true CN110188258A (en) 2019-08-30

Family

ID=67714829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910320214.XA Pending CN110188258A (en) 2019-04-19 2019-04-19 The method and device of external data is obtained using crawler

Country Status (2)

Country Link
CN (1) CN110188258A (en)
WO (1) WO2020211351A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020211351A1 (en) * 2019-04-19 2020-10-22 平安科技(深圳)有限公司 Method and device for obtaining external data by using crawler
CN113076457A (en) * 2021-04-09 2021-07-06 航天信息(广东)有限公司 Crawler action processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070050445A1 (en) * 2005-08-31 2007-03-01 Hugh Hyndman Internet content analysis
US20070124110A1 (en) * 2005-11-28 2007-05-31 Fatlens Inc. Method, system and computer program product for identifying primary product objects
US20090288099A1 (en) * 2008-05-18 2009-11-19 Sap Portals Israel Ltd Apparatus and method for accessing and indexing dynamic web pages
CN107015986A (en) * 2016-01-27 2017-08-04 北京国双科技有限公司 A kind of reptile crawls the method and device of webpage
US20180067932A1 (en) * 2016-09-02 2018-03-08 FutureVault Inc. Real-time document filtering systems and methods
CN107895009A (en) * 2017-11-10 2018-04-10 北京国信宏数科技有限责任公司 One kind is based on distributed internet data acquisition method and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8229911B2 (en) * 2008-05-13 2012-07-24 Enpulz, Llc Network search engine utilizing client browser activity information
CN101826110B (en) * 2010-04-13 2011-12-21 北京大学 Method for crawling BitTorrent torrent files
CN102890692A (en) * 2011-07-22 2013-01-23 阿里巴巴集团控股有限公司 Webpage information extraction method and webpage information extraction system
CN105653599A (en) * 2015-12-23 2016-06-08 浪潮软件集团有限公司 Data acquisition method and device
CN108021369B (en) * 2017-12-21 2020-10-16 马上消费金融股份有限公司 Data integration processing method and related device
CN109614539A (en) * 2019-01-16 2019-04-12 重庆金融资产交易所有限责任公司 Data grab method, device and computer readable storage medium
CN110188258A (en) * 2019-04-19 2019-08-30 平安科技(深圳)有限公司 The method and device of external data is obtained using crawler

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070050445A1 (en) * 2005-08-31 2007-03-01 Hugh Hyndman Internet content analysis
US20070124110A1 (en) * 2005-11-28 2007-05-31 Fatlens Inc. Method, system and computer program product for identifying primary product objects
US20090288099A1 (en) * 2008-05-18 2009-11-19 Sap Portals Israel Ltd Apparatus and method for accessing and indexing dynamic web pages
CN107015986A (en) * 2016-01-27 2017-08-04 北京国双科技有限公司 A kind of reptile crawls the method and device of webpage
US20180067932A1 (en) * 2016-09-02 2018-03-08 FutureVault Inc. Real-time document filtering systems and methods
CN107895009A (en) * 2017-11-10 2018-04-10 北京国信宏数科技有限责任公司 One kind is based on distributed internet data acquisition method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020211351A1 (en) * 2019-04-19 2020-10-22 平安科技(深圳)有限公司 Method and device for obtaining external data by using crawler
CN113076457A (en) * 2021-04-09 2021-07-06 航天信息(广东)有限公司 Crawler action processing method and device

Also Published As

Publication number Publication date
WO2020211351A1 (en) 2020-10-22

Similar Documents

Publication Publication Date Title
US9720731B2 (en) Methods and apparatus for coordinating and selecting protocols for resources acquisition from multiple resource managers
CN108370341B (en) Resource allocation method, virtual network function manager and network element management system
CN109347947A (en) A kind of method of load balancing, name server and cluster nas server
CN107682397B (en) Customer resources acquisition methods, device, terminal device and storage medium
CN109471727A (en) A kind of task processing method, apparatus and system
CN108431796A (en) Distributed resource management system and method
Delicato et al. Resource management for Internet of Things
CN106331150A (en) Method and device for scheduling cloud servers
CN106201661A (en) Method and apparatus for elastic telescopic cluster virtual machine
CN106155812A (en) Method, device, system and the electronic equipment of a kind of resource management to fictitious host computer
CN110506259A (en) System and method for calculate node management agreement
CN108696400A (en) network monitoring method and device
CN105468619B (en) Resource allocation methods and device for database connection pool
CN111552838A (en) Data processing method and device, computer equipment and storage medium
CN108365989A (en) Event-handling method and device
CN107819825A (en) A kind of service scheduling method, device and electronic equipment
CN110188258A (en) The method and device of external data is obtained using crawler
CN108563509A (en) Data query implementation method, device, medium and electronic equipment
CN108111499A (en) Service process performance optimization method, device, electronic equipment and storage medium
CN115134371A (en) Scheduling method, system, equipment and medium containing edge network computing resources
US7478396B2 (en) Tunable engine, method and program product for resolving prerequisites for client devices in an open service gateway initiative (OSGi) framework
CN113037891A (en) Access method and device for stateful application in edge computing system and electronic equipment
CN110472109A (en) Mobilism Data Quality Analysis method and plateform system
CN111565120B (en) 5G network slicing product configuration method and system and electronic equipment
CN116700929A (en) Task batch processing method and system based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination