CN110245278A - Acquisition method, device, electronic equipment and the storage medium of web data - Google Patents

Acquisition method, device, electronic equipment and the storage medium of web data Download PDF

Info

Publication number
CN110245278A
CN110245278A CN201811033910.4A CN201811033910A CN110245278A CN 110245278 A CN110245278 A CN 110245278A CN 201811033910 A CN201811033910 A CN 201811033910A CN 110245278 A CN110245278 A CN 110245278A
Authority
CN
China
Prior art keywords
web
web data
acquisition
data
configuration file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811033910.4A
Other languages
Chinese (zh)
Inventor
周健
方兴
王美玲
戴才良
赵成军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Love Letter And Letter Co Ltd
Aisino Corp
Original Assignee
Love Letter And Letter Co Ltd
Aisino Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Love Letter And Letter Co Ltd, Aisino Corp filed Critical Love Letter And Letter Co Ltd
Priority to CN201811033910.4A priority Critical patent/CN110245278A/en
Publication of CN110245278A publication Critical patent/CN110245278A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the present application provides acquisition method, device, electronic equipment and the storage medium of a kind of web data, is related to data collecting field.Wherein, the acquisition method of the web data includes: the configuration file for Configuration network crawler that receiving terminal apparatus is sent;Based on the configuration file, web crawlers corresponding with the configuration file is generated;By web crawlers corresponding with the configuration file, web data is acquired.By the embodiment of the present application, the web crawlers for individually developing adaptation for the webpage of different web sites is not needed, the development task of web crawlers is eliminated, can quickly and easily realize the acquisition of web data.

Description

Acquisition method, device, electronic equipment and the storage medium of web data
Technical field
The invention relates to data collecting field more particularly to a kind of acquisition methods of web data, device, electronics Equipment and computer readable storage medium.
Background technique
With the rapid development of internet, the exponential growth of network information is brought.In network information resource abundance Under conditions of, in order to not only quickly but also targetedly obtain related network information, promoted the birth of search engine.Search engine Refer to specific computer program according to certain strategy automatically from search information on internet, tissue is being carried out to information After processing, retrieval service is provided for user, the system that the relevant information of user search is showed into user.Search engine from The process that information is collected on internet, crawls related web site information dependent on web crawlers.Web crawlers is a kind of automatic Webpage is browsed, and analyzes the program of web page contents, is the important component of search engine.In addition, using Scrapy technology frame Build basis of the frame as web crawlers, and one's own web crawlers is built on this basis.
In the prior art, web crawlers is to obtain related network information, network by crawling the webpage of website mostly Web page interlinkage accessed by it is put into database or local by crawler since the link of one or several Initial pages Storage, then goes to handle different types of web page interlinkage respectively using different web crawlers.It is well known that the webpage of website is big Mostly it is that tree-shaped constructs, can be jumped step by step from homepage, there are N multilayers to jump.In view of the situation, it has to write it is a variety of not Network related information is obtained with the web crawlers of level, multiple specific web crawlers will be write by resulting in each website also. In addition, the website that web crawlers can be generally adapted to is relatively simple, the data module that can be adapted to is also very single, can only be directed to a certain The specific website of class or a certain specific data module carry out data acquisition, can not accomplish to be adapted on a large scale, need to be directed to The different Type of website writes different web crawlers, relatively complicated.
Summary of the invention
In view of this, one of the technical issues of the embodiment of the present application is solved is to provide a kind of skill of collecting webpage data Art scheme, to solve the web crawlers for needing the webpage for different web sites individually to develop adaptation in the prior art caused by net Network data acquire cumbersome problem.
The embodiment of the present application provides a kind of acquisition method of web data, which comprises receiving terminal apparatus hair The configuration file for Configuration network crawler sent;Based on the configuration file, network corresponding with the configuration file is generated Crawler;By web crawlers corresponding with the configuration file, web data is acquired.
Optionally, described to be based on the configuration file, generate web crawlers corresponding with the configuration file, comprising: right The configuration file is parsed, and the configuration information in the configuration file is obtained;Based on the configuration information, it is assembled into and institute State the matched web crawlers of configuration information.
Optionally, described to be based on the configuration information, it is assembled into and the matched web crawlers of the configuration information, comprising: Based on the web data request method in the configuration information, corresponding web data requesting party is configured for the web crawlers Formula;And/or based on the web data analysis mode in the configuration information, corresponding web data is configured for the web crawlers Analysis mode.
Optionally, it is described be based on the configuration information, be assembled into the matched web crawlers of the configuration information, also wrap It includes: based on the webpage page turning rule in the configuration information, configuring corresponding webpage page turning rule for the web crawlers.
Optionally, described by web crawlers corresponding with the configuration file, acquire web data, comprising: by fixed Web crawlers described in Shi Qidong acquires web data.
Optionally, described by web crawlers corresponding with the configuration file, acquire web data, further includes: acquisition Current web page is when first web data in time acquisition;Determine current web page when in first web data in time acquisition Whether the content of the first web data of appearance and current web page in previous acquisition is identical;Working as if it is not, then continuing current web page Secondary collecting webpage data, until the current web page that collects when in time acquisition when the content of web data and current Until the content of first web data of the webpage in previous acquisition is identical.
Optionally, the method also includes: determining current web page when time acquisition in first web data content When not identical as the content of first web data of the current web page in previous acquisition, storage current web page is when in time acquisition First web data.
Optionally, the method also includes: when acquiring web data, by the format lattice on the date in the web data Formula turns to predetermined format.
Optionally, the configuration information in the configuration file includes at least one of following: site information, webpage information The extracting rule of type, field information.
Optionally, the site information includes at least one of following: site name, website homepage chained address, Web data renewal frequency, web data request method, web data analysis mode, webpage page turning rule.
The embodiment of the present application also provides a kind of acquisition device of web data, described device includes: receiving module, is matched It is set to the configuration file for Configuration network crawler of receiving terminal apparatus transmission;Generation module is configured as matching based on described File is set, web crawlers corresponding with the configuration file is generated;First acquisition module is configured as by literary with the configuration The corresponding web crawlers of part acquires web data.
Optionally, the generation module, comprising: parsing module is configured as parsing the configuration file, obtains Configuration information in the configuration file;Assembling module is configured as being assembled into described based on the configuration information with confidence Cease matched web crawlers.
Optionally, the assembling module, comprising: the first configuration module is configured as based on the net in the configuration information Page data request method configures corresponding web data request method for the web crawlers;And/or second configuration module, quilt The web data analysis mode being configured in the configuration information configures corresponding web data solution for the web crawlers Analysis mode.
Optionally, the assembling module, further includes: third configuration module is configured as based in the configuration information Webpage page turning rule configures corresponding webpage page turning rule for the web crawlers.
Optionally, first acquisition module, comprising: the second acquisition module is configured as through net described in start by set date Network crawler acquires web data.
Optionally, first acquisition module, further includes: third acquisition module is configured as acquisition current web page and is working as First web data in secondary acquisition;Determining module is configured to determine that current web page when first webpage number in time acquisition According to first web data in previous acquisition of content and current web page content it is whether identical;4th acquisition module is matched It is set to if it is not, then continuing current web page is working as secondary collecting webpage data, until the current web page collected is when time acquisition In when web data content it is identical as the content of first web data of the current web page in previous acquisition until.
Optionally, described device further include: memory module is configured as determining current web page as the head in time acquisition When the content of first web data of the content and current web page of web data in previous acquisition is not identical, current net is stored Page is when first web data in time acquisition.
Optionally, described device further include: formatting module is configured as when acquiring web data, by the webpage The format on the date in data turns to predetermined format.
Optionally, the configuration information in the configuration file includes at least one of following: site information, webpage information The extracting rule of type, field information.
Optionally, the site information includes at least one of following: site name, website homepage chained address, Web data renewal frequency, web data request method, web data analysis mode, webpage page turning rule.
The embodiment of the present application also provides a kind of electronic equipment, comprising: one or more processors;Storage device, configuration To store one or more programs;When one or more of programs are executed by one or more of processors, so that described One or more processors realize the acquisition method of web data as described above.
The embodiment of the present application also provides a kind of computer readable storage mediums, are stored thereon with computer program, the journey The acquisition method of web data as described above is realized when sequence is executed by processor.
By the technical solution of the acquisition of web data provided by the embodiments of the present application, what receiving terminal apparatus was sent is used for The configuration file of Configuration network crawler, and be based on configuration file, generates corresponding with configuration file web crawlers, then by with match The corresponding web crawlers of file is set, web data is acquired, compared with existing other way, is used for based on what terminal device was sent The configuration file of Configuration network crawler generates web crawlers corresponding with configuration file, does not need the webpage for different web sites The individually web crawlers of exploitation adaptation, eliminates the development task of web crawlers, can quickly and easily realize web data Acquisition.
Detailed description of the invention
The some specific of the embodiment of the present application is described in detail by way of example and not limitation with reference to the accompanying drawings hereinafter Embodiment.Identical appended drawing reference denotes same or similar part or part in attached drawing.Those skilled in the art should manage Solution, the drawings are not necessarily drawn to scale.In attached drawing:
Fig. 1 is the step flow chart according to a kind of acquisition method of web data of the embodiment of the present application one;
Fig. 2 is the step flow chart according to a kind of acquisition method of web data of the embodiment of the present application two;
Fig. 3 is the structural block diagram according to a kind of acquisition device of web data of the embodiment of the present application three;
Fig. 4 is the structural block diagram according to a kind of acquisition device of web data of the embodiment of the present application four;
Fig. 5 is the structural block diagram according to a kind of acquisition device of web data of the embodiment of the present application five;
Fig. 6 is the structural schematic diagram according to a kind of electronic equipment of the embodiment of the present application six.
Specific embodiment
In order to make those skilled in the art more fully understand the technical solution in the embodiment of the present application, below in conjunction with the application Attached drawing in embodiment, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described reality Applying example only is the embodiment of the present application a part of the embodiment, instead of all the embodiments.Based on the implementation in the embodiment of the present application The range of the embodiment of the present application protection all should belong in example, those of ordinary skill in the art's every other embodiment obtained.
Embodiment one
Referring to Fig.1, a kind of step flow chart of the acquisition method of web data according to the embodiment of the present application one is shown.
The acquisition method of the web data of the present embodiment the following steps are included:
In step s101, the configuration file for Configuration network crawler that receiving terminal apparatus is sent.
In the embodiment of the present application, the mobile terminal may include at least one of following: mobile unit, amusement are set Standby, advertising equipment, personal digital assistant (PDA), tablet computer, laptop, handheld device, intelligent glasses, intelligent hand Table, wearable device, virtual display device or display enhancing equipment (such as Google Glass, Oculus Rift, Hololens, Gear VR).Configuration information in the configuration file includes at least one of following: site information, webpage information type, The extracting rule of field information.Wherein, the site information includes at least one of following: site name, website homepage Chained address, web data renewal frequency, web data request method, web data analysis mode, webpage page turning rule.It can be with Understand, above description is exemplary only, and the embodiment of the present application does not do any restriction to this.
In a specific example, pass through the essential information for the website that the webpage typing of terminal device needs to acquire.Tool Body, the essential information of the website of user's typing include: typing personnel title, site name, site identity, website homepage chain It is grounded location, subordinate area, data source, subject classification.Wherein, site name may include Jingdone district net, day cat net, Taobao etc.. Site identity can be the customized site identity of user.Subordinate area can be the area of website institute subordinate, for example, municipality directly under the Central Government, province Can etc..When site name be Jingdone district net when, data source can be Jingdone district net, subject classification can for Jingdone district net daily class or Electronic product etc..When saving the essential information of website of webpage typing of the user by terminal device, corresponding net is generated Network channel.Meanwhile the chained address acquisition web data renewal frequency of the site name or website homepage also according to user's typing, Web data request method, web data analysis mode, webpage page turning rule etc..
However, each website can be divided into different webpage information types there are many information of multiplicity.It is corresponded to generating Website channel after, user can also pass through the essential information of data module that the webpage typing of terminal device needs to acquire.Tool Body, the essential information of the data module of user's typing includes: typing personnel title, subchannel title, subchannel mark, son frequency The chained address or entry address in road.Wherein, subchannel title can be the title of the data module of website, and subchannel mark can be The mark of the data module of website, the chained address or entry address of subchannel can be the entry address of the data module of website. In the essential information for saving the data module of website of webpage typing of the user by terminal device, corresponding format is generated For the configuration file of .yaml.
After generating configuration file, the configuration file that user can be generated by the page download of terminal device, and under User's web data for needing specifically to acquire in data module is configured in the configuration file carried, for example, title, the time, Issuing date, date of publication, file download address, page turning, product price, product style etc..User is configured in configuration file When the web data for needing specifically to acquire in data module, need user configured in configuration file acquisition field name and The extracting rule of field information.For example, user needs to acquire the price of umbrella in the net of Jingdone district, the extracting rule of field information can be User is directed to the description that umbrella price is extracted.Configuration complete user's web data for needing specifically to acquire in data module it Afterwards, the configuration file that configuration is completed is uploaded to terminal device by user, and terminal device saves the configuration file that configuration is completed, and will The configuration file is sent in server, is climbed so that server generates network corresponding with configuration file according to the configuration file Worm.It is understood that above description is exemplary only, the embodiment of the present application does not do any restriction to this.
In step s 102, it is based on the configuration file, generates web crawlers corresponding with the configuration file.
In some optional embodiments, the configuration file that server receiving terminal equipment is sent, and directly according to configuration text Part generates web crawlers corresponding with configuration file.In some optional embodiments, what server receiving terminal equipment was sent matches File is set, and the configuration file received is saved in the database.In the collecting webpage data for receiving terminal device transmission After request, the configuration file received is transferred from database, and corresponding with configuration file according to configuration file generation Web crawlers.Wherein, web crawlers be it is a kind of according to certain rules, automatically grab the program or script of network data. It is understood that above description is exemplary only, the embodiment of the present application does not do any restriction to this.
In some optional embodiments, it is being based on the configuration file, is generating network corresponding with the configuration file and climbs When worm, the configuration file is parsed, obtains the configuration information in the configuration file;Based on the configuration information, spell It dresses up and the matched web crawlers of the configuration information.It is understood that it is any be based on the configuration file, generate with it is described The embodiment of the corresponding web crawlers of configuration file may be applicable to this, and the embodiment of the present application does not do any restriction to this.
In a specific example, there are the various functional modules developed in advance in server, for example, having single function The functional module of energy, the functional module with high reusability.The configuration in configuration file that server is sent according to terminal device Information, functional module required for selecting, and the various fields and field information that addition needs to acquire on this basis Extracting rule, be assembled into web crawlers corresponding with configuration file.It is understood that above description is exemplary only, The embodiment of the present application does not do any restriction to this.
In some optional embodiments, it is being based on the configuration information, be assembled into and the matched network of the configuration information When crawler, based on the web data request method in the configuration information, corresponding web data is configured for the web crawlers Request method;And/or based on the web data analysis mode in the configuration information, corresponding net is configured for the web crawlers Page data analysis mode.It is understood that any be based on the configuration information, it is assembled into and the matched net of the configuration information The embodiment of network crawler may be applicable to this, and the embodiment of the present application does not do any restriction to this.
In a specific example, the web data of internet is transmitted by server, and the data of server are asked Seeking mode may include post request method and get request method.In order to cope with both request of data modes, the service of the application Device has the scheme for the two sets of adaptations developed in advance, according to the difference of request of data mode, calls different schemes, realizes webpage The acquisition of data.Specifically, when the web data request method of website is post request method, post is configured for web crawlers Request method;When the web data request method of website is get request method, get request method is configured for web crawlers.This Outside, the web data analysis mode of each website is also not quite similar, in order to cope with more situation processing, the server of the application It is configured with a variety of web data analysis modes developed in advance according to the difference of web data analysis mode for web crawlers Corresponding web data analysis mode, realizes the acquisition of web data.It is understood that above description is exemplary only, The embodiment of the present application does not do any restriction to this.
In step s 103, by web crawlers corresponding with the configuration file, web data is acquired.
In some optional embodiments, when acquiring web data, by the format on the date in the web data Turn to predetermined format.Date format in the web data of website is varied, for unified date format, date-written format Change code, collected date format is melted into prescribed form.Take this, the date format in web data can be unified.It can be with Understand, the embodiment of any acquisition web data may be applicable to this, and the embodiment of the present application does not do any restriction to this.
By the acquisition method of web data provided by the embodiments of the present application, what receiving terminal apparatus was sent is used to configure net The configuration file of network crawler, and be based on configuration file, generates corresponding with configuration file web crawlers, then by with configuration file Corresponding web crawlers acquires web data, compared with existing other way, is used to configure net based on what terminal device was sent The configuration file of network crawler generates web crawlers corresponding with configuration file, does not need individually to open for the webpage of different web sites The web crawlers for sending out adaptation, eliminates the development task of web crawlers, can quickly and easily realize the acquisition of web data.
The acquisition method of the web data of the present embodiment can be held by any suitable equipment with data-handling capacity Row, including but not limited to: camera, terminal, mobile terminal, PC machine, server, mobile unit, amusement equipment, advertising equipment, Personal digital assistant (PDA), laptop, handheld device, smart glasses, smart watches, wearable sets tablet computer Standby, virtual display device or display enhancing equipment (such as Google Glass, Oculus Rift, Hololens, Gear VR).
Embodiment two
Referring to Fig. 2, a kind of step flow chart of the acquisition method of web data according to the embodiment of the present application two is shown.
The acquisition method of the web data of the present embodiment the following steps are included:
In step s 201, the configuration file for Configuration network crawler that receiving terminal apparatus is sent.
Since step S201 is similar with above-mentioned steps S101, details are not described herein.
In step S202, it is based on the configuration file, generates web crawlers corresponding with the configuration file.
In some optional embodiments, it is being based on the configuration file, is generating network corresponding with the configuration file and climbs When worm, the configuration file is parsed, obtains the configuration information in the configuration file;Based on the configuration information, spell It dresses up and the matched web crawlers of the configuration information.It is understood that it is any be based on the configuration file, generate with it is described The embodiment of the corresponding web crawlers of configuration file may be applicable to this, and the embodiment of the present application does not do any restriction to this.
In some optional embodiments, it is being based on the configuration information, be assembled into and the matched network of the configuration information When crawler, based on the webpage page turning rule in the configuration information, corresponding webpage page turning rule is configured for the web crawlers. It is understood that any be based on the configuration information, it is assembled into the embodiment party with the matched web crawlers of the configuration information Formula may be applicable to this, and the embodiment of the present application does not do any restriction to this.
In a specific example, webpage page turning rule refers to the page turning rule of the webpage link address of website.Often The page turning rule of the webpage link address of a website is not quite similar, the case where in order to cope with webpage page turning, the server of the application With a variety of webpage page turnings rule developed in advance.Webpage page turning rule in the configuration information sent according to terminal device, from It is dynamic to select the webpage page turning rule to match, realize the page-turning function of webpage.It is understood that above description is merely illustrative , the embodiment of the present application does not do any restriction to this.
In step S203, by web crawlers described in start by set date, web data is acquired.
In some optional embodiments, according to the web data renewal frequency in site information, start by set date web crawlers Newest web data is acquired, realizes the self-starting of web crawlers, the automation collection of web data.Specifically, believed according to website Web data renewal frequency in breath determines the start by set date time of web crawlers, further according to web crawlers start by set date when Between, start by set date web crawlers acquires newest web data.In addition, can also Configuration network be climbed in configuration file by user The start by set date time of worm realizes the self-starting of web crawlers, the automation collection of web data.For example, starting is primary daily Web crawlers starts weekly primary network crawler, monthly starts primary network crawler.It is understood that any pass through timing Start the web crawlers, the embodiment for acquiring web data may be applicable to this, and the embodiment of the present application does not do this any It limits.
In some optional embodiments, by web crawlers corresponding with the configuration file, when acquiring web data, Current web page is acquired when first web data in time acquisition;Determine current web page when first web data in time acquisition First web data in previous acquisition of content and current web page content it is whether identical;If it is not, then continuing current web page When time collecting webpage data, until the current web page that collects when in time acquisition when the content of web data with Until the content of first web data of the current web page in previous acquisition is identical.Take this, can be realized the increment of web data Acquisition, and then reduce the pressure of the server of collected website, and avoid the repeated acquisition of web data.It is understood that It is that any by web crawlers corresponding with the configuration file, the embodiment for acquiring web data may be applicable to this, this Application embodiment does not do any restriction to this.
In a specific example, determine current web page when time acquisition in first web data content with work as When the content of first web data of the preceding webpage in previous acquisition is identical, it is determined that current web page does not have web data update, Terminate when time collecting webpage data.Determining current web page in the content and current web page when first web data in time acquisition When the content of the first web data in previous acquisition is not identical, storage current web page is when first webpage number in time acquisition According to the foundation compared as next time.Why operate in this way, is because the webpage more new data of website is always shown in webpage Foremost.After collecting web data, collected web data is saved in database, realizes persistent storage. It is understood that above description is exemplary only, the embodiment of the present application does not do any restriction to this.
In a specific example, whole web datas all can all be acquired to one time this in order to avoid acquiring every time Situation, the application realize incremental crawler.The first data of website can be recorded when acquiring for the first time, be saved in data In library, when the time for having arrived start by set date, web crawlers can be again started up, and will use previously stored first data, go and First data shown by present webpage compares, if the two is identical, that is, thinks that no data update, does not acquire, this Secondary acquisition terminates, and waits the next start by set date of web crawlers.If the two is different, that is, there is data update, there is new data publication, Start to acquire, when collected data are identical as the first data stored before, stop acquisition, and by the first data of webpage Storage, the foundation compared as next time.It is understood that above description is exemplary only, the embodiment of the present application to this not Do any restriction.
By the acquisition method of web data provided by the embodiments of the present application, what receiving terminal apparatus was sent is used to configure net The configuration file of network crawler, and it is based on configuration file, web crawlers corresponding with configuration file is generated, then pass through start by set date net Network crawler acquires web data, compared with existing other way, based on terminal device send for Configuration network crawler Configuration file generates web crawlers corresponding with configuration file, does not need individually to develop adaptation for the webpage of different web sites Web crawlers eliminates the development task of web crawlers, can quickly and easily realize the acquisition of web data.In addition, passing through Start by set date web crawlers can be realized the automation collection of web data.
The acquisition method of the web data of the present embodiment can be held by any suitable equipment with data-handling capacity Row, including but not limited to: camera, terminal, mobile terminal, PC machine, server, mobile unit, amusement equipment, advertising equipment, Personal digital assistant (PDA), laptop, handheld device, smart glasses, smart watches, wearable sets tablet computer Standby, virtual display device or display enhancing equipment (such as Google Glass, Oculus Rift, Hololens, Gear VR).
Embodiment three
Referring to Fig. 3, a kind of structural block diagram of the acquisition device of web data according to the embodiment of the present application three is shown.
The acquisition device of the web data of the present embodiment includes: receiving module 301, is configured as receiving terminal apparatus transmission The configuration file for Configuration network crawler;Generation module 302, is configured as based on the configuration file, generate with it is described The corresponding web crawlers of configuration file;First acquisition module 303 is configured as climbing by network corresponding with the configuration file Worm acquires web data.
The acquisition device of the web data of the present embodiment is for realizing webpage number corresponding in aforesaid plurality of embodiment of the method According to acquisition method, and with corresponding embodiment of the method beneficial effect, details are not described herein.
Example IV
Referring to Fig. 4, a kind of structural block diagram of the acquisition device of web data according to the embodiment of the present application four is shown.
The acquisition device of the web data of the present embodiment includes: receiving module 401, is configured as receiving terminal apparatus transmission The configuration file for Configuration network crawler;Generation module 402, is configured as based on the configuration file, generate with it is described The corresponding web crawlers of configuration file;First acquisition module 403 is configured as climbing by network corresponding with the configuration file Worm acquires web data.
Optionally, the generation module 402, comprising: parsing module 4021 is configured as solving the configuration file Analysis, obtains the configuration information in the configuration file;Assembling module 4022 is configured as being assembled into based on the configuration information With the matched web crawlers of the configuration information.
Optionally, the assembling module 4022, comprising: the first configuration module 4023 is configured as based on described with confidence Web data request method in breath configures corresponding web data request method for the web crawlers;And/or second configuration Module 4024 is configured as based on the web data analysis mode in the configuration information, corresponding for web crawlers configuration Web data analysis mode.
Optionally, the assembling module 4022, further includes: third configuration module 4025 is configured as based on the configuration Webpage page turning rule in information configures corresponding webpage page turning rule for the web crawlers.
Optionally, described device further include: formatting module 404 is configured as when acquiring web data, by the net The format on the date in page data turns to predetermined format.
Optionally, the configuration information in the configuration file includes at least one of following: site information, webpage information The extracting rule of type, field information.
Optionally, the site information includes at least one of following: site name, website homepage chained address, Web data renewal frequency, web data request method, web data analysis mode, webpage page turning rule.
The acquisition device of the web data of the present embodiment is for realizing webpage number corresponding in aforesaid plurality of embodiment of the method According to acquisition method, and with corresponding embodiment of the method beneficial effect, details are not described herein.
Embodiment five
Referring to Fig. 5, a kind of structural block diagram of the acquisition device of web data according to the embodiment of the present application five is shown.
The acquisition device of the web data of the present embodiment includes: receiving module 501, is configured as receiving terminal apparatus transmission The configuration file for Configuration network crawler;Generation module 502, is configured as based on the configuration file, generate with it is described The corresponding web crawlers of configuration file;First acquisition module 503 is configured as climbing by network corresponding with the configuration file Worm acquires web data.
Optionally, first acquisition module 503, comprising: the second acquisition module 5031 is configured as passing through start by set date The web crawlers acquires web data.
Optionally, first acquisition module 503, further includes: third acquisition module 5032 is configured as acquiring current net Page is when first web data in time acquisition;Determining module 5033 is configured to determine that current web page when in time acquisition Whether the content of first web data of the content and current web page of first web data in previous acquisition is identical;4th acquisition Module 5034 is configured as if it is not, then continuing current web page is working as secondary collecting webpage data, until the current net collected Page is in the content when the first web data when the content and current web page of web data in previous acquisition in time acquisition Until identical.
Optionally, described device further include: memory module 5035 is configured as determining current web page when in time acquisition First web data first web data in previous acquisition of content and current web page content it is not identical when, storage is worked as Preceding webpage is when first web data in time acquisition.
The acquisition device of the web data of the present embodiment is for realizing webpage number corresponding in aforesaid plurality of embodiment of the method According to acquisition method, and with corresponding embodiment of the method beneficial effect, details are not described herein.
Embodiment six
Referring to Fig. 6, the structural schematic diagram of a kind of electronic equipment according to the embodiment of the present application six is shown, the application is specific Embodiment does not limit the specific implementation of electronic equipment.
As shown in fig. 6, the electronic equipment may include: processor (processor) 601, storage device 602.
Wherein:
Processor 601 can specifically execute in the acquisition method embodiment of above-mentioned web data for executing program 603 Correlation step.
Specifically, program 603 may include program code, which includes computer operation instruction.
Processor 601 may be central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present application Road.The one or more processors that electronic equipment includes can be same type of processor, such as one or more CPU;It can also To be different types of processor, such as one or more CPU and one or more ASIC.
Storage device 602 is configured to store one or more programs 603.Storage device 602 may be deposited comprising high-speed RAM Reservoir, it is also possible to further include nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.
Program 603 specifically can be used for so that processor 601 executes following operation: what receiving terminal apparatus was sent is used to match Set the configuration file of web crawlers;Based on the configuration file, web crawlers corresponding with the configuration file is generated;By with The corresponding web crawlers of the configuration file, acquires web data.
In a kind of optional embodiment, program 603 is also used to so that processor 601 is based on the configuration file, When generating web crawlers corresponding with the configuration file, the configuration file is parsed, is obtained in the configuration file Configuration information;Based on the configuration information, it is assembled into and the matched web crawlers of the configuration information.
In a kind of optional embodiment, program 603 is also used to so that processor 601 is based on the configuration information, When being assembled into web crawlers matched with the configuration information, based on the web data request method in the configuration information, it is The web crawlers configures corresponding web data request method;And/or based on the web data parsing in the configuration information Mode configures corresponding web data analysis mode for the web crawlers.
In a kind of optional embodiment, program 603 is also used to so that processor 601 is based on the configuration information, It is described based on the webpage page turning rule in the configuration information when being assembled into web crawlers matched with the configuration information Web crawlers configures corresponding webpage page turning rule.
In a kind of optional embodiment, program 603 be also used to so that processor 601 by with the configuration file Corresponding web crawlers when acquiring web data, by web crawlers described in start by set date, acquires web data.
In a kind of optional embodiment, program 603 be also used to so that processor 601 by with the configuration file Corresponding web crawlers, when acquiring web data, acquisition current web page is when first web data in time acquisition;It determines current Webpage is working as the content and current web page of first web data in time acquisition in the first web data in previous acquisition Whether identical hold;If it is not, then continuing current web page is working as secondary collecting webpage data, until the current web page collected is being worked as In secondary acquisition is mutually all when the content of first web data of the content with current web page in previous acquisition of web data Only.
In a kind of optional embodiment, program 603 is also used to so that processor 601 is determining current web page when secondary When the content of first web data of the content and current web page of first web data in acquisition in previous acquisition is not identical, Current web page is stored when first web data in time acquisition.
In a kind of optional embodiment, program 603 is also used to so that processor 601 is when acquiring web data, will The format on the date in the web data turns to predetermined format.
In a kind of optional embodiment, the configuration information in the configuration file includes at least one of following: Site information, webpage information type, field information extracting rule.
In a kind of optional embodiment, the site information includes at least one of following: site name, website The chained address of homepage, web data renewal frequency, web data request method, web data analysis mode, webpage page turning rule Then.
The specific implementation of each step may refer to corresponding in the acquisition method embodiment of above-mentioned web data in program 603 Corresponding description in step and unit, this will not be repeated here.It is apparent to those skilled in the art that for description Convenienct and succinct, the equipment of foregoing description and the specific work process of module can be with reference to the correspondences in preceding method embodiment Process description, details are not described herein.
Electronic equipment through this embodiment, the configuration file for Configuration network crawler that receiving terminal apparatus is sent, And it is based on configuration file, web crawlers corresponding with configuration file is generated, then by web crawlers corresponding with configuration file, adopt Collect web data, compared with existing other way, based on terminal device send the configuration file for Configuration network crawler, Web crawlers corresponding with configuration file is generated, the web crawlers for individually developing adaptation for the webpage of different web sites is not needed, The development task of web crawlers is eliminated, can quickly and easily realize the acquisition of web data.
It may be noted that all parts/step described in the embodiment of the present application can be split as more according to the needs of implementation The part operation of two or more components/steps or components/steps can also be combined into new component/step by multi-part/step Suddenly, to realize the purpose of the embodiment of the present application.
Particularly, according to the embodiment of the present application, above with reference to the process of flow chart description, to may be implemented as computer soft Part program.For example, the embodiment of the present application includes a kind of computer program product comprising carry on a computer-readable medium Computer program, the computer program include the program code for executing method shown in multiple embodiments of the method above. In such embodiments, which can be downloaded and installed from network by communications portion, and/or from removable Medium is unloaded to be mounted.When the computer program is executed by central processing unit (CPU), the side shown in the embodiment of the present application is executed The above-mentioned function of being limited in method.It should be noted that computer-readable medium described herein can be computer-readable letter Number medium or computer readable storage medium either the two any combination.Computer readable storage medium for example may be used To be, but be not limited to, electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above group It closes.The more specific example of computer readable storage medium can include but is not limited to: have being electrically connected for one or more conducting wires It connects, portable computer diskette, hard disk, random access memory device (RAM), read-only memory device (ROM), erasable type may be programmed Read-only memory device (EPROM or flash memory), optical fiber, portable compact disc read-only memory device (CD-ROM), light storage device Part, magnetic memory apparatus part or above-mentioned any appropriate combination.In this application, computer readable storage medium can be It is any to include or the tangible medium of storage program, the program can be commanded execution system, device or device using or with It is used in combination.And in this application, computer-readable signal media may include in a base band or as carrier wave one Divide the data-signal propagated, wherein carrying computer-readable program code.The data-signal of this propagation can use more Kind form, including but not limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media It can also be any computer-readable medium other than computer readable storage medium, which can send, It propagates or transmits for by the use of instruction execution system, device or device or program in connection.Computer The program code for including on readable medium can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof Machine program code, described program design language --- --- such as Java that includes object oriented program language, Smalltalk, C++, further include conventional procedural programming language --- --- such as " C " language or similar programming Language.Program code can be executed fully on the user computer, partly execute on the user computer, be only as one Vertical software package executes, part executes on the remote computer or completely in remote computer on the user computer for part Or it is executed on server.In situations involving remote computers, remote computer can pass through the net of any kind Network --- --- it is connected to subscriber computer including local area network (LAN) or wide area network (WAN) --- ---, or, it may be connected to it is outer Portion's computer (such as being connected using ISP by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet Include acquiring unit, processing unit and test cell.Wherein, the title of these units is not constituted under certain conditions to the list The restriction of member itself, for example, acquiring unit is also described as " obtaining the business of target object to be tested according to test instruction The unit of test file ".
As on the other hand, present invention also provides a kind of computer readable storage mediums, are stored thereon with computer journey Sequence realizes the method as described in above-mentioned any embodiment when the program is executed by processor.
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be Included in device described in above-described embodiment;It is also possible to individualism, and without in the supplying device.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the device, so that should The configuration file for Configuration network crawler that equipment receiving terminal apparatus where device is sent;Based on the configuration file, Generate web crawlers corresponding with the configuration file;By web crawlers corresponding with the configuration file, webpage number is acquired According to.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims (22)

1. a kind of acquisition method of web data, which is characterized in that the described method includes:
The configuration file for Configuration network crawler that receiving terminal apparatus is sent;
Based on the configuration file, web crawlers corresponding with the configuration file is generated;
By web crawlers corresponding with the configuration file, web data is acquired.
2. being generated and the configuration the method according to claim 1, wherein described be based on the configuration file The corresponding web crawlers of file, comprising:
The configuration file is parsed, the configuration information in the configuration file is obtained;
Based on the configuration information, it is assembled into and the matched web crawlers of the configuration information.
3. according to the method described in claim 2, it is characterized in that, it is described be based on the configuration information, be assembled into and match with described Confidence ceases matched web crawlers, comprising:
Based on the web data request method in the configuration information, corresponding web data request is configured for the web crawlers Mode;And/or
Based on the web data analysis mode in the configuration information, corresponding web data parsing is configured for the web crawlers Mode.
4. according to the method described in claim 2, it is characterized in that, it is described be based on the configuration information, be assembled into and match with described Confidence ceases matched web crawlers, further includes:
Based on the webpage page turning rule in the configuration information, corresponding webpage page turning rule is configured for the web crawlers.
5. the method according to claim 1, wherein described climbed by network corresponding with the configuration file Worm acquires web data, comprising:
By web crawlers described in start by set date, web data is acquired.
6. the method according to claim 1, wherein described climbed by network corresponding with the configuration file Worm acquires web data, further includes:
Current web page is acquired when first web data in time acquisition;
Determine first item of the current web page in the content and current web page when first web data in time acquisition in previous acquisition Whether the content of web data is identical;
If it is not, then continuing current web page is working as secondary collecting webpage data, until the current web page collected is when time acquisition In when web data content it is identical as the content of first web data of the current web page in previous acquisition until.
7. according to the method described in claim 6, it is characterized in that, the method also includes:
Determining head of the current web page in the content and current web page when first web data in time acquisition in previous acquisition When the content of web data is not identical, storage current web page is when first web data in time acquisition.
8. the method according to claim 1, wherein the method also includes:
When acquiring web data, the format on the date in the web data is turned into predetermined format.
9. method described in any one of -8 claims according to claim 1, which is characterized in that in the configuration file Configuration information includes at least one of following:
Site information, webpage information type, field information extracting rule.
10. according to the method described in claim 9, it is characterized in that, the site information includes at least one of following:
Site name, the chained address of website homepage, web data renewal frequency, web data request method, web data solution Analysis mode, webpage page turning rule.
11. a kind of acquisition device of web data, which is characterized in that described device includes:
Receiving module is configured as the configuration file for Configuration network crawler of receiving terminal apparatus transmission;
Generation module is configured as generating web crawlers corresponding with the configuration file based on the configuration file;
First acquisition module is configured as acquiring web data by web crawlers corresponding with the configuration file.
12. device according to claim 11, which is characterized in that the generation module, comprising:
Parsing module is configured as parsing the configuration file, obtains the configuration information in the configuration file;
Assembling module is configured as being assembled into and the matched web crawlers of the configuration information based on the configuration information.
13. device according to claim 12, which is characterized in that the assembling module, comprising:
First configuration module is configured as based on the web data request method in the configuration information, is the web crawlers Configure corresponding web data request method;And/or
Second configuration module is configured as based on the web data analysis mode in the configuration information, is the web crawlers Configure corresponding web data analysis mode.
14. device according to claim 12, which is characterized in that the assembling module, further includes:
Third configuration module is configured as configuring based on the webpage page turning rule in the configuration information for the web crawlers Corresponding webpage page turning rule.
15. device according to claim 11, which is characterized in that first acquisition module, comprising:
Second acquisition module is configured as acquiring web data by web crawlers described in start by set date.
16. device according to claim 11, which is characterized in that first acquisition module, further includes:
Third acquisition module is configured as acquisition current web page when first web data in time acquisition;
Determining module is configured to determine that current web page exists in the content and current web page when first web data in time acquisition Whether the content of first web data in previous acquisition is identical;
4th acquisition module is configured as if it is not, then continuing current web page is working as secondary collecting webpage data, until collecting Current web page when the first webpage number in previous acquisition of content and current web page in time acquisition when web data According to content it is identical until.
17. device according to claim 16, which is characterized in that described device further include:
Memory module is configured as determining current web page in the content and current web page when first web data in time acquisition When the content of the first web data in previous acquisition is not identical, storage current web page is when first webpage number in time acquisition According to.
18. device according to claim 11, which is characterized in that described device further include:
Formatting module is configured as turning to the format on the date in the web data when acquiring web data Predetermined format.
19. device described in any one of 1-18 claim according to claim 1, which is characterized in that the configuration file In configuration information include at least one of following:
Site information, webpage information type, field information extracting rule.
20. device according to claim 19, which is characterized in that the site information includes at least one of following:
Site name, the chained address of website homepage, web data renewal frequency, web data request method, web data solution Analysis mode, webpage page turning rule.
21. a kind of electronic equipment, comprising:
One or more processors;
Storage device is configured to store one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now acquisition method of the web data as described in any one of claim 1-10 claim.
22. a kind of computer readable storage medium, is stored thereon with computer program, realized such as when which is executed by processor The acquisition method of web data described in any one of claim 1-10 claim.
CN201811033910.4A 2018-09-05 2018-09-05 Acquisition method, device, electronic equipment and the storage medium of web data Pending CN110245278A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811033910.4A CN110245278A (en) 2018-09-05 2018-09-05 Acquisition method, device, electronic equipment and the storage medium of web data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811033910.4A CN110245278A (en) 2018-09-05 2018-09-05 Acquisition method, device, electronic equipment and the storage medium of web data

Publications (1)

Publication Number Publication Date
CN110245278A true CN110245278A (en) 2019-09-17

Family

ID=67882858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811033910.4A Pending CN110245278A (en) 2018-09-05 2018-09-05 Acquisition method, device, electronic equipment and the storage medium of web data

Country Status (1)

Country Link
CN (1) CN110245278A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990675A (en) * 2019-11-25 2020-04-10 爱信诺征信有限公司 Webpage data crawling method and system
CN111953766A (en) * 2020-08-07 2020-11-17 福建省天奕网络科技有限公司 Method and system for collecting network data
CN112015963A (en) * 2020-08-21 2020-12-01 北京金和网络股份有限公司 Web crawler system based on big data
CN112800307A (en) * 2021-01-25 2021-05-14 浪潮云信息技术股份公司 Configurable webpage information crawling method based on Java Web
CN112948659A (en) * 2021-03-09 2021-06-11 深圳九星互动科技有限公司 Webpage data acquisition method, device, system and medium
CN113987318A (en) * 2021-11-01 2022-01-28 盐城金堤科技有限公司 Page monitoring method, device, equipment and computer storage medium
CN114428635A (en) * 2022-04-06 2022-05-03 杭州未名信科科技有限公司 Data acquisition method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866555A (en) * 2015-05-15 2015-08-26 浪潮软件集团有限公司 Automatic acquisition method based on web crawler
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
CN108334585A (en) * 2018-01-29 2018-07-27 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866555A (en) * 2015-05-15 2015-08-26 浪潮软件集团有限公司 Automatic acquisition method based on web crawler
CN107808000A (en) * 2017-11-13 2018-03-16 哈尔滨工业大学(威海) A kind of hidden web data collection and extraction system and method
CN108334585A (en) * 2018-01-29 2018-07-27 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990675A (en) * 2019-11-25 2020-04-10 爱信诺征信有限公司 Webpage data crawling method and system
CN110990675B (en) * 2019-11-25 2024-07-26 爱信诺征信有限公司 Webpage data crawling method and system
CN111953766A (en) * 2020-08-07 2020-11-17 福建省天奕网络科技有限公司 Method and system for collecting network data
CN112015963A (en) * 2020-08-21 2020-12-01 北京金和网络股份有限公司 Web crawler system based on big data
CN112800307A (en) * 2021-01-25 2021-05-14 浪潮云信息技术股份公司 Configurable webpage information crawling method based on Java Web
CN112948659A (en) * 2021-03-09 2021-06-11 深圳九星互动科技有限公司 Webpage data acquisition method, device, system and medium
CN113987318A (en) * 2021-11-01 2022-01-28 盐城金堤科技有限公司 Page monitoring method, device, equipment and computer storage medium
CN113987318B (en) * 2021-11-01 2024-03-12 盐城天眼察微科技有限公司 Page monitoring method, device, equipment and computer storage medium
CN114428635A (en) * 2022-04-06 2022-05-03 杭州未名信科科技有限公司 Data acquisition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110245278A (en) Acquisition method, device, electronic equipment and the storage medium of web data
US10698960B2 (en) Content validation and coding for search engine optimization
CN107832468B (en) Demand recognition methods and device
CN101971172B (en) Mobile sitemaps
CN107105031A (en) Information-pushing method and device
CN107729319A (en) Method and apparatus for output information
CN103020207B (en) Browser label page grouping management method and device
CN106933722A (en) A kind of web application monitoring method, server and system
CN108572990A (en) Information-pushing method and device
CN107885777A (en) A kind of control method and system of the crawl web data based on collaborative reptile
CN107609890A (en) A kind of method and apparatus of order tracking
CN108287927B (en) For obtaining the method and device of information
CN107145556B (en) Universal distributed acquisition system
CN110275963A (en) Method and apparatus for output information
CN105824887A (en) Mobile electronic investigation system and method based on intelligent questionnaire generative design
CN109871311A (en) A kind of method and apparatus for recommending test case
CN107391675A (en) Method and apparatus for generating structure information
CN108121814A (en) Search results ranking model generating method and device
CN106407377A (en) Search method and device based on artificial intelligence
CN106354856A (en) Enhanced deep neural network search method and device based on artificial intelligence
CN103514189A (en) Implementing method for web crawler based on search engines
US20190347068A1 (en) Personal history recall
CN106951495A (en) Method and apparatus for information to be presented
CN107729508A (en) Information crawler method and apparatus
CN108574669A (en) User behavior tree constructing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190917