CN110245278A - Acquisition method, device, electronic equipment and the storage medium of web data - Google Patents
Acquisition method, device, electronic equipment and the storage medium of web data Download PDFInfo
- Publication number
- CN110245278A CN110245278A CN201811033910.4A CN201811033910A CN110245278A CN 110245278 A CN110245278 A CN 110245278A CN 201811033910 A CN201811033910 A CN 201811033910A CN 110245278 A CN110245278 A CN 110245278A
- Authority
- CN
- China
- Prior art keywords
- web
- web data
- acquisition
- data
- configuration file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The embodiment of the present application provides acquisition method, device, electronic equipment and the storage medium of a kind of web data, is related to data collecting field.Wherein, the acquisition method of the web data includes: the configuration file for Configuration network crawler that receiving terminal apparatus is sent;Based on the configuration file, web crawlers corresponding with the configuration file is generated;By web crawlers corresponding with the configuration file, web data is acquired.By the embodiment of the present application, the web crawlers for individually developing adaptation for the webpage of different web sites is not needed, the development task of web crawlers is eliminated, can quickly and easily realize the acquisition of web data.
Description
Technical field
The invention relates to data collecting field more particularly to a kind of acquisition methods of web data, device, electronics
Equipment and computer readable storage medium.
Background technique
With the rapid development of internet, the exponential growth of network information is brought.In network information resource abundance
Under conditions of, in order to not only quickly but also targetedly obtain related network information, promoted the birth of search engine.Search engine
Refer to specific computer program according to certain strategy automatically from search information on internet, tissue is being carried out to information
After processing, retrieval service is provided for user, the system that the relevant information of user search is showed into user.Search engine from
The process that information is collected on internet, crawls related web site information dependent on web crawlers.Web crawlers is a kind of automatic
Webpage is browsed, and analyzes the program of web page contents, is the important component of search engine.In addition, using Scrapy technology frame
Build basis of the frame as web crawlers, and one's own web crawlers is built on this basis.
In the prior art, web crawlers is to obtain related network information, network by crawling the webpage of website mostly
Web page interlinkage accessed by it is put into database or local by crawler since the link of one or several Initial pages
Storage, then goes to handle different types of web page interlinkage respectively using different web crawlers.It is well known that the webpage of website is big
Mostly it is that tree-shaped constructs, can be jumped step by step from homepage, there are N multilayers to jump.In view of the situation, it has to write it is a variety of not
Network related information is obtained with the web crawlers of level, multiple specific web crawlers will be write by resulting in each website also.
In addition, the website that web crawlers can be generally adapted to is relatively simple, the data module that can be adapted to is also very single, can only be directed to a certain
The specific website of class or a certain specific data module carry out data acquisition, can not accomplish to be adapted on a large scale, need to be directed to
The different Type of website writes different web crawlers, relatively complicated.
Summary of the invention
In view of this, one of the technical issues of the embodiment of the present application is solved is to provide a kind of skill of collecting webpage data
Art scheme, to solve the web crawlers for needing the webpage for different web sites individually to develop adaptation in the prior art caused by net
Network data acquire cumbersome problem.
The embodiment of the present application provides a kind of acquisition method of web data, which comprises receiving terminal apparatus hair
The configuration file for Configuration network crawler sent;Based on the configuration file, network corresponding with the configuration file is generated
Crawler;By web crawlers corresponding with the configuration file, web data is acquired.
Optionally, described to be based on the configuration file, generate web crawlers corresponding with the configuration file, comprising: right
The configuration file is parsed, and the configuration information in the configuration file is obtained;Based on the configuration information, it is assembled into and institute
State the matched web crawlers of configuration information.
Optionally, described to be based on the configuration information, it is assembled into and the matched web crawlers of the configuration information, comprising:
Based on the web data request method in the configuration information, corresponding web data requesting party is configured for the web crawlers
Formula;And/or based on the web data analysis mode in the configuration information, corresponding web data is configured for the web crawlers
Analysis mode.
Optionally, it is described be based on the configuration information, be assembled into the matched web crawlers of the configuration information, also wrap
It includes: based on the webpage page turning rule in the configuration information, configuring corresponding webpage page turning rule for the web crawlers.
Optionally, described by web crawlers corresponding with the configuration file, acquire web data, comprising: by fixed
Web crawlers described in Shi Qidong acquires web data.
Optionally, described by web crawlers corresponding with the configuration file, acquire web data, further includes: acquisition
Current web page is when first web data in time acquisition;Determine current web page when in first web data in time acquisition
Whether the content of the first web data of appearance and current web page in previous acquisition is identical;Working as if it is not, then continuing current web page
Secondary collecting webpage data, until the current web page that collects when in time acquisition when the content of web data and current
Until the content of first web data of the webpage in previous acquisition is identical.
Optionally, the method also includes: determining current web page when time acquisition in first web data content
When not identical as the content of first web data of the current web page in previous acquisition, storage current web page is when in time acquisition
First web data.
Optionally, the method also includes: when acquiring web data, by the format lattice on the date in the web data
Formula turns to predetermined format.
Optionally, the configuration information in the configuration file includes at least one of following: site information, webpage information
The extracting rule of type, field information.
Optionally, the site information includes at least one of following: site name, website homepage chained address,
Web data renewal frequency, web data request method, web data analysis mode, webpage page turning rule.
The embodiment of the present application also provides a kind of acquisition device of web data, described device includes: receiving module, is matched
It is set to the configuration file for Configuration network crawler of receiving terminal apparatus transmission;Generation module is configured as matching based on described
File is set, web crawlers corresponding with the configuration file is generated;First acquisition module is configured as by literary with the configuration
The corresponding web crawlers of part acquires web data.
Optionally, the generation module, comprising: parsing module is configured as parsing the configuration file, obtains
Configuration information in the configuration file;Assembling module is configured as being assembled into described based on the configuration information with confidence
Cease matched web crawlers.
Optionally, the assembling module, comprising: the first configuration module is configured as based on the net in the configuration information
Page data request method configures corresponding web data request method for the web crawlers;And/or second configuration module, quilt
The web data analysis mode being configured in the configuration information configures corresponding web data solution for the web crawlers
Analysis mode.
Optionally, the assembling module, further includes: third configuration module is configured as based in the configuration information
Webpage page turning rule configures corresponding webpage page turning rule for the web crawlers.
Optionally, first acquisition module, comprising: the second acquisition module is configured as through net described in start by set date
Network crawler acquires web data.
Optionally, first acquisition module, further includes: third acquisition module is configured as acquisition current web page and is working as
First web data in secondary acquisition;Determining module is configured to determine that current web page when first webpage number in time acquisition
According to first web data in previous acquisition of content and current web page content it is whether identical;4th acquisition module is matched
It is set to if it is not, then continuing current web page is working as secondary collecting webpage data, until the current web page collected is when time acquisition
In when web data content it is identical as the content of first web data of the current web page in previous acquisition until.
Optionally, described device further include: memory module is configured as determining current web page as the head in time acquisition
When the content of first web data of the content and current web page of web data in previous acquisition is not identical, current net is stored
Page is when first web data in time acquisition.
Optionally, described device further include: formatting module is configured as when acquiring web data, by the webpage
The format on the date in data turns to predetermined format.
Optionally, the configuration information in the configuration file includes at least one of following: site information, webpage information
The extracting rule of type, field information.
Optionally, the site information includes at least one of following: site name, website homepage chained address,
Web data renewal frequency, web data request method, web data analysis mode, webpage page turning rule.
The embodiment of the present application also provides a kind of electronic equipment, comprising: one or more processors;Storage device, configuration
To store one or more programs;When one or more of programs are executed by one or more of processors, so that described
One or more processors realize the acquisition method of web data as described above.
The embodiment of the present application also provides a kind of computer readable storage mediums, are stored thereon with computer program, the journey
The acquisition method of web data as described above is realized when sequence is executed by processor.
By the technical solution of the acquisition of web data provided by the embodiments of the present application, what receiving terminal apparatus was sent is used for
The configuration file of Configuration network crawler, and be based on configuration file, generates corresponding with configuration file web crawlers, then by with match
The corresponding web crawlers of file is set, web data is acquired, compared with existing other way, is used for based on what terminal device was sent
The configuration file of Configuration network crawler generates web crawlers corresponding with configuration file, does not need the webpage for different web sites
The individually web crawlers of exploitation adaptation, eliminates the development task of web crawlers, can quickly and easily realize web data
Acquisition.
Detailed description of the invention
The some specific of the embodiment of the present application is described in detail by way of example and not limitation with reference to the accompanying drawings hereinafter
Embodiment.Identical appended drawing reference denotes same or similar part or part in attached drawing.Those skilled in the art should manage
Solution, the drawings are not necessarily drawn to scale.In attached drawing:
Fig. 1 is the step flow chart according to a kind of acquisition method of web data of the embodiment of the present application one;
Fig. 2 is the step flow chart according to a kind of acquisition method of web data of the embodiment of the present application two;
Fig. 3 is the structural block diagram according to a kind of acquisition device of web data of the embodiment of the present application three;
Fig. 4 is the structural block diagram according to a kind of acquisition device of web data of the embodiment of the present application four;
Fig. 5 is the structural block diagram according to a kind of acquisition device of web data of the embodiment of the present application five;
Fig. 6 is the structural schematic diagram according to a kind of electronic equipment of the embodiment of the present application six.
Specific embodiment
In order to make those skilled in the art more fully understand the technical solution in the embodiment of the present application, below in conjunction with the application
Attached drawing in embodiment, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described reality
Applying example only is the embodiment of the present application a part of the embodiment, instead of all the embodiments.Based on the implementation in the embodiment of the present application
The range of the embodiment of the present application protection all should belong in example, those of ordinary skill in the art's every other embodiment obtained.
Embodiment one
Referring to Fig.1, a kind of step flow chart of the acquisition method of web data according to the embodiment of the present application one is shown.
The acquisition method of the web data of the present embodiment the following steps are included:
In step s101, the configuration file for Configuration network crawler that receiving terminal apparatus is sent.
In the embodiment of the present application, the mobile terminal may include at least one of following: mobile unit, amusement are set
Standby, advertising equipment, personal digital assistant (PDA), tablet computer, laptop, handheld device, intelligent glasses, intelligent hand
Table, wearable device, virtual display device or display enhancing equipment (such as Google Glass, Oculus Rift, Hololens,
Gear VR).Configuration information in the configuration file includes at least one of following: site information, webpage information type,
The extracting rule of field information.Wherein, the site information includes at least one of following: site name, website homepage
Chained address, web data renewal frequency, web data request method, web data analysis mode, webpage page turning rule.It can be with
Understand, above description is exemplary only, and the embodiment of the present application does not do any restriction to this.
In a specific example, pass through the essential information for the website that the webpage typing of terminal device needs to acquire.Tool
Body, the essential information of the website of user's typing include: typing personnel title, site name, site identity, website homepage chain
It is grounded location, subordinate area, data source, subject classification.Wherein, site name may include Jingdone district net, day cat net, Taobao etc..
Site identity can be the customized site identity of user.Subordinate area can be the area of website institute subordinate, for example, municipality directly under the Central Government, province
Can etc..When site name be Jingdone district net when, data source can be Jingdone district net, subject classification can for Jingdone district net daily class or
Electronic product etc..When saving the essential information of website of webpage typing of the user by terminal device, corresponding net is generated
Network channel.Meanwhile the chained address acquisition web data renewal frequency of the site name or website homepage also according to user's typing,
Web data request method, web data analysis mode, webpage page turning rule etc..
However, each website can be divided into different webpage information types there are many information of multiplicity.It is corresponded to generating
Website channel after, user can also pass through the essential information of data module that the webpage typing of terminal device needs to acquire.Tool
Body, the essential information of the data module of user's typing includes: typing personnel title, subchannel title, subchannel mark, son frequency
The chained address or entry address in road.Wherein, subchannel title can be the title of the data module of website, and subchannel mark can be
The mark of the data module of website, the chained address or entry address of subchannel can be the entry address of the data module of website.
In the essential information for saving the data module of website of webpage typing of the user by terminal device, corresponding format is generated
For the configuration file of .yaml.
After generating configuration file, the configuration file that user can be generated by the page download of terminal device, and under
User's web data for needing specifically to acquire in data module is configured in the configuration file carried, for example, title, the time,
Issuing date, date of publication, file download address, page turning, product price, product style etc..User is configured in configuration file
When the web data for needing specifically to acquire in data module, need user configured in configuration file acquisition field name and
The extracting rule of field information.For example, user needs to acquire the price of umbrella in the net of Jingdone district, the extracting rule of field information can be
User is directed to the description that umbrella price is extracted.Configuration complete user's web data for needing specifically to acquire in data module it
Afterwards, the configuration file that configuration is completed is uploaded to terminal device by user, and terminal device saves the configuration file that configuration is completed, and will
The configuration file is sent in server, is climbed so that server generates network corresponding with configuration file according to the configuration file
Worm.It is understood that above description is exemplary only, the embodiment of the present application does not do any restriction to this.
In step s 102, it is based on the configuration file, generates web crawlers corresponding with the configuration file.
In some optional embodiments, the configuration file that server receiving terminal equipment is sent, and directly according to configuration text
Part generates web crawlers corresponding with configuration file.In some optional embodiments, what server receiving terminal equipment was sent matches
File is set, and the configuration file received is saved in the database.In the collecting webpage data for receiving terminal device transmission
After request, the configuration file received is transferred from database, and corresponding with configuration file according to configuration file generation
Web crawlers.Wherein, web crawlers be it is a kind of according to certain rules, automatically grab the program or script of network data.
It is understood that above description is exemplary only, the embodiment of the present application does not do any restriction to this.
In some optional embodiments, it is being based on the configuration file, is generating network corresponding with the configuration file and climbs
When worm, the configuration file is parsed, obtains the configuration information in the configuration file;Based on the configuration information, spell
It dresses up and the matched web crawlers of the configuration information.It is understood that it is any be based on the configuration file, generate with it is described
The embodiment of the corresponding web crawlers of configuration file may be applicable to this, and the embodiment of the present application does not do any restriction to this.
In a specific example, there are the various functional modules developed in advance in server, for example, having single function
The functional module of energy, the functional module with high reusability.The configuration in configuration file that server is sent according to terminal device
Information, functional module required for selecting, and the various fields and field information that addition needs to acquire on this basis
Extracting rule, be assembled into web crawlers corresponding with configuration file.It is understood that above description is exemplary only,
The embodiment of the present application does not do any restriction to this.
In some optional embodiments, it is being based on the configuration information, be assembled into and the matched network of the configuration information
When crawler, based on the web data request method in the configuration information, corresponding web data is configured for the web crawlers
Request method;And/or based on the web data analysis mode in the configuration information, corresponding net is configured for the web crawlers
Page data analysis mode.It is understood that any be based on the configuration information, it is assembled into and the matched net of the configuration information
The embodiment of network crawler may be applicable to this, and the embodiment of the present application does not do any restriction to this.
In a specific example, the web data of internet is transmitted by server, and the data of server are asked
Seeking mode may include post request method and get request method.In order to cope with both request of data modes, the service of the application
Device has the scheme for the two sets of adaptations developed in advance, according to the difference of request of data mode, calls different schemes, realizes webpage
The acquisition of data.Specifically, when the web data request method of website is post request method, post is configured for web crawlers
Request method;When the web data request method of website is get request method, get request method is configured for web crawlers.This
Outside, the web data analysis mode of each website is also not quite similar, in order to cope with more situation processing, the server of the application
It is configured with a variety of web data analysis modes developed in advance according to the difference of web data analysis mode for web crawlers
Corresponding web data analysis mode, realizes the acquisition of web data.It is understood that above description is exemplary only,
The embodiment of the present application does not do any restriction to this.
In step s 103, by web crawlers corresponding with the configuration file, web data is acquired.
In some optional embodiments, when acquiring web data, by the format on the date in the web data
Turn to predetermined format.Date format in the web data of website is varied, for unified date format, date-written format
Change code, collected date format is melted into prescribed form.Take this, the date format in web data can be unified.It can be with
Understand, the embodiment of any acquisition web data may be applicable to this, and the embodiment of the present application does not do any restriction to this.
By the acquisition method of web data provided by the embodiments of the present application, what receiving terminal apparatus was sent is used to configure net
The configuration file of network crawler, and be based on configuration file, generates corresponding with configuration file web crawlers, then by with configuration file
Corresponding web crawlers acquires web data, compared with existing other way, is used to configure net based on what terminal device was sent
The configuration file of network crawler generates web crawlers corresponding with configuration file, does not need individually to open for the webpage of different web sites
The web crawlers for sending out adaptation, eliminates the development task of web crawlers, can quickly and easily realize the acquisition of web data.
The acquisition method of the web data of the present embodiment can be held by any suitable equipment with data-handling capacity
Row, including but not limited to: camera, terminal, mobile terminal, PC machine, server, mobile unit, amusement equipment, advertising equipment,
Personal digital assistant (PDA), laptop, handheld device, smart glasses, smart watches, wearable sets tablet computer
Standby, virtual display device or display enhancing equipment (such as Google Glass, Oculus Rift, Hololens, Gear VR).
Embodiment two
Referring to Fig. 2, a kind of step flow chart of the acquisition method of web data according to the embodiment of the present application two is shown.
The acquisition method of the web data of the present embodiment the following steps are included:
In step s 201, the configuration file for Configuration network crawler that receiving terminal apparatus is sent.
Since step S201 is similar with above-mentioned steps S101, details are not described herein.
In step S202, it is based on the configuration file, generates web crawlers corresponding with the configuration file.
In some optional embodiments, it is being based on the configuration file, is generating network corresponding with the configuration file and climbs
When worm, the configuration file is parsed, obtains the configuration information in the configuration file;Based on the configuration information, spell
It dresses up and the matched web crawlers of the configuration information.It is understood that it is any be based on the configuration file, generate with it is described
The embodiment of the corresponding web crawlers of configuration file may be applicable to this, and the embodiment of the present application does not do any restriction to this.
In some optional embodiments, it is being based on the configuration information, be assembled into and the matched network of the configuration information
When crawler, based on the webpage page turning rule in the configuration information, corresponding webpage page turning rule is configured for the web crawlers.
It is understood that any be based on the configuration information, it is assembled into the embodiment party with the matched web crawlers of the configuration information
Formula may be applicable to this, and the embodiment of the present application does not do any restriction to this.
In a specific example, webpage page turning rule refers to the page turning rule of the webpage link address of website.Often
The page turning rule of the webpage link address of a website is not quite similar, the case where in order to cope with webpage page turning, the server of the application
With a variety of webpage page turnings rule developed in advance.Webpage page turning rule in the configuration information sent according to terminal device, from
It is dynamic to select the webpage page turning rule to match, realize the page-turning function of webpage.It is understood that above description is merely illustrative
, the embodiment of the present application does not do any restriction to this.
In step S203, by web crawlers described in start by set date, web data is acquired.
In some optional embodiments, according to the web data renewal frequency in site information, start by set date web crawlers
Newest web data is acquired, realizes the self-starting of web crawlers, the automation collection of web data.Specifically, believed according to website
Web data renewal frequency in breath determines the start by set date time of web crawlers, further according to web crawlers start by set date when
Between, start by set date web crawlers acquires newest web data.In addition, can also Configuration network be climbed in configuration file by user
The start by set date time of worm realizes the self-starting of web crawlers, the automation collection of web data.For example, starting is primary daily
Web crawlers starts weekly primary network crawler, monthly starts primary network crawler.It is understood that any pass through timing
Start the web crawlers, the embodiment for acquiring web data may be applicable to this, and the embodiment of the present application does not do this any
It limits.
In some optional embodiments, by web crawlers corresponding with the configuration file, when acquiring web data,
Current web page is acquired when first web data in time acquisition;Determine current web page when first web data in time acquisition
First web data in previous acquisition of content and current web page content it is whether identical;If it is not, then continuing current web page
When time collecting webpage data, until the current web page that collects when in time acquisition when the content of web data with
Until the content of first web data of the current web page in previous acquisition is identical.Take this, can be realized the increment of web data
Acquisition, and then reduce the pressure of the server of collected website, and avoid the repeated acquisition of web data.It is understood that
It is that any by web crawlers corresponding with the configuration file, the embodiment for acquiring web data may be applicable to this, this
Application embodiment does not do any restriction to this.
In a specific example, determine current web page when time acquisition in first web data content with work as
When the content of first web data of the preceding webpage in previous acquisition is identical, it is determined that current web page does not have web data update,
Terminate when time collecting webpage data.Determining current web page in the content and current web page when first web data in time acquisition
When the content of the first web data in previous acquisition is not identical, storage current web page is when first webpage number in time acquisition
According to the foundation compared as next time.Why operate in this way, is because the webpage more new data of website is always shown in webpage
Foremost.After collecting web data, collected web data is saved in database, realizes persistent storage.
It is understood that above description is exemplary only, the embodiment of the present application does not do any restriction to this.
In a specific example, whole web datas all can all be acquired to one time this in order to avoid acquiring every time
Situation, the application realize incremental crawler.The first data of website can be recorded when acquiring for the first time, be saved in data
In library, when the time for having arrived start by set date, web crawlers can be again started up, and will use previously stored first data, go and
First data shown by present webpage compares, if the two is identical, that is, thinks that no data update, does not acquire, this
Secondary acquisition terminates, and waits the next start by set date of web crawlers.If the two is different, that is, there is data update, there is new data publication,
Start to acquire, when collected data are identical as the first data stored before, stop acquisition, and by the first data of webpage
Storage, the foundation compared as next time.It is understood that above description is exemplary only, the embodiment of the present application to this not
Do any restriction.
By the acquisition method of web data provided by the embodiments of the present application, what receiving terminal apparatus was sent is used to configure net
The configuration file of network crawler, and it is based on configuration file, web crawlers corresponding with configuration file is generated, then pass through start by set date net
Network crawler acquires web data, compared with existing other way, based on terminal device send for Configuration network crawler
Configuration file generates web crawlers corresponding with configuration file, does not need individually to develop adaptation for the webpage of different web sites
Web crawlers eliminates the development task of web crawlers, can quickly and easily realize the acquisition of web data.In addition, passing through
Start by set date web crawlers can be realized the automation collection of web data.
The acquisition method of the web data of the present embodiment can be held by any suitable equipment with data-handling capacity
Row, including but not limited to: camera, terminal, mobile terminal, PC machine, server, mobile unit, amusement equipment, advertising equipment,
Personal digital assistant (PDA), laptop, handheld device, smart glasses, smart watches, wearable sets tablet computer
Standby, virtual display device or display enhancing equipment (such as Google Glass, Oculus Rift, Hololens, Gear VR).
Embodiment three
Referring to Fig. 3, a kind of structural block diagram of the acquisition device of web data according to the embodiment of the present application three is shown.
The acquisition device of the web data of the present embodiment includes: receiving module 301, is configured as receiving terminal apparatus transmission
The configuration file for Configuration network crawler;Generation module 302, is configured as based on the configuration file, generate with it is described
The corresponding web crawlers of configuration file;First acquisition module 303 is configured as climbing by network corresponding with the configuration file
Worm acquires web data.
The acquisition device of the web data of the present embodiment is for realizing webpage number corresponding in aforesaid plurality of embodiment of the method
According to acquisition method, and with corresponding embodiment of the method beneficial effect, details are not described herein.
Example IV
Referring to Fig. 4, a kind of structural block diagram of the acquisition device of web data according to the embodiment of the present application four is shown.
The acquisition device of the web data of the present embodiment includes: receiving module 401, is configured as receiving terminal apparatus transmission
The configuration file for Configuration network crawler;Generation module 402, is configured as based on the configuration file, generate with it is described
The corresponding web crawlers of configuration file;First acquisition module 403 is configured as climbing by network corresponding with the configuration file
Worm acquires web data.
Optionally, the generation module 402, comprising: parsing module 4021 is configured as solving the configuration file
Analysis, obtains the configuration information in the configuration file;Assembling module 4022 is configured as being assembled into based on the configuration information
With the matched web crawlers of the configuration information.
Optionally, the assembling module 4022, comprising: the first configuration module 4023 is configured as based on described with confidence
Web data request method in breath configures corresponding web data request method for the web crawlers;And/or second configuration
Module 4024 is configured as based on the web data analysis mode in the configuration information, corresponding for web crawlers configuration
Web data analysis mode.
Optionally, the assembling module 4022, further includes: third configuration module 4025 is configured as based on the configuration
Webpage page turning rule in information configures corresponding webpage page turning rule for the web crawlers.
Optionally, described device further include: formatting module 404 is configured as when acquiring web data, by the net
The format on the date in page data turns to predetermined format.
Optionally, the configuration information in the configuration file includes at least one of following: site information, webpage information
The extracting rule of type, field information.
Optionally, the site information includes at least one of following: site name, website homepage chained address,
Web data renewal frequency, web data request method, web data analysis mode, webpage page turning rule.
The acquisition device of the web data of the present embodiment is for realizing webpage number corresponding in aforesaid plurality of embodiment of the method
According to acquisition method, and with corresponding embodiment of the method beneficial effect, details are not described herein.
Embodiment five
Referring to Fig. 5, a kind of structural block diagram of the acquisition device of web data according to the embodiment of the present application five is shown.
The acquisition device of the web data of the present embodiment includes: receiving module 501, is configured as receiving terminal apparatus transmission
The configuration file for Configuration network crawler;Generation module 502, is configured as based on the configuration file, generate with it is described
The corresponding web crawlers of configuration file;First acquisition module 503 is configured as climbing by network corresponding with the configuration file
Worm acquires web data.
Optionally, first acquisition module 503, comprising: the second acquisition module 5031 is configured as passing through start by set date
The web crawlers acquires web data.
Optionally, first acquisition module 503, further includes: third acquisition module 5032 is configured as acquiring current net
Page is when first web data in time acquisition;Determining module 5033 is configured to determine that current web page when in time acquisition
Whether the content of first web data of the content and current web page of first web data in previous acquisition is identical;4th acquisition
Module 5034 is configured as if it is not, then continuing current web page is working as secondary collecting webpage data, until the current net collected
Page is in the content when the first web data when the content and current web page of web data in previous acquisition in time acquisition
Until identical.
Optionally, described device further include: memory module 5035 is configured as determining current web page when in time acquisition
First web data first web data in previous acquisition of content and current web page content it is not identical when, storage is worked as
Preceding webpage is when first web data in time acquisition.
The acquisition device of the web data of the present embodiment is for realizing webpage number corresponding in aforesaid plurality of embodiment of the method
According to acquisition method, and with corresponding embodiment of the method beneficial effect, details are not described herein.
Embodiment six
Referring to Fig. 6, the structural schematic diagram of a kind of electronic equipment according to the embodiment of the present application six is shown, the application is specific
Embodiment does not limit the specific implementation of electronic equipment.
As shown in fig. 6, the electronic equipment may include: processor (processor) 601, storage device 602.
Wherein:
Processor 601 can specifically execute in the acquisition method embodiment of above-mentioned web data for executing program 603
Correlation step.
Specifically, program 603 may include program code, which includes computer operation instruction.
Processor 601 may be central processor CPU or specific integrated circuit ASIC (Application
Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present application
Road.The one or more processors that electronic equipment includes can be same type of processor, such as one or more CPU;It can also
To be different types of processor, such as one or more CPU and one or more ASIC.
Storage device 602 is configured to store one or more programs 603.Storage device 602 may be deposited comprising high-speed RAM
Reservoir, it is also possible to further include nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.
Program 603 specifically can be used for so that processor 601 executes following operation: what receiving terminal apparatus was sent is used to match
Set the configuration file of web crawlers;Based on the configuration file, web crawlers corresponding with the configuration file is generated;By with
The corresponding web crawlers of the configuration file, acquires web data.
In a kind of optional embodiment, program 603 is also used to so that processor 601 is based on the configuration file,
When generating web crawlers corresponding with the configuration file, the configuration file is parsed, is obtained in the configuration file
Configuration information;Based on the configuration information, it is assembled into and the matched web crawlers of the configuration information.
In a kind of optional embodiment, program 603 is also used to so that processor 601 is based on the configuration information,
When being assembled into web crawlers matched with the configuration information, based on the web data request method in the configuration information, it is
The web crawlers configures corresponding web data request method;And/or based on the web data parsing in the configuration information
Mode configures corresponding web data analysis mode for the web crawlers.
In a kind of optional embodiment, program 603 is also used to so that processor 601 is based on the configuration information,
It is described based on the webpage page turning rule in the configuration information when being assembled into web crawlers matched with the configuration information
Web crawlers configures corresponding webpage page turning rule.
In a kind of optional embodiment, program 603 be also used to so that processor 601 by with the configuration file
Corresponding web crawlers when acquiring web data, by web crawlers described in start by set date, acquires web data.
In a kind of optional embodiment, program 603 be also used to so that processor 601 by with the configuration file
Corresponding web crawlers, when acquiring web data, acquisition current web page is when first web data in time acquisition;It determines current
Webpage is working as the content and current web page of first web data in time acquisition in the first web data in previous acquisition
Whether identical hold;If it is not, then continuing current web page is working as secondary collecting webpage data, until the current web page collected is being worked as
In secondary acquisition is mutually all when the content of first web data of the content with current web page in previous acquisition of web data
Only.
In a kind of optional embodiment, program 603 is also used to so that processor 601 is determining current web page when secondary
When the content of first web data of the content and current web page of first web data in acquisition in previous acquisition is not identical,
Current web page is stored when first web data in time acquisition.
In a kind of optional embodiment, program 603 is also used to so that processor 601 is when acquiring web data, will
The format on the date in the web data turns to predetermined format.
In a kind of optional embodiment, the configuration information in the configuration file includes at least one of following:
Site information, webpage information type, field information extracting rule.
In a kind of optional embodiment, the site information includes at least one of following: site name, website
The chained address of homepage, web data renewal frequency, web data request method, web data analysis mode, webpage page turning rule
Then.
The specific implementation of each step may refer to corresponding in the acquisition method embodiment of above-mentioned web data in program 603
Corresponding description in step and unit, this will not be repeated here.It is apparent to those skilled in the art that for description
Convenienct and succinct, the equipment of foregoing description and the specific work process of module can be with reference to the correspondences in preceding method embodiment
Process description, details are not described herein.
Electronic equipment through this embodiment, the configuration file for Configuration network crawler that receiving terminal apparatus is sent,
And it is based on configuration file, web crawlers corresponding with configuration file is generated, then by web crawlers corresponding with configuration file, adopt
Collect web data, compared with existing other way, based on terminal device send the configuration file for Configuration network crawler,
Web crawlers corresponding with configuration file is generated, the web crawlers for individually developing adaptation for the webpage of different web sites is not needed,
The development task of web crawlers is eliminated, can quickly and easily realize the acquisition of web data.
It may be noted that all parts/step described in the embodiment of the present application can be split as more according to the needs of implementation
The part operation of two or more components/steps or components/steps can also be combined into new component/step by multi-part/step
Suddenly, to realize the purpose of the embodiment of the present application.
Particularly, according to the embodiment of the present application, above with reference to the process of flow chart description, to may be implemented as computer soft
Part program.For example, the embodiment of the present application includes a kind of computer program product comprising carry on a computer-readable medium
Computer program, the computer program include the program code for executing method shown in multiple embodiments of the method above.
In such embodiments, which can be downloaded and installed from network by communications portion, and/or from removable
Medium is unloaded to be mounted.When the computer program is executed by central processing unit (CPU), the side shown in the embodiment of the present application is executed
The above-mentioned function of being limited in method.It should be noted that computer-readable medium described herein can be computer-readable letter
Number medium or computer readable storage medium either the two any combination.Computer readable storage medium for example may be used
To be, but be not limited to, electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above group
It closes.The more specific example of computer readable storage medium can include but is not limited to: have being electrically connected for one or more conducting wires
It connects, portable computer diskette, hard disk, random access memory device (RAM), read-only memory device (ROM), erasable type may be programmed
Read-only memory device (EPROM or flash memory), optical fiber, portable compact disc read-only memory device (CD-ROM), light storage device
Part, magnetic memory apparatus part or above-mentioned any appropriate combination.In this application, computer readable storage medium can be
It is any to include or the tangible medium of storage program, the program can be commanded execution system, device or device using or with
It is used in combination.And in this application, computer-readable signal media may include in a base band or as carrier wave one
Divide the data-signal propagated, wherein carrying computer-readable program code.The data-signal of this propagation can use more
Kind form, including but not limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media
It can also be any computer-readable medium other than computer readable storage medium, which can send,
It propagates or transmits for by the use of instruction execution system, device or device or program in connection.Computer
The program code for including on readable medium can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable,
RF etc. or above-mentioned any appropriate combination.
The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof
Machine program code, described program design language --- --- such as Java that includes object oriented program language,
Smalltalk, C++, further include conventional procedural programming language --- --- such as " C " language or similar programming
Language.Program code can be executed fully on the user computer, partly execute on the user computer, be only as one
Vertical software package executes, part executes on the remote computer or completely in remote computer on the user computer for part
Or it is executed on server.In situations involving remote computers, remote computer can pass through the net of any kind
Network --- --- it is connected to subscriber computer including local area network (LAN) or wide area network (WAN) --- ---, or, it may be connected to it is outer
Portion's computer (such as being connected using ISP by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey
The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation
A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use
The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box
The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually
It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse
Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding
The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction
Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard
The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet
Include acquiring unit, processing unit and test cell.Wherein, the title of these units is not constituted under certain conditions to the list
The restriction of member itself, for example, acquiring unit is also described as " obtaining the business of target object to be tested according to test instruction
The unit of test file ".
As on the other hand, present invention also provides a kind of computer readable storage mediums, are stored thereon with computer journey
Sequence realizes the method as described in above-mentioned any embodiment when the program is executed by processor.
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be
Included in device described in above-described embodiment;It is also possible to individualism, and without in the supplying device.Above-mentioned calculating
Machine readable medium carries one or more program, when said one or multiple programs are executed by the device, so that should
The configuration file for Configuration network crawler that equipment receiving terminal apparatus where device is sent;Based on the configuration file,
Generate web crawlers corresponding with the configuration file;By web crawlers corresponding with the configuration file, webpage number is acquired
According to.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art
Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic
Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature
Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein
Can technical characteristic replaced mutually and the technical solution that is formed.
Claims (22)
1. a kind of acquisition method of web data, which is characterized in that the described method includes:
The configuration file for Configuration network crawler that receiving terminal apparatus is sent;
Based on the configuration file, web crawlers corresponding with the configuration file is generated;
By web crawlers corresponding with the configuration file, web data is acquired.
2. being generated and the configuration the method according to claim 1, wherein described be based on the configuration file
The corresponding web crawlers of file, comprising:
The configuration file is parsed, the configuration information in the configuration file is obtained;
Based on the configuration information, it is assembled into and the matched web crawlers of the configuration information.
3. according to the method described in claim 2, it is characterized in that, it is described be based on the configuration information, be assembled into and match with described
Confidence ceases matched web crawlers, comprising:
Based on the web data request method in the configuration information, corresponding web data request is configured for the web crawlers
Mode;And/or
Based on the web data analysis mode in the configuration information, corresponding web data parsing is configured for the web crawlers
Mode.
4. according to the method described in claim 2, it is characterized in that, it is described be based on the configuration information, be assembled into and match with described
Confidence ceases matched web crawlers, further includes:
Based on the webpage page turning rule in the configuration information, corresponding webpage page turning rule is configured for the web crawlers.
5. the method according to claim 1, wherein described climbed by network corresponding with the configuration file
Worm acquires web data, comprising:
By web crawlers described in start by set date, web data is acquired.
6. the method according to claim 1, wherein described climbed by network corresponding with the configuration file
Worm acquires web data, further includes:
Current web page is acquired when first web data in time acquisition;
Determine first item of the current web page in the content and current web page when first web data in time acquisition in previous acquisition
Whether the content of web data is identical;
If it is not, then continuing current web page is working as secondary collecting webpage data, until the current web page collected is when time acquisition
In when web data content it is identical as the content of first web data of the current web page in previous acquisition until.
7. according to the method described in claim 6, it is characterized in that, the method also includes:
Determining head of the current web page in the content and current web page when first web data in time acquisition in previous acquisition
When the content of web data is not identical, storage current web page is when first web data in time acquisition.
8. the method according to claim 1, wherein the method also includes:
When acquiring web data, the format on the date in the web data is turned into predetermined format.
9. method described in any one of -8 claims according to claim 1, which is characterized in that in the configuration file
Configuration information includes at least one of following:
Site information, webpage information type, field information extracting rule.
10. according to the method described in claim 9, it is characterized in that, the site information includes at least one of following:
Site name, the chained address of website homepage, web data renewal frequency, web data request method, web data solution
Analysis mode, webpage page turning rule.
11. a kind of acquisition device of web data, which is characterized in that described device includes:
Receiving module is configured as the configuration file for Configuration network crawler of receiving terminal apparatus transmission;
Generation module is configured as generating web crawlers corresponding with the configuration file based on the configuration file;
First acquisition module is configured as acquiring web data by web crawlers corresponding with the configuration file.
12. device according to claim 11, which is characterized in that the generation module, comprising:
Parsing module is configured as parsing the configuration file, obtains the configuration information in the configuration file;
Assembling module is configured as being assembled into and the matched web crawlers of the configuration information based on the configuration information.
13. device according to claim 12, which is characterized in that the assembling module, comprising:
First configuration module is configured as based on the web data request method in the configuration information, is the web crawlers
Configure corresponding web data request method;And/or
Second configuration module is configured as based on the web data analysis mode in the configuration information, is the web crawlers
Configure corresponding web data analysis mode.
14. device according to claim 12, which is characterized in that the assembling module, further includes:
Third configuration module is configured as configuring based on the webpage page turning rule in the configuration information for the web crawlers
Corresponding webpage page turning rule.
15. device according to claim 11, which is characterized in that first acquisition module, comprising:
Second acquisition module is configured as acquiring web data by web crawlers described in start by set date.
16. device according to claim 11, which is characterized in that first acquisition module, further includes:
Third acquisition module is configured as acquisition current web page when first web data in time acquisition;
Determining module is configured to determine that current web page exists in the content and current web page when first web data in time acquisition
Whether the content of first web data in previous acquisition is identical;
4th acquisition module is configured as if it is not, then continuing current web page is working as secondary collecting webpage data, until collecting
Current web page when the first webpage number in previous acquisition of content and current web page in time acquisition when web data
According to content it is identical until.
17. device according to claim 16, which is characterized in that described device further include:
Memory module is configured as determining current web page in the content and current web page when first web data in time acquisition
When the content of the first web data in previous acquisition is not identical, storage current web page is when first webpage number in time acquisition
According to.
18. device according to claim 11, which is characterized in that described device further include:
Formatting module is configured as turning to the format on the date in the web data when acquiring web data
Predetermined format.
19. device described in any one of 1-18 claim according to claim 1, which is characterized in that the configuration file
In configuration information include at least one of following:
Site information, webpage information type, field information extracting rule.
20. device according to claim 19, which is characterized in that the site information includes at least one of following:
Site name, the chained address of website homepage, web data renewal frequency, web data request method, web data solution
Analysis mode, webpage page turning rule.
21. a kind of electronic equipment, comprising:
One or more processors;
Storage device is configured to store one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
The now acquisition method of the web data as described in any one of claim 1-10 claim.
22. a kind of computer readable storage medium, is stored thereon with computer program, realized such as when which is executed by processor
The acquisition method of web data described in any one of claim 1-10 claim.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811033910.4A CN110245278A (en) | 2018-09-05 | 2018-09-05 | Acquisition method, device, electronic equipment and the storage medium of web data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811033910.4A CN110245278A (en) | 2018-09-05 | 2018-09-05 | Acquisition method, device, electronic equipment and the storage medium of web data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110245278A true CN110245278A (en) | 2019-09-17 |
Family
ID=67882858
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811033910.4A Pending CN110245278A (en) | 2018-09-05 | 2018-09-05 | Acquisition method, device, electronic equipment and the storage medium of web data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110245278A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110990675A (en) * | 2019-11-25 | 2020-04-10 | 爱信诺征信有限公司 | Webpage data crawling method and system |
CN111953766A (en) * | 2020-08-07 | 2020-11-17 | 福建省天奕网络科技有限公司 | Method and system for collecting network data |
CN112015963A (en) * | 2020-08-21 | 2020-12-01 | 北京金和网络股份有限公司 | Web crawler system based on big data |
CN112800307A (en) * | 2021-01-25 | 2021-05-14 | 浪潮云信息技术股份公司 | Configurable webpage information crawling method based on Java Web |
CN112948659A (en) * | 2021-03-09 | 2021-06-11 | 深圳九星互动科技有限公司 | Webpage data acquisition method, device, system and medium |
CN113987318A (en) * | 2021-11-01 | 2022-01-28 | 盐城金堤科技有限公司 | Page monitoring method, device, equipment and computer storage medium |
CN114428635A (en) * | 2022-04-06 | 2022-05-03 | 杭州未名信科科技有限公司 | Data acquisition method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866555A (en) * | 2015-05-15 | 2015-08-26 | 浪潮软件集团有限公司 | Automatic acquisition method based on web crawler |
CN107808000A (en) * | 2017-11-13 | 2018-03-16 | 哈尔滨工业大学(威海) | A kind of hidden web data collection and extraction system and method |
CN108334585A (en) * | 2018-01-29 | 2018-07-27 | 湖北省楚天云有限公司 | A kind of spiders method, apparatus and electronic equipment |
-
2018
- 2018-09-05 CN CN201811033910.4A patent/CN110245278A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104866555A (en) * | 2015-05-15 | 2015-08-26 | 浪潮软件集团有限公司 | Automatic acquisition method based on web crawler |
CN107808000A (en) * | 2017-11-13 | 2018-03-16 | 哈尔滨工业大学(威海) | A kind of hidden web data collection and extraction system and method |
CN108334585A (en) * | 2018-01-29 | 2018-07-27 | 湖北省楚天云有限公司 | A kind of spiders method, apparatus and electronic equipment |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110990675A (en) * | 2019-11-25 | 2020-04-10 | 爱信诺征信有限公司 | Webpage data crawling method and system |
CN110990675B (en) * | 2019-11-25 | 2024-07-26 | 爱信诺征信有限公司 | Webpage data crawling method and system |
CN111953766A (en) * | 2020-08-07 | 2020-11-17 | 福建省天奕网络科技有限公司 | Method and system for collecting network data |
CN112015963A (en) * | 2020-08-21 | 2020-12-01 | 北京金和网络股份有限公司 | Web crawler system based on big data |
CN112800307A (en) * | 2021-01-25 | 2021-05-14 | 浪潮云信息技术股份公司 | Configurable webpage information crawling method based on Java Web |
CN112948659A (en) * | 2021-03-09 | 2021-06-11 | 深圳九星互动科技有限公司 | Webpage data acquisition method, device, system and medium |
CN113987318A (en) * | 2021-11-01 | 2022-01-28 | 盐城金堤科技有限公司 | Page monitoring method, device, equipment and computer storage medium |
CN113987318B (en) * | 2021-11-01 | 2024-03-12 | 盐城天眼察微科技有限公司 | Page monitoring method, device, equipment and computer storage medium |
CN114428635A (en) * | 2022-04-06 | 2022-05-03 | 杭州未名信科科技有限公司 | Data acquisition method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110245278A (en) | Acquisition method, device, electronic equipment and the storage medium of web data | |
US10698960B2 (en) | Content validation and coding for search engine optimization | |
CN107832468B (en) | Demand recognition methods and device | |
CN101971172B (en) | Mobile sitemaps | |
CN107105031A (en) | Information-pushing method and device | |
CN107729319A (en) | Method and apparatus for output information | |
CN103020207B (en) | Browser label page grouping management method and device | |
CN106933722A (en) | A kind of web application monitoring method, server and system | |
CN108572990A (en) | Information-pushing method and device | |
CN107885777A (en) | A kind of control method and system of the crawl web data based on collaborative reptile | |
CN107609890A (en) | A kind of method and apparatus of order tracking | |
CN108287927B (en) | For obtaining the method and device of information | |
CN107145556B (en) | Universal distributed acquisition system | |
CN110275963A (en) | Method and apparatus for output information | |
CN105824887A (en) | Mobile electronic investigation system and method based on intelligent questionnaire generative design | |
CN109871311A (en) | A kind of method and apparatus for recommending test case | |
CN107391675A (en) | Method and apparatus for generating structure information | |
CN108121814A (en) | Search results ranking model generating method and device | |
CN106407377A (en) | Search method and device based on artificial intelligence | |
CN106354856A (en) | Enhanced deep neural network search method and device based on artificial intelligence | |
CN103514189A (en) | Implementing method for web crawler based on search engines | |
US20190347068A1 (en) | Personal history recall | |
CN106951495A (en) | Method and apparatus for information to be presented | |
CN107729508A (en) | Information crawler method and apparatus | |
CN108574669A (en) | User behavior tree constructing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190917 |