CN106126716A

CN106126716A - A kind of data crawling method and device

Info

Publication number: CN106126716A
Application number: CN201610511377.2A
Authority: CN
Inventors: 姚光明
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2016-06-30
Filing date: 2016-06-30
Publication date: 2016-11-16

Abstract

The embodiment of the invention discloses a kind of data crawling method and device, be obtained ahead of time and store the identification information of at least one contents producer；According to the identification information of contents producer at least one described, determine at least one and described contents producer contents producer personal homepage one to one；For each contents producer, the most within it hold in Producer personal homepage, crawl all data that described contents producer produces.The embodiment that the application present invention provides can crawl comprehensive data.

Description

A kind of data crawling method and device

Technical field

The present invention relates to network information search field, particularly to a kind of data crawling method and device.

Background technology

Along with developing rapidly of network, WWW becomes the carrier of magnanimity information, and people are extra large at these by research tool Amount data are retrieved.The result that research tool returns contains the unconcerned data of a large number of users, goes to search in these data The data that rope user is concerned about become a difficult problem.In this case orientation capture related web page resource crawler system meet the tendency and Raw, can be according to set crawl target, selectively webpage in WWW or chain the data message required for acquisition.

Existing crawler system is when crawling microblogging or video data, and the mode generally used has based on search key Word, the crawling of list page.When scanning for key word and crawling, key step has the searching interface calling crawler system, input Search key word, then downloads the result of search；Extract content details page URL by the result searched, and be downloaded.Should The major defect of mode is that Search Results has number to limit, and the data crawled can be caused the most comprehensive；And based on list page crawl, Owing to by list page number quantitative limitation, also existing and crawling the incomplete problem of data.

Summary of the invention

The purpose of the embodiment of the present invention is to provide a kind of data crawling method and device, in order to crawl comprehensively number According to.

For reaching above-mentioned purpose, the embodiment of the invention discloses a kind of data crawling method, be obtained ahead of time and store at least The identification information of one contents producer；Described method includes:

According to the identification information of contents producer at least one described, determine that at least one is with described contents producer one by one Corresponding contents producer personal homepage；

For each contents producer, the most within it hold in Producer personal homepage, crawl described contents producer The all data produced.

Preferably, it is thus achieved that the identification information of contents producer, including:

The identification information of contents producer is extracted from the results page scanned for key word；

Or

Crawl scheme based on the homepage degree of depth, from targeted website, extract the identification information of contents producer.

Preferably, described method also includes:

For each contents producer, all data produced according to the described contents producer crawled, determine institute State the frequency of contents producer creation data；

Frequency determined by with, in described contents producer personal homepage, the described contents production not crawled The data that person produces.

Preferably, described method also includes:

According to the evaluation information of the data that the user's described contents producer to having crawled produces, determine each contents production The priority of person；

According to described priority from high to low order, in described contents producer personal homepage, do not crawled Described contents producer produce data.

For reaching above-mentioned purpose, the embodiment of the invention discloses a kind of data and crawl device, described device includes:

Obtain module, for being obtained ahead of time and store the identification information of at least one contents producer；

First determines module, for according to the identification information of contents producer at least one described, determine at least one with Described contents producer contents producer personal homepage one to one；

First crawls module, for for each contents producer, the most within it holds in Producer personal homepage, climbs Take all data that described contents producer produces.

Preferably, described acquisition module, specifically for:

From the results page scanned for key word, extract and store the identification information of at least one contents producer；

Or

Crawl scheme based on the homepage degree of depth, from targeted website, extract and store the mark letter of at least one contents producer Breath.

Preferably, described device also includes: second determines that module and second crawls module,

Described second determines module, for for each contents producer, according to the described contents producer crawled The all data produced, determine the frequency of described contents producer creation data；

Described second crawls module, for determined by frequency, in described contents producer personal homepage, crawl not The data that the described contents producer crawled produces.

Preferably, described device also includes: the 3rd determines that module and the 3rd crawls module；

Described 3rd determines module, for according to user's evaluation to the data that the described contents producer that crawled produces Information, determines the priority of each contents producer；

Described 3rd crawls module, is used for according to described priority from high to low order, in described contents producer In people's homepage, the data that the described contents producer not crawled produces.

As seen from the above technical solutions, the data crawling method of embodiment of the present invention offer and device, it is obtained ahead of time also Store the identification information of at least one contents producer；According to the identification information of at least one contents producer, determine at least one Individual with contents producer contents producer personal homepage one to one；For each contents producer, the most within it hold In Producer personal homepage, crawl all data that contents producer produces.The technical scheme that the application embodiment of the present invention provides, After the identification information obtaining contents producer, this contents producer can be crawled according to the flag information of this contents producer The all data produced, thus crawl comprehensive data.

Certainly, either method or the device of implementing the present invention must be not necessarily required to reach all the above excellent simultaneously Point.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to Other accompanying drawing is obtained according to these accompanying drawings.

A kind of schematic flow sheet of the data crawling method that Fig. 1 provides for the embodiment of the present invention；

The another kind of schematic flow sheet of the data crawling method that Fig. 2 provides for the embodiment of the present invention；

Another schematic flow sheet of the data crawling method that Fig. 3 provides for the embodiment of the present invention；

Fig. 4 crawls a kind of structural representation of device for the data that the embodiment of the present invention provides；

Fig. 5 crawls the another kind of structural representation of device for the data that the embodiment of the present invention provides；

Fig. 6 crawls the yet another construction schematic diagram of device for the data that the embodiment of the present invention provides.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise Embodiment, broadly falls into the scope of protection of the invention.

For solving prior art problem, embodiments provide a kind of data crawling method and device, individually below It is described in detail.

It should be noted that a kind of data crawling method of embodiment of the present invention offer and device, it is adaptable to crawler system. In actual application, prestore the identification information of at least one contents producer, such as individual ID, the name on account of video uploader Deng, as follow-up crawl data time use.This process is to form the mark corresponding with search content, for carrying out comprehensive content Crawl work to prepare.

A kind of schematic flow sheet of the data crawling method that Fig. 1 provides for the embodiment of the present invention, comprises the steps:

S101, according to the identification information of contents producer at least one described, determines at least one and described contents production Person's contents producer personal homepage one to one.

Concrete, in actual application, it is obtained ahead of time and stores the identification information of at least one contents producer.Wherein, mark Knowledge information can be name, No. ID, account etc., and the concrete manifestation form of identification information is not limited by the embodiment of the present invention Fixed.

Concrete, it is thus achieved that the identification information of contents producer, can carry from the results page scanned for key word Take the identification information of contents producer.

Exemplary, scan for using " abcdefghijk " as key word, obtain website " http: // Www.yyy.com/movies/key=abcdefghijk ", it is assumed that corresponding 3 video uploader in the web page, ID is respectively For " AAAA1 ", " AAAA2 " and " AAAA3 ".Extract all the ID, ID " AAAA1 " of these video uploader, " AAAA2 " and " AAAA3 " is the identification information of contents producer.If there is corresponding account name in corresponding ID, it is also possible to extracts corresponding account masterpiece Identification information for contents producer.The embodiment of the present invention is intended to extract an identification information corresponding with this contents producer, The type of this identification information is not any limitation as, as long as the one-to-one relationship of identification information and this contents producer can be realized ?.

Concrete, it is thus achieved that the identification information of contents producer, it is also possible to crawl scheme based on the homepage degree of depth, from targeted website The identification information of middle extraction contents producer.

Exemplary, targeted website is homepage http://www.xyz.com, crawls based on this website, it is thus achieved that it is complete The video uploader in portion, it is assumed that there are 5 video uploader, ID be respectively " aaa1 ", " aaa2 ", " aaa3 ", " aaa4 ", " aaa5 ", extracts whole ID, then corresponding ID " aaa1 ", " aaa2 ", " aaa3 ", " aaa4 ", " aaa5 ", be this website The identification information of full content Producer.The extraction ID of the embodiment of the present invention is merely exemplary as contents producer mark , it is not intended that limitation of the invention.

In actual application, after the mark obtaining and preserving contents producer, determine the master that contents producer mark is corresponding Page, i.e. corresponding personal homepage.Be as a example by " aaa1 " by contents producer ID, it is assumed that " aaa1 " corresponding homepage be " http: // Www.xyz.com/ID=aaa1 ", then " http://www.xyz.com/ID=aaa1 " is defined as and contents producer The personal homepage that " aaa1 " is corresponding.

S102, for each contents producer, the most within it holds in Producer personal homepage, crawls described content raw All data that product person produces.

It will be appreciated by persons skilled in the art that in the corresponding personal homepage obtained, comprise all the elements Producer Information and extract, comprehensive information can be obtained.Exemplary, the corresponding personal homepage of contents producer ID " aaa1 " is " http://www.xyz.com/ID=aaa1 ", comprises all data letters uploaded of contents producer " aaa1 " in this homepage Breath, by the total data producing the search of this personal homepage i.e. available " aaa1 ", then carries out crawling of total data, Crawling the data of webpage into prior art, this programme does not repeats.

Visible, apply the embodiment of Fig. 1 of the present invention, after the identification information obtaining contents producer, raw according to this content The flag information of product person can crawl all data that this contents producer produces, thus crawls comprehensive data.

The another kind of schematic flow sheet of the data crawling method that Fig. 2 provides for the embodiment of the present invention, in embodiment illustrated in fig. 1 On the basis of, increase S103 and S104.

S103, for each contents producer, according to all data of the described contents producer production crawled, really The frequency of fixed described contents producer creation data.

It will be appreciated by persons skilled in the art that and be analyzed after extracting all data crawled, exemplary, After crawling the total data that contents producer " aaa1 " produces, the frequency that analytical data updates, it is assumed that be to update once for 2 days Data, are 2 days/time by the frequency setting that crawls of corresponding for contents producer " aaa1 " personal homepage.

S104, with determined by frequency, in described contents producer personal homepage, the described content not crawled The data that Producer produces.

Exemplary, ID is the contents producer of " aaa1 ", set it is crawled frequency as 2 days/time after, it is assumed that When the date that the last data crawl is 14:00 on the 5th June in 2016, then the time that next time crawls is 7 days 14 June in 2016: When 00, this time the most only crawl the data that on June 5th, 2016,14:00 updated between 14:00 up on June 7th, 2016, i.e. carry out Increment crawls.Concrete increment crawls as prior art, and this programme does not repeats.

Visible, apply the embodiment of Fig. 2 of the present invention, after the data obtaining contents producer produce frequency, according to institute really Fixed frequency realizes increment and crawls in individual's Producer homepage, while reduction crawls frequency, reduces and crawl task amount, protects Card can crawl comprehensive data.

The another kind of schematic flow sheet of the data crawling method that Fig. 3 provides for the embodiment of the present invention, in embodiment illustrated in fig. 1 On the basis of, increase S105 and S106.

S105, according to the evaluation information of the data that the user's described contents producer to having crawled produces, determines each interior Hold the priority of Producer.

In actual application, the data crawled often contain the information of attention rate, as click volume, point are praised quantity, evaluated number Amounts etc., these information can reflect the concerned degree of data.For video website, can comment below each video Opinion, online friend's marking, point are praised quantity and praise contrary unwelcome score information etc. with point.While crawling data also Obtain these reaction attention rates information, with one or several reaction attention rates information as standard, carry out the excellent of video First level divides.Exemplary, the priority carrying out video with the some amount of praising divides, and the quantity priority the most at most that point is praised is the highest；Or Carry out the division of priority with online friend's marking, the highest then priority of mark is the highest；Or divide with unwelcome scoring, Divide more high priority the lowest.

Exemplary, divide with touching quantity, " aaa1 ", " aaa2 ", " aaa3 ", " aaa4 ", " aaa5 " online friend point Hitting quantity to be respectively 900 times, 700 times, 300 times, 800 times, 100 times, then priority from high to Low order is: 1 grade " aaa1 ", 2 grades " aaa4 ", 3 grades " aaa2 ", 4 grades " aaa3 ", 5 grades " aaa5 ".The embodiment of the present invention is merely exemplary, for priority The concrete criteria for classifying do not limit.

S106, according to described priority from high to low order, in described contents producer personal homepage, crawls The data that the described contents producer taken produces.

In actual application, it is ranked up after the division obtaining priority, the highest preferentially the crawling in crawling of priority. Exemplary, ID be " aaa1 ", " aaa2 ", " aaa3 ", " aaa4 ", " aaa5 " contents producer through the sequence of priority, The priority orders of gained is: 1 grade " aaa1 ", 2 grades " aaa4 ", 3 grades " aaa2 ", 4 grades " aaa3 ", 5 grades " aaa5 ", is crawling out During the beginning, first carry out the personal homepage http://www.xyz.com/ID=corresponding with the contents producer " aaa1 " that grade is 1 grade The data of aaa1 crawl, and are followed successively by 2 grades, 3 grades etc. and crawl respectively.The prioritization that the embodiment of the present invention provides is only Exemplary, do not constitute limitation of the invention.

Information is the most popular, attention rate is the highest in priority the highest reflection the most to a certain extent, and people generally obtain this letter The desire of breath is the strongest, for presenting to masses as early as possible, the most preferentially crawls the information that priority is high, it is ensured that the promptness of information.

Visible, apply the embodiment of Fig. 3 of the present invention, obtain contents producer priority after, according to determined by from High to Low priority realization order in individual's Producer homepage crawls, to ensure the information of contents producer that priority is high Can preferentially crawl.

Fig. 4 crawls a kind of structural representation of device for the data that the embodiment of the present invention provides, and can include obtaining module 201, first determine module 202, first crawl module 203.

Obtain module 201, for being obtained ahead of time and store the identification information of at least one contents producer.

Concrete, in actual application, described acquisition module 201, specifically for:

Or

First determines module 202, for according to the identification information of contents producer at least one described, determines at least one With described contents producer contents producer personal homepage one to one.

First crawls module 203, for for each contents producer, the most within it holds Producer personal homepage In, crawl all data that described contents producer produces.

Visible, apply the embodiment shown in Fig. 4 of the present invention, after the identification information obtaining contents producer, interior according to this The flag information holding Producer can crawl all data that this contents producer produces, thus crawls comprehensive data.

Fig. 5 crawls the another kind of structural representation of device for the data that the embodiment of the present invention provides, real shown in Fig. 5 of the present invention Execute example on the basis of embodiment illustrated in fig. 4, increase by second and determine that module 204 and second crawls module 205.

Second determines module 204, for for each contents producer, raw according to the described contents producer crawled The all data produced, determine the frequency of described contents producer creation data.

Second crawls module 205, for determined by frequency, in described contents producer personal homepage, crawl not The data that the described contents producer crawled produces.

Visible, apply the embodiment shown in Fig. 5 of the present invention, after the data obtaining contents producer produce frequency, foundation Determined by frequency realize increment and crawl in individual's Producer homepage, crawl frequency reducing, reduce and crawl the same of task amount Time, it is ensured that comprehensive data can be crawled.

Fig. 6 crawls the yet another construction schematic diagram of device for the data that the embodiment of the present invention provides, real shown in Fig. 6 of the present invention Execute example on the basis of embodiment illustrated in fig. 4, increase the 3rd and determine that module 206 and the 3rd crawls module 207.

3rd determines module 206, for according to user's evaluation to the data that the described contents producer that crawled produces Information, determines the priority of each contents producer.

3rd crawls module 207, is used for according to described priority from high to low order, described contents producer individual In homepage, the data that the described contents producer not crawled produces.

Visible, apply the embodiment shown in Fig. 6 of the present invention, after the priority obtaining contents producer, according to being determined The realization order in individual's Producer homepage of priority from high to low crawl, to ensure contents producer that priority is high Information can preferentially crawl.

It should be noted that in this article, the relational terms of such as first and second or the like is used merely to a reality Body or operation separate with another entity or operating space, and deposit between not necessarily requiring or imply these entities or operating Relation or order in any this reality.And, term " includes ", " comprising " or its any other variant are intended to Comprising of nonexcludability, so that include that the process of a series of key element, method, article or equipment not only include that those are wanted Element, but also include other key elements being not expressly set out, or also include for this process, method, article or equipment Intrinsic key element.In the case of there is no more restriction, statement " including ... " key element limited, it is not excluded that Including process, method, article or the equipment of described key element there is also other identical element.

Each embodiment in this specification all uses relevant mode to describe, identical similar portion between each embodiment Dividing and see mutually, what each embodiment stressed is the difference with other embodiments.Real especially for device For executing example, owing to it is substantially similar to embodiment of the method, so describe is fairly simple, relevant part sees embodiment of the method Part illustrate.

One of ordinary skill in the art will appreciate that all or part of step realizing in said method embodiment is can Completing instructing relevant hardware by program, described program can be stored in computer read/write memory medium, The storage medium obtained designated herein, such as: ROM/RAM, magnetic disc, CD etc..

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit protection scope of the present invention.All Any modification, equivalent substitution and improvement etc. made within the spirit and principles in the present invention, are all contained in protection scope of the present invention In.

Claims

1. a data crawling method, it is characterised in that be obtained ahead of time and store the identification information of at least one contents producer； Described method includes:

According to the identification information of contents producer at least one described, determine at least one and described contents producer one_to_one corresponding Contents producer personal homepage；

For each contents producer, the most within it hold in Producer personal homepage, crawl described contents producer and produce All data.

Method the most according to claim 1, it is characterised in that obtain the identification information of contents producer, including:

Or

Method the most according to claim 1, it is characterised in that described method also includes:

For each contents producer, all data produced according to the described contents producer crawled, determine described interior Hold the frequency of Producer creation data；

Frequency determined by with, in described contents producer personal homepage, the described contents producer not crawled is raw The data produced.

According to the evaluation information of the data that the user's described contents producer to having crawled produces, determine each contents producer Priority；

According to described priority from high to low order, in described contents producer personal homepage, the institute not crawled State the data that contents producer produces.

5. data crawl device, it is characterised in that described device includes:

First determines module, for according to the identification information of contents producer at least one described, determines that at least one is with described Contents producer contents producer personal homepage one to one；

First crawls module, for for each contents producer, the most within it holds in Producer personal homepage, crawls institute State all data that contents producer produces.

Device the most according to claim 5, it is characterised in that described acquisition module, specifically for:

Or

Crawl scheme based on the homepage degree of depth, from targeted website, extract and store the identification information of at least one contents producer.

Device the most according to claim 5, it is characterised in that described device also includes: second determines that module and second is climbed Delivery block,

Described second determines module, for for each contents producer, producing according to the described contents producer crawled All data, determine the frequency of described contents producer creation data；

Described second crawls module, for determined by frequency, in described contents producer personal homepage, do not crawl The data that the described contents producer crossed produces.

Device the most according to claim 5, it is characterised in that described device also includes: the 3rd determines that module and the 3rd is climbed Delivery block；

Described 3rd determines module, for according to user's evaluation letter to the data that the described contents producer that crawled produces Breath, determines the priority of each contents producer；

Described 3rd crawls module, is used for according to described priority from high to low order, described contents producer individual master In Ye, the data that the described contents producer not crawled produces.