CN110008393A - It is a kind of for obtaining the method and apparatus of site information - Google Patents
It is a kind of for obtaining the method and apparatus of site information Download PDFInfo
- Publication number
- CN110008393A CN110008393A CN201910262577.2A CN201910262577A CN110008393A CN 110008393 A CN110008393 A CN 110008393A CN 201910262577 A CN201910262577 A CN 201910262577A CN 110008393 A CN110008393 A CN 110008393A
- Authority
- CN
- China
- Prior art keywords
- webpage
- rank
- data
- renewal amount
- webpage rank
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The purpose of the application is to provide a kind of method and apparatus for obtaining site information, and the application passes through the webpage level information for determining targeted website, preset data acquisition time interval and the preset data renewal amount of each webpage rank are determined according to the webpage level information;The corresponding data collection interval of the webpage rank is determined according to the preset data acquisition time interval of each webpage rank and the preset data renewal amount;The content information of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank;The corresponding real data renewal amount of each webpage rank is determined according to the content information of each webpage rank, to update the corresponding data collection interval of the webpage rank.It can be made adjustment based on network upgrade rule updates by the time interval of collecting webpage data, to acquisition web data and save time resource in time, and improve crawler efficiency.
Description
Technical field
This application involves computer fields more particularly to a kind of for obtaining the method and apparatus of site information.
Background technique
Possess the information of magnanimity on current internet, people want to obtain these information, it is necessary to web crawlers is used,
When acquiring targeted website information using web crawlers technology, generally first seeks an initial address page and carries out data acquisition,
Then the address link of needs is found in initial address page, and sequentially carries out data acquisition.But because target
Site information will do it the update of webpage, the link for the new entry that the related pages that need to timely update generate, and be included in be collected
Link space;At the same time, the efficiency that crawler acquires a webpage stores the efficiency of a webpage much higher than database.Mesh
Before, mostly use a default situation to come circle collection data, such as one time of setting greatly for carrying out the crawler of data acquisition
Point returns to index page retrieval when reaching the preset time point and updates.However, setting time is cyclically updated to resource
Waste can have accessed the page several times if not updating at the time point of design slowly more, update again after one wheel of acquisition
Lose timeliness.
Summary of the invention
The purpose of the application is to provide a kind of for obtaining the method and apparatus of site information, solves in the prior art
The problem of obtaining low website data time efficiency, the wasting of resources and data poor in timeliness using crawler technology.
According to the one aspect of the application, a kind of method for obtaining site information is provided, this method comprises: determining
The webpage level information of targeted website determines the preset data acquisition time of each webpage rank according to the webpage level information
Interval and preset data renewal amount;
The net is determined according to the preset data acquisition time interval of each webpage rank and the preset data renewal amount
The corresponding data collection interval of page rank;
The content information of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank;
The corresponding real data renewal amount of each webpage rank is determined according to the content information of each webpage rank, to update
The corresponding data collection interval of the webpage rank.
Further, according to the webpage level information determine each webpage rank preset data acquisition time interval and
Preset data renewal amount, comprising:
Historical data acquisition time interval and the corresponding reality of the webpage rank are determined according to the webpage level information
Historical data renewal amount;
It is true according to historical data acquisition time interval and the corresponding actual, historical data renewal amount of the webpage rank
The preset data acquisition time interval of fixed each webpage rank and preset data renewal amount.
Further, determine that the corresponding real data of each webpage rank updates according to the content information of each webpage rank
After amount, comprising:
Judge the preset data of each webpage rank corresponding real data renewal amount and each webpage rank
Whether the residual quantity of renewal amount is in preset threshold, if so, continuing with the corresponding data collection interval of the webpage rank
Obtain the content information of each webpage rank.
Further, the corresponding data collection interval of the webpage rank is updated, comprising:
Judge the preset data of each webpage rank corresponding real data renewal amount and each webpage rank
Whether the residual quantity of renewal amount is in preset threshold, if it is not, then according to the preset data acquisition time interval of each webpage rank, institute
The real data renewal amount for stating preset data renewal amount and each webpage rank updates the corresponding data of the webpage rank
Acquisition time interval.
Further, according to the preset data acquisition time interval of each webpage rank, the preset data renewal amount and
The real data renewal amount of each webpage rank updates the corresponding data collection interval of the webpage rank, comprising:
Determine the preset data renewal amount of each webpage rank and the ratio of real data renewal amount;
The net is updated according to the product at the ratio and the preset data acquisition time interval of each webpage rank
The corresponding data collection interval of page rank.
Further, the content of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank
After information, comprising:
Data with existing in the content information and database of each webpage rank of the acquisition is compared, determines comparison result;
The content information of each webpage rank of the acquisition is stored in the database according to the comparison result.
Further, wherein the corresponding actual number of each webpage rank is determined according to the content information of each webpage rank
According to renewal amount, comprising:
The flat of each collecting webpage data renewal amount under the webpage rank is determined according to the content information of each webpage rank
Mean value, using the average value as the corresponding real data renewal amount of each webpage rank.
Further, the method also includes:
Determine that the multiple data under preset data acquisition time interval acquire renewal amount;
Comparison based on the multiple data acquisition renewal amount is configured the preset data renewal amount.
Further, after updating the corresponding data collection interval of the webpage rank, comprising:
Updated data collection interval is adjusted according to pre-set interval.
Further, updated data collection interval is adjusted according to pre-set interval, comprising:
Judge whether the updated data collection interval is located in the pre-set interval, if it is not, then according to institute
The upper bound or lower bound for stating pre-set interval adjust updated data collection interval.
Further, the webpage level information of the targeted website includes the layer determined by homepage, index page and entry page
The information of secondary relationship.
Further, which comprises
Record the data collection interval of every level-one webpage.On the other hand according to the application, a kind of meter is additionally provided
Calculation machine readable medium, is stored thereon with computer-readable instruction, and the computer-readable instruction can be executed by processor to realize
A kind of aforementioned method for obtaining site information.
According to the application another aspect, a kind of equipment for obtaining site information is additionally provided, wherein the equipment
Include:
One or more processors;And
It is stored with the memory of computer-readable instruction, the computer-readable instruction makes the processor when executed
It executes above-mentioned a kind of for obtaining the operation of site information method.
Compared with prior art, the application passes through the webpage level information for determining targeted website, according to the webpage rank
Information determines preset data acquisition time interval and the preset data renewal amount of each webpage rank;According to each webpage rank
Preset data acquisition time interval and the preset data renewal amount determine between the corresponding data acquisition time of the webpage rank
Every;The content information of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank;According to each
The content information of webpage rank determines the corresponding real data renewal amount of each webpage rank, corresponding to update the webpage rank
Data collection interval.It can be made adjustment more based on network upgrade rule by the time interval of collecting webpage data
Newly, to obtain web data in time and save time resource, and the efficiency for improving analyzing web page acquisition data is then put down
The efficiency of the analyzing web page that weighed acquisition data and the in the database efficiency of storing data.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is shown according to a kind of for obtaining the method flow schematic diagram of site information of the application one aspect offer;
Fig. 2 shows a form of for obtaining the block schematic illustration of the method for site information in one embodiment of the application.
The same or similar appended drawing reference represents the same or similar component in attached drawing.
Specific embodiment
The application is described in further detail with reference to the accompanying drawing.
In a typical configuration of this application, terminal, the equipment of service network and trusted party include one or more
Processor (CPU), input/output interface, network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices or
Any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, computer
Readable medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
Fig. 1 is shown according to a kind of for obtaining the method flow schematic diagram of site information of the application one aspect offer,
The described method includes: step S1~step S4, wherein step S1 determines the webpage level information of targeted website, according to the net
Page level information determines preset data acquisition time interval and the preset data renewal amount of each webpage rank;Step S2, according to
The preset data acquisition time interval of each webpage rank and the preset data renewal amount determine that the webpage rank is corresponding
Data collection interval;Step S3 obtains each webpage grade according to the corresponding data collection interval of the webpage rank
Other content information;Step S4 determines the corresponding real data of each webpage rank according to the content information of each webpage rank
Renewal amount, to update the corresponding data collection interval of the webpage rank.It can be made adjustment based on network upgrade rule
The time interval of the collecting webpage data of different stage is updated, to obtain web data in time and save time resource, is improved
Crawler efficiency.
Specifically, in step sl, the webpage level information for determining targeted website is determined according to the webpage level information
The preset data acquisition time interval of each webpage rank and preset data renewal amount.Here, the webpage of website may include multilayer
The rank of the webpage of grade, level is different, and the content-data update efficiency of corresponding webpage is different, each webpage level is corresponding
Preset data acquisition time interval and preset data renewal amount be all independent from each other, its webpage is first determined for targeted website
Level information determines its preset data acquisition time by way of by the classification of the webpage of targeted website in each rank
Interval and preset data renewal amount.For example, the webpage rank of certain website is divided into K1 and K11, then the preset data acquisition of K1 webpage
Time interval is T ', and preset data renewal amount is N1, is divided into T " between the preset data acquisition time of K11 webpage, preset data is more
New amount is N2.When initial, webpage rank is different, possible different web pages rank corresponding preset data acquisition time interval it is identical or
Person is different, and corresponding preset data renewal amount is same or different, i.e. T ' is identical as T " possibility may also be different, and N1 and N2 may
It is identical may also be different.
Preferably, the webpage level information of the targeted website includes the level determined by homepage, index page and entry page
The information of relationship.In one preferred embodiment of the application, for a targeted website, a start page is first determined, this
The start page of acquisition data is preferably the homepage of the targeted website;The information on the homepage is acquired, can be obtained described
The chained address and the targeted website of a part of popular entry of targeted website are grounded the chain for the index that entry is classified
Location grabs each index page, the entry chained address in the index page can be obtained, so as to form master
The tertiary level relationship of page, index page and entry page.For example, for some novel website, homepage http: //
Www.AAA.me/, index page is http://www.AAA.me/sort/23, and each novel is then entry page.For search
Engine website, homepage are http://www.XXX.com, and index page is http://www.XXX.com/category/
TreeManage.jsp, each result of page searching are entry page.
Then, in step s 2, according to the preset data acquisition time interval of each webpage rank and the preset data
Renewal amount determines the corresponding data collection interval of the webpage rank.Here, for multiple nets in same webpage rank
Page, preset data acquisition time interval be it is identical, the preset data renewal amount is also identical.Different web pages level
It is independent to possess independent data collection interval, the determination of data collection interval, according to the present count of each webpage rank
It is determined according to acquisition time interval and preset data renewal amount, so as to timely according to determining data collection interval
It obtains the web data of corresponding webpage rank and saves resource.
Then, in step s3, each webpage grade is obtained according to the corresponding data collection interval of the webpage rank
Other content information.In one preferred embodiment of the application, after the webpage of targeted website is classified, it is divided into three layers
Grade: homepage, index page and entry page.The corresponding data collection interval T10 of homepage, the corresponding data acquisition time of index page
It is spaced T20, the corresponding data collection interval T30 of entry page;When obtaining the content information of homepage rank, obtained every T10
The content information of the single page of homepage rank or multiple pages;When obtaining the content information of index page rank, obtained every T20
Take the content information of multiple pages under an index page rank;When obtaining the content information of entry page rank, one is obtained every T30
The content information of multiple pages under secondary entry page rank.Here, described T10, T20 and T30 can be equal or unequal.
Then, in step s 4, the corresponding reality of each webpage rank is determined according to the content information of each webpage rank
Data renewal amount, to update the corresponding data collection interval of the webpage rank.Here, within a certain period of time, different nets
The data renewal amount of the corresponding webpage of page rank is different, and is independently calculated according to the data renewal amount of the webpage of each webpage rank
Renewal time interval.
Above-described embodiment is connect, obtains the content information of the single page of a homepage rank or multiple pages every T10, according to
The content information of the single page of homepage rank or multiple pages got determines the corresponding real data of homepage rank
Renewal amount;The content information of the multiple pages of index page rank is obtained every T20, according to the index page rank got
The content information of multiple pages determines the corresponding real data renewal amount of index page rank;An entry page is obtained every T30
The content information of multiple pages under rank is determined according to the content information of multiple pages under the entry page rank got
The corresponding real data renewal amount of entry page rank.According between the real data renewal amount dynamic adjusting data acquisition time of webpage
Every acquiring no longer effective property when web page contents again to avoid the page several times of more access without reason.
In this application, data acquisition and preset data acquisition time interval will be carried out after page layering, obtain data more
The method newly measured is conducive to be made based on the update rule to the corresponding data collection interval of acquisition each page level
It is updated, to obtain the web data of each level in time, saves update data acquisition time compared to the prior art
It is spaced and obtains the targeted website data and enter time resource required for database, to balance the effect of analyzing web page
Rate, the efficiency and the efficiency of storing data in the database for obtaining data.
Preferably, in step s 2, historical data acquisition time interval and described is determined according to the webpage level information
The corresponding actual, historical data renewal amount of webpage rank;According to historical data acquisition time interval and the webpage rank pair
The actual, historical data renewal amount answered determines preset data acquisition time interval and the preset data renewal amount of each webpage rank.
Here, can update rule when carrying out data acquisition to website according to the historical data of the website, such as the data of history, carry out
Determine preset data acquisition time interval.For example, certain novel website has examination to update at 9 points of the morning, then preset data can be adopted
Collection time interval is set as 9 points of the morning;Preset data renewal amount is specially entry number renewal amount, and the data of website are acquired according to history
Renewal amount is configured, and specifically, can be acquired by the multiple data determined under preset data acquisition time interval
Renewal amount;Comparison based on the multiple data acquisition renewal amount is configured the preset data renewal amount.Here, can compare
The renewal amount of multiple data acquisition under more at certain time intervals is configured data renewal amount as reference;For example, choosing
For the renewal amount of data acquisition twice as reference, the first round first acquires the data of a period of time, then acquires after certain time
Second wheel data, are compared with reference to setting preset data renewal amount.
In one embodiment of the application, the webpage level information got includes what homepage, index page and entry page determined
When the information of hierarchical relationship.For a targeted website, a start page is first determined, the start page of this acquisition data is excellent
It is selected as the homepage of the targeted website;The information on the homepage is acquired, a part of hot topic of the targeted website can be obtained
The chained address for the index that the chained address of entry and the targeted website classify for entry, for each index page into
Row crawl, can obtain the entry chained address in the index page, it is preferable that form homepage, index page and entry page
Three-level hierarchical relationship.The webpage rank corresponding historical data acquisition time interval for obtaining the homepage and actual, historical data are more
New amount, the webpage rank corresponding historical data acquisition time interval of the index page and actual, historical data renewal amount and institute
State webpage rank corresponding historical data acquisition time interval and the actual, historical data renewal amount of entry page.It is excellent in the application one
It selects in embodiment, T11, T12, T13, T14 is divided between the historical data acquisition time of the homepage, a length of Th1 when acquisition is set,
After the duration by Th1, corresponding T11, T12, T13, T14 interval time acquires the actual, historical data that homepage is got and updates
Amount is A11, A12, A13, A14.
Then it is multiple that correspondence is repeatedly obtained under the conditions of, by same acquisition duration in different data acquisition time interval
The collected multiple real data renewal amounts of data collection interval, according to the comparison to the multiple real data renewal amount
As a result the corresponding preset data renewal amount of the page level and preset data acquisition time interval are obtained.Above-described embodiment is connect,
Preset data renewal amount A1 and the preset data acquisition time interval of homepage are obtained according to the comparison result of A11, A12, A13, A14
T1.Here, to A11, A12, A13, A14 relatively include but is not limited to be grouped into A11, A12 and A13, the side that A14 compares two-by-two
Formula, the mode of the comparison can arbitrarily select design.Here, the historical data acquisition time interval got and
Corresponding actual, historical data renewal amount is multiple.Corresponding historical data acquisition is obtained according to the page level of different levels
Time interval and actual, historical data renewal amount, to obtain corresponding preset data renewal amount, that is to say, that the different layers
The preset data renewal amount of the page level of grade is according to the page level corresponding historical data acquisition time interval and reality
Historical data renewal amount is provided independently from, and obtain different levels the page preset data renewal amount can be it is equal
Or not etc..
Preferably, the corresponding real data renewal amount of each webpage rank is determined according to the content information of each webpage rank
Later, judge that the preset data of the corresponding real data renewal amount of each webpage rank and each webpage rank updates
Whether the residual quantity of amount is in preset threshold, if so, continuing to carry out with the corresponding data collection interval of the webpage rank
Obtain the content information of each webpage rank.Here, preset an acquisition time interval, with the acquisition time interval to website into
The acquisition of row crawler data, while presetting a data renewal amount comparison for reference.Compare and is acquired in preset time interval
The actual amount of data and preset data renewal amount arrived, such as equal if the residual quantity of the two is in preset threshold, then this preset is adopted
It is reasonable in range of needs to collect time interval, continues to use the acquisition time interval and carries out crawler acquisition web data, if the two
Residual quantity be more than preset threshold, then need according to preset data acquisition time interval, preset data renewal amount and real data
Renewal amount is adjusted the real time interval of data acquisition.
Specifically, real data renewal amount corresponding for each webpage rank and each webpage rank is pre-
If preset threshold is arranged in the residual quantity of data renewal amount, wherein the corresponding preset threshold of each webpage rank is for every
What one webpage rank was independently arranged, the method that the preset threshold is arranged includes but is not limited to: according to webpage level information and
Each webpage rank corresponding historical data renewal amount setting preset threshold.It is arranged in correspondence with for each webpage rank
The preset threshold can be avoided the data collection interval of unnecessary update is updated caused by the wasting of resources, limit
Data collection interval needs the condition updated, to save time resource and and ensure that the timeliness of data.
Above-described embodiment is connect, in the preset data acquisition time interval T1 and preset data renewal amount A1 by determining homepage
Afterwards, the preset data time interval T1 is arranged acquire the data of the single or multiple pages of homepage rank, thus
To corresponding real data renewal amount A2.It is spaced in T1 at the same time, the real data acquisition renewal amount A2 is greater than pre-
If data renewal amount A1, illustrates that data age is lost;It is spaced in T1 at the same time, the real data acquisition is more
New amount A2 is less than preset data renewal amount A1, illustrates that crawlers obtain some duplicate messages repeatedly, so as to cause resource
Waste.When the difference of real data acquisition renewal amount A2 and preset data renewal amount A1 is in this level page pair of the homepage
In the range of the preset threshold answered, just update is not adjusted to preset data acquisition time interval T1, that is, use
Preset data acquisition time interval T1 in the preset threshold to the single or multiple pages of the homepage level into
The acquisition of row data has saved resource and ensure that data age.
Preferably, in step s 4, the preset data renewal amount and real data renewal amount of each webpage rank are determined
Ratio;The webpage grade is updated according to the product at the ratio and the preset data acquisition time interval of each webpage rank
Not corresponding data collection interval.
Specifically, to obtain the data instance of homepage rank, it is determined that the present count of the single or multiple webpages of homepage rank
According to acquisition time interval T1 and preset data renewal amount A1, and the reality of the webpage rank is got by the time interval of T1
Border data renewal amount A2.It is spaced in T1 at the same time, the real data acquisition renewal amount A2 is updated greater than preset data
A1 is measured, illustrates that data age is lost;It is spaced in T1 at the same time, the real data acquisition renewal amount A2 is less than
Preset data renewal amount A1 illustrates that crawlers obtain the duplicate message of some homepages repeatedly, so as to cause the wave of resource
Take.When real data acquisition renewal amount A2 is corresponding not in this level page of the homepage with the difference of preset data renewal amount A1
Preset threshold in the range of it is necessary to being adjusted update to preset data acquisition time interval T1.
Then, it is preferable that in order to avoid the wasting of resources carries out repeated acquisition to the page info, acquired also for guarantee
Data timeliness, after passing through identical acquisition time interval, between the corresponding preset data acquisition time of each webpage rank
Every with the product of preset data renewal amount be equal to each webpage rank corresponding real data acquisition time interval and real data
The product of renewal amount.Thus it obtains, passes through preset data renewal amount and the real data renewal amount of the determination webpage rank
Ratio updates the webpage grade according to the product at the ratio and the preset data acquisition time interval of each webpage rank
Not corresponding data collection interval, the corresponding following update equation of available each webpage rank after arrangement:
Wherein, the data collection interval of the update, preset data acquisition time interval, the preset data
Renewal amount, the real data renewal amount be by same webpage rank single or multiple webpages and by same acquisition
It is obtained after duration.
Above-described embodiment is connect, when the webpage of homepage rank is in identical preset data acquisition time interval T1, obtains institute
The difference for stating the real data acquisition renewal amount A2 and preset data renewal amount A1 that the single or multiple pages of rank obtain does not exist
In the range of this corresponding preset threshold of level page of the homepage, the preset data renewal amount A1 and the reality are just determined
Border data acquire the ratio of renewal amount A2, and the ratio is multiplied with preset data acquisition time interval T1, Ji Kegeng
New data collection time interval isThe homepage rank next time to targeted website the single or multiple pages into
Row data use when acquiringData collection interval.By being acquired to the corresponding data of each rank page
The update of time interval, has saved time resource, maintains the timeliness of data.Preferably, corresponding according to the webpage rank
Data collection interval obtain the content information of each webpage rank after, compare each webpage rank of the acquisition
Data with existing in content information and database, determines comparison result;According to the comparison result by each webpage of the acquisition
The content information of rank is stored in the database.
Specifically, in one preferred embodiment of the application, as shown in Fig. 2, one data management module of setting, passes through data
The content information and the data with existing in database that management module compares each webpage rank of the acquisition determine comparison
As a result: if the content information that the comparison result is each webpage rank of the acquisition includes data with existing in database,
The content information for each webpage rank that will acquire by the data management module is stored in database after making duplicate removal processing,
Described in duplicate removal processing be by the content information of each webpage rank of the acquisition the data with existing delete;If described
Comparison result is not comprising the data with existing in database, and the data management module is by each webpage rank of the acquisition
Content information deposit database in.It the data management module is set facilitates and storage processing is made to the content information of acquisition,
Duplicate removal processing is carried out by content information of the data management module to each webpage rank got, is further improved
Store efficiency of the content information into database.
Preferably, in step s 4, each webpage under the webpage rank is determined according to the content information of each webpage rank
Data acquire the average value of renewal amount, using the average value as the corresponding real data renewal amount of each webpage rank.Here,
The webpage of different stage possesses a data acquisition renewal amount respectively, and the data acquisition of same rank webpage is updated and takes the grade
The average value of not lower each collecting webpage data renewal amount, such as arithmetic mean of instantaneous value or weighted average.In one embodiment of the application,
For electric business shopping homepage, there is different index page, such as men's clothing classification, women's dress classification, electronic digital classification, these, which are classified, makees
Take arithmetic mean of instantaneous value as the time interval under the homepage rank after independent calculating for a rank.
Specifically, in one preferred embodiment of the application, by taking this webpage rank of index page as an example, the index page includes
Multiple pages include the chained address of multiple entry pages on the multiple page, when the index page packet for having targeted website
When containing the X page, in the data collection interval T20 of index page rank, the X is determined according to collected content information
The data renewal amount of a page is Y, it is preferable that obtains its arithmetic mean of instantaneous value according to the data renewal amount Y of the X pageIt will
It is describedThe real data renewal amount of the X page as index page rank.Here, it will be appreciated by those skilled in the art that,
The average value that each collecting webpage data renewal amount under the webpage rank is determined by way of taking arithmetic mean of instantaneous value is the application
One preferred embodiment, other methods for determining the average value of each collecting webpage data renewal amount under the webpage rank are both contained in
This.
It preferably, can also be according to pre-set interval after updating the corresponding data collection interval of the webpage rank
Adjust updated data collection interval.
Specifically, the corresponding data collection interval of webpage rank each for the targeted website is respectively provided with corresponding
Pre-set interval, the pre-set interval, which is arranged, can prevent the corresponding data collection interval too small, to prevent to institute
Webpage collapse phenomenon caused by targeted website brings burden then to avoid the occurrence of due to data acquisition is stated, while being arranged described pre-
If section can prevent the corresponding data collection interval excessive, to avoid the link space of the page of next level
It cannot be replenished in time.
Preferably, updated data collection interval is adjusted according to pre-set interval, can carried out in the following manner:
Judge whether the updated data collection interval is located in the pre-set interval, if it is not, then according to the preset areas
Between the upper bound or lower bound adjust updated data collection interval.Here, the data collection interval that judgement updates is
It is no in pre-set interval, if it is not, the data collection interval then updated needs to be adjusted to the upper bound or the lower bound of the pre-set interval.
Specifically, the pre-set interval is separately provided according to the progress of the hierarchical relationship of every first level pages is corresponding, the list
It is solely set as according to the corresponding history acquisition data interval time of every first level pages and historical data renewal amount and the target
The historical content information of website carries out the setting of the pre-set interval of corresponding data collection interval.It is preferred real in the application one
It applies in example, by taking multiple pages of the index page level of targeted website as an example, the index page page collapse will not be caused most
Lower limit of the small data acquisition time interval T00 as the pre-set interval will be acquired according to the history of multiple pages of index page
The maximum time interval T01 that data interval time obtains after referring to historical data renewal amount is as the upper of the pre-set interval
Limit.Known updated data collection interval is T60, when updated data collection interval T60 is less than described pre-
If updated data collection interval is adjusted to T00 when lower limit T00 in section;When the acquisition of updated data
Between interval T60 be greater than the pre-set interval in the upper limit T01 when, updated data collection interval is adjusted to T01.From
And webpage collapse phenomenon caused by preventing from bringing burden to the targeted website to avoid the occurrence of due to data acquisition, simultaneously
The pre-set interval, which is arranged, can prevent the data collection interval of the index page excessive, to avoid the item of next level
The link space of the mesh page page cannot be replenished in time.
Preferably, which comprises step S5 records the data collection interval of every level-one webpage.Specifically, such as
Shown in Fig. 2, in one embodiment of the application, a data acquisition module is set, by the data acquisition module for each
Corresponding preset data acquisition time interval, corresponding preset data renewal amount is arranged in a page level, and for each
The single or multiple pages of level acquire corresponding web data by crawler technology according to the prefixed time interval, to obtain
The data are taken, and according to the prefixed time interval section corresponding with each page level and each page level
The practical renewal amount of corresponding data is adjusted data collection interval, and records every level-one collecting webpage data
Time interval, thus history of forming data collection interval data, in order to corresponding for each level is arranged next time
Preset data acquisition time interval provides the historical data acquisition time interval data of corresponding page level, while convenient for each
The corresponding data collection interval pre-set interval of page level is adjusted, and is conducive to the update rule for obtaining the targeted website
Rule, and be conducive to make based on the update rule and the corresponding data collection interval of acquisition each page level is carried out more
Newly, it to obtain the web data of each level in time, save time resource, balances the efficiency of analyzing web page, obtain data
Efficiency and the efficiency of storing data in the database.
In addition, it is stored thereon with computer-readable instruction the embodiment of the present application also provides a kind of computer-readable medium,
The computer-readable instruction can be executed by processor to realize a kind of aforementioned method for obtaining site information.
According to the application another aspect, a kind of equipment for obtaining site information is additionally provided, wherein the equipment
Include:
One or more processors;And
It is stored with the memory of computer-readable instruction, the computer-readable instruction makes the processor when executed
It executes aforementioned a kind of for obtaining the operation of the method for site information.
For example, computer-readable instruction makes one or more of processors when executed:
The webpage level information for determining targeted website determines the default of each webpage rank according to the webpage level information
Data collection interval and preset data renewal amount;
The net is determined according to the preset data acquisition time interval of each webpage rank and the preset data renewal amount
The corresponding data collection interval of page rank;
The content information of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank;
The corresponding real data renewal amount of each webpage rank is determined according to the content information of each webpage rank, to update
The corresponding data collection interval of the webpage rank.
Obviously, those skilled in the art can carry out various modification and variations without departing from the essence of the application to the application
Mind and range.In this way, if these modifications and variations of the application belong to the range of the claim of this application and its equivalent technologies
Within, then the application is also intended to include these modifications and variations.
It should be noted that the application can be carried out in the assembly of software and/or software and hardware, for example, can adopt
With specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment
In, the software program of the application can be executed to implement the above steps or functions by processor.Similarly, the application
Software program (including relevant data structure) can be stored in computer readable recording medium, for example, RAM memory,
Magnetic or optical driver or floppy disc and similar devices.In addition, hardware can be used to realize in some steps or function of the application, example
Such as, as the circuit cooperated with processor thereby executing each step or function.
In addition, a part of the application can be applied to computer program product, such as computer program instructions, when its quilt
When computer executes, by the operation of the computer, it can call or provide according to the present processes and/or technical solution.
And the program instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or pass through
Broadcast or the data flow in other signal-bearing mediums and transmitted, and/or be stored according to described program instruction operation
In the working storage of computer equipment.Here, including a device according to one embodiment of the application, which includes using
Memory in storage computer program instructions and processor for executing program instructions, wherein when the computer program refers to
When enabling by processor execution, method and/or skill of the device operation based on aforementioned multiple embodiments according to the application are triggered
Art scheme.
It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned exemplary embodiment, Er Qie
In the case where without departing substantially from spirit herein or essential characteristic, the application can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and scope of the present application is by appended power
Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims
Variation is included in the application.Any reference signs in the claims should not be construed as limiting the involved claims.This
Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.That states in device claim is multiple
Unit or device can also be implemented through software or hardware by a unit or device.The first, the second equal words are used to table
Show title, and does not indicate any particular order.
Claims (14)
1. a kind of method for obtaining site information, wherein the described method includes:
The webpage level information for determining targeted website determines the preset data of each webpage rank according to the webpage level information
Acquisition time interval and preset data renewal amount;
The webpage grade is determined according to the preset data acquisition time interval of each webpage rank and the preset data renewal amount
Not corresponding data collection interval;
The content information of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank;
The corresponding real data renewal amount of each webpage rank is determined according to the content information of each webpage rank, described in updating
The corresponding data collection interval of webpage rank.
2. according to the method described in claim 1, wherein, determining the default of each webpage rank according to the webpage level information
Data collection interval and preset data renewal amount, comprising:
Historical data acquisition time interval and the corresponding actual history of the webpage rank are determined according to the webpage level information
Data renewal amount;
It is determined according to historical data acquisition time interval and the corresponding actual, historical data renewal amount of the webpage rank every
The preset data acquisition time interval of one webpage rank and preset data renewal amount.
3. according to the method described in claim 1, wherein, determining each webpage rank according to the content information of each webpage rank
After corresponding real data renewal amount, comprising:
Judge that the preset data of the corresponding real data renewal amount of each webpage rank and each webpage rank updates
Whether the residual quantity of amount is in preset threshold, if so, continuing to carry out with the corresponding data collection interval of the webpage rank
Obtain the content information of each webpage rank.
4. being wrapped according to the method described in claim 1, wherein, updating the corresponding data collection interval of the webpage rank
It includes:
Judge that the preset data of the corresponding real data renewal amount of each webpage rank and each webpage rank updates
Whether the residual quantity of amount is in preset threshold, if it is not, then according to the preset data acquisition time interval of each webpage rank, described pre-
If data renewal amount and the real data renewal amount of each webpage rank update the corresponding data acquisition of the webpage rank
Time interval.
5. according to the method described in claim 4, wherein, according to the preset data acquisition time interval of each webpage rank, institute
The real data renewal amount for stating preset data renewal amount and each webpage rank updates the corresponding data of the webpage rank
Acquisition time interval, comprising:
Determine the preset data renewal amount of each webpage rank and the ratio of real data renewal amount;
The webpage grade is updated according to the product at the ratio and the preset data acquisition time interval of each webpage rank
Not corresponding data collection interval.
6. according to the method described in claim 1, wherein, being obtained according to the corresponding data collection interval of the webpage rank
After the content information of each webpage rank, comprising:
Data with existing in the content information and database of each webpage rank of the acquisition is compared, determines comparison result;
The content information of each webpage rank of the acquisition is stored in the database according to the comparison result.
7. according to the method described in claim 1, wherein, determining each webpage rank according to the content information of each webpage rank
Corresponding real data renewal amount, comprising:
The average value of each collecting webpage data renewal amount under the webpage rank is determined according to the content information of each webpage rank,
Using the average value as the corresponding real data renewal amount of each webpage rank.
8. according to the method described in claim 1, wherein, the method also includes:
Determine that the multiple data under preset data acquisition time interval acquire renewal amount;
Comparison based on the multiple data acquisition renewal amount is configured the preset data renewal amount.
9. according to the method described in claim 1, wherein, update the corresponding data collection interval of the webpage rank it
Afterwards, comprising:
Updated data collection interval is adjusted according to pre-set interval.
10. updated data collection interval is adjusted according to pre-set interval according to the method described in claim 9, wherein,
Include:
Judge whether the updated data collection interval is located in the pre-set interval, if it is not, then according to described pre-
If the upper bound in section or lower bound adjust updated data collection interval.
11. according to the method described in claim 1, wherein, the webpage level information of the targeted website includes by homepage, index
The information for the hierarchical relationship that page and entry page determine.
12. according to the method described in claim 1, wherein, which comprises
Record the data collection interval of every level-one webpage.
13. a kind of computer-readable medium, is stored thereon with computer-readable instruction, the computer-readable instruction can be processed
Device is executed to realize the method as described in any one of claims 1 to 12.
14. a kind of equipment for obtaining site information, wherein the equipment includes:
One or more processors;And
It is stored with the memory of computer-readable instruction, the computer-readable instruction when executed executes the processor
Such as the operation of any one of claims 1 to 12 the method.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2018116401467 | 2018-12-29 | ||
CN201811640146 | 2018-12-29 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110008393A true CN110008393A (en) | 2019-07-12 |
CN110008393B CN110008393B (en) | 2023-03-07 |
Family
ID=67169789
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910262577.2A Active CN110008393B (en) | 2018-12-29 | 2019-04-02 | Method and equipment for acquiring website information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110008393B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110879818A (en) * | 2019-10-12 | 2020-03-13 | 北京字节跳动网络技术有限公司 | Method, device, medium and electronic equipment for acquiring data |
CN112100472A (en) * | 2020-09-11 | 2020-12-18 | 深圳市科盾科技有限公司 | Crawler scheduling method and device, terminal equipment and readable storage medium |
CN113992378A (en) * | 2021-10-22 | 2022-01-28 | 绿盟科技集团股份有限公司 | Safety monitoring method and device, electronic equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090112862A1 (en) * | 2007-10-26 | 2009-04-30 | G&G Commerce Ltd. | Image-based search system and method |
CN109033195A (en) * | 2018-06-28 | 2018-12-18 | 上海盛付通电子支付服务有限公司 | The acquisition methods of webpage information obtain equipment and computer-readable medium |
-
2019
- 2019-04-02 CN CN201910262577.2A patent/CN110008393B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090112862A1 (en) * | 2007-10-26 | 2009-04-30 | G&G Commerce Ltd. | Image-based search system and method |
CN109033195A (en) * | 2018-06-28 | 2018-12-18 | 上海盛付通电子支付服务有限公司 | The acquisition methods of webpage information obtain equipment and computer-readable medium |
Non-Patent Citations (1)
Title |
---|
张锋: "基于URL和网页类型的网页信息采集研究", 《电子制作》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110879818A (en) * | 2019-10-12 | 2020-03-13 | 北京字节跳动网络技术有限公司 | Method, device, medium and electronic equipment for acquiring data |
CN112100472A (en) * | 2020-09-11 | 2020-12-18 | 深圳市科盾科技有限公司 | Crawler scheduling method and device, terminal equipment and readable storage medium |
CN112100472B (en) * | 2020-09-11 | 2023-11-28 | 深圳市科盾科技有限公司 | Crawler scheduling method, crawler scheduling device, terminal equipment and readable storage medium |
CN113992378A (en) * | 2021-10-22 | 2022-01-28 | 绿盟科技集团股份有限公司 | Safety monitoring method and device, electronic equipment and storage medium |
CN113992378B (en) * | 2021-10-22 | 2023-11-07 | 绿盟科技集团股份有限公司 | Security monitoring method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110008393B (en) | 2023-03-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6676167B2 (en) | Information recommendation method and device | |
CN105608117B (en) | Information recommendation method and device | |
CN110008393A (en) | It is a kind of for obtaining the method and apparatus of site information | |
CN102855309B (en) | A kind of information recommendation method based on user behavior association analysis and device | |
CN106971348A (en) | A kind of data predication method and device based on time series | |
CN104239324A (en) | Methods and systems for user behavior based feature extraction and personalized recommendation | |
CN103544313B (en) | Data processing method and device for webpage recommending | |
JP6728178B2 (en) | Method and apparatus for processing search data | |
CN110069676A (en) | Keyword recommendation method and device | |
CN105989146B (en) | Object methods of exhibiting and device | |
CN106936778A (en) | The abnormal detection method of website traffic and device | |
CN103186666A (en) | Method, device and equipment for searching based on favorites | |
CN106708871A (en) | Method and device for identifying social service characteristics user | |
CN107153702A (en) | A kind of data processing method and device | |
CN109409940A (en) | Browse processing method, device, equipment and storage medium based on path | |
CN106202513A (en) | Method and apparatus is recommended by the main website that browses of browser | |
JP2008234338A (en) | Season degree analysis system, in-season degree analysis method, and season degree analysis program | |
CN110766232B (en) | Dynamic prediction method and system thereof | |
CN105653550B (en) | Webpage filtering method and device | |
US9514194B1 (en) | Website duration performance based on category durations | |
CN104572932A (en) | Method and device for determining interest label | |
CN105373570B (en) | Management method and terminal for browser history records | |
Yuan et al. | Partitioning social networks for fast retrieval of time-dependent queries | |
CN106528802A (en) | Data collecting method and device | |
CN103544278B (en) | Method and equipment for identifying website capturing flow quota |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230921 Address after: No. 106 Fengze East Road, Nansha District, Guangzhou City, Guangdong Province, 511457 (self made Building 1) X1301-B4056 (cluster registration) (JM) Patentee after: Semantic Intelligent Technology (Guangzhou) Co.,Ltd. Address before: 201203 Shanghai Pudong New Area free trade trial area, 1 spring 3, 400 Fang Chun road. Patentee before: YIYU INTELLIGENT TECHNOLOGY (SHANGHAI) CO.,LTD. |