CN110008393A - It is a kind of for obtaining the method and apparatus of site information - Google Patents

It is a kind of for obtaining the method and apparatus of site information Download PDF

Info

Publication number
CN110008393A
CN110008393A CN201910262577.2A CN201910262577A CN110008393A CN 110008393 A CN110008393 A CN 110008393A CN 201910262577 A CN201910262577 A CN 201910262577A CN 110008393 A CN110008393 A CN 110008393A
Authority
CN
China
Prior art keywords
webpage
rank
data
renewal amount
webpage rank
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910262577.2A
Other languages
Chinese (zh)
Other versions
CN110008393B (en
Inventor
孟孔明
舒畅
李竹桥
陆晨昱
刘尧
郑思璇
李先云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Semantic Intelligent Technology Guangzhou Co ltd
Original Assignee
Yi Language Intelligent Technology (shanghai) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yi Language Intelligent Technology (shanghai) Co Ltd filed Critical Yi Language Intelligent Technology (shanghai) Co Ltd
Publication of CN110008393A publication Critical patent/CN110008393A/en
Application granted granted Critical
Publication of CN110008393B publication Critical patent/CN110008393B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The purpose of the application is to provide a kind of method and apparatus for obtaining site information, and the application passes through the webpage level information for determining targeted website, preset data acquisition time interval and the preset data renewal amount of each webpage rank are determined according to the webpage level information;The corresponding data collection interval of the webpage rank is determined according to the preset data acquisition time interval of each webpage rank and the preset data renewal amount;The content information of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank;The corresponding real data renewal amount of each webpage rank is determined according to the content information of each webpage rank, to update the corresponding data collection interval of the webpage rank.It can be made adjustment based on network upgrade rule updates by the time interval of collecting webpage data, to acquisition web data and save time resource in time, and improve crawler efficiency.

Description

It is a kind of for obtaining the method and apparatus of site information
Technical field
This application involves computer fields more particularly to a kind of for obtaining the method and apparatus of site information.
Background technique
Possess the information of magnanimity on current internet, people want to obtain these information, it is necessary to web crawlers is used, When acquiring targeted website information using web crawlers technology, generally first seeks an initial address page and carries out data acquisition, Then the address link of needs is found in initial address page, and sequentially carries out data acquisition.But because target Site information will do it the update of webpage, the link for the new entry that the related pages that need to timely update generate, and be included in be collected Link space;At the same time, the efficiency that crawler acquires a webpage stores the efficiency of a webpage much higher than database.Mesh Before, mostly use a default situation to come circle collection data, such as one time of setting greatly for carrying out the crawler of data acquisition Point returns to index page retrieval when reaching the preset time point and updates.However, setting time is cyclically updated to resource Waste can have accessed the page several times if not updating at the time point of design slowly more, update again after one wheel of acquisition Lose timeliness.
Summary of the invention
The purpose of the application is to provide a kind of for obtaining the method and apparatus of site information, solves in the prior art The problem of obtaining low website data time efficiency, the wasting of resources and data poor in timeliness using crawler technology.
According to the one aspect of the application, a kind of method for obtaining site information is provided, this method comprises: determining The webpage level information of targeted website determines the preset data acquisition time of each webpage rank according to the webpage level information Interval and preset data renewal amount;
The net is determined according to the preset data acquisition time interval of each webpage rank and the preset data renewal amount The corresponding data collection interval of page rank;
The content information of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank;
The corresponding real data renewal amount of each webpage rank is determined according to the content information of each webpage rank, to update The corresponding data collection interval of the webpage rank.
Further, according to the webpage level information determine each webpage rank preset data acquisition time interval and Preset data renewal amount, comprising:
Historical data acquisition time interval and the corresponding reality of the webpage rank are determined according to the webpage level information Historical data renewal amount;
It is true according to historical data acquisition time interval and the corresponding actual, historical data renewal amount of the webpage rank The preset data acquisition time interval of fixed each webpage rank and preset data renewal amount.
Further, determine that the corresponding real data of each webpage rank updates according to the content information of each webpage rank After amount, comprising:
Judge the preset data of each webpage rank corresponding real data renewal amount and each webpage rank Whether the residual quantity of renewal amount is in preset threshold, if so, continuing with the corresponding data collection interval of the webpage rank Obtain the content information of each webpage rank.
Further, the corresponding data collection interval of the webpage rank is updated, comprising:
Judge the preset data of each webpage rank corresponding real data renewal amount and each webpage rank Whether the residual quantity of renewal amount is in preset threshold, if it is not, then according to the preset data acquisition time interval of each webpage rank, institute The real data renewal amount for stating preset data renewal amount and each webpage rank updates the corresponding data of the webpage rank Acquisition time interval.
Further, according to the preset data acquisition time interval of each webpage rank, the preset data renewal amount and The real data renewal amount of each webpage rank updates the corresponding data collection interval of the webpage rank, comprising:
Determine the preset data renewal amount of each webpage rank and the ratio of real data renewal amount;
The net is updated according to the product at the ratio and the preset data acquisition time interval of each webpage rank The corresponding data collection interval of page rank.
Further, the content of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank After information, comprising:
Data with existing in the content information and database of each webpage rank of the acquisition is compared, determines comparison result;
The content information of each webpage rank of the acquisition is stored in the database according to the comparison result.
Further, wherein the corresponding actual number of each webpage rank is determined according to the content information of each webpage rank According to renewal amount, comprising:
The flat of each collecting webpage data renewal amount under the webpage rank is determined according to the content information of each webpage rank Mean value, using the average value as the corresponding real data renewal amount of each webpage rank.
Further, the method also includes:
Determine that the multiple data under preset data acquisition time interval acquire renewal amount;
Comparison based on the multiple data acquisition renewal amount is configured the preset data renewal amount.
Further, after updating the corresponding data collection interval of the webpage rank, comprising:
Updated data collection interval is adjusted according to pre-set interval.
Further, updated data collection interval is adjusted according to pre-set interval, comprising:
Judge whether the updated data collection interval is located in the pre-set interval, if it is not, then according to institute The upper bound or lower bound for stating pre-set interval adjust updated data collection interval.
Further, the webpage level information of the targeted website includes the layer determined by homepage, index page and entry page The information of secondary relationship.
Further, which comprises
Record the data collection interval of every level-one webpage.On the other hand according to the application, a kind of meter is additionally provided Calculation machine readable medium, is stored thereon with computer-readable instruction, and the computer-readable instruction can be executed by processor to realize A kind of aforementioned method for obtaining site information.
According to the application another aspect, a kind of equipment for obtaining site information is additionally provided, wherein the equipment Include:
One or more processors;And
It is stored with the memory of computer-readable instruction, the computer-readable instruction makes the processor when executed It executes above-mentioned a kind of for obtaining the operation of site information method.
Compared with prior art, the application passes through the webpage level information for determining targeted website, according to the webpage rank Information determines preset data acquisition time interval and the preset data renewal amount of each webpage rank;According to each webpage rank Preset data acquisition time interval and the preset data renewal amount determine between the corresponding data acquisition time of the webpage rank Every;The content information of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank;According to each The content information of webpage rank determines the corresponding real data renewal amount of each webpage rank, corresponding to update the webpage rank Data collection interval.It can be made adjustment more based on network upgrade rule by the time interval of collecting webpage data Newly, to obtain web data in time and save time resource, and the efficiency for improving analyzing web page acquisition data is then put down The efficiency of the analyzing web page that weighed acquisition data and the in the database efficiency of storing data.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is shown according to a kind of for obtaining the method flow schematic diagram of site information of the application one aspect offer;
Fig. 2 shows a form of for obtaining the block schematic illustration of the method for site information in one embodiment of the application.
The same or similar appended drawing reference represents the same or similar component in attached drawing.
Specific embodiment
The application is described in further detail with reference to the accompanying drawing.
In a typical configuration of this application, terminal, the equipment of service network and trusted party include one or more Processor (CPU), input/output interface, network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices or Any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, computer Readable medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
Fig. 1 is shown according to a kind of for obtaining the method flow schematic diagram of site information of the application one aspect offer, The described method includes: step S1~step S4, wherein step S1 determines the webpage level information of targeted website, according to the net Page level information determines preset data acquisition time interval and the preset data renewal amount of each webpage rank;Step S2, according to The preset data acquisition time interval of each webpage rank and the preset data renewal amount determine that the webpage rank is corresponding Data collection interval;Step S3 obtains each webpage grade according to the corresponding data collection interval of the webpage rank Other content information;Step S4 determines the corresponding real data of each webpage rank according to the content information of each webpage rank Renewal amount, to update the corresponding data collection interval of the webpage rank.It can be made adjustment based on network upgrade rule The time interval of the collecting webpage data of different stage is updated, to obtain web data in time and save time resource, is improved Crawler efficiency.
Specifically, in step sl, the webpage level information for determining targeted website is determined according to the webpage level information The preset data acquisition time interval of each webpage rank and preset data renewal amount.Here, the webpage of website may include multilayer The rank of the webpage of grade, level is different, and the content-data update efficiency of corresponding webpage is different, each webpage level is corresponding Preset data acquisition time interval and preset data renewal amount be all independent from each other, its webpage is first determined for targeted website Level information determines its preset data acquisition time by way of by the classification of the webpage of targeted website in each rank Interval and preset data renewal amount.For example, the webpage rank of certain website is divided into K1 and K11, then the preset data acquisition of K1 webpage Time interval is T ', and preset data renewal amount is N1, is divided into T " between the preset data acquisition time of K11 webpage, preset data is more New amount is N2.When initial, webpage rank is different, possible different web pages rank corresponding preset data acquisition time interval it is identical or Person is different, and corresponding preset data renewal amount is same or different, i.e. T ' is identical as T " possibility may also be different, and N1 and N2 may It is identical may also be different.
Preferably, the webpage level information of the targeted website includes the level determined by homepage, index page and entry page The information of relationship.In one preferred embodiment of the application, for a targeted website, a start page is first determined, this The start page of acquisition data is preferably the homepage of the targeted website;The information on the homepage is acquired, can be obtained described The chained address and the targeted website of a part of popular entry of targeted website are grounded the chain for the index that entry is classified Location grabs each index page, the entry chained address in the index page can be obtained, so as to form master The tertiary level relationship of page, index page and entry page.For example, for some novel website, homepage http: // Www.AAA.me/, index page is http://www.AAA.me/sort/23, and each novel is then entry page.For search Engine website, homepage are http://www.XXX.com, and index page is http://www.XXX.com/category/ TreeManage.jsp, each result of page searching are entry page.
Then, in step s 2, according to the preset data acquisition time interval of each webpage rank and the preset data Renewal amount determines the corresponding data collection interval of the webpage rank.Here, for multiple nets in same webpage rank Page, preset data acquisition time interval be it is identical, the preset data renewal amount is also identical.Different web pages level It is independent to possess independent data collection interval, the determination of data collection interval, according to the present count of each webpage rank It is determined according to acquisition time interval and preset data renewal amount, so as to timely according to determining data collection interval It obtains the web data of corresponding webpage rank and saves resource.
Then, in step s3, each webpage grade is obtained according to the corresponding data collection interval of the webpage rank Other content information.In one preferred embodiment of the application, after the webpage of targeted website is classified, it is divided into three layers Grade: homepage, index page and entry page.The corresponding data collection interval T10 of homepage, the corresponding data acquisition time of index page It is spaced T20, the corresponding data collection interval T30 of entry page;When obtaining the content information of homepage rank, obtained every T10 The content information of the single page of homepage rank or multiple pages;When obtaining the content information of index page rank, obtained every T20 Take the content information of multiple pages under an index page rank;When obtaining the content information of entry page rank, one is obtained every T30 The content information of multiple pages under secondary entry page rank.Here, described T10, T20 and T30 can be equal or unequal.
Then, in step s 4, the corresponding reality of each webpage rank is determined according to the content information of each webpage rank Data renewal amount, to update the corresponding data collection interval of the webpage rank.Here, within a certain period of time, different nets The data renewal amount of the corresponding webpage of page rank is different, and is independently calculated according to the data renewal amount of the webpage of each webpage rank Renewal time interval.
Above-described embodiment is connect, obtains the content information of the single page of a homepage rank or multiple pages every T10, according to The content information of the single page of homepage rank or multiple pages got determines the corresponding real data of homepage rank Renewal amount;The content information of the multiple pages of index page rank is obtained every T20, according to the index page rank got The content information of multiple pages determines the corresponding real data renewal amount of index page rank;An entry page is obtained every T30 The content information of multiple pages under rank is determined according to the content information of multiple pages under the entry page rank got The corresponding real data renewal amount of entry page rank.According between the real data renewal amount dynamic adjusting data acquisition time of webpage Every acquiring no longer effective property when web page contents again to avoid the page several times of more access without reason.
In this application, data acquisition and preset data acquisition time interval will be carried out after page layering, obtain data more The method newly measured is conducive to be made based on the update rule to the corresponding data collection interval of acquisition each page level It is updated, to obtain the web data of each level in time, saves update data acquisition time compared to the prior art It is spaced and obtains the targeted website data and enter time resource required for database, to balance the effect of analyzing web page Rate, the efficiency and the efficiency of storing data in the database for obtaining data.
Preferably, in step s 2, historical data acquisition time interval and described is determined according to the webpage level information The corresponding actual, historical data renewal amount of webpage rank;According to historical data acquisition time interval and the webpage rank pair The actual, historical data renewal amount answered determines preset data acquisition time interval and the preset data renewal amount of each webpage rank. Here, can update rule when carrying out data acquisition to website according to the historical data of the website, such as the data of history, carry out Determine preset data acquisition time interval.For example, certain novel website has examination to update at 9 points of the morning, then preset data can be adopted Collection time interval is set as 9 points of the morning;Preset data renewal amount is specially entry number renewal amount, and the data of website are acquired according to history Renewal amount is configured, and specifically, can be acquired by the multiple data determined under preset data acquisition time interval Renewal amount;Comparison based on the multiple data acquisition renewal amount is configured the preset data renewal amount.Here, can compare The renewal amount of multiple data acquisition under more at certain time intervals is configured data renewal amount as reference;For example, choosing For the renewal amount of data acquisition twice as reference, the first round first acquires the data of a period of time, then acquires after certain time Second wheel data, are compared with reference to setting preset data renewal amount.
In one embodiment of the application, the webpage level information got includes what homepage, index page and entry page determined When the information of hierarchical relationship.For a targeted website, a start page is first determined, the start page of this acquisition data is excellent It is selected as the homepage of the targeted website;The information on the homepage is acquired, a part of hot topic of the targeted website can be obtained The chained address for the index that the chained address of entry and the targeted website classify for entry, for each index page into Row crawl, can obtain the entry chained address in the index page, it is preferable that form homepage, index page and entry page Three-level hierarchical relationship.The webpage rank corresponding historical data acquisition time interval for obtaining the homepage and actual, historical data are more New amount, the webpage rank corresponding historical data acquisition time interval of the index page and actual, historical data renewal amount and institute State webpage rank corresponding historical data acquisition time interval and the actual, historical data renewal amount of entry page.It is excellent in the application one It selects in embodiment, T11, T12, T13, T14 is divided between the historical data acquisition time of the homepage, a length of Th1 when acquisition is set, After the duration by Th1, corresponding T11, T12, T13, T14 interval time acquires the actual, historical data that homepage is got and updates Amount is A11, A12, A13, A14.
Then it is multiple that correspondence is repeatedly obtained under the conditions of, by same acquisition duration in different data acquisition time interval The collected multiple real data renewal amounts of data collection interval, according to the comparison to the multiple real data renewal amount As a result the corresponding preset data renewal amount of the page level and preset data acquisition time interval are obtained.Above-described embodiment is connect, Preset data renewal amount A1 and the preset data acquisition time interval of homepage are obtained according to the comparison result of A11, A12, A13, A14 T1.Here, to A11, A12, A13, A14 relatively include but is not limited to be grouped into A11, A12 and A13, the side that A14 compares two-by-two Formula, the mode of the comparison can arbitrarily select design.Here, the historical data acquisition time interval got and Corresponding actual, historical data renewal amount is multiple.Corresponding historical data acquisition is obtained according to the page level of different levels Time interval and actual, historical data renewal amount, to obtain corresponding preset data renewal amount, that is to say, that the different layers The preset data renewal amount of the page level of grade is according to the page level corresponding historical data acquisition time interval and reality Historical data renewal amount is provided independently from, and obtain different levels the page preset data renewal amount can be it is equal Or not etc..
Preferably, the corresponding real data renewal amount of each webpage rank is determined according to the content information of each webpage rank Later, judge that the preset data of the corresponding real data renewal amount of each webpage rank and each webpage rank updates Whether the residual quantity of amount is in preset threshold, if so, continuing to carry out with the corresponding data collection interval of the webpage rank Obtain the content information of each webpage rank.Here, preset an acquisition time interval, with the acquisition time interval to website into The acquisition of row crawler data, while presetting a data renewal amount comparison for reference.Compare and is acquired in preset time interval The actual amount of data and preset data renewal amount arrived, such as equal if the residual quantity of the two is in preset threshold, then this preset is adopted It is reasonable in range of needs to collect time interval, continues to use the acquisition time interval and carries out crawler acquisition web data, if the two Residual quantity be more than preset threshold, then need according to preset data acquisition time interval, preset data renewal amount and real data Renewal amount is adjusted the real time interval of data acquisition.
Specifically, real data renewal amount corresponding for each webpage rank and each webpage rank is pre- If preset threshold is arranged in the residual quantity of data renewal amount, wherein the corresponding preset threshold of each webpage rank is for every What one webpage rank was independently arranged, the method that the preset threshold is arranged includes but is not limited to: according to webpage level information and Each webpage rank corresponding historical data renewal amount setting preset threshold.It is arranged in correspondence with for each webpage rank The preset threshold can be avoided the data collection interval of unnecessary update is updated caused by the wasting of resources, limit Data collection interval needs the condition updated, to save time resource and and ensure that the timeliness of data.
Above-described embodiment is connect, in the preset data acquisition time interval T1 and preset data renewal amount A1 by determining homepage Afterwards, the preset data time interval T1 is arranged acquire the data of the single or multiple pages of homepage rank, thus To corresponding real data renewal amount A2.It is spaced in T1 at the same time, the real data acquisition renewal amount A2 is greater than pre- If data renewal amount A1, illustrates that data age is lost;It is spaced in T1 at the same time, the real data acquisition is more New amount A2 is less than preset data renewal amount A1, illustrates that crawlers obtain some duplicate messages repeatedly, so as to cause resource Waste.When the difference of real data acquisition renewal amount A2 and preset data renewal amount A1 is in this level page pair of the homepage In the range of the preset threshold answered, just update is not adjusted to preset data acquisition time interval T1, that is, use Preset data acquisition time interval T1 in the preset threshold to the single or multiple pages of the homepage level into The acquisition of row data has saved resource and ensure that data age.
Preferably, in step s 4, the preset data renewal amount and real data renewal amount of each webpage rank are determined Ratio;The webpage grade is updated according to the product at the ratio and the preset data acquisition time interval of each webpage rank Not corresponding data collection interval.
Specifically, to obtain the data instance of homepage rank, it is determined that the present count of the single or multiple webpages of homepage rank According to acquisition time interval T1 and preset data renewal amount A1, and the reality of the webpage rank is got by the time interval of T1 Border data renewal amount A2.It is spaced in T1 at the same time, the real data acquisition renewal amount A2 is updated greater than preset data A1 is measured, illustrates that data age is lost;It is spaced in T1 at the same time, the real data acquisition renewal amount A2 is less than Preset data renewal amount A1 illustrates that crawlers obtain the duplicate message of some homepages repeatedly, so as to cause the wave of resource Take.When real data acquisition renewal amount A2 is corresponding not in this level page of the homepage with the difference of preset data renewal amount A1 Preset threshold in the range of it is necessary to being adjusted update to preset data acquisition time interval T1.
Then, it is preferable that in order to avoid the wasting of resources carries out repeated acquisition to the page info, acquired also for guarantee Data timeliness, after passing through identical acquisition time interval, between the corresponding preset data acquisition time of each webpage rank Every with the product of preset data renewal amount be equal to each webpage rank corresponding real data acquisition time interval and real data The product of renewal amount.Thus it obtains, passes through preset data renewal amount and the real data renewal amount of the determination webpage rank Ratio updates the webpage grade according to the product at the ratio and the preset data acquisition time interval of each webpage rank Not corresponding data collection interval, the corresponding following update equation of available each webpage rank after arrangement:
Wherein, the data collection interval of the update, preset data acquisition time interval, the preset data Renewal amount, the real data renewal amount be by same webpage rank single or multiple webpages and by same acquisition It is obtained after duration.
Above-described embodiment is connect, when the webpage of homepage rank is in identical preset data acquisition time interval T1, obtains institute The difference for stating the real data acquisition renewal amount A2 and preset data renewal amount A1 that the single or multiple pages of rank obtain does not exist In the range of this corresponding preset threshold of level page of the homepage, the preset data renewal amount A1 and the reality are just determined Border data acquire the ratio of renewal amount A2, and the ratio is multiplied with preset data acquisition time interval T1, Ji Kegeng New data collection time interval isThe homepage rank next time to targeted website the single or multiple pages into Row data use when acquiringData collection interval.By being acquired to the corresponding data of each rank page The update of time interval, has saved time resource, maintains the timeliness of data.Preferably, corresponding according to the webpage rank Data collection interval obtain the content information of each webpage rank after, compare each webpage rank of the acquisition Data with existing in content information and database, determines comparison result;According to the comparison result by each webpage of the acquisition The content information of rank is stored in the database.
Specifically, in one preferred embodiment of the application, as shown in Fig. 2, one data management module of setting, passes through data The content information and the data with existing in database that management module compares each webpage rank of the acquisition determine comparison As a result: if the content information that the comparison result is each webpage rank of the acquisition includes data with existing in database, The content information for each webpage rank that will acquire by the data management module is stored in database after making duplicate removal processing, Described in duplicate removal processing be by the content information of each webpage rank of the acquisition the data with existing delete;If described Comparison result is not comprising the data with existing in database, and the data management module is by each webpage rank of the acquisition Content information deposit database in.It the data management module is set facilitates and storage processing is made to the content information of acquisition, Duplicate removal processing is carried out by content information of the data management module to each webpage rank got, is further improved Store efficiency of the content information into database.
Preferably, in step s 4, each webpage under the webpage rank is determined according to the content information of each webpage rank Data acquire the average value of renewal amount, using the average value as the corresponding real data renewal amount of each webpage rank.Here, The webpage of different stage possesses a data acquisition renewal amount respectively, and the data acquisition of same rank webpage is updated and takes the grade The average value of not lower each collecting webpage data renewal amount, such as arithmetic mean of instantaneous value or weighted average.In one embodiment of the application, For electric business shopping homepage, there is different index page, such as men's clothing classification, women's dress classification, electronic digital classification, these, which are classified, makees Take arithmetic mean of instantaneous value as the time interval under the homepage rank after independent calculating for a rank.
Specifically, in one preferred embodiment of the application, by taking this webpage rank of index page as an example, the index page includes Multiple pages include the chained address of multiple entry pages on the multiple page, when the index page packet for having targeted website When containing the X page, in the data collection interval T20 of index page rank, the X is determined according to collected content information The data renewal amount of a page is Y, it is preferable that obtains its arithmetic mean of instantaneous value according to the data renewal amount Y of the X pageIt will It is describedThe real data renewal amount of the X page as index page rank.Here, it will be appreciated by those skilled in the art that, The average value that each collecting webpage data renewal amount under the webpage rank is determined by way of taking arithmetic mean of instantaneous value is the application One preferred embodiment, other methods for determining the average value of each collecting webpage data renewal amount under the webpage rank are both contained in This.
It preferably, can also be according to pre-set interval after updating the corresponding data collection interval of the webpage rank Adjust updated data collection interval.
Specifically, the corresponding data collection interval of webpage rank each for the targeted website is respectively provided with corresponding Pre-set interval, the pre-set interval, which is arranged, can prevent the corresponding data collection interval too small, to prevent to institute Webpage collapse phenomenon caused by targeted website brings burden then to avoid the occurrence of due to data acquisition is stated, while being arranged described pre- If section can prevent the corresponding data collection interval excessive, to avoid the link space of the page of next level It cannot be replenished in time.
Preferably, updated data collection interval is adjusted according to pre-set interval, can carried out in the following manner: Judge whether the updated data collection interval is located in the pre-set interval, if it is not, then according to the preset areas Between the upper bound or lower bound adjust updated data collection interval.Here, the data collection interval that judgement updates is It is no in pre-set interval, if it is not, the data collection interval then updated needs to be adjusted to the upper bound or the lower bound of the pre-set interval.
Specifically, the pre-set interval is separately provided according to the progress of the hierarchical relationship of every first level pages is corresponding, the list It is solely set as according to the corresponding history acquisition data interval time of every first level pages and historical data renewal amount and the target The historical content information of website carries out the setting of the pre-set interval of corresponding data collection interval.It is preferred real in the application one It applies in example, by taking multiple pages of the index page level of targeted website as an example, the index page page collapse will not be caused most Lower limit of the small data acquisition time interval T00 as the pre-set interval will be acquired according to the history of multiple pages of index page The maximum time interval T01 that data interval time obtains after referring to historical data renewal amount is as the upper of the pre-set interval Limit.Known updated data collection interval is T60, when updated data collection interval T60 is less than described pre- If updated data collection interval is adjusted to T00 when lower limit T00 in section;When the acquisition of updated data Between interval T60 be greater than the pre-set interval in the upper limit T01 when, updated data collection interval is adjusted to T01.From And webpage collapse phenomenon caused by preventing from bringing burden to the targeted website to avoid the occurrence of due to data acquisition, simultaneously The pre-set interval, which is arranged, can prevent the data collection interval of the index page excessive, to avoid the item of next level The link space of the mesh page page cannot be replenished in time.
Preferably, which comprises step S5 records the data collection interval of every level-one webpage.Specifically, such as Shown in Fig. 2, in one embodiment of the application, a data acquisition module is set, by the data acquisition module for each Corresponding preset data acquisition time interval, corresponding preset data renewal amount is arranged in a page level, and for each The single or multiple pages of level acquire corresponding web data by crawler technology according to the prefixed time interval, to obtain The data are taken, and according to the prefixed time interval section corresponding with each page level and each page level The practical renewal amount of corresponding data is adjusted data collection interval, and records every level-one collecting webpage data Time interval, thus history of forming data collection interval data, in order to corresponding for each level is arranged next time Preset data acquisition time interval provides the historical data acquisition time interval data of corresponding page level, while convenient for each The corresponding data collection interval pre-set interval of page level is adjusted, and is conducive to the update rule for obtaining the targeted website Rule, and be conducive to make based on the update rule and the corresponding data collection interval of acquisition each page level is carried out more Newly, it to obtain the web data of each level in time, save time resource, balances the efficiency of analyzing web page, obtain data Efficiency and the efficiency of storing data in the database.
In addition, it is stored thereon with computer-readable instruction the embodiment of the present application also provides a kind of computer-readable medium, The computer-readable instruction can be executed by processor to realize a kind of aforementioned method for obtaining site information.
According to the application another aspect, a kind of equipment for obtaining site information is additionally provided, wherein the equipment Include:
One or more processors;And
It is stored with the memory of computer-readable instruction, the computer-readable instruction makes the processor when executed It executes aforementioned a kind of for obtaining the operation of the method for site information.
For example, computer-readable instruction makes one or more of processors when executed:
The webpage level information for determining targeted website determines the default of each webpage rank according to the webpage level information Data collection interval and preset data renewal amount;
The net is determined according to the preset data acquisition time interval of each webpage rank and the preset data renewal amount The corresponding data collection interval of page rank;
The content information of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank;
The corresponding real data renewal amount of each webpage rank is determined according to the content information of each webpage rank, to update The corresponding data collection interval of the webpage rank.
Obviously, those skilled in the art can carry out various modification and variations without departing from the essence of the application to the application Mind and range.In this way, if these modifications and variations of the application belong to the range of the claim of this application and its equivalent technologies Within, then the application is also intended to include these modifications and variations.
It should be noted that the application can be carried out in the assembly of software and/or software and hardware, for example, can adopt With specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, the software program of the application can be executed to implement the above steps or functions by processor.Similarly, the application Software program (including relevant data structure) can be stored in computer readable recording medium, for example, RAM memory, Magnetic or optical driver or floppy disc and similar devices.In addition, hardware can be used to realize in some steps or function of the application, example Such as, as the circuit cooperated with processor thereby executing each step or function.
In addition, a part of the application can be applied to computer program product, such as computer program instructions, when its quilt When computer executes, by the operation of the computer, it can call or provide according to the present processes and/or technical solution. And the program instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or pass through Broadcast or the data flow in other signal-bearing mediums and transmitted, and/or be stored according to described program instruction operation In the working storage of computer equipment.Here, including a device according to one embodiment of the application, which includes using Memory in storage computer program instructions and processor for executing program instructions, wherein when the computer program refers to When enabling by processor execution, method and/or skill of the device operation based on aforementioned multiple embodiments according to the application are triggered Art scheme.
It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned exemplary embodiment, Er Qie In the case where without departing substantially from spirit herein or essential characteristic, the application can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and scope of the present application is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included in the application.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " does not exclude other units or steps, and odd number is not excluded for plural number.That states in device claim is multiple Unit or device can also be implemented through software or hardware by a unit or device.The first, the second equal words are used to table Show title, and does not indicate any particular order.

Claims (14)

1. a kind of method for obtaining site information, wherein the described method includes:
The webpage level information for determining targeted website determines the preset data of each webpage rank according to the webpage level information Acquisition time interval and preset data renewal amount;
The webpage grade is determined according to the preset data acquisition time interval of each webpage rank and the preset data renewal amount Not corresponding data collection interval;
The content information of each webpage rank is obtained according to the corresponding data collection interval of the webpage rank;
The corresponding real data renewal amount of each webpage rank is determined according to the content information of each webpage rank, described in updating The corresponding data collection interval of webpage rank.
2. according to the method described in claim 1, wherein, determining the default of each webpage rank according to the webpage level information Data collection interval and preset data renewal amount, comprising:
Historical data acquisition time interval and the corresponding actual history of the webpage rank are determined according to the webpage level information Data renewal amount;
It is determined according to historical data acquisition time interval and the corresponding actual, historical data renewal amount of the webpage rank every The preset data acquisition time interval of one webpage rank and preset data renewal amount.
3. according to the method described in claim 1, wherein, determining each webpage rank according to the content information of each webpage rank After corresponding real data renewal amount, comprising:
Judge that the preset data of the corresponding real data renewal amount of each webpage rank and each webpage rank updates Whether the residual quantity of amount is in preset threshold, if so, continuing to carry out with the corresponding data collection interval of the webpage rank Obtain the content information of each webpage rank.
4. being wrapped according to the method described in claim 1, wherein, updating the corresponding data collection interval of the webpage rank It includes:
Judge that the preset data of the corresponding real data renewal amount of each webpage rank and each webpage rank updates Whether the residual quantity of amount is in preset threshold, if it is not, then according to the preset data acquisition time interval of each webpage rank, described pre- If data renewal amount and the real data renewal amount of each webpage rank update the corresponding data acquisition of the webpage rank Time interval.
5. according to the method described in claim 4, wherein, according to the preset data acquisition time interval of each webpage rank, institute The real data renewal amount for stating preset data renewal amount and each webpage rank updates the corresponding data of the webpage rank Acquisition time interval, comprising:
Determine the preset data renewal amount of each webpage rank and the ratio of real data renewal amount;
The webpage grade is updated according to the product at the ratio and the preset data acquisition time interval of each webpage rank Not corresponding data collection interval.
6. according to the method described in claim 1, wherein, being obtained according to the corresponding data collection interval of the webpage rank After the content information of each webpage rank, comprising:
Data with existing in the content information and database of each webpage rank of the acquisition is compared, determines comparison result;
The content information of each webpage rank of the acquisition is stored in the database according to the comparison result.
7. according to the method described in claim 1, wherein, determining each webpage rank according to the content information of each webpage rank Corresponding real data renewal amount, comprising:
The average value of each collecting webpage data renewal amount under the webpage rank is determined according to the content information of each webpage rank, Using the average value as the corresponding real data renewal amount of each webpage rank.
8. according to the method described in claim 1, wherein, the method also includes:
Determine that the multiple data under preset data acquisition time interval acquire renewal amount;
Comparison based on the multiple data acquisition renewal amount is configured the preset data renewal amount.
9. according to the method described in claim 1, wherein, update the corresponding data collection interval of the webpage rank it Afterwards, comprising:
Updated data collection interval is adjusted according to pre-set interval.
10. updated data collection interval is adjusted according to pre-set interval according to the method described in claim 9, wherein, Include:
Judge whether the updated data collection interval is located in the pre-set interval, if it is not, then according to described pre- If the upper bound in section or lower bound adjust updated data collection interval.
11. according to the method described in claim 1, wherein, the webpage level information of the targeted website includes by homepage, index The information for the hierarchical relationship that page and entry page determine.
12. according to the method described in claim 1, wherein, which comprises
Record the data collection interval of every level-one webpage.
13. a kind of computer-readable medium, is stored thereon with computer-readable instruction, the computer-readable instruction can be processed Device is executed to realize the method as described in any one of claims 1 to 12.
14. a kind of equipment for obtaining site information, wherein the equipment includes:
One or more processors;And
It is stored with the memory of computer-readable instruction, the computer-readable instruction when executed executes the processor Such as the operation of any one of claims 1 to 12 the method.
CN201910262577.2A 2018-12-29 2019-04-02 Method and equipment for acquiring website information Active CN110008393B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2018116401467 2018-12-29
CN201811640146 2018-12-29

Publications (2)

Publication Number Publication Date
CN110008393A true CN110008393A (en) 2019-07-12
CN110008393B CN110008393B (en) 2023-03-07

Family

ID=67169789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910262577.2A Active CN110008393B (en) 2018-12-29 2019-04-02 Method and equipment for acquiring website information

Country Status (1)

Country Link
CN (1) CN110008393B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879818A (en) * 2019-10-12 2020-03-13 北京字节跳动网络技术有限公司 Method, device, medium and electronic equipment for acquiring data
CN112100472A (en) * 2020-09-11 2020-12-18 深圳市科盾科技有限公司 Crawler scheduling method and device, terminal equipment and readable storage medium
CN113992378A (en) * 2021-10-22 2022-01-28 绿盟科技集团股份有限公司 Safety monitoring method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090112862A1 (en) * 2007-10-26 2009-04-30 G&G Commerce Ltd. Image-based search system and method
CN109033195A (en) * 2018-06-28 2018-12-18 上海盛付通电子支付服务有限公司 The acquisition methods of webpage information obtain equipment and computer-readable medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090112862A1 (en) * 2007-10-26 2009-04-30 G&G Commerce Ltd. Image-based search system and method
CN109033195A (en) * 2018-06-28 2018-12-18 上海盛付通电子支付服务有限公司 The acquisition methods of webpage information obtain equipment and computer-readable medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张锋: "基于URL和网页类型的网页信息采集研究", 《电子制作》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879818A (en) * 2019-10-12 2020-03-13 北京字节跳动网络技术有限公司 Method, device, medium and electronic equipment for acquiring data
CN112100472A (en) * 2020-09-11 2020-12-18 深圳市科盾科技有限公司 Crawler scheduling method and device, terminal equipment and readable storage medium
CN112100472B (en) * 2020-09-11 2023-11-28 深圳市科盾科技有限公司 Crawler scheduling method, crawler scheduling device, terminal equipment and readable storage medium
CN113992378A (en) * 2021-10-22 2022-01-28 绿盟科技集团股份有限公司 Safety monitoring method and device, electronic equipment and storage medium
CN113992378B (en) * 2021-10-22 2023-11-07 绿盟科技集团股份有限公司 Security monitoring method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110008393B (en) 2023-03-07

Similar Documents

Publication Publication Date Title
JP6676167B2 (en) Information recommendation method and device
CN105608117B (en) Information recommendation method and device
CN110008393A (en) It is a kind of for obtaining the method and apparatus of site information
CN102855309B (en) A kind of information recommendation method based on user behavior association analysis and device
CN106971348A (en) A kind of data predication method and device based on time series
CN104239324A (en) Methods and systems for user behavior based feature extraction and personalized recommendation
CN103544313B (en) Data processing method and device for webpage recommending
JP6728178B2 (en) Method and apparatus for processing search data
CN110069676A (en) Keyword recommendation method and device
CN105989146B (en) Object methods of exhibiting and device
CN106936778A (en) The abnormal detection method of website traffic and device
CN103186666A (en) Method, device and equipment for searching based on favorites
CN106708871A (en) Method and device for identifying social service characteristics user
CN107153702A (en) A kind of data processing method and device
CN109409940A (en) Browse processing method, device, equipment and storage medium based on path
CN106202513A (en) Method and apparatus is recommended by the main website that browses of browser
JP2008234338A (en) Season degree analysis system, in-season degree analysis method, and season degree analysis program
CN110766232B (en) Dynamic prediction method and system thereof
CN105653550B (en) Webpage filtering method and device
US9514194B1 (en) Website duration performance based on category durations
CN104572932A (en) Method and device for determining interest label
CN105373570B (en) Management method and terminal for browser history records
Yuan et al. Partitioning social networks for fast retrieval of time-dependent queries
CN106528802A (en) Data collecting method and device
CN103544278B (en) Method and equipment for identifying website capturing flow quota

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230921

Address after: No. 106 Fengze East Road, Nansha District, Guangzhou City, Guangdong Province, 511457 (self made Building 1) X1301-B4056 (cluster registration) (JM)

Patentee after: Semantic Intelligent Technology (Guangzhou) Co.,Ltd.

Address before: 201203 Shanghai Pudong New Area free trade trial area, 1 spring 3, 400 Fang Chun road.

Patentee before: YIYU INTELLIGENT TECHNOLOGY (SHANGHAI) CO.,LTD.