CN106250512B

CN106250512B - A kind of subject network information collecting method for taking time intention into account

Info

Publication number: CN106250512B
Application number: CN201610630419.4A
Authority: CN
Inventors: 陈军; 武昊; 侯东阳
Original assignee: NATIONAL GEOMATICS CENTER OF CHINA
Current assignee: NATIONAL GEOMATICS CENTER OF CHINA
Priority date: 2016-08-04
Filing date: 2016-08-04
Publication date: 2019-07-26
Anticipated expiration: 2036-08-04
Also published as: CN106250512A

Abstract

A kind of subject network information collecting method for taking time intention into account, it is used to carry out the Internet web page information for subject events to collect sequence, it includes the following steps: step A, the initial time of subject events is determined using priori data, and quantify its Annual distribution, obtain the quantized value of an Annual distribution；Step B is intended to the time in theme using different representation methods and general keyword is indicated respectively, and calculates separately time correlation degree and the general keyword degree of correlation；Step C, according to step B time correlation degree calculated and the general keyword degree of correlation, building obtains the URL priority distribution formula based on Annual distribution quantized value, calculates final URL priority using the quantized value of the step A Annual distribution obtained as the increasing function of variable.A kind of subject network information collecting method for taking time intention into account provided by the present invention, substantially increases webpage discovery quantity and precision ratio.

Description

A kind of subject network information collecting method for taking time intention into account

Technical field

The present invention relates to the themes of the webpage of specific content in internet web page search field, especially acquisition internet to climb Row method, especially a kind of subject network information collecting method for taking time intention into account.

Background technique

Topic crawling is a kind of key technology method for obtaining specific area webpage in internet, it is intended under as much as possible Carry webpage relevant to designated key.Its mainly theme for being specified according to user, by being calculated with topic correlativity, URL it is excellent Crawl policy based on first grade distribution etc., constantly obtains the information of related web page from Ubiquitous Network resource.

URL priority distribution method based on web page contents is that traditional theme is creeped common method.It is mainly basis What two class relevance degrees were calculated, specifically: (1) father web page contents topic correlativity: its value is higher, and father's webpage is included URL priority is higher；(2) Anchor Text topic correlativity: it refers to theme and Anchor Text, Anchor Text context and URL character The relevance degree of the information such as string, wherein Anchor Text is often to describe to the generality of content of pages pointed by URL.

In the URL priority distribution method based on web page contents, father's web page contents topic correlativity and Anchor Text theme The degree of correlation is calculated frequently with cosine formula, such as: father's web page contents topic correlativity of certain URL is sim (V_Dk,V_Tk), Anchor Text Topic correlativity is sim (V_Ak,V_Tk), then the priority P riority (URL) of the URL can be calculated as follows:

Priority (URL)=θ × sim (V_Dk,V_Tk)+γ×sim(V_Ak,V_Tk) (1-1)

In above formula, θ and γ respectively indicate the decay factor of father's web page contents topic correlativity and Anchor Text topic correlativity, And meet+γ=1 θ.

When using the emergency information of topic crawling method acquisition time sensitivity, time intention can usually be used as theme A kind of restriction element.According to the regulation (2002) of ISO19100 series standard, time object can be divided into " moment " and " when Section ", wherein " moment " indicates a point in time and space；" period " is equivalent to a line in time and space, there is starting point, end The attributes such as point and length.In general, the information dissemination on network about a certain emergency event mainly appears on event it occurs Afterwards, that is, the issuing time reported should be later than the initial time of emergency event；On the other hand, there are Emergence and Developments, change for emergency event The evolutionary process changed and withered away, in the different evolutionary phases, the temperature that people pay close attention to the event is also different, preferential downloading concern The information for spending the higher period, can meet most of Man's Demands, this reflects the Annual distribution of the event to a certain extent.? That is it is related in information that the time is intended to (such as initial time to Annual distribution) when carrying out network information gathering using theme Degree judgement and INFORMATION DISCOVERY order of priority distribution aspect have obvious action.

Although filter house can be individually used for by setting initial time when using topic crawling method collecting network information Point incoherent information, and its Annual distribution will affect the order of priority of INFORMATION DISCOVERY, but traditional network information collecting method The common semanteme of theme is still only paid close attention to, there is no the time of analysis and utilization theme intentions, and there are asking for Annual distribution equalization Topic, causes its precision ratio low.Specific manifestation are as follows:

(1) lack the representation method that the time is intended to: the unidirectional amount theme representation method of tradition it is merely meant that theme keyword, The representation method of its time intention is not provided；

(2) weaken the effect of theme initial time: traditional theme relatedness computation strategy only relies on web page contents to judge The correlation of itself and theme weakens the effect of theme initial time；

(3) ignore the influence of theme Annual distribution influence INFORMATION DISCOVERY order of priority: traditional URL priority distribution method mesh It is preceding main using web page contents, Anchor Text and its context, URL character string, the linking relationship even renewal time of webpage, but Have ignored the influence of theme Annual distribution.

Summary of the invention

The technical problem to be solved in the present invention is to provide it is a kind of take into account the time intention subject network information collecting method, with The problem of being formerly mentioned is reduced or avoided.

In order to solve the above technical problems, the present invention provides a kind of subject network information collection sides for taking time intention into account Method is used to carry out the Internet web page information for subject events to collect sequence comprising following steps:

Step A, the initial time of subject events is determined using priori data, and quantifies its Annual distribution, when obtaining one Between the quantized value that is distributed；

Step B is intended to the time in theme using different representation methods and general keyword is indicated respectively, and Calculate separately time correlation degree and the general keyword degree of correlation；

Step C is obtained according to step B time correlation degree calculated and the general keyword degree of correlation, building with step A The quantized value of the Annual distribution is the increasing function of variable, and is dissolved into the URL priority distribution based on web page contents Method calculates final URL priority to obtain the URL priority distribution formula based on Annual distribution quantized value, Also the URL for allowing for the concerned moment obtains higher priority.

Preferably, the priori data in step A is Google trend data.

Preferably, in step B, the expression way that the time in theme is intended to is as follows；

The Formal Representation of theme and web page contents generally: theme T and web page contents D is given, as follows table Show.

T=< V_Tk,T_ST,T_TD>

D=< V_Dk,T_PT>

Wherein, V_Tk, T_STAnd T_TDRespectively indicate theme general vector, the beginning and ending time of theme and its Annual distribution；V_DkWith T_PTRespectively indicate the general vector and its issuing time of web page contents.

The Formal Representation of theme: its general vector V_Tk, beginning and ending time T_STWith Annual distribution T_TDAccording to following formula table It reaches.

V_Tk={ (k₁,w_Tk1),(k₂,w_Tk2),...,(k_s,w_Tks)}

T_ST=[t_STs,t_STe]

T_TD={ < [t_TDs1,t_TDe1], λ₁>,...,<[t_TDsr,t_TDer], λ_r>}

Wherein, k_iIndicate i-th of general keyword in theme；w_TkiIndicate general keyword k_iWeight；S indicates theme The number of middle general keyword；t_STsIndicate the initial time of theme, t_STeIndicate the end time of theme, < [t_TDsi,t_TDei], λ_i >indicate i-th<period in Annual distribution, search volume index>right；t_TDsiAnd t_TDeiThe initial time of respectively i-th period and End time, λ_iFor the volumes of searches index value of i-th of period；

The Formal Representation of web page contents: its general vector V_DkWith issuing time T_PTIt is indicated according to following formula.

V_Dk={ (k₁,w_Dk1),(k₂,w_Dk2),...,(k_s,w_Dks)}

T_PT=t_PT

Wherein, k_iIndicate i-th of general keyword in web page contents；w_DkiIndicate its general keyword k_iWeight；t_PT Indicate the issuing time of webpage.

Preferably, in step B, the formula difference for calculating time correlation degree and the general keyword degree of correlation is as follows；

The time correlation degree for calculating theme and web page contents is shown as follows:

Wherein, sim (T_PT,T_ST) indicate theme and web page contents time correlation angle value；

The general subject degree of correlation for calculating theme and web page contents is shown as follows:

In formula, sim (V_Dk,V_Tk) indicate theme T and web page contents D general subject relevance degree.

Preferably, the URL priority distribution formula in step C are as follows:

Wherein, Priority_T(URL) final URL priority is indicated, Priority (URL) is existing based on webpage The priority that the URL priority distribution method of content obtains, Pr (t/T) are the standardized values of Annual distribution quantized value, are also illustrated that The probability that the webpage and theme T-phase that issuing time is t close；The threshold value is in 0 to 1 section value.

Preferably, the threshold value is set as 0.4.

It is preferably based on the calculating for the priority P riority (URL) that the URL priority distribution methods of web page contents obtains Formula are as follows:

Priority (URL)=θ × sim (V_Dk,V_Tk)+γ×sim(V_Ak,V_Tk)

Wherein, θ and γ respectively indicates the decay factor of father's web page contents topic correlativity and Anchor Text topic correlativity, and Meet+γ=1 θ.

Preferably, the decay factor θ is set as 0.4, γ and is set as 0.6.

A kind of subject network information collecting method for taking time intention into account provided by the present invention, passes through rising for quantization theme Time beginning and Annual distribution, time-based international standard carry out Formal Representation time intention, and formation is by time intention and commonly The diversification representation method that keyword (non-temporal word) independently forms, then decoupled method time correlation degree and general keyword The Annual distribution of quantization is finally dissolved into URL priority distribution method as the variable of certain increasing function and is calculated by the degree of correlation URL priority out substantially increases webpage discovery quantity and precision ratio.

Specific embodiment

For a clearer understanding of the technical characteristics, objects and effects of the present invention, now illustrate of the invention specific Embodiment.

The present invention provides a kind of subject network information collecting methods for taking time intention into account, are used for for subject events It carries out the Internet web page information and collects sequence comprising following steps:

The time of theme is intended to refer to the temporal characteristics for including in theme.The time of theme is intended to be divided into clear by the present invention Time be intended to and the potential time be intended to.Wherein, the specific time is intended to refer to clearly provides event horizon in theme, such as Theme " earthquake in 2008 ", which explicitly points out, needs to find earthquake information in 2008；The potential time is intended to refer in theme do not have Limiting time feature is specified, but event itself described by theme but implies temporal characteristics, as theme " Wenchuan earthquake " implies river in Shangdong Province Initial time on May 21st, 2008 of valley shake.

In subject network information collection discovery procedure, the initial time and Annual distribution of subject events play different works With therefore, time intention assessment of the invention mainly includes two parts: the identification of subject events initial time and its Annual distribution Identification.

In the retrieval of existing temporal information, the identification that the query word time is intended to mainly by means of certain priori datas, As user searches for log and by the news corpus of mark.On this basis, the present invention also will carry out theme by priori data The identification that time is intended to.In a specific embodiment, the present invention by priori data be Google trend (Google Trends) data.

Google trend data refers to the search volume index of a certain query word within the past period.Google trend number According to being not original volumes of searches, but a standardized value relative to total volumes of searches.After standardization, Google trend Data value between 0 to 100, value show that more greatly volumes of searches is bigger.Currently, Google trend data has been widely used for disease Disease forecasting, conservation biology and network public-opinion etc..To find out its cause, mainly Google trend data reflects user to this The degree of concern of content involved by query word, volumes of searches is bigger, shows that the people of concern is more, and the people paid close attention to is more, more shows Event relevant to the content has occurred.The present invention is based on this feature of Google trend data also to identify that earth's surface is covered The time of lid subject events is intended to, and is broadly divided into two steps:

(1) identify the initial time of subject events: it is mainly that volume index is searched in foundation Google trend data from nothing To the variation having.Because paying close attention to this before subject events generate according to event Emergence and Development, variation and the evolutionary process of extinction The user of theme is less, and the standard of Google trend data statistics is not achieved in volumes of searches.In actually calculating, it is based on Google The theme initial time recognition methods of trend data only identifies the theme that its start periods search volume index is 0.To find out its cause, one Aspect is not that each theme has specific initial time (such as theme " earthquake " and to be not specific to a certain specific event, it does not have Have specific initial time), the initiating searches volume index of this distribution subject is not 0；On the other hand then it is originated from Google trend The limitation of data itself, Google trend data were counted since in January, 2004, were occurred before 2004 and were extended to The initiating searches volume index of theme in 2004 is not 0.Finally, the theme initial time of identification is first in Google trend data At the time of secondary appearance search volume index is greater than 0.

(2) quantify the Annual distribution of subject events: it directly utilizes the variation that volume index is searched in Google trend data It indicates, i.e., using search volume index come quantization time distribution.Because Google trend data inherently reflects in internet The temperature variation of the theme, the i.e. Annual distribution of subject events are paid close attention in different periods.

Firstly, can identify corresponding initial time according to initial time recognition methods, it is based on Google trend data Time intention assessment, the initial time for identifying subject events that can be rough.Such as theme " Wenchuan earthquake " was at 2008 5 The moon is paid close attention to by user very much in December, 2008, and commemorates that the moon attracts attention again in May, 2009, with its evolutionary process It is consistent.This explanation is directly reasonable using the Annual distribution of Google trend data quantization subject events.

In addition, Baidu's index also can be used as the priori data of recognition time intention.It is similar with Google trend data, is Based on the inquiry log of universal search engine Baidu, reflect user of the different theme query words in the past period Attention rate and imedias advertisement.Theme time intension recognizing method based on Baidu's index and the master based on Google trend data Topic time intension recognizing method is similar, and details are not described herein.

Step B takes the theme expression and relatedness computation of time intention into account: using different representation methods in theme Time is intended to and general keyword is indicated respectively, and calculates separately time correlation degree and the general keyword degree of correlation；

In existing subject network information gathering process, generalling use traditional unidirectional amount indicates the master that containment time is intended to Topic, can not thus embody initial time and Annual distribution.Therefore, in method provided by the present invention, using different shapes Formula indicates the common pass of the general keyword of theme, the beginning and ending time of theme, the Time-distribution of theme and web page contents Keyword and its issuing time.Specifically:

(1) indicate general keyword based on single vector approach: the general keyword of theme and web page contents is using < key Word, weight > to expression；Its dimension depends on the number of main in the title of the key words, and in the case where theme is constant, dimension is fixed Constant.

(2) indicate that the time is intended to based on time international standard: in international standard, the time is divided into moment and period.Theme Initial time and the issuing times of web page contents be usually a time point, indicated using the moment；For ease of calculation, this hair The bright initial time that theme is indicated using the period and end time (i.e. beginning and ending time)；When what its Annual distribution reflected is different Between the temperature variation of the event is paid close attention in range.Therefore, Annual distribution is by<the period, searches for volume index>to expression, wherein the period Corresponding time range, volumes of searches exponent pair answer the hot value of subject events.Particularly, it to save memory space, does not indicate to search for At the time of volume index is 0.

Their Formal Representation is as follows:

(1) Formal Representation of theme and web page contents generally: given theme T and web page contents D, it can be by as follows Formula indicates.

T=< V_Tk,T_ST,T_TD> (1-2)

D=< V_Dk,T_PT> (1-3)

In formula, V_Tk, T_STAnd T_TDRespectively indicate theme general vector, the beginning and ending time of theme and its Annual distribution；V_DkWith T_PTRespectively indicate the general vector and its issuing time of web page contents.

(2) Formal Representation of theme: its general vector V_Tk, beginning and ending time T_STWith Annual distribution T_TDIt can be according to following public affairs Formula expression.

V_Tk={ (k₁,w_Tk1),(k₂,w_Tk2),...,(k_s,w_Tks)} (1-4)

T_ST=[t_STs,t_STe] (1-5)

T_TD={ < [t_TDs1,t_TDe1], λ₁>,...,<[t_TDsr,t_TDer], λ_r>} (1-6)

In formula, k_iIndicate i-th of general keyword in theme；w_TkiIndicate general keyword k_iWeight；S indicates theme The number of middle general keyword；t_STsThe initial time for indicating theme is specified by user or is identified according to the method in step A； t_STeThe end time for indicating theme, specified by user or be defaulted as infinity；<[t_TDsi,t_TDei], λ_i> indicate in Annual distribution I-th<period, search volume index>right；t_TDsiAnd t_TDeiThe initial time of respectively i-th period and end time, λ_iIt is i-th The volumes of searches index value of a period, these parameters can (such as the Google trend numbers of the priori data according to used by step A According to) obtain, and omit the period that search volume index is 0；

(3) Formal Representation of web page contents: its general vector V_DkWith issuing time T_PTIt is indicated according to following formula.

V_Dk={ (k₁,w_Dk1),(k₂,w_Dk2),...,(k_s,w_Dks)} (1-7)

T_PT=t_PT (1-8)

In formula, k_iIndicate i-th of general keyword in web page contents；w_DkiIndicate its general keyword k_iWeight；t_PT Indicate the issuing time of webpage.

The weighing computation method of general keyword can be obtained using the prior art in theme and web page contents, such as can refer to Existing literature " Wu H, Chen J, et al.A Focused Crawler for Borderlands Situation Information with Geographical Properties of Place Names[J].Sustainability, 2014,6 (10): method provided by 6529-6552. " obtains.

As described in the background art, whether traditional topic correlativity calculation method judges it merely with web page contents It is related to theme, weaken theme initial time can independent filtration fraction irrelevant information effect, be easy to cause certain information Misjudgement, influence the precision ratio of topic crawling.The present invention is based on traditional vector space model, from initial time and common pass The aspect of keyword two is set out, judge the degree of correlation between web page contents and theme using two step method, thus provide it is a kind of newly Take the topic correlativity calculative strategy of initial time into account.Its calculation process is broadly divided into following two step:

(1) the time correlation degree of theme and web page contents is calculated.Because being the theme, initial time can be individually used for filtration fraction Incoherent information, therefore, only need to compare web page contents issuing time and the theme beginning and ending time can preliminary judgement its whether It is related to theme.Therefore, the calculating of time correlation degree can be as follows shown in formula.

In formula, sim (T_PT,T_ST) indicate theme and web page contents time correlation angle value；Other parameters are as previously described.When Between relevance degree be 0, indicate web page contents it is uncorrelated to theme, the webpage should be abandoned in creeping；Time correlation angle value is 1, Indicate that web page contents may be related to theme, final correlation needs to further determine that by web page contents.Because at this time Between relevance degree be 1 when continue to calculate the general subject degree of correlation.

(2) the general subject degree of correlation of theme and web page contents is calculated.The general keyword of theme and web page contents is still Indicate that relevance degree can be used traditional cosine formula and calculate, as shown in following equation using unidirectional amount.

In formula, sim (V_Dk,V_Tk) indicate theme T and web page contents D general subject relevance degree；For example preceding institute of other parameters It states.If sim (V_Dk,V_Tk) when being more than or equal to given threshold value, then determine that the web page contents are related to theme；Otherwise, it is determined that net Page content is uncorrelated to theme, and abandons the webpage.

In the topic correlativity calculative strategy for taking initial time into account, preferentially calculating the reason of time correlation is spent is time phase The calculating for closing angle value is fairly simple.

Step C is constructed according to step B time correlation degree calculated and the general keyword degree of correlation with obtaining in step A The quantized value of the Annual distribution obtained is the increasing function of variable, and is dissolved into the URL priority based on web page contents Distribution method, so that the URL priority distribution formula based on Annual distribution quantized value is obtained, so that the concerned moment URL obtains higher priority, to solve the problems, such as Annual distribution equalization.

In subject network information gathering process, the Annual distribution of theme will affect the order of priority of INFORMATION DISCOVERY.Specifically It shows themselves in that if the issuing time t of web page contents corresponding to a certain URL is there are more related web page, is determined in theme T Under the premise of, the web page contents that issuing time is t and the probability P r (t/T) that theme T-phase is closed are larger, i.e., have in the URL at the moment Higher priority.But existing URL priority distribution method does not consider this characteristic.

In order to solve this problem, the present invention is with quantized value (the searching in i.e. aforementioned Google trend data of Annual distribution Rope volume index) based on, provide a kind of URL priority distribution method based on Annual distribution quantized value.Its process is:

Firstly, building using quantized value as the increasing function of independent variable: due to Annual distribution quantized value to a certain extent The quantity that its related web page is issued in a certain period is reflected, and the trend of direct ratio is presented in quantized value and associated nets number of pages, that is, measures Change value is bigger, shows that the related web page of publication is more, and this characteristic can be exactly presented in increasing function.Therefore present invention selection Building is using Annual distribution quantized value as index, using natural constant e as the exponential function (natural exponential function) at bottom.

Then, increasing function and the URL priority distribution method based on web page contents are merged: before fusion, this method elder generation base Its content prioritization is calculated in the URL priority distribution method of web page contents, when value is more than or equal to given a certain threshold value, Just merged.This prevents from improving not primarily to ensuring Annual distribution only influences the discovery order that related web page corresponds to URL Related web page corresponds to the discovery order of URL.In fusion, mainly by increasing function multiplied by its content prioritization in the present invention.

Finally, the formula of the URL priority distribution based on Annual distribution quantized value is as follows.

In formula, Priority_T(URL) final URL priority is indicated；

Priority (URL) is the priority that the existing URL priority distribution method based on web page contents obtains, meter Calculating formula can be formula (1-1) provided by background technique；Pr (t/T) is the standardized value of Annual distribution quantized value, is also illustrated that The probability that the webpage and theme T-phase that issuing time is t close；Threshold value in the formula is in 0 to 1 section value, when it is 1, table Show that URL priority traditionally calculates always；When it is 0, indicate URL priority always according to the side for incorporating Annual distribution Method calculates.

In a preferred embodiment, the calculating process master of the URL priority distribution method based on Annual distribution quantized value It is divided into six steps, specific as follows:

(1) quantify the Annual distribution of theme.The Annual distribution of theme can be obtained by Google trend data, quantization Value is to search for volume index in Google trend data.

(2) the issuing time t of web page contents corresponding to URL to be downloaded is estimated.

During INFORMATION DISCOVERY, the issuing time of web page contents corresponding to URL to be downloaded is unknown.In the present invention In, there are mainly two types of calculation methods:

1) calculation method based on URL character string information: (such as when URL character string itself to be downloaded includes temporal information " 20080905 " in " http://news.sohu.com/20080905/n259388056.shtml " are right for URL to be downloaded Answer the issuing time of webpage), the time is extracted using corresponding timed regular expression, and right as URL institute to be downloaded Answer the issuing time of web page contents；

2) calculation method based on father's web page contents time: when URL character string itself to be downloaded does not include temporal information, Using the issuing time of URL father's web page contents to be downloaded as the issuing time of web page contents corresponding to it.Because on the one hand under Carry the issuing times of URL father's web page contents usually all less times greater than or equal to web page contents corresponding to URL to be downloaded publication when Between, and the interval of Google each period of trend data is larger.On the other hand, this hypothesis has no effect on URL to be downloaded The relevance degree of corresponding webpage and theme only influences the discovery sequence of the URL.

(3) the quantized value Pr (t/T) of normalized temporal distribution.As described above, need to only obtain searching for period corresponding to time t Rope volume index simultaneously standardizes, as shown by the following formula.

Parameter in formula is as previously described.

(4) the Anchor Text topic correlativity value sim (V of URL to be downloaded is calculated_Ak,V_Tk).Wherein, Anchor Text vector is (by anchor Text and its context and URL character string information composition) as shown by the following formula,

V_Ak={ (k₁,w_Ak1),(k₂,w_Ak2),...,(k_s,w_Aks)} (1-13)

Anchor Text topic correlativity value is as shown by the following formula.

In formula, V_AkIndicate Anchor Text vector；w_AkiIndicate general keyword k in Anchor Text_iWeight；Other parameters are the same It is described.

(5) calculate the content prioritization Priority (URL) of URL to be downloaded: its calculation formula is as stated in the background art.Cause It is direct description of the webpage to URL to be downloaded for Anchor Text, for the content of father's webpage, Anchor Text is more important, so The decay factor θ and γ in formula are respectively set to 0.4 and 0.6 in the present invention.

(6) calculate the final priority of URL to be downloaded: its calculation formula such as (1-11) is shown, through experimental analysis, the present invention 0.4 is set by the threshold value in formula (1-11).

In a specific embodiment, the present invention is directed to as much as possible is found to have the networks of temporal characteristics from network Change information, while the incoherent information of downloading as few as possible.Its basic procedure may include following five step:

(1) preparation: user needs specified content topic and initial URL relevant to theme.Then, using being based on The time intension recognizing method of Google trend data determines the initial time of theme, and quantifies its Annual distribution.

(2) request and analyzing web page: excellent into the initial URL or URL priority query of the Internet request using http protocol The first highest URL of grade, to obtain the corresponding web page contents of the URL.Secondly, according to the DOM Document Object Model of webpage (Document Object Model, DOM), parse the corresponding title of webpage, text, issuing time, URL to be downloaded and its Anchor Text information.

(3) topic correlativity calculates: firstly, according to the theme initial time and web page contents that obtain in step (1) and (2) Issuing time indicates beginning and ending time, general keyword, Annual distribution and the web page contents of theme using formula (1-2) to (1-6) General keyword and issuing time；Then their time correlation degree is calculated using formula (1-9), filtering out has with theme The web page contents of Before sequential relationship；Then, general subject relevance degree is calculated using formula (1-10).When relevance degree is big When being equal to a certain threshold value, then the webpage is saved in web page resources library；Otherwise, it is determined that the webpage is uncorrelated to theme, and lose Abandon the webpage.

(4) URL priority is distributed: URL priority is calculated according to formula (1-11) to (1-14), then according to the priority Value is deposited into URL priority query.

(5) until repeating step (2), (3) and (4) when URL priority query is sky or reaches a certain cycling condition.

Under hardware condition and the identical situation of network bandwidth, method provided by the present invention is believed than existing subject network It ceases acquisition method and improves the webpage capture quantity of 10%-30%, and 10% or so precision ratio can be improved.

It will be appreciated by those skilled in the art that although the present invention is described in the way of multiple embodiments, It is that not each embodiment only contains an independent technical solution.So narration is used for the purpose of for the sake of understanding in specification, The skilled in the art should refer to the specification as a whole is understood, and by technical solution involved in each embodiment Regard as and can be combined with each other into the modes of different embodiments to understand protection scope of the present invention.

The foregoing is merely the schematical specific embodiment of the present invention, the range being not intended to limit the invention.It is any Those skilled in the art, made equivalent variations, modification and combination under the premise of not departing from design and the principle of the present invention, It should belong to the scope of protection of the invention.

Claims

1. a kind of subject network information collecting method for taking time intention into account is used to carry out internet web page for subject events Information collects sequence, which is characterized in that it includes the following steps:

Step A, the initial time of subject events is determined using priori data, and quantifies its Annual distribution, obtains the time point The quantized value of cloth,

Step B is intended to the time in theme using different representation methods and general keyword is indicated respectively, and respectively Calculate time correlation degree and the general keyword degree of correlation；

Step C, according to step B time correlation degree calculated and the general keyword degree of correlation, building is described in step A acquisition The quantized value of Annual distribution is the increasing function of variable, and is dissolved into the URL priority distribution method based on web page contents, To obtain the URL priority distribution formula based on Annual distribution quantized value, final URL priority is calculated, also So that the URL at concerned moment obtains higher priority,

The URL priority distribution formula are as follows:

Wherein, Priority_T(URL) final URL priority is indicated, Priority (URL) is existing based on web page contents The priority that URL priority distribution method obtains, Pr (t/T) is the standardized value of Annual distribution quantized value, when also illustrating that publication Between for t webpage and theme T-phase close probability；The threshold value is in 0 to 1 section value.

2. the method according to claim 1, wherein the priori data in step A is Google trend number According to.

3. the method according to claim 1, wherein the expression way that the time in theme is intended to is such as in step B Under；

The Formal Representation of theme and web page contents generally: given theme T and web page contents D is indicated, T as follows =< V_Tk,T_ST,T_TD>

D=< V_Dk,T_PT>

Wherein, V_Tk, T_STAnd T_TDRespectively indicate theme general vector, the beginning and ending time of theme and its Annual distribution；V_DkAnd T_PTRespectively Indicate the general vector and its issuing time of web page contents,

The Formal Representation of theme: its general vector V_Tk, beginning and ending time T_STWith Annual distribution T_TDIt is expressed according to following formula, V_Tk ={ (k₁,w_Tk1),(k₂,w_Tk2),...,(k_s,w_Tks)}

T_ST=[t_STs,t_STe]

T_TD={ < [t_TDs1,t_TDe1], λ₁>,...,<[t_TDsr,t_TDer], λ_r>}

Wherein, k_iIndicate i-th of general keyword in theme；w_TkiIndicate general keyword k_iWeight；S indicates general in theme The number of clearance keyword；t_STsIndicate the initial time of theme, t_STeIndicate the end time of theme, < [t_TDsi,t_TDei], λ_i> table Show in Annual distribution i-th<period, search volume index>right；t_TDsiAnd t_TDeiThe initial time and end of respectively i-th period Time, λ_iFor the volumes of searches index value of i-th of period；

The Formal Representation of web page contents: its general vector V_DkWith issuing time T_PTIt is indicated according to following formula,

V_Dk={ (k₁,w_Dk1),(k₂,w_Dk2),...,(k_s,w_Dks)}

T_PT=t_PT

Wherein, k_iIndicate i-th of general keyword in web page contents；w_DkiIndicate its general keyword k_iWeight；t_PTIt indicates The issuing time of webpage.

4. according to the method described in claim 3, it is characterized in that, calculating time correlation degree and general keyword phase in step B The formula difference of Guan Du is as follows；

5. the method according to claim 1, wherein the threshold value is set as 0.4.