CN106250512B - A kind of subject network information collecting method for taking time intention into account - Google Patents

A kind of subject network information collecting method for taking time intention into account Download PDF

Info

Publication number
CN106250512B
CN106250512B CN201610630419.4A CN201610630419A CN106250512B CN 106250512 B CN106250512 B CN 106250512B CN 201610630419 A CN201610630419 A CN 201610630419A CN 106250512 B CN106250512 B CN 106250512B
Authority
CN
China
Prior art keywords
time
theme
web page
url
page contents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610630419.4A
Other languages
Chinese (zh)
Other versions
CN106250512A (en
Inventor
陈军
武昊
侯东阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NATIONAL GEOMATICS CENTER OF CHINA
Original Assignee
NATIONAL GEOMATICS CENTER OF CHINA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NATIONAL GEOMATICS CENTER OF CHINA filed Critical NATIONAL GEOMATICS CENTER OF CHINA
Priority to CN201610630419.4A priority Critical patent/CN106250512B/en
Publication of CN106250512A publication Critical patent/CN106250512A/en
Application granted granted Critical
Publication of CN106250512B publication Critical patent/CN106250512B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

A kind of subject network information collecting method for taking time intention into account, it is used to carry out the Internet web page information for subject events to collect sequence, it includes the following steps: step A, the initial time of subject events is determined using priori data, and quantify its Annual distribution, obtain the quantized value of an Annual distribution;Step B is intended to the time in theme using different representation methods and general keyword is indicated respectively, and calculates separately time correlation degree and the general keyword degree of correlation;Step C, according to step B time correlation degree calculated and the general keyword degree of correlation, building obtains the URL priority distribution formula based on Annual distribution quantized value, calculates final URL priority using the quantized value of the step A Annual distribution obtained as the increasing function of variable.A kind of subject network information collecting method for taking time intention into account provided by the present invention, substantially increases webpage discovery quantity and precision ratio.

Description

A kind of subject network information collecting method for taking time intention into account
Technical field
The present invention relates to the themes of the webpage of specific content in internet web page search field, especially acquisition internet to climb Row method, especially a kind of subject network information collecting method for taking time intention into account.
Background technique
Topic crawling is a kind of key technology method for obtaining specific area webpage in internet, it is intended under as much as possible Carry webpage relevant to designated key.Its mainly theme for being specified according to user, by being calculated with topic correlativity, URL it is excellent Crawl policy based on first grade distribution etc., constantly obtains the information of related web page from Ubiquitous Network resource.
URL priority distribution method based on web page contents is that traditional theme is creeped common method.It is mainly basis What two class relevance degrees were calculated, specifically: (1) father web page contents topic correlativity: its value is higher, and father's webpage is included URL priority is higher;(2) Anchor Text topic correlativity: it refers to theme and Anchor Text, Anchor Text context and URL character The relevance degree of the information such as string, wherein Anchor Text is often to describe to the generality of content of pages pointed by URL.
In the URL priority distribution method based on web page contents, father's web page contents topic correlativity and Anchor Text theme The degree of correlation is calculated frequently with cosine formula, such as: father's web page contents topic correlativity of certain URL is sim (VDk,VTk), Anchor Text Topic correlativity is sim (VAk,VTk), then the priority P riority (URL) of the URL can be calculated as follows:
Priority (URL)=θ × sim (VDk,VTk)+γ×sim(VAk,VTk) (1-1)
In above formula, θ and γ respectively indicate the decay factor of father's web page contents topic correlativity and Anchor Text topic correlativity, And meet+γ=1 θ.
When using the emergency information of topic crawling method acquisition time sensitivity, time intention can usually be used as theme A kind of restriction element.According to the regulation (2002) of ISO19100 series standard, time object can be divided into " moment " and " when Section ", wherein " moment " indicates a point in time and space;" period " is equivalent to a line in time and space, there is starting point, end The attributes such as point and length.In general, the information dissemination on network about a certain emergency event mainly appears on event it occurs Afterwards, that is, the issuing time reported should be later than the initial time of emergency event;On the other hand, there are Emergence and Developments, change for emergency event The evolutionary process changed and withered away, in the different evolutionary phases, the temperature that people pay close attention to the event is also different, preferential downloading concern The information for spending the higher period, can meet most of Man's Demands, this reflects the Annual distribution of the event to a certain extent.? That is it is related in information that the time is intended to (such as initial time to Annual distribution) when carrying out network information gathering using theme Degree judgement and INFORMATION DISCOVERY order of priority distribution aspect have obvious action.
Although filter house can be individually used for by setting initial time when using topic crawling method collecting network information Point incoherent information, and its Annual distribution will affect the order of priority of INFORMATION DISCOVERY, but traditional network information collecting method The common semanteme of theme is still only paid close attention to, there is no the time of analysis and utilization theme intentions, and there are asking for Annual distribution equalization Topic, causes its precision ratio low.Specific manifestation are as follows:
(1) lack the representation method that the time is intended to: the unidirectional amount theme representation method of tradition it is merely meant that theme keyword, The representation method of its time intention is not provided;
(2) weaken the effect of theme initial time: traditional theme relatedness computation strategy only relies on web page contents to judge The correlation of itself and theme weakens the effect of theme initial time;
(3) ignore the influence of theme Annual distribution influence INFORMATION DISCOVERY order of priority: traditional URL priority distribution method mesh It is preceding main using web page contents, Anchor Text and its context, URL character string, the linking relationship even renewal time of webpage, but Have ignored the influence of theme Annual distribution.
Summary of the invention
The technical problem to be solved in the present invention is to provide it is a kind of take into account the time intention subject network information collecting method, with The problem of being formerly mentioned is reduced or avoided.
In order to solve the above technical problems, the present invention provides a kind of subject network information collection sides for taking time intention into account Method is used to carry out the Internet web page information for subject events to collect sequence comprising following steps:
Step A, the initial time of subject events is determined using priori data, and quantifies its Annual distribution, when obtaining one Between the quantized value that is distributed;
Step B is intended to the time in theme using different representation methods and general keyword is indicated respectively, and Calculate separately time correlation degree and the general keyword degree of correlation;
Step C is obtained according to step B time correlation degree calculated and the general keyword degree of correlation, building with step A The quantized value of the Annual distribution is the increasing function of variable, and is dissolved into the URL priority distribution based on web page contents Method calculates final URL priority to obtain the URL priority distribution formula based on Annual distribution quantized value, Also the URL for allowing for the concerned moment obtains higher priority.
Preferably, the priori data in step A is Google trend data.
Preferably, in step B, the expression way that the time in theme is intended to is as follows;
The Formal Representation of theme and web page contents generally: theme T and web page contents D is given, as follows table Show.
T=< VTk,TST,TTD>
D=< VDk,TPT>
Wherein, VTk, TSTAnd TTDRespectively indicate theme general vector, the beginning and ending time of theme and its Annual distribution;VDkWith TPTRespectively indicate the general vector and its issuing time of web page contents.
The Formal Representation of theme: its general vector VTk, beginning and ending time TSTWith Annual distribution TTDAccording to following formula table It reaches.
VTk={ (k1,wTk1),(k2,wTk2),...,(ks,wTks)}
TST=[tSTs,tSTe]
TTD={ < [tTDs1,tTDe1], λ1>,...,<[tTDsr,tTDer], λr>}
Wherein, kiIndicate i-th of general keyword in theme;wTkiIndicate general keyword kiWeight;S indicates theme The number of middle general keyword;tSTsIndicate the initial time of theme, tSTeIndicate the end time of theme, < [tTDsi,tTDei], λi >indicate i-th<period in Annual distribution, search volume index>right;tTDsiAnd tTDeiThe initial time of respectively i-th period and End time, λiFor the volumes of searches index value of i-th of period;
The Formal Representation of web page contents: its general vector VDkWith issuing time TPTIt is indicated according to following formula.
VDk={ (k1,wDk1),(k2,wDk2),...,(ks,wDks)}
TPT=tPT
Wherein, kiIndicate i-th of general keyword in web page contents;wDkiIndicate its general keyword kiWeight;tPT Indicate the issuing time of webpage.
Preferably, in step B, the formula difference for calculating time correlation degree and the general keyword degree of correlation is as follows;
The time correlation degree for calculating theme and web page contents is shown as follows:
Wherein, sim (TPT,TST) indicate theme and web page contents time correlation angle value;
The general subject degree of correlation for calculating theme and web page contents is shown as follows:
In formula, sim (VDk,VTk) indicate theme T and web page contents D general subject relevance degree.
Preferably, the URL priority distribution formula in step C are as follows:
Wherein, PriorityT(URL) final URL priority is indicated, Priority (URL) is existing based on webpage The priority that the URL priority distribution method of content obtains, Pr (t/T) are the standardized values of Annual distribution quantized value, are also illustrated that The probability that the webpage and theme T-phase that issuing time is t close;The threshold value is in 0 to 1 section value.
Preferably, the threshold value is set as 0.4.
It is preferably based on the calculating for the priority P riority (URL) that the URL priority distribution methods of web page contents obtains Formula are as follows:
Priority (URL)=θ × sim (VDk,VTk)+γ×sim(VAk,VTk)
Wherein, θ and γ respectively indicates the decay factor of father's web page contents topic correlativity and Anchor Text topic correlativity, and Meet+γ=1 θ.
Preferably, the decay factor θ is set as 0.4, γ and is set as 0.6.
A kind of subject network information collecting method for taking time intention into account provided by the present invention, passes through rising for quantization theme Time beginning and Annual distribution, time-based international standard carry out Formal Representation time intention, and formation is by time intention and commonly The diversification representation method that keyword (non-temporal word) independently forms, then decoupled method time correlation degree and general keyword The Annual distribution of quantization is finally dissolved into URL priority distribution method as the variable of certain increasing function and is calculated by the degree of correlation URL priority out substantially increases webpage discovery quantity and precision ratio.
Specific embodiment
For a clearer understanding of the technical characteristics, objects and effects of the present invention, now illustrate of the invention specific Embodiment.
The present invention provides a kind of subject network information collecting methods for taking time intention into account, are used for for subject events It carries out the Internet web page information and collects sequence comprising following steps:
Step A, the initial time of subject events is determined using priori data, and quantifies its Annual distribution, when obtaining one Between the quantized value that is distributed;
The time of theme is intended to refer to the temporal characteristics for including in theme.The time of theme is intended to be divided into clear by the present invention Time be intended to and the potential time be intended to.Wherein, the specific time is intended to refer to clearly provides event horizon in theme, such as Theme " earthquake in 2008 ", which explicitly points out, needs to find earthquake information in 2008;The potential time is intended to refer in theme do not have Limiting time feature is specified, but event itself described by theme but implies temporal characteristics, as theme " Wenchuan earthquake " implies river in Shangdong Province Initial time on May 21st, 2008 of valley shake.
In subject network information collection discovery procedure, the initial time and Annual distribution of subject events play different works With therefore, time intention assessment of the invention mainly includes two parts: the identification of subject events initial time and its Annual distribution Identification.
In the retrieval of existing temporal information, the identification that the query word time is intended to mainly by means of certain priori datas, As user searches for log and by the news corpus of mark.On this basis, the present invention also will carry out theme by priori data The identification that time is intended to.In a specific embodiment, the present invention by priori data be Google trend (Google Trends) data.
Google trend data refers to the search volume index of a certain query word within the past period.Google trend number According to being not original volumes of searches, but a standardized value relative to total volumes of searches.After standardization, Google trend Data value between 0 to 100, value show that more greatly volumes of searches is bigger.Currently, Google trend data has been widely used for disease Disease forecasting, conservation biology and network public-opinion etc..To find out its cause, mainly Google trend data reflects user to this The degree of concern of content involved by query word, volumes of searches is bigger, shows that the people of concern is more, and the people paid close attention to is more, more shows Event relevant to the content has occurred.The present invention is based on this feature of Google trend data also to identify that earth's surface is covered The time of lid subject events is intended to, and is broadly divided into two steps:
(1) identify the initial time of subject events: it is mainly that volume index is searched in foundation Google trend data from nothing To the variation having.Because paying close attention to this before subject events generate according to event Emergence and Development, variation and the evolutionary process of extinction The user of theme is less, and the standard of Google trend data statistics is not achieved in volumes of searches.In actually calculating, it is based on Google The theme initial time recognition methods of trend data only identifies the theme that its start periods search volume index is 0.To find out its cause, one Aspect is not that each theme has specific initial time (such as theme " earthquake " and to be not specific to a certain specific event, it does not have Have specific initial time), the initiating searches volume index of this distribution subject is not 0;On the other hand then it is originated from Google trend The limitation of data itself, Google trend data were counted since in January, 2004, were occurred before 2004 and were extended to The initiating searches volume index of theme in 2004 is not 0.Finally, the theme initial time of identification is first in Google trend data At the time of secondary appearance search volume index is greater than 0.
(2) quantify the Annual distribution of subject events: it directly utilizes the variation that volume index is searched in Google trend data It indicates, i.e., using search volume index come quantization time distribution.Because Google trend data inherently reflects in internet The temperature variation of the theme, the i.e. Annual distribution of subject events are paid close attention in different periods.
Firstly, can identify corresponding initial time according to initial time recognition methods, it is based on Google trend data Time intention assessment, the initial time for identifying subject events that can be rough.Such as theme " Wenchuan earthquake " was at 2008 5 The moon is paid close attention to by user very much in December, 2008, and commemorates that the moon attracts attention again in May, 2009, with its evolutionary process It is consistent.This explanation is directly reasonable using the Annual distribution of Google trend data quantization subject events.
In addition, Baidu's index also can be used as the priori data of recognition time intention.It is similar with Google trend data, is Based on the inquiry log of universal search engine Baidu, reflect user of the different theme query words in the past period Attention rate and imedias advertisement.Theme time intension recognizing method based on Baidu's index and the master based on Google trend data Topic time intension recognizing method is similar, and details are not described herein.
Step B takes the theme expression and relatedness computation of time intention into account: using different representation methods in theme Time is intended to and general keyword is indicated respectively, and calculates separately time correlation degree and the general keyword degree of correlation;
In existing subject network information gathering process, generalling use traditional unidirectional amount indicates the master that containment time is intended to Topic, can not thus embody initial time and Annual distribution.Therefore, in method provided by the present invention, using different shapes Formula indicates the common pass of the general keyword of theme, the beginning and ending time of theme, the Time-distribution of theme and web page contents Keyword and its issuing time.Specifically:
(1) indicate general keyword based on single vector approach: the general keyword of theme and web page contents is using < key Word, weight > to expression;Its dimension depends on the number of main in the title of the key words, and in the case where theme is constant, dimension is fixed Constant.
(2) indicate that the time is intended to based on time international standard: in international standard, the time is divided into moment and period.Theme Initial time and the issuing times of web page contents be usually a time point, indicated using the moment;For ease of calculation, this hair The bright initial time that theme is indicated using the period and end time (i.e. beginning and ending time);When what its Annual distribution reflected is different Between the temperature variation of the event is paid close attention in range.Therefore, Annual distribution is by<the period, searches for volume index>to expression, wherein the period Corresponding time range, volumes of searches exponent pair answer the hot value of subject events.Particularly, it to save memory space, does not indicate to search for At the time of volume index is 0.
Their Formal Representation is as follows:
(1) Formal Representation of theme and web page contents generally: given theme T and web page contents D, it can be by as follows Formula indicates.
T=< VTk,TST,TTD> (1-2)
D=< VDk,TPT> (1-3)
In formula, VTk, TSTAnd TTDRespectively indicate theme general vector, the beginning and ending time of theme and its Annual distribution;VDkWith TPTRespectively indicate the general vector and its issuing time of web page contents.
(2) Formal Representation of theme: its general vector VTk, beginning and ending time TSTWith Annual distribution TTDIt can be according to following public affairs Formula expression.
VTk={ (k1,wTk1),(k2,wTk2),...,(ks,wTks)} (1-4)
TST=[tSTs,tSTe] (1-5)
TTD={ < [tTDs1,tTDe1], λ1>,...,<[tTDsr,tTDer], λr>} (1-6)
In formula, kiIndicate i-th of general keyword in theme;wTkiIndicate general keyword kiWeight;S indicates theme The number of middle general keyword;tSTsThe initial time for indicating theme is specified by user or is identified according to the method in step A; tSTeThe end time for indicating theme, specified by user or be defaulted as infinity;<[tTDsi,tTDei], λi> indicate in Annual distribution I-th<period, search volume index>right;tTDsiAnd tTDeiThe initial time of respectively i-th period and end time, λiIt is i-th The volumes of searches index value of a period, these parameters can (such as the Google trend numbers of the priori data according to used by step A According to) obtain, and omit the period that search volume index is 0;
(3) Formal Representation of web page contents: its general vector VDkWith issuing time TPTIt is indicated according to following formula.
VDk={ (k1,wDk1),(k2,wDk2),...,(ks,wDks)} (1-7)
TPT=tPT (1-8)
In formula, kiIndicate i-th of general keyword in web page contents;wDkiIndicate its general keyword kiWeight;tPT Indicate the issuing time of webpage.
The weighing computation method of general keyword can be obtained using the prior art in theme and web page contents, such as can refer to Existing literature " Wu H, Chen J, et al.A Focused Crawler for Borderlands Situation Information with Geographical Properties of Place Names[J].Sustainability, 2014,6 (10): method provided by 6529-6552. " obtains.
As described in the background art, whether traditional topic correlativity calculation method judges it merely with web page contents It is related to theme, weaken theme initial time can independent filtration fraction irrelevant information effect, be easy to cause certain information Misjudgement, influence the precision ratio of topic crawling.The present invention is based on traditional vector space model, from initial time and common pass The aspect of keyword two is set out, judge the degree of correlation between web page contents and theme using two step method, thus provide it is a kind of newly Take the topic correlativity calculative strategy of initial time into account.Its calculation process is broadly divided into following two step:
(1) the time correlation degree of theme and web page contents is calculated.Because being the theme, initial time can be individually used for filtration fraction Incoherent information, therefore, only need to compare web page contents issuing time and the theme beginning and ending time can preliminary judgement its whether It is related to theme.Therefore, the calculating of time correlation degree can be as follows shown in formula.
In formula, sim (TPT,TST) indicate theme and web page contents time correlation angle value;Other parameters are as previously described.When Between relevance degree be 0, indicate web page contents it is uncorrelated to theme, the webpage should be abandoned in creeping;Time correlation angle value is 1, Indicate that web page contents may be related to theme, final correlation needs to further determine that by web page contents.Because at this time Between relevance degree be 1 when continue to calculate the general subject degree of correlation.
(2) the general subject degree of correlation of theme and web page contents is calculated.The general keyword of theme and web page contents is still Indicate that relevance degree can be used traditional cosine formula and calculate, as shown in following equation using unidirectional amount.
In formula, sim (VDk,VTk) indicate theme T and web page contents D general subject relevance degree;For example preceding institute of other parameters It states.If sim (VDk,VTk) when being more than or equal to given threshold value, then determine that the web page contents are related to theme;Otherwise, it is determined that net Page content is uncorrelated to theme, and abandons the webpage.
In the topic correlativity calculative strategy for taking initial time into account, preferentially calculating the reason of time correlation is spent is time phase The calculating for closing angle value is fairly simple.
Step C is constructed according to step B time correlation degree calculated and the general keyword degree of correlation with obtaining in step A The quantized value of the Annual distribution obtained is the increasing function of variable, and is dissolved into the URL priority based on web page contents Distribution method, so that the URL priority distribution formula based on Annual distribution quantized value is obtained, so that the concerned moment URL obtains higher priority, to solve the problems, such as Annual distribution equalization.
In subject network information gathering process, the Annual distribution of theme will affect the order of priority of INFORMATION DISCOVERY.Specifically It shows themselves in that if the issuing time t of web page contents corresponding to a certain URL is there are more related web page, is determined in theme T Under the premise of, the web page contents that issuing time is t and the probability P r (t/T) that theme T-phase is closed are larger, i.e., have in the URL at the moment Higher priority.But existing URL priority distribution method does not consider this characteristic.
In order to solve this problem, the present invention is with quantized value (the searching in i.e. aforementioned Google trend data of Annual distribution Rope volume index) based on, provide a kind of URL priority distribution method based on Annual distribution quantized value.Its process is:
Firstly, building using quantized value as the increasing function of independent variable: due to Annual distribution quantized value to a certain extent The quantity that its related web page is issued in a certain period is reflected, and the trend of direct ratio is presented in quantized value and associated nets number of pages, that is, measures Change value is bigger, shows that the related web page of publication is more, and this characteristic can be exactly presented in increasing function.Therefore present invention selection Building is using Annual distribution quantized value as index, using natural constant e as the exponential function (natural exponential function) at bottom.
Then, increasing function and the URL priority distribution method based on web page contents are merged: before fusion, this method elder generation base Its content prioritization is calculated in the URL priority distribution method of web page contents, when value is more than or equal to given a certain threshold value, Just merged.This prevents from improving not primarily to ensuring Annual distribution only influences the discovery order that related web page corresponds to URL Related web page corresponds to the discovery order of URL.In fusion, mainly by increasing function multiplied by its content prioritization in the present invention.
Finally, the formula of the URL priority distribution based on Annual distribution quantized value is as follows.
In formula, PriorityT(URL) final URL priority is indicated;
Priority (URL) is the priority that the existing URL priority distribution method based on web page contents obtains, meter Calculating formula can be formula (1-1) provided by background technique;Pr (t/T) is the standardized value of Annual distribution quantized value, is also illustrated that The probability that the webpage and theme T-phase that issuing time is t close;Threshold value in the formula is in 0 to 1 section value, when it is 1, table Show that URL priority traditionally calculates always;When it is 0, indicate URL priority always according to the side for incorporating Annual distribution Method calculates.
In a preferred embodiment, the calculating process master of the URL priority distribution method based on Annual distribution quantized value It is divided into six steps, specific as follows:
(1) quantify the Annual distribution of theme.The Annual distribution of theme can be obtained by Google trend data, quantization Value is to search for volume index in Google trend data.
(2) the issuing time t of web page contents corresponding to URL to be downloaded is estimated.
During INFORMATION DISCOVERY, the issuing time of web page contents corresponding to URL to be downloaded is unknown.In the present invention In, there are mainly two types of calculation methods:
1) calculation method based on URL character string information: (such as when URL character string itself to be downloaded includes temporal information " 20080905 " in " http://news.sohu.com/20080905/n259388056.shtml " are right for URL to be downloaded Answer the issuing time of webpage), the time is extracted using corresponding timed regular expression, and right as URL institute to be downloaded Answer the issuing time of web page contents;
2) calculation method based on father's web page contents time: when URL character string itself to be downloaded does not include temporal information, Using the issuing time of URL father's web page contents to be downloaded as the issuing time of web page contents corresponding to it.Because on the one hand under Carry the issuing times of URL father's web page contents usually all less times greater than or equal to web page contents corresponding to URL to be downloaded publication when Between, and the interval of Google each period of trend data is larger.On the other hand, this hypothesis has no effect on URL to be downloaded The relevance degree of corresponding webpage and theme only influences the discovery sequence of the URL.
(3) the quantized value Pr (t/T) of normalized temporal distribution.As described above, need to only obtain searching for period corresponding to time t Rope volume index simultaneously standardizes, as shown by the following formula.
Parameter in formula is as previously described.
(4) the Anchor Text topic correlativity value sim (V of URL to be downloaded is calculatedAk,VTk).Wherein, Anchor Text vector is (by anchor Text and its context and URL character string information composition) as shown by the following formula,
VAk={ (k1,wAk1),(k2,wAk2),...,(ks,wAks)} (1-13)
Anchor Text topic correlativity value is as shown by the following formula.
In formula, VAkIndicate Anchor Text vector;wAkiIndicate general keyword k in Anchor TextiWeight;Other parameters are the same It is described.
(5) calculate the content prioritization Priority (URL) of URL to be downloaded: its calculation formula is as stated in the background art.Cause It is direct description of the webpage to URL to be downloaded for Anchor Text, for the content of father's webpage, Anchor Text is more important, so The decay factor θ and γ in formula are respectively set to 0.4 and 0.6 in the present invention.
(6) calculate the final priority of URL to be downloaded: its calculation formula such as (1-11) is shown, through experimental analysis, the present invention 0.4 is set by the threshold value in formula (1-11).
In a specific embodiment, the present invention is directed to as much as possible is found to have the networks of temporal characteristics from network Change information, while the incoherent information of downloading as few as possible.Its basic procedure may include following five step:
(1) preparation: user needs specified content topic and initial URL relevant to theme.Then, using being based on The time intension recognizing method of Google trend data determines the initial time of theme, and quantifies its Annual distribution.
(2) request and analyzing web page: excellent into the initial URL or URL priority query of the Internet request using http protocol The first highest URL of grade, to obtain the corresponding web page contents of the URL.Secondly, according to the DOM Document Object Model of webpage (Document Object Model, DOM), parse the corresponding title of webpage, text, issuing time, URL to be downloaded and its Anchor Text information.
(3) topic correlativity calculates: firstly, according to the theme initial time and web page contents that obtain in step (1) and (2) Issuing time indicates beginning and ending time, general keyword, Annual distribution and the web page contents of theme using formula (1-2) to (1-6) General keyword and issuing time;Then their time correlation degree is calculated using formula (1-9), filtering out has with theme The web page contents of Before sequential relationship;Then, general subject relevance degree is calculated using formula (1-10).When relevance degree is big When being equal to a certain threshold value, then the webpage is saved in web page resources library;Otherwise, it is determined that the webpage is uncorrelated to theme, and lose Abandon the webpage.
(4) URL priority is distributed: URL priority is calculated according to formula (1-11) to (1-14), then according to the priority Value is deposited into URL priority query.
(5) until repeating step (2), (3) and (4) when URL priority query is sky or reaches a certain cycling condition.
Under hardware condition and the identical situation of network bandwidth, method provided by the present invention is believed than existing subject network It ceases acquisition method and improves the webpage capture quantity of 10%-30%, and 10% or so precision ratio can be improved.
A kind of subject network information collecting method for taking time intention into account provided by the present invention, passes through rising for quantization theme Time beginning and Annual distribution, time-based international standard carry out Formal Representation time intention, and formation is by time intention and commonly The diversification representation method that keyword (non-temporal word) independently forms, then decoupled method time correlation degree and general keyword The Annual distribution of quantization is finally dissolved into URL priority distribution method as the variable of certain increasing function and is calculated by the degree of correlation URL priority out substantially increases webpage discovery quantity and precision ratio.
It will be appreciated by those skilled in the art that although the present invention is described in the way of multiple embodiments, It is that not each embodiment only contains an independent technical solution.So narration is used for the purpose of for the sake of understanding in specification, The skilled in the art should refer to the specification as a whole is understood, and by technical solution involved in each embodiment Regard as and can be combined with each other into the modes of different embodiments to understand protection scope of the present invention.
The foregoing is merely the schematical specific embodiment of the present invention, the range being not intended to limit the invention.It is any Those skilled in the art, made equivalent variations, modification and combination under the premise of not departing from design and the principle of the present invention, It should belong to the scope of protection of the invention.

Claims (5)

1. a kind of subject network information collecting method for taking time intention into account is used to carry out internet web page for subject events Information collects sequence, which is characterized in that it includes the following steps:
Step A, the initial time of subject events is determined using priori data, and quantifies its Annual distribution, obtains the time point The quantized value of cloth,
Step B is intended to the time in theme using different representation methods and general keyword is indicated respectively, and respectively Calculate time correlation degree and the general keyword degree of correlation;
Step C, according to step B time correlation degree calculated and the general keyword degree of correlation, building is described in step A acquisition The quantized value of Annual distribution is the increasing function of variable, and is dissolved into the URL priority distribution method based on web page contents, To obtain the URL priority distribution formula based on Annual distribution quantized value, final URL priority is calculated, also So that the URL at concerned moment obtains higher priority,
The URL priority distribution formula are as follows:
Wherein, PriorityT(URL) final URL priority is indicated, Priority (URL) is existing based on web page contents The priority that URL priority distribution method obtains, Pr (t/T) is the standardized value of Annual distribution quantized value, when also illustrating that publication Between for t webpage and theme T-phase close probability;The threshold value is in 0 to 1 section value.
2. the method according to claim 1, wherein the priori data in step A is Google trend number According to.
3. the method according to claim 1, wherein the expression way that the time in theme is intended to is such as in step B Under;
The Formal Representation of theme and web page contents generally: given theme T and web page contents D is indicated, T as follows =< VTk,TST,TTD>
D=< VDk,TPT>
Wherein, VTk, TSTAnd TTDRespectively indicate theme general vector, the beginning and ending time of theme and its Annual distribution;VDkAnd TPTRespectively Indicate the general vector and its issuing time of web page contents,
The Formal Representation of theme: its general vector VTk, beginning and ending time TSTWith Annual distribution TTDIt is expressed according to following formula, VTk ={ (k1,wTk1),(k2,wTk2),...,(ks,wTks)}
TST=[tSTs,tSTe]
TTD={ < [tTDs1,tTDe1], λ1>,...,<[tTDsr,tTDer], λr>}
Wherein, kiIndicate i-th of general keyword in theme;wTkiIndicate general keyword kiWeight;S indicates general in theme The number of clearance keyword;tSTsIndicate the initial time of theme, tSTeIndicate the end time of theme, < [tTDsi,tTDei], λi> table Show in Annual distribution i-th<period, search volume index>right;tTDsiAnd tTDeiThe initial time and end of respectively i-th period Time, λiFor the volumes of searches index value of i-th of period;
The Formal Representation of web page contents: its general vector VDkWith issuing time TPTIt is indicated according to following formula,
VDk={ (k1,wDk1),(k2,wDk2),...,(ks,wDks)}
TPT=tPT
Wherein, kiIndicate i-th of general keyword in web page contents;wDkiIndicate its general keyword kiWeight;tPTIt indicates The issuing time of webpage.
4. according to the method described in claim 3, it is characterized in that, calculating time correlation degree and general keyword phase in step B The formula difference of Guan Du is as follows;
The time correlation degree for calculating theme and web page contents is shown as follows:
Wherein, sim (TPT,TST) indicate theme and web page contents time correlation angle value;
The general subject degree of correlation for calculating theme and web page contents is shown as follows:
In formula, sim (VDk,VTk) indicate theme T and web page contents D general subject relevance degree.
5. the method according to claim 1, wherein the threshold value is set as 0.4.
CN201610630419.4A 2016-08-04 2016-08-04 A kind of subject network information collecting method for taking time intention into account Active CN106250512B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610630419.4A CN106250512B (en) 2016-08-04 2016-08-04 A kind of subject network information collecting method for taking time intention into account

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610630419.4A CN106250512B (en) 2016-08-04 2016-08-04 A kind of subject network information collecting method for taking time intention into account

Publications (2)

Publication Number Publication Date
CN106250512A CN106250512A (en) 2016-12-21
CN106250512B true CN106250512B (en) 2019-07-26

Family

ID=57605946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610630419.4A Active CN106250512B (en) 2016-08-04 2016-08-04 A kind of subject network information collecting method for taking time intention into account

Country Status (1)

Country Link
CN (1) CN106250512B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114417200B (en) * 2022-01-04 2023-04-14 马上消费金融股份有限公司 Network data acquisition method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231640A (en) * 2007-01-22 2008-07-30 北大方正集团有限公司 Method and system for automatically computing subject evolution trend in the internet
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN103631856A (en) * 2013-10-17 2014-03-12 四川大学 Subject visualization method for Chinese document set
CN105528422A (en) * 2015-12-07 2016-04-27 中国建设银行股份有限公司 Focused crawler processing method and apparatus

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10068013B2 (en) * 2014-06-19 2018-09-04 Samsung Electronics Co., Ltd. Techniques for focused crawling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231640A (en) * 2007-01-22 2008-07-30 北大方正集团有限公司 Method and system for automatically computing subject evolution trend in the internet
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN103631856A (en) * 2013-10-17 2014-03-12 四川大学 Subject visualization method for Chinese document set
CN105528422A (en) * 2015-12-07 2016-04-27 中国建设银行股份有限公司 Focused crawler processing method and apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于网络信息检索的网页文本抽取和处理的研究;余浩;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150515(第5期);正文第28-30页

Also Published As

Publication number Publication date
CN106250512A (en) 2016-12-21

Similar Documents

Publication Publication Date Title
CN103714084B (en) The method and apparatus of recommendation information
Eeckhout et al. Knowledge spillovers and inequality
Eirinaki et al. Web path recommendations based on page ranking and markov models
KR102080362B1 (en) Query expansion
CN102750390B (en) Automatic news webpage element extracting method
CN102622445A (en) User interest perception based webpage push system and webpage push method
Paranjape et al. Improving website hyperlink structure using server logs
CN101630327A (en) Design method of theme network crawler system
CN104035972B (en) A kind of knowledge recommendation method and system based on microblogging
CN109634924A (en) File system parameter automated tuning method and system based on machine learning
CN108804576A (en) A kind of domain name hierarchical structure detection method based on link analysis
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN109065173B (en) Knowledge path acquisition method
CN112488716B (en) Abnormal event detection system
CN102737125B (en) Web temporal object model-based outdated webpage information automatic discovering method
CN106250512B (en) A kind of subject network information collecting method for taking time intention into account
Trevisiol et al. Image ranking based on user browsing behavior
CN103064984A (en) Spam webpage identifying method and spam webpage identifying system
CN110012122A (en) A kind of domain name similarity analysis method of word-based embedded technology
Berthold et al. Pure spreading activation is pointless
An et al. A heuristic approach on metadata recommendation for search engine optimization
CN109977285A (en) A kind of auto-adaptive increment collecting method towards Deep Web
KR20200072851A (en) Method and System for Enrichment of Ontology Instances Using Linked Data and Supplemental String Data
CN109033147A (en) A kind of method for exhibiting data, terminal and computer can storage mediums
CN109213793A (en) A kind of stream data processing method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant