CN106250512B - A kind of subject network information collecting method for taking time intention into account - Google Patents
A kind of subject network information collecting method for taking time intention into account Download PDFInfo
- Publication number
- CN106250512B CN106250512B CN201610630419.4A CN201610630419A CN106250512B CN 106250512 B CN106250512 B CN 106250512B CN 201610630419 A CN201610630419 A CN 201610630419A CN 106250512 B CN106250512 B CN 106250512B
- Authority
- CN
- China
- Prior art keywords
- time
- theme
- web page
- url
- page contents
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
A kind of subject network information collecting method for taking time intention into account, it is used to carry out the Internet web page information for subject events to collect sequence, it includes the following steps: step A, the initial time of subject events is determined using priori data, and quantify its Annual distribution, obtain the quantized value of an Annual distribution;Step B is intended to the time in theme using different representation methods and general keyword is indicated respectively, and calculates separately time correlation degree and the general keyword degree of correlation;Step C, according to step B time correlation degree calculated and the general keyword degree of correlation, building obtains the URL priority distribution formula based on Annual distribution quantized value, calculates final URL priority using the quantized value of the step A Annual distribution obtained as the increasing function of variable.A kind of subject network information collecting method for taking time intention into account provided by the present invention, substantially increases webpage discovery quantity and precision ratio.
Description
Technical field
The present invention relates to the themes of the webpage of specific content in internet web page search field, especially acquisition internet to climb
Row method, especially a kind of subject network information collecting method for taking time intention into account.
Background technique
Topic crawling is a kind of key technology method for obtaining specific area webpage in internet, it is intended under as much as possible
Carry webpage relevant to designated key.Its mainly theme for being specified according to user, by being calculated with topic correlativity, URL it is excellent
Crawl policy based on first grade distribution etc., constantly obtains the information of related web page from Ubiquitous Network resource.
URL priority distribution method based on web page contents is that traditional theme is creeped common method.It is mainly basis
What two class relevance degrees were calculated, specifically: (1) father web page contents topic correlativity: its value is higher, and father's webpage is included
URL priority is higher;(2) Anchor Text topic correlativity: it refers to theme and Anchor Text, Anchor Text context and URL character
The relevance degree of the information such as string, wherein Anchor Text is often to describe to the generality of content of pages pointed by URL.
In the URL priority distribution method based on web page contents, father's web page contents topic correlativity and Anchor Text theme
The degree of correlation is calculated frequently with cosine formula, such as: father's web page contents topic correlativity of certain URL is sim (VDk,VTk), Anchor Text
Topic correlativity is sim (VAk,VTk), then the priority P riority (URL) of the URL can be calculated as follows:
Priority (URL)=θ × sim (VDk,VTk)+γ×sim(VAk,VTk) (1-1)
In above formula, θ and γ respectively indicate the decay factor of father's web page contents topic correlativity and Anchor Text topic correlativity,
And meet+γ=1 θ.
When using the emergency information of topic crawling method acquisition time sensitivity, time intention can usually be used as theme
A kind of restriction element.According to the regulation (2002) of ISO19100 series standard, time object can be divided into " moment " and " when
Section ", wherein " moment " indicates a point in time and space;" period " is equivalent to a line in time and space, there is starting point, end
The attributes such as point and length.In general, the information dissemination on network about a certain emergency event mainly appears on event it occurs
Afterwards, that is, the issuing time reported should be later than the initial time of emergency event;On the other hand, there are Emergence and Developments, change for emergency event
The evolutionary process changed and withered away, in the different evolutionary phases, the temperature that people pay close attention to the event is also different, preferential downloading concern
The information for spending the higher period, can meet most of Man's Demands, this reflects the Annual distribution of the event to a certain extent.?
That is it is related in information that the time is intended to (such as initial time to Annual distribution) when carrying out network information gathering using theme
Degree judgement and INFORMATION DISCOVERY order of priority distribution aspect have obvious action.
Although filter house can be individually used for by setting initial time when using topic crawling method collecting network information
Point incoherent information, and its Annual distribution will affect the order of priority of INFORMATION DISCOVERY, but traditional network information collecting method
The common semanteme of theme is still only paid close attention to, there is no the time of analysis and utilization theme intentions, and there are asking for Annual distribution equalization
Topic, causes its precision ratio low.Specific manifestation are as follows:
(1) lack the representation method that the time is intended to: the unidirectional amount theme representation method of tradition it is merely meant that theme keyword,
The representation method of its time intention is not provided;
(2) weaken the effect of theme initial time: traditional theme relatedness computation strategy only relies on web page contents to judge
The correlation of itself and theme weakens the effect of theme initial time;
(3) ignore the influence of theme Annual distribution influence INFORMATION DISCOVERY order of priority: traditional URL priority distribution method mesh
It is preceding main using web page contents, Anchor Text and its context, URL character string, the linking relationship even renewal time of webpage, but
Have ignored the influence of theme Annual distribution.
Summary of the invention
The technical problem to be solved in the present invention is to provide it is a kind of take into account the time intention subject network information collecting method, with
The problem of being formerly mentioned is reduced or avoided.
In order to solve the above technical problems, the present invention provides a kind of subject network information collection sides for taking time intention into account
Method is used to carry out the Internet web page information for subject events to collect sequence comprising following steps:
Step A, the initial time of subject events is determined using priori data, and quantifies its Annual distribution, when obtaining one
Between the quantized value that is distributed;
Step B is intended to the time in theme using different representation methods and general keyword is indicated respectively, and
Calculate separately time correlation degree and the general keyword degree of correlation;
Step C is obtained according to step B time correlation degree calculated and the general keyword degree of correlation, building with step A
The quantized value of the Annual distribution is the increasing function of variable, and is dissolved into the URL priority distribution based on web page contents
Method calculates final URL priority to obtain the URL priority distribution formula based on Annual distribution quantized value,
Also the URL for allowing for the concerned moment obtains higher priority.
Preferably, the priori data in step A is Google trend data.
Preferably, in step B, the expression way that the time in theme is intended to is as follows;
The Formal Representation of theme and web page contents generally: theme T and web page contents D is given, as follows table
Show.
T=< VTk,TST,TTD>
D=< VDk,TPT>
Wherein, VTk, TSTAnd TTDRespectively indicate theme general vector, the beginning and ending time of theme and its Annual distribution;VDkWith
TPTRespectively indicate the general vector and its issuing time of web page contents.
The Formal Representation of theme: its general vector VTk, beginning and ending time TSTWith Annual distribution TTDAccording to following formula table
It reaches.
VTk={ (k1,wTk1),(k2,wTk2),...,(ks,wTks)}
TST=[tSTs,tSTe]
TTD={ < [tTDs1,tTDe1], λ1>,...,<[tTDsr,tTDer], λr>}
Wherein, kiIndicate i-th of general keyword in theme;wTkiIndicate general keyword kiWeight;S indicates theme
The number of middle general keyword;tSTsIndicate the initial time of theme, tSTeIndicate the end time of theme, < [tTDsi,tTDei], λi
>indicate i-th<period in Annual distribution, search volume index>right;tTDsiAnd tTDeiThe initial time of respectively i-th period and
End time, λiFor the volumes of searches index value of i-th of period;
The Formal Representation of web page contents: its general vector VDkWith issuing time TPTIt is indicated according to following formula.
VDk={ (k1,wDk1),(k2,wDk2),...,(ks,wDks)}
TPT=tPT
Wherein, kiIndicate i-th of general keyword in web page contents;wDkiIndicate its general keyword kiWeight;tPT
Indicate the issuing time of webpage.
Preferably, in step B, the formula difference for calculating time correlation degree and the general keyword degree of correlation is as follows;
The time correlation degree for calculating theme and web page contents is shown as follows:
Wherein, sim (TPT,TST) indicate theme and web page contents time correlation angle value;
The general subject degree of correlation for calculating theme and web page contents is shown as follows:
In formula, sim (VDk,VTk) indicate theme T and web page contents D general subject relevance degree.
Preferably, the URL priority distribution formula in step C are as follows:
Wherein, PriorityT(URL) final URL priority is indicated, Priority (URL) is existing based on webpage
The priority that the URL priority distribution method of content obtains, Pr (t/T) are the standardized values of Annual distribution quantized value, are also illustrated that
The probability that the webpage and theme T-phase that issuing time is t close;The threshold value is in 0 to 1 section value.
Preferably, the threshold value is set as 0.4.
It is preferably based on the calculating for the priority P riority (URL) that the URL priority distribution methods of web page contents obtains
Formula are as follows:
Priority (URL)=θ × sim (VDk,VTk)+γ×sim(VAk,VTk)
Wherein, θ and γ respectively indicates the decay factor of father's web page contents topic correlativity and Anchor Text topic correlativity, and
Meet+γ=1 θ.
Preferably, the decay factor θ is set as 0.4, γ and is set as 0.6.
A kind of subject network information collecting method for taking time intention into account provided by the present invention, passes through rising for quantization theme
Time beginning and Annual distribution, time-based international standard carry out Formal Representation time intention, and formation is by time intention and commonly
The diversification representation method that keyword (non-temporal word) independently forms, then decoupled method time correlation degree and general keyword
The Annual distribution of quantization is finally dissolved into URL priority distribution method as the variable of certain increasing function and is calculated by the degree of correlation
URL priority out substantially increases webpage discovery quantity and precision ratio.
Specific embodiment
For a clearer understanding of the technical characteristics, objects and effects of the present invention, now illustrate of the invention specific
Embodiment.
The present invention provides a kind of subject network information collecting methods for taking time intention into account, are used for for subject events
It carries out the Internet web page information and collects sequence comprising following steps:
Step A, the initial time of subject events is determined using priori data, and quantifies its Annual distribution, when obtaining one
Between the quantized value that is distributed;
The time of theme is intended to refer to the temporal characteristics for including in theme.The time of theme is intended to be divided into clear by the present invention
Time be intended to and the potential time be intended to.Wherein, the specific time is intended to refer to clearly provides event horizon in theme, such as
Theme " earthquake in 2008 ", which explicitly points out, needs to find earthquake information in 2008;The potential time is intended to refer in theme do not have
Limiting time feature is specified, but event itself described by theme but implies temporal characteristics, as theme " Wenchuan earthquake " implies river in Shangdong Province
Initial time on May 21st, 2008 of valley shake.
In subject network information collection discovery procedure, the initial time and Annual distribution of subject events play different works
With therefore, time intention assessment of the invention mainly includes two parts: the identification of subject events initial time and its Annual distribution
Identification.
In the retrieval of existing temporal information, the identification that the query word time is intended to mainly by means of certain priori datas,
As user searches for log and by the news corpus of mark.On this basis, the present invention also will carry out theme by priori data
The identification that time is intended to.In a specific embodiment, the present invention by priori data be Google trend (Google
Trends) data.
Google trend data refers to the search volume index of a certain query word within the past period.Google trend number
According to being not original volumes of searches, but a standardized value relative to total volumes of searches.After standardization, Google trend
Data value between 0 to 100, value show that more greatly volumes of searches is bigger.Currently, Google trend data has been widely used for disease
Disease forecasting, conservation biology and network public-opinion etc..To find out its cause, mainly Google trend data reflects user to this
The degree of concern of content involved by query word, volumes of searches is bigger, shows that the people of concern is more, and the people paid close attention to is more, more shows
Event relevant to the content has occurred.The present invention is based on this feature of Google trend data also to identify that earth's surface is covered
The time of lid subject events is intended to, and is broadly divided into two steps:
(1) identify the initial time of subject events: it is mainly that volume index is searched in foundation Google trend data from nothing
To the variation having.Because paying close attention to this before subject events generate according to event Emergence and Development, variation and the evolutionary process of extinction
The user of theme is less, and the standard of Google trend data statistics is not achieved in volumes of searches.In actually calculating, it is based on Google
The theme initial time recognition methods of trend data only identifies the theme that its start periods search volume index is 0.To find out its cause, one
Aspect is not that each theme has specific initial time (such as theme " earthquake " and to be not specific to a certain specific event, it does not have
Have specific initial time), the initiating searches volume index of this distribution subject is not 0;On the other hand then it is originated from Google trend
The limitation of data itself, Google trend data were counted since in January, 2004, were occurred before 2004 and were extended to
The initiating searches volume index of theme in 2004 is not 0.Finally, the theme initial time of identification is first in Google trend data
At the time of secondary appearance search volume index is greater than 0.
(2) quantify the Annual distribution of subject events: it directly utilizes the variation that volume index is searched in Google trend data
It indicates, i.e., using search volume index come quantization time distribution.Because Google trend data inherently reflects in internet
The temperature variation of the theme, the i.e. Annual distribution of subject events are paid close attention in different periods.
Firstly, can identify corresponding initial time according to initial time recognition methods, it is based on Google trend data
Time intention assessment, the initial time for identifying subject events that can be rough.Such as theme " Wenchuan earthquake " was at 2008 5
The moon is paid close attention to by user very much in December, 2008, and commemorates that the moon attracts attention again in May, 2009, with its evolutionary process
It is consistent.This explanation is directly reasonable using the Annual distribution of Google trend data quantization subject events.
In addition, Baidu's index also can be used as the priori data of recognition time intention.It is similar with Google trend data, is
Based on the inquiry log of universal search engine Baidu, reflect user of the different theme query words in the past period
Attention rate and imedias advertisement.Theme time intension recognizing method based on Baidu's index and the master based on Google trend data
Topic time intension recognizing method is similar, and details are not described herein.
Step B takes the theme expression and relatedness computation of time intention into account: using different representation methods in theme
Time is intended to and general keyword is indicated respectively, and calculates separately time correlation degree and the general keyword degree of correlation;
In existing subject network information gathering process, generalling use traditional unidirectional amount indicates the master that containment time is intended to
Topic, can not thus embody initial time and Annual distribution.Therefore, in method provided by the present invention, using different shapes
Formula indicates the common pass of the general keyword of theme, the beginning and ending time of theme, the Time-distribution of theme and web page contents
Keyword and its issuing time.Specifically:
(1) indicate general keyword based on single vector approach: the general keyword of theme and web page contents is using < key
Word, weight > to expression;Its dimension depends on the number of main in the title of the key words, and in the case where theme is constant, dimension is fixed
Constant.
(2) indicate that the time is intended to based on time international standard: in international standard, the time is divided into moment and period.Theme
Initial time and the issuing times of web page contents be usually a time point, indicated using the moment;For ease of calculation, this hair
The bright initial time that theme is indicated using the period and end time (i.e. beginning and ending time);When what its Annual distribution reflected is different
Between the temperature variation of the event is paid close attention in range.Therefore, Annual distribution is by<the period, searches for volume index>to expression, wherein the period
Corresponding time range, volumes of searches exponent pair answer the hot value of subject events.Particularly, it to save memory space, does not indicate to search for
At the time of volume index is 0.
Their Formal Representation is as follows:
(1) Formal Representation of theme and web page contents generally: given theme T and web page contents D, it can be by as follows
Formula indicates.
T=< VTk,TST,TTD> (1-2)
D=< VDk,TPT> (1-3)
In formula, VTk, TSTAnd TTDRespectively indicate theme general vector, the beginning and ending time of theme and its Annual distribution;VDkWith
TPTRespectively indicate the general vector and its issuing time of web page contents.
(2) Formal Representation of theme: its general vector VTk, beginning and ending time TSTWith Annual distribution TTDIt can be according to following public affairs
Formula expression.
VTk={ (k1,wTk1),(k2,wTk2),...,(ks,wTks)} (1-4)
TST=[tSTs,tSTe] (1-5)
TTD={ < [tTDs1,tTDe1], λ1>,...,<[tTDsr,tTDer], λr>} (1-6)
In formula, kiIndicate i-th of general keyword in theme;wTkiIndicate general keyword kiWeight;S indicates theme
The number of middle general keyword;tSTsThe initial time for indicating theme is specified by user or is identified according to the method in step A;
tSTeThe end time for indicating theme, specified by user or be defaulted as infinity;<[tTDsi,tTDei], λi> indicate in Annual distribution
I-th<period, search volume index>right;tTDsiAnd tTDeiThe initial time of respectively i-th period and end time, λiIt is i-th
The volumes of searches index value of a period, these parameters can (such as the Google trend numbers of the priori data according to used by step A
According to) obtain, and omit the period that search volume index is 0;
(3) Formal Representation of web page contents: its general vector VDkWith issuing time TPTIt is indicated according to following formula.
VDk={ (k1,wDk1),(k2,wDk2),...,(ks,wDks)} (1-7)
TPT=tPT (1-8)
In formula, kiIndicate i-th of general keyword in web page contents;wDkiIndicate its general keyword kiWeight;tPT
Indicate the issuing time of webpage.
The weighing computation method of general keyword can be obtained using the prior art in theme and web page contents, such as can refer to
Existing literature " Wu H, Chen J, et al.A Focused Crawler for Borderlands Situation
Information with Geographical Properties of Place Names[J].Sustainability,
2014,6 (10): method provided by 6529-6552. " obtains.
As described in the background art, whether traditional topic correlativity calculation method judges it merely with web page contents
It is related to theme, weaken theme initial time can independent filtration fraction irrelevant information effect, be easy to cause certain information
Misjudgement, influence the precision ratio of topic crawling.The present invention is based on traditional vector space model, from initial time and common pass
The aspect of keyword two is set out, judge the degree of correlation between web page contents and theme using two step method, thus provide it is a kind of newly
Take the topic correlativity calculative strategy of initial time into account.Its calculation process is broadly divided into following two step:
(1) the time correlation degree of theme and web page contents is calculated.Because being the theme, initial time can be individually used for filtration fraction
Incoherent information, therefore, only need to compare web page contents issuing time and the theme beginning and ending time can preliminary judgement its whether
It is related to theme.Therefore, the calculating of time correlation degree can be as follows shown in formula.
In formula, sim (TPT,TST) indicate theme and web page contents time correlation angle value;Other parameters are as previously described.When
Between relevance degree be 0, indicate web page contents it is uncorrelated to theme, the webpage should be abandoned in creeping;Time correlation angle value is 1,
Indicate that web page contents may be related to theme, final correlation needs to further determine that by web page contents.Because at this time
Between relevance degree be 1 when continue to calculate the general subject degree of correlation.
(2) the general subject degree of correlation of theme and web page contents is calculated.The general keyword of theme and web page contents is still
Indicate that relevance degree can be used traditional cosine formula and calculate, as shown in following equation using unidirectional amount.
In formula, sim (VDk,VTk) indicate theme T and web page contents D general subject relevance degree;For example preceding institute of other parameters
It states.If sim (VDk,VTk) when being more than or equal to given threshold value, then determine that the web page contents are related to theme;Otherwise, it is determined that net
Page content is uncorrelated to theme, and abandons the webpage.
In the topic correlativity calculative strategy for taking initial time into account, preferentially calculating the reason of time correlation is spent is time phase
The calculating for closing angle value is fairly simple.
Step C is constructed according to step B time correlation degree calculated and the general keyword degree of correlation with obtaining in step A
The quantized value of the Annual distribution obtained is the increasing function of variable, and is dissolved into the URL priority based on web page contents
Distribution method, so that the URL priority distribution formula based on Annual distribution quantized value is obtained, so that the concerned moment
URL obtains higher priority, to solve the problems, such as Annual distribution equalization.
In subject network information gathering process, the Annual distribution of theme will affect the order of priority of INFORMATION DISCOVERY.Specifically
It shows themselves in that if the issuing time t of web page contents corresponding to a certain URL is there are more related web page, is determined in theme T
Under the premise of, the web page contents that issuing time is t and the probability P r (t/T) that theme T-phase is closed are larger, i.e., have in the URL at the moment
Higher priority.But existing URL priority distribution method does not consider this characteristic.
In order to solve this problem, the present invention is with quantized value (the searching in i.e. aforementioned Google trend data of Annual distribution
Rope volume index) based on, provide a kind of URL priority distribution method based on Annual distribution quantized value.Its process is:
Firstly, building using quantized value as the increasing function of independent variable: due to Annual distribution quantized value to a certain extent
The quantity that its related web page is issued in a certain period is reflected, and the trend of direct ratio is presented in quantized value and associated nets number of pages, that is, measures
Change value is bigger, shows that the related web page of publication is more, and this characteristic can be exactly presented in increasing function.Therefore present invention selection
Building is using Annual distribution quantized value as index, using natural constant e as the exponential function (natural exponential function) at bottom.
Then, increasing function and the URL priority distribution method based on web page contents are merged: before fusion, this method elder generation base
Its content prioritization is calculated in the URL priority distribution method of web page contents, when value is more than or equal to given a certain threshold value,
Just merged.This prevents from improving not primarily to ensuring Annual distribution only influences the discovery order that related web page corresponds to URL
Related web page corresponds to the discovery order of URL.In fusion, mainly by increasing function multiplied by its content prioritization in the present invention.
Finally, the formula of the URL priority distribution based on Annual distribution quantized value is as follows.
In formula, PriorityT(URL) final URL priority is indicated;
Priority (URL) is the priority that the existing URL priority distribution method based on web page contents obtains, meter
Calculating formula can be formula (1-1) provided by background technique;Pr (t/T) is the standardized value of Annual distribution quantized value, is also illustrated that
The probability that the webpage and theme T-phase that issuing time is t close;Threshold value in the formula is in 0 to 1 section value, when it is 1, table
Show that URL priority traditionally calculates always;When it is 0, indicate URL priority always according to the side for incorporating Annual distribution
Method calculates.
In a preferred embodiment, the calculating process master of the URL priority distribution method based on Annual distribution quantized value
It is divided into six steps, specific as follows:
(1) quantify the Annual distribution of theme.The Annual distribution of theme can be obtained by Google trend data, quantization
Value is to search for volume index in Google trend data.
(2) the issuing time t of web page contents corresponding to URL to be downloaded is estimated.
During INFORMATION DISCOVERY, the issuing time of web page contents corresponding to URL to be downloaded is unknown.In the present invention
In, there are mainly two types of calculation methods:
1) calculation method based on URL character string information: (such as when URL character string itself to be downloaded includes temporal information
" 20080905 " in " http://news.sohu.com/20080905/n259388056.shtml " are right for URL to be downloaded
Answer the issuing time of webpage), the time is extracted using corresponding timed regular expression, and right as URL institute to be downloaded
Answer the issuing time of web page contents;
2) calculation method based on father's web page contents time: when URL character string itself to be downloaded does not include temporal information,
Using the issuing time of URL father's web page contents to be downloaded as the issuing time of web page contents corresponding to it.Because on the one hand under
Carry the issuing times of URL father's web page contents usually all less times greater than or equal to web page contents corresponding to URL to be downloaded publication when
Between, and the interval of Google each period of trend data is larger.On the other hand, this hypothesis has no effect on URL to be downloaded
The relevance degree of corresponding webpage and theme only influences the discovery sequence of the URL.
(3) the quantized value Pr (t/T) of normalized temporal distribution.As described above, need to only obtain searching for period corresponding to time t
Rope volume index simultaneously standardizes, as shown by the following formula.
Parameter in formula is as previously described.
(4) the Anchor Text topic correlativity value sim (V of URL to be downloaded is calculatedAk,VTk).Wherein, Anchor Text vector is (by anchor
Text and its context and URL character string information composition) as shown by the following formula,
VAk={ (k1,wAk1),(k2,wAk2),...,(ks,wAks)} (1-13)
Anchor Text topic correlativity value is as shown by the following formula.
In formula, VAkIndicate Anchor Text vector;wAkiIndicate general keyword k in Anchor TextiWeight;Other parameters are the same
It is described.
(5) calculate the content prioritization Priority (URL) of URL to be downloaded: its calculation formula is as stated in the background art.Cause
It is direct description of the webpage to URL to be downloaded for Anchor Text, for the content of father's webpage, Anchor Text is more important, so
The decay factor θ and γ in formula are respectively set to 0.4 and 0.6 in the present invention.
(6) calculate the final priority of URL to be downloaded: its calculation formula such as (1-11) is shown, through experimental analysis, the present invention
0.4 is set by the threshold value in formula (1-11).
In a specific embodiment, the present invention is directed to as much as possible is found to have the networks of temporal characteristics from network
Change information, while the incoherent information of downloading as few as possible.Its basic procedure may include following five step:
(1) preparation: user needs specified content topic and initial URL relevant to theme.Then, using being based on
The time intension recognizing method of Google trend data determines the initial time of theme, and quantifies its Annual distribution.
(2) request and analyzing web page: excellent into the initial URL or URL priority query of the Internet request using http protocol
The first highest URL of grade, to obtain the corresponding web page contents of the URL.Secondly, according to the DOM Document Object Model of webpage
(Document Object Model, DOM), parse the corresponding title of webpage, text, issuing time, URL to be downloaded and its
Anchor Text information.
(3) topic correlativity calculates: firstly, according to the theme initial time and web page contents that obtain in step (1) and (2)
Issuing time indicates beginning and ending time, general keyword, Annual distribution and the web page contents of theme using formula (1-2) to (1-6)
General keyword and issuing time;Then their time correlation degree is calculated using formula (1-9), filtering out has with theme
The web page contents of Before sequential relationship;Then, general subject relevance degree is calculated using formula (1-10).When relevance degree is big
When being equal to a certain threshold value, then the webpage is saved in web page resources library;Otherwise, it is determined that the webpage is uncorrelated to theme, and lose
Abandon the webpage.
(4) URL priority is distributed: URL priority is calculated according to formula (1-11) to (1-14), then according to the priority
Value is deposited into URL priority query.
(5) until repeating step (2), (3) and (4) when URL priority query is sky or reaches a certain cycling condition.
Under hardware condition and the identical situation of network bandwidth, method provided by the present invention is believed than existing subject network
It ceases acquisition method and improves the webpage capture quantity of 10%-30%, and 10% or so precision ratio can be improved.
A kind of subject network information collecting method for taking time intention into account provided by the present invention, passes through rising for quantization theme
Time beginning and Annual distribution, time-based international standard carry out Formal Representation time intention, and formation is by time intention and commonly
The diversification representation method that keyword (non-temporal word) independently forms, then decoupled method time correlation degree and general keyword
The Annual distribution of quantization is finally dissolved into URL priority distribution method as the variable of certain increasing function and is calculated by the degree of correlation
URL priority out substantially increases webpage discovery quantity and precision ratio.
It will be appreciated by those skilled in the art that although the present invention is described in the way of multiple embodiments,
It is that not each embodiment only contains an independent technical solution.So narration is used for the purpose of for the sake of understanding in specification,
The skilled in the art should refer to the specification as a whole is understood, and by technical solution involved in each embodiment
Regard as and can be combined with each other into the modes of different embodiments to understand protection scope of the present invention.
The foregoing is merely the schematical specific embodiment of the present invention, the range being not intended to limit the invention.It is any
Those skilled in the art, made equivalent variations, modification and combination under the premise of not departing from design and the principle of the present invention,
It should belong to the scope of protection of the invention.
Claims (5)
1. a kind of subject network information collecting method for taking time intention into account is used to carry out internet web page for subject events
Information collects sequence, which is characterized in that it includes the following steps:
Step A, the initial time of subject events is determined using priori data, and quantifies its Annual distribution, obtains the time point
The quantized value of cloth,
Step B is intended to the time in theme using different representation methods and general keyword is indicated respectively, and respectively
Calculate time correlation degree and the general keyword degree of correlation;
Step C, according to step B time correlation degree calculated and the general keyword degree of correlation, building is described in step A acquisition
The quantized value of Annual distribution is the increasing function of variable, and is dissolved into the URL priority distribution method based on web page contents,
To obtain the URL priority distribution formula based on Annual distribution quantized value, final URL priority is calculated, also
So that the URL at concerned moment obtains higher priority,
The URL priority distribution formula are as follows:
Wherein, PriorityT(URL) final URL priority is indicated, Priority (URL) is existing based on web page contents
The priority that URL priority distribution method obtains, Pr (t/T) is the standardized value of Annual distribution quantized value, when also illustrating that publication
Between for t webpage and theme T-phase close probability;The threshold value is in 0 to 1 section value.
2. the method according to claim 1, wherein the priori data in step A is Google trend number
According to.
3. the method according to claim 1, wherein the expression way that the time in theme is intended to is such as in step B
Under;
The Formal Representation of theme and web page contents generally: given theme T and web page contents D is indicated, T as follows
=< VTk,TST,TTD>
D=< VDk,TPT>
Wherein, VTk, TSTAnd TTDRespectively indicate theme general vector, the beginning and ending time of theme and its Annual distribution;VDkAnd TPTRespectively
Indicate the general vector and its issuing time of web page contents,
The Formal Representation of theme: its general vector VTk, beginning and ending time TSTWith Annual distribution TTDIt is expressed according to following formula, VTk
={ (k1,wTk1),(k2,wTk2),...,(ks,wTks)}
TST=[tSTs,tSTe]
TTD={ < [tTDs1,tTDe1], λ1>,...,<[tTDsr,tTDer], λr>}
Wherein, kiIndicate i-th of general keyword in theme;wTkiIndicate general keyword kiWeight;S indicates general in theme
The number of clearance keyword;tSTsIndicate the initial time of theme, tSTeIndicate the end time of theme, < [tTDsi,tTDei], λi> table
Show in Annual distribution i-th<period, search volume index>right;tTDsiAnd tTDeiThe initial time and end of respectively i-th period
Time, λiFor the volumes of searches index value of i-th of period;
The Formal Representation of web page contents: its general vector VDkWith issuing time TPTIt is indicated according to following formula,
VDk={ (k1,wDk1),(k2,wDk2),...,(ks,wDks)}
TPT=tPT
Wherein, kiIndicate i-th of general keyword in web page contents;wDkiIndicate its general keyword kiWeight;tPTIt indicates
The issuing time of webpage.
4. according to the method described in claim 3, it is characterized in that, calculating time correlation degree and general keyword phase in step B
The formula difference of Guan Du is as follows;
The time correlation degree for calculating theme and web page contents is shown as follows:
Wherein, sim (TPT,TST) indicate theme and web page contents time correlation angle value;
The general subject degree of correlation for calculating theme and web page contents is shown as follows:
In formula, sim (VDk,VTk) indicate theme T and web page contents D general subject relevance degree.
5. the method according to claim 1, wherein the threshold value is set as 0.4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610630419.4A CN106250512B (en) | 2016-08-04 | 2016-08-04 | A kind of subject network information collecting method for taking time intention into account |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610630419.4A CN106250512B (en) | 2016-08-04 | 2016-08-04 | A kind of subject network information collecting method for taking time intention into account |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106250512A CN106250512A (en) | 2016-12-21 |
CN106250512B true CN106250512B (en) | 2019-07-26 |
Family
ID=57605946
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610630419.4A Active CN106250512B (en) | 2016-08-04 | 2016-08-04 | A kind of subject network information collecting method for taking time intention into account |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106250512B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114417200B (en) * | 2022-01-04 | 2023-04-14 | 马上消费金融股份有限公司 | Network data acquisition method and device and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101231640A (en) * | 2007-01-22 | 2008-07-30 | 北大方正集团有限公司 | Method and system for automatically computing subject evolution trend in the internet |
CN102073730A (en) * | 2011-01-14 | 2011-05-25 | 哈尔滨工程大学 | Method for constructing topic web crawler system |
CN103631856A (en) * | 2013-10-17 | 2014-03-12 | 四川大学 | Subject visualization method for Chinese document set |
CN105528422A (en) * | 2015-12-07 | 2016-04-27 | 中国建设银行股份有限公司 | Focused crawler processing method and apparatus |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10068013B2 (en) * | 2014-06-19 | 2018-09-04 | Samsung Electronics Co., Ltd. | Techniques for focused crawling |
-
2016
- 2016-08-04 CN CN201610630419.4A patent/CN106250512B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101231640A (en) * | 2007-01-22 | 2008-07-30 | 北大方正集团有限公司 | Method and system for automatically computing subject evolution trend in the internet |
CN102073730A (en) * | 2011-01-14 | 2011-05-25 | 哈尔滨工程大学 | Method for constructing topic web crawler system |
CN103631856A (en) * | 2013-10-17 | 2014-03-12 | 四川大学 | Subject visualization method for Chinese document set |
CN105528422A (en) * | 2015-12-07 | 2016-04-27 | 中国建设银行股份有限公司 | Focused crawler processing method and apparatus |
Non-Patent Citations (1)
Title |
---|
基于网络信息检索的网页文本抽取和处理的研究;余浩;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150515(第5期);正文第28-30页 |
Also Published As
Publication number | Publication date |
---|---|
CN106250512A (en) | 2016-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103714084B (en) | The method and apparatus of recommendation information | |
Eeckhout et al. | Knowledge spillovers and inequality | |
Eirinaki et al. | Web path recommendations based on page ranking and markov models | |
KR102080362B1 (en) | Query expansion | |
CN102750390B (en) | Automatic news webpage element extracting method | |
CN102622445A (en) | User interest perception based webpage push system and webpage push method | |
Paranjape et al. | Improving website hyperlink structure using server logs | |
CN101630327A (en) | Design method of theme network crawler system | |
CN104035972B (en) | A kind of knowledge recommendation method and system based on microblogging | |
CN109634924A (en) | File system parameter automated tuning method and system based on machine learning | |
CN108804576A (en) | A kind of domain name hierarchical structure detection method based on link analysis | |
CN106980651B (en) | Crawling seed list updating method and device based on knowledge graph | |
CN109065173B (en) | Knowledge path acquisition method | |
CN112488716B (en) | Abnormal event detection system | |
CN102737125B (en) | Web temporal object model-based outdated webpage information automatic discovering method | |
CN106250512B (en) | A kind of subject network information collecting method for taking time intention into account | |
Trevisiol et al. | Image ranking based on user browsing behavior | |
CN103064984A (en) | Spam webpage identifying method and spam webpage identifying system | |
CN110012122A (en) | A kind of domain name similarity analysis method of word-based embedded technology | |
Berthold et al. | Pure spreading activation is pointless | |
An et al. | A heuristic approach on metadata recommendation for search engine optimization | |
CN109977285A (en) | A kind of auto-adaptive increment collecting method towards Deep Web | |
KR20200072851A (en) | Method and System for Enrichment of Ontology Instances Using Linked Data and Supplemental String Data | |
CN109033147A (en) | A kind of method for exhibiting data, terminal and computer can storage mediums | |
CN109213793A (en) | A kind of stream data processing method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |