Embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention is made below in conjunction with the accompanying drawings further
Detailed description.
In embodiments of the present invention, using each Intelligence Page source as voter, information is reprinted as ballot using every
Subject matter, the weight using the popular degree in each Intelligence Page source as ballot.Pass through the throwing of every reprinting information of COMPREHENSIVE CALCULATING
Ticket score, the reprinting information to make a good score regard as hot information, and before coming, simultaneously, it is contemplated that dissemination of news needs the time,
Correction factor, correction ballot score, so as to obtain temperature ranking to the end can be used as by the use of the issuing time for reprinting information.
Fig. 1 is the hot information method for digging schematic flow sheet according to embodiment of the present invention.
As shown in figure 1, this method includes:
Step 101:The relative hot value between Intelligence Page source is calculated according to the access times in Intelligence Page source.
Herein, can by the access log of hot information in Intelligence Page source and the access log of other news,
Calculate the access temperature in each Intelligence Page source.Such as by access times described in the access log of hot information and
Described access times are added in the access log of other news, the access times as Intelligence Page source.
Preferably, Intelligence Page source can be various types of news websites.
The access temperature for calculating Intelligence Page source can have a variety of calculations, and its principle is:Intelligence Page source
Access times are more, and the relative hot value in the Intelligence Page source should be higher.Such as:
For k-th of Intelligence Page source, it is calculated with respect to hot value SiteHotnessk, wherein:
Wherein norm is normalization coefficient;AccessCountkFor the access times in k-th of Intelligence Page source, K is all
The set in Intelligence Page source.
Such as:Assuming that A is acquired in some search engine, the news of tri- information web page sources of B, C, it is assumed that this three information
Web page source is respectively 50,20,30 in the access times (AccessCount) of search engine.
Then website C temperature SiteHotnessC=norm* (log (30)/log (50+20+30));
Website B temperature SiteHotnessB=norm* (log (20)/log (50+20+30));
Website A temperature SiteHotnessA=norm* (log (50)/log (50+20+30)).
The truth of a matter in above-mentioned logarithm can be 10, or e.So as to ensure website A temperature SiteHotnessAGreatly
In then website C temperature SiteHotnessC, and website C temperature SiteHotnessCMore than website B temperature
SiteHotnessB。
Wherein, corresponding change or adjustment can be made according to specific experience in the application, norm specific value.
Step 102:Calculate each information of reprinting according to the relative hot value in Intelligence Page source has the reprinting information in reprinting
Intelligence Page source in reprinting weight.
Herein, can the similarity algorithm based on text feature determined from each Intelligence Page source it is described reprint letter
Breath.The papers published of news is identified by the similarity algorithm based on text feature, that is, identifies which news belongs to same
The reprinting of piece news.
Preferably, time factor further can be determined according to each issuing time for reprinting information, and utilizes the time
The each heatrate value of factor pair is modified.Exemplarily, the reproduced time of information will can also be reprinted as time factor.
Such as:For i-th of reprinting information, its heatrate value NewsHotness is calculatedi;
Wherein:
CitationHotnessk=g (SiteHotnessk);
Wherein K is the set in all Intelligence Page sources for reprinting this i-th reprinting information;PublishTime for this i-th
The individual issuing time for reprinting information;F (PublishTime) is the time tune weight function on PublishTime,
CitationHotnesskInformation is reprinted in k-th of reprinting reprinted in having the Intelligence Page source of the reprinting information for this i-th
Weight, g (SiteHotnessk) it is on SiteHotnesskTemperature adjust weight function.
Time adjusts weight function f (PublishTime) to be used to ensure heatrate value NewsHotnessiTimeliness n.Generally
For, issuing time PublishTime should be got over closer to current time, then time tune weight function f (PublishTime) value
Greatly.
Time adjusts weight function f (PublishTime) concrete functional form to have numerous embodiments, can be linear
Or it is nonlinear.As long as meeting issuing time PublishTime closer to current time, then the time adjust weight function f
(PublishTime) value should be bigger (so as to ensure heatrate value NewsHotnessiValue can be bigger) it is substantially former
Then, embodiment of the present invention to f (PublishTime) concrete functional form and is not limited.
g(SiteHotnessk) it is that temperature adjusts weight function, for ensureing to reprint weight CitationHotnesskQuality refer to
Mark.Typically, the relative hot value SiteHotness of some websiteskHigher, then it reprints weight CitationHotnessk
Value should be bigger.
Similarly, temperature adjusts weight function g (SiteHotnessk) concrete functional form can have numerous embodiments, can
To be linear or nonlinear.Substantially, as long as meeting the relative hot value SiteHotness of websitekIt is higher,
Then temperature adjusts weight function CitationHotnesskThe bigger basic principle of value, embodiment of the present invention is to f
(PublishTime) concrete functional form is simultaneously not limited.
Step 103:Reprinting weight of each reprinting information in each Intelligence Page source is summed, calculated each
The heatrate value of information is reprinted, and determines that focus is believed from the reprinting information according to described information hot value size order
Breath.
Herein, the reprinting weight of each reprinting information is summed, so as to calculate the information of each reprinting information
Hot value fraction, after can then proceed in height sequence, suitable news bar number is selected to be showed.
For example it can be set in advance as showing 10 hot informations.So sorted according to height to each reprinting information
Heatrate value point be ranked up after, select 10 news bar numbers to be showed as hot information from high to low.
Preferably, in embodiments of the present invention, can also first all news categories, such as be divided into it is domestic, international,
Amusement etc., then excavate the hot information in each classification using embodiment of the present invention in specific classified news.
Based on above-mentioned analysis, embodiment of the present invention also proposed a kind of hot information digging system.
Fig. 2 is the hot information method for digging system schematic according to embodiment of the present invention.
As shown in Fig. 2 the system includes relative hot value computing unit 201, reprints weight calculation unit 202 and focus letter
Cease determining unit 203.
Wherein:
With respect to hot value computing unit 201, for according between the access times in Intelligence Page source calculating Intelligence Page source
Relative hot value;
Weight calculation unit 202 is reprinted, is existed for calculating each information of reprinting according to the relative hot value in Intelligence Page source
Reprint the reprinting weight having in the Intelligence Page source of the reprinting information;
Hot information determining unit 203, for entering to reprinting weight of each reprinting information in each Intelligence Page source
Row summation, the heatrate value of each reprinting information is calculated, and according to described information hot value size order from the reprinting
Hot information is determined in information.
Preferably, hot information determining unit 203, when being further used for being determined according to each issuing time for reprinting information
Between the factor, and each heatrate value is modified using the time factor.
Preferably, weight calculation unit 202, it is further used for based on the similarity algorithm of text feature from each Information Network
Page determines the reprinting information in source.
In one embodiment, with respect to hot value computing unit 201, for for k-th of Intelligence Page source, calculating it
With respect to hot value SiteHotnessk, wherein:
Wherein norm is normalization coefficient;AccessCountkFor the access times in k-th of Intelligence Page source, K is all
The set in Intelligence Page source.
In one embodiment, weight calculation unit 202, for for i-th of reprinting information, calculating its heatrate
Value NewsHotnessi;
CitationHotnessk=g (SiteHotnessk);
Wherein K is the set in all Intelligence Page sources for reprinting this i-th reprinting information;PublishTime for this i-th
The individual issuing time for reprinting information;F (PublishTime) is the time tune weight function on PublishTime,
CitationHotnesskInformation is reprinted in k-th of reprinting reprinted in having the Intelligence Page source of the reprinting information for this i-th
Weight, g (SiteHotnessk) it is on SiteHotnesskTemperature adjust weight function.
Similarly, the time adjusts weight function f (PublishTime) to be used to ensure heatrate value NewsHotnessiIt is stylish
Property.Typically, issuing time PublishTime is closer to current time, then the time adjust weight function f's (PublishTime)
Value should be bigger.
Time adjusts weight function f (PublishTime) concrete functional form to have numerous embodiments, can be linear
Or it is nonlinear.As long as meeting issuing time PublishTime closer to current time, then the time adjust weight function f
(PublishTime) value should be bigger basic principle, specific function of the embodiment of the present invention to f (PublishTime)
Form is simultaneously not limited.
g(SiteHotnessk) it is that temperature adjusts weight function, for ensureing to reprint weight CitationHotnesskQuality refer to
Mark.Typically, the relative hot value SiteHotness of some websiteskHigher, then it reprints weight CitationHotnessk
Value should be bigger.
Similarly, temperature adjusts weight function g (SiteHotnessk) concrete functional form can have numerous embodiments, can
To be linear or nonlinear.Substantially, as long as meeting the relative hot value SiteHotness of websitekIt is higher,
Then temperature adjusts weight function CitationHotnesskThe bigger basic principle of value, embodiment of the present invention is to f
(PublishTime) concrete functional form is simultaneously not limited.
In one embodiment, the system further comprises hot information display unit 204.Hot information display unit
204, for showing the hot information determined from reprinting information.For example hot information display unit 204 can be advance
It is arranged to show 10 hot informations;After being ranked up according to height sequence to the heatrate value point of each reprinting information,
10 news bar numbers are selected to be showed as hot information from high to low.
Hot news can be excavated from numerous news sources of internet according to embodiment of the present invention.Based on above-mentioned
Labor, Fig. 3 are the exemplary hot news mining process schematic diagram according to embodiment of the present invention.
As shown in figure 3, at processing block 1, from the numerous news sources (such as news website) for coming from internet from crawl
Go out magnanimity news, and identify the specific papers published of news, that is, identify which news belongs to turning for same piece news
Carry.
Such as:Specific identification technology herein can use the Similarity Measure based on text feature.
Exemplarily, Fig. 4 is the reprinting news recognition result schematic diagram according to embodiment of the present invention.
Figure 4 illustrates " China's Software Market in 2015 be expected to up to 71,500,000,000 yuan " from different messages source it is new
Hear, be actually the reprinting news of same news.
In processing block 2, pass through the hot news access log to numerous news websites and the access day of other news
Will, calculate the relative hot value (accessing temperature) of each news website.
The relative hot value computational methods of each website are as follows:
Wherein K is the collection of all websites
Close, norm is normalization coefficient, and AccessCount is the access times of each news website.
In processing block 3, the issuing time and processing block 2 of reprinting recognition result, reprinting news with reference to processing block 1 are counted
The relative hot value of each news website calculated.
Such as:Such as:For i-th of reprinting news, its news hot value NewsHotness is calculatedi;
Wherein:
CitationHotnessk=g (SiteHotnessk);
Wherein K is the set of all news websites for reprinting this i-th reprinting news;PublishTime is this i-th
Reprint the issuing time of news;F (PublishTime) is the time tune weight function on PublishTime,
CitationHotnesskNews is reprinted in k-th of the reprinting power reprinted in having the news website of the reprinting news for this i-th
Weight, g (SiteHotnessk) it is on SiteHotnesskTemperature adjust weight function.
Time adjusts weight function f (PublishTime) to be used to ensure news hot value NewsHotnessiTimeliness n.Generally
For, issuing time PublishTime should be got over closer to current time, then time tune weight function f (PublishTime) value
Greatly.
Time adjusts weight function f (PublishTime) concrete functional form to have numerous embodiments, can be linear
Or it is nonlinear.As long as meeting issuing time PublishTime closer to current time, then the time adjust weight function f
(PublishTime) value should be bigger basic principle, specific function of the embodiment of the present invention to f (PublishTime)
Form is simultaneously not limited.
In processing block 4, hot news is determined according to the result of calculation of processing block 3, and pass through microblogging, webpage, electronics
Hot news is pushed to user by the various ways such as mail.After determining hot news, hot news can be stored in focus
In news access log, consequently facilitating the backtracking of user at any time accesses.
Such as:Fig. 5 is to show schematic diagram according to the hot information of embodiment of the present invention.Moreover, embodiment of the present invention
It is preferred that the specific source of the hot information is shown in result is pushed.
In embodiments of the present invention, the phase between Intelligence Page source is calculated according to the access times in Intelligence Page source first
To hot value;Then there is the letter of the reprinting information in reprinting according to each reprinting information of the relative hot value in Intelligence Page source calculating
Cease the reprinting weight in web page source;And the reprinting weight of each reprinting information is summed, calculate each reprinting information
Heatrate value, hot information is determined from reprinting information according still further to the size order of heatrate value.As can be seen here, apply
After embodiment of the present invention, focus letter can be automatically generated based on the heatrate value for reprinting information from whole internet
Breath, therefore can save manually and reduce cost.
Moreover, embodiment of the present invention can also support any number of hot news to show demand, and base can be supported
In the calculating of whole internet news, and by the automatic mining of technorati authority, website inferior can be dynamically eliminated, is strengthened high-quality
Website so that Mining Quality is continued to optimize.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.It is all
Within the spirit and principles in the present invention, any modification, equivalent substitution and improvements made etc., the protection of the present invention should be included in
Within the scope of.