CN108509490A - A kind of network hot topic discovery method and system - Google Patents
A kind of network hot topic discovery method and system Download PDFInfo
- Publication number
- CN108509490A CN108509490A CN201810136641.8A CN201810136641A CN108509490A CN 108509490 A CN108509490 A CN 108509490A CN 201810136641 A CN201810136641 A CN 201810136641A CN 108509490 A CN108509490 A CN 108509490A
- Authority
- CN
- China
- Prior art keywords
- word
- hot
- hot word
- topic
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of network hot topic discovery method and system, and method includes extracting hot word from the corpus of text of preset time period;When the correlation of two hot words is more than the first predetermined threshold value, two hot words are merged into identity set;The co-occurrence degree between each two hot word set is calculated, when the co-occurrence degree between two hot word set is more than the second predetermined threshold value, two hot word set are merged;The co-occurrence degree between each two hot word in each hot word set is calculated, and counts the co-occurrence degree of each hot word, co-occurrence degree is chosen and is arranged in title of the several hot words in forefront as corresponding much-talked-about topic.This method and system carry out the discovery operation of network hot topic from the angle of hot word, are more in line with the definition of much-talked-about topic;Simultaneously by the merging of related hot word and co-occurrence hot word set, it ensure that the High relevancy of much-talked-about topic internal information, while can realize the accurate discovery of network hot topic, be conducive to user and intuitively identify network hot topic.
Description
Technical field
The present invention relates to text information processing technical fields, and method is found more particularly, to a kind of network hot topic
And system.
Background technology
With becoming increasingly popular for internet, Internet resources exponentially growth trend, traditional manual information processing mode
The requirement that extensive acquisition of information can not have been coped with, therefore, it is necessary to by new information technology, be carried out to network public-opinion
Monitoring and analysis, meet the information requirement of every profession and trade user.By finding network hot topic, every profession and trade can be grasped in time
Developments, in real time monitoring every profession and trade network information dynamic.Much-talked-about topic is tracked, is grasped in time conducive to relevant departments
Thought dynamic in network, carries out correct Public-opinion directing and analysis, maintains social stability.
Topic is the theme with an event, including the multiple and relevant event of the theme.Much-talked-about topic refers to identical
News report amount is more in time, user's discussion amount mostly wide event with spread scope.The title of much-talked-about topic is usually by several languages
The relevant word of justice or a phrase indicate, therefrom more can completely understand the main contents of the topic.Such as hot spot words
Inscribe " 2017 potatoes conference ", " ' internet+' modern agriculture " etc..Traditional much-talked-about topic finds that method is to text
It is clustered, from the content of cluster, is broadly divided into word-based cluster, the cluster based on content and based on information
Cluster.Different clustering algorithms correspond to different validity, but the method based on cluster is unfavorable for user and intuitively identifies hot spot
Topic.
In view of this, it is urgent to provide a kind of network hot topics to find method and system, is conducive to user and intuitively identifies
Go out much-talked-about topic, while support is provided to understand every profession and trade dynamic in time.
Invention content
The present invention in order to overcome in the prior art much-talked-about topic find method accuracy it is not high, be unfavorable for user intuitively
The problem of identifying much-talked-about topic provides a kind of network hot topic discovery method and system.
On the one hand, the present invention provides a kind of network hot topic discovery method, including:
S1 extracts hot word candidate word based on the comentropy of word from the corpus of text of preset time period;
S2 calculates the temperature of the hot word candidate word according to the odd-numbered day word frequency of the hot word candidate word and historical volatility,
The hot word candidate word is ranked up according to temperature descending, and using preceding N hot word candidate word as hot word, wherein N>=1;
S3 calculates the correlation of hot word described in each two, when the correlation of two hot words is more than the first predetermined threshold value
When, two hot words are merged into identity set, and other each hot words are individually stored in a set, are obtained
All hot word set;
S4 calculates the co-occurrence degree between hot word set described in each two, the co-occurrence degree between two hot word set
When more than the second predetermined threshold value, two hot word set are merged, all hot word set after being merged;
S5 calculates the co-occurrence degree between each two hot word in the hot word set after each merging, and counts each heat
All hot words are ranked up by the co-occurrence degree of word according to co-occurrence degree descending, using preceding M hot word as each merging after
The title of the corresponding much-talked-about topic of hot word set, wherein M>=1.
Preferably, the step S1 further comprises:
S11 obtains the corpus of text of preset time period, and the calculation formula based on comentropy calculates separately the text language
The left comentropy and right comentropy of word in material;
The left comentropy and right comentropy are compared with preset minimum threshold and max-thresholds S12 respectively,
When meeting r1<H(lw)<r2And r1<H(rw)<r2When, the abundant degree of the word is calculated, calculation formula is:
R=H (lw) * H (rw)
Wherein, H (lw) and H (lw) is respectively the left comentropy of word w and right comentropy;r1And r2Respectively it is preset most
Small threshold value and max-thresholds;R is the abundant degree of word w;
The word is ranked up by S13 according to abundant degree descending, using preceding K word as hot word candidate
Word, wherein K>=1.
Preferably, the step S12 further includes:
For adjacent two words w and w1, when meeting H (lw)>r2And H (rw)<r1And H (lw1)<r1And H (rw1)>
r2When, by the word w and w1It is merged into a neologisms;Wherein, H (lw) and H (lw) is respectively the left comentropy of the word w
With right comentropy, H (lw1) and H (rw1) it is respectively word w1Left comentropy and right comentropy;
Correspondingly, the step S13 further includes:Using the neologisms as the hot word candidate word.
Preferably, according to described in the odd-numbered day word frequency of the hot word candidate word and historical volatility calculating described in step S2
The temperature of hot word candidate word, further comprises:
The basic weights that the hot word candidate word is calculated according to the odd-numbered day word frequency of the hot word candidate word, according to the heat
The historical volatility of word candidate word calculates the fluctuation weights of the hot word candidate word;
According to the temperature of hot word candidate word described in the basic weights and the fluctuation weight computing, calculation formula is:
H=B*0.5+F*0.5
Wherein, B is the basic weights of the hot word candidate word;F is the fluctuation weights of the hot word candidate word;H is described
The temperature of hot word candidate word.
Preferably, the correlation that hot word described in each two is calculated described in step S3, further comprises:
Calculate the editing distance similarity and Hownet similarity of hot word described in each two;
According to the correlation of hot word described in the editing distance similarity and the Hownet similarity calculation each two, meter
Calculating formula is:
Sim (X, Y)=α * sime(X,Y)+β*simc(X, Y), alpha+beta=1
Wherein, sim (X, Y) indicates the correlation of word X and word Y;sime(X, Y) indicates the volume of word X and word Y
Collect Distance conformability degree;simc(X, Y) shows that the Hownet similarity of word X and word Y, α and β indicate editing distance similarity respectively
With the weight of Hownet similarity.
Preferably, the co-occurrence degree calculated described in step S4 between hot word set described in each two further comprises:
Calculate the co-occurrence degree between each two word in hot word set described in each two;Compare word described in each two
Between co-occurrence degree, using the maximum value of the co-occurrence degree between word described in each two as between hot word set described in each two
Co-occurrence degree.
Preferably, further include after the step S2:The hot word candidate word is added in dictionary for word segmentation.
On the one hand, the present invention provides a kind of network hot topic discovery system, including:
Hot word candidate word extraction module is extracted for the comentropy based on word from the corpus of text of preset time period
Hot word candidate word;
Hot word extraction module, for calculating the heat according to the odd-numbered day word frequency and historical volatility of the hot word candidate word
The hot word candidate word is ranked up by the temperature of word candidate word according to temperature descending, and using top n hot word candidate word as
Hot word, wherein N>=1;
Hot word set acquisition module, the correlation for calculating hot word described in each two, when the correlation of two hot words
Property be more than the first predetermined threshold value when, two hot words are merged into identity set, and other each hot words are individually stored
In gathering at one, all hot word set are obtained;
Hot word set merging module, for calculating the co-occurrence degree between hot word set described in each two, when two heat
When co-occurrence degree between set of words is more than the second predetermined threshold value, two hot word set are merged, after being merged
All hot word set;
Much-talked-about topic acquisition module, for calculating in the hot word set after each merging between each two hot word
Co-occurrence degree, and the co-occurrence degree of each hot word is counted, all hot words are ranked up according to co-occurrence degree descending, preceding M hot word is made
For the title of the corresponding much-talked-about topic of hot word set after each merging, wherein M>=1.
On the one hand, the present invention provides a kind of equipment that network hot topic finds method, including:
At least one processor;And
At least one processor being connect with the processor communication, wherein:
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program
Instruction is able to carry out any of the above-described method.
On the one hand, the present invention provides a kind of non-transient computer readable storage medium, and the non-transient computer is readable to deposit
Storage media stores computer instruction, and the computer instruction makes the computer execute the method as described in any of the above-described.
A kind of network hot topic provided by the invention finds method and system, based on the comentropy of word from it is default when
Between section corpus of text in extract hot word candidate word;Hot word is calculated according to the odd-numbered day word frequency and historical volatility of hot word candidate word
The temperature of candidate word, several hot word candidate words that final selection temperature is stood out are as hot word;Then each two hot word is calculated
Correlation, two hot words that correlation is more than to the first predetermined threshold value are merged into identity set, and by other each heat
Word is individually stored in a set, obtains all hot word set;The co-occurrence degree between each two hot word set is calculated again, it will
Two hot word set that co-occurrence degree is more than the second predetermined threshold value merge;Finally, each two heat in each hot word set is calculated
Co-occurrence degree between word, and count the co-occurrence degree of each hot word in each hot word set, several heat that co-occurrence degree is stood out
Word finds network hot topic as the corresponding much-talked-about topic title of each hot word set.Angle of this method from hot word
The discovery operation for carrying out network hot topic, ensure that the high temperature of topic, is more in line with the definition of much-talked-about topic;Simultaneously
By the operation of hot word relativity measurement and hot word set co-occurrence degree calculating operation, related hot word and co-occurrence hot word set are carried out
Merge, ensure that the High relevancy of much-talked-about topic internal information, while can realize the accurate discovery of network hot topic, have
Network hot topic is intuitively identified conducive to user, provides support to understand every profession and trade dynamic in time, network public-opinion is supervised
Survey is of great significance with analysis.
Description of the drawings
Fig. 1 is that a kind of network hot topic of the embodiment of the present invention finds the overall flow schematic diagram of method;
Fig. 2 is that a kind of network hot topic of the embodiment of the present invention finds the overall structure diagram of system;
Fig. 3 is the structural framing schematic diagram that a kind of network hot topic of the embodiment of the present invention finds the equipment of method.
Specific implementation mode
With reference to the accompanying drawings and examples, the specific implementation mode of the present invention is described in further detail.Implement below
Example is not limited to the scope of the present invention for illustrating the present invention.
Fig. 1 is that a kind of network hot topic of the embodiment of the present invention finds the overall flow schematic diagram of method, such as Fig. 1 institutes
Showing, the present invention provides a kind of network hot topic discovery method, including:
S1 extracts hot word candidate word based on the comentropy of word from the corpus of text of preset time period;
S2 calculates the temperature of the hot word candidate word according to the odd-numbered day word frequency of the hot word candidate word and historical volatility,
The hot word candidate word is ranked up according to temperature descending, and using preceding N hot word candidate word as hot word, wherein N>=1;
S3 calculates the correlation of hot word described in each two, when the correlation of two hot words is more than the first predetermined threshold value
When, two hot words are merged into identity set, and other each hot words are individually stored in a set, are obtained
All hot word set;
S4 calculates the co-occurrence degree between hot word set described in each two, the co-occurrence degree between two hot word set
When more than the second predetermined threshold value, two hot word set are merged, all hot word set after being merged;
S5 calculates the co-occurrence degree between each two hot word in the hot word set after each merging, and counts each heat
All hot words are ranked up by the co-occurrence degree of word according to co-occurrence degree descending, using preceding M hot word as each merging after
The title of the corresponding much-talked-about topic of hot word set, wherein M>=1.
Specifically, for the much-talked-about topic of some period, the corpus of text of the period is obtained, that is, searches for the time
Internet news report of section etc., then for the corpus of text obtained, hot word candidate word is extracted from corpus of text.The present embodiment
In, hot word candidate word is extracted from corpus of text based on the comentropy of word, calculates the information of word in corpus of text first
Entropy determines hot word candidate word further according to the comentropy of word.The comentropy of word indicates the abundant degree of collocations, if
The collocation of one word is abundanter, then more representative, and the possibility for becoming hot word is also bigger.So the letter based on word
Breath entropy can accurately extract the hot word candidate word in corpus of text.Wherein, hot word candidate word include neologisms, entity word and
Non-physical word, hot word candidate word is generally possible to give full expression to content of text messages, representative.
Further, hot word candidate word has certain network attention degree, but the attention rate height of each hot word candidate word is not
Together, hot word could be referred to as by only paying close attention to high hot word candidate word.In view of this, based on the above technical solution, for
The hot word candidate word of acquisition calculates the temperature of each hot word candidate word, based on the temperature of each hot word candidate word, according to temperature
Descending arranges all hot word candidate words, and final top n hot word candidate word of choosing is as hot word, wherein N>=1, i.e.,
The higher preceding several hot word candidate words of temperature are chosen as hot word, particular number can be configured according to actual demand, herein
It is not specifically limited.Usually, hot word has the characteristics that odd-numbered day word frequency height and historical volatility are big.In view of this, the present embodiment
In, the temperature of hot word candidate word is calculated according to the odd-numbered day word frequency and historical volatility of hot word candidate word, thus, it is possible to according to heat
The temperature of word candidate word accurately extracts hot word from hot word candidate word.It in other embodiments, can also be by other means
The temperature for calculating hot word candidate word, can be configured according to actual demand, be not specifically limited herein.
Further, for the hot word of above-mentioned acquisition, the correlation between each two hot word, the correlation of word are calculated
Syntax Relativity and semantic dependency are specifically included, wherein Syntax Relativity is embodied in the editing distance similarity between word
On;Semantic dependency is embodied in the Hownet similarity between word, so the editing distance phase by calculating each two hot word
The correlation between each two hot word is can be obtained like degree and Hownet similarity.It on this basis, will be between each two hot word
Correlation be compared with the first predetermined threshold value, when the correlation between two hot words be more than the first predetermined threshold value when, by this
Two hot words are merged into a set;If the correlation between some hot word and other hot words is no more than the first default threshold
When value, then the hot word is individually stored in a set.Final all hot words have all been stored in set, you can are obtained
All hot word set.Wherein, the first predetermined threshold value is pre-set, can be configured according to actual demand, herein
It is not specifically limited.
Further, for all hot word set of above-mentioned acquisition, the co-occurrence degree between each two hot word set is calculated,
Specifically, by calculating separately the co-occurrence degree in two hot word set between word, the co-occurrence between two hot word set is obtained
Degree.On this basis, the co-occurrence degree between each two hot word set is compared with the second predetermined threshold value, when two hot words
When co-occurrence degree between set is more than the second predetermined threshold value, which is merged.After above-mentioned processing,
All hot word set after being merged, final each hot word set represent a much-talked-about topic.Wherein, the second default threshold
Value is pre-set, can be configured according to actual demand, is not specifically limited herein.
Further, for each hot word set after merging, calculate separately in each hot word set each two hot word it
Between co-occurrence degree, finally count the co-occurrence degree of each hot word in each hot word set, will each gather according to co-occurrence degree descending
In all hot words arranged, it is final choose before M hot word as the corresponding much-talked-about topic title of each hot word set,
Middle M>=1, i.e., for any one hot word set, choosing the larger preceding several hot words conducts of co-occurrence degree in the hot word set should
The corresponding much-talked-about topic title of hot word set, the hot word quantity of selection can be configured according to actual demand, not do and have herein
Body limits.When a much-talked-about topic title is there are when multiple hot words, can be separated using space between each hot word.
A kind of network hot topic provided by the invention finds method, based on the comentropy of word from preset time period
Hot word candidate word is extracted in corpus of text;Hot word candidate word is calculated according to the odd-numbered day word frequency and historical volatility of hot word candidate word
Temperature, it is final to choose several hot word candidate words that temperature is stood out as hot word;Then the correlation of each two hot word is calculated
Property, two hot words that correlation is more than to the first predetermined threshold value are merged into identity set, and other each hot words are independent
It is stored in a set, obtains all hot word set;The co-occurrence degree between each two hot word set is calculated again, by co-occurrence degree
Two hot word set more than the second predetermined threshold value merge;Finally, it calculates in each hot word set between each two hot word
Co-occurrence degree, and count the co-occurrence degree of each hot word in each hot word set, several hot words that co-occurrence degree is stood out as
Each corresponding much-talked-about topic title of hot word set, that is, find network hot topic.This method carries out net from the angle of hot word
The discovery of network much-talked-about topic operates, and ensure that the high temperature of topic, is more in line with the definition of much-talked-about topic;Pass through heat simultaneously
Word correlation metric operations and hot word set co-occurrence degree calculating operation carry out the merging of related hot word and co-occurrence hot word set, protect
The High relevancy of much-talked-about topic internal information has been demonstrate,proved, while can realize the accurate discovery of network hot topic, has been conducive to use
Network hot topic is intuitively identified at family, provides support for timely understanding every profession and trade dynamic, network public-opinion is monitored and is divided
Analysis is of great significance.
Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, the step S1 is further wrapped
It includes:
S11 obtains the corpus of text of preset time period, and the calculation formula based on comentropy calculates separately the text language
The left comentropy and right comentropy of word in material;
The left comentropy and right comentropy are compared with preset minimum threshold and max-thresholds S12 respectively,
When meeting r1<H(lw)<r2And r1<H(rw)<r2When, the abundant degree of the word is calculated, calculation formula is:
R=H (lw) * H (rw)
Wherein, H (lw) and H (lw) is respectively the left comentropy of word w and right comentropy;r1And r2Respectively it is preset most
Small threshold value and max-thresholds;R is the abundant degree of word w;
The word is ranked up by S13 according to abundant degree descending, using preceding K word as hot word candidate
Word, wherein K>=1.
Specifically, based on the above technical solution, it is above-mentioned based on the comentropy of word from the text of preset time period
Being implemented as follows for hot word candidate word is extracted in language material:
For the much-talked-about topic of some period, the corpus of text of the period is obtained, that is, searches for the network of the period
News report etc.;For the word w in corpus of text, the calculation formula of the comentropy H (w) of word w is obtained first, specifically such as
Under:
H (w)=- ∑ p (x) logp (x)
Wherein, p (x) indicates the probability that character x occurs in all characters;
Based on the above technical solution, the calculation formula based on comentropy calculate separately word w left comentropy and
Right comentropy, is as follows:
If L={ (l1,cl1),(l2,cl2),…,(ln,cln) gather for the left neighbour of word w, R={ (r1,cr1),(r2,
cr2),…,(rm,crm) the right adjacent set that is word w, wherein li and rj are respectively the adjacent character in left and right of word w, cliAnd crjPoint
The number that adjacent character occurs Wei not be controlled, then the left comentropy H (lw) of word w and right comentropy H (rw) are defined respectively as:
Further, the left comentropy H (lw) of the word w above-mentioned calculating obtained and right comentropy H (rw) respectively in advance
If minimum threshold and max-thresholds be compared, wherein having preset minimum threshold r1With max-thresholds r2;When meeting r1<
H(lw)<r2And r1<H(rw)<r2When, then the abundant degree R of word w collocation is calculated using following formula:
R=H (lw) * H (rw)
Descending sequence is carried out to word according to calculated R values, selects the word of K before coming as hot word candidate
Word, wherein K>=1, it can be configured according to actual demand, be not specifically limited herein.
A kind of network hot topic provided by the invention finds method, and the calculation formula based on comentropy calculates text language
The left comentropy and right comentropy of word in material, when left comentropy and right comentropy are satisfied by preset range, according to left letter
It ceases entropy and right comentropy calculates the abundant degree of word, several words that the final abundant degree of selection is stood out are as hot word
Candidate word, realizes the accurate extraction of hot word candidate word, and then ensures the accurate extraction of hot word, is conducive to from the angle of hot word
Degree carries out the discovery operation of network hot topic, ensure that the high temperature of topic, is more in line with the definition of much-talked-about topic;Together
When be advantageously implemented the accurate discovery of network hot topic, intuitively identify network hot topic convenient for user, be timely
It solves every profession and trade dynamic and support is provided, network public-opinion is monitored and is of great significance with analysis.
Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, the step S12 further includes:
For adjacent two words w and w1, when meeting H (lw)>r2And H (rw)<r1And H (lw1)<r1And H (rw1)>
r2When, by the word w and w1It is merged into a neologisms;Wherein, H (lw) and H (lw) is respectively the left comentropy of the word w
With right comentropy, H (lw1) and H (rw1) it is respectively word w1Left comentropy and right comentropy;
Correspondingly, the step S13 further includes:Using the neologisms as the hot word candidate word.
Specifically, in general, the left comentropy and right comentropy of word are bigger, which more can express text message
Content.But there is also a kind of special circumstances:The very big and right comentropy very little of the left comentropy of one word, and the word is adjacent
Word left comentropy very little and right comentropy is very big, then need the word and word adjacent thereto to be merged into one at this time
Neologisms.That is, being directed to adjacent two words w and w1, when meeting H (lw)>r2And H (rw)<r1And H (lw1)<r1And H (rw1)>
r2, then by word w and w1It is merged into one newly.Finally, using the neologisms after merging as hot word candidate word.
A kind of network hot topic provided by the invention finds method, and the calculation formula based on comentropy calculates text language
The left comentropy and right comentropy of word in material when there are the very big and right comentropy very littles of the left comentropy of a word, and are somebody's turn to do
The left comentropy very little of the adjacent word of word and right comentropy is very big, then need the word and word adjacent thereto to close at this time
And at a neologisms, finally using the neologisms after merging as hot word candidate word, the accurate extraction of hot word candidate word is realized, into
And ensure the accurate extraction of hot word, be conducive to the discovery operation for carrying out network hot topic from the angle of hot word, ensure that words
The high temperature of topic, is more in line with the definition of much-talked-about topic;It is advantageously implemented the accurate discovery of network hot topic simultaneously, just
Network hot topic is intuitively identified in user, provides support to understand every profession and trade dynamic in time, network public-opinion is monitored
It is of great significance with analysis.
Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, described in step S2 according to
The odd-numbered day word frequency and historical volatility of hot word candidate word calculate the temperature of the hot word candidate word, further comprise:
The basic weights that the hot word candidate word is calculated according to the odd-numbered day word frequency of the hot word candidate word, according to the heat
The historical volatility of word candidate word calculates the fluctuation weights of the hot word candidate word;
According to the temperature of hot word candidate word described in the basic weights and the fluctuation weight computing, calculation formula is:
H=B*0.5+F*0.5
Wherein, B is the basic weights of the hot word candidate word;F is the fluctuation weights of the hot word candidate word;H is described
The temperature of hot word candidate word.
Specifically, based on the above technical solution, above-mentioned odd-numbered day word frequency and historical volatility according to hot word candidate word
Property calculate hot word candidate word temperature, be implemented as follows:
Since hot word has the characteristics that odd-numbered day word frequency is high, historical volatility is big, in view of this, from odd-numbered day word frequency and history wave
Two aspect of dynamic property calculates the temperature of hot word candidate word, calculates hot word candidate word according to the odd-numbered day word frequency of hot word candidate word first
Basic weights, wherein odd-numbered day word frequency refer to the day statistics word frequency of hot word candidate word.In order to avoid odd-numbered day textual data difference is to basis
The influence of weights, therefore be smoothed.The calculation formula of basic weights B is as follows:
B=log (1+log (1+log (tf+1)))
Wherein, tf indicates that the day of hot word candidate word counts word frequency.
Further, the fluctuation weights of hot word candidate word are calculated according to the historical volatility of hot word candidate word, specifically,
Historical volatility is from the aspect of the whole fluctuation, change in long term and short term variations three of basic weights.Hot word candidate word is gone through
History fluctuation is bigger, and the possibility which becomes hot word is bigger.The whole fluctuation and change in long term of odd-numbered day word frequency are compared to short
Phase, which changes, can more embody the historical volatility of hot word candidate word, so when carrying out fluctuation weight computing, whole fluctuation and
The weight of change in long term is higher than short term variations.In order to simplify the complexity of weight coefficient selection, to odd-numbered day word frequency in the present embodiment
Whole fluctuation and change in long term take equal weight.By above-mentioned analysis, on the basis of it is 1 to ensure the sum of three's weight,
In final fluctuation weight computing formula, the whole fluctuation of basic weights and the weight of change in long term are 0.4, short term variations
Weight is 0.2.Whole fluctuation V, change in long term L, short term variations S and the fluctuation weights F of basic weights indicate as follows respectively:
F=V*0.4+L*0.4+S*0.2
Wherein, n indicates that experimental data period, Bi indicate basic word frequency.
The temperature weights of hot word candidate word include basic weights and fluctuation weights two parts, embody hot word respectively two
Feature, so the two takes equal weight in the calculating of temperature weights.Temperature weights H indicates as follows:
H=B*0.5+F*0.5
The temperature of each hot word candidate word can be calculated by the calculation formula of above-mentioned temperature weights, and choose heat
The several hot word candidate words stood out are spent as hot word.
A kind of network hot topic provided by the invention finds method, and heat is calculated according to the odd-numbered day word frequency of hot word candidate word
The basic weights of word candidate word, and according to the fluctuation weights of the historical volatility of hot word candidate word calculating hot word candidate word, finally
According to the temperature of basic weights and fluctuation weight computing hot word candidate word, the final several hot words time chosen temperature and stood out
Select word as hot word;The accurate extraction for realizing hot word is conducive to the discovery that network hot topic is carried out from the angle of hot word
Operation, ensure that the high temperature of topic, is more in line with the definition of much-talked-about topic;It is advantageously implemented network hot topic simultaneously
Accurate discovery, intuitively identify network hot topic convenient for user, in time understand every profession and trade dynamic support is provided, it is right
It monitors in network public-opinion and is of great significance with analysis.
Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, every two are calculated described in step S3
The correlation of a hot word, further comprises:
Calculate the editing distance similarity and Hownet similarity of hot word described in each two;
According to the correlation of hot word described in the editing distance similarity and the Hownet similarity calculation each two, meter
Calculating formula is:
Sim (X, Y)=α * sime(X,Y)+β*simc(X, Y), alpha+beta=1
Wherein, sim (X, Y) indicates the correlation of word X and word Y;sime(X, Y) indicates the volume of word X and word Y
Collect Distance conformability degree;simc(X, Y) shows that the Hownet similarity of word X and word Y, α and β indicate editing distance similarity respectively
With the weight of Hownet similarity.
Specifically, based on the above technical solution, the specific implementation of the correlation between each two hot word is calculated such as
Under:
The correlation of word specifically includes Syntax Relativity and semantic dependency, and wherein Syntax Relativity is embodied in word
Between editing distance similarity on;Semantic dependency is embodied in the Hownet similarity between word, so it is every by calculating
The editing distance similarity and Hownet similarity of two hot words can be obtained the correlation between each two hot word.
Editing distance, also known as Levenshtein distance (also referred to as Edit Distance), refer to two word strings by one
A minimum edit operation number switched to needed for another.Edit operation therein is divided into three kinds, including replacement, is inserted into and deletes
It removes.If the editing distance between two word strings is bigger, illustrate that their similarities are lower.
If word X constitutes X=x by n character1x2,…,xn, word Y constitutes Y=y by m character1y2,…,ym, C=
{cs,ci,cdIndicating replacement, insertion and the cost for deleting a character when character change respectively, then the editing distance of X and Y is passed
It is defined as follows with returning:
Wherein, Head (X)=x1x2…xn-1, Head (Y)=y1y2…ym-1, Ci(ε, ym)=ci, Cd(xn, ε) and=cd,
It is not much different it is generally believed that being inserted into and deleting the cost that character is spent, therefore their weights having the same,
That is ci=cd, but the cost of substitute character is different.For example, the operation of a substitute character is considered as first deleting a word
Symbol, then be inserted into the operation of a fresh character in original place, then replacement operation is exactly two for deleting or being inserted into character manipulation cost
Times.Therefore ci=cd≠cs
Therefore, the editing distance calculating formula of similarity of hot word then X and Y is as follows:
Wherein, | X |, | Y | the length of hot word X and hot word Y are indicated respectively.
If word X is made of the n senses of a dictionary entry, C is usedx1,Cx2,…,CxnIt indicates, Y is made of the m senses of a dictionary entry, uses Cy1,Cy2,…,
CynIt indicates, then the Hownet similarity sim of X and YcThe calculation formula of (X, Y) is as follows:
Wherein, Sim (Cxi,Cyj) refer to two senses of a dictionary entry CxiAnd CyjSimilarity.
Senses of a dictionary entry CxiAnd CyjCalculating formula of similarity it is as follows:
Wherein, Simpj(P1,P2) refer to two adopted original P1And P2Similarity, βi(1≤i≤4) are adjustable parameters, and
Meet β1+β2+β3+β4=1, β1≤β2≤β3≤β4。
Adopted original P1And P2Calculating formula of similarity it is as follows:
Simp(P1,P2)=σ/(d+ σ)
Wherein, d refers to P1And P2Path length in adopted former hierarchical system, σ is an adjustable parameter.
Further, editing distance similarity is that the correlation of word is measured from syntactic level, can not be from semantic level
The correlation of word is calculated, in view of this, being combined editing distance similarity and Hownet similarity in the present embodiment, is proposed
A kind of semantic dependency measure formulas is that is, related to Hownet similarity calculation each two hot word according to editing distance similarity
Property, calculation formula is as follows:
Sim (X, Y)=α * sime(X,Y)+β*simc(X, Y), alpha+beta=1
Wherein, sim (X, Y) indicates the correlation of word X and word Y;sime(X, Y) indicates the volume of word X and word Y
Collect Distance conformability degree;simc(X, Y) shows that the Hownet similarity of word X and word Y, α and β indicate editing distance similarity respectively
With the weight of Hownet similarity.
A kind of network hot topic provided by the invention finds method, calculates the editing distance phase between each two hot word
Like degree and Hownet similarity, according to the correlation of editing distance similarity and Hownet similarity calculation each two hot word.This method
Simultaneously from syntactic level and semantic level calculate hot word between correlation, be conducive to accurately obtain related hot word, and by associated hot
Word merges in identity set, ensure that the High relevancy of much-talked-about topic internal information, while can realize that network hotspot is talked about
The accurate discovery of topic, is conducive to user and intuitively identifies network hot topic, and branch is provided to understand every profession and trade dynamic in time
It holds, network public-opinion is monitored and is of great significance with analysis.
Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, every two are calculated described in step S4
Co-occurrence degree between a hot word set further comprises:
Calculate the co-occurrence degree between each two word in hot word set described in each two;Compare word described in each two
Between co-occurrence degree, using the maximum value of the co-occurrence degree between word described in each two as between hot word set described in each two
Co-occurrence degree.
Specifically, based on the above technical solution, the co-occurrence degree between above-mentioned calculating each two hot word set is specific
It realizes as follows:
Word co-occurrence refers to that two words appear in the same cell window (paragragh, an a word etc.) jointly, only
It is limited to two hot words, relatively simple fixation.The concept of co-occurrence is expanded into two set by two words in the present embodiment, is carried
Go out to gather the concept of co-occurrence degree.By calculating separately the co-occurrence degree in two hot word set between word, being total to for hot word set is obtained
Now spend.
If hot word set A includes n semantic relevant hot word, i.e. A={ X1,X2,…Xn, wherein n >=1, hot word set B
Including m semantic relevant hot word, i.e. B={ Y1,Y2,…Ym, wherein m >=1, then the co-occurrence degree C of definition set A and set B
The calculation formula of (A, B) is as follows:
C (A, B)=max { C (Xi,Yj) i=1,2 ..., n;J=1,2 ..., m.
Wherein, C (Xi,Yj) indicate hot word X in set AiWith the hot word Y in set BjCo-occurrence degree.
Hot word XiWith hot word YjCo-occurrence degree C (Xi,Yj) calculation formula it is as follows:
Wherein, R (Xi|Yj) indicate hot word XiRelative to hot word YjOpposite co-occurrence degree, R (Yj|Xi) indicate hot word YjRelatively
In hot word XiOpposite co-occurrence degree, R (Xi|Yj) it is generally not equal to R (Yj|Xi), but C (Xi,Yj)=C (Yj,Xi)。
Hot word XiRelative to hot word YjOpposite co-occurrence degree R (Xi|Yj) calculation formula it is as follows:
Wherein, f (Xi,Yj) indicate hot word XiWith word YjThe number occurred jointly in one text, f (Yj) indicate heat
Word YjThe number of appearance.
A kind of network hot topic provided by the invention finds method, calculates each two word in each two hot word set
Co-occurrence degree between language;Compare the co-occurrence degree between each two word, the maximum value of the co-occurrence degree between every word is made
For the co-occurrence degree between each two hot word set;Be conducive to merge co-occurrence hot word set, ensure that inside much-talked-about topic
The High relevancy of information, while can realize the accurate discovery of network hot topic, be conducive to user and intuitively identify net
Network much-talked-about topic provides support to understand every profession and trade dynamic in time, network public-opinion is monitored and is of great significance with analysis.
Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, is also wrapped after the step S2
It includes:The hot word candidate word is added in dictionary for word segmentation.
Specifically, combine neologisms candidate word for " corn price " is such, can be classified as in participle " corn " and
" price " two words, to which when carrying out word frequency statistics to " corn price ", which is 0.In order to avoid there are this feelings
Condition in the present embodiment, after obtaining hot word candidate word, hot word candidate word is added in dictionary for word segmentation, then is not in group
Close the divided situation of candidate word.
A kind of network hot topic provided by the invention finds that method waits hot word after obtaining hot word candidate word
It selects word to be added in dictionary for word segmentation, can effectively avoid the combination divided situation of neologisms, be conducive to accurately to hot word candidate word
Word frequency carry out accurate statistics, and finally realize hot word accurate extraction.
Fig. 2 is that a kind of network hot topic of the embodiment of the present invention finds the overall structure diagram of system, such as Fig. 2 institutes
Showing, the present invention provides a kind of network hot topic discovery system, including:
Hot word candidate word extraction module 1 is extracted for the comentropy based on word from the corpus of text of preset time period
Hot word candidate word;
Hot word extraction module 2, for calculating the heat according to the odd-numbered day word frequency and historical volatility of the hot word candidate word
The hot word candidate word is ranked up by the temperature of word candidate word according to temperature descending, and using top n hot word candidate word as
Hot word, wherein N>=1;
Hot word set acquisition module 3, the correlation for calculating hot word described in each two, when the phase of two hot words
When closing property is more than the first predetermined threshold value, two hot words are merged into identity set, and other each hot words are individually deposited
It is placed in a set, obtains all hot word set;
Hot word set merging module 4, for calculating the co-occurrence degree between hot word set described in each two, described in two
When co-occurrence degree between hot word set is more than the second predetermined threshold value, two hot word set are merged, after being merged
All hot word set;
Much-talked-about topic acquisition module 5, for calculating in the hot word set after each merging between each two hot word
Co-occurrence degree, and the co-occurrence degree of each hot word is counted, all hot words are ranked up according to co-occurrence degree descending, preceding M hot word is made
For the title of the corresponding much-talked-about topic of hot word set after each merging, wherein M>=1.
Specifically, the present invention provides a kind of network hot topic discovery system, including hot word candidate word extraction module 1, heat
Word extraction module 2, hot word set acquisition module 3, hot word set merging module 4 and much-talked-about topic acquisition module 5, pass through each mould
Block realizes that the network hot topic in any of the above-described embodiment finds method, is implemented as follows:
For the much-talked-about topic of some period, the text language of the period is obtained using hot word candidate word extraction module 1
Material searches for Internet news report of the period etc., then for the corpus of text obtained, hot word is extracted from corpus of text
Candidate word.In the present embodiment, hot word candidate word is extracted from corpus of text based on the comentropy of word, calculates text language first
The comentropy of word in material determines hot word candidate word further according to the comentropy of word.The comentropy of word indicates collocations
Abundant degree, more representative if the collocation of a word is abundanter, the possibility for becoming hot word is also bigger.Therefore
And the comentropy based on word can accurately extract the hot word candidate word in corpus of text.Wherein, hot word candidate word packet
Neologisms, entity word and non-physical word are included, hot word candidate word is generally possible to give full expression to content of text messages, representative.
Further, hot word candidate word has certain network attention degree, but the attention rate height of each hot word candidate word is not
Together, hot word could be referred to as by only paying close attention to high hot word candidate word.In view of this, based on the above technical solution, for
The hot word candidate word of acquisition calculates the temperature of each hot word candidate word using hot word extraction module 2, candidate based on each hot word
The temperature of word arranges all hot word candidate words according to temperature descending, final to choose top n hot word candidate word conduct
Hot word, wherein N>=1, that is, the higher preceding several hot word candidate words of temperature are chosen as hot word, and particular number can be according to reality
Demand is configured, and is not specifically limited herein.Usually, hot word has the characteristics that odd-numbered day word frequency height and historical volatility are big.Have
In consideration of it, in the present embodiment, calculated according to the odd-numbered day word frequency and historical volatility of hot word candidate word using hot word extraction module 2
The temperature of hot word candidate word, thus, it is possible to accurately extract hot word from hot word candidate word according to the temperature of hot word candidate word.
In other embodiments, the temperature that can also calculate hot word candidate word by other means, can be set according to actual demand
It sets, is not specifically limited herein.
Further, for the hot word of above-mentioned acquisition, using between the calculating each two hot word of hot word set acquisition module 3
Correlation, the correlation of word specifically includes Syntax Relativity and semantic dependency, and wherein Syntax Relativity is embodied in word
Between editing distance similarity on;Semantic dependency is embodied in the Hownet similarity between word, so it is every by calculating
The editing distance similarity and Hownet similarity of two hot words can be obtained the correlation between each two hot word.It is basic herein
On, the correlation between each two hot word is compared with the first predetermined threshold value using hot word set acquisition module 3, when two
When correlation between a hot word is more than the first predetermined threshold value, which is merged into a set;If some hot word
When correlation between other hot words is no more than the first predetermined threshold value, then the hot word is individually stored in a set.
Final all hot words have all been stored in set, you can obtain all hot word set.Wherein, the first predetermined threshold value is pre-
First it is arranged, can be configured according to actual demand, be not specifically limited herein.
Further, for all hot word set of above-mentioned acquisition, each two is calculated using hot word set merging module 4
Co-occurrence degree between hot word set specifically by calculating separately the co-occurrence degree in two hot word set between word, obtains two
Co-occurrence degree between a hot word set.It on this basis, will be between each two hot word set using hot word set merging module 4
Co-occurrence degree be compared with the second predetermined threshold value, when the co-occurrence degree between two hot word set be more than the second predetermined threshold value
When, which is merged.After above-mentioned processing, all hot word set after being merged, final is every
A hot word set represents a much-talked-about topic.Wherein, the second predetermined threshold value is pre-set, can according to actual demand into
Row setting, is not specifically limited herein.
Further, it for each hot word set after merging, is calculated separately using much-talked-about topic acquisition module 5 each
Co-occurrence degree in hot word set between each two hot word finally counts the co-occurrence degree of each hot word in each hot word set, presses
All hot words in each set are arranged according to co-occurrence degree descending, it is final to choose preceding M hot word as each hot word set
Corresponding much-talked-about topic title, wherein M>=1, i.e., for any one hot word set, choose co-occurrence degree in the hot word set
For larger preceding several hot words as the corresponding much-talked-about topic title of the hot word set, the hot word quantity of selection can be according to reality
Demand is configured, and is not specifically limited herein.It, can be in each hot word when a much-talked-about topic title is there are when multiple hot words
Between separated using space.
A kind of network hot topic provided by the invention finds system, based on the comentropy of word from preset time period
Hot word candidate word is extracted in corpus of text;Hot word candidate word is calculated according to the odd-numbered day word frequency and historical volatility of hot word candidate word
Temperature, it is final to choose several hot word candidate words that temperature is stood out as hot word;Then the correlation of each two hot word is calculated
Property, two hot words that correlation is more than to the first predetermined threshold value are merged into identity set, and other each hot words are independent
It is stored in a set, obtains all hot word set;The co-occurrence degree between each two hot word set is calculated again, by co-occurrence degree
Two hot word set more than the second predetermined threshold value merge;Finally, it calculates in each hot word set between each two hot word
Co-occurrence degree, and count the co-occurrence degree of each hot word in each hot word set, several hot words that co-occurrence degree is stood out as
Each corresponding much-talked-about topic title of hot word set, that is, find network hot topic.The system carries out net from the angle of hot word
The discovery of network much-talked-about topic operates, and ensure that the high temperature of topic, is more in line with the definition of much-talked-about topic;Pass through heat simultaneously
Word correlation metric operations and hot word set co-occurrence degree calculating operation carry out the merging of related hot word and co-occurrence hot word set, protect
The High relevancy of much-talked-about topic internal information has been demonstrate,proved, while can realize the accurate discovery of network hot topic, has been conducive to use
Network hot topic is intuitively identified at family, provides support for timely understanding every profession and trade dynamic, network public-opinion is monitored and is divided
Analysis is of great significance.
Fig. 3 shows that a kind of network hot topic of the embodiment of the present invention finds the structure diagram of the equipment of method.With reference to figure
3, the equipment that the network hot topic finds method, including:Processor (processor) 31,32 He of memory (memory)
Bus 33;Wherein, the processor 31 and memory 32 complete mutual communication by the bus 33;The processor
31 for calling the program instruction in the memory 32, to execute the method that above-mentioned each method embodiment is provided, such as
Including:Hot word candidate word is extracted from the corpus of text of preset time period based on the comentropy of word;According to hot word candidate word
Odd-numbered day word frequency and historical volatility calculate the temperature of hot word candidate word, are ranked up hot word candidate word according to temperature descending, and
Using top n hot word candidate word as hot word, wherein N>=1;The correlation for calculating each two hot word, when the correlation of two hot words
Property be more than the first predetermined threshold value when, two hot words are merged into identity set, and other each hot words are individually stored in one
In a set, all hot word set are obtained;The co-occurrence degree between each two hot word set is calculated, when between two hot word set
Co-occurrence degree be more than the second predetermined threshold value when, two hot word set are merged, all hot word set after being merged;
The co-occurrence degree between each two hot word in the hot word set after each merge is calculated, and counts the co-occurrence degree of each hot word, is pressed
All hot words are ranked up according to co-occurrence degree descending, using preceding M hot word as the corresponding hot spot of hot word set after each merge
The title of topic, wherein M>=1.
The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium
Matter stores computer instruction, and the computer instruction makes the computer execute the method that above-mentioned each method embodiment is provided,
Such as including:Hot word candidate word is extracted from the corpus of text of preset time period based on the comentropy of word;According to hot word candidate
The odd-numbered day word frequency and historical volatility of word calculate the temperature of hot word candidate word, arrange hot word candidate word according to temperature descending
Sequence, and using top n hot word candidate word as hot word, wherein N>=1;The correlation for calculating each two hot word, when two hot words
When correlation is more than the first predetermined threshold value, two hot words are merged into identity set, and other each hot words are individually stored
In gathering at one, all hot word set are obtained;The co-occurrence degree between each two hot word set is calculated, when two hot word set
Between co-occurrence degree be more than the second predetermined threshold value when, two hot word set are merged, all hot word collection after being merged
It closes;The co-occurrence degree between each two hot word in the hot word set after each merge is calculated, and counts the co-occurrence degree of each hot word,
All hot words are ranked up according to co-occurrence degree descending, using preceding M hot word as the corresponding heat of hot word set after each merge
The title of point topic, wherein M>=1.
One of ordinary skill in the art will appreciate that:Realize that all or part of step of above method embodiment can lead to
The relevant hardware of program instruction is crossed to complete, program above-mentioned can be stored in a computer read/write memory medium, the journey
Sequence when being executed, executes step including the steps of the foregoing method embodiments;And storage medium above-mentioned includes:ROM, RAM, magnetic disc or
The various media that can store program code such as person's CD.
The embodiments such as the equipment that network hot topic described above finds method are only schematical, wherein institute
It states the unit illustrated as separating component may or may not be physically separated, the component shown as unit
It may or may not be physical unit, you can be located at a place, or may be distributed over multiple network element
On.Some or all of module therein can be selected according to the actual needs to achieve the purpose of the solution of this embodiment.This
Field those of ordinary skill is not in the case where paying performing creative labour, you can to understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment
The mode of required general hardware platform can be added to realize by software, naturally it is also possible to pass through hardware.Based on such reason
Solution, substantially the part that contributes to existing technology can embody above-mentioned technical proposal in the form of software products in other words
Out, which can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD,
It is used including some instructions so that a computer equipment (can be personal computer, server or the network equipment etc.) is held
Method described in certain parts of each embodiment of row or embodiment.
Finally, the present processes are only preferable embodiment, are not intended to limit the scope of the present invention.It is all
Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in the guarantor of the present invention
Within the scope of shield.
Claims (10)
1. a kind of network hot topic finds method, which is characterized in that including:
S1 extracts hot word candidate word based on the comentropy of word from the corpus of text of preset time period;
S2 calculates the temperature of the hot word candidate word according to the odd-numbered day word frequency of the hot word candidate word and historical volatility, according to
The hot word candidate word is ranked up by temperature descending, and using top n hot word candidate word as hot word, wherein N>=1;
S3 calculates the correlation of hot word described in each two, will when the correlation of two hot words is more than the first predetermined threshold value
Two hot words are merged into identity set, and other each hot words are individually stored in a set, obtain all heat
Set of words;
S4 calculates the co-occurrence degree between hot word set described in each two, when the co-occurrence degree between two hot word set is more than
When the second predetermined threshold value, two hot word set are merged, all hot word set after being merged;
S5 calculates the co-occurrence degree between each two hot word in the hot word set after each merging, and counts each hot word
All hot words are ranked up by co-occurrence degree according to co-occurrence degree descending, using preceding M hot word as the hot word collection after each merging
Close the title of corresponding much-talked-about topic, wherein M>=1.
2. according to the method described in claim 1, it is characterized in that, the step S1 further comprises:
S11 obtains the corpus of text of preset time period, and the calculation formula based on comentropy calculates separately word in the corpus of text
The left comentropy and right comentropy of language;
The left comentropy and right comentropy are compared with preset minimum threshold and max-thresholds, work as satisfaction by S12 respectively
r1<H(lw)<r2And r1<H(rw)<r2When, the abundant degree of the word is calculated, calculation formula is:
R=H (lw) * H (rw)
Wherein, H (lw) and H (lw) is respectively the left comentropy of word w and right comentropy;r1And r2Respectively preset Minimum Threshold
Value and max-thresholds;R is the abundant degree of word w;
The word is ranked up by S13 according to abundant degree descending, using preceding K word as the hot word candidate word, wherein
K>=1.
3. according to the method described in claim 2, it is characterized in that, the step S12 further includes:
For adjacent two words w and w1, when meeting H (lw)>r2And H (rw)<r1And H (lw1)<r1And H (rw1)>r2When,
By the word w and w1It is merged into a neologisms;Wherein, H (lw) and H (lw) is respectively left comentropy and the right side of the word w
Comentropy, H (lw1) and H (rw1) it is respectively word w1Left comentropy and right comentropy;
Correspondingly, the step S13 further includes:Using the neologisms as the hot word candidate word.
4. according to the method described in claim 1, it is characterized in that, according to the odd-numbered day of the hot word candidate word described in step S2
Word frequency and historical volatility calculate the temperature of the hot word candidate word, further comprise:
The basic weights that the hot word candidate word is calculated according to the odd-numbered day word frequency of the hot word candidate word, according to hot word candidate
The historical volatility of word calculates the fluctuation weights of the hot word candidate word;
According to the temperature of hot word candidate word described in the basic weights and the fluctuation weight computing, calculation formula is:
H=B*0.5+F*0.5
Wherein, B is the basic weights of the hot word candidate word;F is the fluctuation weights of the hot word candidate word;H is the hot word
The temperature of candidate word.
5. according to the method described in claim 1, it is characterized in that, calculating the correlation of hot word described in each two described in step S3
Property, further comprise:
Calculate the editing distance similarity and Hownet similarity of hot word described in each two;
According to the correlation of hot word described in the editing distance similarity and the Hownet similarity calculation each two, calculation formula
For:
Sim (X, Y)=α * sime(X,Y)+β*simc(X, Y), alpha+beta=1
Wherein, sim (X, Y) indicates the correlation of word X and word Y;sime(X, Y) indicates the editing distance of word X and word Y
Similarity;simc(X, Y) shows that the Hownet similarity of word X and word Y, α and β indicate editing distance similarity and Hownet phase respectively
Like the weight of degree.
6. according to the method described in claim 1, it is characterized in that, described in step S4 calculate each two described in hot word set it
Between co-occurrence degree further comprise:
Calculate the co-occurrence degree between each two word in hot word set described in each two;Compare between word described in each two
Co-occurrence degree, using the maximum value of the co-occurrence degree between word described in each two as the co-occurrence between hot word set described in each two
Degree.
7. according to the method described in claim 1, it is characterized in that, further including after the step S2:By hot word candidate
Word is added in dictionary for word segmentation.
8. a kind of network hot topic finds system, which is characterized in that including:
Hot word candidate word extraction module extracts hot word for the comentropy based on word from the corpus of text of preset time period and waits
Select word;
Hot word extraction module, it is candidate for calculating the hot word according to the odd-numbered day word frequency and historical volatility of the hot word candidate word
The hot word candidate word is ranked up by the temperature of word according to temperature descending, and using top n hot word candidate word as hot word,
Middle N>=1;
Hot word set acquisition module, the correlation for calculating hot word described in each two, when the correlation of two hot words is big
When the first predetermined threshold value, two hot words are merged into identity set, and other each hot words are individually stored in one
In a set, all hot word set are obtained;
Hot word set merging module, for calculating the co-occurrence degree between hot word set described in each two, when two hot word collection
When co-occurrence degree between conjunction is more than the second predetermined threshold value, two hot word set are merged, it is all after being merged
Hot word set;
Much-talked-about topic acquisition module, for calculating the co-occurrence in the hot word set after each merging between each two hot word
Degree, and the co-occurrence degree of each hot word is counted, all hot words are ranked up according to co-occurrence degree descending, using preceding M hot word as every
The title of the corresponding much-talked-about topic of hot word set after a merging, wherein M>=1.
9. the equipment that a kind of network hot topic finds method, which is characterized in that including:
At least one processor;And
At least one processor being connect with the processor communication, wherein:
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy
Enough methods executed as described in claim 1 to 7 is any.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited
Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810136641.8A CN108509490B (en) | 2018-02-09 | 2018-02-09 | Network hot topic discovery method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810136641.8A CN108509490B (en) | 2018-02-09 | 2018-02-09 | Network hot topic discovery method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108509490A true CN108509490A (en) | 2018-09-07 |
CN108509490B CN108509490B (en) | 2020-10-02 |
Family
ID=63375282
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810136641.8A Active CN108509490B (en) | 2018-02-09 | 2018-02-09 | Network hot topic discovery method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108509490B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271639A (en) * | 2018-10-11 | 2019-01-25 | 南京中孚信息技术有限公司 | Hot ticket finds method and device |
CN111125484A (en) * | 2019-12-17 | 2020-05-08 | 网易(杭州)网络有限公司 | Topic discovery method and system and electronic device |
CN113626722A (en) * | 2020-05-08 | 2021-11-09 | 国家广播电视总局广播电视科学研究院 | Public opinion guiding method, device, equipment and computer readable storage medium |
CN114938477A (en) * | 2022-06-23 | 2022-08-23 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
CN117034904A (en) * | 2023-10-09 | 2023-11-10 | 北京睿企信息科技有限公司 | Method for obtaining hot words with stable heat, electronic equipment and storage medium |
CN114938477B (en) * | 2022-06-23 | 2024-05-03 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5467425A (en) * | 1993-02-26 | 1995-11-14 | International Business Machines Corporation | Building scalable N-gram language models using maximum likelihood maximum entropy N-gram models |
CN104679738A (en) * | 2013-11-27 | 2015-06-03 | 北京拓尔思信息技术股份有限公司 | Method and device for mining Internet hot words |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
CN106528755A (en) * | 2016-10-28 | 2017-03-22 | 东软集团股份有限公司 | Hot topic generation method and device |
CN107330022A (en) * | 2017-06-21 | 2017-11-07 | 腾讯科技(深圳)有限公司 | A kind of method and device for obtaining much-talked-about topic |
CN107423444A (en) * | 2017-08-10 | 2017-12-01 | 世纪龙信息网络有限责任公司 | Hot word phrase extracting method and system |
-
2018
- 2018-02-09 CN CN201810136641.8A patent/CN108509490B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5467425A (en) * | 1993-02-26 | 1995-11-14 | International Business Machines Corporation | Building scalable N-gram language models using maximum likelihood maximum entropy N-gram models |
CN104679738A (en) * | 2013-11-27 | 2015-06-03 | 北京拓尔思信息技术股份有限公司 | Method and device for mining Internet hot words |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
CN106528755A (en) * | 2016-10-28 | 2017-03-22 | 东软集团股份有限公司 | Hot topic generation method and device |
CN107330022A (en) * | 2017-06-21 | 2017-11-07 | 腾讯科技(深圳)有限公司 | A kind of method and device for obtaining much-talked-about topic |
CN107423444A (en) * | 2017-08-10 | 2017-12-01 | 世纪龙信息网络有限责任公司 | Hot word phrase extracting method and system |
Non-Patent Citations (3)
Title |
---|
李渝勤 等: "面向互联网舆情的热词分析技术", 《中文信息学报》 * |
段青玲 等: "基于农业网络信息分类的热词自动提取方法", 《农业机械学报》 * |
郝晓玲 等: "微博热词抽取及话题发现研究", 《情报杂志》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271639A (en) * | 2018-10-11 | 2019-01-25 | 南京中孚信息技术有限公司 | Hot ticket finds method and device |
CN109271639B (en) * | 2018-10-11 | 2021-03-05 | 南京中孚信息技术有限公司 | Hot event discovery method and device |
CN111125484A (en) * | 2019-12-17 | 2020-05-08 | 网易(杭州)网络有限公司 | Topic discovery method and system and electronic device |
CN111125484B (en) * | 2019-12-17 | 2023-06-30 | 网易(杭州)网络有限公司 | Topic discovery method, topic discovery system and electronic equipment |
CN113626722A (en) * | 2020-05-08 | 2021-11-09 | 国家广播电视总局广播电视科学研究院 | Public opinion guiding method, device, equipment and computer readable storage medium |
CN114938477A (en) * | 2022-06-23 | 2022-08-23 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
CN114938477B (en) * | 2022-06-23 | 2024-05-03 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
CN117034904A (en) * | 2023-10-09 | 2023-11-10 | 北京睿企信息科技有限公司 | Method for obtaining hot words with stable heat, electronic equipment and storage medium |
CN117034904B (en) * | 2023-10-09 | 2023-12-08 | 北京睿企信息科技有限公司 | Method for obtaining hot words with stable heat, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108509490B (en) | 2020-10-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10565233B2 (en) | Suffix tree similarity measure for document clustering | |
Bian et al. | Multimedia summarization for social events in microblog stream | |
US8280886B2 (en) | Determining candidate terms related to terms of a query | |
CN105786991B (en) | In conjunction with the Chinese emotion new word identification method and system of user feeling expression way | |
US9081852B2 (en) | Recommending terms to specify ontology space | |
US9317593B2 (en) | Modeling topics using statistical distributions | |
US20170091339A1 (en) | Method, apparatus and system of intelligent navigation | |
US8543380B2 (en) | Determining a document specificity | |
CN107578292B (en) | User portrait construction system | |
CN108509490A (en) | A kind of network hot topic discovery method and system | |
EP2045737A2 (en) | Selecting tags for a document by analysing paragraphs of the document | |
CN104573130B (en) | The entity resolution method and device calculated based on colony | |
CN108170692A (en) | A kind of focus incident information processing method and device | |
CN106940726B (en) | Creative automatic generation method and terminal based on knowledge network | |
CN110888990A (en) | Text recommendation method, device, equipment and medium | |
CN109726289A (en) | Event detecting method and device | |
EP2045732A2 (en) | Determining the depths of words and documents | |
Garg et al. | The structure of word co-occurrence network for microblogs | |
CN104077417A (en) | Figure tag recommendation method and system in social network | |
Kumar et al. | Hashtag recommendation for short social media texts using word-embeddings and external knowledge | |
CN108304382A (en) | Mass analysis method based on manufacturing process text data digging and system | |
Alam et al. | A knowledge-poor approach to chemical-disease relation extraction | |
US20040158558A1 (en) | Information processor and program for implementing information processor | |
CN111177559A (en) | Text travel service recommendation method and device, electronic equipment and storage medium | |
CN114064851A (en) | Multi-machine retrieval method and system for government office documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |