CN108509490A

CN108509490A - A kind of network hot topic discovery method and system

Info

Publication number: CN108509490A
Application number: CN201810136641.8A
Authority: CN
Inventors: 段青玲; 李道亮; 张璐; 刘怡然; 曹新凯; 王凯
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2018-02-09
Filing date: 2018-02-09
Publication date: 2018-09-07
Anticipated expiration: 2038-02-09
Also published as: CN108509490B

Abstract

The present invention provides a kind of network hot topic discovery method and system, and method includes extracting hot word from the corpus of text of preset time period；When the correlation of two hot words is more than the first predetermined threshold value, two hot words are merged into identity set；The co-occurrence degree between each two hot word set is calculated, when the co-occurrence degree between two hot word set is more than the second predetermined threshold value, two hot word set are merged；The co-occurrence degree between each two hot word in each hot word set is calculated, and counts the co-occurrence degree of each hot word, co-occurrence degree is chosen and is arranged in title of the several hot words in forefront as corresponding much-talked-about topic.This method and system carry out the discovery operation of network hot topic from the angle of hot word, are more in line with the definition of much-talked-about topic；Simultaneously by the merging of related hot word and co-occurrence hot word set, it ensure that the High relevancy of much-talked-about topic internal information, while can realize the accurate discovery of network hot topic, be conducive to user and intuitively identify network hot topic.

Description

A kind of network hot topic discovery method and system

Technical field

The present invention relates to text information processing technical fields, and method is found more particularly, to a kind of network hot topic And system.

Background technology

With becoming increasingly popular for internet, Internet resources exponentially growth trend, traditional manual information processing mode The requirement that extensive acquisition of information can not have been coped with, therefore, it is necessary to by new information technology, be carried out to network public-opinion Monitoring and analysis, meet the information requirement of every profession and trade user.By finding network hot topic, every profession and trade can be grasped in time Developments, in real time monitoring every profession and trade network information dynamic.Much-talked-about topic is tracked, is grasped in time conducive to relevant departments Thought dynamic in network, carries out correct Public-opinion directing and analysis, maintains social stability.

Topic is the theme with an event, including the multiple and relevant event of the theme.Much-talked-about topic refers to identical News report amount is more in time, user's discussion amount mostly wide event with spread scope.The title of much-talked-about topic is usually by several languages The relevant word of justice or a phrase indicate, therefrom more can completely understand the main contents of the topic.Such as hot spot words Inscribe " 2017 potatoes conference ", " ' internet+' modern agriculture " etc..Traditional much-talked-about topic finds that method is to text It is clustered, from the content of cluster, is broadly divided into word-based cluster, the cluster based on content and based on information Cluster.Different clustering algorithms correspond to different validity, but the method based on cluster is unfavorable for user and intuitively identifies hot spot Topic.

In view of this, it is urgent to provide a kind of network hot topics to find method and system, is conducive to user and intuitively identifies Go out much-talked-about topic, while support is provided to understand every profession and trade dynamic in time.

Invention content

The present invention in order to overcome in the prior art much-talked-about topic find method accuracy it is not high, be unfavorable for user intuitively The problem of identifying much-talked-about topic provides a kind of network hot topic discovery method and system.

On the one hand, the present invention provides a kind of network hot topic discovery method, including：

S1 extracts hot word candidate word based on the comentropy of word from the corpus of text of preset time period；

S2 calculates the temperature of the hot word candidate word according to the odd-numbered day word frequency of the hot word candidate word and historical volatility, The hot word candidate word is ranked up according to temperature descending, and using preceding N hot word candidate word as hot word, wherein N>=1；

S3 calculates the correlation of hot word described in each two, when the correlation of two hot words is more than the first predetermined threshold value When, two hot words are merged into identity set, and other each hot words are individually stored in a set, are obtained All hot word set；

S4 calculates the co-occurrence degree between hot word set described in each two, the co-occurrence degree between two hot word set When more than the second predetermined threshold value, two hot word set are merged, all hot word set after being merged；

S5 calculates the co-occurrence degree between each two hot word in the hot word set after each merging, and counts each heat All hot words are ranked up by the co-occurrence degree of word according to co-occurrence degree descending, using preceding M hot word as each merging after The title of the corresponding much-talked-about topic of hot word set, wherein M>=1.

Preferably, the step S1 further comprises：

S11 obtains the corpus of text of preset time period, and the calculation formula based on comentropy calculates separately the text language The left comentropy and right comentropy of word in material；

The left comentropy and right comentropy are compared with preset minimum threshold and max-thresholds S12 respectively, When meeting r₁<H(lw)<r₂And r₁<H(rw)<r₂When, the abundant degree of the word is calculated, calculation formula is：

R=H (lw) * H (rw)

Wherein, H (lw) and H (lw) is respectively the left comentropy of word w and right comentropy；r₁And r₂Respectively it is preset most Small threshold value and max-thresholds；R is the abundant degree of word w；

The word is ranked up by S13 according to abundant degree descending, using preceding K word as hot word candidate Word, wherein K>=1.

Preferably, the step S12 further includes：

For adjacent two words w and w₁, when meeting H (lw)>r₂And H (rw)<r₁And H (lw₁)<r₁And H (rw₁)> r₂When, by the word w and w₁It is merged into a neologisms；Wherein, H (lw) and H (lw) is respectively the left comentropy of the word w With right comentropy, H (lw₁) and H (rw₁) it is respectively word w₁Left comentropy and right comentropy；

Correspondingly, the step S13 further includes：Using the neologisms as the hot word candidate word.

Preferably, according to described in the odd-numbered day word frequency of the hot word candidate word and historical volatility calculating described in step S2 The temperature of hot word candidate word, further comprises：

The basic weights that the hot word candidate word is calculated according to the odd-numbered day word frequency of the hot word candidate word, according to the heat The historical volatility of word candidate word calculates the fluctuation weights of the hot word candidate word；

According to the temperature of hot word candidate word described in the basic weights and the fluctuation weight computing, calculation formula is：

H=B*0.5+F*0.5

Wherein, B is the basic weights of the hot word candidate word；F is the fluctuation weights of the hot word candidate word；H is described The temperature of hot word candidate word.

Preferably, the correlation that hot word described in each two is calculated described in step S3, further comprises：

Calculate the editing distance similarity and Hownet similarity of hot word described in each two；

According to the correlation of hot word described in the editing distance similarity and the Hownet similarity calculation each two, meter Calculating formula is：

Sim (X, Y)=α * sim_e(X,Y)+β*sim_c(X, Y), alpha+beta=1

Wherein, sim (X, Y) indicates the correlation of word X and word Y；sim_e(X, Y) indicates the volume of word X and word Y Collect Distance conformability degree；sim_c(X, Y) shows that the Hownet similarity of word X and word Y, α and β indicate editing distance similarity respectively With the weight of Hownet similarity.

Preferably, the co-occurrence degree calculated described in step S4 between hot word set described in each two further comprises：

Calculate the co-occurrence degree between each two word in hot word set described in each two；Compare word described in each two Between co-occurrence degree, using the maximum value of the co-occurrence degree between word described in each two as between hot word set described in each two Co-occurrence degree.

Preferably, further include after the step S2：The hot word candidate word is added in dictionary for word segmentation.

On the one hand, the present invention provides a kind of network hot topic discovery system, including：

Hot word candidate word extraction module is extracted for the comentropy based on word from the corpus of text of preset time period Hot word candidate word；

Hot word extraction module, for calculating the heat according to the odd-numbered day word frequency and historical volatility of the hot word candidate word The hot word candidate word is ranked up by the temperature of word candidate word according to temperature descending, and using top n hot word candidate word as Hot word, wherein N>=1；

Hot word set acquisition module, the correlation for calculating hot word described in each two, when the correlation of two hot words Property be more than the first predetermined threshold value when, two hot words are merged into identity set, and other each hot words are individually stored In gathering at one, all hot word set are obtained；

Hot word set merging module, for calculating the co-occurrence degree between hot word set described in each two, when two heat When co-occurrence degree between set of words is more than the second predetermined threshold value, two hot word set are merged, after being merged All hot word set；

Much-talked-about topic acquisition module, for calculating in the hot word set after each merging between each two hot word Co-occurrence degree, and the co-occurrence degree of each hot word is counted, all hot words are ranked up according to co-occurrence degree descending, preceding M hot word is made For the title of the corresponding much-talked-about topic of hot word set after each merging, wherein M>=1.

On the one hand, the present invention provides a kind of equipment that network hot topic finds method, including：

At least one processor；And

At least one processor being connect with the processor communication, wherein：

The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program Instruction is able to carry out any of the above-described method.

On the one hand, the present invention provides a kind of non-transient computer readable storage medium, and the non-transient computer is readable to deposit Storage media stores computer instruction, and the computer instruction makes the computer execute the method as described in any of the above-described.

A kind of network hot topic provided by the invention finds method and system, based on the comentropy of word from it is default when Between section corpus of text in extract hot word candidate word；Hot word is calculated according to the odd-numbered day word frequency and historical volatility of hot word candidate word The temperature of candidate word, several hot word candidate words that final selection temperature is stood out are as hot word；Then each two hot word is calculated Correlation, two hot words that correlation is more than to the first predetermined threshold value are merged into identity set, and by other each heat Word is individually stored in a set, obtains all hot word set；The co-occurrence degree between each two hot word set is calculated again, it will Two hot word set that co-occurrence degree is more than the second predetermined threshold value merge；Finally, each two heat in each hot word set is calculated Co-occurrence degree between word, and count the co-occurrence degree of each hot word in each hot word set, several heat that co-occurrence degree is stood out Word finds network hot topic as the corresponding much-talked-about topic title of each hot word set.Angle of this method from hot word The discovery operation for carrying out network hot topic, ensure that the high temperature of topic, is more in line with the definition of much-talked-about topic；Simultaneously By the operation of hot word relativity measurement and hot word set co-occurrence degree calculating operation, related hot word and co-occurrence hot word set are carried out Merge, ensure that the High relevancy of much-talked-about topic internal information, while can realize the accurate discovery of network hot topic, have Network hot topic is intuitively identified conducive to user, provides support to understand every profession and trade dynamic in time, network public-opinion is supervised Survey is of great significance with analysis.

Description of the drawings

Fig. 1 is that a kind of network hot topic of the embodiment of the present invention finds the overall flow schematic diagram of method；

Fig. 2 is that a kind of network hot topic of the embodiment of the present invention finds the overall structure diagram of system；

Fig. 3 is the structural framing schematic diagram that a kind of network hot topic of the embodiment of the present invention finds the equipment of method.

Specific implementation mode

With reference to the accompanying drawings and examples, the specific implementation mode of the present invention is described in further detail.Implement below Example is not limited to the scope of the present invention for illustrating the present invention.

Fig. 1 is that a kind of network hot topic of the embodiment of the present invention finds the overall flow schematic diagram of method, such as Fig. 1 institutes Showing, the present invention provides a kind of network hot topic discovery method, including：

Specifically, for the much-talked-about topic of some period, the corpus of text of the period is obtained, that is, searches for the time Internet news report of section etc., then for the corpus of text obtained, hot word candidate word is extracted from corpus of text.The present embodiment In, hot word candidate word is extracted from corpus of text based on the comentropy of word, calculates the information of word in corpus of text first Entropy determines hot word candidate word further according to the comentropy of word.The comentropy of word indicates the abundant degree of collocations, if The collocation of one word is abundanter, then more representative, and the possibility for becoming hot word is also bigger.So the letter based on word Breath entropy can accurately extract the hot word candidate word in corpus of text.Wherein, hot word candidate word include neologisms, entity word and Non-physical word, hot word candidate word is generally possible to give full expression to content of text messages, representative.

Further, hot word candidate word has certain network attention degree, but the attention rate height of each hot word candidate word is not Together, hot word could be referred to as by only paying close attention to high hot word candidate word.In view of this, based on the above technical solution, for The hot word candidate word of acquisition calculates the temperature of each hot word candidate word, based on the temperature of each hot word candidate word, according to temperature Descending arranges all hot word candidate words, and final top n hot word candidate word of choosing is as hot word, wherein N>=1, i.e., The higher preceding several hot word candidate words of temperature are chosen as hot word, particular number can be configured according to actual demand, herein It is not specifically limited.Usually, hot word has the characteristics that odd-numbered day word frequency height and historical volatility are big.In view of this, the present embodiment In, the temperature of hot word candidate word is calculated according to the odd-numbered day word frequency and historical volatility of hot word candidate word, thus, it is possible to according to heat The temperature of word candidate word accurately extracts hot word from hot word candidate word.It in other embodiments, can also be by other means The temperature for calculating hot word candidate word, can be configured according to actual demand, be not specifically limited herein.

Further, for the hot word of above-mentioned acquisition, the correlation between each two hot word, the correlation of word are calculated Syntax Relativity and semantic dependency are specifically included, wherein Syntax Relativity is embodied in the editing distance similarity between word On；Semantic dependency is embodied in the Hownet similarity between word, so the editing distance phase by calculating each two hot word The correlation between each two hot word is can be obtained like degree and Hownet similarity.It on this basis, will be between each two hot word Correlation be compared with the first predetermined threshold value, when the correlation between two hot words be more than the first predetermined threshold value when, by this Two hot words are merged into a set；If the correlation between some hot word and other hot words is no more than the first default threshold When value, then the hot word is individually stored in a set.Final all hot words have all been stored in set, you can are obtained All hot word set.Wherein, the first predetermined threshold value is pre-set, can be configured according to actual demand, herein It is not specifically limited.

Further, for all hot word set of above-mentioned acquisition, the co-occurrence degree between each two hot word set is calculated, Specifically, by calculating separately the co-occurrence degree in two hot word set between word, the co-occurrence between two hot word set is obtained Degree.On this basis, the co-occurrence degree between each two hot word set is compared with the second predetermined threshold value, when two hot words When co-occurrence degree between set is more than the second predetermined threshold value, which is merged.After above-mentioned processing, All hot word set after being merged, final each hot word set represent a much-talked-about topic.Wherein, the second default threshold Value is pre-set, can be configured according to actual demand, is not specifically limited herein.

Further, for each hot word set after merging, calculate separately in each hot word set each two hot word it Between co-occurrence degree, finally count the co-occurrence degree of each hot word in each hot word set, will each gather according to co-occurrence degree descending In all hot words arranged, it is final choose before M hot word as the corresponding much-talked-about topic title of each hot word set, Middle M>=1, i.e., for any one hot word set, choosing the larger preceding several hot words conducts of co-occurrence degree in the hot word set should The corresponding much-talked-about topic title of hot word set, the hot word quantity of selection can be configured according to actual demand, not do and have herein Body limits.When a much-talked-about topic title is there are when multiple hot words, can be separated using space between each hot word.

A kind of network hot topic provided by the invention finds method, based on the comentropy of word from preset time period Hot word candidate word is extracted in corpus of text；Hot word candidate word is calculated according to the odd-numbered day word frequency and historical volatility of hot word candidate word Temperature, it is final to choose several hot word candidate words that temperature is stood out as hot word；Then the correlation of each two hot word is calculated Property, two hot words that correlation is more than to the first predetermined threshold value are merged into identity set, and other each hot words are independent It is stored in a set, obtains all hot word set；The co-occurrence degree between each two hot word set is calculated again, by co-occurrence degree Two hot word set more than the second predetermined threshold value merge；Finally, it calculates in each hot word set between each two hot word Co-occurrence degree, and count the co-occurrence degree of each hot word in each hot word set, several hot words that co-occurrence degree is stood out as Each corresponding much-talked-about topic title of hot word set, that is, find network hot topic.This method carries out net from the angle of hot word The discovery of network much-talked-about topic operates, and ensure that the high temperature of topic, is more in line with the definition of much-talked-about topic；Pass through heat simultaneously Word correlation metric operations and hot word set co-occurrence degree calculating operation carry out the merging of related hot word and co-occurrence hot word set, protect The High relevancy of much-talked-about topic internal information has been demonstrate,proved, while can realize the accurate discovery of network hot topic, has been conducive to use Network hot topic is intuitively identified at family, provides support for timely understanding every profession and trade dynamic, network public-opinion is monitored and is divided Analysis is of great significance.

Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, the step S1 is further wrapped It includes：

R=H (lw) * H (rw)

Specifically, based on the above technical solution, it is above-mentioned based on the comentropy of word from the text of preset time period Being implemented as follows for hot word candidate word is extracted in language material：

For the much-talked-about topic of some period, the corpus of text of the period is obtained, that is, searches for the network of the period News report etc.；For the word w in corpus of text, the calculation formula of the comentropy H (w) of word w is obtained first, specifically such as Under：

H (w)=- ∑ p (x) logp (x)

Wherein, p (x) indicates the probability that character x occurs in all characters；

Based on the above technical solution, the calculation formula based on comentropy calculate separately word w left comentropy and Right comentropy, is as follows：

If L={ (l₁,c_l1),(l₂,c_l2),…,(l_n,c_ln) gather for the left neighbour of word w, R={ (r₁,c_r1),(r₂, c_r2),…,(r_m,c_rm) the right adjacent set that is word w, wherein li and rj are respectively the adjacent character in left and right of word w, c_liAnd c_rjPoint The number that adjacent character occurs Wei not be controlled, then the left comentropy H (lw) of word w and right comentropy H (rw) are defined respectively as：

Further, the left comentropy H (lw) of the word w above-mentioned calculating obtained and right comentropy H (rw) respectively in advance If minimum threshold and max-thresholds be compared, wherein having preset minimum threshold r₁With max-thresholds r₂；When meeting r₁< H(lw)<r₂And r₁<H(rw)<r₂When, then the abundant degree R of word w collocation is calculated using following formula：

R=H (lw) * H (rw)

Descending sequence is carried out to word according to calculated R values, selects the word of K before coming as hot word candidate Word, wherein K>=1, it can be configured according to actual demand, be not specifically limited herein.

A kind of network hot topic provided by the invention finds method, and the calculation formula based on comentropy calculates text language The left comentropy and right comentropy of word in material, when left comentropy and right comentropy are satisfied by preset range, according to left letter It ceases entropy and right comentropy calculates the abundant degree of word, several words that the final abundant degree of selection is stood out are as hot word Candidate word, realizes the accurate extraction of hot word candidate word, and then ensures the accurate extraction of hot word, is conducive to from the angle of hot word Degree carries out the discovery operation of network hot topic, ensure that the high temperature of topic, is more in line with the definition of much-talked-about topic；Together When be advantageously implemented the accurate discovery of network hot topic, intuitively identify network hot topic convenient for user, be timely It solves every profession and trade dynamic and support is provided, network public-opinion is monitored and is of great significance with analysis.

Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, the step S12 further includes：

Specifically, in general, the left comentropy and right comentropy of word are bigger, which more can express text message Content.But there is also a kind of special circumstances：The very big and right comentropy very little of the left comentropy of one word, and the word is adjacent Word left comentropy very little and right comentropy is very big, then need the word and word adjacent thereto to be merged into one at this time Neologisms.That is, being directed to adjacent two words w and w₁, when meeting H (lw)>r₂And H (rw)<r₁And H (lw₁)<r₁And H (rw₁)> r₂, then by word w and w₁It is merged into one newly.Finally, using the neologisms after merging as hot word candidate word.

A kind of network hot topic provided by the invention finds method, and the calculation formula based on comentropy calculates text language The left comentropy and right comentropy of word in material when there are the very big and right comentropy very littles of the left comentropy of a word, and are somebody's turn to do The left comentropy very little of the adjacent word of word and right comentropy is very big, then need the word and word adjacent thereto to close at this time And at a neologisms, finally using the neologisms after merging as hot word candidate word, the accurate extraction of hot word candidate word is realized, into And ensure the accurate extraction of hot word, be conducive to the discovery operation for carrying out network hot topic from the angle of hot word, ensure that words The high temperature of topic, is more in line with the definition of much-talked-about topic；It is advantageously implemented the accurate discovery of network hot topic simultaneously, just Network hot topic is intuitively identified in user, provides support to understand every profession and trade dynamic in time, network public-opinion is monitored It is of great significance with analysis.

Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, described in step S2 according to The odd-numbered day word frequency and historical volatility of hot word candidate word calculate the temperature of the hot word candidate word, further comprise：

H=B*0.5+F*0.5

Specifically, based on the above technical solution, above-mentioned odd-numbered day word frequency and historical volatility according to hot word candidate word Property calculate hot word candidate word temperature, be implemented as follows：

Since hot word has the characteristics that odd-numbered day word frequency is high, historical volatility is big, in view of this, from odd-numbered day word frequency and history wave Two aspect of dynamic property calculates the temperature of hot word candidate word, calculates hot word candidate word according to the odd-numbered day word frequency of hot word candidate word first Basic weights, wherein odd-numbered day word frequency refer to the day statistics word frequency of hot word candidate word.In order to avoid odd-numbered day textual data difference is to basis The influence of weights, therefore be smoothed.The calculation formula of basic weights B is as follows：

B=log (1+log (1+log (tf+1)))

Wherein, tf indicates that the day of hot word candidate word counts word frequency.

Further, the fluctuation weights of hot word candidate word are calculated according to the historical volatility of hot word candidate word, specifically, Historical volatility is from the aspect of the whole fluctuation, change in long term and short term variations three of basic weights.Hot word candidate word is gone through History fluctuation is bigger, and the possibility which becomes hot word is bigger.The whole fluctuation and change in long term of odd-numbered day word frequency are compared to short Phase, which changes, can more embody the historical volatility of hot word candidate word, so when carrying out fluctuation weight computing, whole fluctuation and The weight of change in long term is higher than short term variations.In order to simplify the complexity of weight coefficient selection, to odd-numbered day word frequency in the present embodiment Whole fluctuation and change in long term take equal weight.By above-mentioned analysis, on the basis of it is 1 to ensure the sum of three's weight, In final fluctuation weight computing formula, the whole fluctuation of basic weights and the weight of change in long term are 0.4, short term variations Weight is 0.2.Whole fluctuation V, change in long term L, short term variations S and the fluctuation weights F of basic weights indicate as follows respectively：

F=V*0.4+L*0.4+S*0.2

Wherein, n indicates that experimental data period, Bi indicate basic word frequency.

The temperature weights of hot word candidate word include basic weights and fluctuation weights two parts, embody hot word respectively two Feature, so the two takes equal weight in the calculating of temperature weights.Temperature weights H indicates as follows：

H=B*0.5+F*0.5

The temperature of each hot word candidate word can be calculated by the calculation formula of above-mentioned temperature weights, and choose heat The several hot word candidate words stood out are spent as hot word.

A kind of network hot topic provided by the invention finds method, and heat is calculated according to the odd-numbered day word frequency of hot word candidate word The basic weights of word candidate word, and according to the fluctuation weights of the historical volatility of hot word candidate word calculating hot word candidate word, finally According to the temperature of basic weights and fluctuation weight computing hot word candidate word, the final several hot words time chosen temperature and stood out Select word as hot word；The accurate extraction for realizing hot word is conducive to the discovery that network hot topic is carried out from the angle of hot word Operation, ensure that the high temperature of topic, is more in line with the definition of much-talked-about topic；It is advantageously implemented network hot topic simultaneously Accurate discovery, intuitively identify network hot topic convenient for user, in time understand every profession and trade dynamic support is provided, it is right It monitors in network public-opinion and is of great significance with analysis.

Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, every two are calculated described in step S3 The correlation of a hot word, further comprises：

Sim (X, Y)=α * sim_e(X,Y)+β*sim_c(X, Y), alpha+beta=1

Specifically, based on the above technical solution, the specific implementation of the correlation between each two hot word is calculated such as Under：

The correlation of word specifically includes Syntax Relativity and semantic dependency, and wherein Syntax Relativity is embodied in word Between editing distance similarity on；Semantic dependency is embodied in the Hownet similarity between word, so it is every by calculating The editing distance similarity and Hownet similarity of two hot words can be obtained the correlation between each two hot word.

Editing distance, also known as Levenshtein distance (also referred to as Edit Distance), refer to two word strings by one A minimum edit operation number switched to needed for another.Edit operation therein is divided into three kinds, including replacement, is inserted into and deletes It removes.If the editing distance between two word strings is bigger, illustrate that their similarities are lower.

If word X constitutes X=x by n character₁x₂,…,x_n, word Y constitutes Y=y by m character₁y₂,…,y_m, C= {c_s,c_i,c_dIndicating replacement, insertion and the cost for deleting a character when character change respectively, then the editing distance of X and Y is passed It is defined as follows with returning：

Wherein, Head (X)=x₁x₂…x_n-1, Head (Y)=y₁y₂…y_m-1, C_i(ε, ym)=c_i, C_d(x_n, ε) and=c_d,

It is not much different it is generally believed that being inserted into and deleting the cost that character is spent, therefore their weights having the same, That is c_i=c_d, but the cost of substitute character is different.For example, the operation of a substitute character is considered as first deleting a word Symbol, then be inserted into the operation of a fresh character in original place, then replacement operation is exactly two for deleting or being inserted into character manipulation cost Times.Therefore c_i=c_d≠c_s

Therefore, the editing distance calculating formula of similarity of hot word then X and Y is as follows：

Wherein, | X |, | Y | the length of hot word X and hot word Y are indicated respectively.

If word X is made of the n senses of a dictionary entry, C is used_x1,C_x2,…,C_xnIt indicates, Y is made of the m senses of a dictionary entry, uses C_y1,C_y2,…, C_ynIt indicates, then the Hownet similarity sim of X and Y_cThe calculation formula of (X, Y) is as follows：

Wherein, Sim (C_xi,C_yj) refer to two senses of a dictionary entry C_xiAnd C_yjSimilarity.

Senses of a dictionary entry C_xiAnd C_yjCalculating formula of similarity it is as follows：

Wherein, Sim_pj(P₁,P₂) refer to two adopted original P₁And P₂Similarity, β_i(1≤i≤4) are adjustable parameters, and Meet β₁+β₂+β₃+β₄=1, β₁≤β₂≤β₃≤β₄。

Adopted original P₁And P₂Calculating formula of similarity it is as follows：

Sim_p(P₁,P₂)=σ/(d+ σ)

Wherein, d refers to P₁And P₂Path length in adopted former hierarchical system, σ is an adjustable parameter.

Further, editing distance similarity is that the correlation of word is measured from syntactic level, can not be from semantic level The correlation of word is calculated, in view of this, being combined editing distance similarity and Hownet similarity in the present embodiment, is proposed A kind of semantic dependency measure formulas is that is, related to Hownet similarity calculation each two hot word according to editing distance similarity Property, calculation formula is as follows：

Sim (X, Y)=α * sim_e(X,Y)+β*sim_c(X, Y), alpha+beta=1

A kind of network hot topic provided by the invention finds method, calculates the editing distance phase between each two hot word Like degree and Hownet similarity, according to the correlation of editing distance similarity and Hownet similarity calculation each two hot word.This method Simultaneously from syntactic level and semantic level calculate hot word between correlation, be conducive to accurately obtain related hot word, and by associated hot Word merges in identity set, ensure that the High relevancy of much-talked-about topic internal information, while can realize that network hotspot is talked about The accurate discovery of topic, is conducive to user and intuitively identifies network hot topic, and branch is provided to understand every profession and trade dynamic in time It holds, network public-opinion is monitored and is of great significance with analysis.

Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, every two are calculated described in step S4 Co-occurrence degree between a hot word set further comprises：

Specifically, based on the above technical solution, the co-occurrence degree between above-mentioned calculating each two hot word set is specific It realizes as follows：

Word co-occurrence refers to that two words appear in the same cell window (paragragh, an a word etc.) jointly, only It is limited to two hot words, relatively simple fixation.The concept of co-occurrence is expanded into two set by two words in the present embodiment, is carried Go out to gather the concept of co-occurrence degree.By calculating separately the co-occurrence degree in two hot word set between word, being total to for hot word set is obtained Now spend.

If hot word set A includes n semantic relevant hot word, i.e. A={ X₁,X₂,…X_n, wherein n >=1, hot word set B Including m semantic relevant hot word, i.e. B={ Y₁,Y₂,…Y_m, wherein m >=1, then the co-occurrence degree C of definition set A and set B The calculation formula of (A, B) is as follows：

C (A, B)=max { C (X_i,Y_j) i=1,2 ..., n；J=1,2 ..., m.

Wherein, C (X_i,Y_j) indicate hot word X in set A_iWith the hot word Y in set B_jCo-occurrence degree.

Hot word X_iWith hot word Y_jCo-occurrence degree C (X_i,Y_j) calculation formula it is as follows：

Wherein, R (X_i|Y_j) indicate hot word X_iRelative to hot word Y_jOpposite co-occurrence degree, R (Y_j|X_i) indicate hot word Y_jRelatively In hot word X_iOpposite co-occurrence degree, R (X_i|Y_j) it is generally not equal to R (Y_j|X_i), but C (X_i,Y_j)=C (Y_j,X_i)。

Hot word X_iRelative to hot word Y_jOpposite co-occurrence degree R (X_i|Y_j) calculation formula it is as follows：

Wherein, f (X_i,Y_j) indicate hot word X_iWith word Y_jThe number occurred jointly in one text, f (Y_j) indicate heat Word Y_jThe number of appearance.

A kind of network hot topic provided by the invention finds method, calculates each two word in each two hot word set Co-occurrence degree between language；Compare the co-occurrence degree between each two word, the maximum value of the co-occurrence degree between every word is made For the co-occurrence degree between each two hot word set；Be conducive to merge co-occurrence hot word set, ensure that inside much-talked-about topic The High relevancy of information, while can realize the accurate discovery of network hot topic, be conducive to user and intuitively identify net Network much-talked-about topic provides support to understand every profession and trade dynamic in time, network public-opinion is monitored and is of great significance with analysis.

Based on any of the above-described embodiment, a kind of network hot topic discovery method is provided, is also wrapped after the step S2 It includes：The hot word candidate word is added in dictionary for word segmentation.

Specifically, combine neologisms candidate word for " corn price " is such, can be classified as in participle " corn " and " price " two words, to which when carrying out word frequency statistics to " corn price ", which is 0.In order to avoid there are this feelings Condition in the present embodiment, after obtaining hot word candidate word, hot word candidate word is added in dictionary for word segmentation, then is not in group Close the divided situation of candidate word.

A kind of network hot topic provided by the invention finds that method waits hot word after obtaining hot word candidate word It selects word to be added in dictionary for word segmentation, can effectively avoid the combination divided situation of neologisms, be conducive to accurately to hot word candidate word Word frequency carry out accurate statistics, and finally realize hot word accurate extraction.

Fig. 2 is that a kind of network hot topic of the embodiment of the present invention finds the overall structure diagram of system, such as Fig. 2 institutes Showing, the present invention provides a kind of network hot topic discovery system, including：

Hot word candidate word extraction module 1 is extracted for the comentropy based on word from the corpus of text of preset time period Hot word candidate word；

Hot word extraction module 2, for calculating the heat according to the odd-numbered day word frequency and historical volatility of the hot word candidate word The hot word candidate word is ranked up by the temperature of word candidate word according to temperature descending, and using top n hot word candidate word as Hot word, wherein N>=1；

Hot word set acquisition module 3, the correlation for calculating hot word described in each two, when the phase of two hot words When closing property is more than the first predetermined threshold value, two hot words are merged into identity set, and other each hot words are individually deposited It is placed in a set, obtains all hot word set；

Hot word set merging module 4, for calculating the co-occurrence degree between hot word set described in each two, described in two When co-occurrence degree between hot word set is more than the second predetermined threshold value, two hot word set are merged, after being merged All hot word set；

Much-talked-about topic acquisition module 5, for calculating in the hot word set after each merging between each two hot word Co-occurrence degree, and the co-occurrence degree of each hot word is counted, all hot words are ranked up according to co-occurrence degree descending, preceding M hot word is made For the title of the corresponding much-talked-about topic of hot word set after each merging, wherein M>=1.

Specifically, the present invention provides a kind of network hot topic discovery system, including hot word candidate word extraction module 1, heat Word extraction module 2, hot word set acquisition module 3, hot word set merging module 4 and much-talked-about topic acquisition module 5, pass through each mould Block realizes that the network hot topic in any of the above-described embodiment finds method, is implemented as follows：

For the much-talked-about topic of some period, the text language of the period is obtained using hot word candidate word extraction module 1 Material searches for Internet news report of the period etc., then for the corpus of text obtained, hot word is extracted from corpus of text Candidate word.In the present embodiment, hot word candidate word is extracted from corpus of text based on the comentropy of word, calculates text language first The comentropy of word in material determines hot word candidate word further according to the comentropy of word.The comentropy of word indicates collocations Abundant degree, more representative if the collocation of a word is abundanter, the possibility for becoming hot word is also bigger.Therefore And the comentropy based on word can accurately extract the hot word candidate word in corpus of text.Wherein, hot word candidate word packet Neologisms, entity word and non-physical word are included, hot word candidate word is generally possible to give full expression to content of text messages, representative.

Further, hot word candidate word has certain network attention degree, but the attention rate height of each hot word candidate word is not Together, hot word could be referred to as by only paying close attention to high hot word candidate word.In view of this, based on the above technical solution, for The hot word candidate word of acquisition calculates the temperature of each hot word candidate word using hot word extraction module 2, candidate based on each hot word The temperature of word arranges all hot word candidate words according to temperature descending, final to choose top n hot word candidate word conduct Hot word, wherein N>=1, that is, the higher preceding several hot word candidate words of temperature are chosen as hot word, and particular number can be according to reality Demand is configured, and is not specifically limited herein.Usually, hot word has the characteristics that odd-numbered day word frequency height and historical volatility are big.Have In consideration of it, in the present embodiment, calculated according to the odd-numbered day word frequency and historical volatility of hot word candidate word using hot word extraction module 2 The temperature of hot word candidate word, thus, it is possible to accurately extract hot word from hot word candidate word according to the temperature of hot word candidate word. In other embodiments, the temperature that can also calculate hot word candidate word by other means, can be set according to actual demand It sets, is not specifically limited herein.

Further, for the hot word of above-mentioned acquisition, using between the calculating each two hot word of hot word set acquisition module 3 Correlation, the correlation of word specifically includes Syntax Relativity and semantic dependency, and wherein Syntax Relativity is embodied in word Between editing distance similarity on；Semantic dependency is embodied in the Hownet similarity between word, so it is every by calculating The editing distance similarity and Hownet similarity of two hot words can be obtained the correlation between each two hot word.It is basic herein On, the correlation between each two hot word is compared with the first predetermined threshold value using hot word set acquisition module 3, when two When correlation between a hot word is more than the first predetermined threshold value, which is merged into a set；If some hot word When correlation between other hot words is no more than the first predetermined threshold value, then the hot word is individually stored in a set. Final all hot words have all been stored in set, you can obtain all hot word set.Wherein, the first predetermined threshold value is pre- First it is arranged, can be configured according to actual demand, be not specifically limited herein.

Further, for all hot word set of above-mentioned acquisition, each two is calculated using hot word set merging module 4 Co-occurrence degree between hot word set specifically by calculating separately the co-occurrence degree in two hot word set between word, obtains two Co-occurrence degree between a hot word set.It on this basis, will be between each two hot word set using hot word set merging module 4 Co-occurrence degree be compared with the second predetermined threshold value, when the co-occurrence degree between two hot word set be more than the second predetermined threshold value When, which is merged.After above-mentioned processing, all hot word set after being merged, final is every A hot word set represents a much-talked-about topic.Wherein, the second predetermined threshold value is pre-set, can according to actual demand into Row setting, is not specifically limited herein.

Further, it for each hot word set after merging, is calculated separately using much-talked-about topic acquisition module 5 each Co-occurrence degree in hot word set between each two hot word finally counts the co-occurrence degree of each hot word in each hot word set, presses All hot words in each set are arranged according to co-occurrence degree descending, it is final to choose preceding M hot word as each hot word set Corresponding much-talked-about topic title, wherein M>=1, i.e., for any one hot word set, choose co-occurrence degree in the hot word set For larger preceding several hot words as the corresponding much-talked-about topic title of the hot word set, the hot word quantity of selection can be according to reality Demand is configured, and is not specifically limited herein.It, can be in each hot word when a much-talked-about topic title is there are when multiple hot words Between separated using space.

A kind of network hot topic provided by the invention finds system, based on the comentropy of word from preset time period Hot word candidate word is extracted in corpus of text；Hot word candidate word is calculated according to the odd-numbered day word frequency and historical volatility of hot word candidate word Temperature, it is final to choose several hot word candidate words that temperature is stood out as hot word；Then the correlation of each two hot word is calculated Property, two hot words that correlation is more than to the first predetermined threshold value are merged into identity set, and other each hot words are independent It is stored in a set, obtains all hot word set；The co-occurrence degree between each two hot word set is calculated again, by co-occurrence degree Two hot word set more than the second predetermined threshold value merge；Finally, it calculates in each hot word set between each two hot word Co-occurrence degree, and count the co-occurrence degree of each hot word in each hot word set, several hot words that co-occurrence degree is stood out as Each corresponding much-talked-about topic title of hot word set, that is, find network hot topic.The system carries out net from the angle of hot word The discovery of network much-talked-about topic operates, and ensure that the high temperature of topic, is more in line with the definition of much-talked-about topic；Pass through heat simultaneously Word correlation metric operations and hot word set co-occurrence degree calculating operation carry out the merging of related hot word and co-occurrence hot word set, protect The High relevancy of much-talked-about topic internal information has been demonstrate,proved, while can realize the accurate discovery of network hot topic, has been conducive to use Network hot topic is intuitively identified at family, provides support for timely understanding every profession and trade dynamic, network public-opinion is monitored and is divided Analysis is of great significance.

Fig. 3 shows that a kind of network hot topic of the embodiment of the present invention finds the structure diagram of the equipment of method.With reference to figure 3, the equipment that the network hot topic finds method, including：Processor (processor) 31,32 He of memory (memory) Bus 33；Wherein, the processor 31 and memory 32 complete mutual communication by the bus 33；The processor 31 for calling the program instruction in the memory 32, to execute the method that above-mentioned each method embodiment is provided, such as Including：Hot word candidate word is extracted from the corpus of text of preset time period based on the comentropy of word；According to hot word candidate word Odd-numbered day word frequency and historical volatility calculate the temperature of hot word candidate word, are ranked up hot word candidate word according to temperature descending, and Using top n hot word candidate word as hot word, wherein N>=1；The correlation for calculating each two hot word, when the correlation of two hot words Property be more than the first predetermined threshold value when, two hot words are merged into identity set, and other each hot words are individually stored in one In a set, all hot word set are obtained；The co-occurrence degree between each two hot word set is calculated, when between two hot word set Co-occurrence degree be more than the second predetermined threshold value when, two hot word set are merged, all hot word set after being merged； The co-occurrence degree between each two hot word in the hot word set after each merge is calculated, and counts the co-occurrence degree of each hot word, is pressed All hot words are ranked up according to co-occurrence degree descending, using preceding M hot word as the corresponding hot spot of hot word set after each merge The title of topic, wherein M>=1.

The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Matter stores computer instruction, and the computer instruction makes the computer execute the method that above-mentioned each method embodiment is provided, Such as including：Hot word candidate word is extracted from the corpus of text of preset time period based on the comentropy of word；According to hot word candidate The odd-numbered day word frequency and historical volatility of word calculate the temperature of hot word candidate word, arrange hot word candidate word according to temperature descending Sequence, and using top n hot word candidate word as hot word, wherein N>=1；The correlation for calculating each two hot word, when two hot words When correlation is more than the first predetermined threshold value, two hot words are merged into identity set, and other each hot words are individually stored In gathering at one, all hot word set are obtained；The co-occurrence degree between each two hot word set is calculated, when two hot word set Between co-occurrence degree be more than the second predetermined threshold value when, two hot word set are merged, all hot word collection after being merged It closes；The co-occurrence degree between each two hot word in the hot word set after each merge is calculated, and counts the co-occurrence degree of each hot word, All hot words are ranked up according to co-occurrence degree descending, using preceding M hot word as the corresponding heat of hot word set after each merge The title of point topic, wherein M>=1.

One of ordinary skill in the art will appreciate that：Realize that all or part of step of above method embodiment can lead to The relevant hardware of program instruction is crossed to complete, program above-mentioned can be stored in a computer read/write memory medium, the journey Sequence when being executed, executes step including the steps of the foregoing method embodiments；And storage medium above-mentioned includes：ROM, RAM, magnetic disc or The various media that can store program code such as person's CD.

The embodiments such as the equipment that network hot topic described above finds method are only schematical, wherein institute It states the unit illustrated as separating component may or may not be physically separated, the component shown as unit It may or may not be physical unit, you can be located at a place, or may be distributed over multiple network element On.Some or all of module therein can be selected according to the actual needs to achieve the purpose of the solution of this embodiment.This Field those of ordinary skill is not in the case where paying performing creative labour, you can to understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment The mode of required general hardware platform can be added to realize by software, naturally it is also possible to pass through hardware.Based on such reason Solution, substantially the part that contributes to existing technology can embody above-mentioned technical proposal in the form of software products in other words Out, which can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, It is used including some instructions so that a computer equipment (can be personal computer, server or the network equipment etc.) is held Method described in certain parts of each embodiment of row or embodiment.

Finally, the present processes are only preferable embodiment, are not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in the guarantor of the present invention Within the scope of shield.

Claims

1. a kind of network hot topic finds method, which is characterized in that including：

S2 calculates the temperature of the hot word candidate word according to the odd-numbered day word frequency of the hot word candidate word and historical volatility, according to The hot word candidate word is ranked up by temperature descending, and using top n hot word candidate word as hot word, wherein N>=1；

S3 calculates the correlation of hot word described in each two, will when the correlation of two hot words is more than the first predetermined threshold value Two hot words are merged into identity set, and other each hot words are individually stored in a set, obtain all heat Set of words；

S4 calculates the co-occurrence degree between hot word set described in each two, when the co-occurrence degree between two hot word set is more than When the second predetermined threshold value, two hot word set are merged, all hot word set after being merged；

S5 calculates the co-occurrence degree between each two hot word in the hot word set after each merging, and counts each hot word All hot words are ranked up by co-occurrence degree according to co-occurrence degree descending, using preceding M hot word as the hot word collection after each merging Close the title of corresponding much-talked-about topic, wherein M>=1.

2. according to the method described in claim 1, it is characterized in that, the step S1 further comprises：

S11 obtains the corpus of text of preset time period, and the calculation formula based on comentropy calculates separately word in the corpus of text The left comentropy and right comentropy of language；

The left comentropy and right comentropy are compared with preset minimum threshold and max-thresholds, work as satisfaction by S12 respectively r₁<H(lw)<r₂And r₁<H(rw)<r₂When, the abundant degree of the word is calculated, calculation formula is：

R=H (lw) * H (rw)

Wherein, H (lw) and H (lw) is respectively the left comentropy of word w and right comentropy；r₁And r₂Respectively preset Minimum Threshold Value and max-thresholds；R is the abundant degree of word w；

The word is ranked up by S13 according to abundant degree descending, using preceding K word as the hot word candidate word, wherein K>=1.

3. according to the method described in claim 2, it is characterized in that, the step S12 further includes：

For adjacent two words w and w₁, when meeting H (lw)>r₂And H (rw)<r₁And H (lw₁)<r₁And H (rw₁)>r₂When, By the word w and w₁It is merged into a neologisms；Wherein, H (lw) and H (lw) is respectively left comentropy and the right side of the word w Comentropy, H (lw₁) and H (rw₁) it is respectively word w₁Left comentropy and right comentropy；

4. according to the method described in claim 1, it is characterized in that, according to the odd-numbered day of the hot word candidate word described in step S2 Word frequency and historical volatility calculate the temperature of the hot word candidate word, further comprise：

The basic weights that the hot word candidate word is calculated according to the odd-numbered day word frequency of the hot word candidate word, according to hot word candidate The historical volatility of word calculates the fluctuation weights of the hot word candidate word；

H=B*0.5+F*0.5

Wherein, B is the basic weights of the hot word candidate word；F is the fluctuation weights of the hot word candidate word；H is the hot word The temperature of candidate word.

5. according to the method described in claim 1, it is characterized in that, calculating the correlation of hot word described in each two described in step S3 Property, further comprise：

According to the correlation of hot word described in the editing distance similarity and the Hownet similarity calculation each two, calculation formula For：

Sim (X, Y)=α * sim_e(X,Y)+β*sim_c(X, Y), alpha+beta=1

Wherein, sim (X, Y) indicates the correlation of word X and word Y；sim_e(X, Y) indicates the editing distance of word X and word Y Similarity；sim_c(X, Y) shows that the Hownet similarity of word X and word Y, α and β indicate editing distance similarity and Hownet phase respectively Like the weight of degree.

6. according to the method described in claim 1, it is characterized in that, described in step S4 calculate each two described in hot word set it Between co-occurrence degree further comprise：

Calculate the co-occurrence degree between each two word in hot word set described in each two；Compare between word described in each two Co-occurrence degree, using the maximum value of the co-occurrence degree between word described in each two as the co-occurrence between hot word set described in each two Degree.

7. according to the method described in claim 1, it is characterized in that, further including after the step S2：By hot word candidate Word is added in dictionary for word segmentation.

8. a kind of network hot topic finds system, which is characterized in that including：

Hot word candidate word extraction module extracts hot word for the comentropy based on word from the corpus of text of preset time period and waits Select word；

Hot word extraction module, it is candidate for calculating the hot word according to the odd-numbered day word frequency and historical volatility of the hot word candidate word The hot word candidate word is ranked up by the temperature of word according to temperature descending, and using top n hot word candidate word as hot word, Middle N>=1；

Hot word set acquisition module, the correlation for calculating hot word described in each two, when the correlation of two hot words is big When the first predetermined threshold value, two hot words are merged into identity set, and other each hot words are individually stored in one In a set, all hot word set are obtained；

Hot word set merging module, for calculating the co-occurrence degree between hot word set described in each two, when two hot word collection When co-occurrence degree between conjunction is more than the second predetermined threshold value, two hot word set are merged, it is all after being merged Hot word set；

Much-talked-about topic acquisition module, for calculating the co-occurrence in the hot word set after each merging between each two hot word Degree, and the co-occurrence degree of each hot word is counted, all hot words are ranked up according to co-occurrence degree descending, using preceding M hot word as every The title of the corresponding much-talked-about topic of hot word set after a merging, wherein M>=1.

9. the equipment that a kind of network hot topic finds method, which is characterized in that including：

At least one processor；And

The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy Enough methods executed as described in claim 1 to 7 is any.

10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.