CN104679738A

CN104679738A - Method and device for mining Internet hot words

Info

Publication number: CN104679738A
Application number: CN201310607937.0A
Authority: CN
Inventors: 肖诗斌; 孙丽华
Original assignee: BEIJING TRS INFORMATION TECHNOLOGY Co Ltd
Current assignee: TOLS INFORMATION TECHNOLOGY Co.,Ltd.
Priority date: 2013-11-27
Filing date: 2013-11-27
Publication date: 2015-06-03
Anticipated expiration: 2033-11-27
Also published as: CN104679738B

Abstract

The invention provides a method for mining Internet hot words. The method comprises the following steps: initializing a word graph and a background library; identifying an entity string and a non-entity string; updating a word string statistical index; calculating the popular degree of the word string; and sorting the popular degree of the word string, and outputting the word string. The word string is divided into the entity string and the non-entity string, the entity string and the non-entity string are subjected to targeted division identification, the background library is arranged to realize the incremental updating of corpuses and calculation indexes, and hot word extraction accuracy and efficiency can be improved. Meanwhile, the invention also provides a device for mining the Internet hot words. The device comprises a storage unit, an entity string identification unit, a non-entity string identification unit and a hot word extraction unit, wherein the hot word extraction unit finishes the incremental updating of the statistical index, the calculation of the popular degree of the word string and word string sorting output. The hot words can be orderly, efficiently and accurately extracted.

Description

The hot word method for digging in internet and device

Technical field

The present invention relates to natural language processing technique, particularly relate to the hot word method for digging in a kind of internet and device.

Background technology

Hot word refers to the word that frequency of utilization is higher within certain period, often has characteristics of the times, reflects much-talked-about topic and the livelihood issues in a period.The hot word in internet has been included except word except dictionary, also there are some network boom words, this kind of word derives from, spread in cyberspace, and be widely used among daily interchange, as " how abandoning treatment ", " not apperception is strict ", " Chen Outi " etc., Words partition system is difficult to identify this kind of word usually, and network boom word appears in current internet as a kind of important propagation phenomenon newly, and along with the change in epoch, there are evolution and transition.

The hot word in internet and social event or phenomenon contact closely, become the instrument of expresses public opinions and supervision by public opinion, and accurately, the hot word in efficient decimation internet is the development foundation of the important matters such as public sentiment supervision instantly, study Internet.

Hot word excavates current used technology to be had, and the hot word based on clustering documents excavates, and these class methods easily occur that cluster complexity is high usually, cannot meet the real-time demand that the hot word in internet excavates; Another is, according to the feature such as one-tenth word border, Annual distribution of word string, adopt machine learning model, carry out hot word whether classification learning, these class methods need knowledge base support on the one hand, and selected feature is the publicly-owned feature of word string substantially on the other hand, does not do special processing to special word string, cause noise word comparatively large, the accuracy rate that hot word finds is not high.

Because each entity string has unique one-tenth word rule, as name string is made up of limited surname+high frequency name word, and there is a large amount of entity string knowledge base at present, be convenient to machine learning model study.For this reason, hot for internet word is divided into entity string and non-physical string by the present invention, proposes the hot word method for digging in a kind of internet and device, to solve internet hot word digging efficiency bottleneck.

Summary of the invention

Given this, fundamental purpose of the present invention is to provide the hot word method for digging in a kind of internet and device, to improve accuracy rate and the efficiency of the excavation of hot word.

The invention provides the hot word method for digging in a kind of internet, the method comprises.

Steps A builds word figure Words and context vault Corpus, and initialization.

Word figure Words, stores the result of the word extracted in each step.

Context vault Corpus, deposits the source data that internet collection is next, records each statistical indicator result in each chronomere simultaneously, as title string frequency, text string frequency, total string frequency etc.

The identification of step B entity string.

With sentence terminal symbol for standard, be original word string sequence one by one by internet raw data cutting.

Carry out the cutting of participle atom to word string sequence, carry out combination of two to atomic unit, the binary rough lumber realizing word string sequence divides, and extracts optimum N number of rough segmentation result and joins in word figure Words.

Build three grades of interconnected Hidden Markov Model (HMM), bottom-uply be followed successively by name identification HMM, place name identification HMM and organization names recognition HMM, every one-level is using Hidden Markov Model (HMM) as rudimentary algorithm model, build stacked Hidden Markov Model (HMM) (Cascaded Hidden Markov Model is called for short Cascaded HMM).

Every one deck Hidden Markov Model (HMM) adopts N-Best strategy, delivers in word figure Words, for high-level model by the best N number of result produced.

The parameter estimation that low layer Hidden Markov Model (HMM) is high-rise Hidden Markov Model (HMM) by the generation model of word provides support.

The identification of ground floor name be input as binary rough lumber sub-sequence, every one deck Hidden Markov Model (HMM) all adopts the Viterbi algorithm of improvement, sends in word figure, for high first-order model by best N number of result.

Highest hidden horse model carries out organization names recognition on the basis of name and place name identification.

The identification of step C non-physical string.

Adopt length in Nagao algorithm statistics word string to be the substring string frequency of L, extract the substring that string is greater than certain threshold value frequently, carry out Substring reduction.

Adopt general geological coodinate system filtration, IWP filtration, mutual trust to spend the strategy such as filter and the filtration of head and the tail word and carry out rubbish cascade filter, obtain candidate's string, from candidate goes here and there, filter out entity string, be non-physical string.

Step D word string statistical indicator upgrades.

Word string is divided into candidate's entity string and non-physical string, is the string that above step extracts.

Word string statistical indicator refers to the statistical value of serving the calculating of word string temperature here, and as the frequency that word string occurs in title, text, the frequency summation that word string occurs, the number of files that word string occurs, under certain chronomere, the frequency etc. of word string refers to target value.

Record the word string value of statistical indicant under source data unit update time in context vault Corpus, when the internet data of chronomere arrives subsequently, incremental update is carried out to the language material in context vault Corpus, the index simultaneously in incremental update record.

Step e word string temperature calculates.

Word string temperature weights are divided into: basic weights and fluctuation weights, the statistical indicator according to real-time update in context vault calculates word string temperature.

Wherein, basic weights are determined by going here and there positional information, frequency, the inverse document frequency occurred.

Fluctuation weights, describe by the time dough softening of word string.

Hot word is defined as interior frequent, a large amount of word used of section sometime, adopts the time dough softening of entry to characterize word string frequency over time for this reason, is called for short the dough softening.

Further, basic weight computing formula is as follows:

Basew (s)=titlew (s) * + content (s), wherein titlew is the weight that word string occurs in title, and contentw is the weight that word string occurs in the body of the email, and the measurement of weight adopts tf-idf technology, for function coefficient, the difference of reaction word string in title and text.

For balance low frequency, high frequency strings, to the smoothing process of basic weights, disposal route is as follows:

Convbasew (s) = log(1+log(1+log(basew(s))))。

Fluctuation weights, be the word string frequency dough softening in time, its computing method are as follows:

Wavew (s, t)= , t [1, T], t is a chronomere.

Word string temperature finalweight (s, t)=Convbasew (s) * Wavew (s, t).

The hot word sequence of step F, output.

According to the descending sequence of word string temperature weights, the focus name of a period of time, place name, mechanism's name and focus non-physical word can be obtained.

In addition, present invention also offers the hot word excavating gear in a kind of internet, comprising: storage unit 101, Entity recognition unit 102, non-physical recognition unit 103, hot word extracting unit 104.

Wherein, storage unit 101, the storage and Supply of primary responsibility context vault, word figure, intermediate result etc.

Entity recognition unit 102, the cutting of primary responsibility word string and the identification of entity string, comprise name identification, place name identification, organization names recognition.

Non-physical string recognition unit 103, the extraction of primary responsibility high frequency strings, rubbish cascade filter, candidate's non-physical string extract.

Hot word extracting unit 104, primary responsibility: in context vault, the statistical indicator of word string upgrades; The temperature of entity string and non-physical string calculates; The sequence of word string temperature and word string export.

Hot word extracting unit, is sorted by statistical indicator update module 104_1, temperature computing module 104_2, temperature again and hot word output module 104_3 forms.

Wherein, statistical indicator update module 104_1, adopts incremental update mechanism, calculates and upgrade the statistical indicator of the word string be not present in context vault.

Temperature computing module 104_2, according to statistical indicator, calculates basic weights and the fluctuation weights of word string, obtains word string hot value.

Temperature sequence and hot word output module 104_3, sort from high to low according to word string hot value, heat outputting angle value is greater than the word string of certain threshold value, is hot word.

From such scheme, the hot word method for digging in a kind of internet that the embodiment of the present invention provides and device, arrange context vault, makes word string statistical indicator can according to special time unit real-time update; Entity string and non-physical string is divided into identify respectively hot for internet word, the machine learning model training based on participle is adopted to obtain during entity string, non-physical string adopts Nagao algorithm to obtain high frequency substring, the attributive character making word string identification preferably apply word string itself to possess; When temperature calculates, except considering the feature such as positional information, frequency, inverse document frequency that word string occurs, also take full advantage of word string fluctuation characteristic in time.Like this, improve hot word extraction efficiency on the one hand, ensure that the accuracy that hot word extracts, especially to the extraction of some unregistered words as hot word on the other hand.

Accompanying drawing explanation

The process flow diagram of the hot word method for digging in a kind of internet that Fig. 1 provides for the embodiment of the present invention.

The module map of the hot word excavating gear in a kind of internet that Fig. 2 provides for the embodiment of the present invention.

Specific embodiments

For making the object of the embodiment of the present invention, technical method and advantage clearly understand, below in conjunction with accompanying drawing, the technical scheme that the embodiment of the present invention provides being described in detail, but being not limited to the present invention.

Hot word refers to the word that frequency of utilization is higher within certain period, has regular hour attribute.Therefore, the embodiment of the present invention by building context vault, to store language material before section sometime and statistical information; Meanwhile, hot word will be divided into entity string and non-physical string, better to utilize each entity string attribute feature, carry out training study, and utilize high frequency string statistic algorithm to carry out the extraction of candidate's non-physical string; Not only consider the basic value information such as word string position, word frequency, inverse document frequency when temperature calculates, more consider word string fluctuation distribution in time, improve hot word extraction efficiency and accuracy rate.

As shown in Figure 1, be the process flow diagram of the hot word method for digging in a kind of internet that the embodiment of the present invention provides, comprise.

Steps A word figure Words, context vault Corpus build and initialization.

Word figure Words stores word, the candidate string that level extracts.

Context vault Corpus is divided into corpus and word string index storehouse, corpus stores the Internet resources of the to be extracted hot word before section sometime, word string index storehouse, word string and corresponding value of statistical indicant thereof contained by this resource, statistical indicator generally has word string position, word string frequency, word string number of files, during initialization, corpus is empty, and word string index storehouse is empty.

The identification of step B entity string.

With sentence terminal symbol for standard, as ".", "! ", "? " Deng, be original word string sequence one by one by internet raw data cutting.

The cutting of participle atom is carried out to word string sequence, obtain atomic unit, atomic unit is can not the substring of cutting again, and as " 18 Third Plenary Sessions will be held in Beijing November 9 to 12 days ", wherein atomic unit will be: 18 Third Plenary Sessions will 9 to 12 November.

Carry out combination of two to atomic unit, the binary rough lumber realizing word string sequence divides, and according to the word string frequency, extracts optimum N number of rough segmentation result and joins in word figure Words.

The identification of step C non-physical string.

With punctuation mark in sentence for standard, as ", ", ", ", ".", "; " etc., be word string sequence one by one by internet raw data cutting.

Adopt Nagao algorithm, frequency statistics is carried out to the substring of these word strings, obtain the substring that the frequency of occurrences is greater than certain threshold value, and carry out Substring reduction with certain strategy, obtain candidate's substring.

Step D word string statistical indicator upgrades.

Recording mechanism residing for essential record word string in index storehouse, the position, the frequency, place number of files, current statistic time etc. of word string refer to target value.

The entity string identified and non-physical string are write the word string index storehouse in context vault, take increment writing mode, there is the index renewal that a certain word string then only carries out current time in index storehouse, there is not this word string and then write.

Step e word string temperature calculates.

Calculate basic weights and the fluctuation weights of word string, wherein basic weights are determined by word string position, the frequency, place number of files, and the weights that fluctuate are by time effects.

Further, basic weight computing formula is as follows:

Convbasew (s) = log(1+log(1+log(basew(s))))。

Wavew (s, t)= , t [1, T], t is a chronomere.

Word string temperature computing method are: finalweight (s, t)=Convbasew (s) * Wavew (s, t).

Step F word string extracts.

Sort from high to low by word string temperature, the focus name of a period of time, place name, mechanism's name and focus non-physical word can be obtained.

As shown in Figure 2, be the hot word excavating gear in a kind of internet that the embodiment of the present invention provides, comprise: storage unit 101; Entity recognition unit 102; Non-physical recognition unit 103; Hot word extracting unit 104.

Wherein, storage unit 101, the storage of primary responsibility resource, data, and provide corresponding access interface, as the access of word, the access etc. of statistical indicator for other each modules.

Entity recognition unit 102, builds stacked Hidden Markov Model (HMM), based on the basis of participle, extracts the entity titles such as name, place name, mechanism's name.

Non-physical string recognition unit 103, splits into word string sequence by language material according to subordinate sentence, adopts Nagao algorithm, and statistics frequency of occurrence is greater than the substring of certain threshold value, and after carrying out Substring reduction and rubbish cascade filter, gets rid of entity string, obtain non-physical string.

Hot word extracting unit 104, is responsible for temperature calculating, temperature sequence, wherein comprises statistical indicator update module 104_1, temperature computing module 104_2, temperature sequence and hot word output module 104_3.

The present embodiment with internet news, forum, blog for source data, take sky as chronomere, Entity recognition performance can reach the recognition speed of about 500K per second, non-physical recognition speed quickly, per secondly reach about 2M, hot word excavates accuracy rate and recall rate all can reach higher level, to meet engineer applied, can be effectively hot spot monitoring service.

Claims

1. the hot word method for digging in internet, it is characterized in that, the method comprises:

Steps A, builds word figure Words and context vault Corpus, and initialization;

Step B, the identification of entity string;

Step C, the identification of non-physical string;

Step D, word string statistical indicator upgrades;

Step e, word string temperature calculates;

Step F, hot word sequence, output.

2. the hot word method for digging in a kind of internet as claimed in claim 1, is characterized in that, word figure is for storing the middle word extracted; Context vault is for storing background language material and each statistical indicator quantized value in the unit interval, and each statistical indicator is some indexs for word string temperature calculation services, and according to the difference of temperature computing method, statistical indicator is distinguished to some extent.

3. the hot word method for digging in a kind of internet as claimed in claim 1, is characterized in that, with sentence terminal symbol for standard, by internet raw data cutting for doing next step process after original word string sequence one by one.

4. the hot word method for digging in a kind of internet as claimed in claim 1, it is characterized in that, entity string comprises name, place name, mechanism's name etc., the identification of entity string is based on participle basis, build three grades of interconnected Hidden Markov Model (HMM), bottom-uply be followed successively by name HMM, place name HMM, organization names HMM, every one-level, using Hidden Markov Model (HMM) as rudimentary algorithm model, builds stacked Hidden Markov Model (HMM).

5. as the hot word method for digging of claim 1 and a kind of internet according to claim 4, it is characterized in that every one deck Hidden Markov Model (HMM) adopts the Viterbi algorithm improved, utilize N-Best strategy, the best N number of result produced is delivered in word figure Words, for high-level model.

6. the hot word method for digging in a kind of internet as claimed in claim 4, is characterized in that, the parameter estimation that low layer Hidden Markov Model (HMM) is high-level model by the generation model of word provides support.

7. the hot word method for digging in a kind of internet as claimed in claim 4, it is characterized in that, the input of ground floor name model of cognition is the binary rough lumber sub-sequence after participle, and highest Hidden Markov Model (HMM), on the basis of name and place name identification, does organization names recognition.

8. the hot word method for digging in a kind of internet as claimed in claim 1, is characterized in that, adopts statistical string frequency algorithm, as Nagao algorithm, in statistics word string, length is the substring string frequency of L, extracts the substring being greater than certain threshold value, carries out Substring reduction and the filtration of rubbish substring.

9. as the hot word method for digging of claim 1 and a kind of internet according to claim 2, it is characterized in that, carry out fixed point to language material in context vault and upgrade, upgrade word string statistical indicator, word string here refers to entity string and non-physical string simultaneously.

10. the hot word method for digging in a kind of internet as claimed in claim 1, is characterized in that, weights and fluctuation weights based on word string temperature weights divide, and the calculating of word string temperature weights, depends on the value of word string statistical indicator, and its computing method are:

Word string temperature finalweight (s, t)=Convbasew (s) * Wavew (s, t), wherein Convbasew (s) is word string basis weights, the fluctuation weights that Wavew (s) is word string.

11. as the hot word method for digging of claim 1 and a kind of internet according to claim 10, and it is characterized in that, the position that basic weights are occurred by word string, frequency, inverse document frequency are determined; The time dough softening of fluctuation weights word string describes, and is word string frequency situation over time.

The 12. hot word method for digging in a kind of internet as claimed in claim 11, it is characterized in that, the computing method of basic weights are:

Basew (s)=titlew (s) * + content (s), wherein titlew is the weight that word string occurs in title, and contentw is the weight that word string occurs in the body of the email, and the measurement of weight adopts tf-idf technology, for function coefficient, the difference of reaction word string in title and text;

Convbasew (s) = log(1+log(1+log(basew(s))))。

The 13. hot word method for digging in a kind of internet as claimed in claim 11, is characterized in that, fluctuation weights, and its computing method are: Wavew (s, t)= , t [1, T], t is a chronomere.

The 14. hot word method for digging in a kind of internet as claimed in claim 1, it is characterized in that, descending to word string sequence according to word string temperature weights, export in certain hour, temperature is greater than the hot word of conduct of certain threshold value, comprising focus name, place name, mechanism's name and non-physical word.

The 15. hot word excavating gears in a kind of internet provided by the invention, is characterized in that, comprise with lower module:

Storage unit 101, is responsible for the storage and Supply of word figure, context vault etc.;

Entity recognition unit 102, is responsible for the identification of word string cutting and entity string, comprises name, place name, organization names recognition;

Non-physical string recognition unit 103, is responsible for high frequency strings extraction, rubbish cascade filter, the extraction of candidate's non-physical string;

Hot word extracting unit 104, primary responsibility, in context vault, the statistical indicator of word string upgrades; Word string temperature calculates; The sequence of word string temperature and word string export.

The 16. hot word excavating gears in a kind of internet as claimed in claim 15, is characterized in that, hot word extracting unit 104 is again by statistical indicator update module 104_1, and temperature computing module 104_2, temperature sequence and hot word output module 104_3 form.