CN104679738A - Method and device for mining Internet hot words - Google Patents

Method and device for mining Internet hot words Download PDF

Info

Publication number
CN104679738A
CN104679738A CN201310607937.0A CN201310607937A CN104679738A CN 104679738 A CN104679738 A CN 104679738A CN 201310607937 A CN201310607937 A CN 201310607937A CN 104679738 A CN104679738 A CN 104679738A
Authority
CN
China
Prior art keywords
word
string
internet
hot
word string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310607937.0A
Other languages
Chinese (zh)
Other versions
CN104679738B (en
Inventor
肖诗斌
孙丽华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TOLS INFORMATION TECHNOLOGY Co.,Ltd.
Original Assignee
BEIJING TRS INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING TRS INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING TRS INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310607937.0A priority Critical patent/CN104679738B/en
Publication of CN104679738A publication Critical patent/CN104679738A/en
Application granted granted Critical
Publication of CN104679738B publication Critical patent/CN104679738B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method for mining Internet hot words. The method comprises the following steps: initializing a word graph and a background library; identifying an entity string and a non-entity string; updating a word string statistical index; calculating the popular degree of the word string; and sorting the popular degree of the word string, and outputting the word string. The word string is divided into the entity string and the non-entity string, the entity string and the non-entity string are subjected to targeted division identification, the background library is arranged to realize the incremental updating of corpuses and calculation indexes, and hot word extraction accuracy and efficiency can be improved. Meanwhile, the invention also provides a device for mining the Internet hot words. The device comprises a storage unit, an entity string identification unit, a non-entity string identification unit and a hot word extraction unit, wherein the hot word extraction unit finishes the incremental updating of the statistical index, the calculation of the popular degree of the word string and word string sorting output. The hot words can be orderly, efficiently and accurately extracted.

Description

The hot word method for digging in internet and device
Technical field
The present invention relates to natural language processing technique, particularly relate to the hot word method for digging in a kind of internet and device.
Background technology
Hot word refers to the word that frequency of utilization is higher within certain period, often has characteristics of the times, reflects much-talked-about topic and the livelihood issues in a period.The hot word in internet has been included except word except dictionary, also there are some network boom words, this kind of word derives from, spread in cyberspace, and be widely used among daily interchange, as " how abandoning treatment ", " not apperception is strict ", " Chen Outi " etc., Words partition system is difficult to identify this kind of word usually, and network boom word appears in current internet as a kind of important propagation phenomenon newly, and along with the change in epoch, there are evolution and transition.
The hot word in internet and social event or phenomenon contact closely, become the instrument of expresses public opinions and supervision by public opinion, and accurately, the hot word in efficient decimation internet is the development foundation of the important matters such as public sentiment supervision instantly, study Internet.
Hot word excavates current used technology to be had, and the hot word based on clustering documents excavates, and these class methods easily occur that cluster complexity is high usually, cannot meet the real-time demand that the hot word in internet excavates; Another is, according to the feature such as one-tenth word border, Annual distribution of word string, adopt machine learning model, carry out hot word whether classification learning, these class methods need knowledge base support on the one hand, and selected feature is the publicly-owned feature of word string substantially on the other hand, does not do special processing to special word string, cause noise word comparatively large, the accuracy rate that hot word finds is not high.
Because each entity string has unique one-tenth word rule, as name string is made up of limited surname+high frequency name word, and there is a large amount of entity string knowledge base at present, be convenient to machine learning model study.For this reason, hot for internet word is divided into entity string and non-physical string by the present invention, proposes the hot word method for digging in a kind of internet and device, to solve internet hot word digging efficiency bottleneck.
Summary of the invention
Given this, fundamental purpose of the present invention is to provide the hot word method for digging in a kind of internet and device, to improve accuracy rate and the efficiency of the excavation of hot word.
The invention provides the hot word method for digging in a kind of internet, the method comprises.
Steps A builds word figure Words and context vault Corpus, and initialization.
Word figure Words, stores the result of the word extracted in each step.
Context vault Corpus, deposits the source data that internet collection is next, records each statistical indicator result in each chronomere simultaneously, as title string frequency, text string frequency, total string frequency etc.
The identification of step B entity string.
With sentence terminal symbol for standard, be original word string sequence one by one by internet raw data cutting.
Carry out the cutting of participle atom to word string sequence, carry out combination of two to atomic unit, the binary rough lumber realizing word string sequence divides, and extracts optimum N number of rough segmentation result and joins in word figure Words.
Build three grades of interconnected Hidden Markov Model (HMM), bottom-uply be followed successively by name identification HMM, place name identification HMM and organization names recognition HMM, every one-level is using Hidden Markov Model (HMM) as rudimentary algorithm model, build stacked Hidden Markov Model (HMM) (Cascaded Hidden Markov Model is called for short Cascaded HMM).
Every one deck Hidden Markov Model (HMM) adopts N-Best strategy, delivers in word figure Words, for high-level model by the best N number of result produced.
The parameter estimation that low layer Hidden Markov Model (HMM) is high-rise Hidden Markov Model (HMM) by the generation model of word provides support.
The identification of ground floor name be input as binary rough lumber sub-sequence, every one deck Hidden Markov Model (HMM) all adopts the Viterbi algorithm of improvement, sends in word figure, for high first-order model by best N number of result.
Highest hidden horse model carries out organization names recognition on the basis of name and place name identification.
The identification of step C non-physical string.
Adopt length in Nagao algorithm statistics word string to be the substring string frequency of L, extract the substring that string is greater than certain threshold value frequently, carry out Substring reduction.
Adopt general geological coodinate system filtration, IWP filtration, mutual trust to spend the strategy such as filter and the filtration of head and the tail word and carry out rubbish cascade filter, obtain candidate's string, from candidate goes here and there, filter out entity string, be non-physical string.
Step D word string statistical indicator upgrades.
Word string is divided into candidate's entity string and non-physical string, is the string that above step extracts.
Word string statistical indicator refers to the statistical value of serving the calculating of word string temperature here, and as the frequency that word string occurs in title, text, the frequency summation that word string occurs, the number of files that word string occurs, under certain chronomere, the frequency etc. of word string refers to target value.
Record the word string value of statistical indicant under source data unit update time in context vault Corpus, when the internet data of chronomere arrives subsequently, incremental update is carried out to the language material in context vault Corpus, the index simultaneously in incremental update record.
Step e word string temperature calculates.
Word string temperature weights are divided into: basic weights and fluctuation weights, the statistical indicator according to real-time update in context vault calculates word string temperature.
Wherein, basic weights are determined by going here and there positional information, frequency, the inverse document frequency occurred.
Fluctuation weights, describe by the time dough softening of word string.
Hot word is defined as interior frequent, a large amount of word used of section sometime, adopts the time dough softening of entry to characterize word string frequency over time for this reason, is called for short the dough softening.
Further, basic weight computing formula is as follows:
Basew (s)=titlew (s) * + content (s), wherein titlew is the weight that word string occurs in title, and contentw is the weight that word string occurs in the body of the email, and the measurement of weight adopts tf-idf technology, for function coefficient, the difference of reaction word string in title and text.
For balance low frequency, high frequency strings, to the smoothing process of basic weights, disposal route is as follows:
Convbasew (s) = log(1+log(1+log(basew(s))))。
Fluctuation weights, be the word string frequency dough softening in time, its computing method are as follows:
Wavew (s, t)= , t [1, T], t is a chronomere.
Word string temperature finalweight (s, t)=Convbasew (s) * Wavew (s, t).
The hot word sequence of step F, output.
According to the descending sequence of word string temperature weights, the focus name of a period of time, place name, mechanism's name and focus non-physical word can be obtained.
In addition, present invention also offers the hot word excavating gear in a kind of internet, comprising: storage unit 101, Entity recognition unit 102, non-physical recognition unit 103, hot word extracting unit 104.
Wherein, storage unit 101, the storage and Supply of primary responsibility context vault, word figure, intermediate result etc.
Entity recognition unit 102, the cutting of primary responsibility word string and the identification of entity string, comprise name identification, place name identification, organization names recognition.
Non-physical string recognition unit 103, the extraction of primary responsibility high frequency strings, rubbish cascade filter, candidate's non-physical string extract.
Hot word extracting unit 104, primary responsibility: in context vault, the statistical indicator of word string upgrades; The temperature of entity string and non-physical string calculates; The sequence of word string temperature and word string export.
Hot word extracting unit, is sorted by statistical indicator update module 104_1, temperature computing module 104_2, temperature again and hot word output module 104_3 forms.
Wherein, statistical indicator update module 104_1, adopts incremental update mechanism, calculates and upgrade the statistical indicator of the word string be not present in context vault.
Temperature computing module 104_2, according to statistical indicator, calculates basic weights and the fluctuation weights of word string, obtains word string hot value.
Temperature sequence and hot word output module 104_3, sort from high to low according to word string hot value, heat outputting angle value is greater than the word string of certain threshold value, is hot word.
From such scheme, the hot word method for digging in a kind of internet that the embodiment of the present invention provides and device, arrange context vault, makes word string statistical indicator can according to special time unit real-time update; Entity string and non-physical string is divided into identify respectively hot for internet word, the machine learning model training based on participle is adopted to obtain during entity string, non-physical string adopts Nagao algorithm to obtain high frequency substring, the attributive character making word string identification preferably apply word string itself to possess; When temperature calculates, except considering the feature such as positional information, frequency, inverse document frequency that word string occurs, also take full advantage of word string fluctuation characteristic in time.Like this, improve hot word extraction efficiency on the one hand, ensure that the accuracy that hot word extracts, especially to the extraction of some unregistered words as hot word on the other hand.
Accompanying drawing explanation
The process flow diagram of the hot word method for digging in a kind of internet that Fig. 1 provides for the embodiment of the present invention.
The module map of the hot word excavating gear in a kind of internet that Fig. 2 provides for the embodiment of the present invention.
Specific embodiments
For making the object of the embodiment of the present invention, technical method and advantage clearly understand, below in conjunction with accompanying drawing, the technical scheme that the embodiment of the present invention provides being described in detail, but being not limited to the present invention.
Hot word refers to the word that frequency of utilization is higher within certain period, has regular hour attribute.Therefore, the embodiment of the present invention by building context vault, to store language material before section sometime and statistical information; Meanwhile, hot word will be divided into entity string and non-physical string, better to utilize each entity string attribute feature, carry out training study, and utilize high frequency string statistic algorithm to carry out the extraction of candidate's non-physical string; Not only consider the basic value information such as word string position, word frequency, inverse document frequency when temperature calculates, more consider word string fluctuation distribution in time, improve hot word extraction efficiency and accuracy rate.
As shown in Figure 1, be the process flow diagram of the hot word method for digging in a kind of internet that the embodiment of the present invention provides, comprise.
Steps A word figure Words, context vault Corpus build and initialization.
Word figure Words stores word, the candidate string that level extracts.
Context vault Corpus is divided into corpus and word string index storehouse, corpus stores the Internet resources of the to be extracted hot word before section sometime, word string index storehouse, word string and corresponding value of statistical indicant thereof contained by this resource, statistical indicator generally has word string position, word string frequency, word string number of files, during initialization, corpus is empty, and word string index storehouse is empty.
The identification of step B entity string.
With sentence terminal symbol for standard, as ".", "! ", "? " Deng, be original word string sequence one by one by internet raw data cutting.
The cutting of participle atom is carried out to word string sequence, obtain atomic unit, atomic unit is can not the substring of cutting again, and as " 18 Third Plenary Sessions will be held in Beijing November 9 to 12 days ", wherein atomic unit will be: 18 Third Plenary Sessions will 9 to 12 November.
Carry out combination of two to atomic unit, the binary rough lumber realizing word string sequence divides, and according to the word string frequency, extracts optimum N number of rough segmentation result and joins in word figure Words.
Build three grades of interconnected Hidden Markov Model (HMM), bottom-uply be followed successively by name identification HMM, place name identification HMM and organization names recognition HMM, every one-level is using Hidden Markov Model (HMM) as rudimentary algorithm model, build stacked Hidden Markov Model (HMM) (Cascaded Hidden Markov Model is called for short Cascaded HMM).
Every one deck Hidden Markov Model (HMM) adopts N-Best strategy, delivers in word figure Words, for high-level model by the best N number of result produced.
The parameter estimation that low layer Hidden Markov Model (HMM) is high-rise Hidden Markov Model (HMM) by the generation model of word provides support.
The identification of ground floor name be input as binary rough lumber sub-sequence, every one deck Hidden Markov Model (HMM) all adopts the Viterbi algorithm of improvement, sends in word figure, for high first-order model by best N number of result.
Highest hidden horse model carries out organization names recognition on the basis of name and place name identification.
The identification of step C non-physical string.
With punctuation mark in sentence for standard, as ", ", ", ", ".", "; " etc., be word string sequence one by one by internet raw data cutting.
Adopt Nagao algorithm, frequency statistics is carried out to the substring of these word strings, obtain the substring that the frequency of occurrences is greater than certain threshold value, and carry out Substring reduction with certain strategy, obtain candidate's substring.
Adopt general geological coodinate system filtration, IWP filtration, mutual trust to spend the strategy such as filter and the filtration of head and the tail word and carry out rubbish cascade filter, obtain candidate's string, from candidate goes here and there, filter out entity string, be non-physical string.
Step D word string statistical indicator upgrades.
Recording mechanism residing for essential record word string in index storehouse, the position, the frequency, place number of files, current statistic time etc. of word string refer to target value.
The entity string identified and non-physical string are write the word string index storehouse in context vault, take increment writing mode, there is the index renewal that a certain word string then only carries out current time in index storehouse, there is not this word string and then write.
Step e word string temperature calculates.
Calculate basic weights and the fluctuation weights of word string, wherein basic weights are determined by word string position, the frequency, place number of files, and the weights that fluctuate are by time effects.
Further, basic weight computing formula is as follows:
Basew (s)=titlew (s) * + content (s), wherein titlew is the weight that word string occurs in title, and contentw is the weight that word string occurs in the body of the email, and the measurement of weight adopts tf-idf technology, for function coefficient, the difference of reaction word string in title and text.
For balance low frequency, high frequency strings, to the smoothing process of basic weights, disposal route is as follows:
Convbasew (s) = log(1+log(1+log(basew(s))))。
Fluctuation weights, be the word string frequency dough softening in time, its computing method are as follows:
Wavew (s, t)= , t [1, T], t is a chronomere.
Word string temperature computing method are: finalweight (s, t)=Convbasew (s) * Wavew (s, t).
Step F word string extracts.
Sort from high to low by word string temperature, the focus name of a period of time, place name, mechanism's name and focus non-physical word can be obtained.
As shown in Figure 2, be the hot word excavating gear in a kind of internet that the embodiment of the present invention provides, comprise: storage unit 101; Entity recognition unit 102; Non-physical recognition unit 103; Hot word extracting unit 104.
Wherein, storage unit 101, the storage of primary responsibility resource, data, and provide corresponding access interface, as the access of word, the access etc. of statistical indicator for other each modules.
Entity recognition unit 102, builds stacked Hidden Markov Model (HMM), based on the basis of participle, extracts the entity titles such as name, place name, mechanism's name.
Non-physical string recognition unit 103, splits into word string sequence by language material according to subordinate sentence, adopts Nagao algorithm, and statistics frequency of occurrence is greater than the substring of certain threshold value, and after carrying out Substring reduction and rubbish cascade filter, gets rid of entity string, obtain non-physical string.
Hot word extracting unit 104, is responsible for temperature calculating, temperature sequence, wherein comprises statistical indicator update module 104_1, temperature computing module 104_2, temperature sequence and hot word output module 104_3.
Wherein, statistical indicator update module 104_1, adopts incremental update mechanism, calculates and upgrade the statistical indicator of the word string be not present in context vault.
Temperature computing module 104_2, according to statistical indicator, calculates basic weights and the fluctuation weights of word string, obtains word string hot value.
Temperature sequence and hot word output module 104_3, sort from high to low according to word string hot value, heat outputting angle value is greater than the word string of certain threshold value, is hot word.
The present embodiment with internet news, forum, blog for source data, take sky as chronomere, Entity recognition performance can reach the recognition speed of about 500K per second, non-physical recognition speed quickly, per secondly reach about 2M, hot word excavates accuracy rate and recall rate all can reach higher level, to meet engineer applied, can be effectively hot spot monitoring service.

Claims (16)

1. the hot word method for digging in internet, it is characterized in that, the method comprises:
Steps A, builds word figure Words and context vault Corpus, and initialization;
Step B, the identification of entity string;
Step C, the identification of non-physical string;
Step D, word string statistical indicator upgrades;
Step e, word string temperature calculates;
Step F, hot word sequence, output.
2. the hot word method for digging in a kind of internet as claimed in claim 1, is characterized in that, word figure is for storing the middle word extracted; Context vault is for storing background language material and each statistical indicator quantized value in the unit interval, and each statistical indicator is some indexs for word string temperature calculation services, and according to the difference of temperature computing method, statistical indicator is distinguished to some extent.
3. the hot word method for digging in a kind of internet as claimed in claim 1, is characterized in that, with sentence terminal symbol for standard, by internet raw data cutting for doing next step process after original word string sequence one by one.
4. the hot word method for digging in a kind of internet as claimed in claim 1, it is characterized in that, entity string comprises name, place name, mechanism's name etc., the identification of entity string is based on participle basis, build three grades of interconnected Hidden Markov Model (HMM), bottom-uply be followed successively by name HMM, place name HMM, organization names HMM, every one-level, using Hidden Markov Model (HMM) as rudimentary algorithm model, builds stacked Hidden Markov Model (HMM).
5. as the hot word method for digging of claim 1 and a kind of internet according to claim 4, it is characterized in that every one deck Hidden Markov Model (HMM) adopts the Viterbi algorithm improved, utilize N-Best strategy, the best N number of result produced is delivered in word figure Words, for high-level model.
6. the hot word method for digging in a kind of internet as claimed in claim 4, is characterized in that, the parameter estimation that low layer Hidden Markov Model (HMM) is high-level model by the generation model of word provides support.
7. the hot word method for digging in a kind of internet as claimed in claim 4, it is characterized in that, the input of ground floor name model of cognition is the binary rough lumber sub-sequence after participle, and highest Hidden Markov Model (HMM), on the basis of name and place name identification, does organization names recognition.
8. the hot word method for digging in a kind of internet as claimed in claim 1, is characterized in that, adopts statistical string frequency algorithm, as Nagao algorithm, in statistics word string, length is the substring string frequency of L, extracts the substring being greater than certain threshold value, carries out Substring reduction and the filtration of rubbish substring.
9. as the hot word method for digging of claim 1 and a kind of internet according to claim 2, it is characterized in that, carry out fixed point to language material in context vault and upgrade, upgrade word string statistical indicator, word string here refers to entity string and non-physical string simultaneously.
10. the hot word method for digging in a kind of internet as claimed in claim 1, is characterized in that, weights and fluctuation weights based on word string temperature weights divide, and the calculating of word string temperature weights, depends on the value of word string statistical indicator, and its computing method are:
Word string temperature finalweight (s, t)=Convbasew (s) * Wavew (s, t), wherein Convbasew (s) is word string basis weights, the fluctuation weights that Wavew (s) is word string.
11. as the hot word method for digging of claim 1 and a kind of internet according to claim 10, and it is characterized in that, the position that basic weights are occurred by word string, frequency, inverse document frequency are determined; The time dough softening of fluctuation weights word string describes, and is word string frequency situation over time.
The 12. hot word method for digging in a kind of internet as claimed in claim 11, it is characterized in that, the computing method of basic weights are:
Basew (s)=titlew (s) * + content (s), wherein titlew is the weight that word string occurs in title, and contentw is the weight that word string occurs in the body of the email, and the measurement of weight adopts tf-idf technology, for function coefficient, the difference of reaction word string in title and text;
For balance low frequency, high frequency strings, to the smoothing process of basic weights, disposal route is as follows:
Convbasew (s) = log(1+log(1+log(basew(s))))。
The 13. hot word method for digging in a kind of internet as claimed in claim 11, is characterized in that, fluctuation weights, and its computing method are: Wavew (s, t)= , t [1, T], t is a chronomere.
The 14. hot word method for digging in a kind of internet as claimed in claim 1, it is characterized in that, descending to word string sequence according to word string temperature weights, export in certain hour, temperature is greater than the hot word of conduct of certain threshold value, comprising focus name, place name, mechanism's name and non-physical word.
The 15. hot word excavating gears in a kind of internet provided by the invention, is characterized in that, comprise with lower module:
Storage unit 101, is responsible for the storage and Supply of word figure, context vault etc.;
Entity recognition unit 102, is responsible for the identification of word string cutting and entity string, comprises name, place name, organization names recognition;
Non-physical string recognition unit 103, is responsible for high frequency strings extraction, rubbish cascade filter, the extraction of candidate's non-physical string;
Hot word extracting unit 104, primary responsibility, in context vault, the statistical indicator of word string upgrades; Word string temperature calculates; The sequence of word string temperature and word string export.
The 16. hot word excavating gears in a kind of internet as claimed in claim 15, is characterized in that, hot word extracting unit 104 is again by statistical indicator update module 104_1, and temperature computing module 104_2, temperature sequence and hot word output module 104_3 form.
CN201310607937.0A 2013-11-27 2013-11-27 Internet hot words mining method and device Active CN104679738B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310607937.0A CN104679738B (en) 2013-11-27 2013-11-27 Internet hot words mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310607937.0A CN104679738B (en) 2013-11-27 2013-11-27 Internet hot words mining method and device

Publications (2)

Publication Number Publication Date
CN104679738A true CN104679738A (en) 2015-06-03
CN104679738B CN104679738B (en) 2018-02-27

Family

ID=53314802

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310607937.0A Active CN104679738B (en) 2013-11-27 2013-11-27 Internet hot words mining method and device

Country Status (1)

Country Link
CN (1) CN104679738B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205048A (en) * 2015-10-21 2015-12-30 上海迪爱斯通信设备有限公司 Hot word analysis and statistic system and method
CN105488196A (en) * 2015-12-07 2016-04-13 中国人民大学 Automatic hot topic mining system based on internet corpora
CN105824803A (en) * 2016-03-31 2016-08-03 北京奇艺世纪科技有限公司 Method and device for determining hotspot event name
CN106407175A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Method and device for processing character strings in new word discovery
CN106503256A (en) * 2016-11-11 2017-03-15 中国科学院计算技术研究所 A kind of hot information method for digging based on social networkies document
CN108009234A (en) * 2017-11-29 2018-05-08 苏州大学 A kind of abstracting method, device and the equipment of non-physical type argument
CN108446274A (en) * 2018-03-15 2018-08-24 北京科技大学 A kind of keyword extracting method based on time-sensitive tf-idf
CN108509490A (en) * 2018-02-09 2018-09-07 中国农业大学 A kind of network hot topic discovery method and system
CN108595435A (en) * 2018-05-03 2018-09-28 鹏元征信有限公司 A kind of organization names identifying processing method, intelligent terminal and storage medium
CN110750682A (en) * 2018-07-06 2020-02-04 武汉斗鱼网络科技有限公司 Title hot word automatic metering method, storage medium, electronic equipment and system
CN110765239A (en) * 2019-10-29 2020-02-07 腾讯科技(深圳)有限公司 Hot word recognition method, device and storage medium
CN111916058A (en) * 2020-06-24 2020-11-10 西安交通大学 Voice recognition method and system based on incremental word graph re-scoring
CN113076335A (en) * 2021-04-02 2021-07-06 西安交通大学 Network cause detection method, system, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101256557A (en) * 2008-04-16 2008-09-03 腾讯科技(深圳)有限公司 Self-defining word management apparatus, method and participle system
CN101504667A (en) * 2009-03-20 2009-08-12 北京学之途网络科技有限公司 Keyword confirming method and system, weight vector learning method and system
US20090222883A1 (en) * 2008-02-29 2009-09-03 Zhen Zhong Huo Method and Apparatus for Confidential Knowledge Protection in Software System Development
CN101673305A (en) * 2009-09-29 2010-03-17 百度在线网络技术(北京)有限公司 Industry sorting method, industry sorting device and industry sorting server
CN102043843A (en) * 2010-12-08 2011-05-04 百度在线网络技术(北京)有限公司 Method and obtaining device for obtaining target entry based on target application
US20120130705A1 (en) * 2010-11-22 2012-05-24 Alibaba Group Holding Limited Text segmentation with multiple granularity levels

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090222883A1 (en) * 2008-02-29 2009-09-03 Zhen Zhong Huo Method and Apparatus for Confidential Knowledge Protection in Software System Development
CN101256557A (en) * 2008-04-16 2008-09-03 腾讯科技(深圳)有限公司 Self-defining word management apparatus, method and participle system
CN101504667A (en) * 2009-03-20 2009-08-12 北京学之途网络科技有限公司 Keyword confirming method and system, weight vector learning method and system
CN101673305A (en) * 2009-09-29 2010-03-17 百度在线网络技术(北京)有限公司 Industry sorting method, industry sorting device and industry sorting server
US20120130705A1 (en) * 2010-11-22 2012-05-24 Alibaba Group Holding Limited Text segmentation with multiple granularity levels
CN102043843A (en) * 2010-12-08 2011-05-04 百度在线网络技术(北京)有限公司 Method and obtaining device for obtaining target entry based on target application

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
唐远华: "Web新闻热点信息的自动发现及展示", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
李渝勤等: "面向互联网舆情的热词分析技术", 《中文信息学报》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407175A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Method and device for processing character strings in new word discovery
CN105205048B (en) * 2015-10-21 2018-05-04 迪爱斯信息技术股份有限公司 A kind of hot word analytic statistics system and method
CN105205048A (en) * 2015-10-21 2015-12-30 上海迪爱斯通信设备有限公司 Hot word analysis and statistic system and method
CN105488196A (en) * 2015-12-07 2016-04-13 中国人民大学 Automatic hot topic mining system based on internet corpora
CN105488196B (en) * 2015-12-07 2019-01-22 中国人民大学 A kind of hot topic automatic mining system based on interconnection corpus
CN105824803B (en) * 2016-03-31 2018-10-30 北京奇艺世纪科技有限公司 A kind of determination method and device of focus incident title
CN105824803A (en) * 2016-03-31 2016-08-03 北京奇艺世纪科技有限公司 Method and device for determining hotspot event name
CN106503256A (en) * 2016-11-11 2017-03-15 中国科学院计算技术研究所 A kind of hot information method for digging based on social networkies document
CN106503256B (en) * 2016-11-11 2019-05-07 中国科学院计算技术研究所 A kind of hot information method for digging based on social networks document
CN108009234B (en) * 2017-11-29 2022-02-11 苏州大学 Extraction method, device and equipment of non-entity type argument
CN108009234A (en) * 2017-11-29 2018-05-08 苏州大学 A kind of abstracting method, device and the equipment of non-physical type argument
CN108509490A (en) * 2018-02-09 2018-09-07 中国农业大学 A kind of network hot topic discovery method and system
CN108509490B (en) * 2018-02-09 2020-10-02 中国农业大学 Network hot topic discovery method and system
CN108446274A (en) * 2018-03-15 2018-08-24 北京科技大学 A kind of keyword extracting method based on time-sensitive tf-idf
CN108595435A (en) * 2018-05-03 2018-09-28 鹏元征信有限公司 A kind of organization names identifying processing method, intelligent terminal and storage medium
CN108595435B (en) * 2018-05-03 2020-09-01 鹏元征信有限公司 Organization name recognition processing method, intelligent terminal and storage medium
CN110750682A (en) * 2018-07-06 2020-02-04 武汉斗鱼网络科技有限公司 Title hot word automatic metering method, storage medium, electronic equipment and system
CN110765239A (en) * 2019-10-29 2020-02-07 腾讯科技(深圳)有限公司 Hot word recognition method, device and storage medium
CN110765239B (en) * 2019-10-29 2023-03-28 腾讯科技(深圳)有限公司 Hot word recognition method, device and storage medium
CN111916058A (en) * 2020-06-24 2020-11-10 西安交通大学 Voice recognition method and system based on incremental word graph re-scoring
CN113076335A (en) * 2021-04-02 2021-07-06 西安交通大学 Network cause detection method, system, equipment and storage medium
CN113076335B (en) * 2021-04-02 2024-05-24 西安交通大学 Network module factor detection method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN104679738B (en) 2018-02-27

Similar Documents

Publication Publication Date Title
CN104679738A (en) Method and device for mining Internet hot words
CN107862070B (en) Online classroom discussion short text instant grouping method and system based on text clustering
CN104699763B (en) The text similarity gauging system of multiple features fusion
CN103268339B (en) Named entity recognition method and system in Twitter message
CN103984681B (en) News event evolution analysis method based on time sequence distribution information and topic model
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN105608200A (en) Network public opinion tendency prediction analysis method
CN104199965A (en) Semantic information retrieval method
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN103699525A (en) Method and device for automatically generating abstract on basis of multi-dimensional characteristics of text
CN105320646A (en) Incremental clustering based news topic mining method and apparatus thereof
CN104008106A (en) Method and apparatus for obtaining hot topic
CN102253930A (en) Method and device for translating text
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN102411611A (en) Instant interactive text oriented event identifying and tracking method
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
CN109408802A (en) A kind of method, system and storage medium promoting sentence vector semanteme
CN103207864A (en) Online novel content similarity comparison method
CN105183765A (en) Big data-based topic extraction method
CN110457711A (en) A kind of social media event topic recognition methods based on descriptor
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN102779119B (en) A kind of method of extracting keywords and device
CN103869999A (en) Method and device for sorting candidate items generated by input method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100088 Beijing city Haidian District No. 6 Zhichun Road Jinqiu International Building 14 floor 14B04

Patentee after: TOLS INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 100088 Beijing city Haidian District No. 6 Zhichun Road Jinqiu International Building 14 floor 14B04

Patentee before: BEIJING TRS INFORMATION TECHNOLOGY Co.,Ltd.