CN107908618A - A kind of hot spot word finds method and apparatus - Google Patents

A kind of hot spot word finds method and apparatus Download PDF

Info

Publication number
CN107908618A
CN107908618A CN201711058951.4A CN201711058951A CN107908618A CN 107908618 A CN107908618 A CN 107908618A CN 201711058951 A CN201711058951 A CN 201711058951A CN 107908618 A CN107908618 A CN 107908618A
Authority
CN
China
Prior art keywords
character string
word
data
comentropy
hot spot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711058951.4A
Other languages
Chinese (zh)
Inventor
陈思佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Ltd
Original Assignee
Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Ltd filed Critical Bank of China Ltd
Priority to CN201711058951.4A priority Critical patent/CN107908618A/en
Publication of CN107908618A publication Critical patent/CN107908618A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the present invention discloses a kind of hot spot word and finds method and apparatus, is related to field of information processing, can effectively capture hot spot word, to improve the adaptivity of system.Including:Gather the data of network system generation;Participle is carried out to the character string in the data based on entropy model and obtains candidate word;The candidate word with the word in dictionary match and obtains neologisms;Bayes's average computation is carried out according to the frequency of occurrences and scoring of the neologisms, obtains Bayes's average value of the neologisms;If it is determined that Bayes's average value meets that predetermined condition then determines that the neologisms are hot spot word.

Description

A kind of hot spot word finds method and apparatus
Technical field
The embodiment of the present invention is related to field of information processing, more particularly to a kind of hot spot word finds method and apparatus.
Background technology
The word-building capacity of Chinese is very strong, and theoretically, the chinese character of any two and the above has been combined The possibility of word is formed, this strong word-building causes new word identification to become extremely difficult.
Usually, the research method of new word identification mainly has two kinds:Rule-based approach and based on statistical method.The former is sharp With Morphology rule, coordinate semantic information or part-of-speech information to construct template, neologisms are found and identify finally by matching;And The latter is to identify neologisms by being counted to the entry composition in language material or characteristic information.Major part researcher uses at present The method that rule and statistics are combined, to play combination advantage, so as to be efficiently modified new word identification effect.
Another new word identification is word segmentation based on dictionary method, its keynote idea be by word string to be segmented with Existing entry in some dictionaries, dictionary is matched, if finding some character string, successful match in dictionary.In addition, The participle of no dictionary, which is realized, is generally based on the frequency statistics of word, it is not against dictionary, but by any two word in article At the same time occur frequency counted, number it is higher may be a word.It is syncopated as matched all with vocabulary first Possible word, optimal cutting result is determined with statistical language model and decision making algorithm.
After being segmented to text, word frequency is counted, is the word often occurred for the higher vocabulary of word frequency Converge, by the contrast with common words, hot spot vocabulary can be obtained by screening out high frequency vocabulary.But either in professional domain Hot spot word finds that hot spot word still on a timeline finds that one of important method is exactly mutual comparison, i.e., The lexical gap in field or before and after the period is found out, but the simple calculating changed to word frequency or ratio cannot all receive Good effect.
The content of the invention
The embodiment of the present invention provides a kind of hot spot word and finds method and apparatus, hot spot word can be effectively captured, to carry The adaptivity of high system.
First aspect, there is provided a kind of hot spot word finds method, including:
Gather the data of network system generation;
Participle is carried out to the character string in the data based on entropy model and obtains candidate word;
The candidate word with the word in dictionary match and obtains neologisms;
Bayes's average computation is carried out according to the frequency of occurrences and scoring of the neologisms, the Bayes for obtaining the neologisms puts down Average;
If it is determined that Bayes's average value meets that predetermined condition then determines that the neologisms are hot spot word.
Second aspect, there is provided a kind of hot spot word finds device, including:
Collecting unit, for gathering the data of network system generation;
Participle unit, is segmented for the character string in the data that are gathered based on entropy model to the collecting unit Obtain candidate word;
Matching unit, for match acquisition newly with the word in dictionary by the candidate word that the participle unit obtains Word;
Hot spot word acquiring unit, the frequency of occurrences of the neologisms for being obtained according to the matching unit are carried out with scoring Bayes's average computation, obtains Bayes's average value of the neologisms;If it is determined that Bayes's average value meets predetermined condition Then determine that the neologisms are hot spot word.
In such scheme, hot spot word finds the data of device collection network system generation;Based on entropy model logarithm Character string in carries out participle and obtains candidate word;Candidate word with the word in dictionary match and obtains neologisms;According to neologisms The frequency of occurrences and scoring carry out Bayes's average computation, obtain Bayes's average value of neologisms;If it is determined that Bayes's average value Meet that predetermined condition then determines that neologisms are hot spot word;Wherein when hot spot selected ci poem takes, the frequency of occurrences and the scoring of neologisms with reference to Bayes's average value, relative to according to the frequency of occurrences or ratio-dependent hot spot word of neologisms, can effectively capturing merely Hot spot word, to improve the adaptivity of system.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be in embodiment or description of the prior art Required attached drawing is briefly described, it should be apparent that, drawings in the following description are only some realities of the present invention Example is applied, for those of ordinary skill in the art, without creative efforts, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 provides a kind of application scenarios schematic diagram for the embodiment of the present invention;
Fig. 2 is the flow chart that a kind of hot spot word provided in an embodiment of the present invention finds method;
Fig. 3 is the structure chart that a kind of hot spot word provided in an embodiment of the present invention finds device.
Embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without making creative work Embodiment, belongs to the scope of protection of the invention.
The system architecture and business scenario of description of the embodiment of the present invention are in order to which more clearly the explanation present invention is implemented The technical solution of example, does not form the restriction for technical solution provided in an embodiment of the present invention, those of ordinary skill in the art Understand, with the differentiation of system architecture and the appearance of new business scene, technical solution provided in an embodiment of the present invention is for similar Technical problem, it is equally applicable.
The above method is described in detail with reference to specific embodiment.With reference to shown in Fig. 1, the embodiment of the present invention should For following scene:Including database 11, hot spot word find device 12, data analysis and excavate server 13, management platform 14, Displaying and business support equipment 15;
Wherein database 11 is used to store the data for the network system that heat supply point word discovery device uses, hot spot word discovery dress Other data put the hot spot word of 12 generations and used for data analysis and excavation server 13;Management platform 14 be used for pair Other each several parts carry out status monitoring, rights management and safety guarantee;Data analysis and excavation server 13 are mainly used for profit Find that other data that hot spot word, database 11 that device 12 generates provide carry out content association analysis, time sequence with hot spot word Row analysis, propagate Study on Trend, much-talked-about topic identification, autoabstract generation, Topic Tracking etc. the analysis of public opinion, displaying and business Holding equipment 15 can be a terminal device, it is mainly used for according to data analysis as human-computer interaction device and excavates service The analysis result of device 13 realizes public sentiment early warning, statistical report form, visualization, propagation topology etc. function.The embodiment of the present invention master A kind of hot spot word is provided and finds device 12, it mainly includes data acquisition, data cleansing and new word discovery and hot spot word sieves Select function.
Specifically with reference to shown in Fig. 2, the embodiment of the present invention provides a kind of hot spot word and finds method, including:
101st, the data of network system generation are gathered.
Wherein, further include after step 101:Data cleansing is carried out to data.Wherein, neologisms, do not have before being often referred to The word for occurring or not included in dictionary.In new word identification field, unified boundary is there is no to " neologisms " this concept Fixed, current research includes unknown word identification (Unknown Words Identification UWI) and new word identification (NWI) two aspect.Wherein, unregistered word refers to the word not occurred in currently used dictionary, and UWI is Chinese Automatic Word Segmentation process In important stage, the research of this respect is carried out more early, achieves many achievements;And so-called neologisms (New Word) are Refer to and occur with the development of the times and newly or the new word of old word, such as Severe Acute Respiratory Syndromes, " mountain vallage ".New word identification in this meaning is near Just grow up in year a bit.But since neologisms fall within unregistered word, many researchers do not make any distinction between the two concepts, this Also do not do in application and clearly distinguish.
New word identification main task is the filtering of candidate's new words extraction and rubbish word string.Candidate's new words extraction refers to carry The character string for meeting preliminary condition is taken as candidate's neologisms.Because Chinese character has extremely strong word-building capacity, any adjacent in theory Chinese character combine the possibility for having into word, so the first step of new word identification be exactly character string is extracted from language material As candidate word.To avoid occurring non-word garbage character string in the candidate word extracted, it is therefore desirable to carry out the mistake of rubbish word string Filter, i.e. data cleansing, wherein data cleansing can use the modes such as keyword filtering, length filtration, specific format filtering.
102nd, participle is carried out to the character string in data based on entropy model and obtains candidate word.
Specific step 102 includes:
Character string in Sa, acquisition data.
The left side comentropy and right side comentropy of Sb, calculating character string.
Wherein step Sb is specially:
Embodiments herein introduces the concept of " degrees of freedom " herein, degrees of freedom refer to for character string s it is left, The abundant degree on right word border, if the left and right word border of character string s is more various, then one can consider that character string s can To be used as left and right word border.For example, data-oriented is as follows:
" performance of computer has large increase at present, and dependence of the people to computer is also increasing "
Character string " meter ", the word of the left side collocation of " calculating " and " computer " is " preceding " and " to ", if data long enough, The number that character string " computer " occurs is enough, it finds that at " meter ", the word occurred on the left of " calculating " and " computer " is very Uncertain, such character string is considered as left word border.And there is the more fixed character string of word for left side, then it is assumed that They are not left word borders.Such as " the preceding meter " in example sentence, only there is character string once in " property " etc., word occur in their left sides Conditional probability is 1, and collocation is very fixed;Character string " calculation machine " occurs repeatedly, but the Chinese character that its left side occurs all is " meter ", collocation It is very fixed, so they are not left word borders.We estimate this character string collocation by the comentropy of calculating character string Uncertainty:
According to formula Hl(s)=- ∑a∈Ap(sla|s)*log(p(sla|s)) (1-1)
The left side comentropy of calculating character string, wherein, s represents the character string, Hl(s) the left side comentropy of s, A are represented Set for the Chinese character occurred on the left of s, slaRepresent to be combined formed character string, p (s with s by the Chinese character a on the left of sla| s) table Show occur the conditional probability of Chinese character a on the premise of there is s in the data on the left of s.
Hl(s) reflect on the left of character string s and the average uncertainty of Chinese character occur.Hl(s) it is bigger, then collocation on the left of s Chinese character is more uncertain.
If character string s meets following condition:
Hl(s)>hmin (1-2)
It is left word border then to think s.
hminFor a constant, the minimal information entropy on expression word border.
Similarly, judge character string s whether be right word border method it is as follows:
According to formula HΓ(s)=- ∑b∈Bp(sΓb|s)*log(p(sΓb|s)) (1-3)
The right side comentropy of calculating character string, wherein HΓ(s) the right side comentropy of s is represented, B is the Chinese character occurred on the right side of s Set, sΓbTo be combined formed character string, p (s with s by the Chinese character b on the right side of sΓb| s) represent the premise of s occur in data Under, occur the conditional probability of Chinese character b on the right side of s.
Sc, according to the left side comentropy and right side comentropy of character string segmented to obtain to the character string in the data Candidate phrase, determines to obtain candidate word in candidate phrase.
In this way, according to formula (1-1), (1-3) we can be extracted from data those be both left word border and The character string on right word border, these character strings are exactly our obtained words.Because the frequency that these character strings occur generally compares It is higher, at least it is greater than 2, so they are typically all high frequency words, given hminBigger, the frequency of these words is also higher.
Determine that obtaining candidate word is specially in candidate phrase in step Sc:According to formula
Coagulation grade calculating is carried out to the character string of the candidate phrase, wherein describedRefer to by character string siGroup Into candidate phrase coagulation grade,Refer to by character string siThe candidate phrase of composition occurs general in the data Rate, P (si) refer to character string siThe probability occurred in the data;If it is determined thatMeet predetermined condition, it is determined that institute It is candidate word to state candidate phrase.
Degree of flexibility by the calculating to character string " degrees of freedom " we can determine whether character string, so that it is determined that character Whether string can be used as word border, but only can not be entirely as the foundation of participle by comentropy.Therefore we also need to " coagulation grade " is introduced to come to candidate's word string only with the judgement of internal stability degree.For example, if candidate phrase is by character string A, then formula (1-4) is transformed to b compositions:
NJ (ab)=p (ab)/p (a) * p (b) (1-5)
Wherein NJ (ab) refers to the candidate phrase being made of character string a, b, and p (ab) represents the candidate phrase being made of a, b The frequency occurred in data, p (a), p (b) represent the frequency that character string a, b occurs in data respectively.
103rd, candidate word with the word in dictionary match and obtain neologisms.
104th, Bayes's average computation is carried out according to the frequency of occurrences and scoring of neologisms, the Bayes for obtaining neologisms is averaged Value.
105th, if it is determined that Bayes's average value meets that predetermined condition then determines that neologisms are hot spot word.
Wherein, Bayes is averagely a kind of method of the estimated data average value consistent with bayesian theory, this It is not to be averaged according to the progress of existing data set is stringent in method, and is to be able to reduce influence of the large deviation to result Also bring the existence information related with data into calculating, or a value is directly given tacit consent to when data set very little.
Bayesian decision is exactly under incomplete information, and the state subjective probability unknown to part is estimated, then uses shellfish This formula of leaf is modified probability of happening, finally recycles desired value and corrects probability and makes optimizing decision.If pattra leaves How the people that this theory can be understood as a reason ideally provides answer so shellfish to the confidence level of a result This average value of leaf is exactly a kind of method of the calculating average value provided according to bayesian theory.Bayes's Mean Value Formulas is as follows:
Wherein C is a constant, directly proportional to the size of data set, and m is the arithmetic mean of instantaneous value of data set, and n is data set Sum.
The average formula of Bayes can not obviously illustrate that it is calculating the effect played in average value, below We have an example to illustrate.Such as one books on data mining of purchase, tri- books of A, B, C are found on website, Book A has 3 people marking, and average mark 5.0 divides;Book B has 10 people's marking, and average mark 4.8 divides;Book C has 50 people marking, is equally divided into 4.5 point.
If sorted according to average mark, book A is optimal selection.But usually we have a kind of sensation, many people go to buy, This thing is just more credible, and only one or two people agrees, it may be possible to holds in the palm.So we need the wisdom by everybody, that is, Say more people's marking, the scoring of this product is more credible, should just obtain the weight of higher.So Bayes is averagely in fact Combine the averaging method of marking number and average mark.The average characteristic of Bayes, can two sides of connecting times and fraction Face is ranked up, this can play an important role in the screening of hot spot vocabulary.
Hot spot word either in professional domain finds that hot spot word still on a timeline finds that one of them is critically important Method be exactly mutual comparison, that is, find out the lexical gap in field or before and after the period, but it is simple to word frequency or The calculating of person's ratio change cannot all receive good effect.For the less vocabulary of the frequency, its Bayes's average value is in entirety Average mark or so, and the larger vocabulary of word frequency can make Bayes's average value close to original scoring, such result of calculation symbol On the one hand the characteristics of closing hot spot word, i.e., be the need for opposite change, to be on the other hand ensured of colony's phenomenon, that is, have enough The frequency.
In such scheme, hot spot word finds the data of device collection network system generation;Based on entropy model logarithm Character string in carries out participle and obtains candidate word;Candidate word with the word in dictionary match and obtains neologisms;According to neologisms The frequency of occurrences and scoring carry out Bayes's average computation, obtain Bayes's average value of neologisms;If it is determined that Bayes's average value Meet that predetermined condition then determines that neologisms are hot spot word;Wherein when hot spot selected ci poem takes, the frequency of occurrences and the scoring of neologisms with reference to Bayes's average value, relative to according to the frequency of occurrences or ratio-dependent hot spot word of neologisms, can effectively capturing merely Hot spot word, to improve the adaptivity of system.
As shown in figure 3, the embodiment of the present invention, which provides a kind of hot spot word, finds device, including:
Collecting unit 31, for gathering the data of network system generation;
Participle unit 32, is carried out for the character string in the data that are gathered based on entropy model to the collecting unit 31 Participle obtains candidate word;
Matching unit 33, is obtained for the candidate word that the participle unit 32 obtains to be carried out matching with the word in dictionary Take neologisms;
Hot spot word acquiring unit 34, for the frequency of occurrences of the neologisms obtained according to the matching unit 33 and scoring Bayes's average computation is carried out, obtains Bayes's average value of the neologisms;If it is determined that Bayes's average value meets to make a reservation for Condition then determines that the neologisms are hot spot word.
In a kind of exemplary implementation, further include:Data cleansing unit 35, for being gathered to the collecting unit Data carry out data cleansing.
In a kind of exemplary implementation, the hot spot word acquiring unit 34 is specifically used for:Obtain in the data Character string;Calculate the left side comentropy and right side comentropy of the character string;According to the left side comentropy of the character string and Right side comentropy is segmented to obtain candidate phrase to the character string in the data, is determined to obtain in the candidate phrase and is waited Select word.
In a kind of exemplary implementation, the hot spot word acquiring unit 34 is specifically used for according to formulaCoagulation grade calculating is carried out to the character string of the candidate phrase, wherein describedRefer to by character string siThe coagulation grade of the candidate phrase of composition,Refer to by character string siThe candidate of composition is short The probability that language occurs in the data, P (si) refer to character string siThe probability occurred in the data;If it is determined that Meet predetermined condition, it is determined that the candidate phrase is candidate word.
In a kind of exemplary implementation, the hot spot word acquiring unit 34 is specifically used for according to formula Hl(s)=- ∑a∈Ap(sla|s)*log(p(sla| s)) the left side comentropy of calculating character string, wherein, s represents the character string, Hl(s) table Show the left side comentropy of s, the set for the Chinese character that A occurs for s left sides, slaRepresent to be combined what is formed with s by the Chinese character a on the left of s Character string, p (sla| s) represent on the premise of there is s in the data, occur the conditional probability of Chinese character a on the left of s;According to formula HΓ(s)=- ∑b∈Bp(sΓb|s)*log(p(sΓb| s)) the right side comentropy of calculating character string, wherein HΓ(s) right side of s is represented Side comentropy, B are the set of the Chinese character occurred on the right side of s, sΓbTo be combined formed character string, p with s by the Chinese character b on the right side of s (sΓb| s) represent on the premise of there is s in the data, occur the conditional probability of Chinese character b on the right side of s.
Method is found since the device in the embodiment of the present application can be applied to above-mentioned hot spot word, it can be obtained Technique effect also refer to above method embodiment, details are not described herein for the embodiment of the present application.
It should be noted that collecting unit 31, participle unit 32, matching unit 33, hot spot word acquiring unit 34, data are clear It can be the processor individually set up to wash unit 35, can also be integrated in some processor of controller and realize, in addition, Can be stored in the form of program code in the memory of controller, called by some processor of controller and performed with The function of upper each unit.Processor described here can be a central processing unit (Central Processing Unit, CPU), or specific integrated circuit (Application Specific Integrated Circuit, ASIC), either It is configured to implement one or more integrated circuits of the embodiment of the present application.
It is to be understood that in the various embodiments of the application, the size of the sequence number of above-mentioned each process is not meant to perform suitable The priority of sequence, the execution sequence of each process should be determined with its function and internal logic, without the implementation of reply the embodiment of the present application Process forms any restriction.
In addition, also providing a kind of calculating readable media (or medium), including carry out when executed in above-described embodiment The computer-readable instruction of the operation of method.
In addition, a kind of computer program product is also provided, including above computer readable media (or medium).
It is to be understood that in various embodiments of the present invention, the size of the sequence number of above-mentioned each process is not meant to perform suitable The priority of sequence, the execution sequence of each process should be determined with its function and internal logic, without the implementation of the reply embodiment of the present invention Process forms any restriction.
Those of ordinary skill in the art may realize that each exemplary list described with reference to the embodiments described herein Member and algorithm steps, can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually Performed with hardware or software mode, application-specific and design constraint depending on technical solution.Professional technician Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed The scope of the present invention.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, may be referred to the corresponding process in preceding method embodiment, details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, apparatus and method, can be with Realize by another way.For example, apparatus embodiments described above are only schematical, for example, the unit Division, is only a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing Another system can be combined or be desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, equipment or unit Close or communicate to connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit The component shown may or may not be physical location, you can with positioned at a place, or can also be distributed to multiple In network unit.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units integrate in a unit.
If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words The part to contribute to the prior art or the part of the technical solution can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be People's computer, server, or network equipment etc.) perform all or part of step of each embodiment the method for the present invention. And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (English full name:Read-only memory, English letter Claim:ROM), random access memory (English full name:Random access memory, English abbreviation:RAM), magnetic disc or light Disk etc. is various can be with the medium of store program codes.
The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (11)

1. a kind of hot spot word finds method, it is characterised in that including:
Gather the data of network system generation;
Participle is carried out to the character string in the data based on entropy model and obtains candidate word;
The candidate word with the word in dictionary match and obtains neologisms;
Bayes's average computation is carried out according to the frequency of occurrences and scoring of the neologisms, the Bayes for obtaining the neologisms is averaged Value;
If it is determined that Bayes's average value meets that predetermined condition then determines that the neologisms are hot spot word.
2. according to the method described in claim 1, it is characterized in that, it is described based on entropy model to the character in the data Before string carries out participle acquisition candidate word, further include:
Data cleansing is carried out to the data.
3. according to the method described in claim 1, it is characterized in that, it is described based on entropy model to the character in the data String carries out participle and obtains candidate word, including:
Obtain the character string in the data;
Calculate the left side comentropy and right side comentropy of the character string;
The character string in the data is segmented according to the left side comentropy of the character string and right side comentropy and is waited Phrase is selected, determines to obtain candidate word in the candidate phrase.
4. according to the method described in claim 3, it is characterized in that, it is described in the candidate phrase determine obtain candidate word, Including:
According to formulaCoagulation grade calculating is carried out to the character string of the candidate phrase, It is wherein describedRefer to by character string siThe coagulation grade of the candidate phrase of composition,Refer to by character string siComposition The probability that occurs in the data of candidate phrase, P (si) refer to character string siThe probability occurred in the data;
If it is determined thatMeet predetermined condition, it is determined that the candidate phrase is candidate word.
5. according to the method described in claim 3, it is characterized in that, calculate the left side comentropy and right side information of the character string Entropy, including:
According to formula Hl(s)=- ∑a∈Ap(sla|s)*log(p(sla| s)) the left side comentropy of calculating character string,
Wherein, s represents the character string, Hl(s) the left side comentropy of s, the set for the Chinese character that A occurs for s left sides, s are representedlaTable Show that the Chinese character a on the left of by s is combined formed character string, p (s with sla| s) represent on the premise of there is s in the data, in s There is the conditional probability of Chinese character a in left side;
According to formula HΓ(s)=- ∑b∈Bp(sΓb|s)*log(p(sΓb| s)) the right side comentropy of calculating character string, wherein HΓ(s) Represent the right side comentropy of s, B is the set of the Chinese character occurred on the right side of s, sΓbTo be combined what is formed with s by the Chinese character b on the right side of s Character string, p (sΓb| s) represent on the premise of there is s in the data, occur the conditional probability of Chinese character b on the right side of s.
6. a kind of hot spot word finds device, it is characterised in that including:
Collecting unit, for gathering the data of network system generation;
Participle unit, participle acquisition is carried out for the character string in the data that are gathered based on entropy model to the collecting unit Candidate word;
Matching unit, neologisms are obtained for the word in dictionary match the candidate word that the participle unit obtains;
Hot spot word acquiring unit, the frequency of occurrences of the neologisms for being obtained according to the matching unit carry out pattra leaves with scoring This average computation, obtains Bayes's average value of the neologisms;If it is determined that Bayes's average value meets that predetermined condition is then true The fixed neologisms are hot spot word.
7. device according to claim 6, it is characterised in that further include:
Data cleansing unit, the data for being gathered to the collecting unit carry out data cleansing.
8. device according to claim 6, it is characterised in that the hot spot word acquiring unit is specifically used for:Described in acquisition Character string in data;Calculate the left side comentropy and right side comentropy of the character string;Believed according to the left side of the character string Breath entropy and right side comentropy are segmented to obtain candidate phrase to the character string in the data, are determined in the candidate phrase Obtain candidate word.
9. device according to claim 8, it is characterised in that the hot spot word acquiring unit is specifically used for according to formulaSolidification program calculating is carried out to the character string of the candidate phrase, wherein describedRefer to by character string siThe solidification program of the candidate phrase of composition,Refer to by character string siThe candidate phrase of composition The probability occurred in the data, P (si) refer to character string siThe probability occurred in the data;If it is determined thatIt is full Sufficient predetermined condition, it is determined that the candidate phrase is candidate word.
10. device according to claim 8, it is characterised in that the hot spot word acquiring unit is specifically used for according to formula Hl (s)=- ∑a∈Ap(sla|s)*log(p(sla| s)) the left side comentropy of calculating character string, wherein, s represents the character string, Hl (s) the left side comentropy of s, the set for the Chinese character that A occurs for s left sides, s are representedlaRepresent by the Chinese character a on the left of s and s combinations institute structure Into character string, p (sla| s) represent on the premise of there is s in the data, occur the conditional probability of Chinese character a on the left of s;According to Formula HΓ(s)=- ∑b∈Bp(sΓb|s)*log(p(sΓb| s)) the right side comentropy of calculating character string, wherein HΓ(s) represent s's Right side comentropy, B are the set of the Chinese character occurred on the right side of s, sΓbTo be combined formed character string with s by the Chinese character b on the right side of s, p(sΓb| s) represent on the premise of there is s in the data, occur the conditional probability of Chinese character b on the right side of s.
A kind of 11. computer-readable recording medium for storing one or more programs, it is characterised in that one or more of journeys Sequence includes instruction, and described instruction makes the computer perform such as claim 1 to 5 any one of them when executed by a computer Method.
CN201711058951.4A 2017-11-01 2017-11-01 A kind of hot spot word finds method and apparatus Pending CN107908618A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711058951.4A CN107908618A (en) 2017-11-01 2017-11-01 A kind of hot spot word finds method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711058951.4A CN107908618A (en) 2017-11-01 2017-11-01 A kind of hot spot word finds method and apparatus

Publications (1)

Publication Number Publication Date
CN107908618A true CN107908618A (en) 2018-04-13

Family

ID=61843344

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711058951.4A Pending CN107908618A (en) 2017-11-01 2017-11-01 A kind of hot spot word finds method and apparatus

Country Status (1)

Country Link
CN (1) CN107908618A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516229A (en) * 2019-07-10 2019-11-29 杭州电子科技大学 A kind of domain-adaptive Chinese word cutting method based on deep learning
CN110991173A (en) * 2019-11-29 2020-04-10 支付宝(杭州)信息技术有限公司 Word segmentation method and system
CN111538893A (en) * 2020-04-29 2020-08-14 四川大学 Method for extracting network security new words from unstructured data
CN111563143A (en) * 2020-07-20 2020-08-21 上海二三四五网络科技有限公司 Method and device for determining new words
CN111737555A (en) * 2020-06-18 2020-10-02 苏州朗动网络科技有限公司 Method and device for selecting hot keywords and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110191355A1 (en) * 2007-04-24 2011-08-04 Peking University Method for monitoring abnormal state of internet information
CN107180025A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of recognition methods of neologisms and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110191355A1 (en) * 2007-04-24 2011-08-04 Peking University Method for monitoring abnormal state of internet information
CN107180025A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of recognition methods of neologisms and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MATRIX67: "互联网时代的社会语言学:基于SNS的文本数据挖掘", 《HTTP://WWW.MATRIX67.COM/BLOG/ARCHIVES/5044》 *
ZHZHX0318: "关于新词发现", 《HTTPS://BLOG.CSDN.NET/ZHZHX0318/ARTICLE/DETAILS/78253378》 *
郝晓玲 等: "微博热词抽取及话题发现研究", 《情报杂志》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516229A (en) * 2019-07-10 2019-11-29 杭州电子科技大学 A kind of domain-adaptive Chinese word cutting method based on deep learning
CN110516229B (en) * 2019-07-10 2020-05-05 杭州电子科技大学 Domain-adaptive Chinese word segmentation method based on deep learning
CN110991173A (en) * 2019-11-29 2020-04-10 支付宝(杭州)信息技术有限公司 Word segmentation method and system
CN110991173B (en) * 2019-11-29 2023-09-29 支付宝(杭州)信息技术有限公司 Word segmentation method and system
CN111538893A (en) * 2020-04-29 2020-08-14 四川大学 Method for extracting network security new words from unstructured data
CN111737555A (en) * 2020-06-18 2020-10-02 苏州朗动网络科技有限公司 Method and device for selecting hot keywords and storage medium
CN111563143A (en) * 2020-07-20 2020-08-21 上海二三四五网络科技有限公司 Method and device for determining new words

Similar Documents

Publication Publication Date Title
CN107908618A (en) A kind of hot spot word finds method and apparatus
Hoffart et al. Discovering emerging entities with ambiguous names
US10565233B2 (en) Suffix tree similarity measure for document clustering
JP5092165B2 (en) Data construction method and system
Inzalkar et al. A survey on text mining-techniques and application
CN108829658A (en) The method and device of new word discovery
WO2017101728A1 (en) Similar word aggregation method and apparatus
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
US10467255B2 (en) Methods and systems for analyzing reading logs and documents thereof
CN104137097A (en) Predicate template gathering device, specified phrase pair gathering device and computer program for said devices
CN108733791B (en) Network event detection method
CN110232126A (en) Hot spot method for digging and server and computer readable storage medium
Bykau et al. Fine-grained controversy detection in Wikipedia
Ouyang et al. Sentistory: multi-grained sentiment analysis and event summarization with crowdsourced social media data
Wang et al. Exploring text links for coherent multi-document summarization
Harandizadeh et al. Tweeki: Linking named entities on Twitter to a knowledge graph
Liang et al. Clustering web services for automatic categorization
Abidi et al. Searching Personalized $ k $-wing in Bipartite Graphs
Schinas et al. Event detection and retrieval on social media
Unankard et al. Sub-events tracking from social network based on the relationships between topics
CN112528640A (en) Automatic domain term extraction method based on abnormal subgraph detection
CN116467291A (en) Knowledge graph storage and search method and system
Zhou et al. Real-time timeline summarisation for high-impact events in twitter
Jaber et al. Inferring offline hierarchical ties from online social networks
AT&T

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination