CN107908618A - A kind of hot spot word finds method and apparatus - Google Patents
A kind of hot spot word finds method and apparatus Download PDFInfo
- Publication number
- CN107908618A CN107908618A CN201711058951.4A CN201711058951A CN107908618A CN 107908618 A CN107908618 A CN 107908618A CN 201711058951 A CN201711058951 A CN 201711058951A CN 107908618 A CN107908618 A CN 107908618A
- Authority
- CN
- China
- Prior art keywords
- character string
- word
- data
- comentropy
- hot spot
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the present invention discloses a kind of hot spot word and finds method and apparatus, is related to field of information processing, can effectively capture hot spot word, to improve the adaptivity of system.Including:Gather the data of network system generation;Participle is carried out to the character string in the data based on entropy model and obtains candidate word;The candidate word with the word in dictionary match and obtains neologisms;Bayes's average computation is carried out according to the frequency of occurrences and scoring of the neologisms, obtains Bayes's average value of the neologisms;If it is determined that Bayes's average value meets that predetermined condition then determines that the neologisms are hot spot word.
Description
Technical field
The embodiment of the present invention is related to field of information processing, more particularly to a kind of hot spot word finds method and apparatus.
Background technology
The word-building capacity of Chinese is very strong, and theoretically, the chinese character of any two and the above has been combined
The possibility of word is formed, this strong word-building causes new word identification to become extremely difficult.
Usually, the research method of new word identification mainly has two kinds:Rule-based approach and based on statistical method.The former is sharp
With Morphology rule, coordinate semantic information or part-of-speech information to construct template, neologisms are found and identify finally by matching;And
The latter is to identify neologisms by being counted to the entry composition in language material or characteristic information.Major part researcher uses at present
The method that rule and statistics are combined, to play combination advantage, so as to be efficiently modified new word identification effect.
Another new word identification is word segmentation based on dictionary method, its keynote idea be by word string to be segmented with
Existing entry in some dictionaries, dictionary is matched, if finding some character string, successful match in dictionary.In addition,
The participle of no dictionary, which is realized, is generally based on the frequency statistics of word, it is not against dictionary, but by any two word in article
At the same time occur frequency counted, number it is higher may be a word.It is syncopated as matched all with vocabulary first
Possible word, optimal cutting result is determined with statistical language model and decision making algorithm.
After being segmented to text, word frequency is counted, is the word often occurred for the higher vocabulary of word frequency
Converge, by the contrast with common words, hot spot vocabulary can be obtained by screening out high frequency vocabulary.But either in professional domain
Hot spot word finds that hot spot word still on a timeline finds that one of important method is exactly mutual comparison, i.e.,
The lexical gap in field or before and after the period is found out, but the simple calculating changed to word frequency or ratio cannot all receive
Good effect.
The content of the invention
The embodiment of the present invention provides a kind of hot spot word and finds method and apparatus, hot spot word can be effectively captured, to carry
The adaptivity of high system.
First aspect, there is provided a kind of hot spot word finds method, including:
Gather the data of network system generation;
Participle is carried out to the character string in the data based on entropy model and obtains candidate word;
The candidate word with the word in dictionary match and obtains neologisms;
Bayes's average computation is carried out according to the frequency of occurrences and scoring of the neologisms, the Bayes for obtaining the neologisms puts down
Average;
If it is determined that Bayes's average value meets that predetermined condition then determines that the neologisms are hot spot word.
Second aspect, there is provided a kind of hot spot word finds device, including:
Collecting unit, for gathering the data of network system generation;
Participle unit, is segmented for the character string in the data that are gathered based on entropy model to the collecting unit
Obtain candidate word;
Matching unit, for match acquisition newly with the word in dictionary by the candidate word that the participle unit obtains
Word;
Hot spot word acquiring unit, the frequency of occurrences of the neologisms for being obtained according to the matching unit are carried out with scoring
Bayes's average computation, obtains Bayes's average value of the neologisms;If it is determined that Bayes's average value meets predetermined condition
Then determine that the neologisms are hot spot word.
In such scheme, hot spot word finds the data of device collection network system generation;Based on entropy model logarithm
Character string in carries out participle and obtains candidate word;Candidate word with the word in dictionary match and obtains neologisms;According to neologisms
The frequency of occurrences and scoring carry out Bayes's average computation, obtain Bayes's average value of neologisms;If it is determined that Bayes's average value
Meet that predetermined condition then determines that neologisms are hot spot word;Wherein when hot spot selected ci poem takes, the frequency of occurrences and the scoring of neologisms with reference to
Bayes's average value, relative to according to the frequency of occurrences or ratio-dependent hot spot word of neologisms, can effectively capturing merely
Hot spot word, to improve the adaptivity of system.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be in embodiment or description of the prior art
Required attached drawing is briefly described, it should be apparent that, drawings in the following description are only some realities of the present invention
Example is applied, for those of ordinary skill in the art, without creative efforts, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 provides a kind of application scenarios schematic diagram for the embodiment of the present invention;
Fig. 2 is the flow chart that a kind of hot spot word provided in an embodiment of the present invention finds method;
Fig. 3 is the structure chart that a kind of hot spot word provided in an embodiment of the present invention finds device.
Embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other without making creative work
Embodiment, belongs to the scope of protection of the invention.
The system architecture and business scenario of description of the embodiment of the present invention are in order to which more clearly the explanation present invention is implemented
The technical solution of example, does not form the restriction for technical solution provided in an embodiment of the present invention, those of ordinary skill in the art
Understand, with the differentiation of system architecture and the appearance of new business scene, technical solution provided in an embodiment of the present invention is for similar
Technical problem, it is equally applicable.
The above method is described in detail with reference to specific embodiment.With reference to shown in Fig. 1, the embodiment of the present invention should
For following scene:Including database 11, hot spot word find device 12, data analysis and excavate server 13, management platform 14,
Displaying and business support equipment 15;
Wherein database 11 is used to store the data for the network system that heat supply point word discovery device uses, hot spot word discovery dress
Other data put the hot spot word of 12 generations and used for data analysis and excavation server 13;Management platform 14 be used for pair
Other each several parts carry out status monitoring, rights management and safety guarantee;Data analysis and excavation server 13 are mainly used for profit
Find that other data that hot spot word, database 11 that device 12 generates provide carry out content association analysis, time sequence with hot spot word
Row analysis, propagate Study on Trend, much-talked-about topic identification, autoabstract generation, Topic Tracking etc. the analysis of public opinion, displaying and business
Holding equipment 15 can be a terminal device, it is mainly used for according to data analysis as human-computer interaction device and excavates service
The analysis result of device 13 realizes public sentiment early warning, statistical report form, visualization, propagation topology etc. function.The embodiment of the present invention master
A kind of hot spot word is provided and finds device 12, it mainly includes data acquisition, data cleansing and new word discovery and hot spot word sieves
Select function.
Specifically with reference to shown in Fig. 2, the embodiment of the present invention provides a kind of hot spot word and finds method, including:
101st, the data of network system generation are gathered.
Wherein, further include after step 101:Data cleansing is carried out to data.Wherein, neologisms, do not have before being often referred to
The word for occurring or not included in dictionary.In new word identification field, unified boundary is there is no to " neologisms " this concept
Fixed, current research includes unknown word identification (Unknown Words Identification UWI) and new word identification
(NWI) two aspect.Wherein, unregistered word refers to the word not occurred in currently used dictionary, and UWI is Chinese Automatic Word Segmentation process
In important stage, the research of this respect is carried out more early, achieves many achievements;And so-called neologisms (New Word) are
Refer to and occur with the development of the times and newly or the new word of old word, such as Severe Acute Respiratory Syndromes, " mountain vallage ".New word identification in this meaning is near
Just grow up in year a bit.But since neologisms fall within unregistered word, many researchers do not make any distinction between the two concepts, this
Also do not do in application and clearly distinguish.
New word identification main task is the filtering of candidate's new words extraction and rubbish word string.Candidate's new words extraction refers to carry
The character string for meeting preliminary condition is taken as candidate's neologisms.Because Chinese character has extremely strong word-building capacity, any adjacent in theory
Chinese character combine the possibility for having into word, so the first step of new word identification be exactly character string is extracted from language material
As candidate word.To avoid occurring non-word garbage character string in the candidate word extracted, it is therefore desirable to carry out the mistake of rubbish word string
Filter, i.e. data cleansing, wherein data cleansing can use the modes such as keyword filtering, length filtration, specific format filtering.
102nd, participle is carried out to the character string in data based on entropy model and obtains candidate word.
Specific step 102 includes:
Character string in Sa, acquisition data.
The left side comentropy and right side comentropy of Sb, calculating character string.
Wherein step Sb is specially:
Embodiments herein introduces the concept of " degrees of freedom " herein, degrees of freedom refer to for character string s it is left,
The abundant degree on right word border, if the left and right word border of character string s is more various, then one can consider that character string s can
To be used as left and right word border.For example, data-oriented is as follows:
" performance of computer has large increase at present, and dependence of the people to computer is also increasing "
Character string " meter ", the word of the left side collocation of " calculating " and " computer " is " preceding " and " to ", if data long enough,
The number that character string " computer " occurs is enough, it finds that at " meter ", the word occurred on the left of " calculating " and " computer " is very
Uncertain, such character string is considered as left word border.And there is the more fixed character string of word for left side, then it is assumed that
They are not left word borders.Such as " the preceding meter " in example sentence, only there is character string once in " property " etc., word occur in their left sides
Conditional probability is 1, and collocation is very fixed;Character string " calculation machine " occurs repeatedly, but the Chinese character that its left side occurs all is " meter ", collocation
It is very fixed, so they are not left word borders.We estimate this character string collocation by the comentropy of calculating character string
Uncertainty:
According to formula Hl(s)=- ∑a∈Ap(sla|s)*log(p(sla|s)) (1-1)
The left side comentropy of calculating character string, wherein, s represents the character string, Hl(s) the left side comentropy of s, A are represented
Set for the Chinese character occurred on the left of s, slaRepresent to be combined formed character string, p (s with s by the Chinese character a on the left of sla| s) table
Show occur the conditional probability of Chinese character a on the premise of there is s in the data on the left of s.
Hl(s) reflect on the left of character string s and the average uncertainty of Chinese character occur.Hl(s) it is bigger, then collocation on the left of s
Chinese character is more uncertain.
If character string s meets following condition:
Hl(s)>hmin (1-2)
It is left word border then to think s.
hminFor a constant, the minimal information entropy on expression word border.
Similarly, judge character string s whether be right word border method it is as follows:
According to formula HΓ(s)=- ∑b∈Bp(sΓb|s)*log(p(sΓb|s)) (1-3)
The right side comentropy of calculating character string, wherein HΓ(s) the right side comentropy of s is represented, B is the Chinese character occurred on the right side of s
Set, sΓbTo be combined formed character string, p (s with s by the Chinese character b on the right side of sΓb| s) represent the premise of s occur in data
Under, occur the conditional probability of Chinese character b on the right side of s.
Sc, according to the left side comentropy and right side comentropy of character string segmented to obtain to the character string in the data
Candidate phrase, determines to obtain candidate word in candidate phrase.
In this way, according to formula (1-1), (1-3) we can be extracted from data those be both left word border and
The character string on right word border, these character strings are exactly our obtained words.Because the frequency that these character strings occur generally compares
It is higher, at least it is greater than 2, so they are typically all high frequency words, given hminBigger, the frequency of these words is also higher.
Determine that obtaining candidate word is specially in candidate phrase in step Sc:According to formula
Coagulation grade calculating is carried out to the character string of the candidate phrase, wherein describedRefer to by character string siGroup
Into candidate phrase coagulation grade,Refer to by character string siThe candidate phrase of composition occurs general in the data
Rate, P (si) refer to character string siThe probability occurred in the data;If it is determined thatMeet predetermined condition, it is determined that institute
It is candidate word to state candidate phrase.
Degree of flexibility by the calculating to character string " degrees of freedom " we can determine whether character string, so that it is determined that character
Whether string can be used as word border, but only can not be entirely as the foundation of participle by comentropy.Therefore we also need to
" coagulation grade " is introduced to come to candidate's word string only with the judgement of internal stability degree.For example, if candidate phrase is by character string
A, then formula (1-4) is transformed to b compositions:
NJ (ab)=p (ab)/p (a) * p (b) (1-5)
Wherein NJ (ab) refers to the candidate phrase being made of character string a, b, and p (ab) represents the candidate phrase being made of a, b
The frequency occurred in data, p (a), p (b) represent the frequency that character string a, b occurs in data respectively.
103rd, candidate word with the word in dictionary match and obtain neologisms.
104th, Bayes's average computation is carried out according to the frequency of occurrences and scoring of neologisms, the Bayes for obtaining neologisms is averaged
Value.
105th, if it is determined that Bayes's average value meets that predetermined condition then determines that neologisms are hot spot word.
Wherein, Bayes is averagely a kind of method of the estimated data average value consistent with bayesian theory, this
It is not to be averaged according to the progress of existing data set is stringent in method, and is to be able to reduce influence of the large deviation to result
Also bring the existence information related with data into calculating, or a value is directly given tacit consent to when data set very little.
Bayesian decision is exactly under incomplete information, and the state subjective probability unknown to part is estimated, then uses shellfish
This formula of leaf is modified probability of happening, finally recycles desired value and corrects probability and makes optimizing decision.If pattra leaves
How the people that this theory can be understood as a reason ideally provides answer so shellfish to the confidence level of a result
This average value of leaf is exactly a kind of method of the calculating average value provided according to bayesian theory.Bayes's Mean Value Formulas is as follows:
Wherein C is a constant, directly proportional to the size of data set, and m is the arithmetic mean of instantaneous value of data set, and n is data set
Sum.
The average formula of Bayes can not obviously illustrate that it is calculating the effect played in average value, below
We have an example to illustrate.Such as one books on data mining of purchase, tri- books of A, B, C are found on website,
Book A has 3 people marking, and average mark 5.0 divides;Book B has 10 people's marking, and average mark 4.8 divides;Book C has 50 people marking, is equally divided into
4.5 point.
If sorted according to average mark, book A is optimal selection.But usually we have a kind of sensation, many people go to buy,
This thing is just more credible, and only one or two people agrees, it may be possible to holds in the palm.So we need the wisdom by everybody, that is,
Say more people's marking, the scoring of this product is more credible, should just obtain the weight of higher.So Bayes is averagely in fact
Combine the averaging method of marking number and average mark.The average characteristic of Bayes, can two sides of connecting times and fraction
Face is ranked up, this can play an important role in the screening of hot spot vocabulary.
Hot spot word either in professional domain finds that hot spot word still on a timeline finds that one of them is critically important
Method be exactly mutual comparison, that is, find out the lexical gap in field or before and after the period, but it is simple to word frequency or
The calculating of person's ratio change cannot all receive good effect.For the less vocabulary of the frequency, its Bayes's average value is in entirety
Average mark or so, and the larger vocabulary of word frequency can make Bayes's average value close to original scoring, such result of calculation symbol
On the one hand the characteristics of closing hot spot word, i.e., be the need for opposite change, to be on the other hand ensured of colony's phenomenon, that is, have enough
The frequency.
In such scheme, hot spot word finds the data of device collection network system generation;Based on entropy model logarithm
Character string in carries out participle and obtains candidate word;Candidate word with the word in dictionary match and obtains neologisms;According to neologisms
The frequency of occurrences and scoring carry out Bayes's average computation, obtain Bayes's average value of neologisms;If it is determined that Bayes's average value
Meet that predetermined condition then determines that neologisms are hot spot word;Wherein when hot spot selected ci poem takes, the frequency of occurrences and the scoring of neologisms with reference to
Bayes's average value, relative to according to the frequency of occurrences or ratio-dependent hot spot word of neologisms, can effectively capturing merely
Hot spot word, to improve the adaptivity of system.
As shown in figure 3, the embodiment of the present invention, which provides a kind of hot spot word, finds device, including:
Collecting unit 31, for gathering the data of network system generation;
Participle unit 32, is carried out for the character string in the data that are gathered based on entropy model to the collecting unit 31
Participle obtains candidate word;
Matching unit 33, is obtained for the candidate word that the participle unit 32 obtains to be carried out matching with the word in dictionary
Take neologisms;
Hot spot word acquiring unit 34, for the frequency of occurrences of the neologisms obtained according to the matching unit 33 and scoring
Bayes's average computation is carried out, obtains Bayes's average value of the neologisms;If it is determined that Bayes's average value meets to make a reservation for
Condition then determines that the neologisms are hot spot word.
In a kind of exemplary implementation, further include:Data cleansing unit 35, for being gathered to the collecting unit
Data carry out data cleansing.
In a kind of exemplary implementation, the hot spot word acquiring unit 34 is specifically used for:Obtain in the data
Character string;Calculate the left side comentropy and right side comentropy of the character string;According to the left side comentropy of the character string and
Right side comentropy is segmented to obtain candidate phrase to the character string in the data, is determined to obtain in the candidate phrase and is waited
Select word.
In a kind of exemplary implementation, the hot spot word acquiring unit 34 is specifically used for according to formulaCoagulation grade calculating is carried out to the character string of the candidate phrase, wherein describedRefer to by character string siThe coagulation grade of the candidate phrase of composition,Refer to by character string siThe candidate of composition is short
The probability that language occurs in the data, P (si) refer to character string siThe probability occurred in the data;If it is determined that
Meet predetermined condition, it is determined that the candidate phrase is candidate word.
In a kind of exemplary implementation, the hot spot word acquiring unit 34 is specifically used for according to formula Hl(s)=-
∑a∈Ap(sla|s)*log(p(sla| s)) the left side comentropy of calculating character string, wherein, s represents the character string, Hl(s) table
Show the left side comentropy of s, the set for the Chinese character that A occurs for s left sides, slaRepresent to be combined what is formed with s by the Chinese character a on the left of s
Character string, p (sla| s) represent on the premise of there is s in the data, occur the conditional probability of Chinese character a on the left of s;According to formula
HΓ(s)=- ∑b∈Bp(sΓb|s)*log(p(sΓb| s)) the right side comentropy of calculating character string, wherein HΓ(s) right side of s is represented
Side comentropy, B are the set of the Chinese character occurred on the right side of s, sΓbTo be combined formed character string, p with s by the Chinese character b on the right side of s
(sΓb| s) represent on the premise of there is s in the data, occur the conditional probability of Chinese character b on the right side of s.
Method is found since the device in the embodiment of the present application can be applied to above-mentioned hot spot word, it can be obtained
Technique effect also refer to above method embodiment, details are not described herein for the embodiment of the present application.
It should be noted that collecting unit 31, participle unit 32, matching unit 33, hot spot word acquiring unit 34, data are clear
It can be the processor individually set up to wash unit 35, can also be integrated in some processor of controller and realize, in addition,
Can be stored in the form of program code in the memory of controller, called by some processor of controller and performed with
The function of upper each unit.Processor described here can be a central processing unit (Central Processing Unit,
CPU), or specific integrated circuit (Application Specific Integrated Circuit, ASIC), either
It is configured to implement one or more integrated circuits of the embodiment of the present application.
It is to be understood that in the various embodiments of the application, the size of the sequence number of above-mentioned each process is not meant to perform suitable
The priority of sequence, the execution sequence of each process should be determined with its function and internal logic, without the implementation of reply the embodiment of the present application
Process forms any restriction.
In addition, also providing a kind of calculating readable media (or medium), including carry out when executed in above-described embodiment
The computer-readable instruction of the operation of method.
In addition, a kind of computer program product is also provided, including above computer readable media (or medium).
It is to be understood that in various embodiments of the present invention, the size of the sequence number of above-mentioned each process is not meant to perform suitable
The priority of sequence, the execution sequence of each process should be determined with its function and internal logic, without the implementation of the reply embodiment of the present invention
Process forms any restriction.
Those of ordinary skill in the art may realize that each exemplary list described with reference to the embodiments described herein
Member and algorithm steps, can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually
Performed with hardware or software mode, application-specific and design constraint depending on technical solution.Professional technician
Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed
The scope of the present invention.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, may be referred to the corresponding process in preceding method embodiment, details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, apparatus and method, can be with
Realize by another way.For example, apparatus embodiments described above are only schematical, for example, the unit
Division, is only a kind of division of logic function, can there is other dividing mode, such as multiple units or component when actually realizing
Another system can be combined or be desirably integrated into, or some features can be ignored, or do not perform.It is another, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, equipment or unit
Close or communicate to connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separate, be shown as unit
The component shown may or may not be physical location, you can with positioned at a place, or can also be distributed to multiple
In network unit.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also
That unit is individually physically present, can also two or more units integrate in a unit.
If the function is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can be with
It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words
The part to contribute to the prior art or the part of the technical solution can be embodied in the form of software product, the meter
Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be
People's computer, server, or network equipment etc.) perform all or part of step of each embodiment the method for the present invention.
And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (English full name:Read-only memory, English letter
Claim:ROM), random access memory (English full name:Random access memory, English abbreviation:RAM), magnetic disc or light
Disk etc. is various can be with the medium of store program codes.
The above description is merely a specific embodiment, but protection scope of the present invention is not limited thereto, any
Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained
Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.
Claims (11)
1. a kind of hot spot word finds method, it is characterised in that including:
Gather the data of network system generation;
Participle is carried out to the character string in the data based on entropy model and obtains candidate word;
The candidate word with the word in dictionary match and obtains neologisms;
Bayes's average computation is carried out according to the frequency of occurrences and scoring of the neologisms, the Bayes for obtaining the neologisms is averaged
Value;
If it is determined that Bayes's average value meets that predetermined condition then determines that the neologisms are hot spot word.
2. according to the method described in claim 1, it is characterized in that, it is described based on entropy model to the character in the data
Before string carries out participle acquisition candidate word, further include:
Data cleansing is carried out to the data.
3. according to the method described in claim 1, it is characterized in that, it is described based on entropy model to the character in the data
String carries out participle and obtains candidate word, including:
Obtain the character string in the data;
Calculate the left side comentropy and right side comentropy of the character string;
The character string in the data is segmented according to the left side comentropy of the character string and right side comentropy and is waited
Phrase is selected, determines to obtain candidate word in the candidate phrase.
4. according to the method described in claim 3, it is characterized in that, it is described in the candidate phrase determine obtain candidate word,
Including:
According to formulaCoagulation grade calculating is carried out to the character string of the candidate phrase,
It is wherein describedRefer to by character string siThe coagulation grade of the candidate phrase of composition,Refer to by character string siComposition
The probability that occurs in the data of candidate phrase, P (si) refer to character string siThe probability occurred in the data;
If it is determined thatMeet predetermined condition, it is determined that the candidate phrase is candidate word.
5. according to the method described in claim 3, it is characterized in that, calculate the left side comentropy and right side information of the character string
Entropy, including:
According to formula Hl(s)=- ∑a∈Ap(sla|s)*log(p(sla| s)) the left side comentropy of calculating character string,
Wherein, s represents the character string, Hl(s) the left side comentropy of s, the set for the Chinese character that A occurs for s left sides, s are representedlaTable
Show that the Chinese character a on the left of by s is combined formed character string, p (s with sla| s) represent on the premise of there is s in the data, in s
There is the conditional probability of Chinese character a in left side;
According to formula HΓ(s)=- ∑b∈Bp(sΓb|s)*log(p(sΓb| s)) the right side comentropy of calculating character string, wherein HΓ(s)
Represent the right side comentropy of s, B is the set of the Chinese character occurred on the right side of s, sΓbTo be combined what is formed with s by the Chinese character b on the right side of s
Character string, p (sΓb| s) represent on the premise of there is s in the data, occur the conditional probability of Chinese character b on the right side of s.
6. a kind of hot spot word finds device, it is characterised in that including:
Collecting unit, for gathering the data of network system generation;
Participle unit, participle acquisition is carried out for the character string in the data that are gathered based on entropy model to the collecting unit
Candidate word;
Matching unit, neologisms are obtained for the word in dictionary match the candidate word that the participle unit obtains;
Hot spot word acquiring unit, the frequency of occurrences of the neologisms for being obtained according to the matching unit carry out pattra leaves with scoring
This average computation, obtains Bayes's average value of the neologisms;If it is determined that Bayes's average value meets that predetermined condition is then true
The fixed neologisms are hot spot word.
7. device according to claim 6, it is characterised in that further include:
Data cleansing unit, the data for being gathered to the collecting unit carry out data cleansing.
8. device according to claim 6, it is characterised in that the hot spot word acquiring unit is specifically used for:Described in acquisition
Character string in data;Calculate the left side comentropy and right side comentropy of the character string;Believed according to the left side of the character string
Breath entropy and right side comentropy are segmented to obtain candidate phrase to the character string in the data, are determined in the candidate phrase
Obtain candidate word.
9. device according to claim 8, it is characterised in that the hot spot word acquiring unit is specifically used for according to formulaSolidification program calculating is carried out to the character string of the candidate phrase, wherein describedRefer to by character string siThe solidification program of the candidate phrase of composition,Refer to by character string siThe candidate phrase of composition
The probability occurred in the data, P (si) refer to character string siThe probability occurred in the data;If it is determined thatIt is full
Sufficient predetermined condition, it is determined that the candidate phrase is candidate word.
10. device according to claim 8, it is characterised in that the hot spot word acquiring unit is specifically used for according to formula Hl
(s)=- ∑a∈Ap(sla|s)*log(p(sla| s)) the left side comentropy of calculating character string, wherein, s represents the character string, Hl
(s) the left side comentropy of s, the set for the Chinese character that A occurs for s left sides, s are representedlaRepresent by the Chinese character a on the left of s and s combinations institute structure
Into character string, p (sla| s) represent on the premise of there is s in the data, occur the conditional probability of Chinese character a on the left of s;According to
Formula HΓ(s)=- ∑b∈Bp(sΓb|s)*log(p(sΓb| s)) the right side comentropy of calculating character string, wherein HΓ(s) represent s's
Right side comentropy, B are the set of the Chinese character occurred on the right side of s, sΓbTo be combined formed character string with s by the Chinese character b on the right side of s,
p(sΓb| s) represent on the premise of there is s in the data, occur the conditional probability of Chinese character b on the right side of s.
A kind of 11. computer-readable recording medium for storing one or more programs, it is characterised in that one or more of journeys
Sequence includes instruction, and described instruction makes the computer perform such as claim 1 to 5 any one of them when executed by a computer
Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711058951.4A CN107908618A (en) | 2017-11-01 | 2017-11-01 | A kind of hot spot word finds method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711058951.4A CN107908618A (en) | 2017-11-01 | 2017-11-01 | A kind of hot spot word finds method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107908618A true CN107908618A (en) | 2018-04-13 |
Family
ID=61843344
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711058951.4A Pending CN107908618A (en) | 2017-11-01 | 2017-11-01 | A kind of hot spot word finds method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107908618A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110516229A (en) * | 2019-07-10 | 2019-11-29 | 杭州电子科技大学 | A kind of domain-adaptive Chinese word cutting method based on deep learning |
CN110991173A (en) * | 2019-11-29 | 2020-04-10 | 支付宝(杭州)信息技术有限公司 | Word segmentation method and system |
CN111538893A (en) * | 2020-04-29 | 2020-08-14 | 四川大学 | Method for extracting network security new words from unstructured data |
CN111563143A (en) * | 2020-07-20 | 2020-08-21 | 上海二三四五网络科技有限公司 | Method and device for determining new words |
CN111737555A (en) * | 2020-06-18 | 2020-10-02 | 苏州朗动网络科技有限公司 | Method and device for selecting hot keywords and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110191355A1 (en) * | 2007-04-24 | 2011-08-04 | Peking University | Method for monitoring abnormal state of internet information |
CN107180025A (en) * | 2017-03-31 | 2017-09-19 | 北京奇艺世纪科技有限公司 | A kind of recognition methods of neologisms and device |
-
2017
- 2017-11-01 CN CN201711058951.4A patent/CN107908618A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110191355A1 (en) * | 2007-04-24 | 2011-08-04 | Peking University | Method for monitoring abnormal state of internet information |
CN107180025A (en) * | 2017-03-31 | 2017-09-19 | 北京奇艺世纪科技有限公司 | A kind of recognition methods of neologisms and device |
Non-Patent Citations (3)
Title |
---|
MATRIX67: "互联网时代的社会语言学:基于SNS的文本数据挖掘", 《HTTP://WWW.MATRIX67.COM/BLOG/ARCHIVES/5044》 * |
ZHZHX0318: "关于新词发现", 《HTTPS://BLOG.CSDN.NET/ZHZHX0318/ARTICLE/DETAILS/78253378》 * |
郝晓玲 等: "微博热词抽取及话题发现研究", 《情报杂志》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110516229A (en) * | 2019-07-10 | 2019-11-29 | 杭州电子科技大学 | A kind of domain-adaptive Chinese word cutting method based on deep learning |
CN110516229B (en) * | 2019-07-10 | 2020-05-05 | 杭州电子科技大学 | Domain-adaptive Chinese word segmentation method based on deep learning |
CN110991173A (en) * | 2019-11-29 | 2020-04-10 | 支付宝(杭州)信息技术有限公司 | Word segmentation method and system |
CN110991173B (en) * | 2019-11-29 | 2023-09-29 | 支付宝(杭州)信息技术有限公司 | Word segmentation method and system |
CN111538893A (en) * | 2020-04-29 | 2020-08-14 | 四川大学 | Method for extracting network security new words from unstructured data |
CN111737555A (en) * | 2020-06-18 | 2020-10-02 | 苏州朗动网络科技有限公司 | Method and device for selecting hot keywords and storage medium |
CN111563143A (en) * | 2020-07-20 | 2020-08-21 | 上海二三四五网络科技有限公司 | Method and device for determining new words |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107908618A (en) | A kind of hot spot word finds method and apparatus | |
Hoffart et al. | Discovering emerging entities with ambiguous names | |
US10565233B2 (en) | Suffix tree similarity measure for document clustering | |
JP5092165B2 (en) | Data construction method and system | |
Inzalkar et al. | A survey on text mining-techniques and application | |
CN108829658A (en) | The method and device of new word discovery | |
WO2017101728A1 (en) | Similar word aggregation method and apparatus | |
CN104978332B (en) | User-generated content label data generation method, device and correlation technique and device | |
US10467255B2 (en) | Methods and systems for analyzing reading logs and documents thereof | |
CN104137097A (en) | Predicate template gathering device, specified phrase pair gathering device and computer program for said devices | |
CN108733791B (en) | Network event detection method | |
CN110232126A (en) | Hot spot method for digging and server and computer readable storage medium | |
Bykau et al. | Fine-grained controversy detection in Wikipedia | |
Ouyang et al. | Sentistory: multi-grained sentiment analysis and event summarization with crowdsourced social media data | |
Wang et al. | Exploring text links for coherent multi-document summarization | |
Harandizadeh et al. | Tweeki: Linking named entities on Twitter to a knowledge graph | |
Liang et al. | Clustering web services for automatic categorization | |
Abidi et al. | Searching Personalized $ k $-wing in Bipartite Graphs | |
Schinas et al. | Event detection and retrieval on social media | |
Unankard et al. | Sub-events tracking from social network based on the relationships between topics | |
CN112528640A (en) | Automatic domain term extraction method based on abnormal subgraph detection | |
CN116467291A (en) | Knowledge graph storage and search method and system | |
Zhou et al. | Real-time timeline summarisation for high-impact events in twitter | |
Jaber et al. | Inferring offline hierarchical ties from online social networks | |
AT&T |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |