CN105956158A

CN105956158A - Automatic extraction method of network neologism on the basis of mass microblog texts and use information

Info

Publication number: CN105956158A
Application number: CN201610324541.9A
Authority: CN
Inventors: 黄永峰; 吴方照; 刘佳伟; 袁志刚; 吴思行
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2016-05-17
Filing date: 2016-05-17
Publication date: 2016-09-21
Anticipated expiration: 2036-05-17
Also published as: CN105956158B

Abstract

The invention provides an automatic extraction method of a network neologism on the basis of mass microblog texts and use information. The automatic extraction method comprises the following steps: obtaining a microblog text and an author identifier corresponding to the microblog; establishing a neologism list; according to a Chinese analysis tool, carrying out a word segmentation operation on the microblog text, obtaining a segmentation word segment, and independently carrying out statistics on the word frequency information of each segmentation word segment on the basis of two dimensions including the text and a user; deleting a word of which the word frequency is smaller than a first frequency threshold value in the neologism list from the neologism list; carrying out statistics on all two-tuples and triples in microblog data, and taking the two-tuples and triples as candidate neologisms; calculating the score of the relevance of the candidate neologisms; adding the word of which the word frequency is greater than a second frequency threshold value and the score of the relevance is greater than a score threshold value in the candidate neologisms into the neologism list; and carrying out iterative execution on the above process until no new neologisms are generated and no candidate neologisms in the neologism list are deleted. The automatic extraction method can automatically extract the network neologism and has high accuracy and low time and space complexity.

Description

The method that network neologisms based on massive micro-blog text and user profile automatically extract

Technical field

The present invention relates to network data excavation technical field, particularly to one based on massive micro-blog text and user profile The method that automatically extracts of network neologisms.

Background technology

New word discovery is a pith in Chinese natural language treatment research field.Neologisms refer to be not present in tradition word Word in allusion quotation.And in the Internet, particularly in social networks, neologisms the most constantly emerge in large numbers.Social network user for Want to express intense emotion, or performance personal emotion color, or make that the social networks text oneself issued is more interesting work The reasons such as power, can use network neologisms continually.These neologisms are probably to be abridged by some longer words or sentence and constitute, it is possible to Can be the homophonic word of tradition word, in some instances it may even be possible to be complete and traditional incoherent word of word.Nowadays social networks is the Internet One of important ingredient, is also that expert and the scholar in numerous data mining direction grinds to the analysis of social network media data The hot fields studied carefully.On the one hand, social media data update very fast, and the data volume therefore can studied is the abundantest；Separately On the one hand the most active due to the user of social networks, it is more likely to use some to be different from the novel of traditional text grammatical rules Term, this result also in emerging in multitude of neologisms in social networks, brings the biggest challenge to traditional text analysis technique.

Being different from the language such as English has natural space character word and word to be separated, and the text of Chinese is by Chinese character sequence Row composition, and the unit carrying Chinese text semantic is often word rather than single character.These Chinese words all have There is oneself specific semantic and part of speech.It is exactly by Chinese text that the most most Chinese natural language processes the first step of task Be divided into " the word section " being made up of different words, the step for be referred to as " participle ".Participle operation is largely dependent upon The dictionary that participle is used.The participle mistake of more than 60% all causes owing to can not correctly divide neologisms according to statistics, and this is Because neologisms are not present in the dictionary of participle instrument, result in these neologisms of identification that participle instrument cannot be correct.

Traditional neologisms detection method, mainly has a following several ways: neologisms detection is embedded with in participle task, based on Complicated linguistic rules and knowledge, word detection is converted into classification problem and statistical method.Traditional method is the most difficult To reach higher accuracy rate, and often due to algorithm produces a large amount of candidate's neologisms cause the complexity in its time or space Spend higher.

Summary of the invention

It is contemplated that at least solve one of above-mentioned technical problem.

To this end, it is automatic to it is an object of the invention to propose a kind of network neologisms based on massive micro-blog text and user profile The method extracted, the method, on the basis of microblogging text, has considered user profile, establishes a kind of iterative computation simultaneously Algorithm automatically extracts network neologisms, and the result of generation has higher accuracy and relatively low Time & Space Complexity.

To achieve these goals, embodiment of the invention discloses that a kind of based on massive micro-blog text and user profile The method that network neologisms automatically extract, comprises the following steps: S1: obtaining microblog data, wherein, microblog data includes microblogging text And the author identifier that microblogging is corresponding；S2: setting up neologisms list, wherein, neologisms list initialization is empty set；S3: neologisms are arranged Table adds in the Chinese analysis instrument preset, and according to Chinese analysis instrument, microblogging text is carried out participle operation, with by microblogging Text is mapped to split the set of word section, and adds up each segmentation word section word frequency based on text and two dimensions of user letter respectively Breath；S4: update the word frequency information of corresponding word in described neologisms list according to the word frequency information obtained, and the word frequency is less than The word of the first frequency threshold value is deleted from neologisms list；S5: in being operated by participle, n the segmentation word section occurred continuously is defined as N tuple, all of two tuples and tlv triple in statistics microblog data, and using two tuples and tlv triple as candidate's neologisms；S6: root According to candidate's neologisms in text and the distribution of two dimensions of user, statistics candidate's neologisms are based on text and the word frequency of two dimensions of user Information, and calculate the score of the relatedness of candidate's neologisms；S7: by the word frequency in candidate's neologisms more than the second frequency threshold value and pass Connection property score is added to neologisms list more than the word of score threshold；And S8: iteration performs S2 to S7, until in microblog data Do not have new candidate's neologisms to produce and neologisms list do not have candidate's neologisms be deleted.

The method that network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract, Utilize the feature of microblog data, on the basis of microblogging text, considered user profile, establish a kind of iteration meter simultaneously Calculate algorithm and automatically extract network neologisms；And relative to the result of traditional method, the result that the method produces has higher Accuracy and relatively low Time & Space Complexity, have important application in the excavation and analysis of social media data.

It addition, network neologisms based on massive micro-blog text and user profile according to the above embodiment of the present invention carry automatically The method taken can also have a following additional technical characteristic:

In some instances, in described S4, on the basis of microblogging text, carry out according to the user profile of microblog data Neologisms automatically extract.

In some instances, wherein, in described S8, merged in word segmentation result by iterative computation algorithm iteration formula Segmentation word section, wherein, only needs to search two tuples in microblog data and tlv triple in each iterative process.

In some instances, in described S8, also include: after each iteration completes, it may be found that neologisms add neologisms List, and using neologisms list as the User Defined dictionary of default Chinese analysis instrument, in operating at upper once participle, incite somebody to action The neologisms found in last iterative process correctly divide.

In some instances, described S6 farther includes: theoretical based on strengthening mutual information EMI, calculate each word based on The EMI score of the text frequency, particularly as follows:

E M I (w^{n}) = l o g \frac{N_{w^{n}} / T}{Π_{i = 1}^{n} [(N_{w_{i}^{n}} - N_{w^{n}}) / T]},

Wherein,WithRepresent word w respectivelyⁿWithThe frequency based on microblogging text, T is the sum of microblogging Mesh, n is parameter n in n tuple, n=2 or 3；

The EMI score between user is calculated according to word distributed intelligence between users, particularly as follows:

u s r E M I (w^{n}) = l o g \frac{N_{w^{n}}^{u} / T_{u}}{Π_{i = 1}^{n} [(N_{w_{i}^{n}}^{u} - N_{w^{n}}^{u}) / T_{u}]},

Wherein,WithRepresent word w respectivelyⁿWithBased on user use the frequency, T_uIt it is the sum of user Mesh, n is parameter n in n tuple, n=2 or 3；

The relatedness score of candidate's neologisms is obtained according to the EMI score between EMI score based on the text frequency and user, Particularly as follows:

ascore(wⁿ)=EMI (wⁿ)+usrEMI(wⁿ),

Wherein, ascore (wⁿ) it is candidate neologisms wⁿRelatedness score.

In some instances, in described S1, by microblog data described in web crawlers technical limit spacing.

The additional aspect of the present invention and advantage will part be given in the following description, and part will become from the following description Obtain substantially, or recognized by the practice of the present invention.

Accompanying drawing explanation

Above-mentioned and/or the additional aspect of the present invention and advantage are from combining the accompanying drawings below description to embodiment and will become Substantially with easy to understand, wherein:

Fig. 1 is that network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract The flow chart of method；And

Fig. 2 is that based on massive micro-blog text and user profile according to an embodiment of the invention network neologisms carry automatically The overall flow figure of the method taken.

Detailed description of the invention

Embodiments of the invention are described below in detail, and the example of embodiment is shown in the drawings, the most identical Or similar label represents same or similar element or has the element of same or like function.Retouch below with reference to accompanying drawing The embodiment stated is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.

In describing the invention, it is to be understood that term " " center ", " longitudinally ", " laterally ", " on ", D score, Orientation or the position relationship of the instruction such as "front", "rear", "left", "right", " vertically ", " level ", " top ", " end ", " interior ", " outward " are Based on orientation shown in the drawings or position relationship, it is for only for ease of the description present invention and simplifies description rather than instruction or dark The device or the element that show indication must have specific orientation, with specific azimuth configuration and operation, therefore it is not intended that right The restriction of the present invention.Additionally, term " first ", " second " are only used for describing purpose, and it is not intended that instruction or hint relatively Importance.

In describing the invention, it should be noted that unless otherwise clearly defined and limited, term " is installed ", " phase Even ", " connection " should be interpreted broadly, for example, it may be fixing connection, it is also possible to be to removably connect, or be integrally connected；Can To be mechanical connection, it is also possible to be electrical connection；Can be to be joined directly together, it is also possible to be indirectly connected to by intermediary, Ke Yishi The connection of two element internals.For the ordinary skill in the art, can understand that above-mentioned term is at this with concrete condition Concrete meaning in invention.

The network based on massive micro-blog text and user profile described according to embodiments of the present invention below in conjunction with accompanying drawing is new The method that word automatically extracts.

Fig. 1 is that network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract Method.Fig. 2 is that based on massive micro-blog text and user profile according to an embodiment of the invention network neologisms automatically extract The overall flow figure of method.Shown in Fig. 1 and Fig. 2, according to embodiments of the present invention based on massive micro-blog text and user The method that the network neologisms of information automatically extract, comprises the following steps:

Step S1: obtaining microblog data, wherein, microblog data includes microblogging content of text and author corresponding to every microblogging Identifier.In some instances, such as by a large amount of microblog data of web crawlers technical limit spacing.Such as, microblog data collection is obtained Being combined into D, each of which item includes microblogging text D_iAnd the user identifier symbol S of correspondence_i。

Step S2: set up neologisms list, such as, be denoted as W, and wherein, neologisms list initialization is empty set, i.e. at the beginning of neologisms list W Begin as empty set.

Step S3: neologisms list is added in the Chinese analysis instrument preset as User Defined dictionary, and according to the Chinese Language analytical tool is to each microblogging text D in microblog data set D_iCarry out participle operation, with by microblogging text D_iMap composition Cut the set of word section, and add up each segmentation word section respectively based on text and the word frequency information of two dimensions of user.In other words, i.e. Using obtain each segmentation word section as an elementary cell, be designated as w_i, add up each elementary cell w respectively_iBased on microblogging literary composition Basis and the word frequency information of two dimensions of user profile, be designated as the most respectivelyWith

Step S4: update the word frequency information of corresponding word in neologisms list according to the word frequency information obtained in step S3 and incite somebody to action The word frequency is deleted from neologisms list W less than the word of the first frequency threshold value.

Wherein, in step s 4, on the basis of microblogging text, the user profile fully utilizing microblog data is carried out newly Word automatically extracts, and is different from traditional method and only considers neologisms distribution based on content of text, and the method is according to the spy of microblog data Point, has considered network neologisms in text and the distributed intelligence of two dimensions of user.

Step S5: in the segmentation word section that participle operation obtains, each single segmentation word section that will appear from is as one Elementary cell, is defined as n tuple, all of two occurred in statistics microblog data accordingly by the segmentation word section that n occurs continuously Tuple and tlv triple, and using two tuples and tlv triple as candidate's neologisms.

Step S6: according to candidate's neologisms in text and the distribution of two dimensions of user, statistics candidate's neologisms based on text and The word frequency information of two dimensions of user, and calculate the relatedness score of candidate's neologisms.

In some instances, this step specifically includes:

First, theoretical based on strengthening mutual information EMI, calculate each word EMI based on text frequency score, particularly as follows:

E M I (w^{n}) = l o g \frac{N_{w^{n}} / T}{Π_{i = 1}^{n} [(N_{w_{i}^{n}} - N_{w^{n}}) / T]},

Secondly, calculate the EMI score between user according to word distributed intelligence between users, particularly as follows:

u s r E M I (w^{n}) = l o g \frac{N_{w^{n}}^{u} / T_{u}}{Π_{i = 1}^{n} [(N_{w_{i}^{n}}^{u} - N_{w^{n}}^{u}) / T_{u}]},

Finally, the relatedness of candidate's neologisms is obtained according to the EMI score between EMI score based on the text frequency and user Score, particularly as follows:

ascore(wⁿ)=EMI (wⁿ)+usrEMI(wⁿ),

Wherein, ascore (wⁿ) it is candidate neologisms wⁿRelatedness score.

Step S7: the word frequency in candidate's neologisms is more than score threshold more than the second frequency threshold value and relatedness score Word adds to neologisms list.

Step S8: iteration execution step S2 is to step S7, until not having new candidate's neologisms to produce in microblog data and new Word list does not have candidate's neologisms be deleted.It is to say, in this step, establish a kind of iterative computation algorithm, thus can The segmentation word section in word segmentation result is merged with iterative.In each iterative process, so have only to find the n tuple of low order (such as two tuple and tlv triple), and traditional method is in order to find that long neologisms need to find the n tuple of high-order, because of The quantity of this candidate word is exponentially increased along with the increase of n.Therefore the method for the embodiment of the present invention is relative to traditional method, with repeatedly The computational algorithm in generation instead of the direct computational algorithm in traditional method, significantly reduces the room and time complexity of method.

Further, in step s 8, after each iteration completes, it may be found that neologisms add neologisms list, and will be new Word list is as the User Defined dictionary of default Chinese analysis instrument, in operating at upper once participle, by last time iteration During find neologisms correctly divide, such that it is able to continue to optimize word segmentation result.And word segmentation result after You Huaing and then The quality of candidate's neologisms can be improved.

To sum up, network neologisms based on massive micro-blog text and user profile according to embodiments of the present invention automatically extract Method, utilizes the feature of microblog data, on the basis of microblogging text, has considered user profile, establishes one simultaneously Iterative computation algorithm automatically extracts network neologisms；And relative to the result of traditional method, the result that the method produces has Higher accuracy and relatively low Time & Space Complexity, have important answering in the excavation and analysis of social media data With.

For the ease of being more fully understood that the present invention, below as a example by Sina's microblog data, it is right to come in conjunction with specific embodiments The method that the network neologisms based on massive micro-blog text and user profile of the above embodiment of the present invention automatically extract is done further Describe in detail.

In the present embodiment, the method such as comprises the following steps:

Step 1: utilize the API (application programming interface that Sina's microblogging provides) that Sina's microblogging is corresponding to crawl a large amount of new The microblog data that wave microblog users is issued, including identifier symbol (user name or the user of all microblogging texts and corresponding user The attribute at the energy only table requisition such as ID family), data acquisition system is designated as D, and each of which item includes microblogging content of text D_iAnd user Identifier symbol S_i；I.e. D={ (D_i,S_i) | i=1,2,3 ..., do data for follow-up neologisms Detection task and prepare.

Step 2: define a neologisms list W, initializing neologisms list W is empty set.The method can be to W continuous updating, and W is Whole content is exactly the output result of whole method, the network neologisms i.e. extracted.

Step 3: using neologisms list W as the User Defined dictionary of Chinese analysis instrument, and utilize this Chinese analysis work Tool is to each content of text D in microblog data set D_iCarry out participle operation, thus obtain the segmentation word section of each microblogging Set, is designated as w, w={w_i| i=1,2,3 ..., w_iWord section is split for each.Neologisms list W is continuously updated so that newfound Word segmentation result can be continued to optimize in network neologisms.

Step 4: according to the segmentation word section in the word segmentation result obtained in step 3, add up the word frequency information of each word section.Right Word section w is split in each_i, add up it respectively at microblogging text and the word frequency information of two dimensions of user.Particularly as follows: for often One word section w_i, in this dimension of microblogging content of text, statistics has a how many microblogging to contain word section w_i, result is designated asAnd Using in the dimension of information based on user, statistics has a how many user to employ word section w_i, result is designated asTherefore, this step It is finally completed and has used information to the mapping of word section frequency table from microblogging content of text and user.Finally, the word section that will obtain Frequency table is designated as G,

Step 5: after statistics obtains the frequency information of each segmentation word section, utilize this information that neologisms list W is carried out Update.For each word in neologisms list W, according to the information in word section frequency table G, by corresponding microblogging content of text word frequencyFilter from W less than the word of the first frequency threshold value.And why after participle each time, the word in W is carried out word frequency system Meter, carries out filtering operation the most again, is because adding neologisms list W as the User Defined dictionary of Chinese analysis instrument In the middle of participle operation, participle instrument can be allowed to find according to each word in W microblogging text is more suitably divided, thus Optimize word segmentation result, and the neologisms of the mistake found before can be filtered according to the word segmentation result after optimizing again.

Step 6: utilize segmentation word section to find all of two tuples and tlv triple in massive micro-blog text.Two tuples are by micro- Two segmentation word sections w adjacent in rich content of text_iComposition, is designated as w²=w₁w₂, and tlv triple is phase in microblogging content of text Three adjacent segmentation word sections w_iComposition, is designated as w³=w₁w₂w₃, it is designated as w by unified to two tuples and tlv tripleⁿ, these two tuples and three Tuple i.e. constitutes candidate's neologisms.

Step 7: for each candidate neologisms wⁿ, utilize the method identical with statistics word section frequency table before, respectively base In its place content of text and two dimensions of corresponding user information, adding up its word frequency information, result is designated as F,Wherein, result F obtained be comprise the set of likely candidate's neologisms.Further Ground, whether method based on statistical learning, calculating each word in F according to its text word frequency information and user's service condition is Neologisms.Being primarily based on strengthening mutual information EMI theoretical, calculate each word EMI based on text frequency score, formula is as follows:

E M I (w^{n}) = l o g \frac{N_{w^{n}} / T}{Π_{i = 1}^{n} [(N_{w_{i}^{n}} - N_{w^{n}}) / T]},

Wherein,WithRepresent word w respectivelyⁿWithThe frequency based on microblogging text, T is the sum of microblogging Mesh, n is parameter n (n=2 or 3) in n tuple.Word wⁿEMI obtain the highest, represent composition this word wⁿEach segmentation Word section has higher relatedness, then this word wⁿMore it is likely to be network neologisms.

Then, utilizing word distributed intelligence between users to calculate the EMI score between user, formula is as follows:

u s r E M I (w^{n}) = l o g \frac{N_{w^{n}}^{u} / T_{u}}{Π_{i = 1}^{n} [(N_{w_{i}^{n}}^{u} - N_{w^{n}}^{u}) / T_{u}]},

Wherein,WithRepresent word w respectivelyⁿWithBased on user use the frequency, T_uIt it is the sum of user Mesh, n is parameter n (n=2 or 3) in n tuple.Word wⁿUser EMI obtain the highest, represent this word wⁿMay be more User used, and have higher relatedness among different users, then this word wⁿMore it is likely to be popular net Network neologisms.

Finally, by candidate neologisms wⁿRelatedness score ascore be defined as:

ascore(wⁿ)=EMI (wⁿ)+usrEMI(wⁿ),

Wherein, for a candidate neologisms wⁿ, its relatedness score is the highest, illustrates to form each segmentation word of this word Section uses at microblogging text and user has higher relatedness in two dimensions.Simultaneously because this word wⁿNot by participle instrument It is correctly detected, therefore wⁿIt is probably from microblogging the user-defined popular vocabulary found, i.e. network neologisms.

According to priori, " neologisms " be everybody receptible emerging, have certain semantic and be not present in passing Word in system dictionary, therefore neologisms must be widely used by a lot of different users.The frequency information of word and its relatedness Score can well reflect These parameters, if therefore word wⁿRelatedness score more than relatedness score threshold, and Word wⁿThe frequency also greater than frequency threshold value, then by word wⁿAdd candidate neologisms list W.

Above step 7 is an iteration of the method, constantly repeats above step, until certain an iteration is not the most produced The word of tissue regeneration promoting adds neologisms list W, and does not also have the word in neologisms list to be deleted, then terminate iterative process, now Neologisms list W in each item be the present invention method extract network neologisms.

To sum up, the method has a characteristic that the word distribution letter that make use of in microblog users dimension in the present embodiment Breath.Relative to traditional method, the method utilizes statistical method, theoretical based on strengthening mutual information (EMI), not only exists neologisms Distribution in content of text dimension is analyzed, and utilizes the characteristic of this network information carriers of microblogging simultaneously, analyzes neologisms and exists Use distribution situation between different user, this point can relatively significantly promote the accuracy rate of the neologisms that the method finds.Separately Outward, this method establishes a kind of iterative computation algorithm and carries out neologisms and automatically extract step, and is different from traditional method and directly carries out Calculate.First, this point can be effectively reduced the Time & Space Complexity of the method.Original neologisms based on EMI detection In algorithm, in order to once find all of neologisms, need to find the n tuple of high-order, i.e. find the company being arbitrarily not more than n in text The combination of continuous segmentation word section.However as the increase of n, the quantity of candidate word presents exponential increase, to internal memory and the consumption of time Also it is increased dramatically.And the method uses the mode of iteration, can only use two tuples and tlv triple in each iterative process, it Finding longer word combination by union operation repeatedly afterwards, therefore the method can effectively reduce the demand to internal memory, tool There is relatively low Time & Space Complexity.On the other hand, in the step of iteration each time, the candidate recognized can be produced new Word, before the method utilizes these candidate's neologisms to optimize participle next time and operates, and then the word segmentation result after utilization optimization filters Underproof item in the new set of words found, this point can promote the accuracy rate of the neologisms that the method finds further.

In the description of this specification, reference term " embodiment ", " some embodiments ", " example ", " specifically show Example " or the description of " some examples " etc. means to combine this embodiment or example describes specific features, structure, material or spy Point is contained at least one embodiment or the example of the present invention.In this manual, to the schematic representation of above-mentioned term not Necessarily refer to identical embodiment or example.And, the specific features of description, structure, material or feature can be any One or more embodiments or example in combine in an appropriate manner.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that: not These embodiments can be carried out multiple change in the case of departing from the principle of the present invention and objective, revise, replace and modification, this The scope of invention is limited by claim and equivalent thereof.

Claims

1. the method that network neologisms based on massive micro-blog text and user profile automatically extract, it is characterised in that include Following steps:

S1: obtaining microblog data, wherein, described microblog data includes microblogging text and author identifier corresponding to microblogging；

S2: setting up neologisms list, wherein, described neologisms list initialization is empty set；

S3: described neologisms list is added in the Chinese analysis instrument preset, and according to described Chinese analysis instrument to described micro- Blog article originally carries out participle operation, so that described microblogging text to be mapped to split the set of word section, and adds up each segmentation word respectively Section is based on text and the word frequency information of two dimensions of user；

S4: update the word frequency information of corresponding word in described neologisms list according to the word frequency information obtained, and by little for the word frequency Word in the first frequency threshold value is deleted from described neologisms list；

S5: in being operated by participle, n the segmentation word section occurred continuously is defined as n tuple, all of binary in statistics microblog data Group and tlv triple, and using described two tuples and tlv triple as candidate's neologisms；

S6: according to described candidate's neologisms in text and the distribution of two dimensions of user, add up described candidate's neologisms based on text and The word frequency information of two dimensions of user, and calculate the relatedness score of described candidate's neologisms；

S7: the word frequency in described candidate's neologisms is more than more than the second frequency threshold value and relatedness score the word of score threshold Add to described neologisms list；And

S8: iteration performs described S2 to S7, until not having new candidate's neologisms to produce in described microblog data and described neologisms row Table does not have candidate's neologisms be deleted.

The method that network neologisms based on massive micro-blog text and user profile the most according to claim 1 automatically extract, It is characterized in that, in described S4, on the basis of described microblogging text, carry out neologisms certainly according to the user profile of microblog data Dynamic extraction.

The method that network neologisms based on massive micro-blog text and user profile the most according to claim 1 automatically extract, It is characterized in that, wherein, in described S8, merge the segmentation word section in word segmentation result by iterative computation algorithm iteration formula, Wherein, only need to search two tuples in microblog data and tlv triple in each iterative process.

The method that network neologisms based on massive micro-blog text and user profile the most according to claim 3 automatically extract, It is characterized in that, in described S8, also include:

After each iteration completes, it may be found that neologisms add described neologisms list, and using described neologisms list as the default Chinese The User Defined dictionary of language analytical tool, in operating at upper once participle, the neologisms that will find in last iterative process Correctly divide.

The method that network neologisms based on massive micro-blog text and user profile the most according to claim 1 automatically extract, It is characterized in that, described S6 farther includes:

Theoretical based on strengthening mutual information EMI, calculate each word EMI based on text frequency score, particularly as follows:

E M I (w^{n}) = l o g \frac{N_{w^{n}} / T}{Π_{i = 1}^{n} [(N_{w_{i}^{n}} - N_{w^{n}}) / T]},

Wherein,WithRepresent word w respectivelyⁿWithThe frequency based on microblogging text, T is the total number of microblogging, and n is n Parameter n in tuple, n=2 or 3；

u s r E M I (w^{n}) = l o g \frac{N_{w^{n}}^{u} / T_{u}}{Π_{i = 1}^{n} [(N_{w_{i}^{n}}^{u} - N_{w^{n}}^{u}) / T_{u}]},

Wherein,WithRepresent word w respectivelyⁿWithBased on user use the frequency, T_uBeing the total number of user, n is Parameter n in n tuple, n=2 or 3；

The relatedness score of candidate's neologisms is obtained according to the EMI score between described EMI score based on the text frequency and user, Particularly as follows:

ascore(wⁿ)=EMI (wⁿ)+usrEMI(wⁿ),

Wherein, ascore (wⁿ) it is candidate neologisms wⁿRelatedness score.